|Statistical considerations of L2L|
|Introduction||Why binomial P values?|
|Multiple hypothesis problem: Corrections||Multiple hypothesis problem: Simulations|
|Multiple hypothesis problem: FDR||P value inflation|
|Reliability of L2L results|
Adapted from the 2005 Genome Biology L2L manuscript.
If we accept in principle that measuring the overlap between a user's list and the various lists in the L2L database can produce biological insights, we still must resolve how to quantify that overlap with a meaningful statistic that provides at least a relative gauge of which overlaps deserve the most attention. Three major considerations are the choice of statistic, the multiple-hypothesis problem, and the issue of p-value inflation. We performed a variety of analyses on our sample data set of genes down-regulated in diabetic nephropathy in order to determine how well the relatively simple binomial distribution calculation performs under real-world circumstances. The results for a selection of 22 lists, those upon which we based our conclusions about diabetic nephropathy, are presented in Table 1. Complete results for all lists, and more detailed information about how these analyses were performed, are available on the L2L website.back to top
The essential task for a statistical test in over-abundance analysis is to quantify how surprised we should be to see a particular degree of overlap or, conversely, how likely is the overlap to have occurred by chance? If the likelihood of success in a trial is p, and we perform n trials, what are the odds that we will see m or more successes? In the case of L2L, n is the number of probes that map to a list in the database, and p is the likelihood that any one of them will be found in the data by chance - the proportion of probes in the user's data out of all the probes on the microarray. A "trial" tests whether one of the n probes derived from a database list is found in the user's data; success is a match. The binomial distribution permits the exact calculation of the odds of achieving a particular number of matches out of n trials. The cumulative probability of achieving m or more matches is found as follows:
L2L uses the Double Precision Cumulative Distribution Function Library (DCDFLIB), implemented in the Math::CDF Perl module, to compute binomial probabilities. The binomial distribution performs trials with replacement - the odds of scoring a success remain constant for all trials. In reality, a probe can only be selected once, so the hypergeometric distribution, which calculates probabilities without replacement, is more accurate. But it is more difficult to calculate than the binomial distribution, and in any event approaches the binomial distribution at large values of n and m, where replacement has little impact on the odds of the next trial. Alternatively, the Poisson distribution is easier to calculate than the binomial distribution, and approaches it where values of n are large and p small (as in most L2L analyses). In our sample data set of genes up-regulated in diabetic nephropathy, the p-values calculated from the hypergeometric distribution or Poisson distribution closely followed those calculated from the binomial distribution (Table 1, compare columns 5, 6 and 7). We therefore chose to use the binomial distribution as a reasonable compromise between accuracy and computational requirements.back to top
The multiple-hypothesis problem is that when testing a large number of hypotheses simultaneously - here, that each of the hundreds of lists in the L2L database might overlap significantly with the user's data - the odds of producing a low p-value by chance become substantial. For example, with 357 lists in the original L2L database, we might expect purely random data to produce about 18 "significant" overlaps with p-values < 0.05 (357 * 0.05). There are two common approaches that either reduce the odds of seeing any such false-positive p-values, or mitigate their effect. The former approach is to control the family-wise error rate, usually by applying some adjustment to the calculated p-values. This adjustment can be the same for all p-values (termed "single-step") or can vary as we evaluate each p-value in order ("step-down" or "step-up"). The single-step Bonferroni is the most common adjustment, and is simply the multiplication of the p-value by the number of hypotheses (p * n, n being 357 here). We found the Bonferroni adjustment to be excessively conservative, based on the simulation-adjusted p-values and false discovery rate (see below, and Table 1 in the L2L manuscript). The single-step Sidak, which uses the adjustment (1 - (1 - p)^n), produced near-identical results to the Bonferroni for low p-values. Since n has a large initial value, step-down procedures for these two adjustments - where n is decremented by 1 as we adjust each p-value in ascending order - did not produce substantially different adjusted p-values.back to top
An attractive alternative to simple adjustments based on the number of hypotheses is to perform simulations with random data, and adjust p-values based on their frequency of occurrence among the random results. We therefore undertook a 10,891-trial simulation using data sets of the same size as our diabetic nephropathy sample (513 probes), drawn randomly from all the probes on the U95Av2 microarray (10,877 probes). We used true random numbers from Random.org for all simulations. As expected, the median binomial p-value calculated from these random data was not significant for any list (Table 1, Column 9). We compared each p-value from the actual sample data to the simulation-generated p-values for that specific list, and for all lists together. In both cases, the frequency of occurrence of a p-value equal to or less than the actual p-value (i.e., the simulation-adjusted p-value) was generally lower than the actual p-value (Table 1). This shows that, at least for the diabetic nephropathy data set on the U95Av2 platform, a simple calculation of p-values based on the binomial distribution gives a good approximation of the actual likelihood of seeing an overlap by chance. The capability to perform a simulation analysis will be included in a future release of the downloadable L2L application. However, the utility of a simulation analysis is proportional to the number of trials run, because an adjusted p-value cannot be lower than (1 / number of trials). Each "trial" is a full-fledged L2L analysis, so a 10,000-trial simulation takes four orders of magnitude longer to run than a single analysis, not considering the time required to create random data sets. The computational requirements are therefore daunting, and preclude it from being practical in a web-based tool.back to top
All such p-value adjustments, however they are made, aim to reduce the chances of seeing any false positives. They can therefore be too conservative if, as in most biological questions, permitting a few false-positives is a reasonable trade-off for seeing more true data. The false-discovery rate (FDR) is an increasingly popular approach to the multiple-hypothesis problem that mitigates the effect of false-positives by estimating how many there are at a given level of significance, rather than trying to eradicate them. It can therefore be substantially more powerful than controlling the family-wise error rate. We used our random-data simulation to calculate the FDR at all levels of significance by dividing the average number of random occurrences of a p-value less than or equal to a given number by the number of occurrences in the actual data of a p-value less than or equal to that number. Column 12 of Table 1 shows that even for the least-significant of our 22 sample lists, only 2% of the lists with a p-value less than it are expected to be false-positives. Overall, a binomial p-value of 0.05 corresponded to a FDR of about 10%, and 0.01 to 2.5%. The capability to calculate FDR from simulation data will be included in a future version of the downloadable L2L application, but these sample data suggest that the simple and economical binomial calculation of L2L, with a rough p-value threshold of 0.05-0.01, strikes a reasonable balance between stringency and power.back to top
Finally, we must address the issue of p-value inflation: the generation of p-values that, while genuinely statistically significant, are devoid of biological meaning. One way this can occur is through the statistics of small numbers - the anthropic principle of over-representation analysis. When only very few genes in the universe being tested posses a given characteristic, even one occurring in the data may be calculated as highly significant. Unlike a Fisher's exact test, the binomial distribution makes no explicit accommodation for small numbers. However, in creating L2L we assumed that comparisons to very short database lists would not be meaningful, and excluded lists (including those generated from Gene Ontology annotations) with fewer than five genes. For a moderate-sized data set like our sample (513 genes), 2 of 5 probes must match the data for a significant p-value (0.01) to be generated. For much smaller data sets, only a single matching probe could produce a significant p-value (for 50 genes, 0.02). However, the goal of L2L is to tease out complex patterns of gene expression that might be produced by a kaleidoscope of pathways. There is simply too little signal among a few dozen genes to identify meaningful patterns, unless the investigator is certain that only a single pathway is at work - in which case L2L is unlikely to be helpful anyway. We therefore intend L2L to be used with relatively large database lists and relatively larger data sets, and in such circumstances the dangers of small numbers should be minor. We quantitatively tested the robustness of L2L's results by performing a 10,891-trial permutation simulation. In each trial, 52 probes from the sample data (10%) were thrown out and replaced with 52 different probes drawn randomly from the universe of the U95Av2 microarray. We found that the median p-values generated by the permutations were only slightly reduced from the actual values. In no case was the actual p-value a significant outlier among the permuted data: all had list-specific p-values of > 0.05, most with FDRs of 70-80% (Table 2 in the L2L manuscript).
The second potential source of p-value inflation arises from the universal nature of the database. The common language, HUGO symbols, must be translated to platform-specific probes identifiers for the user's microarray. If only a handful of genes in some database list are represented on the microarray, and one of those happens to represented by several probes, all of which are differentially expressed in the user's experiment, the list will generate a highly significant p-value on the questionably narrow basis of that single gene. A user can see on a Listmatch page exactly which genes or probes created a small but significant overlap, and judge if it appears to be an artifact of translation. Users should be particularly wary of genes used as hybridization controls. We re-analyzed our diabetic nephropathy sample data without probe translation, using only gene symbols (Table 2 in the L2L manuscript). Several of the 22 sample lists dropped out of statistical significance; most of these were due to STAT1, an Affymetrix hybridization control, being represented by six probes in the data. Users may wish to remove control probes from their data before analyzing it with L2L. A future release of L2L will incorporate a directed-permutation algorithm into the statistical analysis to ensure that a reported p-value is not overly reliant on a single gene.back to top
Concerns about whether the results of an L2L analysis can be trusted fall into two major categories, which might be described as qualitative and quantitative. The qualitative concern is whether the lists of differentially expressed genes in the database are trustworthy, and if comparison to a user's data can be meaningful. The quantitative concern is whether the statistics we use to judge the significance of the overlaps between a user's data and lists from the database provide a useful metric of biological meaning.
Could a small amount of poorly analyzed or biased data in the L2L database poison the well for all who drink? Much like the scientific process as a whole, L2L takes a distributed-competence approach, augmented by independent replication and careful statistical analysis, to mitigate this concern. Our working assumption is that investigators themselves are best qualified to judge the quality of their own data, and that published lists usually include only those genes for which a change call can be assigned with a reasonable probability. We augment this assumption by including in the database, whenever possible, microarray datasets generated by independent groups that have addressed the same or a closely related question. Given the noise inherent in any microarray experiment, a user can feel much more secure interpreting results which reflect overlap with several related database lists from different sources, rather than idiosyncratic overlap with just one list. Finally, L2L calculates a p-value for each comparison that provides a quantitative assessment of the significance of an overlap. If an experiment is contaminated with random data due to experimental error or systematic bias, the likelihood of the L2L list derived from that experiment overlapping significantly with any other experimental data would be purely stochastic - unless both experiments suffer from a common systematic bias. For example, we performed a 10,891-trial simulation with randomized data to help validate our sample analysis of diabetic nephropathy. The odds of achieving a p-value below 0.05 with random data was no greater than 0.05 for any list in the database, and as low as 0.001 (see Supplemental Data). Absent common systemic bias, therefore, random data is very unlikely to produce spuriously significant L2L results.
There are two major potential sources of systematic bias: genes that are considered a priori to be "interesting" or "critical" based on previous data or theory, and platform-specific bias. Certain often-studied, well-understood genes - the very kind that lend themselves to "critical gene" hypotheses - are represented on virtually all microarray platforms, and thus could be more likely to be found in random data acquired with any platform. Certain genes may also be more likely to be flagged as differentially expressed on a particular type of chip, perhaps because the chip is more sensitive to small variations at particular expression levels or because of probe-specific effects. If any systematic bias exists, it could only represent a higher likelihood of a random change in signal for that gene or probe - the chip does not know whether the control or experimental RNA is washed onto it, or with which dye color. So proper experimental design and data analysis should eliminate these false-positives before a user turns to L2L. The same applies to the published data from which the L2L lists are derived. If any false-positive genes do persist on database lists, the fact that L2L separately analyzes "up" and "down" lists mitigates their impact, because they will be randomly distributed between the two lists. These separate lists also provide great potential assurance for the user, if the "up" and "down" lists in the user's data both correlate significantly and respectively with the "up" and "down" lists (or vice versa) for a particular condition in the database (see Figure 4b, diabetic nephropathy and adipogenesis). The inclusion of data from independent groups can provide further assurance, because the same set of randomly changing genes is unlikely to be found in independent data sets from different platforms. Still, both sources of systematic bias can be directly addressed in a future release of L2L by more sophisticated statistical analysis algorithms. Each list in the database is annotated with the platform that produced it, so the frequency of occurrence of genes among lists from a given platform (platform-specific bias) as well as the overall occurrence of genes in the database (bias toward "interesting genes") could be used to weight the contribution of each gene match to the overall significance of the overlap between two lists.back to top