×

Investigations into refinements of Storey’s method of multiple hypothesis testing minimising the FDR, and its application to test binomial data. (English) Zbl 1255.62212

Summary: Storey’s method for multiple hypothesis testing, “the Optimal Discovery Procedure” (ODP), minimising the false discovery rate (FDR) and giving \(p\)-values and \(q\)-values (estimates of FDR) for each test, was extended by iteration to enforce consistency between the \(p\)-values of the tests and the binary parameters defining which data points contribute to the fitted null hypothesis. These parameters arise when the null hypothesis has to be estimated from the data. The ODP as previously described, is only optimal for fixed values of these parameters. The extension proposed here requires the introduction of a cut-off parameter for the \(p\)-values. Motivated by using this method to analyse a set of pairs of frequencies representing gene expression for a set of genes in two libraries, from which it was desired to select those that are most likely to be not following the null hypothesis that the frequency ratio is a fixed unknown number, this method was tested by analysing many similar simulated datasets. The results showed that the ODP modified by iteration could be improved sometimes greatly by a suitable choice of the cut-off parameter, but varying this parameter alone may not lead to the globally optimal solution because statistical testing based on the binomial distribution is more efficient than using a form of the ODP when the number of non-null hypotheses in the data is small, but the reverse is true when it is large. This may be an effect of using discrete data. Efficiency here is defined in terms of the expected proportion of errors that occur (\(q\)-value) when a given proportion of the data is declared “significant” (i.e., the null hypothesis is believed not to hold for them). An improved version of the ODP along these lines is likely to have numerous applications such as in the optimised search for candidate genes that show unusual expression patterns for example when more than two experimental conditions are simultaneously compared and to cases when additional categorical variables or a time series is present in the experimental design.

MSC:

62J15 Paired and multiple comparisons; multiple testing
62P10 Applications of statistics to biology and medical sciences; meta analysis
92D10 Genetics and epigenetics
65C60 Computational problems in statistics (MSC2010)
92C40 Biochemistry, molecular biology
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Altschul, S. F.; Madden, T. L.; Schaeffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D. J., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res., 25, 3389-3402 (1997)
[2] Benjamini, Y.; Hochberg, Y., Controlling the false discovery rate: a practical and powerful approach to multiple testing, J. R. Statist. Soc. B, 57, 289-300 (1995) · Zbl 0809.62014
[3] Chou, H. H.; Holmes, M. H., DNA sequence quality trimming and vector removal, Bioinformatics, 17, 1093-1104 (2001)
[4] Käll, L.; Storey, J. D.; Noble, W. S., Non-parametric estimation of posterior error probabilities associated with peptides identified by tandem mass spectrometry, Bioinformatics, 24, i42-i48 (2008)
[5] Käll, L.; Storey, J. D.; Noble, W. S., QVALITY: non-parametric estimation of \(q\)-values and posterior error probabilities, Bioinformatics, 25, 964-966 (2009)
[6] Leung, Y. F.; Cavalieri, D., Fundamentals of cDNA microarray data analysis, Trends in Genetics, 19, 649-659 (2003)
[7] Pertea, G.; Huang, X.; Liang, F.; Antonescu, V.; Sultana, R.; Karamycheva, S.; Lee, Y.; White, J.; Cheung, F.; Parvizi, B., TIGR gene indices clustering tools (TGICL): a software system for fast clustering of large EST datasets, Bioinformatics, 19, 651-652 (2003)
[8] Santner, T. J.; Duffy, D. E., The Statistical Analysis of Discrete Data (1989), Springer-Verlag · Zbl 0702.62005
[9] Simonoff, J. S., Smoothing Methods in Statistics (1996), Springer-Verlag: Springer-Verlag New York · Zbl 0859.62035
[10] Storey, J. D., The optimal discovery procedure: a new approach to simultaneous significance testing, J. R. Statist. Soc. B, 69, 347-368 (2007) · Zbl 07555356
[11] Storey, J.D., Dai, J.Y., Leek, J.T., 2005. The Optimal Discovery Procedure for Large-Scale Significance Testing, with Applications to Comparative Microarray Experiments. University of Washington Biostatistics Working Paper Series. Paper 260.; Storey, J.D., Dai, J.Y., Leek, J.T., 2005. The Optimal Discovery Procedure for Large-Scale Significance Testing, with Applications to Comparative Microarray Experiments. University of Washington Biostatistics Working Paper Series. Paper 260. · Zbl 1213.62175
[12] Storey, J. D.; Tibshirani, R., Statistical significance for genomewide studies, Proc. Natn. Acad. Sci. USA, 100, 9440-9445 (2003) · Zbl 1130.62385
[13] Zelterman, D., Models for Discrete Data: Revised Edition (2006), Oxford University Press
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.