×

zbMATH — the first resource for mathematics

Marginal asymptotics for the “large \(p\), small \(n\)” paradigm: with applications to microarray data. (English) Zbl 1123.62005
Summary: The “large \(p\), small \(n\)” paradigm arises in microarray studies, image analysis, high throughput molecular screening, astronomy, and in many other high dimensional applications. False discovery rate (FDR) methods are useful for resolving the accompanying multiple testing problems. In cDNA microarray studies, for example, \(p\)-values may be computed for each of \(p\) genes using data from n arrays, where typically \(p\) is in the thousands and \(n\) is less than 30. For FDR methods to be valid in identifying differentially expressed genes, the \(p\)-values for the nondifferentially expressed genes must simultaneously have uniform distributions marginally. While feasible for permutation \(p\)-values, this uniformity is problematic for asymptotic based \(p\)-values since the number of \(p\)-values involved goes to infinity and intuition suggests that at least some of the \(p\)-values should behave erratically.
We examine this neglected issue when \(n\) is moderately large but \(p\) is almost exponentially large relative to \(n\). We show the somewhat surprising result that, under very general dependence structures and for both mean and median tests, the \(p\)-values are simultaneously valid. A small simulation study and data analysis are used for illustration.

MSC:
62A01 Foundations and philosophical topics in statistics
62P10 Applications of statistics to biology and medical sciences; meta analysis
62H99 Multivariate analysis
62G20 Asymptotic properties of nonparametric inference
92C40 Biochemistry, molecular biology
PDF BibTeX XML Cite
Full Text: DOI arXiv
References:
[1] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300. JSTOR: · Zbl 0809.62014 · links.jstor.org
[2] Billingsley, P. (1995). Probability and Measure , 3rd ed. Wiley, New York. · Zbl 0822.60002
[3] Bretagnolle, J. and Massart, P. (1989). Hungarian construction from the nonasymptotic viewpoint. Ann. Probab. 17 239–256. · Zbl 0667.60042 · doi:10.1214/aop/1176991506
[4] Csörgő, M. and Révész, P. (1981). Strong Approximations in Probability and Statistics. Academic Press, New York. · Zbl 0539.60029
[5] Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 97 77–87. JSTOR: · Zbl 1073.62576 · doi:10.1198/016214502753479248 · links.jstor.org
[6] Dvoretzky, A., Kiefer, J. and Wolfowitz, J. (1956). Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Math. Statist. 27 642–669. · Zbl 0073.14603 · doi:10.1214/aoms/1177728174
[7] Fan, J., Hall, P. and Yao, Q. (2005). To how many simultaneous hypothesis tests can normal, Student’s \(t\) or bootstrap calibration be applied? Unpublished manuscript. · Zbl 1332.62063
[8] Fan, J., Peng, H. and Huang, T. (2005). Semilinear high-dimensional model for normalization of microarray data: A theoretical analysis and partial consistency (with discussion). J. Amer. Statist. Assoc. 100 781–813. · Zbl 1117.62330 · doi:10.1198/016214504000001781 · miranda.asa.catchword.org
[9] Fan, J., Tam, P., Vande Woude, G. and Ren, Y. (2004). Normalization and analysis of cDNA microarrays using within-array replications applied to neuroblastoma cell response to a cytokine. Proc. Natl. Acad. Sci. USA 101 1135–1140.
[10] Genovese, C. and Wasserman, L. (2002). Operating characteristics and extensions of the false discovery rate procedure. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 499–517. JSTOR: · Zbl 1090.62072 · doi:10.1111/1467-9868.00347 · links.jstor.org
[11] Ghosh, D. and Chinnaiyan, A. M. (2005). Classification and selection of biomarkers in genomic data using LASSO. J. Biomedicine and Biotechnology 2005 147–154.
[12] Gui, J. and Li, H. (2005). Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics 21 3001–3008.
[13] Huang, J., Kuo, H.-C., Koroleva, I., Zhang, C.-H. and Bento Soares, M. (2003). A semilinear model for normalization and analysis of cDNA microarray data. Technical Report 321, Dept. Statistics and Actuarial Science, Univ. Iowa.
[14] Huang, J., Wang, D. and Zhang, C.-H. (2005). A two-way semilinear model for normalization and analysis of cDNA microarray data. J. Amer. Statist. Assoc. 100 814–829. · Zbl 1117.62358 · doi:10.1198/016214504000002032 · miranda.asa.catchword.org
[15] Komlós, J., Major, P. and Tusnády, G. (1975). An approximation of partial sums of independent rv’s and the sample df. I. Z. Wahrsch. Verw. Gebiete 32 111–131. · Zbl 0308.60029 · doi:10.1007/BF00533093
[16] Kosorok, M. R. (1999). Two-sample quantile tests under general conditions. Biometrika 86 909–921. JSTOR: · Zbl 0942.62052 · doi:10.1093/biomet/86.4.909 · links.jstor.org
[17] Kosorok, M. R. (2002). On global consistency of a bivariate survival estimator under univariate censoring. Statist. Probab. Lett. 56 439–446. · Zbl 0994.62097 · doi:10.1016/S0167-7152(02)00044-5
[18] Kosorok, M. R. and Ma, S. (2005). Comment on “Semilinear high-dimensional model for normalization of microarray data: A theoretical analysis and partial consistency,” by J. Fan, H. Peng and T. Huang. J. Amer. Statist. Assoc. 100 805–807. · Zbl 1117.62330 · doi:10.1198/016214504000001781 · miranda.asa.catchword.org
[19] Kosorok, M. R. and Ma, S. (2005). Marginal asymptotics for the “large \(p\), small \(n\)” paradigm: With applications to microarray data. Technical Report 188, Dept. Biostatistics and Medical Informatics, Univ. Wisconsin, Madison. · Zbl 1123.62005
[20] Massart, P. (1990). The tight constant in the Dvoretzky–Kiefer–Wolfowitz inequality. Ann. Probab. 18 1269–1283. · Zbl 0713.62021 · doi:10.1214/aop/1176990746
[21] Skorohod, A. V. (1976). On a representation of random variables. Theory Probab. Appl. 21 628–632. · Zbl 0362.60004
[22] Spang, R., Blanchette, C., Zuzan, H., Marks, J., Nevins, J. and West, M. (2001). Prediction and uncertainty in the analysis of gene expression profiles. In Proc. German Conference on Bioinformatics GCB 2001 (E. Wingender, R. Hofestädt and I. Liebich, eds.) 102–111.
[23] Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 479–498. JSTOR: · Zbl 1090.62073 · doi:10.1111/1467-9868.00346 · links.jstor.org
[24] Storey, J. D., Taylor, J. E. and Siegmund, E. (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discover rates: A unified approach. J. R. Stat. Soc. Ser. B Methodol. 66 187–205. · Zbl 1061.62110 · doi:10.1111/j.1467-9868.2004.00439.x
[25] van der Laan, M. J. and Bryan, J. (2001). Gene expression analysis with the parametric bootstrap. Biostatistics 2 445–461. · Zbl 1097.62571 · doi:10.1093/biostatistics/2.4.445
[26] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes : With Applications to Statistics . Springer, New York. · Zbl 0862.60002
[27] West, M. (2003). Bayesian factor regression models in the “large \(p\), small \(n\)” paradigm. In Bayesian Statistics 7 (J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West, eds.) 733–742. Oxford Univ. Press.
[28] West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H., Olson, J. A. Jr., Marks, J. R. and Nevins, J. R. (2001). Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl. Acad. Sci. USA 98 11,462–11,467.
[29] Yang, Y. H., Dudoit, S., Luu, P. and Speed, T. P. (2001). Normalization for cDNA microarray data. In Microarrays : Optical Technologies and Informatics (M. L. Bittner, Y. Chen, A. N. Dorsal and E. R. Dougherty, eds.) 141–152. Proc. SPIE 4266 .
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.