×

Hypothesis tests for principal component analysis when variables are standardized. (English) Zbl 1426.62178

Summary: In principal component analysis (PCA), the first few principal components possibly reveal interesting systematic patterns in the data, whereas the last may reflect random noise. The researcher may wonder how many principal components are statistically significant. Many methods have been proposed for determining how many principal components to retain in the model, but most of these assume non-standardized data. In agricultural, biological and environmental applications, however, standardization is often required. This article proposes parametric bootstrap methods for hypothesis testing of principal components when variables are standardized. Unlike previously proposed methods, the proposed parametric bootstrap methods do not rely on any asymptotic results requiring large dimensions. In a simulation study, the proposed parametric bootstrap methods for standardized data were compared with parallel analysis for PCA and methods using the Tracy-Widom distribution. Parallel analysis performed well when testing the first principal component, but was much too conservative when testing higher-order principal components not reflecting random noise. When variables are standardized, the Tracy-Widom distribution may not approximate the distribution of the largest eigenvalue. The proposed parametric bootstrap methods maintained the level of significance approximately and were up to twice as powerful as the methods using the Tracy-Widom distribution. SAS and R computer code is provided for the recommended methods.

MSC:

62H25 Factor analysis and principal components; correspondence analysis
62F40 Bootstrap, jackknife and other resampling methods
62H15 Hypothesis testing in multivariate analysis
62P12 Applications of statistics to environmental and related topics

Software:

SAS; R
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Bai, J., and Ng, S. (2002), “Determining the number of factors in approximate factor models,” Econometrica, 70, 191-221. · Zbl 1103.91399 · doi:10.1111/1468-0262.00273
[2] Bro, R., Kjeldahl, K., Smilde, A. K., and Kiers, H. A. L. (2008), “Cross-validation of component models: a critical look at current methods,” Analytical and Bioanalytical Chemistry, 390, 1241-1251. · doi:10.1007/s00216-007-1790-1
[3] Bumpus, H. C. (1899), “The elimination of the unfit as illustrated by the introduced sparrow, <Emphasis Type=”Italic“>Passer domesticus,” Biological Lectures, Marine Biology Laboratory, Woods Hole, 11th lecture, 209-226.
[4] Cattell, R. B. (1966), “The scree test for the number of factors,” Multivariate Behavioral Research, 1, 245-276. · doi:10.1207/s15327906mbr0102_10
[5] Choi, B. Y., Taylor, J., and Tibshirani, R. (2017), “Selecting the number of principal components: estimation of the true rank of a noisy matrix,” The Annals of Statistics, 45, 2590-2617. · Zbl 1394.62073 · doi:10.1214/16-AOS1536
[6] Crawford, A. V., Green, S. B., Levy, R., Lo, W. J., Scott, L., Svetina, D. et al. (2010), “Evaluation of parallel analysis methods for determining the number of factors,” Educational and Psychological Measurement, 70, 885-901. · doi:10.1177/0013164410379332
[7] Crossa J., Yang, R. C., and Cornelius, P. L. (2004), “Studying crossover genotype x environment interaction using linear-bilinear models and mixed models,” Journal of Agricultural, Biological, and Environmental Statistics, 9, 362-380. · doi:10.1198/108571104X4423
[8] Crossa, J., Burgueño, J., Autran, D., Vielle-Calzada, J. P., Cornelius, P. L., Garcia, N., Salamanca, F., and Arenas, D. (2005), “Using linear-bilinear models for studying gene expression x teatment interaction in microarray experiments,” Journal of Agricultural, Biological, and Environmental Statistics, 10, 337-353. · doi:10.1198/108571105X58216
[9] Forkman J. (2015), “A resampling test for principal component analysis of genotype-by-environment interaction,” Acta et Commentationes Universitatis Tartuensis de Mathematica, 19, 27-33. · Zbl 1341.62311 · doi:10.12697/ACUTM.2015.19.03
[10] Forkman, J., and Piepho H. P. (2014), “Parametric bootstrap methods for testing multiplicative terms in GGE and AMMI models,” Biometrics, 70, 639-647. · Zbl 1299.65014 · doi:10.1111/biom.12162
[11] Forkman, J., and Piepho H. P. (2015), “Robustness of the simple parametric bootstrap method for the additive main effects and multiplicative interaction (AMMI) model”, Biuletyn Oceny Odmian, 34, 11-18.
[12] Franklin, S. B., Gibson, D. J., Robertson, P. A., Pohlmann, J. T., and Fralish, J. S. (1995), “Parallel analysis: a method for determining significant principal components,” Journal of Vegetation Science, 6, 99-106. · doi:10.2307/3236261
[13] Galgani, E., Bocquene, G., Lucon, M., Grzebyk, D., Letrouit E., and Claisse D. (1991), “EROD measurements in fish from the northwest part of France,” Marine Pollution Bulletin, 22, 494-500. · doi:10.1016/0025-326X(91)90403-F
[14] Gauch, H. G. (1992), Statistical analysis of regional yield trials: AMMI analysis of factorial designs, Amsterdam: Elsevier.
[15] Gelman, A., and Loken, E. (2014), “The statistical crisis in science,” American Scientist, 102, 460-465. · doi:10.1511/2014.111.460
[16] Glorfeld, L. W. (1995), “An improvement on Horn’s parallel analysis methodology for selecting the correct number of factors to retain,” Educational and Psychological Measurement, 55, 377-393. · doi:10.1177/0013164495055003002
[17] Green, S. B., Levy, R., Thompson, M. S., Lu, M., and Lo, W. J. (2012), “A proposed solution to the problem with using completely random data to assess the number of factors with parallel analysis,” Educational and Psychological Measurement, 72, 357-374. · doi:10.1177/0013164411422252
[18] Hoyos-Villegas, V., Wright, E. M., and Kelly, J. D. (2016), “GGE biplot analysis of yield associations with root traits in a mesoamerican bean diversity panel,” Crop Science, 56, 1081-1094. · doi:10.2135/cropsci2015.10.0609
[19] Hoff, P. D. (2007), “Model avaraging and dimension selection for the singular value decomposition,” Journal of the American Statistical Association, 102, 674-685. · Zbl 1172.62318 · doi:10.1198/016214506000001310
[20] Horn, J. L. (1965), “A rationale and test for the number of factors in factor analysis,” Psychometrika, 30, 179-185. · Zbl 1367.62186 · doi:10.1007/BF02289447
[21] Husson, F., Lê, S., and Pagès, J. (2011), Exploratory multivariate analysis by examples using R, Boca Raton, FL: CRC Press. · Zbl 1281.62006
[22] Johnson, R. A., and Wichern, D. W. (2007), Applied multivariate statistical analysis, 6th ed., Harlow: Pearson Education. · Zbl 1269.62044
[23] Johnstone, I. M. (2001), “On the distribution of the largest eigenvalue in principal components analysis,” The Annals of Statistics, 29, 295-327. · Zbl 1016.62078 · doi:10.1214/aos/1009210544
[24] — (2007), “High dimensional statistical inference and random matrices,” In: M. Sanz-Sol, J. Soria, J. L. Varona, J. Verdera (eds.), Proceedings of the International Congress of Mathematicians, Madrid, Spain, 2006, Volume 1, p. 307-333, Zürich: The European Mathematical Society. · Zbl 1120.62033
[25] Jolliffe, I. T. (2002). Principal component analysis, 2nd ed., New York: Springer. · Zbl 1011.62064
[26] Jolliffe, I. T., and Cadima, J. (2016), “Principal component analysis: a review and recent developments,” Philosophical Transactions of the Royal Society A 374, 20150202. · Zbl 1353.62067 · doi:10.1098/rsta.2015.0202
[27] Josse, J., van Eeuwijk, F., Piepho H.P., and Denis, J. B. (2014), “Another look at Bayesian analysis of AMMI models for genotype-environment data,” Journal of Agricultural, Biological, and Environmental Statistics, 19, 240-257. · Zbl 1303.62079
[28] Josse, J., and Husson, F. (2011), “Selecting the number of components in PCA using cross-validation approximations,” Computational Statistics and Data Analysis, 56, 1869-1879. · Zbl 1243.62082 · doi:10.1016/j.csda.2011.11.012
[29] Kang, M. S., Balzarini, M., and Guerra, J. L. L. (2004), “Genotype-by-environment interaction,” In: A. M. Saxton (ed.). Genetic analysis of complex traits using SAS, p. 69-96, Cary, NC: SAS Institute.
[30] Kaiser, H. F. (1960), “The application of electronic computers to factor analysis,” Educational and Psychological Measurement, 20, 141-151. · doi:10.1177/001316446002000116
[31] Kollah, B., Ahirwar, U., Mohanty, S. R. (2017), “Elevated carbon dioxide and temperature alters aggregate specific methane consumption in a tropical vertisol”, Journal of Agricultural Science, 155, 1191-1202.
[32] Kritchman, S., and Nadler, B. (2008), “Determining the number of components in a factor model from limited noisy data,” Chemometrics and Intelligent Laboratory Systems, 94, 19-32. · doi:10.1016/j.chemolab.2008.06.002
[33] Malik, W. A., Hadasch, S., Forkman, J., and Piepho H.P. (2018), “Non-parametric resampling methods for testing multiplicative terms in AMMI and GGE models for multi-environment trials,” Crop Science, 58, 752-761. · Zbl 1391.62219 · doi:10.2135/cropsci2017.10.0615
[34] Manly, B. F. J. (1986), Multivariate statistical methods: a primer, London: Chapman and Hall. · Zbl 0867.62041
[35] Marasinghe, M. G. (1985), “Asymptotic tests and Monte-Carlo studies associated with the multiplicative interaction-model,” Communications in Statistics - Theory and Methods, 14, 2219-2231. · doi:10.1080/03610928508829039
[36] Muirhead, R. J. (1978), “Latent roots and matrix variates: A review of some asymptotic results,” Annals of Statistics, 6, 5-33. · Zbl 0375.62050 · doi:10.1214/aos/1176344063
[37] Muirhead, R. J. (1982), Aspects of multivariate statistical theory, New York: Wiley. · Zbl 0556.62028 · doi:10.1002/9780470316559
[38] North Dakota State University (1997), Information Technology Services, https://www.ndsu.edu/pubweb/ doetkott/introsas/rawdata/bumpus.html (accessed Oct 28, 2018).
[39] Onatski, A. (2009), “Testing hypotheses about the number of factors in large factor models,” Econometrica, 77, 1447-1479. · Zbl 1182.62180 · doi:10.3982/ECTA6964
[40] Owen, A. B., and Wang, J. (2016), “Bi-cross-validation for factor analysis,” Statistical Science, 31, 119-139. · Zbl 1442.62136 · doi:10.1214/15-STS539
[41] Passimier, D., Li, Z., and Yao, J. (2017), “On estimation of the noise variance in high dimensional probabilistic principal component analysis.” Journal of the Royal Statistical Society B, 79, 51-67. · Zbl 1414.62218 · doi:10.1111/rssb.12153
[42] Patterson, N., Price, A. L., Reich, D. (2006), “Population structure and eigenanalysis.” PLoS Genetics, 2, 2074-2093. · doi:10.1371/journal.pgen.0020190
[43] Paul, D., and Aue, A. (2014), “Random matrix theory in statistics: A review,” Journal of Statistical Planning and Inference, 150, 1-29. · Zbl 1287.62011 · doi:10.1016/j.jspi.2013.09.005
[44] Peres-Neto, P. R., Jackson, D. A., and Somers, K. M. (2005), “How many principal components? Stopping rules for determining the number of non-trivial axes revisited,” Computational Statistics & Data Analysis, 49, 974-997. · Zbl 1429.62223 · doi:10.1016/j.csda.2004.06.015
[45] Perez-Elizalde, S., Jarquin, D., and Crossa J. (2012), “A general Bayesian estimation method of linear-bilinear models applied to plant breeding trials with genotype x environment interaction,” Journal of Agricultural, Biological, and Environmental Statistics, 17, 15-37. · Zbl 1302.62275 · doi:10.1007/s13253-011-0063-9
[46] Ruscio, J., and Roche, B. (2012), “Determining the number of factors to retain in an exploratory factor analysis using comparison data of known factorial structure,” Psychological Assessment, 24, 282-292. · doi:10.1037/a0025697
[47] Shao, J. (2003), Mathematical statistics, 2nd ed., New York: Springer. · Zbl 1018.62001 · doi:10.1007/b97553
[48] Sobczyk, P., Bogdan, M., and Josse, J. (2017), “Bayesian dimensionality reduction with PCA using penalized semi-integrated likelihood,” Journal of Computational and Graphical Statistics, 26, 826-839. · doi:10.1080/10618600.2017.1340302
[49] Sterling, T. D. (1959), “Publication decisions and their possible effects on inferences drawn from tests of significance - or vice versa,” Journal of the American Statistical Association, 54, 30-34.
[50] Underhill, L. G. (1990), “The coefficient of variation biplot,” Journal of Classification, 7, 241-256. · doi:10.1007/BF01908718
[51] Wasserstein, R. L., and Lazar, N. A. (2016), “The ASA’s statement on <InlineEquation ID=”IEq243“> <EquationSource Format=”TEX“>\[p\] <EquationSource Format=”MATHML“> <math xmlns:xlink=”http://www.w3.org/1999/xlink“> p-values: context, process, and purpose,” The American Statistician, 70, 129-133. · doi:10.1080/00031305.2016.1154108
[52] Yan W., and Frgeau-Reid, J. (2018), “Genotype by yield*trait (GYT) biplot: a novel approach for genotype selection based on multiple traits,” Scientific Reports, 8, 8242. · doi:10.1038/s41598-018-26688-8
[53] Yan, W., and Kang, M. S. (2003), GGE biplot analysis: a graphical tool for breeders, geneticists, and agronomists, Boca Raton: CRC Press.
[54] Yan, W., and Tinker, N. A. (2006), “Biplot analysis of multi-environment trial data: principles and applications,” Canadian Journal of Plant Science, 86, 623-645. · doi:10.4141/P05-169
[55] Yang, R. C., Crossa, J., Cornelius, P. L., and Burgueño, J. (2009), “Biplot analysis of genotype x environment interaction: proceed with caution,” Crop Science, 49, 1564-1576. · doi:10.2135/cropsci2008.11.0665
[56] Yeater, K. M., Duke, S. E., and Riedell, W. E. (2015), “Multivariate analysis: Greater insights into complex systems,” Agronomy Journal, 107, 799-810. · doi:10.2134/agronj14.0017
[57] Yochmowitz, M. G., and Cornell, R. G. (1978), “Stepwise tests for multiplicative components of interaction,” Technometrics, 20, 79-84. · Zbl 0379.62046 · doi:10.1080/00401706.1978.10489619
[58] Zitko, V. (1994), “Principal component analysis in the evaluation of environmental data,” Marine Pollution Bulletin, 28, 718-722. · doi:10.1016/0025-326X(94)90329-8
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.