A clustering approach to interpretable principal components. (English) Zbl 07265816

Summary: A new method for constructing interpretable principal components is proposed. The method first clusters the variables, and then interpretable (sparse) components are constructed from the correlation matrices of the clustered variables. For the first step of the method, a new weighted-variances method for clustering variables is proposed. It reflects the nature of the problem that the interpretable components should maximize the explained variance and thus provide sparse dimension reduction. An important feature of the new clustering procedure is that the optimal number of clusters (and components) can be determined in a non-subjective manner. The new method is illustrated using well-known simulated and real data sets. It clearly outperforms many existing methods for sparse principal component analysis in terms of both explained variance and sparseness.


62-XX Statistics
Full Text: DOI


[1] d’Aspremont, A., Ghaoui, L., Jordan, M. and Lanckriet, G. 2007. A direct formulation for sparse PCA using semidefinite programming. SIAM Rev., 49: 434-448. (doi:10.1137/050645506) · Zbl 1128.90050
[2] Cadima, J. and Jolliffe, I. T. 1995. Loadings and correlations in the interpretations of principal components. J. Appl. Stat., 22: 203-214. (doi:10.1080/757584614)
[3] Cadima, J. and Jolliffe, I. T. 2001. Variable selection and the interpretation of principal subspaces. J. Agri. Biol. Environ. Stat., 6: 62-79. (doi:10.1198/108571101300325256)
[4] Chipman, H. A. and Gu, H. 2005. Interpretable dimension reduction. J. Appl. Stat., 32: 969-987. (doi:10.1080/02664760500168648) · Zbl 1121.62347
[5] Hastie, T., Tibshirani, R., Eisen, M. B., Alizadeh, A., Levy, R., Staudt, L., Chan, W. C., Botstein, D. and Brown, P. 2000. ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol., 1 pp. research0003.1-0003.21 (doi:10.1186/gb-2000-1-2-research0003)
[6] Hausman, R. E. 1982. “Constrained multivariate analysis”. In Optimization in Statistics, Edited by: Zanakis, S. H. and Rustagi, J. S. 137-151. North-Holland: Amsterdam.
[7] Izenman, A. J. 2008. Modern Multivariate Statistical Techniques, New York: Springer. · Zbl 1155.62040
[8] Jeffers, J. N.R. 1967. Two case studies in the application of principal component analysis. Appl. Stat., 16: 225-236. (doi:10.2307/2985919)
[9] Jolliffe, I. T. 1972. Discarding variables in a principal component analysis I: Artificial data. Appl. Stat., 21: 160-173. (doi:10.2307/2346488)
[10] Jolliffe, I. T. 1973. Discarding variables in a principal component analysis II: Real data. Appl. Stat., 22: 21-31. (doi:10.2307/2346300)
[11] Jolliffe, I. T. 2002. Principal Component Analysis, New York: Springer-Verlag.
[12] Jolliffe, I. T., Trendafilov, N. T. and Uddin, M. 2003. A modified principal component technique based on the LASSO. J. Comput. Graph. Statist., 12: 531-547. (doi:10.1198/1061860032148)
[13] G.P. McCabe, Principal variables, Tech. Rep. 82-3, Purdue University, 1982.
[14] McCabe, G. P. 1984. Principal variables. Technometrics, 26: 137-144. (doi:10.1080/00401706.1984.10487939) · Zbl 0548.62037
[15] Moghaddam, B., Weiss, Y. and Avidan, S. 2006. Spectral bounds for sparse PCA: Exact and greedy algorithms. Adv. Neur. Inform. Process. Syst., 18: 915-922.
[16] Rousson, V. and Gasser, T. 2004. Simple component analysis. Appl. Stat., 53: 539-555. · Zbl 1111.62310
[17] Seber, G. A.F. 2004. Multivariate Observations, Hoboken, NJ: Wiley.
[18] Vichi, M. and Saporta, G. 2009. Clustering and disjoint principal component analysis. Comput. Statist. Data Anal., 53: 3194-3208. (doi:10.1016/j.csda.2008.05.028) · Zbl 1453.62230
[19] Vigneau, E. and Qannari, E. M. 2003. Clustering of variables around latent components. Comm. Statist. Simulation Comput., 32: 1131-1150. (doi:10.1081/SAC-120023882) · Zbl 1100.62582
[20] Yeung, K. Y. and Ruzzo, W. L. 2001. Principal component analysis for clustering gene expression data. Bioinformatics, 17: 763-774. (doi:10.1093/bioinformatics/17.9.763)
[21] Zou, H., Hastie, T. and Tibshirani, R. 2006. Sparse principal component analysis. J. Comput. Graph. Statist., 15: 265-286. (doi:10.1198/106186006X113430)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.