An empirical comparison of two approaches for CDPCA in high-dimensional data. (English) Zbl 1480.62116

Summary: Modified principal component analysis techniques, specially those yielding sparse solutions, are attractive due to its usefulness for interpretation purposes, in particular, in high-dimensional data sets. Clustering and disjoint principal component analysis (CDPCA) is a constrained PCA that promotes sparsity in the loadings matrix. In particular, CDPCA seeks to describe the data in terms of disjoint (and possibly sparse) components and has, simultaneously, the particularity of identifying clusters of objects. Based on simulated and real gene expression data sets where the number of variables is higher than the number of the objects, we empirically compare the performance of two different heuristic iterative procedures, namely ALS and two-step-SDP algorithms proposed in the specialized literature to perform CDPCA. To avoid possible effect of different variance values among the original variables, all the data was standardized. Although both procedures perform well, numerical tests highlight two main features that distinguish their performance, in particular related to the two-step-SDP algorithm: it provides faster results than ALS and, since it employs a clustering procedure (k-means) on the variables, outperforms ALS algorithm in recovering the true variable partitioning unveiled by the generated data sets. Overall, both procedures produce satisfactory results in terms of solution precision, where ALS performs better, and in recovering the true object clusters, in which two-step-SDP outperforms ALS approach for data sets with lower sample size and more structure complexity (i.e., error level in the CDPCA model). The proportion of explained variance by the components estimated by both algorithms is affected by the data structure complexity (higher error level, the lower variance) and presents similar values for the two algorithms, except for data sets with two object clusters where the two-step-SDP approach yields higher variance. Moreover, experimental tests suggest that the two-step-SDP approach, in general, presents more ability to recover the true number of object clusters, while the ALS algorithm is better in terms of quality of object clustering with more homogeneous, compact and well-separated clusters in the reduced space of the CDPCA components.


62H25 Factor analysis and principal components; correspondence analysis
62-08 Computational methods for problems pertaining to statistics
62H30 Classification and discrimination; cluster analysis (statistical aspects)
Full Text: DOI


[1] Adachi, K.; Trendafilov, NT, Sparse principal component analysis subject to prespecified cardinality of loadings, Comput Stat, 31, 4, 1403-1427 (2016) · Zbl 1348.65014
[2] Boulesteix AL, Durif G, Lambert-Lacroix S, Peyre J, Strimmer K (2015) plsgenomics: PLS Analyses for Genomics, R package version 1.3-1 https://CRAN.R-project.org/package=plsgenomics
[3] Calinski, T.; Harabasz, J., A dendrite method for cluster analysis, Commun Stat, 3, 1-27 (1974) · Zbl 0273.62010
[4] Cavicchia C, Vichi M, Zaccaria G (2020) The ultrametric correlation matrix for modelling hierarchical latent concepts, Adv Data Anal Classif. doi:10.1007/s11634-020-00400-z · Zbl 1459.62097
[5] Charrad, M.; Ghazzali, N.; Boiteau, V.; Niknafs, A., NbClust: an R package for determining the relevant number of clusters in a data set, J Stat Softw, 61, 6, 1-36 (2014)
[6] Chung D, Chun H, Keles S (2013) spls: sparse partial least squares (SPLS) regression and classification. R package version 2.2-1. https://CRAN.R-project.org/package=spls
[7] d’Aspremont, A.; El Ghaoui, L.; Jordan, MI; Lanckriet, GRG, A direct formulation for sparse PCA using semidefinite programming, SIAM, 49, 3, 434-448 (2007) · Zbl 1128.90050
[8] DeSarbo, WS; Jedidi, K.; Cool, K.; Schendel, D., Simultaneous multidimensional unfolding and cluster analysis: an investigationof strategic groups, Mark Lett, 2, 129-146 (1990)
[9] Enki, DG; Trendafilov, NT; Jolliffe, IT, A clustering approach to interpretable principal components, J Appl Stat, 40, 3, 583-599 (2013) · Zbl 07265816
[10] Erichson NB, Zheng P, Aravkin S (2018) sparsepca: Sparse Principal Component Analysis (SPCA), R package version 0.1.2. https://CRAN.R-project.org/package=sparsepca
[11] Erichson NB, Zheng P, Manohar K, Brunton S, Kutz JN, Aravkin AY (2018) Sparse principal component analysis via variable projection. IEEE J Sel Top Signal Process (available at arXiv 1804.00341) · Zbl 1440.62231
[12] Hennig C (2015) fpc: Flexible Procedures for Clustering. R package version 2.1-10. https://CRAN.R-project.org/package=fpc
[13] Hunter, MA; Takane, Y., Constrained principal component analysis: various applications, J Educ Behav Stat, 27, 41-81 (2002)
[14] Jolliffe, IT, Principal component analysis (2002), New York: Springer, New York · Zbl 1011.62064
[15] Jolliffe, IT; Trendafilov, NT; Uddin, M., A modified principal component technique based on the lasso, J Comput Graph Stat, 12, 3, 531-547 (2003)
[16] Ma, Z., Sparse principal component analysis and iterative thresholding, Ann Stat, 41, 2, 772-801 (2013) · Zbl 1267.62074
[17] Macedo, E., Two-step-SDP approach to clustering and dimensionality reduction, Stat Optim Inf Comput, 3, 3, 294-311 (2015)
[18] Macedo E, Freitas A (2015) The alternating least-squares algorithm for CDPCA. In: Plakhov A et al (eds) Optimization in the natural sciences, communications in computer and information science (CCIS), vol 499. Springer, pp 173-191
[19] Nieto-Librero AB, Galindo-Villardón MP, Freitas A (2019)biplotbootGUI: Bootstrap on Classical Biplots and Clustering Disjoint Biplot, R package version 1.2. http://www.R-project.org/package=biplotbootGUI
[20] Nieto-Librero, AB; Sierra, C.; Vicente-Galindo, MP; Ruíz-Barzola, O.; Galindo-Villardón, MP, Clustering disjoint HJ-Biplot: a new tool for identifying pollution patterns in geochemical studies, Chemosphere, 176, 389-396 (2017)
[21] Overton, ML; Womersley, RS, Optimality conditions and duality theory for minimizing sums of the largest eigenvalues of symmetric matrices, Math Program, 62, 321-357 (1993) · Zbl 0806.90114
[22] Peng, J.; Wei, Y., Approximating k-means-type clustering via semidefinite programming, SIAM J Optim, 18, 1, 186-205 (2007) · Zbl 1146.90046
[23] Peng J, Xia Y (2005) A new theoretical framework for k-means-type clustering. In: Chu W et al (eds) Foundations and advances in data mining studies in fuzziness and soft computing, vol 180. Springer, pp 79-96 · Zbl 1085.68132
[24] R Development Core Team (2019) R: a language and environment for statistical computing. http://www.R-project.org/
[25] Rocci, R.; Vichi, M., Two-mode multi-partitioning, Comput Stat Data Anal, 52, 1984-2003 (2008) · Zbl 1452.62463
[26] Takane, Y.; Hunter, MA, Constrained principal component analysis: a comprehensive theory, Appl Algebra Eng Commun Comput, 12, 391-419 (2001) · Zbl 1040.62050
[27] Vichi, M., Disjoint factor analysis with cross-loadings, Adv Data Anal Classif, 11, 3, 563-591 (2017) · Zbl 1414.62222
[28] Vichi, M.; Saporta, G., Clustering and disjoint principal component analysis, Comput Stat Data Anal, 53, 3194-3208 (2009) · Zbl 1453.62230
[29] Vines, S., Simple principal components, Appl Stat, 49, 441-451 (2000) · Zbl 0965.62052
[30] Xu, R.; Wunsch, D., Survey of clustering algorithms, IEEE Trans Neural Netw, 16, 645-648 (2005)
[31] Zou, H.; Hastie, T.; Tibshirani, R., Sparse principal component analysis, J Comput Graph Stat, 15, 2, 262-286 (2006)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.