On consistency and sparsity for principal components analysis in high dimensions.

*(English)*Zbl 1388.62174Summary: Principal components analysis (PCA) is a classic method for the reduction of dimensionality of data in the form of \(n\) observations (or cases) of a vector with \(p\) variables. Contemporary datasets often have \(p\) comparable with or even much larger than \(n\). Our main assertions, in such settings, are (a) that some initial reduction in dimensionality is desirable before applying any PCA-type search for principal modes, and (b) the initial reduction in dimensionality is best achieved by working in a basis in which the signals have a sparse representation. We describe a simple asymptotic model in which the estimate of the leading principal component vector via standard PCA is consistent if and only if \(p(n)/n \to 0\). We provide a simple algorithm for selecting a subset of coordinates with largest sample variances, and show that if PCA is done on the selected subset, then consistency is recovered, even if \(p(n)\gg n\).

##### MSC:

62H25 | Factor analysis and principal components; correspondence analysis |