On using principal components before separating a mixture of two multivariate normal distributions. (English) Zbl 0538.62050

Summary: In applying principal components for reducing the dimension of the data before clustering, it has ordinarily been the practice to use components with the largest eigenvalues. We prove, by means of a mixture of two multivariate normal distributions, that this practice is not justified in general. A relationship between the distance of the two subpopulations and any subset of principal components is derived, showing that the components with the larger eigenvalues do not necessarily contain more information (distance).
This result is further demonstrated through hypothetical as well as real situations which use actual data. The effect of scaling the variables on the distribution of the information to different components is investigated. An application to a mixture of two normal distributions is illustrated by utilizing a set of generated data in which the information is concentrated in the components with the largest and the smallest eigenvalues.


62H25 Factor analysis and principal components; correspondence analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)
Full Text: DOI