A principal component method to impute missing values for mixed data. (English) Zbl 1414.62206

Summary: We propose a new method to impute missing values in mixed data sets. It is based on a principal component method, the factorial analysis for mixed data, which balances the influence of all the variables that are continuous and categorical in the construction of the principal components. Because the imputation uses the principal axes and components, the prediction of the missing values is based on the similarity between individuals and on the relationships between variables. The properties of the method are illustrated via simulations and the quality of the imputation is assessed using real data sets. The method is compared to a recent method (Stekhoven and Buhlmann Bioinformatics 28:113-118, 2011) based on random forest and shows better performance especially for the imputation of categorical variables and situations with highly linear relationships between continuous variables.


62H25 Factor analysis and principal components; correspondence analysis
Full Text: DOI arXiv


[1] Benzécri JP (1973) L’analyse des données. L’analyse des correspondances. Dunod, Tome II
[2] Breiman, L., Random forests, Mach Learn, 45, 5-32, (2001) · Zbl 1007.68152
[3] Bro, R.; Kjeldahl, K.; Smilde, AK; Kiers, HAL, Cross-validation of component model: a critical look at current methods, Anal Bioanal Chem, 390, 1241-1251, (2008)
[4] Cornillon PA, Guyader A, Husson F, Jégou N, Josse J, Kloareg M, Matzner-Løber E, Rouvière L (2012) R for Statistics. Chapman and Hall/CRC, Boca Raton
[5] de Leeuw J, Mair P (2009) Gifi methods for optimal scaling in R: The package homals. J Statist Software 31(4):1-20, URL http://www.jstatsoft.org/v31/i04/
[6] Escofier, B., Traitement simultané de variables quantitatives et qualitatives en analyse factorielle, Les cahiers de l’analyse des données, 4, 137-146, (1979)
[7] Gifi A (1990) Nonlinear multivariate analysis. Wiley, Chichester · Zbl 0697.62048
[8] Greenacre M, Blasius J (2006) Multiple correspondence analysis and related methods. Chapman and Hall/CRC. · Zbl 1277.62156
[9] Husson F, Josse J (2012) missMDA: Handling missing values with/in multivariate data analysis (principal component methods). URL http://www.agrocampus-ouest.fr/math/husson, r package version 1.4 · Zbl 1316.62006
[10] Ilin A, Raiko T (2010) Practical approaches to principal component analysis in the presence of missing values. J Mach Learn Res 99:1957-2000, URL http://dl.acm.org/citation.cfm?id=1859890.1859917 · Zbl 1242.62047
[11] Josse, J.; Husson, F., Selecting the number of components in PCA using cross-validation approximations, Comput Statist Data Anal, 56, 1869-1879, (2011) · Zbl 1243.62082
[12] Josse, J.; Husson, F., Handling missing values in exploratory multivariate data analysis methods, Journal de la Société Française de Statistique, 153, 1-21, (2012) · Zbl 1316.62006
[13] Josse, J.; Pagès, J.; Husson, F., Gestion des données manquantes en analyse en composantes principales, Journal de la Société Française de Statistique, 150, 28-51, (2009) · Zbl 1311.62091
[14] Josse, J.; Chavent, M.; Liquet, B.; Husson, F., Handling missing values with regularized iterative multiple correspondence analysis, J Classif, 29, 91-116, (2012) · Zbl 1360.62306
[15] Kiers, HAL, Simple structure in component analysis techniques for mixtures of qualitative and quantitative variables, Psychometrika, 56, 197-212, (1991) · Zbl 0850.62461
[16] Kiers, HAL, Weighted least squares fitting using ordinary least squares algorithms, Psychometrika, 62, 251-266, (1997) · Zbl 0873.62058
[17] Lafaye de Micheaux P, Drouilhet R, Liquet B (2011) Le logiciel R. Springer, Paris · Zbl 1216.68006
[18] Lang DT, Swayne D, Wickham H, Lawrence M (2012) rggobi: Interface between R and GGobi. URL http://CRAN.R-project.org/package=rggobi, r package version 2.1.19
[19] Lebart L, Morineau A, Werwick KM (1984) Multivariate descriptive statistical analysis. Wiley, New York · Zbl 0658.62069
[20] Little RJA, Rubin DB (1987, 2002) Statistical analysis with missing data. Wiley series in probability and statistics, New York
[21] Mazumder, R.; Hastie, T.; Tibshirani, R., Spectral regularization algorithms for learning large incomplete matrices, J Mach Learn Res, 11, 2287-2322, (2010) · Zbl 1242.68237
[22] Michailidis, G.; Leeuw, J., The Gifi system of descriptive multivariate analysis, Statist Sci, 13, 307-336, (1998) · Zbl 1059.62551
[23] Peters A, Hothorn T (2012) ipred: Improved Predictors. URL http://CRAN.R-project.org/package=ipred, R package version 0.9-1
[24] R Development Core Team (2011) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, URL http://www.R-project.org/, ISBN 3-900051-07-0
[25] Rubin, DB, Inference and missing data, Biometrika, 63, 581-592, (1976) · Zbl 0344.62034
[26] Schafer JL (1997) Analysis of incomplete multivariate data. Chapman and Hall/CRC, London · Zbl 0997.62510
[27] Stekhoven, D.; Bühlmann, P., Missforest - nonparametric missing value imputation for mixed-type data, Bioinformatics, 28, 113-118, (2011)
[28] Tenenhaus, M.; Young, FW, An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data, Psychometrika, 50, 91-119, (1985) · Zbl 0585.62104
[29] Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, RB, Missing value estimation methods for DNA microarrays, Bioinformatics, 17, 520-525, (2001)
[30] Buuren, S., Multiple imputation of discrete and continuous data by fully conditional specification, Statist Method Med Res, 16, 219-242, (2007) · Zbl 1122.62382
[31] Buuren, S.; Boshuizen, H.; Knook, D., Multiple imputation of missing blood pressure covariates in survival analysis, Statist Med, 18, 681-694, (1999)
[32] van der Heijden P, Escofier B (2003) Multiple correspondence analysis with missing data. In: Analyse des correspondances, Presse universitaire de Rennes, pp 153-170
[33] Vermunt, JK; Ginkel, JR; Ark, LA; Sijtsma, K., Multiple imputation of incomplete categorical data using latent class analysis, Sociol Methodol, 33, 369-397, (2008)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.