A sequential distance-based approach for imputing missing data: forward imputation. (English) Zbl 1414.62220

Summary: Missing data recurrently affect datasets in almost every field of quantitative research. The subject is vast and complex and has originated a literature rich in very different approaches to the problem. Within an exploratory framework, distance-based methods such as nearest-neighbour imputation (NNI), or procedures involving multivariate data analysis (MVDA) techniques seem to treat the problem properly. In NNI, the metric and the number of donors can be chosen at will. MVDA-based procedures expressly account for variable associations. The new approach proposed here, called Forward Imputation, ideally meets these features. It is designed as a sequential procedure that imputes missing data in a step-by-step process involving subsets of units according to their “completeness rate”. Two methods within this context are developed for the imputation of quantitative data. One applies NNI with the Mahalanobis distance, the other combines NNI and principal component analysis. Statistical properties of the two methods are discussed, and their performance is assessed, also in comparison with alternative imputation methods. To this purpose, a simulation study in the presence of different data patterns along with an application to real data are carried out, and practical hints for users are also provided.


62H25 Factor analysis and principal components; correspondence analysis
62-07 Data analysis (statistics) (MSC2010)
62-04 Software, source code, etc. for problems pertaining to statistics
62H99 Multivariate analysis
Full Text: DOI Link


[1] Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the Forward Search. Springer, New York · Zbl 1049.62057
[2] Azzalini A (2015) R package “sn”: the skew-normal and skew-t distributions (version 1.2-4). http://azzalini.stat.unipd.it/SN
[3] Azzalini, A.; Capitanio, A., Statistical applications of the multivariate skew normal distribution, J R Stat Soc B, 61, 579-602, (1999) · Zbl 0924.62050
[4] Azzalini, A.; Dalla Valle, A., The multivariate skew-normal distribution, Biometrika, 83, 715-726, (1996) · Zbl 0885.62062
[5] Breiman, L., Random forests, Mach Learn, 45, 5-32, (2001) · Zbl 1007.68152
[6] Cox TF, Cox MAA (2001) Multidimensional scaling, 2nd edn. Chapman & Hall/CRC, Boca Raton · Zbl 1004.91067
[7] Ferrari, PA; Annoni, P.; Barbiero, A.; Manzi, G., An imputation method for categorical variables with application to nonlinear principal component analysis, Comput Stat Data Anal, 55, 2410-2420, (2011) · Zbl 1328.65028
[8] Gower, JC; Armitage, P. (ed.); Colton, T. (ed.), Principal coordinates analysis, (2005), New York
[9] Greenacre M (1984) Theory and applications of correspondence analysis. Academic Press, London · Zbl 0555.62005
[10] Groves RM, Dillman DA, Eltinge JL, Little RJA (2002) Survey nonresponse. Wiley, New York · Zbl 0976.00027
[11] Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data mining, inference and prediction, 2nd edn. Springer, New York · Zbl 1273.62005
[12] Hollander M, Wolfe DA (1999) Nonparametric statistical methods, 2nd edn. Wiley-Interscience, New York · Zbl 0997.62511
[13] Husson F, Josse J (2015) missMDA: Handling missing values with/in multivariate data analysis (principal component methods). R package version 1.8.2. http://CRAN.R-project.org/package=missMDA
[14] Josse, J.; Pagès, J.; Husson, F., Multiple imputation in principal component analysis, Adv Data Anal Classif, 5, 231-246, (2011) · Zbl 1274.62409
[15] Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York · Zbl 1011.62004
[16] Mardia, KV, Measures of multivariate skewness and kurtosis with applications, Biometrika, 57, 519-530, (1970) · Zbl 0214.46302
[17] Nora-Chouteau C (1974) Une méthode de reconstitution et d’analyse de données incomplètes. PhD thesis, Université Pierre et Marie Curie
[18] R Core Team (2015) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org
[19] Rässler, S.; Rubin, DB; Zell, ER, Imputation, Wiley Interdiscip Rev Comput Stat, 5, 20-29, (2013)
[20] Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley, New York · Zbl 0711.62030
[21] Schafer JL (1997) Analysis of incomplete multivariate data. Chapman and Hall/CRC, London · Zbl 0997.62510
[22] Solaro N, Barbiero A, Manzi G, Ferrari PA (2014) Algorithmic-type imputation techniques with different data structures: alternative approaches in comparison. In: Vicari D, Okada A, Ragozini G, Weihs C (eds) Analysis and modeling of complex data in behavioural and social sciences. Studies in classification, data analysis, and knowledge organization. Springer International Publishing, Cham, pp 253-261
[23] Solaro N, Barbiero A, Manzi G, Ferrari PA (2015a) A comprehensive simulation study on the Forward Imputation. Working Paper 2015\(\_\)4, Università degli Studi di Milano, Italy. https://ideas.repec.org/p/mil/wpdepa/2015-04.html
[24] Solaro N, Barbiero A, Manzi G, Ferrari PA (2015b) GenForImp: a sequential distance-based approach for imputing missing data. R package version 1.0.0. http://CRAN.R-project.org/package=GenForImp
[25] Stekhoven DJ (2013). missForest: nonparametric missing value imputation using random forest. R package version 1.4. http://CRAN.R-project.org/package=missForest
[26] Stekhoven, DJ; Bühlmann, P., MissForest—nonparametric missing value imputation for mixed-type data, Bioinformatics, 28, 112-118, (2012)
[27] Tarsitano A, Falcone M (2010) Missing values adjustment for mixed-type data. Working Paper n. 15-2010, Università della Calabria, Italy. https://ideas.repec.org/p/clb/wpaper/201015.html · Zbl 1229.62039
[28] Wasito, I.; Mirkin, B., Nearest neighbour approach in the least-squares data imputation algorithms, Inf Sci, 169, 1-25, (2005) · Zbl 1084.62043
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.