×

Imputation and low-rank estimation with missing not at random data. (English) Zbl 1452.62174

Summary: Missing values challenge data analysis because many supervised and unsupervised learning methods cannot be applied directly to incomplete data. Matrix completion based on low-rank assumptions are very powerful solution for dealing with missing values. However, existing methods do not consider the case of informative missing values which are widely encountered in practice. This paper proposes matrix completion methods to recover Missing Not At Random (MNAR) data. Our first contribution is to suggest a model-based estimation strategy by modelling the missing mechanism distribution. An EM algorithm is then implemented, involving a Fast Iterative Soft-Thresholding Algorithm (FISTA). Our second contribution is to suggest a computationally efficient surrogate estimation by implicitly taking into account the joint distribution of the data and the missing mechanism: the data matrix is concatenated with the mask coding for the missing values; a low-rank structure for exponential family is assumed on this new matrix, in order to encode links between variables and missing mechanisms. The methodology that has the great advantage of handling different missing value mechanisms is robust to model specification errors. The performances of our methods are assessed on the real data collected from a trauma registry (TraumaBase\(^{\circledR}\)) containing clinical information about over twenty thousand severely traumatized patients in France. The aim is then to predict if the doctors should administrate tranexomic acid to patients with traumatic brain injury, that would limit excessive bleeding.

MSC:

62D10 Missing data
62H12 Estimation in multivariate analysis
62P10 Applications of statistics to biology and medical sciences; meta analysis
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Audigier, V.; Husson, F.; Josse, J., A principal component method to impute missing values for mixed data, Adv. Data Anal. Classif., 10, 1, 5-26 (2016) · Zbl 1414.62206
[2] Beck, A.; Teboulle, M., A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imag. Sci., 2, 1, 183-202 (2009) · Zbl 1175.94009
[3] Cai, T.; Zhou, W-X, A max-norm constrained minimization approach to 1-bit matrix completion, J. Mach. Learn. Res., 14, 1, 3619-3647 (2013) · Zbl 1318.62172
[4] Cai, J-F; Candès, EJ; Shen, Z., A singular value thresholding algorithm for matrix completion, SIAM J. Optim., 20, 4, 1956-1982 (2010) · Zbl 1201.90155
[5] Candes, EJ; Plan, Y., Matrix completion with noise, Proc. IEEE, 98, 6, 925-936 (2010)
[6] Candès, EJ; Recht, B., Exact matrix completion via convex optimization, Found. Comput. Math., 9, 6, 717 (2009) · Zbl 1219.90124
[7] Candès, EJ; Sing-Long, CA; Trzasko, JD, Unbiased risk estimates for singular value thresholding and spectral estimators, IEEE Trans. Signal Process., 61, 19, 4643-4657 (2013) · Zbl 1393.94187
[8] Dempster, AP; Laird, NM; Rubin, DB, Maximum likelihood from incomplete data via the em algorithm, J. R. Statist. Soc. Ser. B (Methodol.), 39, 1, 1-22 (1977) · Zbl 0364.62022
[9] Gavish, M.; Donoho, DL, Optimal shrinkage of singular values, IEEE Trans. Inf. Theory, 63, 4, 2137-2152 (2017) · Zbl 1366.94100
[10] Gordon, NJ; Salmond, DJ; Smith, AFM, Novel approach to nonlinear/non-gaussian bayesian state estimation, IEE Proc. F-radar Signal Process., 140, 107-113 (1993)
[11] Harel, O.; Schafer, JL, Partial and latent ignorability in missing-data problems, Biometrika, 96, 1, 37-50 (2009) · Zbl 1162.62095
[12] Hastie, T., Mazumder, R.: softImpute: Matrix Completion via Iterative Soft-Thresholded SVD (2015). https://CRAN.R-project.org/package=softImpute. R package version 1.4
[13] Hastie, T.; Mazumder, R.; Lee, JD; Zadeh, R., Matrix completion and low-rank SVD via fast alternating least squares, J. Mach. Learn. Res., 16, 1, 3367-3402 (2015) · Zbl 1352.65117
[14] Hay, SI; Abajobir, AA; Abate, KH; Abbafati, C.; Abbas, KM; Abd-Allah, F.; Abdulkader, RS; Abdulle, AM; Abebo, TA; Abera, SF, Global, regional, and national disability-adjusted life-years (dalys) for 333 diseases and injuries and healthy life expectancy (hale) for 195 countries and territories, 1990-2016: a systematic analysis for the global burden of disease study 2016, Lancet, 390, 10100, 1260-1344 (2017)
[15] Heckman, JJ, Sample selection bias as a specification error, Econometrica, 42, 679-94 (1974) · Zbl 0289.90003
[16] Ibrahim, JG; Lipsitz, SR; Chen, M-H, Missing covariates in generalized linear models when the missing data mechanism is non-ignorable, J. R. Statist. Soc. Ser. B (Statist. Methodol.), 61, 1, 173-190 (1999) · Zbl 0917.62060
[17] Josse, J., Prost, N., Scornet, E., Varoquaux, G.: On the consistency of supervised learning with missing values. arXiv:1902.06931 (2019)
[18] Josse, J., Sardy, S., Wager, S.: denoiser: A package for low rank matrix estimation. J. Stat. Softw. (2016)
[19] Josse, J.; Husson, F., Selecting the number of components in principal component analysis using cross-validation approximations, Comput. Statist. Data Anal., 56, 6, 1869-1879 (2012) · Zbl 1243.62082
[20] Kallus, N., Mao, X., Udell, M.: Causal inference with noisy and missing covariates via matrix factorization. arXiv:1806.00811 (2018)
[21] Kishore Kumar, N.; Schneider, J., Literature survey on low rank approximation of matrices, Linear Multilinear Algebra, 65, 11, 2212-2244 (2017) · Zbl 1387.65039
[22] Leek, JT; Storey, JD, Capturing heterogeneity in gene expression studies by surrogate variable analysis, PLoS Genetics, 3, 9, e161 (2007)
[23] Little, RJA, Pattern-mixture models for multivariate incomplete data, J. Am. Statist. Assoc., 88, 421, 125-134 (1993) · Zbl 0775.62134
[24] Little, RJA; Rubin, DB, Statistical Analysis with Missing Data (2014), New York: Wiley, New York
[25] Liu, LT; Dobriban, E.; Singer, A., \( e\) PCA: high dimensional exponential family PCA, Ann. Appl. Statist., 12, 4, 2121-2150 (2018) · Zbl 1411.62376
[26] Mazumder, R.; Hastie, T.; Tibshirani, R., Spectral regularization algorithms for learning large incomplete matrices, J. Mach. Learn. Res., 11, Aug, 2287-2322 (2010) · Zbl 1242.68237
[27] Miao, W., Tchetgen, E.T.: Identification and inference with nonignorable missing covariate data. Statistica Sinica (2017) · Zbl 1406.62018
[28] Mohan, K., Pearl, J.: Graphical models for processing missing data. arXiv:1801.03583 (2018)
[29] Mohan, K., Thoemmes, F., Pearl, J.: Estimation with incomplete data: the linear case. In: IJCAI, pp. 5082-5088 (2018)
[30] Morikawa, K.; Kim, JK; Kano, Y., Semiparametric maximum likelihood estimation with data missing not at random, Canadian J. Statist., 45, 4, 393-409 (2017) · Zbl 1474.62071
[31] Murray, J.S.: Multiple imputation: a review of practical and theoretical findings. arXiv:1801.04058 (2018) · Zbl 1397.62052
[32] Price, AL; Patterson, NJ; Plenge, RM; Weinblatt, ME; Shadick, NA; Reich, D., Principal components analysis corrects for stratification in genome-wide association studies, Nature Genet., 38, 8, 904-909 (2006)
[33] Robin, G., Klopp, O., Josse, J., Moulines, É., Tibshirani, R.: Main effects and interactions in mixed and incomplete data frames. arXiv:1806.09734 (2018) · Zbl 1441.62145
[34] Rubin, DB, Inference and missing data, Biometrika, 63, 3, 581-592 (1976) · Zbl 0344.62034
[35] Rubin, DB, Multiple Imputation for Nonresponse in Surveys (2004), New York: Wiley, New York · Zbl 1070.62007
[36] Seaman, S.; Galati, J.; Jackson, D.; Carlin, J., What is meant by missing at random?, Statist. Sci., 28, 2, 257-268 (2013) · Zbl 1331.62036
[37] Tang, F.; Ishwaran, H., Random forest missing data algorithms, Statist. Anal. Data Min. ASA Data Sci. J., 10, 6, 363-377 (2017) · Zbl 07260721
[38] Twala, BETH; Jones, MC; Hand, DJ, Good methods for coping with missing data in decision trees, Pattern Recognit. Lett., 29, 7, 950-956 (2008)
[39] Udell, M., Townsend, A.: Nice latent variable models have log-rank. aXiv:1705.07474 (2017)
[40] Udell, M.; Horn, C.; Zadeh, R.; Boyd, S., Generalized low rank models, Found. Trends® Mach. Learn., 9, 1, 1-118 (2016) · Zbl 1350.68221
[41] Verbanck, M.; Josse, J.; Husson, F., Regularised pca to denoise and visualise data, Statist. Comput., 25, 2, 471-486 (2015) · Zbl 1331.62298
[42] Yang, C., Akimoto, Y., Kim, D.W., Udell, M.: Oboe: Collaborative filtering for automl initialization. arXiv:1808.03233 (2018)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.