×

Model selection and model averaging after multiple imputation. (English) Zbl 1471.62181

Summary: Model selection and model averaging are two important techniques to obtain practical and useful models in applied research. However, it is now well-known that many complex issues arise, especially in the context of model selection, when the stochastic nature of the selection process is ignored and estimates, standard errors, and confidence intervals are calculated as if the selected model was known a priori. While model averaging aims to incorporate the uncertainty associated with the model selection process by combining estimates over a set of models, there is still some debate over appropriate interpretation and confidence interval construction. These problems become even more complex in the presence of missing data and it is currently not entirely clear how to proceed. To deal with such situations, a framework for model selection and model averaging in the context of missing data is proposed. The focus lies on multiple imputation as a strategy to deal with the missingness: a consequent combination with model averaging aims to incorporate both the uncertainty associated with the model selection and with the imputation process. Furthermore, the performance of bootstrapping as a flexible extension to our framework is evaluated. Monte Carlo simulations are used to reveal the nature of the proposed estimators in the context of the linear regression model. The practical implications of our approach are illustrated by means of a recent survival study on sputum culture conversion in pulmonary tuberculosis.

MSC:

62-08 Computational methods for problems pertaining to statistics
62J05 Linear regression; mixed models
62D10 Missing data
62F10 Point estimation

Software:

MAMI; copula; copula; BMA; Amelia
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Akaike, H., 1973. Information theory and an extension of the maximum likelihood principle, In: Proceeding of the Second International Symposium on Information Theory Budapest, pp. 267-281. · Zbl 0283.62006
[2] Buckland, S. T.; Burnham, K. P.; Augustin, N. H., Model selection: an integral part of inference, Biometrics, 53, 603-618, (1997) · Zbl 0885.62118
[3] Cavanaugh, J.; Shumway, R., An Akaike information criterion for model selection in the presence of incomplete data, Journal of Statistical Planning and Inference, 67, 45-65, (1998) · Zbl 1067.62504
[4] Chatfield, C., Model uncertainty, data mining and statistical inference, Journal of the Royal Statistical Society A, 158, 419-466, (1995)
[5] Claeskens, G.; Consentino, F., Variable selection with incomplete covariate data, Biometrics, 64, 1062-1069, (2008) · Zbl 1152.62388
[6] Claeskens, G.; Hjort, N. L., The focused information criterion (with discussion), Journal of the American Statistical Association, 98, 900-916, (2003)
[7] Draper, D., Assessment and propagation of model uncertainty, Journal of the Royal Statistical Society B, 57, 45-97, (1995) · Zbl 0812.62001
[8] Drechsler, J.; Rässler, S., Does convergence really matter?, (Shalabh; Heumann, C., Recent Advances in Linear Models and Related Areas, (2008), Springer), 342-355
[9] Fletcher, D.; Dillingham, P., Model-averaged confidence intervals for factorial experiments, Computational Statistics and Data Analysis, 55, 3041-3048, (2011)
[10] Hansen, B. E., Least squares model averaging, Econometrica, 75, 1175-1189, (2007) · Zbl 1133.91051
[11] Hansen, B. E.; Racine, J., Jackknife model averaging, Journal of Econometrics, 167, 38-46, (2012) · Zbl 1441.62721
[12] Hens, N.; Aerts, M.; Molenberghs, G., Model selection for incomplete and design based samples, Statistics in Medicine, 25, 2502-2520, (2006)
[13] Heumann, C., Grenke, M., 2010. An efficient model averaging procedure for logistic regression models using a Bayesian estimator with Laplace prior. In: Kneib, T., Tutz, G. (Eds.), Statistical Modelling and Regression Structures. Physica, pp. 79-90.
[14] Hjort, L.; Claeskens, G., Frequentist model average estimators, Journal of the American Statistical Association, 98, 879-945, (2003) · Zbl 1047.62003
[15] Hjort, N. L.; Claeskens, G., Focussed information criteria and model averaging for cox’s hazard regression model, Journal of the American Statistical Association, 101, 1449-1464, (2006) · Zbl 1171.62350
[16] Hoeting, J. A.; Madigan, D.; Raftery, A. E.; Volinsky, C. T., Bayesian model averaging: a tutorial, Statistical Science, 14, 382-417, (1999) · Zbl 1059.62525
[17] Honaker, J.; King, G., What to do about missing values in time series cross-section data, American Journal of Political Science, 54, 561-581, (2010)
[18] Honaker, J., King, G., Blackwell, M., 2010. Amelia 2: a program for missing data. R Package version 1.5. http://gking.harvard.edu/amelia.
[19] Horton, N.; Kleinman, K., Much ado about nothing: a comparison of missing data methods and software to fit incomplete regression models, The American Statistician, 61, 79-90, (2007)
[20] Ishwaran, H.; Rao, J., Discussion, Journal of the American Statistical Association, 98, 922-925, (2003)
[21] Kabaila, P.; Leeb, H., On the large-sample minimal coverage probability of confidence intervals after model selection, Journal of the American Statistical Association, 101, 619-629, (2006) · Zbl 1119.62322
[22] Leeb, H.; Pötscher, B. M., Model selection and inference: facts and fiction, Econometric Theory, 21, 21-59, (2005) · Zbl 1085.62004
[23] Leeb, H.; Pötscher, B. M., Can one estimate the conditional distribution of post-model-selection estimators?, Annals of Statistics, 34, 2554-2591, (2006) · Zbl 1106.62029
[24] Leeb, H.; Pötscher, B. M., Can one estimate the unconditional distribution of post-model-selection estimators?, Econometric Theory, 24, 338-376, (2008) · Zbl 1284.62152
[25] Liang, H.; Zou, G.; Wan, A.; Zhang, X., Optimal weight choice for frequentist model average estimators, Journal of the American Statistical Association, 106, 1053-1066, (2011) · Zbl 1229.62090
[26] Lipsitz, S.; Parzen, M.; Zhao, L., A degrees-of-freedom approximation in multiple imputation, Journal of Statistical Computation and Simulation, 72, 309-318, (2002) · Zbl 0995.62006
[27] Little, R.; Rubin, D., Statistical analysis with missing data, (2002), Wiley New York · Zbl 1011.62004
[28] Magnus, J.; Powell, O.; Prüfer, P., A comparison of two model averaging techniques with an application to growth empirics, Journal of Econometrics, 154, 139-153, (2010) · Zbl 1431.62654
[29] Magnus, J.; Wan, A.; Zhang, X., Weighted average least squares estimation with nonspherical disturbances and an application to the Hong Kong housing market, Computational Statistics and Data Analysis, 55, 1331-1341, (2011) · Zbl 1328.65034
[30] May, M.; Boulle, A.; Phiri, S.; Messou, E.; Myer, L.; Wood, R.; Sterne, J.; Dabis, F.; Egger, M., Prognosis of petients with HIV-1 infection starting therapy in sub-saharan africa: a collaborative analysis of scale-up programmes, Lancet, 376, 449-457, (2010)
[31] Molenberghs, G.; Fitzmaurice, G., Incomplete data: introduction and overview, (Fitzmaurice, G.; Davidian, M.; Verbeke, G.; Molenberghs, G., Longitudinal Data Analysis, (2009), CRC Press), 395-408
[32] Pötscher, B., The distribution of model averaging estimators and an impossibility result regarding its estimation, (Ho, H.; Ing, C.; Lai, T., IMS Lecture Notes: Time Series and Related Topics, vol. 52, (2006)), 113-129 · Zbl 1268.62066
[33] Raftery, A., Hoeting, J., Volinsky, C., Painter, I., Yeung, K., 2011. BMA: Bayesian model averaging. R package version 3.14. http://CRAN.R-project.org/package=BMA.
[34] Rao, C.; Wu, Y., On model selection, IMS Lecture Notes - Monograph Series, 38, 1-64, (2001)
[35] Rubin, D., The Bayesian bootstrap, Annals of Statistics, 9, 130-134, (1981)
[36] Rubin, D.; Schenker, N., Multiple imputation for interval estimation from simple random samples with ignorable nonresponse, Journal of the American Statistical Association, 81, 366-374, (1986) · Zbl 0615.62011
[37] Schomaker, M., Shrinkage averaging estimation, Statistical Papers, 53, 1015-1034, (2012) · Zbl 1254.62082
[38] Schomaker, M.; Heumann, C., Model averaging in factor analysis: an analysis of olympic decathlon data, Journal of Quantitative Analysis in Sports, 7, 1, (2011), Article 4
[39] Schomaker, M.; Wan, A. T.K.; Heumann, C., Frequentist model averaging with missing observations, Computational Statistics and Data Analysis, 54, 3336-3347, (2010) · Zbl 1284.62063
[40] Shimodaira, H., A new criterion for selecting models from partially observed data, (Cheesman, P.; Oldford, R., Selecting Models from Data: Artificial Intelligence and Statistics, Vol. IV, (1994), Springer), 21-29 · Zbl 0828.62004
[41] Stone, M., Cross-validatory choice and assessment of statistical predictions, Journal of the Royal Statistical Society B, 36, 111-147, (1974) · Zbl 0308.62063
[42] Turek, D.; Fletcher, D., Model-averaged Wald confidence intervals, Computational Statistics and Data Analysis, 56, 2809-2815, (2012) · Zbl 1255.62141
[43] Visser, M.; Stead, M.; Walzl, G.; Warren, R.; Schomaker, M.; Grewal, H.; Swart, E.; Maartens, G., Baseline predictors of sputum conversion in pulmonary tuberculosis: importance of cavities, smoking, time to detection and W-Beijing genotype, PLoS ONE, 7, e29588, (2012)
[44] Wan, A. T.K.; Zhang, X.; Zou, G. H., Least squares model averaging by Mallows criterion, Journal of Econometrics, 156, 277-283, (2010) · Zbl 1431.62291
[45] Wang, H.; Zhang, X.; Zou, G., Frequentist model averaging: a review, Journal of Systems Science and Complexity, 22, 732-748, (2009) · Zbl 1300.93164
[46] Wang, H., Zhou, S., 2012. Interval estimation by frequentist model averaging, Communications in Statistics—Theory and Methods (2013) (forthcoming). · Zbl 1462.62117
[47] Wang, H.; Zou, G.; Wan, A., Model averaging for varying-coefficient partially linear measurement error models, Electronic Journal of Statistics, 6, 1017-1039, (2012) · Zbl 1281.62054
[48] White, I.; Royston, P.; Wood, A., Multiple imputation using chained equations, Statistics in Medicine, 30, 377-399, (2011)
[49] Wood, A.; White, I.; Royston, P., How should variable selection be performed with multiply imputed data?, Statistics in Medicine, 27, 3227-3246, (2008)
[50] Yan, J., Enjoy the joy of copulas: with package copula, Journal of Statistical Software, 21, 1-21, (2007)
[51] Zhang, X.; Wan, A.; Zhou, S., Focused information criteria, model selection and model averaging in a tobit model with a non-zero threshold, Journal of Business and Economics Statistics, 30, 132-142, (2012)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.