×

Boosted multivariate trees for longitudinal data. (English) Zbl 1453.68156

Summary: Machine learning methods provide a powerful approach for analyzing longitudinal data in which repeated measurements are observed for a subject over time. We boost multivariate trees to fit a novel flexible semi-nonparametric marginal model for longitudinal data. In this model, features are assumed to be nonparametric, while feature-time interactions are modeled semi-nonparametrically utilizing \(P\)-splines with estimated smoothing parameter. In order to avoid overfitting, we describe a relatively simple in sample cross-validation method which can be used to estimate the optimal boosting iteration and which has the surprising added benefit of stabilizing certain parameter estimates. Our new multivariate tree boosting method is shown to be highly flexible, robust to covariance misspecification and unbalanced designs, and resistant to overfitting in high dimensions. Feature selection can be used to identify important features and feature-time interactions. An application to longitudinal data of forced 1-second lung expiratory volume (FEV1) for lung transplant patients identifies an important feature-time interaction and illustrates the ease with which our method can find complex relationships in longitudinal data.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62G86 Nonparametric inference and fuzziness
62M10 Time series, auto-correlation, regression, etc. in statistics (GARCH)
62P10 Applications of statistics to biology and medical sciences; meta analysis
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. California: Belmont. · Zbl 0541.62042
[2] De Boor, C. (1978). A practical guide to splines. Berlin: Springer. · Zbl 0406.41003
[3] Diggle, P., Heagerty, P., Liang, K.-Y., & Zeger, S. (2002). Analysis of longitudinal data. Oxford: Oxford University Press. · Zbl 1031.62002
[4] Duchon, J. (1977). Splines minimizing rotation-invariant semi-norms in Sobolev spaces. In Constructive theory of functions of several variables (pp. 85-100). Berlin Heidelberg: Springer. · Zbl 0342.41012
[5] Eilers, P. H. C., & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11(2), 89-102. · Zbl 0955.62562
[6] Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Proceedings of the 13th international conference on machine learning (pp. 148-156).
[7] Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, 1189-1232. · Zbl 1043.62034 · doi:10.1214/aos/1013203451
[8] Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367-378. · Zbl 1072.65502 · doi:10.1016/S0167-9473(01)00065-2
[9] Groll, A., & Tutz, G. (2012). Regularization for generalized additive mixed models by likelihood-based boosting. Methods of Information in Medicine, 51(2), 168. · doi:10.3414/ME11-02-0021
[10] Hastie, T. J., & Tibshirani, R. J. (1990). Generalized additive models (Vol. 43). Boca raton: CRC Press. · Zbl 0747.62061
[11] Hoover, D. R., Rice, J. A., Wu, C. O., & Yang, L.-P. (1998). Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika, 85(4), 809-822. · Zbl 0921.62045 · doi:10.1093/biomet/85.4.809
[12] Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical statistics, 15, 651-674. · doi:10.1198/106186006X133933
[13] Hothorn, T., Buhlmann, P., Kneib, T., Schmid, M., & Hofner, B. (2010). Model-based boosting 2.0. Journal of Machine Learning Research, 11, 2109-2113. · Zbl 1242.68002
[14] Hothorn, T., Buhlmann, P., Kneib, T., Schmid, M., Hofner, B., Sobotka, A., & Scheipl, F. (2016). mboost: Model-based boosting, 2016. R package version 2.6-0.
[15] Ishwaran, H., & Kogalur, U. B. (2016). Random forests forsurvival, regression and classification (RF-SRC), 2016. R packageversion 2.2.0.
[16] Ishwaran, H., Pande, A., & Kogalur, U. B. (2016). Boostmtree: Boosted multivariate trees for longitudinaldata, 2016. R package version 1.1.0.
[17] Loh, W.-Y., & Shih, Y.-S. (1997). Split selection methods for classification trees. Statistica Sinica, 7, 815-840. · Zbl 1067.62545
[18] Mallat, S., & Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41, 3397-3415. · Zbl 0842.94004 · doi:10.1109/78.258082
[19] Mason, D. P., Rajeswaran, J., Liang, L., Murthy, S. C., Su, J. W., Pettersson, G. B., et al. (2012). Effect of changes in postoperative spirometry on survival after lung transplantation. The Journal of Thoracic and Cardiovascular Surgery, 144(1), 197-203. · doi:10.1016/j.jtcvs.2012.03.028
[20] Mayr, A., Hothorn, T., & Fenske, N. (2012). Prediction intervals for future BMI values of individual children-a non-parametric approach by quantile boosting. BMC Medical Research Methodology, 12(1), 6. · doi:10.1186/1471-2288-12-6
[21] Mayr, A., Hofner, B., & Schmid, M. (2012). The importance of knowing when to stop: A sequential stopping rule for component-wise gradient boosting. Methods of Information in Medicine, 51, 178-186. · doi:10.3414/ME11-02-0030
[22] Pan, W. (2001). Akaike’s information criteria in generalized estimating equations. Biometrika, 57, 120-125. · Zbl 1210.62099 · doi:10.1111/j.0006-341X.2001.00120.x
[23] Pinheiro, J. C., & Bates, D. M. (2000). Mixed-effects models in S and S-PLUS. Berlin: Springer. · Zbl 0953.62065 · doi:10.1007/978-1-4419-0318-1
[24] Pinheiro, J.C., Bates, D.M., DebRoy, S., Sarkar, D., & R Core Team. (2014).nlme: Linear and nonlinear mixed effects models. Rpackage version 3.1-117.
[25] Robinson, G. K. (1991). That BLUP is a good thing: The estimation of random effects. Statistical Science, 6(1), 15-32. · Zbl 0955.62500
[26] Ruppert, D., Wand, M. P., & Carroll, R. J. (2003). Semiparametric regression. (Vol. 12). Cambridge: Cambridge University Press. · Zbl 1038.62042 · doi:10.1017/CBO9780511755453
[27] Sela, R. J., & Simonoff, J. S. (2012). RE-EM trees: A data mining approach for longitudinal and clustered data. Machine Learning, 86, 169-207. · Zbl 1238.68131 · doi:10.1007/s10994-011-5258-3
[28] Tutz, G., & Binder, H. (2006). Generalized additive modeling with implicit variable selection by likelihood-based boosting. Biometrics, 62(4), 961-971. · Zbl 1116.62075 · doi:10.1111/j.1541-0420.2006.00578.x
[29] Tutz, G., & Reithinger, F. (2007). A boosting approach to flexible semiparametric mixed models. Statistics in Medicine, 26(14), 2872-2900. · doi:10.1002/sim.2738
[30] Wahba, G. (1990). Spline models for observational data (Vol. 59). Bangkok: SIAM. · Zbl 0813.62001 · doi:10.1137/1.9781611970128
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.