Tree-structured modelling of categorical predictors in generalized additive regression. (English) Zbl 1416.62364

Summary: Generalized linear and additive models are very efficient regression tools but many parameters have to be estimated if categorical predictors with many categories are included. The method proposed here focusses on the main effects of categorical predictors by using tree type methods to obtain clusters of categories. When the predictor has many categories one wants to know in particular which of the categories have to be distinguished with respect to their effect on the response. The tree-structured approach allows to detect clusters of categories that share the same effect while letting other predictors, in particular metric predictors, have a linear or additive effect on the response. An algorithm for the fitting is proposed and various stopping criteria are evaluated. The preferred stopping criterion is based on \(p\) values representing a conditional inference procedure. In addition, stability of clusters is investigated and the relevance of predictors is investigated by bootstrap methods. Several applications show the usefulness of the tree-structured approach and small simulation studies demonstrate that the fitting procedure works well.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
62J12 Generalized linear models (logistic models)
62J02 General nonlinear regression
Full Text: DOI arXiv


[1] Belitz C, Brezger A, Kneib T, Lang S, Umlauf N (2015) BayesX: software for Bayesian inference in structured additive regression models. R package version 1.0-0
[2] Berger M (2017) structree: tree-structured clustering. R package version 1.1.4
[3] Bondell, HD; Reich, BJ, Simultaneous factor selection and collapsing levels in anova, Biometrics, 65, 169-177, (2009) · Zbl 1159.62048
[4] Breiman, L., Random forests, Mach Learn, 45, 5-32, (2001) · Zbl 1007.68152
[5] Breiman L, Friedman JH, Olshen RA, Stone JC (1984) Classification and regression trees. Wadsworth, Monterey · Zbl 0541.62042
[6] Bühlmann, P.; Yu, B., Boosting with the L2 loss: regression and classification, J Am Stat Assoc, 98, 324-339, (2003) · Zbl 1041.62029
[7] Bürgin, R.; Ritschard, G., Tree-based varying coefficient regression for longitudinal ordinal responses, Comput Stat Data Anal, 86, 65-80, (2015) · Zbl 1468.62033
[8] Chen, J.; Yu, K.; Hsing, A.; Therneau, TM, A partially linear tree-based regression model for assessing complex joint gene-gene and gene-environment effects, Genet Epidemiol, 31, 238-251, (2007)
[9] Dusseldorp, E.; Meulman, JJ, The regression trunk approach to discover treatment covariate interaction, Psychometrika, 69, 355-374, (2004) · Zbl 1306.62405
[10] Dusseldorp, E.; Conversano, C.; Os, BJ, Combining an additive and tree-based regression model simultaneously: stima, J Comput Graph Stat, 19, 514-530, (2010)
[11] Efron B, Tibshirani RJ (1994) An introduction to the bootstrap. CRC Press, Boca Raton · Zbl 0835.62038
[12] Eilers, PHC; Marx, BD, Flexible smoothing with B-splines and penalties, Stat Sci, 11, 89-121, (1996) · Zbl 0955.62562
[13] Fan, J.; Li, R., Variable selection via nonconcave penalized likelihood and its oracle properties, J Am Stat Assoc, 96, 1348-1360, (2001) · Zbl 1073.62547
[14] Fisher, WD, On grouping for maximum homogeneity, J Am Stat Assoc, 53, 789-798, (1958) · Zbl 0084.35904
[15] Friedman, JH, Greedy function approximation: a gradient boosting machine, Ann Stat, 29, 1189-1232, (2001) · Zbl 1043.62034
[16] Friedman, JH; Hastie, T.; Tibshirani, R., Additive logistic regression: a statistical view of boosting, Ann Stat, 28, 337-407, (2000) · Zbl 1106.62323
[17] Gertheiss, J.; Tutz, G., Sparse modeling of categorial explanatory variables, Ann Appl Stat, 4, 2150-2180, (2010) · Zbl 1220.62092
[18] Hastie T, Tibshirani R (1990) Generalized additive models. Chapman & Hall, London · Zbl 0747.62061
[19] Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning, 2nd edn. Springer, New York · Zbl 1273.62005
[20] Hothorn, T.; Hornik, K.; Zeileis, A., Unbiased recursive partitioning: a conditional inference framework, J Comput Graph Stat, 15, 651-674, (2006)
[21] Ishwaran, Hemant, Variable importance in binary regression trees and forests, Electronic Journal of Statistics, 1, 519-537, (2007) · Zbl 1320.62158
[22] McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman & Hall, New York · Zbl 0588.62104
[23] Morgan, JN; Sonquist, JA, Problems in the analysis of survey data, and a proposal, J Am Stat Assoc, 58, 415-435, (1963)
[24] Oelker M-R (2015) gvcm.cat: regularized categorical effects/categorical effect modifiers/continuous/smooth effects in GLMs. R package version 1.9
[25] Oelker, M-R; Tutz, G., A uniform framework for the combination of penalties in generalized structured models, Adv Data Anal Classif, 1, 97-120, (2015)
[26] Quinlan, JR, Induction of decision trees, Mach Learn, 1, 81-106, (1986)
[27] Quinlan JR (1993) Programs for machine learning. Morgan Kaufmann, San Francisco
[28] Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge
[29] Sandri, M.; Zuccolotto, P., A bias correction algorithm for the gini variable importance measure in classification trees, J Comput Graph Stat, 17, 611-628, (2008)
[30] Sela, RJ; Simonoff, JS, Re-EM trees: a data mining approach for longitudinal and clustered data, Mach Learn, 86, 169-207, (2012) · Zbl 1238.68131
[31] Strobl, C.; Boulesteix, A-L; Kneib, T.; Augustin, T.; Zeileis, A., Conditional variable importance for random forests, BMC Bioinform, 9, 307, (2008)
[32] Strobl, C.; Malley, J.; Tutz, G., An introduction to recursive partitioning: rationale, application and characteristics of classification and regression trees, bagging and random forests, Psychol Methods, 14, 323-348, (2009)
[33] Su, X.; Tsai, C-L; Wang, MC, Tree-structured model diagnostics for linear regression, Mach Learn, 74, 111-131, (2009) · Zbl 1200.68083
[34] Tutz, G.; Gertheiss, J., Rating scales as predictors—the old question of scale level and some answers, Psychometrika, 79, 357-376, (2014) · Zbl 1308.62151
[35] Tutz, G.; Gertheiss, J., Regularized regression for categorical data, Stati Model, 16, 161-200, (2016)
[36] Tutz, G.; Oelker, M., Modeling clustered heterogeneity: fixed effects, random effects and mixtures, Int Stat Rev, 85, 204-227, (2016)
[37] Umlauf, N.; Adler, D.; Kneib, T.; Lang, S.; Zeileis, A., Structured additive regression models: an R interface to bayesx, J Stat Softw, 63, 1-46, (2015)
[38] Wood SN (2006) Generalized additive models: an introduction with R. Chapman & Hall/CRC, London
[39] Wood, SN, Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models, J R Stat Soc B, 73, 3-36, (2011)
[40] Yu, K.; Wheeler, W.; Li, Q.; Bergen, AW; Caporaso, N.; Chatterjee, N.; Chen, J., A partially linear tree-based regression model for multivariate outcomes, Biometrics, 66, 89-96, (2010) · Zbl 1187.62182
[41] Zeileis, A.; Hothorn, T.; Hornik, K., Model-based recursive partitioning, J Comput Graph Stat, 17, 492-514, (2008)
[42] Zhang H, Singer B (1999) Recursive partitioning in the health sciences. Springer, New York · Zbl 0920.62135
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.