×

zbMATH — the first resource for mathematics

Classification tree analysis using TARGET. (English) Zbl 1452.62445
Summary: Tree models are valuable tools for predictive modeling and data mining. Traditional tree-growing methodologies such as CART are known to suffer from problems including greediness, instability, and bias in split rule selection. Alternative tree methods, including Bayesian CART [H. A. Chipman et al., “Bayesian CART model search”, J. Am. Stat. Assoc. 93, No. 443, 935–948 (1998; doi:10.1080/01621459.1998.10473750); D. G. T. Denisonet al., “Comment”, ibid. 93, No. 443, 954–957 (1998; doi:10.1080/01621459.1998.10473753)], random forests [L. Breiman, Mach. Learn. 45, No. 1, 5–32 (2001; Zbl 1007.68152)], bootstrap bumping [R. Tibshirani and K. Knight, “Model search by bootstrap ‘bumping”’, J. Comput. Graph. Stat. 8, No. 4, 671 (1999; doi:10.2307/1390820)], QUEST [W.-Y. Loh and Y.-S. Shih, Stat. Sin. 7, No. 4, 815–840 (1997; Zbl 1067.62545)], and CRUISE [H. Kim and W.-Yin Loh, “Classification trees with unbiased multiway splits”, J. Am. Stat. Assoc. 96, No. 454, 589–604 (2001; doi:10.1198/016214501753168271)], have been proposed to resolve these issues from various aspects, but each has its own drawbacks.
In [TARGET: tree analysis with randomly generated and evolved trees. Technical Report, Applied Statistics Program. Tuscaloosa, AL: The University of Alabama. (2003)], the authors described a genetic algorithm approach to constructing decision trees called tree analysis with randomly generated and evolved trees (TARGET) that performs a better search of the tree model space and largely resolves the problems with current tree modeling techniques. Utilizing the Bayesian information criterion (BIC), the authors developed a version of TARGET for regression tree analysis [“Regression tree analysis using TARGET”, J. Comput. Graph. Stat. 14, No. 1, 206–218 (2005; doi:10.1198/106186005x37210)]. In this article, we consider the construction of classification trees using TARGET. We modify the BIC to handle a categorical response variable, but we also adjust its penalty component to better account for the model complexity of TARGET. We also incorporate the option of splitting rules based on linear combinations of two or three variables in TARGET, which greatly improves the prediction accuracy of TARGET trees. Comparisons of TARGET to existing methods, using simulated and real data sets, indicate that TARGET has advantages over these other approaches.

MSC:
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62-08 Computational methods for problems pertaining to statistics
68T05 Learning and adaptive systems in artificial intelligence
90C59 Approximation methods and heuristics in mathematical programming
Software:
rpart; ElemStatLearn
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Baragona, R.; Battaglia, F.; Calzini, C., Genetic algorithms for the identification of additive and innovation outliers in time series, Comput. statist. data anal., 37, 1-12, (2001) · Zbl 1030.62063
[2] Breiman, L., The heuristics of instability and stabilization in model selection, Ann. statist., 24, 2350-2383, (1996) · Zbl 0867.62055
[3] Breiman, L., Bagging predictors, Mach. learn., 24, 123-140, (1996) · Zbl 0858.68080
[4] Breiman, L., Random forests, Mach. learn., 45, 5-32, (2001) · Zbl 1007.68152
[5] Breiman, L., Statistical modeling: the two cultures, Statist. sci., 16, 199-231, (2001) · Zbl 1059.62505
[6] Breiman, L.; Friedman, J.; Olshen, R.; Stone, C., Classification and regression trees, (1984), Wadsworth Belmont, CA · Zbl 0541.62042
[7] Chatterjee, S.; Laudatto, M.; Lynch, L.A., Genetic algorithms and their statistical applications: an introduction, Comput. statist. data anal., 22, 633-651, (1996) · Zbl 0900.62336
[8] Chipman, H., 2004. Personal communication.
[9] Chipman, H.A.; George, E.I.; McCulloch, R.E., Bayesian CART model search (with discussions), J. amer. statist. assoc., 93, 935-960, (1998)
[10] Denison, D.G.T.; Smith, A.F.M.; Mallick, B.K., Comment, J. amer. statist. assoc., 93, 954-957, (1998)
[11] Fan, G.; Gray, J.B., Regression tree analysis using TARGET, J. computat. graphical statist., 14, 206-218, (2005)
[12] Freund, Y., Boosting a weak learning algorithm by majority, Inform. and comput., 121, 256-285, (1995) · Zbl 0833.68109
[13] Goldberg, D., Genetic algorithms in search, optimization, and machine learning, (1988), Addison-Wesley Reading, MA
[14] Gray, J.B.; Fan, G., TARGET: tree analysis with randomly generated and evolved trees, (2003), Technical Report, Applied Statistics Program The University of Alabama
[15] Hamada, M.; Martz, H.F.; Reese, C.S.; Wilson, A.G., Finding near-optimal Bayesian experimental designs via genetic algorithms, Amer. statist., 55, 175-181, (2001) · Zbl 1182.62156
[16] Hastie, T.; Tibshirani, R.; Friedman, J., The elements of statistical learning, data mining, inference and prediction, (2001), Springer New York · Zbl 0973.62007
[17] Holland, J.H., Adaptation in natural and artificial systems, (1975), University of Michigan Press Ann Arbor, MI
[18] Kass, G.V., An exploratory technique for investigating large quantities of categorical data, Appl. statist., 29, 119-127, (1980)
[19] Kim, H.; Loh, W.-Y., Classification trees with unbiased multiway splits, J. amer. statist. assoc., 96, 589-604, (2001)
[20] Kim, H.; Loh, W.-Y., Classification trees with bivariate linear discriminant node models, J. comput. graphical statist., 12, 512-530, (2003)
[21] Loh, W.-Y.; Shih, Y.-S., Split selection methods for classification trees, Statist. sin., 7, 815-840, (1997) · Zbl 1067.62545
[22] Mangasarian, O.L.; Wolberg, W.H., Cancer diagnosis via linear programming, SIAM news, 23, 1, (1990)
[23] Morgan, J.N.; Sonquest, J.A., Problems in the analysis of survey data and a proposal, J. amer. statist. assoc., 58, 415-435, (1963) · Zbl 0114.10103
[24] Schwarz, G., Estimating the dimension of a model, Ann. statist., 6, 461-464, (1978) · Zbl 0379.62005
[25] Therneau, T.M.; Atkinson, E.J., An introduction to recursive partitioning using the RPART routines, (1997), Technical Report Mayo Foundation
[26] Tibshirani, R.; Knight, K., Model search and inference by bootstrap bumping, J. comput. graphical statist., 8, 671-686, (1999)
[27] Wallet, B.C.; Marchette, D.J.; Solka, J.L.; Wegman, E.J., A genetic algorithm for best subset selection in linear regression, (), 545-550
[28] Zhang, H., Comment, J. amer. statist. assoc., 93, 948-950, (1998)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.