Boosting insights in insurance tariff plans with tree-based machine learning methods. (English) Zbl 1475.91306

Summary: Pricing actuaries typically operate within the framework of generalized linear models (GLMs). With the upswing of data analytics, our study puts focus on machine learning methods to develop full tariff plans built from both the frequency and severity of claims. We adapt the loss functions used in the algorithms such that the specific characteristics of insurance data are carefully incorporated: highly unbalanced count data with excess zeros and varying exposure on the frequency side combined with scarce but potentially long-tailed data on the severity side. A key requirement is the need for transparent and interpretable pricing models that are easily explainable to all stakeholders. We therefore focus on machine learning with decision trees: Starting from simple regression trees, we work toward more advanced ensembles such as random forests and boosted trees. We show how to choose the optimal tuning parameters for these models in an elaborate cross-validation scheme. In addition, we present visualization tools to obtain insights from the resulting models, and the economic value of these new modeling approaches is evaluated. Boosted trees outperform the classical GLMs, allowing the insurer to form profitable portfolios and to guard against potential adverse risk selection.


91G05 Actuarial mathematics
68T05 Learning and adaptive systems in artificial intelligence
Full Text: DOI arXiv


[1] Antonio, K.; Frees, E. W.; Valdez., E. A., A multilevel analysis of intercompany claim counts, ASTIN Bulletin, 40, 1, 151-77 (2010)
[2] Breiman, L., Bagging predictors, Machine Learning, 24, 2, 123-40 (1996) · Zbl 0858.68080
[3] Breiman, L., Random forests, Machine Learning, 45, 1, 5-32 (2001) · Zbl 1007.68152
[4] Breiman, L.; Friedman, J.; Stone, C. J.; Olshen., R. A., Classification and regression trees (1984), New York: Taylor & Francis, New York · Zbl 0541.62042
[5] Buchner, F.; Wasem, J.; Schillo., S., Regression trees identify relevant interactions: Can this improve the predictive performance of risk adjustment?, Health Economics, 26, 1, 74-85 (2017)
[6] Boucher, J.-P.; Charpentier, A., Computational actuarial science with R, General insurance pricing, 507-42 (2014), New York: Chapman and Hall/CRC, New York
[7] Council of the European Union, Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (General Data Protection Regulation), Official Journal of the European Union L, 119, 1, 1-88 (2016)
[8] Czado, C.; Kastenmeier, R.; Brechmann, E. C.; Min., A., A mixed copula model for insurance claims and claim sizes, Scandinavian Actuarial Journal, 2012, 4, 278-305 (2012) · Zbl 1277.62249
[9] Dal Pozzolo, A., Comparison of data mining techniques for insurance claim prediction (2010)
[10] De Jong, P.; Heller., G. Z., Generalized linear models for insurance data (2008), Cambridge, UK: Cambridge University Press, Cambridge, UK · Zbl 1142.91046
[11] Denuit, M.; Lang., S., Non-life rate-making with Bayesian GAMs, Insurance: Mathematics and Economics, 35, 3, 627-47 (2004) · Zbl 1070.62095
[12] Denuit, M.; Maréchal, X.; Pitrebois, S.; Walhin., J. F., Actuarial modelling of claim counts: Risk classification, credibility and bonus-malus systems (2007), West Sussex, UK: John Wiley & Sons, West Sussex, UK · Zbl 1168.91001
[13] Dionne, G.; Gouriéroux, C.; Vanasse., C., Automobile insurance: Road safety, new drivers, risks, insurance fraud and regulation, Evidence of adverse selection in automobile insurance markets, 13-46 (1999), New York: Springer, New York
[14] Ferrario, A.; Noll, A.; Wüthrich., M. V. (2018)
[15] Frees, E. W., Analytics of insurance markets, Annual Review of Financial Economics, 7, 253-77 (2015)
[16] Frees, E. W.; Derrig, R. A.; Meyers., G., Predictive modeling applications in actuarial science: Vol. 1. 1-10, Cambridge, UK: Cambridge University Press. Predictive modeling techniques, . Predictive modeling in actuarial science (2014)
[17] Frees, E. W.; Meyers, G.; Cummings., A. D., Insurance ratemaking and a Gini index, Journal of Risk and Insurance, 81, 2, 335-66 (2013)
[18] Frees, E. W.; Valdez., E. A., Hierarchical insurance claims modeling, Journal of the American Statistical Association, 103, 484, 1457-69 (2008) · Zbl 1286.62087
[19] Friedman, J. H., Greedy function approximation: A gradient boosting machine, Annals of Statistics, 29, 5, 1189-232 (2001) · Zbl 1043.62034
[20] Friedman, J. H., Stochastic gradient boosting, Computational Statistics & Data Analysis, 38, 4, 367-78 (2002) · Zbl 1072.65502
[21] Friedman, J. H.; Hastie, T.; Tibshirani., R., The elements of statistical learning, 1 (2001), New York: Springer, New York
[22] Friedman, J. H.; Popescu., B. E., Predictive learning via rule ensembles, The Annals of Applied Statistics, 2, 3, 916-54 (2008) · Zbl 1149.62051
[23] Garrido, J.; Genest, C.; Schulz., J., Generalized linear models for dependent frequency and severity of insurance claims, Insurance: Mathematics and Economics, 70, 205-15 (2016) · Zbl 1373.62515
[24] Gini, C., Variabilità e mutabilità [Variability and mutability] (1912), Bologna, Italy: Cuppini, Bologna, Italy
[25] Goldburd, M.; Khare, A.; Tevet., D., Generalized linear models for insurance rating (2016), Casualty Actuarial Society
[26] Goldstein, A.; Kapelner, A.; Bleich, J.; Pitkin., E., Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation, Journal of Computational and Graphical Statistics, 24, 1, 44-65 (2015)
[27] Gschlößl, S.; Czado., C., Spatial modelling of claim frequency and claim size in non-life insurance, Scandinavian Actuarial Journal, 2007, 3, 202-25 (2007) · Zbl 1150.91026
[28] Guelman, L., Gradient boosting trees for auto insurance loss cost modeling and prediction, Expert Systems with Applications, 39, 3, 3659-67 (2012)
[29] Haberman, S.; Renshaw., A. E., Generalized linear models and actuarial science, Insurance: Mathematics and Economics, 20, 2, 142 (1997)
[30] Henckaerts, R. (2020)
[31] Henckaerts, R.; Antonio, K.; Clijsters, M.; Verbelen., R., A data driven binning strategy for the construction of insurance tariff classes, Scandinavian Actuarial Journal, 2018, 8, 681-705 (2018) · Zbl 1418.91241
[32] Kaminski, M. E., The right to explanation, explained, Berkeley Technology Law Journal, 34, 1 (2018)
[33] Klein, N.; Denuit, M.; Lang, S.; Kneib., T., Nonlife ratemaking and risk management with Bayesian generalized additive models for location, scale, and shape, Insurance: Mathematics and Economics, 55, 225-49 (2014) · Zbl 1296.62089
[34] Klugman, S. A.; Panjer, H. H.; Willmot., G. E., Loss models: From data to decisions (2012), Hoboken, NJ: John Wiley & Sons, Hoboken, NJ · Zbl 1272.62002
[35] Krasheninnikova, E.; García, J.; Maestre, R.; Fernández., F., Reinforcement learning for pricing strategy optimization in the insurance industry, Engineering Applications of Artificial Intelligence, 80, 8-19 (2019)
[36] Lemaire, J., Bonus-malus systems in automobile insurance (1995), New York: Springer, New York
[37] Liu, Y.; Wang, B.; Lv., S., Using multi-class AdaBoost tree for prediction frequency of auto insurance, Journal of Applied Finance and Banking, 4, 5, 45 (2014)
[38] Molnar, C. (2019)
[39] Nelder, J. A.; Wedderburn., R. W. M., Generalized linear models, Journal of the Royal Statistical Society: Series A (General), 135, 3, 370-84 (1972)
[40] Neyman, J., On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection, Journal of the Royal Statistical Society, 97, 4, 558-625 (1934) · JFM 61.1310.02
[41] O’Neil, C., Weapons of math destruction: How big data increases inequality and threatens democracy (2017), Broadway Books · Zbl 1441.00001
[42] Parodi, P., Pricing in general insurance (2014), New York: Chapman and Hall/CRC, New York
[43] Pasquale, F., The black box society: The secret algorithms that control money and information (2015), Cambridge, MA: Harvard University Press, Cambridge, MA
[44] Pesantez-Narvaez, J.; Guillen, M.; Alcañiz., M., Predicting motor insurance claims using telematics data — XGBoost versus logistic regression, Risks, 7, 2, 1 (2019)
[45] Ridgeway, G. (2014)
[46] Schelldorfer, J.; Wüthrich., M. V. (2019)
[47] Schiltz, F.; Masci, C.; Agasisti, T.; Horn., D., Using regression tree ensembles to model interaction effects: A graphical approach, Applied Economics, 50, 58, 6341-54 (2018)
[48] Southworth, H. (2015)
[49] Spedicato, G. A.; Dutang, C.; Petrini., L., Machine learning methods to perform pricing optimization, A comparison with standard GLMs. Variance, 12, 1, 69-89 (2018)
[50] Therneau, T. M.; Atkinson., B. (2018)
[51] Therneau, T. M.; Atkinson., E. J. (2019)
[52] Venables, W. N.; Ripley., B. D., Modern applied statistics with S, Tree-based methods, 251-69 (2002), New York: Springer, New York · Zbl 1006.62003
[53] Wang, Y.; Xu., W., Leveraging deep learning with LDA-based text analytics to detect automobile insurance fraud, Decision Support Systems, 105, 87-95 (2018)
[54] Wood, S. N.2006. Generalized additive models: An introduction with R. Chapman and Hall/CRC. · Zbl 1087.62082
[55] Wüthrich, M. V., and Buser., C.2019. Data analytics for non-life insurance pricing (lecture notes).
[56] Xia, Y.; Liu, C.; Li, Y.; Liu., N., A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring, Expert Systems with Applications, 78, 225-41 (2017)
[57] Yang, Y.; Qian, W.; Zou., H., Insurance premium prediction via gradient tree-boosted Tweedie compound Poisson models, Journal of Business & Economic Statistics, 36, 3, 456-70 (2018)
[58] Zöchbauer, P.; Wüthrich, M. V.; Buser., C., Data science in non-life insurance pricing (2017), ETH Zurich
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.