Model-based and nonparametric approaches to clustering for data compression in actuarial applications. (English) Zbl 07059858

Summary: Clustering is used by actuaries in a data compression process to make massive or nested stochastic simulations practical to run. A large data set of assets or liabilities is partitioned into a user-defined number of clusters, each of which is compressed to a single representative policy. The representative policies can then simulate the behavior of the entire portfolio over a large range of stochastic scenarios. Such processes are becoming increasingly important in understanding product behavior and assessing reserving requirements in a big-data environment. This article proposes a variety of clustering techniques that can be used for this purpose. Initialization methods for performing clustering compression are also compared, including principal components, factor analysis, and segmentation. A variety of methods for choosing a cluster’s representative policy is considered. A real data set comprising variable annuity policies, provided by Milliman, is used to test the proposed methods. It is found that the compressed data sets produced by the new methods, namely, model-based clustering, Ward’s minimum variance hierarchical clustering, and k-medoids clustering, can replicate the behavior of the uncompressed (seriatim) data more accurately than those obtained by the existing Milliman method. This is verified within sample by examining location variable totals of the representative policies versus the uncompressed data at the five levels of compression of interest. More crucially it is also verified out of sample by comparing the distributions of the present values of several variables after 20 years across 1000 simulated scenarios based on the compressed and seriatim data, using Kolmogorov-Smirnov goodness-of-fit tests and weighted sums of squared differences.


91-XX Game theory, economics, finance, and other social and behavioral sciences
62-XX Statistics
Full Text: DOI Link


[1] Ackerman, M.; S., Ben-Daivd; D., Loker, Proceedings of the 26th AAAI Conference on Artificial Intelligence, 858-863, (2012)
[2] Anderberg, M. R., Cluster Analysis for Applications., (1973), New York: Academic Press, New York · Zbl 0299.62029
[3] Banfield, J. D.; Raftery, A. E., Model-Based Gaussian and Non-Gaussian Clustering, Biometrics, 49, 803-821, (1993) · Zbl 0794.62034
[4] Bauer, E.; Kohavi, R., An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants, Machine Learning, 36, 1-2, 105-139, (1999)
[5] Bellas, A.; Bouveyron, C.; Cottrell, M.; Lacaille, J., Model-Based Clustering of High-Dimensional Data Streams with Online Mixture of Probabilistic PCA, Advances in Data Analysis and Classification, 7, 3, 281-300, (2013) · Zbl 1273.62137
[6] Ben-Hur, A.; Guyon, I.; Brownstein, M. J.; Kohodursky, A., Detecting Stable Clusters Using Principal Component Analysis, Functional Genomics: Methods and Protocols, 159-182, (2003), Totowa, NJ: Humana Press
[7] Beygelzimer, A.; Kakadet, S.; Langford, J.; Arya, S.; Mount, D.; Li, S.; Li, M. S., FNN: Fast Nearest Neighbor Search Algorithms and Applications, (2013)
[8] Brown, J. D., Statistics Corner Questions and Answers about Language Testing Statistics: Principal Components Analysis and Exploratory Factor Analysis—Definitions, Differences, and Choices, Shiken: JALT Testing & Evaluation SIG Newsletter, 13, 1, 26-30, (2009)
[9] Celeux, G.; Govaert, G., Gaussian Parsimonious Clustering Models, Pattern Recognition, 28, 5, 781-793, (1995)
[10] Chang, W. C., On Using Principal Components before Separating a Mixture of Two Multivariate Normal Distributions, Journal of the Royal Statistical Society, Series C (Applied Statistics), 32, 3, 267-275, (1983) · Zbl 0538.62050
[11] De la Cruz-Mesía, R.; Quintana, F. A.; Marshall, G., Model-Based Clustering for Longitudinal Data, Computational Statistics and Data Analysis, 52, 3, 1441-1457, (2008) · Zbl 1452.62454
[12] Dempster, A. P.; Laird, N. M.; Rubin, D. B., Maximum Likelihood for Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society, 39, 1-38, (1977) · Zbl 0364.62022
[13] Donoho, D. L., High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality, AMS Math Challenges Lecture, 1-32, (2000)
[14] DuMouchel, W.; Volinsky, C.; Johnson, T.; Cortes, C.; Pregibon, D., Squashing Flat Files Flatter. In Proceedings of the Fifth ACM Conference on Knowledge Discovery and Data Mining, pp. 6–15, (1999)
[15] Fayyad, U.; Smyth, P., From Massive Data Sets to Science Catalogs: Applications and Challenges, Proceedings of the Workshop on Massive Data Sets, National Research Council. , pp. 129–142, (1995)
[16] Fraley, C.; Raftery, A. E., Model-Based Clustering, Discriminant Analysis and Density Estimation, Journal of the American Statistical Association, 97, 611-631, (2002) · Zbl 1073.62545
[17] Fraley, C.; Raftery, A. E.; Scrucca, L., mclust version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation, (2012)
[18] Fraley, C.; Raftery, A. E.; Wehrens, R., Incremental Model-Based Clustering for Large Datasets with Small Clusters, Journal of Computational and Graphical Statistics, 14, 3, 529-546, (2005)
[19] Freedman, A.; Reynolds, C., Cluster Analysis: A Spatial Approach to Actuarial Modelling, (2008)
[20] Friedman, J., Regularized Discriminant Analysis, Journal of the American Statistical Association, 84, 165-175, (1989)
[21] Harman, H. H., Modern Factor Analysis, (1960), Chicago: University of Chicago Press, Chicago · Zbl 0095.13403
[22] Hoeting, J. A.; Madigan, D.; Raftery, A. E.; Volinsky, C. T., Bayesian Model Averaging: A Tutorial, Statistical Science, 14, 4, 382-401, (1999) · Zbl 1059.62525
[23] Husson, F.; Josse, J.; Le, S.; Mazet, J., FactoMineR: Multivariate Exploratory Data Analysis and Data Mining with R. R package version 1.26. Vienna: R Foundation, (2014)
[24] Johnson, S., Hierarchical Clustering Schemes, Psychometrika, 32, 3, 241-254, (1967) · Zbl 1367.62191
[25] Jolliffe, I., Principal Component Analysis, (2002), New York: Springer, New York · Zbl 1011.62064
[26] Junus, N.; Motiwalla, Z., A Discussion of Actuarial Guideline 43 for Variable Annuities, (2009)
[27] Kaiser, H. F., The Varimax Criterion for Analytic Rotation in Factor Analysis, Psychometrika, 23, 187-200, (1958) · Zbl 0095.33603
[28] Lange, K. L.; Zhou, H., On the Bumpy Road to the Dominant Mode, Scandinavian Journal of Statistics, 37, 4, 612-631, (2010) · Zbl 1226.62027
[29] Lê, S.; Josse, J.; Husson, F., FactoMineR: An R Package for Multivariate Analysis, Journal of Statistical Software, 25, 1, 1-18, (2008)
[30] MacLachlan, G. J.; Krishnan, T., The EM Algorithm and Extensions, (1997), New York: Wiley, New York
[31] Madigan, D.; Raghavan, N.; Dumouchel, W.; Nason, M.; Posse, C.; Ridgeway, G., Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction, Data Mining and Knowledge Discovery, 6, 2, 173-190, (2002) · Zbl 0996.68564
[32] Maechler, M.; Rousseeuw, P.; Struyf, A.; Hubert, M.; Hornik, K., cluster: Cluster Analysis Basics and Extensions, (2015)
[33] Mar, J. C.; McLachlan, G. J., Model-Based Clustering in Gene Expression Microarrays: An Application to Breast Cancer Data. International Journal of Software Engineering and Knowledge Engineering 13(6): 579–592, (2003)
[34] McParland, D.; Gormley, I., Model Based Clustering for Mixed Data: clustMD, Advances in Data Analysis and Classification, (2014)
[35] Müllner, D., fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python, Journal of Statistical Software, 53, 9, 1-18, (2013)
[36] Murphy, T. B.; Dean, N.; Raftery, A. E., Variable Selection and Updating in Model-Based Discriminant Analysis for High Dimensional Data with Food Authenticity Applications, Annals of Applied Statistics, 4, 396-421, (2010) · Zbl 1189.62105
[37] Murphy, T. B.; Scrucca, L., Using Weights in mclust, (2012)
[38] Neumann, J.; Cramon, D.; Lohmann, G., Model-Based Clustering of Meta-Analytic Functional Imaging Data, Human Brain Mapping, 29, 2, 177-192, (2008)
[39] Pearson, K., On Lines and Planes of Closest Fit to Systems of Points in Space, Philosophical Magazine, 2, 11, 559-572, (1901) · JFM 32.0246.07
[40] Posse, C., Hierarchical Model-Based Clustering for Large Datasets, Journal of Computational and Graphical Statistics, 10, 464-486, (2001)
[41] Reynolds, C.; Man, S., Nested Stochastic Pricing: The Time Has Come. Product Matters, Society of Actuaries, 71, 16-20, (2008)
[42] Sanche, R.; Lonergan, K., Variable Reduction for Predictive Modelling with Clustering, Casualty Actuarial Society Forum (winter), 89-100, (2006)
[43] Spearman, C., General Intelligence, Objectively Determined and Measured, American Journal of Psychology, 15, 2, 201-292, (1904)
[44] Van Der Laan, M.; Pollard, K.; Bryan, J., A New Partitioning around Medoids Algorithm, Journal of Statistical Computation and Simulation, 73, 8, 575-584, (2003) · Zbl 1054.62075
[45] Ward, J. H., Hierarchical Grouping to Optimize an Objective Function, Journal of the American Statistical Association, 58, 301, 236-244, (1963)
[46] Wehrens, R.; Buydens, L. M.; Fraley, C.; Raftery, A. E., Model-Based Clustering for Image Segmentation and Large Datasets via Sampling, Journal of Classification, 21, 2, 231-253, (2004) · Zbl 1083.62051
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.