Model-based clustering of probability density functions. (English) Zbl 1273.62140

Summary: Complex data such as those where each statistical unit under study is described not by a single observation (or vector variable), but by a unit-specific sample of several or even many observations, are becoming more and more popular. Reducing these sample data by summary statistics, like the average or the median, implies that most inherent information (about variability, skewness or multi-modality) gets lost. Full information is preserved only if each unit is described by a whole distribution. This new kind of data, a.k.a. “distribution-valued data”, require the development of adequate statistical methods.
This paper presents a method to group a set of probability density functions (pdfs) into homogeneous clusters, provided that the pdfs have to be estimated nonparametrically from the unit-specific data. Since elements belonging to the same cluster are naturally thought of as samples from the same probability model, the idea is to tackle the clustering problem by defining and estimating a proper mixture model on the space of pdfs. The issue of model building is challenging here because of the infinite-dimensionality and the non-Euclidean geometry of the domain space. By adopting a wavelet-based representation for the elements in the space, the task is accomplished by using mixture models for hyper-spherical data. The proposed solution is illustrated through a simulation experiment and on two real data sets.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
62G07 Density estimation
42C40 Nontrigonometric harmonic analysis involving wavelets and other special systems
62H11 Directional data; spatial statistics
65C60 Computational problems in statistics (MSC2010)


wmtsa; AS 136
Full Text: DOI


[1] Abramowitz M, Stegun IA (1974) Handbook of mathematical functions. Dover Publ Inc., New York
[2] Applegate D, Dasu T, Krishnan S, Urbanek S (2011) Unsupervised clustering of multidimensional distributions using earth mover distance. In: the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 636–644. doi: 10.1145/2020408.2020508
[3] Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises-Fisher distributions. J Mach Learn Res 6:1345–1382 · Zbl 1190.62116
[4] Bezdeck JC (1981) Pattern recognition with Fuzzy objective function algorithms. Plenum Press, New York
[5] Bock H-H, Diday E (2000) Analysis of symbolic data. Springer, Heidelberg · Zbl 1039.62501
[6] Cattani C (2010) Fractals and Hidden Symmetries in DNA. Mathematical Problems in Engineering. Article ID 507056: doi: 10.1155/2010/507056 · Zbl 1189.92015
[7] Chervoneva I, Zhan T, Iglewicz B, Walter H, Birck DE (2012) Two-stage hierarchical modeling for analysis of subpopulations in conditional distributions. J Appl Stat 39:445–460
[8] Delicado P (2011) Dimensionality reduction when data are density functions. Comput Stat Data An 55: 401–420 · Zbl 1247.62148
[9] Dempster NM, Laird AP, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm. J Roy Stat Soc (Ser B) 39:1–39 · Zbl 0364.62022
[10] Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42:143–175 · Zbl 0970.68167
[11] Diday E, Noirhomme M (2008) Symbolic data analysis. Wiley, New York · Zbl 1275.62029
[12] Hartigan JA, Wong MA (1979) A k-means clustering algorithm. Appl Stat 28:100–108 · Zbl 0447.62062
[13] Herrick DRM, Nason GP, Silverman BW (2001) Some new methods for wavelet density estimation. Sankhya A 63:391–411 · Zbl 1192.62106
[14] Maharaj EA, D’Urso P, Galagedera DUA (2010) Wavelets-based fuzzy clustering of time series. J Classif 27:231–275 · Zbl 1337.62307
[15] Mallat SG (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans Patt An Mach Intell 11:674–693 · Zbl 0709.94650
[16] Mardia KV, Jupp PE (2000) Directional statistics. Wiley, New York · Zbl 0935.62065
[17] Marron S, Wand M (1992) Exact mean integrated squared error. Ann Stat 20:712–736 · Zbl 0746.62040
[18] Noirhomme-Fraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4:157–170
[19] Ogden RT (1997) Essential wavelets for statistical application and data analysis. Birkhauser, Boston · Zbl 0868.62033
[20] Peel D, Whiten WJ, McLachlan GJ (2001) Fitting mixtures of Kent distributions to aid in joint set identification. J Am Stat Assoc 96:56–63
[21] Penev S, Dechevsky L (1997) On non-negative wavelet-based density estimators. J Nonparameter Stat 7:365–394 · Zbl 1003.62513
[22] Percival DB, Walden AT (2000) Wavelet methods for time series analysis. Cambridge University Press, New York
[23] Peter A, Rangarajan A (2008) Maximum likelihood wavelet density estimation with applications to image and shape matching. IEEE Trans Image Proc 17:458–468 · Zbl 05516558
[24] Pinheiro A, Vidakovic B (1997) Estimating the square root of a density via compactly supported wavelets. Comput Stat Data Anal 25:399–415 · Zbl 0900.62202
[25] Sakurai Y, Chong R, Lei L, Faloutsos C (2008) Efficient distribution mining and classification. In: Proceedings of the 2008 SIAM international conference on data mining. http://www.siam.org/proceedings/datamining/2008/dm08_58_sakurai.pdf
[26] Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464 · Zbl 0379.62005
[27] Sheather SJ, Jones MC (1991) A reliable data-based bandwidth selection method for kernel density estimation. J Roy Statist Soc (Ser B) 53:683–690 · Zbl 0800.62219
[28] Silverman B (1986) Density estimation. Chapman and Hall, London · Zbl 0617.62042
[29] Spellman E, Vemuri BC, Rao M (2005) Using the KL-center for efficient and accurate retrieval of distributions arising from texture images. IEEE Comput Soc Confer Comput V Pattern Recogn 1:111–116. doi: 10.1109/CVPR.2005.363
[30] Sra S, Karp D (2013) The multivariate Watson distribution: maximum-likelihood estimation and other aspects. J Multivariate Anal 114:256–269 · Zbl 1258.62062
[31] Srivastava A, Jermyn I, Joshi S (2007) Riemannian analysis of probability density functions with applications in vision. IEEE Conf Comput Vision Patt Recogn. doi: 10.1109/CVPR.2007.383188
[32] Sturges H (1926) The choice of a class-interval. J Am Stat Assoc 21:65–66
[33] Terada Y, Yadohisa H (2010) Non-hierarchical clustering for distribution-valued data. In: Lechevallier Y, Saporta G (eds) Proceedings of COMPSTAT 2010. Physica-Verlag, Heidelberg, pp 1653–1660
[34] Vannucci M (1998) Nonparametric density estimation using wavelets. ISDS, D.P. http://www.isds.duke.edu
[35] Verde R, Irpino A (2008) Comparing histogram data using a Mahalanobis-Wasserstein distance. In: Brito P (ed) Proceedings of COMPSTAT2008. Physica-Verlag, Heidelberg, pp 77–89 · Zbl 1147.62054
[36] Vrac M, Billard L, Diday E, Chdin A (2011) Copula analysis of mixture models. Comput Stat 27:427–457 · Zbl 1304.65087
[37] Walter GG (1995) Estimation with wavelets and the curse of dimensionality. Technical report–Department of Mathematical Sciences. University of Wisconsin-Milwaukee
[38] Wouters BJ, Lwenberg B, Erpelinck-Verschueren CA, van Putten W, Valk P, Delwel R (2009) Double CEBPA mutations, but not single CEBPA mutations, define a subgroup of acute myeloid leukemia with a distinctive gene expression profile that is uniquely associated with a favorable outcome. Blood 26:3088–3091
[39] Yamal JM, Follen M, Guillaud M, Cox D (2011) Classifying tissue samples from measurements on cells with within-class tissue sample heterogeneity. Biostatistics 12:695–709 · Zbl 1314.62253
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.