Basic statistics for distributional symbolic variables: a new metric-based approach. (English) Zbl 1414.62017

Summary: In data mining it is usual to describe a group of measurements using summary statistics or through empirical distribution functions. Symbolic data analysis (SDA) aims at the treatment of such kinds of data, allowing the description and the analysis of conceptual data or of macrodata summarizing classical data. In the conceptual framework of SDA, the paper aims at presenting new basic statistics for distribution-valued variables, i.e., variables whose realizations are distributions. The proposed measures extend some classical univariate (mean, variance, standard deviation) and bivariate (covariance and correlation) basic statistics to distribution-valued variables, taking into account the nature and the variability of such data. The novel statistics are based on a distance between distributions: the \(\ell_2\) Wasserstein distance. A comparison with other univariate and bivariate statistics presented in the literature points out some relevant properties of the proposed ones. An application on a clinic dataset shows the main differences in terms of interpretation of results.


62-07 Data analysis (statistics) (MSC2010)
62A99 Foundational topics in statistics
Full Text: DOI arXiv


[1] Aitchison J (1986) The statistical analysis of compositional data. Chapman Hall, New York · Zbl 0688.62004
[2] Bacelar-Nicolau, H.; Devijver, PA (ed.); Kittler, J. (ed.), On the distribution equivalence in cluster analysis, 73-79, (1987), Berlin
[3] Bacelar-Nicolau H (1988) Two probabilistic models for classification of variables in frequency tables. In: Bock HH (ed) Classification and related methods. North-Holland, Amsterdam, pp 181-189 · Zbl 0729.62546
[4] Barrio, E.; Matran, C.; Rodriguez-Rodriguez, J.; Cuesta-Albertos, JA, Tests of goodness of fit based on the L2-Wasserstein distance, Ann Stat, 27, 1230-1239, (1999) · Zbl 0961.62037
[5] Bertrand, P.; Goupil, F.; Bock, HH (ed.); Diday, E. (ed.), Descriptive statistics for symbolic data, 103-124, (2000), Berlin
[6] Billard, L.; Brito, P. (ed.); Bertrand, P. (ed.); Cucumel, G. (ed.); Carvalho, FAT (ed.), Dependencies and variation components of symbolic interval-valued data, 3-12, (2007), Berlin · Zbl 05486137
[7] Billard L (2008) Sample covariance function for complex quantitative data. In: Proceedings of IASC 2008, Yokohama, Japan, pp 157-163
[8] Billard, L.; Diday, E., From the statistics of data to the statistics of knowledge: symbolic data analysis, J Am Stat Assoc, 98, 470-487, (2003)
[9] Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, Chirchester · Zbl 1117.62002
[10] Bock HH, Diday E (2000) Analysis of symbolic data, exploratory methods for extracting statistical information from complex data. Studies in Classification, Data Analysis and Knowledge Organisation. Springer-Verlag, Berlin
[11] Brito, P.; Brito, P. (ed.); Bertrand, P. (ed.); Cucumel, G. (ed.); Carvalho, FAT (ed.), On the analysis of symbolic data, 13-22, (2007), Berlin · Zbl 05486138
[12] Chisini, O., Sul concetto di media, Periodico di Matematiche, 4, 106-116, (1929) · JFM 55.0918.01
[13] Diday, E., Principal component analysis for bar charts and metabins tables, Stat Anal Data Min, 6, 403-430, (2013)
[14] Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, Berlin · Zbl 1108.62002
[15] Gibbs, AL; Su, FE, On choosing and bounding probability metrics, Int Stat Rev, 7, 419-435, (2002) · Zbl 1217.62014
[16] Gilchrist WG (2000) Statistical modelling with quantile functions. Chapman and Hall/CRC, London
[17] Ginestet, CE; Simmons, A.; Kolaczyk, ED, Weighted Frechet means as convex combinations in metric spaces: properties and generalized median inequalities, Stat Probab Lett, 82, 1859-1863, (2012) · Zbl 1264.54050
[18] Ichino, M., The quantile method for symbolic principal component analysis, Stat Anal Data Min, 4, 184-198, (2011)
[19] Irpino A, Lechevallier Y, Verde R (2006) Dynamic clustering of histograms using Wasserstein metric. In: Rizzi A, Vichi M (eds) COMPSTAT 2006. Physica-Verlag, Berlin, pp 869-876
[20] Irpino, A.; Verde, R.; Batanjeli, V. (ed.); Bock, HH (ed.); Ferligoj, A. (ed.); Ziberna, A. (ed.), A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data, 185-192, (2006), Berlin
[21] Irpino, A.; Verde, R., Dynamic clustering of interval data using a Wasserstein-based distance, Pattern Recogn Lett, 29, 1648-1658, (2008)
[22] Irpino, A.; Verde, R.; Brito, P. (ed.), Comparing histogram data using a Mahalanobis-Wasserstein distance, 77-89, (2008), Heidelberg · Zbl 1147.62054
[23] Kim, J.; Billard, L., Dissimilarity measures for histogram-valued observations, Commun Stat-Theor M, 42, 283-303, (2013) · Zbl 1298.62100
[24] Matusita, K., On the theory of statistical decision functions, Ann I Stat Math, 3, 1-30, (1951)
[25] Moore RE (1966) Interval analysis. Prentice Hall, Englewood Cliffs · Zbl 0176.13301
[26] Moore, R.; Lodwick, W., Interval analysis and fuzzy set theory, Fuzzy Set Syst, 135, 5-9, (2003) · Zbl 1015.03513
[27] Noirhomme-Fraiture, M.; Brito, P., Far beyond the classical data models: symbolic data analysis, Stat Anal Data Min, 4, 157-170, (2012)
[28] Nielsen, F.; Nock, R., Sided and symmetrized Bregman centroids, IEEE T Inform Theory, 55, 2882-2904, (2009) · Zbl 1367.94138
[29] Rüschendorf, L.; Hazewinkel, M. (ed.), Wasserstein metric, (2001), New York
[30] Verde, R.; Irpino, A.; Brito, P. (ed.); Bertrand, P. (ed.); Cucumel, G. (ed.); Carvalho, FAT (ed.), Dynamic clustering of histogram data: using the right metric, 123-134, (2007), Berlin · Zbl 1151.62335
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.