Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein distance. (English) Zbl 1414.62306

Summary: In this paper we present a new linear regression technique for distributional symbolic variables, i.e., variables whose realizations can be histograms, empirical distributions or empirical estimates of parametric distributions. Such data are known as numerical modal data according to the Symbolic Data Analysis definitions. In order to measure the error between the observed and the predicted distributions, the \(\ell_2\) Wasserstein distance is proposed. Some properties of such a metric are exploited to predict the modal response variable as a linear combination of the explanatory modal variables. Based on the metric, the model uses the quantile functions associated with the data and thus is subject to a positivity constraint of the estimated parameters. We propose solving the linear regression problem by starting from a particular decomposition of the squared distance. Therefore, we estimate the model parameters according to two separate models, one for the averages of the data and one for the centered distributions by a constrained least squares algorithm. Measures of goodness-of-fit are also proposed and discussed. The method is validated by two applications, one on simulated data and one on two real-world datasets.


62J05 Linear regression; mixed models
62G30 Order statistics; empirical distribution functions
46F10 Operations with distributions and generalized functions


SODAS; bootstrap
Full Text: DOI arXiv


[1] Arroyo, J.; Maté, C., Forecasting histogram time series with k-nearest neighbours methods, Int J Forecast, 25, 192-207, (2009)
[2] Bertrand, P.; Goupil, F.; Bock, HH (ed.); Diday, E. (ed.), Descriptive statistics for symbolic data, 103-124, (2000), Berlin
[3] Bickel, P.; Freedman, D., Some asymptotic theory for the bootstrap, Ann Stat, 9, 1196-1217, (1981) · Zbl 0449.62034
[4] Billard L, Diday E (2000) Regression analysis for interval-valued data. In: Data analysis, classification and related methods: proceedings of the seventh conference of the IFCS, Springer, Berlin, pp 369-374 · Zbl 1026.62073
[5] Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, New York · Zbl 1117.62002
[6] Bock H, Diday E (2000) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin · Zbl 1039.62501
[7] Dall’Aglio, G., Sugli estremi dei momenti delle funzioni di ripartizione doppia, Ann Sci Norm Super Di Pisa Cl Sci, 3, 3374, (1956) · Zbl 0073.14002
[8] DiasS, Brito P (2011) A new linear regression model for histogram-valued variables. In: 58th ISI world statistics congress, Dublin, Ireland. http://isi2011.congressplanner.eu/pdfs/950662
[9] Diday E, Noirhomme-Fraiture M (2008) Symbolic data analysis and the SODAS software. Wiley, New York · Zbl 1275.62029
[10] Dueñas C, Fernández MC, Cañete S, Carretero J, Liger E (2002) Assessment of ozone variations and meteorological effects in an urban area in the Mediterranean coast. Sci Total Environ 299(1-3):97-113
[11] Efron B, Tibshirani RJ (1993) An introduction to the bootstrap. Chapman and Hall, New York · Zbl 0835.62038
[12] Gilchrist WG (2000) Statistical modelling with quantile functions. Chapman and Hall/CRC, New York
[13] Gini C (1914) Di una misura della dissomiglianza tra due gruppi di quantit e delle sue applicazioni allo studio delle relazioni stratistiche. Atti del Reale Istituto Veneto di Scienze, Lettere ed Arti, Tomo LXXIV parte seconda (1914)
[14] Giordani P (2011) Linear regression analysis for interval-valued data based on the lasso technique. Techchnical repor 6, Diploma of Statistical Sciences, Sapienza University of Rome
[15] Irpino A, Romano E (2007) Optimal histogram representation of large data sets: fisher vs piecewise linear approximation. In: Noirhomme-Fraiture M, Venturini G (eds) EGC, Cépaduès-Éditions, Revue des Nouvelles Technologies de l’Information, vol RNTI-E-9, pp 99-110
[16] Irpino A, Verde R, Lechevallier Y (2006) Dynamic clustering of histograms using Wasserstein metric. In: COMPSTAT, pp 869-876
[17] Irpino A, Verde R (2006) A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Batagelj V, Bock HH, Ferligoj A, Žiberna A (eds) Data science and classification, studies in classification, data analysis, and knowledge organization, Springer, Berlin, 20, pp 185-192
[18] Irpino, A.; Verde, R., Dynamic clustering of interval data using a Wasserstein-based distance, Pattern Recognit Lett, 29, 1648-1658, (2008)
[19] Kantorovich, L., On one effective method of solving certain classes of extremal problems, Dokl Akad Nauk, 28, 212-215, (1940)
[20] Lawson CL, Hanson RJ (1974) Solving least square problems. Prentice Hall, Edgeworth Cliff
[21] Mallows, CL, A note on asymptotic joint normality, Ann Math Stat, 43, 508-515, (1972) · Zbl 0238.60017
[22] Neto EAL, de Carvalho FAT, Tenorio CP (2004) Univariate and multivariate linear regression methods to predict interval-valued features. In: Australian cconference on artificial intelligence, pp 526-537
[23] Neto, EAL; Carvalho, FAT, Centre and range method for fitting a linear regression model to symbolic interval data, Comput Stat Data Anal, 52, 1500-1515, (2008) · Zbl 1452.62493
[24] Neto, EAL; Carvalho, FAT, Constrained linear regression models for symbolic interval-valued variables, Comput Stat Data Anal, 54, 333-347, (2010) · Zbl 1464.62055
[25] Noirhomme-Fraiture, M.; Brito, P., Far beyond the classical data models: symbolic data analysis, Stat Anal Data Min, 4, 157-170, (2011)
[26] Salvemini T (1943) Sul calcolo degli indici di concordanza tra due caratteri quantitativi. In: Atti della VI Riunione della Soc Ital di Statistica, Roma (1943)
[27] Tibshirani, R., Regression shrinkage and selection via the lasso, J R Stat Soc Ser B, 58, 267-288, (1996) · Zbl 0850.62538
[28] Verde R, Irpino A (2008) Comparing histogram data using a Mahalanobis-Wasserstein distance. In: Brito P (ed) COMPSTAT 2008, Physica, Heidelberg, 7, 77-89 · Zbl 1147.62054
[29] Verde R, Irpino A (2007) Dynamic clustering of histogram data: Using the right metric. In: Brito P, Cucumel G, Bertrand P, Carvalho F (eds) Selected contributions in data analysis and classification, studies in classification, data analysis, and knowledge organization, Springer, Berlin, 12, 123-134 (2007) · Zbl 1151.62335
[30] Verde R, Irpino A (2010) Ordinary least squares for histogram data based on Wasserstein distance. In: Lechevallier Y, Saporta G (eds) In: Proceedings of COMPSTAT’2010, vol. 60, pp. 581-588. Physica, Heidelberg (2010)
[31] Wasserstein, L., Markov processes over denumerable products of spaces describing large systems of automata, Prob Inf Trans, 5, 47-52, (1969)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.