×

Classifying real-world data with the \(DD\alpha\)-procedure. (English) Zbl 1414.62258

Summary: The \(DD\alpha\)-classifier, a nonparametric fast and very robust procedure, is described and applied to fifty classification problems regarding a broad spectrum of real-world data. The procedure first transforms the data from their original property space into a depth space, which is a low-dimensional unit cube, and then separates them by a projective invariant procedure, called \(DD\alpha\)-procedure. To each data point the transformation assigns its depth values with respect to the given classes. Several alternative depth notions (spatial depth, Mahalanobis depth, projection depth, and Tukey depth, the latter two being approximated by univariate projections) are used in the procedure, and compared regarding their average error rates. With the Tukey depth, which fits the distributions’ shape best and is most robust, ’outsiders’, that is data points having zero depth in all classes, appear. They need an additional treatment for classification. Evidence is also given about the dimension of the extended feature space needed for linear separation. The \(DD\alpha\)-procedure is available as an R-package.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62G35 Nonparametric robustness
62-04 Software, source code, etc. for problems pertaining to statistics
PDF BibTeX XML Cite
Full Text: DOI arXiv

References:

[1] Biblarz TJ, Raftery AE (1993) The effects of family disruption on social mobility. American Sociological Review 58:97-109
[2] Christmann, A.; Fischer, P.; Joachims, T., Comparison between various regression depth methods and the support vector machine to approximate the minimum number of misclassifications, Comput Stat, 17, 273-287, (2002) · Zbl 1010.62054
[3] Christmann, A.; Rousseeuw, PJ, Measuring overlap in binary regression, Comput Stat Data Anal, 37, 65-75, (2001) · Zbl 1051.62065
[4] Cortes, C.; Vapnik, V., Support vector networks, Mach Learn, 20, 273-297, (1995) · Zbl 0831.68098
[5] Cox LH, Johnson MM, Kafadar K (1982) Exposition of statistical graphics technology. ASA Proceedings of the Statistical Computing Section pp 55-56
[6] Cuesta-Albertos, JA; Nieto-Reyes, A., The random Tukey depth, Comput Stat Data Anal, 52, 4979-4988, (2008) · Zbl 1452.62344
[7] Dyckerhoff, R., Data depths satisfying the projection property, Allg Stat Archiv, 88, 163-190, (2004) · Zbl 1294.62112
[8] Fisher, RA, The use of multiple measurements in taxonomic problems, Ann Eugen, 7, 179-188, (1936)
[9] Flury B, Riedwyl H (1988) Multivariate statistics: a practical approach. Cambridge Chapman and Hall, New York · Zbl 0495.62057
[10] Frank A, Asuncion A (2010) UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science
[11] Galton, F., Regression towards mediocrity in hereditary stature, J Anthropol Inst, 15, 246-263, (1886)
[12] Greaney V, Kellaghan T (1984) Equality of opportunity in Irish schools. Educational Company, Dublin
[13] Habemma JDF, Hermans J, Van Den Broek K (1974) Stepwise Discriminant Analysis Program Using Density Estimation. COMPSTAT 1974. Proceedings in Computational statistics, Physica Verlag, Heidelberg, pp 101-110
[14] Hand DJ, Daly F, Lunn AD, McConway KJ, Ostrowski E (1994) A handbook of small data sets. Chapman and Hall, London · Zbl 0949.62500
[15] Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Verlag, New York · Zbl 1273.62005
[16] Hubert, M.; Driessen, K., Fast and robust discriminant analysis, Comput Stat Data Anal, 45, 301-320, (2004) · Zbl 1429.62247
[17] Joachims, T.; Schoelkopf, B. (ed.); Burges, C. (ed.); Smola, A. (ed.), Making large-Scale SVM learning Practical, 169-184, (1999), Cambridge MA,
[18] Kalbfleisch JD, Prentice RL (1980) The statistical analysis of failure time data. J. Wiley, University of Michiga
[19] Koshevoy, G.; Mosler, K., Zonoid trimming for multivariate distributions, Ann Stat, 25, 1998-2017, (1997) · Zbl 0881.62059
[20] Lange, T.; Mosler, K.; Mozharovskyi, P., Fast nonparametric classification based on data depth, Stat Papers, 55, 49-69, (2014) · Zbl 1283.62128
[21] Lange, T.; Mosler, K.; Mozharovskyi, P.; Spiliopoulou, M. (ed.); Schmidt-Thieme, L. (ed.); Janning, R. (ed.), DD\(α \)-classification of asymmetric and fat-tailed data, 71-78, (2014), Berlin
[22] Lange T, Mozharovskyi P (2014) The Alpha-procedure—a nonparametric invariant method for automatic classification of \(d\)-dimensional objects. In: Spiliopoulou M, Schmidt-Thieme L, Janning R (eds) Data analysis. Machine learning and knowledge discovery. Springer, Berlin, pp 79-86
[23] Li, J.; Cuesta-Albertos, JA; Liu, RY, \(DD\)-classifier: nonparametric classification procedure based on \(DD\)-plot, J Am Stat Assoc, 107, 737-753, (2012) · Zbl 1261.62058
[24] Liu, X.; Zuo, Y., Computing halfspace depth and regression depth, Commun Stat Simul Comput, 43, 969-985, (2014) · Zbl 1291.62059
[25] Liu, X.; Zuo, Y., Computing projection depth and its associated estimators, Stat Comput, 24, 51-63, (2014) · Zbl 1325.62014
[26] Mahalanobis, P., On the generalized distance in statistics, Proc Natl Acad India, 12, 49-55, (1936) · Zbl 0015.03302
[27] McGilchrist, CA; Aisbett, CW, Regression with frailty in survival analysis, Biometrics, 47, 461-466, (1991)
[28] Miller, AJ; Shaw, DE; Veitch, LG; Smith, EJ, Analyzing the results of a cloud-seeding experiment in Tasmania, Commun Stat Theory Methods, A8, 1017-1047, (1979)
[29] Mosler K (2002) Multivariate dispersion, central regions and depth: the lift zonoid approach. Springer, New York · Zbl 1027.62033
[30] Mosler, K.; Becker, C. (ed.); Fried, R. (ed.); Kuhnt, S. (ed.), Depth statistics, 17-34, (2013), Berlin
[31] Mosler K, Hoberg R (2006) Data analysis and classification with the zonoid depth. In: Liu R, Serfling R, Souvaine D (eds) Data Depth: Robust Multivariate Analysis. American Mathematical Society, Providence RI, pp 49-59
[32] Nierenberg, DW; Stukel, TA; Baron, JA; Dain, BJ; Greenberg, ER, Determinants of plasma levels of beta-carotene and retinol, Am J Epidemiol, 130, 511-521, (1989)
[33] Paindaveine D, Van Bever G (2012) Nonparametrically consistent depth-based classifiers. Bernoulli (to appear) · Zbl 1359.62258
[34] Reaven, GM; Miller, RG, An attempt to define the nature of chemical diabetes using a multidimensional analysis, Diabetologia, 16, 17-24, (1979)
[35] Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge, UK
[36] Rousseeuw, PJ; Driessen, K., A fast algorithm for the minimum covariance determinant estimator, Technometrics, 41, 212-223, (1999)
[37] Rousseeuw, PJ; Struyf, A., Computing location depth and regression depth in higher dimensions, Stat Comput, 13, 153-162, (1998)
[38] Serfling R (2002) A depth function and a scale curve based on spatial quantiles. In: Dodge Y (ed) Statistics and data analysis based on L\(_1\)-Norm and related methods, Birkhaeuser, pp 25-38 · Zbl 1460.62076
[39] Tukey JW (1974) Mathematics and the picturing of data. Proceeding of the International Congress of Mathematicians Vancouver, pp 523-531
[40] Turney P (1993) Robust Classification With Context-Sensitive Features. Proceedings of the Sixth International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems (IEA/AIE-93), pp 268-276
[41] Vapnik VN (1998) Statistical learning theory. Wiley, New York
[42] Vardi Y, Zhang CH (2000) The multivariate \(L_1\)-median and associated data depth. In: Proceedings of the National Academy of Sciences, USA 97, pp 1423-1426 · Zbl 1054.62067
[43] Vasil’ev, VI, The reduction principle in pattern recognition learning (PRL) problem, Pattern Recognit Image Anal, 1, 23-32, (1991)
[44] Vasil’ev, VI, The reduction principle in problems of revealing regularities I, Cyber Syst Anal, 39, 686-694, (2003) · Zbl 1075.68642
[45] Vasil’ev, VI; Lange, T., The duality principle in learning for pattern recognition (in Russian), Kibernetika i Vytschislit’elnaya Technika, 121, 7-16, (1998)
[46] Wolberg, WH; Mangasarian, OL, Multisurface method of pattern separation for medical diagnosis applied to breast cytology, Proc Natl Acad Sci USA, 87, 9193-9196, (1990) · Zbl 0709.92537
[47] Yeh, I-C; Yang, K-J; Ting, T-M, Knowledge discovery on RFM model using Bernoulli sequence, Expert Syst Appl, 36, 5866-5871, (2009)
[48] Zuo, YJ; Serfling, R., General notions of statistical depth function, Ann Stat, 28, 461-482, (2000) · Zbl 1106.62334
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.