×

Classification using proximity catch digraphs. (English) Zbl 1496.68277

Summary: We employ random geometric digraphs to construct semi-parametric classifiers. These data-random digraphs belong to parameterized random digraph families called proximity catch digraphs (PCDs). A related geometric digraph family, class cover catch digraph (CCCD), has been used to solve the class cover problem by using its approximate minimum dominating set and showed relatively good performance in the classification of imbalanced data sets. Although CCCDs have a convenient construction in \(\mathbb{R}^d\), finding their minimum dominating sets is NP-hard and their probabilistic behaviour is not mathematically tractable except for \(d=1\). On the other hand, a particular family of PCDs, called proportional-edge PCDs (PE-PCDs), has mathematically tractable minimum dominating sets in \(\mathbb{R}^d\); however their construction in higher dimensions may be computationally demanding. More specifically, we show that the classifiers based on PE-PCDs are prototype-based classifiers such that the exact minimum number of prototypes (equivalent to minimum dominating sets) is found in polynomial time on the number of observations. We construct two types of classifiers based on PE-PCDs. One is a family of hybrid classifiers that depends on the location of the points of the training data set, and another type is a family of classifiers solely based on class covers. We assess the classification performance of our PE-PCD based classifiers by extensive Monte Carlo simulations, and compare them with that of other commonly used classifiers. We also show that, similar to CCCD classifiers, our classifiers tend to be robust to the class imbalance in classification as well.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
05C20 Directed graphs (digraphs), tournaments
05C62 Graph representations (geometric and intersection representations, etc.)
05C69 Vertex subsets with special properties (dominating sets, independent sets, cliques, etc.)
05C80 Random graphs (graph-theoretic aspects)
62H30 Classification and discrimination; cluster analysis (statistical aspects)
68R10 Graph theory (including graph drawing) in computer science

Software:

UCI-ml; JStatCom
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. In Proceedings of 15th European conference on machine learning, Pisa (pp. 39-50). · Zbl 1132.68523
[2] Alcalá, J.; Fernández, A.; Luengo, J.; Derrac, J.; García, S.; Sánchez, L.; Herrera, F., Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, Multiple-Valued Logic and Soft Computing, 17, 2-3, 255-287 (2011)
[3] Alpaydın, E., Combined \(5\times 2\) CV \(F\) test for comparing supervised classification learning algorithms, Neural Computation, 11, 8, 1885-1892 (1999)
[4] Angiulli, F., Prototype-based domain description for one-class classification, IEEE Transactions on Pattern Analysis and Machine Intelligence, 34, 6, 1131-1144 (2012)
[5] Arora, S.; Lund, C., Approximation algorithms for NP-hard problems (1996), Boston, MA: PWS Publishing, Boston, MA
[6] Bereg, S.; Cabello, S.; Díaz-Báñez, JM; Pérez-Lantero, P.; Seara, C.; Ventura, I., The class cover problem with boxes, Computational Geometry, 45, 7, 294-304 (2012) · Zbl 1239.65015
[7] Bien, J.; Tibshirani, R., Prototype selection for interpretable classification, The Annals of Applied Statistics, 5, 4, 2403-2424 (2011) · Zbl 1234.62096
[8] Cannon, AH; Cowen, LJ, Approximation algorithms for the class cover problem, Annals of Mathematics and Artificial Intelligence, 40, 3-4, 215-223 (2004) · Zbl 1075.68609
[9] Ceyhan, E. (2005). An investigation of proximity catch digraphs in delaunay tessellations. PhD thesis, Johns Hopkins University, Baltimore, MD.
[10] Ceyhan, E., Extension of one-dimensional proximity regions to higher dimensions, Computational Geometry, 43, 9, 721-748 (2010) · Zbl 1229.65046
[11] Ceyhan, E.; Priebe, CE, The use of domination number of a random proximity catch digraph for testing spatial patterns of segregation and association, Statistics & Probability Letters, 73, 1, 37-50 (2005) · Zbl 1080.62031
[12] Ceyhan, E.; Priebe, CE; Marchette, DJ, A new family of random graphs for testing spatial segregation, Canadian Journal of Statistics, 35, 1, 27-50 (2007) · Zbl 1124.05039
[13] Ceyhan, E.; Priebe, CE; Wierman, JC, Relative density of the random \(r\)-factor proximity catch digraph for testing spatial patterns of segregation and association, Computational Statistics & Data Analysis, 50, 8, 1925-1964 (2006) · Zbl 1445.62140
[14] Chvatal, V., A greedy heuristic for the set-covering problem, Mathematics of Operations Research, 4, 233-235 (1979) · Zbl 0443.90066
[15] Deng, X.; Zhu, B., A randomized algorithm for the Voronoi diagram of line segments on coarse-grained multiprocessors, Algorithmica, 24, 3-4, 270-286 (1999) · Zbl 0943.68174
[16] DeVinney, J., Priebe, C., Marchette, D., & Socolinsky, D. (2002). Random walks and catch digraphs in classification. In Proceedings of the 34th symposium on the interface, volume 34: Computing science and statistics, Montreal, Quebec. · Zbl 1055.62074
[17] DeVinney, J. G. (2003). The class cover problem and its application in pattern recognition. PhD thesis, Johns Hopkins University, Baltimore, MD.
[18] Devroye, L.; Gyorfi, L.; Lugosi, G., A probabilistic theory of pattern recognition (1996), New York: Springer, New York · Zbl 0853.68150
[19] Dietterich, TG, Approximate statistical tests for comparing supervised classification learning algorithms, Neural Computation, 10, 7, 1895-1923 (1998)
[20] Downey, RG; Fellows, MR, Fundamentals of parameterized complexity (2013), New York: Springer, New York
[21] Dua, D., & Graff, C. (2019). UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml. Accessed 10 July 2019.
[22] Eveland, CK; Socolinsky, DA; Priebe, CE; Marchette, DJ, A hierarchical methodology for class detection problems with skewed priors, Journal of Classification, 22, 1, 17-48 (2005) · Zbl 1083.62049
[23] Frey, PW; Slate, DJ, Letter recognition using holland-style adaptive classifiers, Machine Learning, 6, 2, 161-182 (1991)
[24] Gao, BJ; Ester, M.; Xiong, H.; Cai, JY; Schulte, O., The minimum consistent subset cover problem: A minimization view of data mining, IEEE Transactions on Knowledge and Data Engineering, 25, 3, 690-703 (2013)
[25] Hammer, P.; Liu, Y.; Simeone, B.; Szedmák, S., Saturated systems of homogeneous boxes and the logical analysis of numerical data, Discrete Applied Mathematics, 144, 1-2, 103-109 (2004) · Zbl 1078.62503
[26] Jameson, A.; Possenti, A.; Stappers, BW; Levin, L.; Bailes, M.; Burgay, M.; Kramer, M.; D’Amico, N.; Bhat, NDR; McMahon, PL; Burke-Spolaor, S.; Johnston, S.; Milia, S.; Bates, SD; van Straten, W.; Keith, MJ, The high time resolution universe pulsar survey ? I. system configuration and initial discoveries, Monthly Notices of the Royal Astronomical Society, 409, 2, 619-627 (2010)
[27] Jaromczyk, JW; Toussaint, GT, Relative neighborhood graphs and their relatives, Proceedings of the IEEE, 80, 9, 1502-1517 (1992)
[28] Karr, AF, Probability (1992), New York: Springer, New York
[29] Lawson, CL, Properties of n-dimensional triangulations, Computer Aided Geometric Design, 3, 4, 231-246 (1986) · Zbl 0624.65018
[30] Lyon, RJ; Stappers, B.; Cooper, S.; Brooke, J.; Knowles, J., Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach, Monthly Notices of the Royal Astronomical Society, 459, 1, 1104-1123 (2016)
[31] Manukyan, A.; Ceyhan, E., Classification of imbalanced data with a geometric digraph family, Journal of Machine Learning Research, 17, 189, 1-40 (2016) · Zbl 1392.62190
[32] Marchette, DJ, Random graphs for statistical pattern recognition (2004), Hoboken: Wiley, Hoboken · Zbl 1057.62048
[33] Mehta, M., Rissanen, J., & Agrawal, R. (1995). Mdl-based decision tree pruning. In Knowledge discovery and data mining (pp. 216-221).
[34] Morello, V.; Barr, E.; Bailes, M.; Flynn, C.; Keane, E.; van Straten, W., Spinn: A straightforward machine learning solution to the pulsar candidate selection problem, Monthly Notices of the Royal Astronomical Society, 443, 2, 1651-1662 (2014)
[35] Narasimhan, H., Pan, W., Kar, P., Protopapas, P., & Ramaswamy, H. G. (2016). Optimizing the multiclass f-measure via biconcave programming. In 2016 IEEE 16th international conference on data mining (ICDM) (pp. 1101-1106). IEEE.
[36] Parekh, AK, Analysis of a greedy heuristic for finding small dominating sets in graphs, Information Processing Letters, 39, 237-240 (1991) · Zbl 0746.05061
[37] Pȩkalska, E.; Duin, RP; Paclík, P., Prototype selection for dissimilarity-based classifiers, Pattern Recognition, 39, 2, 189-208 (2006) · Zbl 1080.68646
[38] Priebe, CE; DeVinney, JG; Marchette, DJ, On the distribution of the domination number for random class cover catch digraphs, Statistics & Probability Letters, 55, 3, 239-246 (2001) · Zbl 0999.05082
[39] Priebe, CE; Marchette, DJ; DeVinney, J.; Socolinsky, D., Classification using class cover catch digraphs, Journal of Classification, 20, 1, 3-23 (2003) · Zbl 1055.62074
[40] Priebe, CE; Solka, JL; Marchette, DJ; Clark, BT, Class cover catch digraphs for latent class discovery in gene expression monitoring by DNA microarrays, Computational Statistics & Data Analysis, 43, 4, 621-632 (2003) · Zbl 1429.62594
[41] Rissanen, J., Stochastic complexity in statistical inquiry theory (1989), River Edge: World Scientific Publishing Co. Inc, River Edge · Zbl 0800.68508
[42] Schölkopf, B.; Platt, JC; Shawe-Taylor, J.; Smola, AJ; Williamson, RC, Estimating the support of a high-dimensional distribution, Neural Computation, 13, 7, 1443-1471 (2001) · Zbl 1009.62029
[43] Seidel, R., The upper bound theorem for polytopes: An easy proof of its asymptotic version, Computational Geometry, 5, 2, 115-116 (1995) · Zbl 0831.68114
[44] Serafini, P., Classifying negative and positive points by optimal box clustering, Discrete Applied Mathematics, 165, 270-282 (2014) · Zbl 1358.68240
[45] Takigawa, I.; Kudo, M.; Nakamura, A., Convex sets as prototypes for classifying patterns, Engineering Applications of Artificial Intelligence, 22, 1, 101-108 (2009)
[46] Toussaint, GT, The relative neighborhood graph of a finite planar set, Pattern Recognition, 12, 4, 261-268 (1980) · Zbl 0437.05050
[47] Toussaint, G. T. (2002). Proximity graphs for nearest neighbor decision rules: Recent progress. In Proceedings of the 34th symposium on the interface, Montreal, Quebec (Vol. 34).
[48] Ungar, AA, Barycentric calculus in euclidean and hyperbolic geometry: A comparative introduction (2010), Singapore: World Scientific Publishing Co. Pte. Ltd., Singapore · Zbl 1213.51016
[49] Vazirani, VV, Approximation algorithms (2001), New York: Springer, New York
[50] Wang, W.; Xu, Z.; Lu, W.; Zhang, X., Determination of the spread parameter in the Gaussian kernel for classification and regression, Neurocomputing, 55, 3-4, 643-663 (2003)
[51] Warren, J., Barycentric coordinates for convex polytopes, Advances in Computational Mathematics, 6, 1, 97-108 (1996) · Zbl 0873.52013
[52] Watson, DF, Computing the \(n\)-dimensional Delaunay tessellation with application to Voronoi polytopes, The Computer Journal, 24, 2, 167-172 (1981)
[53] West, DB, Introduction to graph theory (2000), Englewood Cliffs: Prentice Hall, Englewood Cliffs
[54] Woźniak, M.; Graña, M.; Corchado, E., A survey of multiple classifier systems as hybrid systems, Information Fusion, 16, 3-17 (2014)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.