Pruning boxes in a box-based classification method. (English) Zbl 1414.62274

Summary: In this work we address an extension of box clustering in supervised classification problems that makes use of optimization problems to refine the results obtained by agglomerative techniques. The central concept of box clustering is that of homogeneous boxes that give rise to overtrained classifiers under some conditions. Thus, we focus our attentions on the issue of pruning out redundant boxes, using the information gleaned from the other boxes generated under the hypothesis that such a choice would identify simpler models with good predictive power. We propose a pruning method based on an integer optimization problem and a family of sub problems derived from the main one. The overall performances are then compared to the accuracy levels of competing methods on a wide range of real data sets. The method has proven to be robust, making it possible to derive a more compact system of boxes in the instance space with good performance on training and test data.


62H30 Classification and discrimination; cluster analysis (statistical aspects)


Full Text: DOI


[1] Almuallim H, Dietterich TG (1991) Learning with many irrelevant features. In: Proceedings of the ninth national conference on artificial intelligence, vol 2. AAAI Press, Menlo Park, CA, pp 547-552
[2] Bache K, Lichman M (2013) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. http://www.ics.uci.edu/ mlearn/MLRepository.html
[3] Bertolazzi, P.; Felici, G.; Festa, P.; Lancia, G., Logic classification and feature selection for biomedical data, Comput Math Appl, 55, 889-899, (2008) · Zbl 1139.92013
[4] Boros, E.; Hammer, PL; Ibaraki, T.; Kogan, A., Logical analysis of numerical data, Math Progr, 79, 163-190, (1997) · Zbl 0887.90179
[5] Boros, E.; Hammer, PL; Ibaraki, T.; Kogan, A.; Mayoraz, E.; Muchnik, I., An implementation of logical analysis of data, Knowl Data Eng IEEE Trans, 12, 292-306, (2000)
[6] Davenport MA, Baraniuk MG, Scott CD (2006a) Controlling false alarms with support vector machines. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP), Toulouse, France
[7] Davenport MA, Baraniuk MG, Scott CD (2006b) Learning minimum volume sets with support vector machines. In: IEEE workshop on machine learning for signal processing (MLSP), Maynooth, Ireland
[8] Davenport MA, Baraniuk RG, Scott C (2010) Tuning support vector machines for minimax and neyman-pearson classification. IEEE Trans Pattern Anal Mach Intell 32(10):1888-1898. http://www.ece.rice.edu/ md/np_svm.php
[9] Eckstein, J.; Hammer, PL; Liu, Y.; Nediak, M.; Simeone, B., The maximum box problem and its application to data analysis, Comput Optim Appl, 23, 285-298, (2002) · Zbl 1028.90039
[10] Felici G, Simeone B, Spinelli V (2010) Classification techniques and error control in logic mining. In: Stahlbock R, Crone SF, Lessmann S (eds) Data mining, Annals of information systems, vol 8. Springer, London, pp 99-119. ISBN:978-1-4419-1279-4
[11] Grudzinski K, Grochowski M, Duch W (2010) Pruning classification rules with reference vector selection methods. In: Rutkowski L, Scherer R, Tadeusiewicz R, Zadeh LA, Zurada JM (eds) ICAISC (1), Lecture notes in computer science, vol 6113. Springer, Berlin, Heidelberg, pp 347-354
[12] Hammer, PL; Liu, Y.; Simeone, B.; Szedmàk, S., Saturated systems of homogeneous boxes and the logical analysis of numerical data, Discret Appl Math, 144, 103-109, (2004) · Zbl 1078.62503
[13] Harris E (2002) Information gain versus gain ratio: a study of split method biases. In: ISAIM. http://dblp.uni-trier.de/db/conf/isaim/isaim2002.html
[14] Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. Knowl Data Eng IEEE Trans 17(3):299-310
[15] Kaneko A, Kano M (2003) Discrete geometry on red and blue points in the plane-a survey. In: Discrete and computational geometry. Springer, Berlin, Heidelberg, pp 551-570 · Zbl 1079.52505
[16] Kohavi, R.; Provost, F., Glossary of terms, Mach Learn, 30, 271-274, (1998)
[17] Maloof MA (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. In: ICML-2003 workshop on learning from imbalanced data sets
[18] Maravalle M, Ricca F, Simeone B, Spinelli V (2014) Carpal tunnel syndrome automatic classification: electromyography vs. ultrasound imaging. In: Proceedings of TOP, pp 1-24 · Zbl 1312.92027
[19] McCarthy K, Zabar B, Weiss G (2005) Does cost-sensitive learning beat sampling for classifying rare classes? In: Proceedings of the 1st international workshop on utility-based data mining, pp 69-77
[20] Nitesh, VC; Maimon, O. (ed.); Rokach, L. (ed.), Data mining for imbalanced datasets: an overview, 853-867, (2005), New York
[21] Schaffer C (1994) A conservative law for generalization performance. In: Cohen WW, Hirsh H (eds) Eleventh international conference on machine learning, ICML. Morgan Kaufmann, San Francisco, CA, pp 259-265. ISBN:1-55860-335-2
[22] Shah D, Lakshmanan LVS, Ramamritham K, Sudarshan S (1999) Interestingness and pruning of mined patterns. In: 1999 ACM SIGMOD workshop on research issues in data mining and knowledge discovery. http://dblp.uni-trier.de/db/conf/dmkd/dmkd1999.html
[23] Simeone B, Spinelli V (2007) The optimization problem framework for box clustering approach in logic mining. In: Book of abstract of Euro XXII-22nd European conference on operational research, Euro XXII
[24] Simeone B, Felici G, Spinelli V (2007) A graph coloring approach for box clustering techniques in logic mining. In: Book of Abstract of Euro XXII-22nd European conference on operational research, Euro XXII
[25] Sogaard A (2013) Semi-supervised learning and domain adaptation in natural language processing. Morgan & Claypool, San Rafael
[26] Weka (2013) Machine learning group- data mining software in java. University of Waikato, Department of Computer Science. http://www.cs.waikato.ac.nz/ml/weka
[27] Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, San Francisco, CA. ISBN:0123748569, 9780123748560
[28] Wu S, Flach P (2005) A scored AUC metric for classifier evaluation and selection In: ICML’05 workshop on ROC Analysis in Machine Learning, Bonn, Germany, August 2005
[29] Zadrozny B, Langford J, Abe N (2003) Cost-sensitive learning by cost-proportionate example weighting. In: Proceedings of the third IEEE international conference on data mining
[30] Zilberstein, S., Using anytime algorithms in intelligent systems, AI Mag, 17, 73-83, (1996)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.