Supervised feature selection by clustering using conditional mutual information-based distances. (English) Zbl 1191.68514

Summary: A supervised feature selection approach is presented, which is based on metric applied on continuous and discrete data representations. This method builds a dissimilarity space using information theoretic measures, in particular conditional mutual information between features with respect to a relevant variable that represents the class labels. Applying a hierarchical clustering, the algorithm searches for a compression of the information contained in the original set of features. The proposed technique is compared with other state of art methods also based on information measures. Eventually, several experiments are presented to show the effectiveness of the features selected from the point of view of classification accuracy.


68T05 Learning and adaptive systems in artificial intelligence
68T10 Pattern recognition, speech recognition


ITIP; UCI-ml; C4.5
Full Text: DOI


[1] L.D. Baker, A. McCallum, Distributional clustering of words for text classification, in: 21th Annual International ACM SIGIR, ACM, August 1998, pp. 96-103.
[2] Battiti, R., Using mutual information for selection features in supervised neural net learning, IEEE trans. neural networks, 5, 4, 537-550, (1994)
[3] Blum, A.L.; Langley, P., Selection of relevant features and examples in machine learning, Artif. intell., 97, 1-2, 245-271, (1997) · Zbl 0904.68142
[4] Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A., Classification and regression trees, (1984), CRC Press Boca Raton · Zbl 0541.62042
[5] Cover, T.; Hart, P., Nearest neighbor pattern classification, IEEE trans. inf. theory, IT-13, 1, 21-27, (1967) · Zbl 0154.44505
[6] Cover, T.M.; Thomas, J.A., Elements of information theory, (1991), Wiley New York · Zbl 0762.94001
[7] Cover, T.M., The best two independent measurements are not the two best, IEEE trans. syst. man cybern., 4, 116-117, (1974) · Zbl 0283.68061
[8] Cristianini, N.; Shawe-Taylor, J., An introduction to support vector machines and other kernel-based learning methods, (2000), Cambridge University Press Cambridge, UK
[9] Dhillong, I.; Mallela, S.; Kumar, R., A divisive information-theoretic feature clustering algorithm for text classification, J. Mach. learn. res., 3, 1265-1287, (2003) · Zbl 1102.68545
[10] Friedman, M., The use of ranks to avoid the assumption of normality implicit in the analysis of variance, J. am. stat. assoc., 32, 200, 675-701, (1937) · JFM 63.1098.02
[11] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, Special issue on variable and feature selection, J. Mach. Learn. Res. 3 (2003) 1157-1182. · Zbl 1102.68556
[12] I. Guyon, S. Gunn, M. Nikravesh, L. Zadeh (Eds.), Feature extraction, foundations and applications, in: Series Studies in Fuzziness and Soft Computing, Physica-Verlag, Springer, 2006. · Zbl 1114.68059
[13] Jain, A.K.; Duin, R.P.W.; Mao, J., Statistical pattern recognition: a review, IEEE trans. pattern anal. Mach. intell., 22, 1, 4-37, (2000)
[14] John, G.H.; Kohavi, R.; Pfleger, K., Irrelevant features and the subset selection problem, (), 121-129
[15] Kohavi, R.; John, G.H., Wrapper for feature subset selection, Artif. intell., 97, 1-2, 273-324, (1997) · Zbl 0904.68143
[16] Kudo, M.; Sklansky, J., Comparison of algorithms that select features for pattern classifiers, Pattern recognition, 33, 25-41, (2000)
[17] Kwak, N.; Choi, C.-H., Input feature selection by mutual information based on parzen window, IEEE trans. pattern anal. Mach. intell., 24, 12, 1667-1671, (2002)
[18] Kwak, N.; Choi, C.-H., Input feature selection for classification problems, IEEE trans. neural networks, 13, 1, 143-159, (2002)
[19] Hsu, C.W.; Lin, C.J., A comparison of methods for multi-class support vector machines, IEEE trans. neural networks, 13, 415-425, (2002), [Online]. Available: \(\langle\)www.csie.ntu.edu.tw/∼cjlin/libsvm〉
[20] Li, J., Divergence measures based on the Shannon entropy, IEEE trans. inf. theory, 37, 1, 145-151, (1991) · Zbl 0712.94004
[21] Dash, M.; Liu, H., Feature selection for classification, Intelligent data anal., 1, 131-156, (1997)
[22] P.M. Murphy, UCI Repository of Machine Learning \(\langle\)http://archive.ics.uci.edu/ml/〉, Department of Information and Computer Science, University of California, Irvine, CA, 1995.
[23] Peng, H.; Long, F.; Ding, C., Feature selection based on mutual information: criteria of MAX-dependency, MAX-relevance, and MIN-redundance, IEEE trans. pattern anal. Mach. intell., 27, 8, 1226-1238, (2005)
[24] F.C. Pereira, N. Tishby, L. Lee, Distributional clustering of English words, in: 30th Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 1993, pp. 183-190.
[25] Pudil, P.; Ferri, F.J.; Novovicova, J.; Kittler, J., Floating search methods for feature selection with nonmonotonic criterion functions, Pattern recognition, 2, 279-283, (1994)
[26] Quinlan, J.R., Improved use of continuous attributes in C4.5, J. artif. intell. res., 4, 77-90, (1996) · Zbl 0900.68112
[27] N. Slonim, N. Tishby, Agglomerative information bottleneck, in: Proceedings of Neural Information Processing Systems (NIPS99), 1999, pp. 617-623.
[28] Tishby, N.; Pereira, F.; Bialek, W., The information bottleneck method, ()
[29] Ward, J.H., Hierarchical grouping to optimize an objective function, Am. stat. assoc., 58, 301, 236-244, (1963)
[30] R.W. Yeung, A First Course in Information Theory, Information Technology: Transmision, Processing, and Storage (series), Springer Science + Business Media, LLC, 2002.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.