Variable selection in discriminant analysis based on the location model for mixed variables. (English) Zbl 1301.62066

Summary: Nonparametric smoothing of the location model is a potential basis for discriminating between groups of objects using mixtures of continuous and categorical variables simultaneously. However, it may lead to unreliable estimates of parameters when too many variables are involved. This paper proposes a method for performing variable selection on the basis of distance between groups as measured by smoothed Kullback-Leibler divergence. Searching strategies using forward, backward and stepwise selections are outlined, and corresponding stopping rules derived from asymptotic distributional results are proposed. Results from a Monte Carlo study demonstrate the feasibility of the method. Examples on real data show that the method is generally competitive with, and sometimes is better than, other existing classification methods.


62H30 Classification and discrimination; cluster analysis (statistical aspects)


Full Text: DOI


[1] Aeberhard S, Vel OYD, Coomans DH (2000) New fast algorithms for error rate-based stepwise variable selection in discriminant analysis. SIAM J Sci Comput 22:1036–1052 · Zbl 0968.68204
[2] Aitchison J, Aitken CGG (1976) Multivariate binary discrimination by the kernel method. Biometrika 63:413–420 · Zbl 0344.62035
[3] Asparoukhov O, Krzanowski WJ (2000) Non-parametric smoothing of the location model in mixed variable discrimination. Stat Comput 10:289–297
[4] Bar-Hen A, Daudin JJ (1995) Generalization of the Mahalanobis distance in the mixed case. J Multivar Anal 53:332–342 · Zbl 0820.62058
[5] Bickel PJ, Levina E (2004) Some theory for Fisher’s Linear Discriminant function, ”naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli 10:989–1010 · Zbl 1064.62073
[6] Chang PC, Afifi AA (1974) Classification based on dichotomous and continuous variables. J Am Stat Assoc 69:336–339 · Zbl 0289.62040
[7] Costanza MC, Afifi AA (1979) Comparison of stopping rules in forward stepwise discriminant analysis. J Am Stat Assoc 74:777–785 · Zbl 0436.62050
[8] Daudin JJ (1986) Selection of variables in mixed-variable discriminant analysis. Biometrics 42:473–481
[9] Daudin JJ, Bar-Hen A (1999) Selection in discriminant analysis with continuous and discrete variables. Comput Stat Data Anal 32:161–175 · Zbl 04556222
[10] Duin RPW (1996) A note on comparing classifiers. Patt Recognit Lett 17:529–536 · Zbl 05476838
[11] Everitt BS, Merette C (1990) The clustering of mixed-mode data: A comparison of possible approaches. J Appl Stat 17:283–297
[12] Fienberg SE (1972) The analysis of incomplete multiway contingency tables. Biometrics 28:177–202
[13] Ganeshanandam S, Krzanowski WJ (1989) On selecting variables and assessing their performance in linear discriminant analysis. Aust J Stat 31:433–447 · Zbl 0707.62120
[14] Habbema JDF, Hermans J (1977) Selection of variables in discriminant analysis by F-statistic and error rate. Technometrics 19:487–493 · Zbl 0369.62002
[15] Hall P (1981) Optimal near neighbour estimator for use in discriminant analysis. Biometrika 68:572–575 · Zbl 0477.62024
[16] Hand DJ (1997) Construction and assessment of classification rules. Wiley, Chichester · Zbl 0997.62500
[17] Hoadley B (2001) Comment on ”Statistical modelling: The two cultures”, by Breiman, L. Stat Sci 16: 220–224
[18] Krusińska E (1987) A valuation of state of object based on weighted Mahalanobis distance. Patt Recognit 20:413–418
[19] Krzanowski WJ (1975) Discrimination and classification using both binary and continuous variables. J Am Stat Assoc 70:782–790 · Zbl 0322.62075
[20] Krzanowski WJ (1980) Mixtures of continuous and categorical variables in discriminant analysis. Biometrics 36:493–499 · Zbl 0442.62045
[21] Krzanowski WJ (1983) Stepwise location model choice in mixed-variable discrimination. Appl Stat 32: 260–266
[22] Krzanowski WJ (1994) Quadratic location discriminant functions for mixed categorical and continuous data. Stat Prob Lett 19:91–95 · Zbl 0800.62339
[23] Mahat NI (2006) Some investigations in discriminant analysis with mixed variables. Ph. D. thesis, Exeter University, U.K.
[24] McKay RJ, Campbell NA (1982) Variable selection techniques in discriminant analysis ii. allocation. British J Math Stat Psychol 35:30–41 · Zbl 0491.62049
[25] McLachlan GJ (1992) Discriminant analysis and statistical pattern recognition. Wiley, New York
[26] Olkin I, Tate RF (1961) Multivariate correlation models with mixed discrete and continuous variables. Ann Math Stat 32:448–465 · Zbl 0113.35101
[27] Rao CR (1973) Linear statistical inference and its applications, 2nd edn. Wiley, New York · Zbl 0256.62002
[28] Raudys SJ, Jain AK (1991) Small sample size effects in statistical pattern recognition: Recommendations for practitioners. IEEE Trans Syst Man Cyber 13:252–264
[29] Rencher AC (1993) The contribution of individual variables to Hotelling’s T2, Wilk’s {\(\lambda\)}, and R2. Biometrics 49:479–489 · Zbl 0800.62285
[30] Snapinn SM, Knoke JD (1989) Estimation of error rates in discriminant analysis with selection of variables. Biometrics 45:289–299 · Zbl 0715.62116
[31] Venables WN, Ripley BD (1994) Modern applied statistics with S-Plus. Springer, New York · Zbl 0806.62002
[32] Webb A (2002) Statistical pattern recognition, 2nd edn. Wiley, Chichester · Zbl 1102.68639
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.