Variable selection in model-based clustering using multilocus genotype data. (English) Zbl 1284.62397

Summary: We propose a variable selection procedure in model-based clustering using multilocus genotype data. Indeed, it may happen that some loci are not relevant for clustering into statistically different populations. Inferring the number \(K\) of clusters and the relevant clustering subset \(S\) of loci is seen as a model selection problem. The competing models are compared using penalized maximum likelihood criteria. Under weak assumptions on the penalty function, we prove the consistency of the resulting estimator \((\widehat K_n,\widehat S_n)\). An associated algorithm named mixture model for genotype data (MixMoGenD) has been implemented using C++ programming language and is available on http://www.math.u-psud.fr/~toussile. To avoid an exhaustive search of the optimum model, we propose a modified Backward-Stepwise algorithm, which enables a better search of the optimum model among all possible cardinalities of \(S\). We present numerical experiments on simulated and real datasets that highlight the interest of our loci selection procedure.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
92D10 Genetics and epigenetics
Full Text: DOI


[1] Allman ES, Matias C, Rhodes JA (2009) Identifiability of latent class models with many observed variables. Ann Stat (to appear) · Zbl 1191.62003
[2] Azais J-M, Gassiat E, Mercadier C (2009) The likelihood ratio test for general mixture models with possibly structural parameter. ESAIM P&S (to appear)
[3] Biernacki C, Celeux G, Govaert G (2001) Strategies for getting highest likehood in mixture models. Technical Report 4255, INRIA · Zbl 1429.62235
[4] Chambaz A, Garivier A, Gassiat E (2008) A MDL approach to HMM with Poisson and Gaussian emissions. Application to order identification (to appear JSPI) · Zbl 1284.62534
[5] Corander J, Marttinen P, Sirén J, Tang J (2008) Enhanced Bayesian modelling in baps software for learning genetic structures of populations. BMC Bioinformatics 9: 539
[6] Dempster AP, Lairdsand NM, Rubin DB (1977) Maximum likelihood from in-complete data via the EM algorithm. J R Stat Soc B 39: 1–38
[7] François O, Ancelet S, Guillot G (2006) Bayesian clustering using hidden Markov random fields in spatial population genetics. Genetics 174(2): 805–816
[8] Gassiat E (2002) Likelihood ratio inequalities with applications to various mixtures. In: Annales de l’Institut Henri Poincaré/Probabilités et statistiques, vol 38, pp 897–906. Elsevier SAS · Zbl 1011.62025
[9] Guillot G, Mortier F, Estoup A (2005) Geneland: a computer package for landscape genetics. Mol Ecol Notes 5(3): 712–715
[10] Latch EK, Dharmarajan GC, Glaubitz J, Rhodes OE Jr (2006) Relative performance of Bayesian clustering software for inferring population substructure and individual assignment at low levels of population differentiation. Conserv Genet 7(2): 295
[11] Massart P (2007) Concentration inequalities and model selection, vol 1896 of Lecture Notes in Mathematics. Springer, Berlin. Lectures from the 33rd Summer School on probability theory held in Saint-Flour, July 6–23, 2003, With a foreword by Jean Picard
[12] Maugis C, Celeux G, Martin-Magniette M-L (2009) Variable selection for clustering with gaussian mixture models. Biometrics (to appear) · Zbl 1172.62021
[13] Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155(2): 945–959
[14] R Development Core Team (2009) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0
[15] Rosenberg NA, Woolf E, Pritchard JK, Schaap T, Gefel D, Shpirer I, Lavi U, Bonne-Tamir B, Hillel J, Feldman MW (2001) Distinctive genetic signatures in the libyan jews. Proc Natl Acad Sci USA 98(3): 858–863
[16] Wang Y, Liu Q (2006) Comparison of Akaike information criterion (AIC) and Bayesian information criterion (BIC) in selection of stock–recruitment relationships. Fish Res 77(2): 220–225
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.