Semi-supervised model-based clustering with positive and negative constraints. (English) Zbl 1414.62255

Summary: Cluster analysis is a popular technique in statistics and computer science with the objective of grouping similar observations in relatively distinct groups generally known as clusters. Semi-supervised clustering assumes that some additional information about group memberships is available. Under the most frequently considered scenario, labels are known for some portion of data and unavailable for the rest of observations. In this paper, we discuss a general type of semi-supervised clustering defined by so called positive and negative constraints. Under positive constraints, some data points are required to belong to the same cluster. On the contrary, negative constraints specify that particular points must represent different data groups. We outline a general framework for semi-supervised clustering with constraints naturally incorporating the additional information into the EM algorithm traditionally used in mixture modeling and model-based clustering. The developed methodology is illustrated on synthetic and classification datasets. A dendrochronology application is considered and thoroughly discussed.


62H30 Classification and discrimination; cluster analysis (statistical aspects)


mclust; OEIS; MixSim
Full Text: DOI


[1] Anderson, E., The Irises of the Gaspe Peninsula, Bull Am Iris Soc, 59, 2-5, (1935)
[2] Basu S, Banerjee A, Mooney R (2002) Semi-supervised clustering by seeding. In: Proceedings of the 19th International Conference on Machine Learning, pp 19-26
[3] Basu S, Bilenko M, Mooney RJ (2004) A Probabilistic Framework for Semi-Supervised Clustering. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 59-68
[4] Basu S, Davidson I, Wagstaff K (2008) Constrained clustering: advances in algorithms, theory, and application. Chapman and Hall/CRC · Zbl 1142.68005
[5] Bouveyron, C.; Brunet, C., Model-based clustering of high-dimensional data: a review, Comput Stat Data Anal, 71, 52-78, (2014) · Zbl 1471.62032
[6] Bridge, M., Locating the origins of wood resources: a review of dendroprovenancing, J Archaeol Sci, 39, 2828-2834, (2012)
[7] Campbell, NA; Mahon, RJ, A multivariate study of variation in two species of rock crab of genus Leptograsus, Aust J Zool, 22, 417-425, (1974)
[8] Chen, W-C; Maitra, R., Model-based clustering of regression time series data via APECM-An AECM Algorithm Sung to an even faster beat, Stat Anal Data Min, 4, 567-578, (2011)
[9] Côme, E.; Oukhellou, L.; Denœux, T.; Aknin, P., Learning from partially supervised data using mixture models and belief functions, Pattern Recognit, 42, 334-348, (2009) · Zbl 1181.68231
[10] Dempster, AP; Laird, NM; Rubin, DB, Maximum likelihood for incomplete data via the EM algorithm (with discussion), J Royal Stat Soc, Ser B, 39, 1-38, (1977) · Zbl 0364.62022
[11] Digalakis, VV; Rtischev, D.; Neumeyer, LG, Speaker adaptation using constrained estimation of Gaussian mixtures, IEEE Trans Speech Audio Process, 3, 357-366, (1995)
[12] Fisher, RA, The use of multiple measurements in taxonomic poblems, Ann Eugen, 7, 179-188, (1936)
[13] Forgy E (1965) Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics 21:768-780
[14] Fraley, C.; Raftery, AE, How many clusters? Which cluster method? Answers via model-based cluster analysis, Comput J, 41, 578-588, (1998) · Zbl 0920.68038
[15] Fraley, C.; Raftery, AE, Model-based clustering and density estimation, J Am Stat Assoc, 97, 611-631, (2002) · Zbl 1073.62545
[16] Fraley C, Raftery AE (2006) MCLUST Version 3 for R: normal mixture modeling and model-based clustering, Tech. Rep. 504, University of Washington, Department of Statistics, Seattle, WA
[17] Gaffney, SJ; Smyth, P., Trajectory clustering with mixture of regression model, 63-72, (1999), USA
[18] Grissino-Mayeri, HD; Fritts, H., The international tree-ring data bank: an enhanced global database serving the Global Scientific Community, Holocene, 7, 235-238, (1997)
[19] Haneca, K.; Wazny, T.; Acker, J.; Beeckman, H., Provenancing Baltic timber from art historical objects: success and limitations, J Archaeol Sci, 32, 261-271, (2005)
[20] Hennig, C., Methods for merging Gaussian mixture components, Adv Data Anal Classif, 4, 3-34, (2010) · Zbl 1306.62141
[21] Huang J-T, Hasegawa-Johnson M (2009) On semi-supervised learning of Gaussian mixture models for phonetic classification. In: NAACL HLT workshop on semi-supervised learning
[22] Hughes MK, Swetnam TW, Diaz HF (2009) Dendroclimatology: progress and prospects, vol 11. Princeton, Developments in Paleoenvirnmental ResearchSpringer
[23] Johnson, S., Hierarchical clustering schemes, Psychometrika, 32, 241-254, (1967) · Zbl 1367.62191
[24] Law MHC, Topchy A, Jain AK (2005) Model-based clustering with probabilistic constraints. In: 2005 SIAM International Conference on Data Mining, pp 641-645
[25] Liu, B.; Shen, X.; Pan, W., Semi-supervised spectral clustering with application to detect population stratification, Front Genet, 4, 1-5, (2013)
[26] Lu, Z.; Leen, TK, Penalized probabilistic clustering, Neural Comput, 19, 1528-1567, (2007) · Zbl 1119.68183
[27] MacQueen, J., Some methods for classification and analysis of multivariate observations, Proc Fifth Berkeley Symp, 1, 281-297, (1967) · Zbl 0214.46201
[28] Maitra, R.; Melnykov, V., Simulating data to study performance of finite mixture modeling and clustering algorithms, J Comput Graph Stat, 19, 354-376, (2010)
[29] Martinez-Uso A, Pla F, Sotoca J (2010) A semi-supervised Gaussian mixture model for image segmentation. In: International Conference on Pattern Recognition, pp 2941-2944
[30] McLachlan G, Peel D (2000) Finite Mixture Models. Wiley, New York · Zbl 0963.62061
[31] Melnykov, V., Efficient estimation in model-based clustering of Gaussian regression time series, Stat Anal Data Min, 5, 95-99, (2012)
[32] Melnykov, V., On the distribution of posterior probabilities in finite mixture models with application in clustering, J Multivar Anal, 122, 175-189, (2013) · Zbl 1279.62114
[33] Melnykov, V.; Chen, W-C; Maitra, R., MixSim: R package for simulating datasets with pre-specified clustering complexity, J Stat Softw, 51, 1-25, (2012)
[34] Melnykov, V.; Maitra, R., Finite mixture models and model-based clustering, Stat Surv, 4, 80-116, (2010) · Zbl 1190.62121
[35] Nigam, K.; McCallum, AK; Thrun, S.; Mitchell, T., Text classification from labeled and unlabeled documents using EM, Mach Learn, 39, 103-134, (2000) · Zbl 0949.68162
[36] Pan, W.; Shen, X.; Jiang, A.; Hebbel, R., Semisupervised learning via penalized mixture model with application to microarray sample classification, Bioinformatics, 22, 2388-2395, (2006)
[37] Schwarz, G., Estimating the dimensions of a model, Ann Stat, 6, 461-464, (1978) · Zbl 0379.62005
[38] Shental N, Bar-Hillel A, Hertz T, Weinshall D (2003) Computing Gaussian mixture models with EM using equivalence constraints. In: Advances in NIPS, vol. 15 · Zbl 1161.68775
[39] Sloane NJA (2014) The online encyclopedia of integer sequences: A001349 Number of connected graphs with n nodes
[40] Wagstaff K, Cardie C, Rogers S, Schroedl S (2001) Constrained \(K\)-means Clustering with Background Knowledge. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp 577-584
[41] Wang, L.; Zhu, J.; Zou, H., Hybrid Huberized Support Vector Machines for Microarray Classification, 983-990, (2007), USA
[42] Ward, JH, Hierarchical grouping to optimize an objective function, J Am Stat Assoc, 58, 236-244, (1963)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.