×

Robust finite mixture regression for heterogeneous targets. (English) Zbl 1428.62309

Summary: Finite Mixture Regression (FMR) refers to the mixture modeling scheme which learns multiple regression models from the training data set. Each of them is in charge of a subset. FMR is an effective scheme for handling sample heterogeneity, where a single regression model is not enough for capturing the complexities of the conditional distribution of the observed samples given the features. In this paper, we propose an FMR model that (1) finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously, (2) achieves shared feature selection among tasks and cluster components, and (3) detects anomaly tasks or clustered structure among tasks, and accommodates outlier samples. We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework. The proposed model is evaluated on both synthetic and real-world data sets. The results show that our model can achieve state-of-the-art performance.

MSC:

62J02 General nonlinear regression
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62R07 Statistical aspects of big data and data science
68T05 Learning and adaptive systems in artificial intelligence

Software:

TFOCS
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Aho K, Derryberry D, Peterson T (2014) Model selection for ecologists: the worldviews of AIC and BIC. Ecology 95(3):631-636 · doi:10.1890/13-1452.1
[2] Alfò M, Salvati N, Ranallli MG (2016) Finite mixtures of quantile and M-quantile regression models. Stat Comput 27:1-24 · Zbl 1505.62017
[3] Argyriou A, Evgeniou T, Pontil M (2007a) Multi-task feature learning. In: Advances in neural information processing systems, pp 41-48 · Zbl 1470.68073
[4] Argyriou A, Pontil M, Ying Y, Micchelli CA (2007b) A spectral regularization framework for multi-task structure learning. In: Advances in neural information processing systems, pp 25-32
[5] Bai X, Chen K, Yao W (2016) Mixture of linear mixed models using multivariate t distribution. J Stat Comput Simul 86(4):771-787 · Zbl 1510.62272 · doi:10.1080/00949655.2015.1036431
[6] Bartolucci F, Scaccia L (2005) The use of mixtures for dealing with non-normal regression errors. Comput Stat Data Anal 48(4):821-834 · Zbl 1429.62284 · doi:10.1016/j.csda.2004.04.005
[7] Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1):141-148 · Zbl 0638.65055 · doi:10.1093/imanum/8.1.141
[8] Becker SR, Candès EJ, Grant MC (2011) Templates for convex cone problems with applications to sparse signal recovery. Math Program Comput 3(3):165-218 · Zbl 1257.90042 · doi:10.1007/s12532-011-0029-5
[9] Bhat HS, Kumar N (2010) On the derivation of the Bayesian information criterion. School of Natural Sciences, University of California, Oakland
[10] Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 37:705-1732 · Zbl 1173.62022 · doi:10.1214/08-AOS620
[11] Bishop CM (2006) Pattern recognition. Mach Learn 128:1-58
[12] Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends® Mach Learn 3(1):1-122 · Zbl 1229.90122
[13] Candès EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9(6):717-772 · Zbl 1219.90124 · doi:10.1007/s10208-009-9045-5
[14] Chen X, Kim S, Lin Q, Carbonell JG, Xing EP (2010) Graph-structured multi-task regression and an efficient optimization method for general fused lasso. ArXiv preprint arXiv:1005.3579
[15] Chen J, Zhou J, Ye J (2011) Integrating low-rank and group-sparse structures for robust multi-task learning. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 42-50
[16] Chen J, Liu J, Ye J (2012a) Learning incoherent sparse and low-rank patterns from multiple tasks. ACM Trans Knowl Discov Data (TKDD) 5(4):22
[17] Chen K, Chan KS, Stenseth NC (2012b) Reduced rank stochastic regression with a sparse singular value decomposition. J R Stat Soc Ser B (Stat Methodol) 74(2):203-221 · Zbl 1411.62182 · doi:10.1111/j.1467-9868.2011.01002.x
[18] Cover TM, Thomas JA (2012) Elements of information theory. Wiley, Hoboken · Zbl 0762.94001
[19] Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1-38 · Zbl 0364.62022
[20] Doğru, FZ; Arslan, O.; Agostinelli, C. (ed.); Basu, A. (ed.); Filzmoser, P. (ed.); Mukherjee, D. (ed.), Robust mixture regression using mixture of different distributions, 57-79 (2016), New Delhi · Zbl 1356.62084 · doi:10.1007/978-81-322-3643-6_4
[21] Doğru FZ, Arslan O (2017) Parameter estimation for mixtures of skew Laplace normal distributions and application in mixture regression modeling. Commun Stat Theory Methods 46(21):10,879-10,896 · Zbl 1462.62195 · doi:10.1080/03610926.2016.1252400
[22] Fahrmeir L, Kneib T, Lang S, Marx B (2013) Regression: models, methods and applications. Springer, Berlin · Zbl 1276.62046 · doi:10.1007/978-3-642-34333-9
[23] Fan J, Lv J (2010) A selective overview of variable selection in high dimensional feature space. Stat Sin 20(1):101-148 · Zbl 1180.62080
[24] Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 186-193
[25] Gong P, Ye J, Zhang C (2012a) Robust multi-task feature learning. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 895-903
[26] Gong P, Ye J, Zhang C (2012b) Multi-stage multi-task feature learning. In: Advances in neural information processing systems, pp 1988-1996
[27] Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29-36 · doi:10.1148/radiology.143.1.7063747
[28] He J, Lawrence R (2011) A graph-based framework for multi-task multi-view learning. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 25-32
[29] Huang J, Breheny P, Ma S (2012) A selective review of group selection in high-dimensional models. Stat Sci 27(4):481-499 · Zbl 1331.62347 · doi:10.1214/12-STS392
[30] Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE (1991) Adaptive mixtures of local experts. Neural Comput 3(1):79-87 · doi:10.1162/neco.1991.3.1.79
[31] Jacob L, Vert J, Bach FR (2009) Clustered multi-task learning: a convex formulation. In: Advances in neural information processing systems, pp 745-752
[32] Jalali A, Sanghavi S, Ruan C, Ravikumar PK (2010) A dirty model for multi-task learning. In: Advances in neural information processing systems, pp 964-972
[33] Ji S, Ye J (2009) An accelerated gradient method for trace norm minimization. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 457-464
[34] Jin R, Goswami A, Agrawal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1):17-40 · doi:10.1007/s10115-005-0210-0
[35] Jin X, Zhuang F, Pan SJ, Du C, Luo P, He Q (2015) Heterogeneous multi-task semantic feature learning for classification. In: Proceedings of the 24th ACM international on conference on information and knowledge management. ACM, pp 1847-1850
[36] Jorgensen B (1987) Exponential dispersion models. J R Stat Soc Ser B (Methodol) 49:127-162 · Zbl 0662.62078
[37] Khalili A (2011) An overview of the new feature selection methods in finite mixture of regression models. J Iran Stat Soc 10(2):201-235 · Zbl 1244.62021
[38] Khalili A, Chen J (2007) Variable selection in finite mixture of regression models. J Am Stat Assoc 102(479):1025-1038 · Zbl 1469.62306 · doi:10.1198/016214507000000590
[39] Koller D (1996) Toward optimal feature selection. In: Proceedings of the 13th international conference on machine learning, pp 284-292
[40] Kubat M (2015) An introduction to machine learning. Springer, Berlin · Zbl 1330.68003 · doi:10.1007/978-3-319-20010-1
[41] Kumar A, Daumé III H (2012) Learning task grouping and overlap in multi-task learning. In: Proceedings of the 29th international conference on machine learning. Omnipress, pp 1723-1730
[42] Lim H, Narisetty NN, Cheon S (2016) Robust multivariate mixture regression models with incomplete data. J Stat Comput Simul 87:1-20
[43] Law MH, Jain AK, Figueiredo M (2002) Feature selection in mixture-based clustering. In: Advances in neural information processing systems, pp 625-632
[44] Li S, Liu ZQ, Chan AB (2014) Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 482-489
[45] Liu J, Ji S, Ye J (2009) Multi-task feature learning via efficient \[\ell_{2,1}\] ℓ2,1-norm minimization. In: Proceedings of the 25th conference on uncertainty in artificial intelligence. AUAI Press, pp 339-348
[46] McLachlan G, Peel D (2004) Finite mixture models. Wiley, Hoboken · Zbl 0963.62061
[47] Neal, RM; Hinton, GE; Jordan, MI (ed.), A view of the EM algorithm that justifies incremental, sparse, and other variants, 355-368 (1998), Dordrecht · Zbl 0916.62019 · doi:10.1007/978-94-011-5014-9_12
[48] Nelder JA, Baker RJ (1972) Generalized linear models. Encyclopedia of statistical sciences. Wiley, Hoboken
[49] Nesterov Y et al (2007) Gradient methods for minimizing composite objective function. Technical report, UCL
[50] Passos A, Rai P, Wainer J, Daumé III H (2012) Flexible modeling of latent task structures in multitask learning. In: Proceedings of the 29th international conference on machine learning. Omnipress, pp 1283-1290
[51] Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299-1319 · doi:10.1162/089976698300017467
[52] She Y, Chen K (2017) Robust reduced-rank regression. Biometrika 104(3):633-647 · Zbl 07072232
[53] She Y, Owen AB (2011) Outlier detection using nonconvex penalized regression. J Am Stat Assoc 106(494):626-639 · Zbl 1232.62068 · doi:10.1198/jasa.2011.tm10390
[54] Städler N, Bühlmann P, Van De Geer \[S (2010) \ell_1\] ℓ1-penalization for mixture regression models. Test 19(2):209-256 · Zbl 1203.62128 · doi:10.1007/s11749-010-0197-z
[55] Strehl A, Ghosh J (2002a) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3(Dec):583-617 · Zbl 1084.68759
[56] Strehl A, Ghosh J (2002b) Cluster ensembles: a knowledge reuse framework for combining partitionings. In: 18th national conference on artificial intelligence. American Association for Artificial Intelligence, pp 93-98
[57] Tan Z, Kaddoum R, Le Yi Wang HW (2010) Decision-oriented multi-outcome modeling for anesthesia patients. Open Biomed Eng J 4:113 · doi:10.2174/1874120701004010113
[58] Van de Geer SA (2000) Applications of empirical process theory, vol 91. Cambridge University Press, Cambridge · Zbl 0953.62049
[59] Van Der Maaten L, Postma E, Van den Herik J (2009) Dimensionality reduction: a comparative. J Mach Learn Res 10:66-71
[60] Van Der Vaart AW, Wellner JA (1996) Weak convergence. Springer, Berlin · Zbl 0862.60002
[61] Wedel M, DeSarbo WS (1995) A mixture likelihood approach for generalized linear models. J Classif 12(1):21-55 · Zbl 0825.62611 · doi:10.1007/BF01202266
[62] Weruaga L, Vía J (2015) Sparse multivariate gaussian mixture regression. IEEE Trans Neural Netw Learn Syst 26(5):1098-1108 · doi:10.1109/TNNLS.2014.2334596
[63] Wang HX, bing Zhang Q, Luo B, Wei S (2004) Robust mixture modelling using multivariate t-distribution with missing information. Pattern Recognit Lett 25(6):701-710 · doi:10.1016/j.patrec.2004.01.010
[64] Yang X, Kim S, Xing EP (2009) Heterogeneous multitask learning with joint sparsity constraints. In: Advances in neural information processing systems, pp 2151-2159
[65] Yuksel SE, Wilson JN, Gader PD (2012) Twenty years of mixture of experts. IEEE Trans Neural Netw Learn Syst 23(8):1177-1193 · doi:10.1109/TNNLS.2012.2200299
[66] Zhang D, Shen D, Initiative ADN et al (2012) Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage 59(2):895-907 · doi:10.1016/j.neuroimage.2011.09.069
[67] Zhang Y, Yeung DY (2011) Multi-task learning in heterogeneous feature spaces. In: 25th AAAI conference on artificial intelligence and the 23rd innovative applications of artificial intelligence conference, AAAI-11/IAAI-11, San Francisco, CA, 7-11 August 2011, Code 87049, Proceedings of the National Conference on Artificial Intelligence, p 574
[68] Zhou J, Chen J, Ye J (2011) Clustered multi-task learning via alternating structure optimization. In: Advances in neural information processing systems, pp 702-710
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.