×

Bayesian variable selection with sparse and correlation priors for high-dimensional data analysis. (English) Zbl 1417.62203

Summary: The main challenge in working with gene expression microarrays is that the sample size is small compared to the large number of variables (genes). In many studies, the main focus is on finding a small subset of the genes, which are the most important ones for differentiating between different types of cancer, for simpler and cheaper diagnostic arrays. In this paper, a sparse Bayesian variable selection method in probit model is proposed for gene selection and classification. We assign a sparse prior for regression parameters and perform variable selection by indexing the covariates of the model with a binary vector. The correlation prior for the binary vector assigned in this paper is able to distinguish models with the same size. The performance of the proposed method is demonstrated with one simulated data and two well known real data sets, and the results show that our method is comparable with other existing methods in variable selection and classification.

MSC:

62J07 Ridge regression; shrinkage estimators (Lasso)
62F15 Bayesian inference
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62P10 Applications of statistics to biology and medical sciences; meta analysis
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Albert J, Chib S (1993) Bayesian analysis of binary and polychotomous response data. J Am Stat Assoc 88:669-679 · Zbl 0774.62031 · doi:10.1080/01621459.1993.10476321
[2] Armagan A, Dunson DB, Lee J (2013) Generalized double Pareto shrinkage. Stat Sin 3(1):119-143 · Zbl 1259.62061
[3] Bae K, Mallick BK (2004) Gene selection using a two-level hierarchical Bayesian model. Bioinformatics 20(18):3423-3430 · doi:10.1093/bioinformatics/bth419
[4] Baragatti M (2011) Bayesian variable selection for probit mixed models applied to gene selection. Bayesian Anal 6(2):209-230 · Zbl 1330.62297 · doi:10.1214/11-BA607
[5] Baragatti M, Pommeret D (2012) A study of variable selection using g-prior distribution with ridge parameter. Comput Stat Data Anal 56:1920-1934 · Zbl 1368.62190 · doi:10.1016/j.csda.2011.11.017
[6] Bradley P, Mangasarian O (1998) Feature selection via concave minimization and support vector machines. In: Proceedings of the 15th international conference on machine learning, pp 82-90
[7] Brotherick I, Robson CN, Browell DA, Shenfine J, White MD, Cunliffe WJ, Shenton BK, Egan M, Webb LA, Lunt LG, Young JR, Higgs MJ (1998) Cytokeratin expression in breast cancer: phenotypic changes associated with disease progression. Cytometry 32:301-308 · doi:10.1002/(SICI)1097-0320(19980801)32:4<301::AID-CYTO7>3.0.CO;2-K
[8] Chakraborty S (2009) Bayesian Binary kernel probit model for microarray based cancer classification and gene selection. Comput Stat Data Anal 53:4198-4209 · Zbl 1453.62061 · doi:10.1016/j.csda.2009.05.007
[9] Chakraborty S, Guo R (2011) Bayesian hybrid huberized SVM and its applications in high dimensional medical data. Comput Stat Data Anal 55(3):1342-1356 · Zbl 1328.62584 · doi:10.1016/j.csda.2010.09.024
[10] Chhikara R, Folks L (1989) The inverse Gaussian distribution: theory, methodology, and applications. Marcel Dekker, New York · Zbl 0701.62009
[11] Devroye L (1986) Non-uniform random variate generation. Springer, New York · Zbl 0593.65005 · doi:10.1007/978-1-4613-8643-8
[12] Dougherty ER (2001) Small sample issues for microarray-based classification. Comp Funct Genomics 2:28-34 · doi:10.1002/cfg.62
[13] Dudoit Y, Yang H, Callow M, Speed T (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97:77-87 · Zbl 1073.62576 · doi:10.1198/016214502753479248
[14] Geman S, Geman D (1984) Stochastic relaxation, Gibbls distribution, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6:721-741 · Zbl 0573.62030 · doi:10.1109/TPAMI.1984.4767596
[15] George EI, McCulloch RE (1993) Variable selection via Gibbs sampling. J Am Stat Assoc 88:881-889 · doi:10.1080/01621459.1993.10476353
[16] Geyer CJ (1992) Practical Markov chain Monte Carlo. Stat Sci 7:473-511 · Zbl 0085.18501 · doi:10.1214/ss/1177011137
[17] Gilks W, Richardson S, Spiegelhalter D (1996) Markov Chain Monte Carlo in practise. Chapman and Hall, London · Zbl 0832.00018
[18] Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531-537 · doi:10.1126/science.286.5439.531
[19] Gupta M, Ibrahim JG (2007) Variable selection in regression mixture modeling for the discovery of gene regulatory networks. J Am Stat Assoc 102(479):867-880 · Zbl 1469.62369 · doi:10.1198/016214507000000068
[20] Guyon I, Weston J, Barnhill S, Vapnik V et al (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389-422 · Zbl 0998.68111 · doi:10.1023/A:1012487302797
[21] Hastie T, Tibshirani R, Friedman J (2001) The element of statistical learning. Springer, New York · Zbl 0973.62007 · doi:10.1007/978-0-387-21606-5
[22] Hendenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, Wilfond B, Borg A, Trent J (2001) Gene expression profiles in hereditary breast cancer. N Engl J Med 344:539-548 · doi:10.1056/NEJM200102223440801
[23] Hirota T, Morisaki T, Nishiyama Y, Marumoto T, Tada K, Hara T, Masuko N, Inagaki M, Hatakeyama K, Saya H (2000) Zyxin a regulator of actin filament assembly, targets the mitotic apparatus by interacting with h-warts/LATS1 tumor suppressor. J Cell Biol 149:1073-1086 · doi:10.1083/jcb.149.5.1073
[24] Ishwaran H, Rao JS (2005) Spike and slab variable selection: frequentist and bayesian strategies. Ann Stat 33(2):730-773 · Zbl 1068.62079 · doi:10.1214/009053604000001147
[25] Kass RE, Carlin BP, Gelman A, Neal R (1998) Markov Chain Monte Carlo in practice: a roundtable discussion. Am Stat 52:93-100
[26] Lamnisos D, Griffin JE, Steel FJ Mark (2009) Transdimensional sampling algorithms for Bayesian variable selection in classification problems with many more variables than observations. J Comput Graph Stat 18:592-612 · doi:10.1198/jcgs.2009.08027
[27] Lee KE et al (2003) Gene selection: a Bayesian variable selection approach. Bioinformatics 19:90-97 · doi:10.1093/bioinformatics/19.1.90
[28] Li F, Zhang NR (2010) Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. J Am Stat Assoc 105(491):1202-1214 · Zbl 1390.62027 · doi:10.1198/jasa.2010.tm08177
[29] Liu X, Krishnan A, Mondry A (2005) An entropy-based gene selection method for cancer classification using microarray data. BMC Bioinform 6:76 · doi:10.1186/1471-2105-6-76
[30] Mallick BK, Ghosh D, Ghosh M (2005) Bayesian classification of tumors using gene expression data. J R Stat Soc B 67:219-232 · Zbl 1069.62100 · doi:10.1111/j.1467-9868.2005.00498.x
[31] Maruyama Y, George EI (2011) gBF: a fully Bayes factor with a generalized g-prior. Technical Report, University of Pennsylvania. arXiv:0801.4410 · Zbl 1231.62036
[32] Mitchell TJ, Beauchamp JJ (1988) Bayesian variable selection in linear regression. J Am Stat Assoc 83:1023-1036 · Zbl 0673.62051 · doi:10.1080/01621459.1988.10478694
[33] Nguyen DV, Rocke DM (2002) Multi-class cancer classification via partial least squares with gene expression profiles. Bioinformatics 18:1216-1226 · doi:10.1093/bioinformatics/18.9.1216
[34] OHara RB, Sillanpaa MJ (2009) A review of Bayesian variable selection methods: what, how and which. Bayesian Anal 4:85-118 · Zbl 1330.62291 · doi:10.1214/09-BA403
[35] Panagiotelisa A, Smith M (2008) Bayesian identification, selection and estimation of semiparametric functions in high dimensional additive models. J Econom 143:291-316 · Zbl 1418.62166 · doi:10.1016/j.jeconom.2007.10.003
[36] Park K, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103:681-686 · Zbl 1330.62292 · doi:10.1198/016214508000000337
[37] Quintana MA, Conti DV (2013) Integrative variable selection via Bayesian model uncertainty. Stat Med 32(28):4938-4953 · doi:10.1002/sim.5888
[38] Sha N, Vannucci M, Tadesse M, Brown P, Dragoni I, Davies N, Roberts T, Contestabile A, Salmon M, Buckley C, Falciani F (2004) Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage. Biometrics 60:812-819 · Zbl 1274.62428 · doi:10.1111/j.0006-341X.2004.00233.x
[39] Stingo FC, Vannucci M (2011) Variable selection for discriminant analysis with Markov random field priors for the analysis of microarray data. Bioinformatics 27(4):495-501 · doi:10.1093/bioinformatics/btq690
[40] Strawderman WE (1971) Proper Bayes minimax estimators of the multivariate normal mean. Ann Math Stat 42:385-388 · Zbl 0222.62006 · doi:10.1214/aoms/1177693528
[41] Tolosi L, Lengauer T (2011) Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 27:1986-1994 · Zbl 1235.93089 · doi:10.1093/bioinformatics/btr300
[42] Yang K, Cai Z, Li J, Lin G (2006) A stable gene selection in microarray data analysis. BMC Bioinform 7:228 · doi:10.1186/1471-2105-7-228
[43] Yang A, Song X (2010) Bayesian variable selection for disease classication using gene expression data. Bioinformatics 26(2):215-222 · doi:10.1093/bioinformatics/btp620
[44] Yuan M, Lin Y (2005) Efficient empirical bayes variable selection and estimation in linear models. J Am Stat Assoc 472:1215-1225 · Zbl 1117.62453 · doi:10.1198/016214505000000367
[45] Zellner A (1986) On assessing prior distributions and Bayesian regression analysis with g-prior distributions. Bayesian inference and decision techniques: essays in honor of Bruno de Finetti. NorthHolland, Amsterdam, pp 233-243 · Zbl 1068.62079
[46] Zhou X, Liu K, Wong S (2004) Cancer classification and prediction using logistic regression with Bayesian gene selection. J Biomed Inform 37:249-259 · doi:10.1016/j.jbi.2004.07.009
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.