A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification. (English) Zbl 1474.62273

Summary: The common issues of high-dimensional gene expression data are that many of the genes may not be relevant, and there exists a high correlation among genes. Gene selection has been proven to be an effective way to improve the results of many classification methods. Sparse logistic regression using least absolute shrinkage and selection operator (lasso) or using smoothly clipped absolute deviation is one of the most widely applicable methods in cancer classification for gene selection. However, this method faces a critical challenge in practical applications when there are high correlations among genes. To address this problem, a two-stage sparse logistic regression is proposed, with the aim of obtaining an efficient subset of genes with high classification capabilities by combining the screening approach as a filter method and adaptive lasso with a new weight as an embedded method. In the first stage, sure independence screening method as a screening approach retains those genes representing high individual correlation with the cancer class level. In the second stage, the adaptive lasso with new weight is implemented to address the existence of high correlations among the screened genes in the first stage. Experimental results based on four publicly available gene expression datasets have shown that the proposed method significantly outperforms three state-of-the-art methods in terms of classification accuracy, G-mean, area under the curve, and stability. In addition, the results demonstrate that the top selected genes are biologically related to the cancer type. Thus, the proposed method can be useful for cancer classification using DNA gene expression data in real clinical practice.


62J07 Ridge regression; shrinkage estimators (Lasso)
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62P10 Applications of statistics to biology and medical sciences; meta analysis
92C40 Biochemistry, molecular biology


SparseLOGREG; glmnet
Full Text: DOI


[1] Algamal, ZY; Lee, MH, Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification, Expert Syst Appl, 42, 9326-9332, (2015)
[2] Algamal, ZY; Lee, MH, Regularized logistic regression with adjusted adaptive elastic net for gene selection in high dimensional cancer classification, Comput Biol Med, 67, 136-145, (2015)
[3] Algamal, ZY; Lee, MH, Applying penalized binary logistic regression with correlation based elastic net for variables selection, J Mod Appl Stat Methods, 14, 168-179, (2015)
[4] Algamal, ZY; Lee, MH, High dimensional logistic regression model using adjusted elastic net penalty, Pak J Stat Oper Res, 11, 667-676, (2015)
[5] Algamal, ZY; Lee, MH, Adjusted adaptive lasso in high-dimensional Poisson regression model, Mod Appl Sci, 9, 170-176, (2015)
[6] Alon, U.; Barkai, N.; Notterman, DA; Gish, K.; Ybarra, S.; Mack, D.; Levine, AJ, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc Natl Acad Sci, 96, 6745-6750, (1999)
[7] Asar, Y., Some new methods to solve multicollinearity in logistic regression, Commun Stat Simul Comput, (2015) · Zbl 1462.62435
[8] Asar, Y.; Genç, A., New shrinkage parameters for the Liu-type logistic estimators, Commun Stat Simul Comput, 45, 1094-1103, (2015) · Zbl 1341.62233
[9] Ben Brahim, A.; Limam, M., A hybrid feature selection method based on instance learning and cooperative subset search, Pattern Recogn Lett, 69, 28-34, (2016)
[10] Bielza, C.; Robles, V.; Larrañaga, P., Regularized logistic regression without a penalty term: an application to cancer classification with microarray data, Expert Syst Appl, 38, 5110-5118, (2011)
[11] Bolón-Canedo, V.; Sánchez-Maroño, N.; Alonso-Betanzos, A., An ensemble of filters and classifiers for microarray data classification, Pattern Recogn, 45, 531-539, (2012)
[12] Bootkrajang, J.; Kabán, A., Classification of mislabelled microarrays using robust sparse logistic regression, Bioinformatics, 29, 870-877, (2013)
[13] Cawley, GC; Talbot, NLC, Gene selection in cancer classification using sparse logistic regression with Bayesian regularization, Bioinformatics, 22, 2348-2355, (2006)
[14] Chen, Y.; Wang, L.; Li, L.; Zhang, H.; Yuan, Z., Informative gene selection and the direct classification of tumors based on relative simplicity, BMC Bioinform, 17, 44-57, (2016)
[15] Cui, Y.; Zheng, CH; Yang, J.; Sha, W., Sparse maximum margin discriminant analysis for feature extraction and gene selection on gene expression data, Comput Biol Med, 43, 933-941, (2013)
[16] Drotar, P.; Gazda, J.; Smekal, Z., An experimental comparison of feature selection methods on two-class biomedical datasets, Comput Biol Med, 66, 1-10, (2015)
[17] Fan, J.; Li, R., Variable selection via nonconcave penalized likelihood and its oracle properties, J Am Stat Assoc, 96, 1348-1360, (2001) · Zbl 1073.62547
[18] Fan, J.; Lv, J., Sure independence screening for ultrahigh dimensional feature space, J R Stat Soc Ser B (Stat Methodol), 70, 849-911, (2008) · Zbl 1411.62187
[19] Fan, J.; Song, R., Sure independence screening in generalized linear models with NP-dimensionality, Ann Stat, 38, 3567-3604, (2010) · Zbl 1206.68157
[20] Ferreira, AJ; Figueiredo, MAT, Efficient feature selection filters for high-dimensional data, Pattern Recogn Lett, 33, 1794-1804, (2012)
[21] Friedman, J.; Hastie, T.; Tibshirani, R., Regularization paths for generalized linear models via coordinate descent, J Stat Softw, 33, 1-22, (2010)
[22] Golub, TR; Slonim, DK; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, JP; Coller, H.; Loh, ML; Downing, JR; Caligiuri, MA; Bloomfield, CD; Lander, ES, Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286, 531-537, (1999)
[23] Gordon, GJ; Jensen, RV; Hsiao, L-L; Gullans, SR; Blumenstock, JE; Ramaswamy, S.; Richards, WG; Sugarbaker, DJ; Bueno, R., Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma, Cancer Res, 62, 4963-4967, (2002)
[24] Guo, S.; Guo, D.; Chen, L.; Jiang, Q., A centroid-based gene selection method for microarray data classification, J Theor Biol, 400, 32-41, (2016) · Zbl 1343.92012
[25] Guyon, I.; Elisseeff, A., An introduction to variable and feature selection, J Mach Learn Res, 3, 1157-1182, (2003) · Zbl 1102.68556
[26] Han, B.; Li, L.; Chen, Y.; Zhu, L.; Dai, Q., A two step method to identify clinical outcome relevant genes with microarray data, J Biomed Inf, 44, 229-238, (2011)
[27] Huang, HH; Liu, XY; Liang, Y., Feature selection and cancer classification via sparse logistic regression with the hybrid L1/2 + 2 regularization, PLoS ONE, 11, 1-15, (2016)
[28] Kalina, J., Classification methods for high-dimensional genetic data, Biocybern Biomed Eng, 34, 10-18, (2014)
[29] Kalousis, A.; Prados, J.; Hilario, M., Stability of feature selection algorithms: a study on high-dimensional spaces, Knowl Inf Syst, 12, 95-116, (2006)
[30] Korkmaz, S.; Zararsiz, G.; Goksuluk, D., Drug/nondrug classification using support vector machines with various feature selection strategies, Comput Methods Programs Biomed, 117, 51-60, (2014)
[31] Li, S.; Tan, EC, Dimension reduction-based penalized logistic regression for cancer classification using microarray data, IEEE/ACM Trans Comput Biol Bioinform, 2, 166-175, (2005)
[32] Li, S.; Wu, X.; Tan, M., Gene selection using hybrid particle swarm optimization and genetic algorithm, Soft Comput, 12, 1039-1048, (2008)
[33] Li, J.; Jia, Y.; Zhao, Z., Partly adaptive elastic net and its application to microarray classification, Neural Comput Appl, 22, 1193-1200, (2012)
[34] Liang, Y.; Liu, C.; Luan, X-Z; Leung, K-S; Chan, T-M; Xu, Z-B; Zhang, H., Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification, BMC Bioinform, 14, 198-211, (2013)
[35] Liao, JG; Chin, K-V, Logistic regression for disease classification using microarray data: model selection in a large p and small n case, Bioinformatics, 23, 1945-1951, (2007)
[36] Ma, S.; Huang, J., Penalized feature selection and classification in bioinformatics, Brief Bioinform, 9, 392-403, (2008)
[37] Mai, Q.; Zou, H., The Kolmogorov filter for variable screening in high-dimensional binary classification, Biometrika, 100, 229-234, (2013) · Zbl 1452.62456
[38] Mao, Z.; Cai, W.; Shao, X., Selecting significant genes by randomization test for cancer classification using gene expression data, J Biomed Inf, 46, 594-601, (2013)
[39] Özkale, MR, Iterative algorithms of biased estimation methods in binary logistic regression, Stat Pap, 57, 991-1016, (2016) · Zbl 1351.62137
[40] Pappua, V.; Panagopoulosb, OP; Xanthopoulosb, P.; Pardalosa, PM, Sparse proximal support vector machines for feature selection in high dimensional datasets, Expert Syst Appl, 42, 9183-9191, (2015)
[41] Park, MY; Hastie, T., Penalized logistic regression for detecting gene interactions, Biostatistics, 9, 30-50, (2008) · Zbl 1274.62853
[42] Qian, W.; Yang, Y., Model selection via standard error adjusted adaptive lasso, Ann Inst Stat Math, 65, 295-318, (2013) · Zbl 1440.62285
[43] Shevade, SK; Keerthi, SS, A simple and efficient algorithm for gene selection using sparse logistic regression, Bioinformatics, 19, 2246-2253, (2003)
[44] Singh, D.; Febbo, PG; Ross, K.; Jackson, DG; Manola, J.; Ladd, C.; Tamayo, P.; Renshaw, AA; D’Amico, AV; Richie, JP; Lander, ES; Loda, M.; Kantoff, PW; Golub, TR; Sellers, WR, Gene expression correlates of clinical prostate cancer behavior, Cancer Cell, 1, 203-209, (2002)
[45] Sun, H.; Wang, S., Penalized logistic regression for high-dimensional DNA methylation data with case-control studies, Bioinformatics, 28, 1368-1375, (2012)
[46] Tibshirani, R., Regression shrinkage and selection via the lasso, J R Stat Soc Ser B (Stat Methodol), 58, 267-288, (1996) · Zbl 0850.62538
[47] Wang, SL; Li, X.; Zhang, S.; Gui, J.; Huang, DS, Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction, Comput Biol Med, 40, 179-189, (2010)
[48] Yang, L.; Qian, Y., A sparse logistic regression framework by difference of convex functions programming, Appl Intell, 45, 241-254, (2016)
[49] Yap, Y.; Zhang, X.; Ling, MT; Wang, X.; Wong, YC; Danchin, A., Classification between normal and tumor tissues based on the pair-wise gene expression ratio, BMC Cancer, 4, 72, (2004)
[50] Zhang, L.; Qian, L.; Ding, C.; Zhou, W.; Li, F., Similarity-balanced discriminant neighbor embedding and its application to cancer classification based on gene expression data, Comput Biol Med, 64, 236-245, (2015)
[51] Zheng, S.; Liu, W., An experimental comparison of gene selection by Lasso and Dantzig selector for cancer classification, Comput Biol Med, 41, 1033-1040, (2011)
[52] Zhenqiu, L.; Feng, J.; Guoliang, T.; Suna, W.; Fumiaki, S.; Ming, T., Sparse logistic regression with Lp penalty for biomarker identification, Stat Appl Genet Mol Biol, 6, 1-22, (2007) · Zbl 1166.62314
[53] Zhu, J.; Hastie, T., Classification of gene microarrays by penalized logistic regression, Biostatistics, 5, 427-443, (2004) · Zbl 1154.62406
[54] Zou, H., The adaptive lasso and its oracle properties, J Am Stat Assoc, 101, 1418-1429, (2006) · Zbl 1171.62326
[55] Zou, H.; Hastie, T., Regularization and variable selection via the elastic net, J R Stat Soc Ser B (Stat Methodol), 67, 301-320, (2005) · Zbl 1069.62054
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.