×

Prediction-based structured variable selection through the receiver operating characteristic curves. (English) Zbl 1274.62901

Summary: In many clinical settings, a commonly encountered problem is to assess accuracy of a screening test for early detection of a disease. In these applications, predictive performance of the test is of interest. Variable selection may be useful in designing a medical test. An example is a research study conducted to design a new screening test by selecting variables from an existing screener with a hierarchical structure among variables: there are several root questions followed by their stem questions. The stem questions will only be asked after a subject has answered the root question. It is therefore unreasonable to select a model that only contains stem variables but not its root variable. In this work, we propose methods to perform variable selection with structured variables when predictive accuracy of a diagnostic test is the main concern of the analysis. We take a linear combination of individual variables to form a combined test. We then maximize a direct summary measure of the predictive performance of the test, the area under a receiver operating characteristic curve (AUC of an ROC), subject to a penalty function to control for overfitting. Since maximizing empirical AUC of the ROC of a combined test is a complicated nonconvex problem [M. S. Pepe et al., Biometrics 62, No. 1, 211–229 (2006; Zbl 1091.62125)], we explore the connection between the empirical AUC and a support vector machine (SVM). We cast the problem of maximizing predictive performance of a combined test as a penalized SVM problem and apply a reparametrization to impose the hierarchical structure among variables. We also describe a penalized logistic regression variable selection procedure for structured variables and compare it with the ROC-based approaches. We use simulation studies based on real data to examine performance of the proposed methods. Finally we apply developed methods to design a structured screener to be used in primary care clinics to refer potentially psychotic patients for further specialty diagnostics and treatment.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
92C50 Medical applications (general)

Citations:

Zbl 1091.62125
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] Bebbington, The Psychosis Screening Questionnaire, International Journal of Methods in Psychiatric Research 5 pp 11– (1995)
[2] Becker, Penalized SVM: A R-package for feature selection SVM classification, Bioinformatics 25 pp 1711– (2009) · Zbl 05744081 · doi:10.1093/bioinformatics/btp286
[3] Brefeld, Proceedings of the 22nd International Conference on Machine Learning-Workshop on ROC Analysis in Machine Learning (2005)
[4] Briggs, The skill plot: A graphical technique for evaluating continuous diagnostic tests, Biometrics 63 pp 250– (2008) · Zbl 1138.62067 · doi:10.1111/j.1541-0420.2007.00781_1.x
[5] Calders, Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD) pp 42– (2007)
[6] Efron, How biased is the apparent error rate of a prediction rule, Journal of the American Statistical Association 81 pp 461– (1986) · Zbl 0621.62073 · doi:10.2307/2289236
[7] Fan, Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association 96 pp 1348– (2001) · Zbl 1073.62547 · doi:10.1198/016214501753382273
[8] First , M. B. Spitzer , R. L. Gibbon , M. Williams , J. 1998 Structured Clinical Interview for DSM-IV Axis I Disorders
[9] Fung, A feature selection Newton method for support vector machine classification, Computational Optimization and Applications 28 pp 185– (2004) · Zbl 1056.90103 · doi:10.1023/B:COAP.0000026884.66338.df
[10] Han, Non-parametric analysis of a generalized regression model. The maximum rank correlation estimator, Journal of Economics 35 pp 303– (1987) · Zbl 0638.62063 · doi:10.1016/0304-4076(87)90030-3
[11] Hand, On Briggs and Zaretzki: The Skill Plot: A graphical technique for evaluating continuous diagnostic tests, Biometrics 63 pp 259– (2008) · Zbl 1138.62067 · doi:10.1111/j.1541-0420.2007.00781_3.x
[12] Hastie, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2001) · Zbl 0973.62007
[13] Heagerty, Survival model predictive accuracy and ROC curves, Biometrics 61 pp 92– (2005) · Zbl 1077.62077 · doi:10.1111/j.0006-341X.2005.030814.x
[14] Huang, A group bridge approach for variable selection, Biometrika 96 pp 339– (2009) · Zbl 1163.62050 · doi:10.1093/biomet/asp020
[15] Koo, A Bahadur representation of the linear support vector machine, Journal of Machine Learning Research 9 pp 1343– (2008) · Zbl 1225.68191
[16] Lewis-Fernández, Proceedings of the 15th Annual Scientific Symposium (2003)
[17] Ma, Regularized ROC method for disease classification and biomarker selection with microarray data, Bioinformatics 21 pp 4356– (2005) · doi:10.1093/bioinformatics/bti724
[18] Ma, Combining multiple markers for classification using ROC, Biometrics 63 pp 751– (2007) · Zbl 1128.62117 · doi:10.1111/j.1541-0420.2006.00731.x
[19] Obuchowski, An ROC-type measure of diagnostic accuracy when the gold standard is continuous-scale, Statistics in Medicine 25 pp 481– (2006) · doi:10.1002/sim.2228
[20] Pepe, The Statistical Evaluation of Medical Tests for Classification and Prediction (2003) · Zbl 1039.62105
[21] Pepe, Evaluating technologies for classification and prediction in medicine, Statistics in Medicine 24 pp 3687– (2005) · doi:10.1002/sim.2431
[22] Pepe, Combining predictors for classification using the area under the receiver operating characteristic curve, Biometrics 62 pp 221– (2006) · Zbl 1091.62125 · doi:10.1111/j.1541-0420.2005.00420.x
[23] Pinsky, Scaling of true and apparent ROC AUC with number of observations and number of variables, Communications in Statistics: Simulation and Computation 34 pp 771– (2005) · Zbl 1072.62112 · doi:10.1081/SAC-200068366
[24] Swets, Measuring the accuracy of diagnostic systems, Science 240 pp 1285– (1988) · Zbl 1226.92048 · doi:10.1126/science.3287615
[25] Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B 58 pp 267– (1996) · Zbl 0850.62538
[26] Wahba, Advances in Large Margin Classifiers pp 297– (2000)
[27] Wang, Hierarchically penalized Cox regression with grouped variables, Biometrika 96 pp 307– (2009) · Zbl 1163.62089 · doi:10.1093/biomet/asp016
[28] Yuan, An efficient variable selection approach for analyzing designed experiments, Technometrics 49 pp 430– (2007) · doi:10.1198/004017007000000173
[29] Yuan, Structured variable selection and estimation, Annals of Applied Statistics 3 pp 1738– (2009) · Zbl 1184.62032 · doi:10.1214/09-AOAS254
[30] Zhang, Gene selection using support vector machine with non-convex penalty, Bioinformatics 22 pp 88– (2006) · doi:10.1093/bioinformatics/bti736
[31] Zou, One-step sparse estimates in nonconcave penalized likelihood models, Annals of Statistics 36 pp 1509– (2008) · Zbl 1142.62027 · doi:10.1214/009053607000000802
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.