Chained correlations for feature selection. (English) Zbl 1474.62233

Summary: Data-driven algorithms stand and fall with the availability and quality of existing data sources. Both can be limited in high-dimensional settings \((n \gg m)\). For example, supervised learning algorithms designed for molecular pheno- or genotyping are restricted to samples of the corresponding diagnostic classes. Samples of other related entities, such as arise in differential diagnosis, are usually not utilized in this learning scheme. Nevertheless, they might provide domain knowledge on the background or context of the original diagnostic task. In this work, we discuss the possibility of incorporating samples of foreign classes in the training of diagnostic classification models that can be related to the task of differential diagnosis. Especially in heterogeneous data collections comprising multiple diagnostic categories, the foreign ones can change the magnitude of available samples. More precisely, we utilize this information for the internal feature selection process of diagnostic models. We propose the use of chained correlations of original and foreign diagnostic classes. This method allows the detection of intermediate foreign classes by evaluating the correlation between class labels and features for each pair of original and foreign categories. Interestingly, this criterion does not require direct comparisons of the initial diagnostic groups and therefore, might be suitable for settings with restricted data access.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
62H20 Measures of association (correlation, canonical correlation, etc.)
68T10 Pattern recognition, speech recognition
62P10 Applications of statistics to biology and medical sciences; meta analysis
Full Text: DOI


[1] Bellman, R., Dynamic programming (1957), Princeton: Princeton University Press, Princeton · Zbl 0077.13605
[2] Berchtold, NC; Cribbs, DH; Coleman, PD; Rogers, J.; Head, E.; Kim, R.; Beach, T.; Miller, C.; Troncoso, J.; Trojanowski, JQ; Zielke, HR; Cotman, CW, Gene expression changes in the course of normal brain aging are sexually dimorphic, Proc Natl Acad Sci USA, 105, 40, 15605-15610 (2008)
[3] Bittner M (2005) Expression project for oncology (expO). National Center for Biotechnology Information
[4] Breiman, L., Random forests, Mach Learn, 45, 1, 5-32 (2001) · Zbl 1007.68152
[5] Bühlmann, P.; van de Geer, S., Statistics for high-dimensional data (2011), Heidelberg: Springer Series in Statistics, Springer, Heidelberg · Zbl 1273.62015
[6] Burkovski, A.; Lausser, L.; Kraus, J.; Kestler, H.; Spiliopoulou, M.; Schmidt-Thieme, L.; Janning, R., Rank aggregation for candidate gene identification, machine learning and knowledge discovery, Data analysis, 285-293 (2014), Cham: Springer International Publishing, Cham
[7] Caruana, R., Multitask learning, Mach Learn, 28, 1, 41-75 (1997)
[8] Chapelle, O.; Schölkopf, B.; Zien, A., Semi-supervised learning (2010), Cambridge: The MIT Press, Cambridge
[9] Chevaleyre, Y.; Endriss, U.; Lang, J.; Maudet, N.; van Leeuwen, J.; Italiano, G.; van der Hoek, W.; Meinel, C.; Sack, H.; Plášil, F., A short introduction to computational social choice, SOFSEM 2007: theory and practice of computer science, 51-69 (2007), Berlin, Heidelberg: Springer, Berlin, Heidelberg · Zbl 1131.91316
[10] Cover, TM, Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition, IEEE Trans Electron Comput, 14, 3, 326-334 (1965) · Zbl 0192.08403
[11] Deb, K., Multi-objective optimization using evolutionary algorithms (2001), Hoboken: Wiley, Hoboken · Zbl 0970.90091
[12] Fix E, Hodges JL (1951) Discriminatory analysis: nonparametric discrimination: consistency properties. In: Technical reports project 21-49-004, report number 4. USAF School of Aviation Medicine, Randolf Field, Texas · Zbl 0715.62080
[13] François, D.; Rossi, F.; Wertz, V.; Verleysen, M., Resampling methods for parameter-free and robust feature selection with mutual information, Neurocomputing, 70, 7-9, 1276-1288 (2007)
[14] Gobble, RM; Qin, LX; Brill, ER; Angeles, CV; Ugras, S.; O’Connor, RB; Moraco, NH; DeCarolis, PL; Antonescu, C.; Singer, S., Expression profiling of liposarcoma yields a multigene predictor of patient outcome and identifies genes that contribute to liposarcomagenesis, Cancer Res, 71, 7, 2697-2705 (2011)
[15] Guyon, I.; Elisseeff, A., An introduction to variable and feature selection, J Mach Learn Res, 3, Mar, 1157-1182 (2003) · Zbl 1102.68556
[16] Haferlach, T.; Kohlmann, A.; Wieczorek, L.; Basso, G.; Kronnie, GT; Béné, MC; Vos, JD; Hernández, JM; Hofmann, WK; Mills, KI; Gilkes, A.; Chiaretti, S.; Shurtleff, SA; Kipps, TJ; Rassenti, LZ; Yeoh, AE; Papenhausen, PR; Liu, WM; Williams, PM; Foà, R., Clinical utility of microarray-based gene expression profiling in the diagnosis and subclassification of leukemia: report from the international microarray innovations in leukemia study group, J Clin Oncol, 28, 15, 2529-2537 (2010)
[17] Hinneburg A, Aggarwal C, Keim D (2000) What is the nearest neighbor in high dimensional spaces? In: Proceedings of the 26th international conference on very large data bases, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 506-515
[18] Japkowicz, N.; Shah, M., Evaluating learning algorithms: a classification perspective (2011), New York: Cambridge University Press, New York · Zbl 1230.68020
[19] Jones, J.; Otu, H.; Spentzos, D.; Kolia, S.; Inan, M.; Beecken, WD; Fellbaum, C.; Gu, X.; Joseph, M.; Pantuck, AJ; Jonas, D.; Libermann, TA, Gene signatures of progression and metastasis in renal cell cancer, Clin Cancer Res, 11, 16, 5730-5739 (2005)
[20] Kearns, M.; Vazirani, U., An introduction to computational learning theory (1994), Cambridge: MIT Press, Cambridge
[21] Kimpel, MW; Strother, WN; McClintick, JN; Carr, LG; Liang, T.; Edenberg, HJ; McBride, WJ, Functional gene expression differences between inbred alcohol-preferring and non-preferring rats in five brain regions, Alcohol, 41, 2, 95-132 (2007)
[22] Kraus, J.; Lausser, L.; Kuhn, P.; Jobst, F.; Bock, M.; Halanke, C.; Hummel, M.; Heuschmann, P.; Kestler, HA, Big data and precision medicine: challenges and strategies with healthcare data, Int J Data Sci Anal, 6, 3, 241-249 (2018)
[23] Lattke R, Lausser L, Müssel C, Kestler HA (2015) Detecting ordinal class structures. In: Schwenker F, Roli F, Kittler J (eds) Multiple classifier systems, MCS 2015. Lecture notes in computer science, vol 9132, pp 100-111. Springer, Cham
[24] Lausser, L.; Schmid, F.; Schmid, M.; Kestler, HA, Unlabeling data can improve classification accuracy, Pattern Recogn Lett, 37, 15-23 (2014)
[25] Lausser, L.; Schmid, F.; Platzer, M.; Sillanpää, MJ; Kestler, HA, Semantic multi-classifier systems for the analysis of gene expression profiles, Arch Data Sci Ser A, 1, 1, 1-19 (2016)
[26] Lausser, L.; Schmid, F.; Schirra, LR; Wilhelm, A.; Kestler, H., Rank-based classifiers for extremely high-dimensional gene expression data, Adv Data Anal Classif, 12, 1-20 (2016) · Zbl 1416.62608
[27] Lausser, L.; Szekely, R.; Kessler, V.; Schwenker, F.; Kestler, HA; Pancioni, L.; Schwenker, F.; Trentin, E., Selecting features from foreign classes, Artificial neural networks in pattern recognition, 66-77 (2018), Cham: Springer International Publishing, Cham
[28] Lausser, L.; Szekely, R.; Schirra, LR; Kestler, HA, The influence of multi-class feature selection on the prediction of diagnostic phenotypes, Neural Process Lett, 48, 2, 863-880 (2018)
[29] Müssel, C.; Lausser, L.; Maucher, M.; Kestler, HA, Multi-objective parameter selection for classifiers, J Stat Softw, 46, 5, 1-27 (2012)
[30] Pan, SJ; Yang, Q., A survey on transfer learning, IEEE Trans Knowl Data Eng, 22, 10, 1345-1359 (2010)
[31] Pfister, TD; Reinhold, WC; Agama, K.; Gupta, S.; Khin, SA; Kinders, RJ; Parchment, RE; Tomaszewski, JE; Doroshow, JH; Pommier, Y., Topoisomerase I levels in the NCI-60 cancer cell line panel determined by validated ELISA and microarray analysis and correlation with indenoisoquinoline sensitivity, Mol Cancer Ther, 8, 7, 1878-1884 (2009)
[32] Sheffer, M.; Bacolod, MD; Zuk, O.; Giardina, SF; Pincas, H.; Barany, F.; Paty, PB; Gerald, WL; Notterman, DA; Domany, E., Association of survival and disease progression with chromosomal instability: a genomic exploration of colorectal cancer, Proc Nat Acad Sci, 106, 17, 7131-7136 (2009)
[33] Taudien, S.; Lausser, L.; Giamarellos-Bourboulis, EJ; Sponholz, C.; F, S.; Felder, M.; Schirra, LR; Schmid, F.; Gogos, C.; S, G.; Petersen, BS; Franke, A.; Lieb, W.; Huse, K.; Zipfel, PF; Kurzai, O.; Moepps, B.; Gierschik, P.; Bauer, M.; Scherag, A.; Kestler, HA; Platzer, M., Genetic factors of the disease course after sepsis: rare deleterious variants are predictive, EBioMedicine, 12, 227-238 (2016)
[34] Vapnik, VN, Statistical learning theory (1998), New York: Wiley, New York · Zbl 0935.62007
[35] Yu, S.; Príncipe, J., Simple stopping criteria for information theoretic feature selection, Entropy, 21, 1, 99 (2019)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.