Rank-based classifiers for extremely high-dimensional gene expression data. (English) Zbl 1416.62608

Summary: Predicting phenotypes on the basis of gene expression profiles is a classification task that is becoming increasingly important in the field of precision medicine. Although these expression signals are real-valued, it is questionable if they can be analyzed on an interval scale. As with many biological signals their influence on e.g. protein levels is usually non-linear and thus can be misinterpreted. In this article we study gene expression profiles with up to 54,000 dimensions. We analyze these measurements on an ordinal scale by replacing the real-valued profiles by their ranks. This type of rank transformation can be used for the construction of invariant classifiers that are not affected by noise induced by data transformations which can occur in the measurement setup. Our \(10 \times 10\) fold cross-validation experiments on 86 different data sets and 19 different classification models indicate that classifiers largely benefit from this transformation. Especially random forests and support vector machines achieve improved classification results on a significant majority of datasets.


62P10 Applications of statistics to biology and medical sciences; meta analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T05 Learning and adaptive systems in artificial intelligence


Full Text: DOI


[1] Bavaud, F., Aggregation invariance in general clustering approaches, Adv Data Anal Classif, 3, 205-225, (2009) · Zbl 1306.62137
[2] Ben-Dor, A.; Bruhn, L.; Friedman, N.; Nachman, I.; Schummer, M.; Yakhini, Z., Tissue classification with gene expression profiles, J Comput Biol, 7, 559-583, (2000)
[3] Bishop CM (2006) Pattern recognition and machine learning (information science and statistics). Springer, New York · Zbl 1107.68072
[4] Breiman, L., Random forests, Mach Learn, 45, 5-32, (2001) · Zbl 1007.68152
[5] Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. The Wadsworth statistics/probability series. Chapman & Hall/CRC, Boca Raton
[6] Fan, RE; Chang, KW; Hsieh, CJ; Wang, XR; Lin, CJ, LIBLINEAR: a library for large linear classification, J Mach Learn Res, 9, 1871-1874, (2008) · Zbl 1225.68175
[7] Fix E, Hodges JL (1951) Discriminatory analysis: nonparametric discrimination: consistency properties. Tech. Rep. Project 21-49-004, Report Number 4, USAF School of Aviation Medicine, Randolf Field, Texas · Zbl 0715.62080
[8] Guyon, I.; Elisseeff, A., An introduction to variable and feature selection, J Mach Learn Res, 3, 1157-1182, (2003) · Zbl 1102.68556
[9] Haasdonk, B.; Burkhardt, H., Invariant kernel functions for pattern analysis and machine learning, Mach Learn, 68, 35-61, (2007)
[10] Hariharan B, Malik J, Ramanan D (2012) Discriminative decorrelation for clustering and classification. In: Fitzgibbon AW, Lazebnik S, Perona P, Sato Y, Schmid C (eds) Computer Vision-ECCV 2012, Springer, Lecture notes in computer science 7575:459-472
[11] Irizarry, R.; Hobbs, B.; Collin, F.; Beazer-Barclay, Y.; Antonellis, K.; Scherf, U.; Speed, T., Exploration, normalization, and summaries of high density oligonucleotide array probe level data, Biostatistics, 4, 249-264, (2003) · Zbl 1141.62348
[12] Jamain, A.; Hand, D., Where are the large and difficult datasets?, Adv Data Anal Classif, 3, 25-38, (2009) · Zbl 1231.62002
[13] Kestler, HA; Lausser, L.; Lindner, W.; Palm, G., On the fusion of threshold classifiers for categorization and dimensionality reduction, Comput Stat, 26, 321-340, (2011) · Zbl 1304.65045
[14] Lausser, Ludwig; Müssel, Christoph; Kestler, Hans A., Representative Prototype Sets for Data Characterization and Classification, 36-47, (2012), Berlin, Heidelberg · Zbl 1233.68192
[15] McCall, M. N.; Bolstad, B. M.; Irizarry, R. A., Frozen robust multiarray analysis (fRMA), Biostatistics, 11, 242-253, (2010)
[16] Müssel, C.; Lausser, L.; Maucher, M.; Kestler, HA, Multi-objective parameter selection for classifiers, J Stat Softw, 46, 1-27, (2012)
[17] Niyogi, P.; Poggio, T.; Girosi, F., Incorporating prior information in machine learning by creating virtual examples, IEEE Proc Intell Signal Process, 86, 2196-2209, (1998)
[18] Patil, P.; Bachant-Winner, PO; Haibe-Kains, B.; Leek, J., Test set bias affects reproducibility of gene signatures, Bioinformatics, 31, 2318-2323, (2015)
[19] Saeys, Y.; Inza, I.; Larranaga, P., A review of feature selection techniques in bioinformatics, Bioinformatics, 23, 2507-2517, (2007)
[20] Schmid F, Lausser L, Kestler HA (2014) Linear contrast classifiers in high-dimensional spaces. In: Gayar NE, Schwenker F, Suen C (eds) Artificial neural networks in pattern recognition (ANNPR14), Springer, Heidelberg, Lecture notes in artificial intelligence 8774:141-152
[21] Schölkopf B, Burges C, Vapnik V (1996) Incorporating invariances in support vector learning machines. In: von der Malsburg C, von Seelen W, Vorbrüggen J, Sendhoff S (eds) Artificial neural networks—ICANN’96, Springer, Lecture Notes in Computer Science, 1112:47-52
[22] Schölkopf, B.; Smola, A.; Müller, KR, Nonlinear component analysis as a kernel eigenvalue problem, Neural Comput, 10, 1299-1319, (1998)
[23] Simard, PY; LeCun, YA; Denker, JS; Victorri, B.; Orr, G. (ed.); Müller, KR (ed.), Transformation invariance in pattern recognition—tangent distance and tangent propagation, No. 7700, 239-274, (2012), Heidelberg
[24] Thomas, J.; Olson, J.; Tapscott, S.; Zhao, L., An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles, Genome Res, 11, 1227-1236, (2001)
[25] Tibshirani, R.; Hastie, T.; Narasimhan, B.; Chu, G., Diagnosis of multiple cancer types by shrunken centroids of gene expression, Proc Natl Acad Sci USA, 99, 6567-6572, (2002)
[26] Tsuda K (1999) Support vector classifier with asymmetric kernel functions. In: Verleysen M (ed) Proceedings of ESANN’99 - European symposium on artificial neural networks, D-Facto public, Brussels, pp 183-188
[27] Wood, J., Invariant pattern recognition: a review, Pattern Recogn, 29, 1-17, (1996)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.