High-dimensional classification using features annealed independence rules.

*(English)*Zbl 1360.62327Summary: Classification using high-dimensional features arises frequently in many contemporary statistical studies such as tumor classification using microarray or other high-throughput data. The impact of dimensionality on classifications is poorly understood. In a seminal paper, Bickel and Levina [Bernoulli 10 (2004) 989-1010] show that the Fisher discriminant performs poorly due to diverging spectra and they propose to use the independence rule to overcome the problem. We first demonstrate that even for the independence classification rule, classification using all the features can be as poor as the random guessing due to noise accumulation in estimating population centroids in high-dimensional feature space. In fact, we demonstrate further that almost all linear discriminants can perform as poorly as the random guessing. Thus, it is important to select a subset of important features for high-dimensional classification, resulting in Features Annealed Independence Rules (FAIR). The conditions under which all the important features can be selected by the two-sample \(t\)-statistic are established. The choice of the optimal number of features, or equivalently, the threshold value of the test statistics are proposed based on an upper bound of the classification error. Simulation studies and real data analysis support our theoretical results and demonstrate convincingly the advantage of our new classification procedure.

##### MSC:

62H30 | Classification and discrimination; cluster analysis (statistical aspects) |

62P10 | Applications of statistics to biology and medical sciences; meta analysis |

##### Keywords:

classification; feature extraction; high dimensionality; independence rule; misclassification rates##### References:

[1] | Antoniadis, A., Lambert-Lacroix, S. and Leblanc, F. (2003). Effective dimension reduction methods for tumor classification using gene expression data. Bioinformatics 19 563-570. |

[2] | Bai, Z. and Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statist. Sinica 6 311-329. · Zbl 0848.62030 |

[3] | Bair, E., Hastie, T., Paul, D. and Tibshirani, R. (2006). Prediction by supervised principal components. J. Amer. Statist. Assoc. 101 119-137. · Zbl 1118.62326 |

[4] | Bickel, P. J. and Levina, E. (2004). Some theory for Fisher’s linear discriminant function, “naive Bayes,” and some alternatives when there are many more variables than observations. Bernoulli 10 989-1010. · Zbl 1064.62073 |

[5] | Boulesteix, A.-L. (2004). PLS dimension reduction for classification with microarray data. Stat. Appl. Genet. Mol. Biol. 3 1-33. · Zbl 1086.62119 |

[6] | Bühlmann, P. and Yu, B. (2003). Boosting with the L 2 loss: Regression and classification. J. Amer. Statist. Assoc. 98 324-339. · Zbl 1041.62029 |

[7] | Bura, E. and Pfeiffer, R. M. (2003). Graphical methods for class prediction using dimension reduction techniques on DNA microarray data. Bioinformatics 19 1252-1258. |

[8] | Cao, H. (2007). Moderate deviations for two sample t -statistics. ESAIM Probab. Statist. 11 264-271. · Zbl 1181.60037 |

[9] | Chiaromonte, F. and Martinelli, J. (2002). Dimension reduction strategies for analyzing global gene expression data with a response. Math. Biosci. 176 123-144. · Zbl 0999.62090 |

[10] | Dettling, M. and Bühlmann, P. (2003). Boosting for tumor classification with gene expression data. Bioinformatics 19 1061-1069. |

[11] | Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 97 77-87. JSTOR: · Zbl 1073.62576 |

[12] | Fan, J. (1996). Test of significance based on wavelet thresholding and Neyman’s truncation. J. Amer. Statist. Assoc. 91 674-688. JSTOR: · Zbl 0869.62032 |

[13] | Fan, J., Hall, P. and Yao, Q. (2006). To how many simultaneous hypothesis tests can normal, Student’s t or bootstrap calibration be applied? Manuscript. · Zbl 1332.62063 |

[14] | Fan, J. and Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. In Proceedings of the International Congress of Mathematicians (M. Sanz-Sole, J. Soria, J. L. Varona and J. Verdera, eds.) III 595-622. Eur. Math. Soc., Zürich. · Zbl 1117.62137 |

[15] | Fan, J. and Ren, Y. (2006). Statistical analysis of DNA microarray data. Clinical Cancer Research 12 4469-4473. |

[16] | Fan, J. and Lv, J. (2007). Sure independence screening for ultra-high dimensional feature space. Manuscript. |

[17] | Friedman, J. H. (1989). Regularized discriminant analysis. J. Amer. Statist. Assoc. 84 165-175. JSTOR: |

[18] | Ghosh, D. (2002). Singular value decomposition regression modeling for classification of tumors from microarray experiments. Proceedings of the Pacific Symposium on Biocomputing 11462-11467. |

[19] | Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S. (1999). Molecular classifcation of cancer: Class discovery and class prediction by gene expression monitoring. Science 286 531-537. · Zbl 1047.65504 |

[20] | Gordon, G. J., Jensen, R. V., Hsiao, L. L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., Richards, W. G., Sugarbaker, D. J. and Bueno, R. (2002). Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res. 62 4963-4967. |

[21] | Greenshtein, E. (2006). Best subset selection, persistence in high-dimensional statistical learning and optimization under l 1 constraint. Ann. Statist. 34 2367-2386. · Zbl 1106.62022 |

[22] | Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10 971-988. · Zbl 1055.62078 |

[23] | Huang, X. and Pan, W. (2003). Linear regression and two-class classification with gene expression data. Bioinformatics 19 2072-2978. |

[24] | Meinshausen, N. (2007). Relaxed Lasso. Comput. Statist. Data Anal. · Zbl 1452.62522 |

[25] | Nguyen, D. V. and Rocke, D. M. (2002). Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18 39-50. |

[26] | Shao, Q. M. (2005). Self-normalized limit theorems in probability and statistics. Manuscript. |

[27] | Singh, D., Febbo, P., Ross, K., Jackson, D., Manola, J., Ladd, C., Tamayo, P., Renshaw, A., D’Amico, A. and Richie, J. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1 203-209. |

[28] | Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. 99 6567-6572. |

[29] | van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes . Springer, New York. · Zbl 0862.60002 |

[30] | West, M., Blanchette, C., Fressman, H., Huang, E., Ishida, S., Spang, R., Zuan, H., Marks, J. R. and Nevins, J. R. (2001). Predicting the clinical status of human breast cancer using gene expression profiles. Proc. Natl. Acad. Sci. 98 11462-11467. |

[31] | Zou, H., Hastie, T. and Tibshirani. R. (2004). Sparse principal component analysis. Technical report. |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.