×

Noise accumulation in high dimensional classification and total signal index. (English) Zbl 1498.68284

Summary: Great attention has been paid to Big Data in recent years. Such data hold promise for scientific discoveries but also pose challenges to analyses. One potential challenge is noise accumulation. In this paper, we explore noise accumulation in high dimensional two-group classification. First, we revisit a previous assessment of noise accumulation with principal component analyses, which yields a different threshold for discriminative ability than originally identified. Then we extend our scope to its impact on classifiers developed with three common machine learning approaches – random forest, support vector machine, and boosted classification trees. We simulate four scenarios with differing amounts of signal strength to evaluate each method. After determining noise accumulation may affect the performance of these classifiers, we assess factors that impact it. We conduct simulations by varying sample size, signal strength, signal strength proportional to the number predictors, and signal magnitude with random forest classifiers. These simulations suggest that noise accumulation affects the discriminative ability of high-dimensional classifiers developed using common machine learning methods, which can be modified by sample size, signal strength, and signal magnitude. We developed the measure total signal index (TSI) to track the trends of total signal and noise accumulation.

MSC:

68T09 Computational aspects of data analysis and big data
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62R07 Statistical aspects of big data and data science
PDFBibTeX XMLCite
Full Text: Link

References:

[1] Leo Breiman. Random forests.Machine Learning, 45(1):5-32, 2001. · Zbl 1007.68152
[2] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2011. Available athttps://www.csie.ntu.edu.tw/ cjlin/libsvm/.
[3] Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine Learning, 20(3):273-297, 1995. · Zbl 0831.68098
[4] Miriam R. Elman. Noise accumulation for high dimensional classification code, 2018. Available at https://github.com/sink-or-swim/NoiseAccumulation.
[5] Jianqing Fan. Features of big data and sparsest solution in high confidence set. In Xihong Lin, Christian Genest, David L. Banks, Geert Molenberghs, David W. Scott, and Jane-Ling Wang, editors,Past, Present, and Future of Statistical Science, pages 531-548. Chapman and Hall/CRC, New York, NY, USA, 2014.
[6] Jianqing Fan and Yingying Fan. High dimensional classification using features annealed independence rules.Annals of Statistics, 36(6):2605-2637, 2008. · Zbl 1360.62327
[7] Jianqing Fan, Fang Han, and Han Liu. Challenges of Big Data analysis.National Science Review, 1(2):293-314, 2014.
[8] Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. Additive logistic regression: a statistical view of boosting.Annals of Statistics, 28(2):337-407, 2000. · Zbl 1106.62323
[9] Jerome H. Friedman. Greedy function approximation: a gradient boosting machine.Annals of Statistics, 29(5):1189-1232, 2001. · Zbl 1043.62034
[10] Peter Hall, Yvonne Pittelkow, and Malay Ghosh. Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):159-173, 2008. · Zbl 1400.62094
[11] Trevor Hastie, Robert Tibshirani, and Jerome Friedman.The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York, NY, USA, 2009. · Zbl 1273.62005
[12] Andy Liaw and Matthew Wiener. Classification and regression by randomForest.R News, 2(3): 18-22, 2002.
[13] David Meyer, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, and Friedrich Leisch. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien, 2015. R package version 1.6-6.
[14] R Core Team.R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2017.
[15] G. Ridgeway.gbm: Generalized Boosted Regression Models, 2017. R package version 2.1.3.
[16] Huan Xu, Constantine Caramanis, and Shie Mannor. Robustness and regularization of support vector machines.Journal of Machine Learning Research, 10(Jul):1485-1510, 2009. · Zbl 1235.68209
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.