×

Robust multivariate nonparametric tests via projection averaging. (English) Zbl 1460.62087

Summary: In this work, we generalize the Cramér-von Mises statistic via projection averaging to obtain a robust test for the multivariate two-sample problem. The proposed test is consistent against all fixed alternatives, robust to heavy-tailed data and minimax rate optimal against a certain class of alternatives. Our test statistic is completely free of tuning parameters and is computationally efficient even in high dimensions. When the dimension tends to infinity, the proposed test is shown to have comparable power to the existing high-dimensional mean tests under certain location models. As a by-product of our approach, we introduce a new metric called the angular distance which can be thought of as a robust alternative to the Euclidean distance. Using the angular distance, we connect the proposed method to the reproducing kernel Hilbert space approach. In addition to the Cramér-von Mises statistic, we demonstrate that the projection-averaging technique can be used to define robust multivariate tests in many other problems.

MSC:

62H15 Hypothesis testing in multivariate analysis
62H20 Measures of association (correlation, canonical correlation, etc.)
62G10 Nonparametric hypothesis testing
62G35 Nonparametric robustness
46E22 Hilbert spaces with reproducing kernels (= (proper) functional Hilbert spaces, including de Branges-Rovnyak and other structured spaces)

Software:

energy; MNM
PDFBibTeX XMLCite
Full Text: DOI arXiv Euclid

References:

[1] Anderson, T. W. (1962). On the distribution of the two-sample Cramér-von Mises criterion. Ann. Math. Stat. 33 1148-1159. · Zbl 0116.37601 · doi:10.1214/aoms/1177704477
[2] Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. Wiley Series in Probability and Statistics. Wiley-Interscience, Hoboken, NJ. · Zbl 1039.62044
[3] Anderson, N. H., Hall, P. and Titterington, D. M. (1994). Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. J. Multivariate Anal. 50 41-54. · Zbl 0798.62055 · doi:10.1006/jmva.1994.1033
[4] Bai, Z. and Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statist. Sinica 6 311-329. · Zbl 0848.62030
[5] Baringhaus, L. and Franz, C. (2004). On a new multivariate two-sample test. J. Multivariate Anal. 88 190-206. · Zbl 1035.62052 · doi:10.1016/S0047-259X(03)00079-4
[6] Baringhaus, L. and Henze, N. (2017). Cramér-von Mises distance: Probabilistic interpretation, confidence intervals, and neighbourhood-of-model validation. J. Nonparametr. Stat. 29 167-188. · Zbl 1369.62084 · doi:10.1080/10485252.2017.1285029
[7] Bera, A. K., Ghosh, A. and Xiao, Z. (2013). A smooth test for the equality of distributions. Econometric Theory 29 419-446. · Zbl 1271.62093 · doi:10.1017/S0266466612000370
[8] Bergsma, W. and Dassios, A. (2014). A consistent test of independence based on a sign covariance related to Kendall’s tau. Bernoulli 20 1006-1028. · Zbl 1400.62091 · doi:10.3150/13-BEJ514
[9] Bhat, B. V. (1995). Theory of U-statistics and its applications. Ph.D. thesis, Karnatak Univ.
[10] Bhattacharya, B. B. (2018). Two-sample tests based on geometric graphs: Asymptotic distribution and detection thresholds. Preprint. Available at arXiv:1512.00384v3.
[11] Bhattacharya, B. B. (2019). A general asymptotic framework for distribution-free graph-based two-sample tests. J. R. Stat. Soc. Ser. B. Stat. Methodol. 81 575-602. · Zbl 1420.62214 · doi:10.1111/rssb.12319
[12] Biswas, M. and Ghosh, A. K. (2014). A nonparametric two-sample test applicable to high dimensional data. J. Multivariate Anal. 123 160-171. · Zbl 1278.62059 · doi:10.1016/j.jmva.2013.09.004
[13] Biswas, M., Mukhopadhyay, M. and Ghosh, A. K. (2014). A distribution-free two-sample run test applicable to high-dimensional data. Biometrika 101 913-926. · Zbl 1306.62122 · doi:10.1093/biomet/asu045
[14] Chakraborty, A. and Chaudhuri, P. (2017). Tests for high-dimensional data based on means, spatial signs and spatial ranks. Ann. Statist. 45 771-799. · Zbl 1368.62147 · doi:10.1214/16-AOS1467
[15] Chen, H., Chen, X. and Su, Y. (2018). A weighted edge-count two-sample test for multivariate and object data. J. Amer. Statist. Assoc. 113 1146-1155. · Zbl 1402.62079 · doi:10.1080/01621459.2017.1307757
[16] Chen, L., Dou, W. W. and Qiao, Z. (2013). Ensemble subsampling for imbalanced multivariate two-sample tests. J. Amer. Statist. Assoc. 108 1308-1323. · Zbl 1283.62093 · doi:10.1080/01621459.2013.800763
[17] Chen, H. and Friedman, J. H. (2017). A new graph-based two-sample test for multivariate and object data. J. Amer. Statist. Assoc. 112 397-409.
[18] Chen, S. X. and Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 38 808-835. · Zbl 1183.62095 · doi:10.1214/09-AOS716
[19] Chikkagoudar, M. S. and Bhat, B. V. (2014). Limiting distribution of two-sample degenerate U-statistic under contiguous alternatives and applications. J. Appl. Statist. Sci. 22 127-139.
[20] Chung, E. and Romano, J. P. (2013). Exact and asymptotically robust permutation tests. Ann. Statist. 41 484-507. · Zbl 1267.62064 · doi:10.1214/13-AOS1090
[21] Cramér, H. (1928). On the composition of elementary errors. Skand. Aktuarietidskr. 11 141-180. · JFM 54.0557.02
[22] Cui, H. (2002). Average projection type weighted Cramér-von Mises statistics for testing some distributions. Sci. China Ser. A 45 562-577. · Zbl 1098.62076
[23] Escanciano, J. C. (2006). A consistent diagnostic test for regression models using projections. Econometric Theory 22 1030-1051. · Zbl 1170.62318 · doi:10.1017/S0266466606060506
[24] Friedman, J. H. and Rafsky, L. C. (1979). Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. Ann. Statist. 7 697-717. · Zbl 0423.62034 · doi:10.1214/aos/1176344722
[25] Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B. and Smola, A. (2012). A kernel two-sample test. J. Mach. Learn. Res. 13 723-773. · Zbl 1283.62095
[26] Hall, P., Marron, J. S. and Neeman, A. (2005). Geometric representation of high dimension, low sample size data. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 427-444. · Zbl 1069.62097 · doi:10.1111/j.1467-9868.2005.00510.x
[27] Harchaoui, Z., Bach, F., Cappe, O. and Moulines, E. (2013). Kernel-based methods for hypothesis testing: A unified view. IEEE Signal Process. Mag. 30 87-97.
[28] Henze, N. (1988). A multivariate two-sample test based on the number of nearest neighbor type coincidences. Ann. Statist. 16 772-783. · Zbl 0645.62062 · doi:10.1214/aos/1176350835
[29] Hettmansperger, T. P., Möttönen, J. and Oja, H. (1998). Affine invariant multivariate rank tests for several samples. Statist. Sinica 8 785-800. · Zbl 0905.62062
[30] Hoeffding, W. (1952). The large-sample power of tests based on permutations of observations. Ann. Math. Stat. 23 169-192. · Zbl 0046.36403 · doi:10.1214/aoms/1177729436
[31] Hu, J. and Bai, Z. (2016). A review of 20 years of naive tests of significance for high-dimensional mean vectors and covariance matrices. Sci. China Math. 59 2281-2300. · Zbl 1360.62290 · doi:10.1007/s11425-016-0131-0
[32] Kanamori, T., Suzuki, T. and Sugiyama, M. (2012). \(f\)-divergence estimation and two-sample homogeneity test under semiparametric density-ratio models. IEEE Trans. Inform. Theory 58 708-720. · Zbl 1365.62119 · doi:10.1109/TIT.2011.2163380
[33] Kim, I., Balakrishnan, S. and Wasserman, L. (2020). Supplement to “Robust multivariate nonparametric tests via projection averaging.” https://doi.org/10.1214/19-AOS1936SUPP.
[34] Kruskal, J. B. Jr. (1956). On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Amer. Math. Soc. 7 48-50. · Zbl 0070.18404 · doi:10.1090/S0002-9939-1956-0078686-7
[35] Lee, A. J. (1990). \(U\)-Statistics: Theory and Practice. Statistics: Textbooks and Monographs 110. Dekker, New York.
[36] Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses, 3rd ed. Springer Texts in Statistics. Springer, New York. · Zbl 1076.62018
[37] Li, J. and Chen, S. X. (2012). Two sample tests for high-dimensional covariance matrices. Ann. Statist. 40 908-940. · Zbl 1274.62383 · doi:10.1214/12-AOS993
[38] Liu, R. Y. (2006). Data Depth: Robust Multivariate Analysis, Computational Geometry, and Applications 72. Amer. Math. Soc., Providence.
[39] Liu, Z. and Modarres, R. (2011). A triangle test for equality of distribution functions in high dimensions. J. Nonparametr. Stat. 23 605-615. · Zbl 1228.62055 · doi:10.1080/10485252.2010.485644
[40] Lopez-Paz, D. and Oquab, M. (2016). Revisiting classifier two-sample tests. Preprint. Available at arXiv:1610.06545.
[41] Mondal, P. K., Biswas, M. and Ghosh, A. K. (2015). On high dimensional two-sample tests based on nearest neighbors. J. Multivariate Anal. 141 168-178. · Zbl 1323.62037 · doi:10.1016/j.jmva.2015.07.002
[42] Mukhopadhyay, S. and Wang, K. (2018). A nonparametric approach to high-dimensional K-sample comparison problem. Preprint. Available at arXiv:1810.01724.
[43] Oja, H. (2010). Multivariate Nonparametric Methods with R: An Approach Based on Spatial Signs and Ranks. Lecture Notes in Statistics 199. Springer, New York. · Zbl 1269.62036
[44] Oja, H. and Randles, R. H. (2004). Multivariate nonparametric tests. Statist. Sci. 19 598-605. · Zbl 1100.62567 · doi:10.1214/088342304000000558
[45] Pan, W., Tian, Y., Wang, X. and Zhang, H. (2018). Ball divergence: Nonparametric two sample test. Ann. Statist. 46 1109-1137. · Zbl 1395.62101 · doi:10.1214/17-AOS1579
[46] Pesarin, F. (2001). Multivariate Permutation Tests: With Applications in Biostatistics. Wiley, Chichester. · Zbl 0972.62037
[47] Ramdas, A., Reddi, S. J., Poczos, B., Singh, A. and Wasserman, L. (2015). Adaptivity and computation-statistics tradeoffs for kernel and distance based high dimensional two sample testing. Preprint. Available at arXiv:1508.00655.
[48] Rosenbaum, P. R. (2005). An exact distribution-free test comparing two multivariate distributions based on adjacency. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 515-530. · Zbl 1095.62053 · doi:10.1111/j.1467-9868.2005.00513.x
[49] Schilling, M. F. (1986). Multivariate two-sample tests based on nearest neighbors. J. Amer. Statist. Assoc. 81 799-806. · Zbl 0612.62081 · doi:10.1080/01621459.1986.10478337
[50] Sejdinovic, D., Sriperumbudur, B., Gretton, A. and Fukumizu, K. (2013). Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Statist. 41 2263-2291. · Zbl 1281.62117 · doi:10.1214/13-AOS1140
[51] Székely, G. J. and Rizzo, M. L. (2004). Testing for equal distributions in high dimension. Interstate 5.
[52] Székely, G. J. and Rizzo, M. L. (2013). Energy statistics: A class of statistics based on distances. J. Statist. Plann. Inference 143 1249-1272. · Zbl 1278.62072 · doi:10.1016/j.jspi.2013.03.018
[53] Thas, O. (2010). Comparing Distributions. Springer Series in Statistics. Springer, New York. · Zbl 1234.62014
[54] Wald, A. and Wolfowitz, J. (1940). On a test whether two samples are from the same population. Ann. Math. Stat. 11 147-162. · JFM 66.0645.01 · doi:10.1214/aoms/1177731909
[55] Wang, L., Peng, B. and Li, R. (2015). A high-dimensional nonparametric multivariate test for mean vector. J. Amer. Statist. Assoc. 110 1658-1669. · Zbl 1373.62280 · doi:10.1080/01621459.2014.988215
[56] Zhou, W.-X., Zheng, C. and Zhang, Z. (2017). Two-sample smooth tests for the equality of distributions. Bernoulli 23 951-989. · Zbl 1380.62202 · doi:10.3150/15-BEJ766
[57] Zhu, L.-X., Fang, K.-T. and Bhatti, M. I. (1997). On estimated projection pursuit-type Crámer-von Mises statistics. J. Multivariate Anal. 63 1-14. · Zbl 0889.62055 · doi:10.1006/jmva.1997.1673
[58] Zhu, L. · Zbl 07072331 · doi:10.1093/biomet/asx043
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.