Rosenblatt, Jonathan D.; Nadler, Boaz On the optimality of averaging in distributed statistical learning. (English) Zbl 1426.68241 Inf. Inference 5, No. 4, 379-404 (2016). Summary: A common approach to statistical learning with Big-data is to randomly split it among \(m\) machines and learn the parameter of interest by averaging the \(m\) individual estimates. In this paper, focusing on empirical risk minimization or equivalently M-estimation, we study the statistical error incurred by this strategy. We consider two large-sample settings: first, a classical setting where the number of parameters \(p\) is fixed, and the number of samples per machine \(n\rightarrow \infty\). Second, a high-dimensional regime where both \(p,n\rightarrow \infty \) with \(p/n \rightarrow \kappa \in (0,1)\). For both regimes and under suitable assumptions, we present asymptotically exact expressions for this estimation error. In the fixed-\(p\) setting, we prove that to leading order averaging is as accurate as the centralized solution. We also derive the second-order error terms, and show that these can be non-negligible, notably for nonlinear models. The high-dimensional setting, in contrast, exhibits a qualitatively different behavior: data splitting incurs a first-order accuracy loss, which increases linearly with the number of machines. The dependence of our error approximations on the number of machines traces an interesting accuracy-complexity tradeoff, allowing the practitioner an informed choice on the number of machines to deploy. Finally, we confirm our theoretical analysis with several simulations. Cited in 28 Documents MSC: 68T05 Learning and adaptive systems in artificial intelligence 62R07 Statistical aspects of big data and data science 68T09 Computational aspects of data analysis and big data 68W15 Distributed algorithms Keywords:machine learning; M-estimation; distributed algorithms; empirical risk minimization; big data; high-order asymptotics Software:Hadoop; MapReduce; Spark PDFBibTeX XMLCite \textit{J. D. Rosenblatt} and \textit{B. Nadler}, Inf. Inference 5, No. 4, 379--404 (2016; Zbl 1426.68241) Full Text: DOI arXiv References: [1] Achutegui, K.; Crisan, D.; Miguez, J.; Rios, G., (2014) [2] Bean, D.; Bickel, P. J.; Karoui, N. E.; Yu, B., Optimal M-estimation in high-dimensional regression, Proc. Natl. Acad. Sci., 110, 14568, (2013) [3] Bekkerman, R.; Bilenko, M.; Langford, J., Scaling up Machine Learning: Parallel and Distributed Approaches, (2011) [4] Bender, C. M.; Orszag, S. A., Advanced Mathematical Methods for Scientists and Engineers: Asymptotic Methods and Perturbation Theory, (1999) · Zbl 0938.34001 [5] Dean, J.; Ghemawat, S., MapReduce: simplified data processing on large clusters, Commun. ACM, 51, 113, (2008) [6] Der Devroye, L.; Gyorfi, L.; Lugosi, G., A Probabilistic Theory of Pattern Recognition, (1997) [7] Donoho, D.; Montanari, A., (2013) [8] Doss, H.; Sethuraman, J., The price of bias reduction when there is no unbiased estimate, Ann. Stat., 17, 442, (1989) · Zbl 0669.62010 [9] El Karoui, N., (2013) [10] El Karoui, N.; Bean, D.; Bickel, P. J.; Lim, C.; Yu, B., On robust regression with high-dimensional predictors, Proc. Natl. Acad. Sci., 14562, (2013) · Zbl 1359.62184 [11] El Karoui, N.; Bickel, P.; Bean, D.; Lim, C.; Yu, B., (2012) [12] Fujikoshi, Y.; Ulyanov, V. V.; Shimizu, R., Multivariate Statistics: High-Dimensional and Large-Sample Approximations, (2011) [13] Gupta, A. K.; Nagar, D. K., Matrix Variate Distributions, (1999) · Zbl 0935.62064 [14] Hsu, D.; Sabato, S., Loss minimization and parameter estimation with heavy tails, J. Mach. Learn. Res., 17, 40, (2016) · Zbl 1360.62380 [15] Huber, P. J., Robust regression: asymptotics, conjectures and Monte Carlo, Ann. Stat., 1, 821, (1973) · Zbl 0289.62033 [16] Kim, K. I., (2006) [17] Lee, J. D.; Sun, Y.; Liu, Q.; Taylor, J. E., (2015) [18] Liu, Q.; Ihler, A. T., Distributed estimation, information loss and exponential families, 1106, (2014) [19] McDonald, R.; Mohri, M.; Silberman, N.; Walker, D.; Mann, G. S., (2009) [20] Meng, Z.; Wiesel, A.; Hero, A., Distributed principal component analysis on networks via directed graphical models, 2880, (2012) [21] Minsker, S., Geometric median and robust estimation in banach spaces, Bernoulli, 21, 2335, (2015) · Zbl 1348.60041 [22] Rieder, H., Robust Asymptotic Statistics, I, (2012) [23] Rilstone, P.; Srivastava, V. K.; Ullah, A., The second-order bias and mean squared error of nonlinear estimators, J. Econom., 75, 395, (1996) · Zbl 0866.62010 [24] Rosset, S.; Zhu, J., Piecewise linear regularized solution paths, Ann. Stat., 35, 1030, (2007) · Zbl 1194.62094 [25] Shalev-Shwartz, S.; Ben-David, S., Understanding Machine Learning: From Theory to Algorithms, (2014) · Zbl 1305.68005 [26] Shalev-Shwartz, S.; Srebro, N., SVM optimization: inverse dependence on training set size, 935, (2008) [27] Shamir, O., Is averaging needed for strongly convex stochastic gradient descent, (2012) [28] Shamir, O.; Srebro, N.; Zhang, T., (2013) [29] Shvachko, K.; Kuang, H.; Radia, S.; Chansler, R., The hadoop distributed file system, 10, (2010) [30] Vaart, A. W. V. D., Asymptotic Statistics, (1998) · Zbl 0910.62001 [31] Zaharia, M.; Chowdhury, M.; Franklin, M. J.; Shenker, S.; Stoica, I., Spark: cluster computing with working sets, 10, (2010) [32] Zhang, Y.; Duchi, J.; Wainwright, M., Divide and conquer kernel ridge regression, 617, (2013a) [33] Zhang, Y.; Duchi, J. C.; Wainwright, M. J., Communication-efficient algorithms for statistical optimization, J. Mach. Learn. Res., 14, 3363, (2013b) [34] Zinkevich, M.; Weimer, M.; Li, L.; Smola, A. J., Parallelized stochastic gradient descent, 2603, (2010) This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.