×

Computing confidence intervals from massive data via penalized quantile smoothing splines. (English) Zbl 1504.62064

Summary: New methodology is presented for the computation of pointwise confidence intervals from massive response data sets in one or two covariates using robust and flexible quantile regression splines. Novel aspects of the method include a new cross-validation procedure for selecting the penalization coefficient and a reformulation of the quantile smoothing problem based on a weighted data representation. These innovations permit for uncertainty quantification and fast parameter selection in very large data sets via a distributed “bag of little bootstraps”. Experiments with synthetic data demonstrate that the computed confidence intervals feature empirical coverage rates that are generally within 2% of the nominal rates. The approach is broadly applicable to the analysis of large data sets in one or two dimensions. Comparative (or “A/B”) experiments conducted at Netflix aimed at optimizing the quality of streaming video originally motivated this work, but the proposed methods have general applicability. The methodology is illustrated using an open source application: the comparison of geo-spatial climate model scenarios from NASA’s Earth Exchange.

MSC:

62G15 Nonparametric tolerance and confidence regions
62G08 Nonparametric regression and quantile regression
62R07 Statistical aspects of big data and data science
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Akaike, H., Information theory and an extension of the maximum likelihood principle, (Selected Papers of Hirotugu Akaike (1973), Springer), 199-213 · Zbl 0283.62006
[2] Bickel, P. J.; Sakov, A., On the choice of m in the m out of n bootstrap and confidence bounds for extrema, Statist. Sinica, 967-985 (2008) · Zbl 05361940
[3] Bosch, R. J.; Ye, Y.; Woodworth, G. G., A convergent algorithm for quantile regression with smoothing splines, Comput. Statist. Data Anal., 19, 6, 613-630 (1995) · Zbl 0875.62148
[4] Cox, D. D., Asymptotics for M-type smoothing splines, Ann. Statist., 530-551 (1983) · Zbl 0519.62034
[5] Dikta, G., Bootstrap approximation of nearest neighbor regression function estimates, J. Multivariate Anal., 32, 2, 213-229 (1990) · Zbl 0723.62028
[6] Gent, P. R.; Danabasoglu, G.; Donner, L. J.; Holland, M. M.; Hunke, E. C.; Jayne, S. R.; Lawrence, D. M.; Neale, R. B.; Rasch, P. J.; Vertenstein, M., The community climate system model version 4, J. Clim., 24, 19, 4973-4991 (2011)
[7] Govind, N., Optimizing the Netflix streaming experience with data science (2014), Netflix Tech Blog Jun 11
[8] Govind, N., A/B testing and beyond: Improving the Netflix streaming experience with experimentation and data science (2017), Netflix Tech Blog Jun 13
[9] Hall, P., The Bootstrap and Edgeworth Expansion (2013), Springer Science & Business Media
[10] Hansen, M.; Kooperberg, C.; Sardy, S., Triogram models, J. Amer. Statist. Assoc., 93, 441, 101-119 (1998) · Zbl 0902.62045
[11] Härdle, W.; Marron, J., Bootstrap simultaneous error bars for nonparametric regression, Ann. Statist., 778-796 (1991) · Zbl 0725.62037
[12] Summary for policymakers, (Stocker, T.; Qin, D.; Plattner, G.-K.; Tignor, M.; Allen, S.; Boschung, J.; Nauels, A.; Xia, Y.; Bex, V.; Midgley, P., Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change (2013), Cambridge University Press: Cambridge University Press Cambridge, United Kingdom and New York, NY, USA), 130
[13] Kleiner, A.; Talwalkar, A.; Sarkar, P.; Jordan, M. I., A scalable bootstrap for massive data, J. R. Stat. Soc. Ser. B Stat. Methodol., 76, 4, 795-816 (2014) · Zbl 07555464
[14] Koenker, R., Quantile Regression, Vol. 38 (2005), Cambridge university press · Zbl 1111.62037
[15] Koenker, R., Quantreg: quantile regression. r package version 4.10 (2007)
[16] Koenker, R.; Bassett Jr, G., Regression quantiles, Econometrica, 33-50 (1978) · Zbl 0373.62038
[17] Koenker, R.; Mizera, I., Penalized triograms: Total variation regularization for bivariate smoothing, J. R. Stat. Soc. Ser. B Stat. Methodol., 66, 1, 145-163 (2004) · Zbl 1064.62038
[18] Koenker, R.; Ng, P., A frisch-newton algorithm for sparse quantile regression, Acta Math. Appl. Sin. (Engl. Ser.), 21, 2, 225-236 (2005) · Zbl 1097.62028
[19] Koenker, R.; Ng, P., Inequality constrained quantile regression, Sankhyā, 418-440 (2005) · Zbl 1193.62023
[20] Koenker, R.; Ng, P.; Portnoy, S., Quantile smoothing splines, Biometrika, 81, 4, 673-680 (1994) · Zbl 0810.62040
[21] Kua, J.; Armitage, G.; Branch, P., A survey of rate adaptation techniques for dynamic adaptive streaming over http, IEEE Commun. Surv. Tutor., 19, 3, 1842-1866 (2017)
[22] Nychka, D.; Gray, G.; Haaland, P.; Martin, D.; O’connell, M., A nonparametric regression approach to syringe grading for quality improvement, J. Amer. Statist. Assoc., 90, 432, 1171-1178 (1995) · Zbl 0864.62066
[23] Oh, H.-S.; Nychka, D.; Brown, T.; Charbonneau, P., Period analysis of variable stars by robust smoothing, J. R. Stat. Soc. Ser. C. Appl. Stat., 53, 1, 15-30 (2004) · Zbl 1111.85302
[24] Ramsay, J. O.; Silverman, B. W., Applied Functional Data Analysis: Methods and Case Studies (2007), Springer · Zbl 1011.62002
[25] Reich, B. J.; Bondell, H. D.; Wang, H. J., Flexible Bayesian quantile regression for independent and clustered data, Biostatistics, 11, 2, 337-352 (2010) · Zbl 1437.62589
[26] Reiss, P. T.; Huang, L., Smoothness selection for penalized quantile regression splines, Int. J. Biostat., 8, 1 (2012)
[27] Renka, R. J., Algorithm 751: tripack: a constrained two-dimensional delaunay triangulation package, ACM Trans. Math. Softw. (TOMS), 22, 1, 1-8 (1996) · Zbl 0884.65144
[28] Schwarz, G., Estimating the dimension of a model, Ann. Statist., 6, 2, 461-464 (1978) · Zbl 0379.62005
[29] Serrin, J., On the definition and properties of certain variational integrals, Trans. Amer. Math. Soc., 101, 1, 139-167 (1961) · Zbl 0102.04601
[30] Shang, Z.; Cheng, G., Computational limits of a distributed algorithm for smoothing spline, J. Mach. Learn. Res., 18, 1, 3809-3845 (2017) · Zbl 1442.90055
[31] Sommerfeld, M.; Sain, S.; Schwartzman, A., Confidence regions for spatial excursion sets from repeated random field observations, with an application to climate, J. Amer. Statist. Assoc., 113, 523, 1327-1340 (2018) · Zbl 1402.62101
[32] Sun, J.; Loader, C. R., Simultaneous confidence bands for linear regression and smoothing, Ann. Statist., 22, 3, 1328-1345 (1994) · Zbl 0817.62057
[33] Thrasher, B.; Maurer, E. P.; McKellar, C.; Duffy, P., Bias correcting climate model simulated daily temperature extremes with quantile mapping, Hydrol. Earth Syst. Sci., 16, 9, 3309 (2012)
[34] Van Vuuren, D. P.; Edmonds, J.; Kainuma, M.; Riahi, K.; Thomson, A.; Hibbard, K.; Hurtt, G. C.; Kram, T.; Krey, V.; Lamarque, J.-F., The representative concentration pathways: an overview, Clim. Change, 109, 1-2, 5 (2011)
[35] Wahba, G., Spline Models for Observational Data, Vol. 59 (1990), Siam · Zbl 0813.62001
[36] Yang, Y.; He, X., Bayesian empirical likelihood for quantile regression, Ann. Statist., 40, 2, 1102-1131 (2012) · Zbl 1274.62458
[37] Yuan, M., Gacv for quantile smoothing splines, Comput. Stat. Data Anal., 50, 3, 813-829 (2006) · Zbl 1432.62090
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.