×

Multilevel fine-tuning: closing generalization gaps in approximation of solution maps under a limited budget for training data. (English) Zbl 1468.65220

Summary: In scientific machine learning, regression networks have been recently applied to approximate solution maps (e.g., the potential-ground state map of the Schrödinger equation). In this paper, we aim to reduce the generalization error without spending more time on generating training samples. However, to reduce the generalization error, the regression network needs to be fit on a large number of training samples (e.g., a collection of potential-ground state pairs). The training samples can be produced by running numerical solvers, which takes significant time in many applications. In this paper, we aim to reduce the generalization error without spending more time on generating training samples. Inspired by few-shot learning techniques, we develop the multilevel fine-tuning algorithm by introducing levels of training: we first train the regression network on samples generated at the coarsest grid and then successively fine-tune the network on samples generated at finer grids. Within the same amount of time, numerical solvers generate more samples on coarse grids than on fine grids. We demonstrate a significant reduction of generalization error in numerical experiments on challenging problems with oscillations, discontinuities, or rough coefficients. Further analysis can be conducted in the neural tangent kernel regime, and we provide practical estimators to the generalization error. The number of training samples at different levels can be optimized for the smallest estimated generalization error under the constraint of budget for training data. The optimized distribution of budget over levels provides practical guidance with theoretical insight as in the celebrated multilevel Monte Carlo algorithm.

MSC:

65N55 Multigrid methods; domain decomposition for boundary value problems involving PDEs
65C20 Probabilistic models, generic numerical methods in probability and statistics
65C05 Monte Carlo methods
62J07 Ridge regression; shrinkage estimators (Lasso)
62M45 Neural nets and related approaches to inference from stochastic processes
68T05 Learning and adaptive systems in artificial intelligence
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Z. Allen-Zhu, Y. Li, and Y. Liang, Learning and generalization in overparameterized neural networks, going beyond two layers, in Advances in Neural Information Processing Systems 32 (Vancouver, Canada), Curran Associates, Red Hook, NY, 2019, pp. 6158-6169.
[2] Z. Allen-Zhu, Y. Li, and Z. Song, A convergence theory for deep learning via over-parameterization, in Proceedings of the 36th International Conference on Machine Learning (Long Beach, CA), PMLR, 2019, pp. 242-252.
[3] J. R. Anglin and W. Ketterle, Bose-Einstein condensation of atomic gases, Nature, 416 (2002), pp. 211-218, https://doi.org/10.1038/416211a.
[4] S. Arora, S. Du, W. Hu, Z. Li, and R. Wang, Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks, in Proceedings of the 36th International Conference on Machine Learning (Long Beach, CA), PMLR, 2019, pp. 322-332.
[5] S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang, On exact computation with an infinitely wide neural net, in Advances in Neural Information Processing Systems 32 (Vancouver, Canada), Curran Associates, Red Hook, NY, 2019, pp. 8141-8150.
[6] S. Arora, R. Ge, B. Neyshabur, and Y. Zhang, Stronger generalization bounds for deep nets via a compression approach, in Proceedings of the 35th International Conference on Machine Learning (Stockholm, Sweden), PMLR, 2018, pp. 254-263.
[7] S. Arridge, P. Maass, O. Öktem, and C.-B. Schönlieb, Solving inverse problems using data-driven models, Acta Numer., 28 (2019), pp. 1-174, https://doi.org/10.1017/S0962492919000059. · Zbl 1429.65116
[8] V. S. Bagnato, D. J. Frantzeskakis, P. G. Kevrekidis, B. A. Malomed, and D. Mihalache, Bose-Einstein condensation: Twenty years after, Rom. Rep. Phys., 67 (2015), pp. 5-50.
[9] W. Bao and Q. Du, Computing the ground state solution of Bose-Einstein condensates by a normalized gradient flow, SIAM J. Sci. Comput., 25 (2004), pp. 1674-1697, https://doi.org/10.1137/S1064827503422956. · Zbl 1061.82025
[10] P. L. Bartlett and S. Mendelson, Rademacher and Gaussian complexities: Risk bounds and structural results, in Computational Learning Theory, COLT 2001, Lecture Notes in Comput. Sci. 2111, Springer, Berlin, Heidelberg, 2001, pp. 224-240, https://doi.org/10.1007/3-540-44581-1_15. · Zbl 0992.68106
[11] M. Belkin, S. Ma, and S. Mandal, To understand deep learning we need to understand kernel learning, in Proceedings of the 35th International Conference on Machine Learning (Stockholm, Sweden), PMLR, 2018, pp. 541-549.
[12] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, JAX: Composable Transformations of Python+NumPy Programs, http://github.com/google/jax, 2018.
[13] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, Stax, a Flexible Neural Net Specification Library in JAX, https://github.com/google/jax/blob/master/jax/experimental/stax.py, 2018.
[14] Y. Cao and Q. Gu, Generalization bounds of stochastic gradient descent for wide and deep neural networks, in Advances in Neural Information Processing Systems 32 (Vancouver, Canada), Curran Associates, Red Hook, NY, 2019, pp. 10836-10846.
[15] Y. Cao and Q. Gu, Generalization error bounds of gradient descent for learning over-parameterized deep ReLU networks, Proc. AAAI Conf. Artif. Intell., 34, (2020), pp. 3349-3356, https://doi.org/10.1609/aaai.v34i04.5736.
[16] C. Cortes, M. Kloft, and M. Mohri, Learning kernels using local Rademacher complexity, in Advances in Neural Information Processing Systems 26 (Lake Tahoe, NV), Curran Associates, Red Hook, NY, 2013, pp. 2760-2768.
[17] W. E and B. Engquist, The heterogenous multiscale methods, Commun. Math. Sci., 1 (2003), pp. 87-132. · Zbl 1093.35012
[18] W. E and B. Yu, The Deep Ritz Method: A deep learning-based numerical algorithm for solving variational problems, Commun. Math. Stat., 6 (2018), pp. 1-12, https://doi.org/10.1007/s40304-018-0127-z. · Zbl 1392.35306
[19] Y. Fan, J. Feliu-Fabà, L. Lin, L. Ying, and L. Zepeda-Nún͂ez, A multiscale neural network based on hierarchical nested bases, Res. Math. Sci., 6 (2019), 21, https://doi.org/10.1007/s40687-019-0183-3. · Zbl 07096701
[20] Y. Fan, L. Lin, L. Ying, and L. Zepeda-Nún͂ez, A multiscale neural network based on hierarchical matrices, Multiscale Model. Simul., 17 (2019), pp. 1189-1213, https://doi.org/10.1137/18M1203602. · Zbl 1435.65181
[21] Y. Fan, C. Orozco Bohorquez, and L. Ying, BCR-Net: A neural network based on the nonstandard wavelet form, J. Comput. Phys., 384 (2019), pp. 1-15, https://doi.org/10.1016/j.jcp.2019.02.002. · Zbl 1451.65244
[22] Y. Fan and L. Ying, Solving Optical Tomography with Deep Learning, preprint, https://arxiv.org/abs/1910.04756, 2019.
[23] Y. Fan and L. Ying, Solving electrical impedance tomography with deep learning, J. Comput. Phys., 404 (2020), 109119, https://doi.org/10.1016/j.jcp.2019.109119. · Zbl 1453.65041
[24] R. Gao, Y. Lu, J. Zhou, S.-C. Zhu, and Y. N. Wu, Learning generative ConvNets via multi-grid modeling and sampling, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Salt Lake City, UT), IEEE, Washington, DC, 2018, pp. 9155-9164, https://doi.org/10.1109/CVPR.2018.00954.
[25] M. B. Giles, Multilevel Monte Carlo path simulation, Oper. Res., 56 (2008), pp. 607-617, https://doi.org/10.1287/opre.1070.0496. · Zbl 1167.65316
[26] D. Gilton, G. Ongie, and R. Willett, Neumann networks for linear inverse problems in imaging, IEEE Trans. Comput. Imaging, 6 (2020), pp. 328-343, https://doi.org/10.1109/TCI.2019.2948732.
[27] E. Haber, L. Ruthotto, E. Holtham, and S.-H. Jun, Learning across scales-multiscale methods for convolution neural networks, in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) (New Orleans, LA), AAAI Press, 2018, pp. 3142-3148.
[28] B. Hanin and M. Nica, Finite depth and width corrections to the neural tangent kernel, in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 2020, https://openreview.net/forum?id=SJgndT4KwB. · Zbl 1446.60007
[29] T. Hofmann, B. Schölkopf, and A. J. Smola, Kernel methods in machine learning, Ann. Statist., 36 (2008), pp. 1171-1220, https://doi.org/10.1214/009053607000000677. · Zbl 1151.30007
[30] M. Huh, P. Agrawal, and A. A. Efros, What Makes ImageNet Good for Transfer Learning?, preprint, https://arxiv.org/abs/1608.08614, 2016.
[31] B. Ilan and M. I. Weinstein, Band-edge solitons, nonlinear Schrödinger/Gross-Pitaevskii equations, and effective media, Multiscale Model. Simul., 8 (2010), pp. 1055-1101, https://doi.org/10.1137/090769417. · Zbl 1213.35060
[32] A. Jacot, F. Gabriel, and C. Hongler, Neural tangent kernel: Convergence and generalization in neural networks, in Advances in Neural Information Processing Systems 31 (Montreal, Canada), Curran Associates, Red Hook, NY, 2018, pp. 8571-8580.
[33] H. Juncai and X. Jinchao, MgNet: A unified framework of multigrid and convolutional neural network, Sci. China Math., 62 (2019), pp. 1331-1354, https://doi.org/10.1007/s11425-019-9547-2. · Zbl 1476.65026
[34] Y. Khoo, J. Lu, and L. Ying, Solving for high-dimensional committor functions using artificial neural networks, Res. Math. Sci., 6 (2018), 1, https://doi.org/10.1007/s40687-018-0160-2. · Zbl 1498.60222
[35] Y. Khoo, J. Lu, and L. Ying, Solving parametric PDE problems with artificial neural networks, European J. Appl. Math., (2020), https://doi.org/10.1017/S0956792520000182. · Zbl 1501.65154
[36] Y. Khoo and L. Ying, SwitchNet: A neural network model for forward and inverse scattering problems, SIAM J. Sci. Comput., 41 (2019), pp. A3182-A3201, https://doi.org/10.1137/18M1222399. · Zbl 1425.65208
[37] B. Kim, V. C. Azevedo, N. Thuerey, T. Kim, M. Gross, and B. Solenthaler, Deep fluids: A generative network for parameterized fluid simulations, Comput. Graph. Forum, 38 (2019), pp. 59-70, https://doi.org/10.1111/cgf.13619.
[38] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, in 3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings, San Diego, CA, 2015; preprint, https://arxiv.org/abs/1412.6980, 2014.
[39] I. Lagaris, A. Likas, and D. Fotiadis, Artificial neural networks for solving ordinary and partial differential equations, IEEE Trans. Neural Netw. Learn. Syst., 9 (1998), pp. 987-1000, https://doi.org/10.1109/72.712178.
[40] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington, Wide neural networks of any depth evolve as linear models under gradient descent, in Advances in Neural Information Processing Systems 32 (Vancouver, Canada), Curran Associates, Red Hook, NY, 2019, pp. 8572-8583. · Zbl 07330523
[41] K. Lee and K. T. Carlberg, Model reduction of dynamical systems on nonlinear manifolds using deep convolutional autoencoders, J. Comput. Phys., 404 (2020), 108973, https://doi.org/10.1016/j.jcp.2019.108973. · Zbl 1454.65184
[42] T. Liang and A. Rakhlin, Just interpolate: Kernel “Ridgeless” regression can generalize, Ann. Statist., 48 (2020), pp. 1329-1347, https://doi.org/10.1214/19-AOS1849. · Zbl 1453.68155
[43] L. Lin, J. Lu, L. Ying, R. Car, and W. E, Fast algorithm for extracting the diagonal of the inverse matrix with application to the electronic structure analysis of metallic systems, Commun. Math. Sci., 7 (2009), pp. 755-777. · Zbl 1182.65072
[44] K. O. Lye, S. Mishra, and R. Molinaro, A multi-level procedure for enhancing accuracy of machine learning algorithms, European J. Appl. Math., (2020), https://doi.org/10.1017/S0956792520000224. · Zbl 1482.65149
[45] A. Maurer, The Rademacher complexity of linear transformation classes, in Learning Theory, COLT 2006 (Pittsburgh, PA), Lecture Notes in Comput. Sci. 4005, Springer, Berlin, Heidelberg, 2006, pp. 65-78, https://doi.org/10.1007/11776420_8. · Zbl 1147.68542
[46] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning, 2nd ed., MIT Press, Cambridge, MA, 2018. · Zbl 1407.68007
[47] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro, Exploring generalization in deep learning, in Advances in Neural Information Processing Systems 30 (Long Beach, CA), Curran Associates, Red Hook, NY, 2017, pp. 5947-5956.
[48] R. Novak, L. Xiao, J. Hron, J. Lee, A. A. Alemi, J. Sohl-Dickstein, and S. S. Schoenholz, Neural tangents: Fast and easy infinite neural networks in Python, in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 2020, https://openreview.net/forum?id=SklD9yrFPS.
[49] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, PyTorch: An imperative style, high-performance deep learning library, in Advances in Neural Information Processing Systems 32 (Vancouver, Canada), Curran Associates, Red Hook, NY, 2019, pp. 8026-8037.
[50] B. Peherstorfer, K. Willcox, and M. Gunzburger, Survey of multifidelity methods in uncertainty propagation, inference, and optimization, SIAM Rev., 60 (2018), pp. 550-591, https://doi.org/10.1137/16M1082469. · Zbl 1458.65003
[51] P. Perdikaris, M. Raissi, A. Damianou, N. D. Lawrence, and G. E. Karniadakis, Nonlinear information fusion algorithms for data-efficient multi-fidelity modelling, Proc. R. Soc. A, 473 (2017), 20160751, https://doi.org/10.1098/rspa.2016.0751. · Zbl 1407.62252
[52] P. Perdikaris, D. Venturi, J. O. Royset, and G. E. Karniadakis, Multi-fidelity modelling via recursive co-kriging and Gaussian-Markov random fields, Proc. R. Soc. A, 471 (2015), 20150018, https://doi.org/10.1098/rspa.2015.0018.
[53] N. Qian, On the momentum term in gradient descent learning algorithms, Neural Netw., 12 (1999), pp. 145-151, https://doi.org/10.1016/S0893-6080(98)00116-6.
[54] M. Raissi, Deep hidden physics models: Deep learning of nonlinear partial differential equations, J. Mach. Learn. Res., 19 (2018), pp. 932-955, https://dl.acm.org/doi/abs/10.5555/3291125.3291150. · Zbl 1439.68021
[55] S. Reed, A. Oord, N. Kalchbrenner, S. G. Colmenarejo, Z. Wang, Y. Chen, D. Belov, and N. Freitas, Parallel multiscale autoregressive density estimation, in Proceedings of the 34th International Conference on Machine Learning (Sydney, Australia), PMLR, 2017, pp. 2912-2921.
[56] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F.-F. Li, ImageNet large scale visual recognition challenge, Int. J. Comput. Vis., 115 (2015), pp. 211-252, https://doi.org/10.1007/s11263-015-0816-y.
[57] V. Sindhwani, H. Q. Minh, and A. C. Lozano, Scalable matrix-valued kernel learning for high-dimensional nonlinear multivariate regression and Granger causality, in Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (Bellevue, WA), AUAI Press, 2013, pp. 586-595.
[58] M. Sun, X. Yan, and R. Sclabassi, Solving partial differential equations in real-time using artificial neural network signal processing as an alternative to finite-element analysis, in Proceedings of the International Conference on Neural Networks and Signal Processing (Nanjing, China), IEEE, Washington, DC, 2003, pp. 381-384, https://doi.org/10.1109/ICNNSP.2003.1279289.
[59] J. Tompson, K. Schlachter, P. Sprechmann, and K. Perlin, Accelerating Eulerian fluid simulation with convolutional networks, in Proceedings of the 34th International Conference on Machine Learning (Sydney, Australia), PMLR, 2017, pp. 3424-3433.
[60] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, Generalizing from a few examples: A survey on few-shot learning, ACM Comput. Surv., 53 (2020), 63, https://doi.org/10.1145/3386252.
[61] D. Weinshall, G. Cohen, and D. Amir, Curriculum learning by transfer learning: Theory and experiments with deep networks, in Proceedings of the 35th International Conference on Machine Learning (Stockholm, Sweden), PMLR, 2018, pp. 5238-5246.
[62] L. Zhang, J. Han, H. Wang, R. Car, and W. E, Deep potential molecular dynamics: A scalable model with the accuracy of quantum mechanics, Phys. Rev. Lett., 120 (2018), 143001, https://doi.org/10.1103/PhysRevLett.120.143001.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.