×

Mixed precision block fused multiply-add: error analysis and application to GPU tensor cores. (English) Zbl 1452.65425

Summary: Computing units that carry out a fused multiply-add (FMA) operation with matrix arguments, referred to as tensor units by some vendors, have great potential for use in scientific computing. However, these units are inherently mixed precision, and existing rounding error analyses do not support them. We consider a mixed precision block FMA that generalizes both the usual scalar FMA and existing tensor units. We describe how to exploit such a block FMA in the numerical linear algebra kernels of matrix multiplication and LU factorization and give detailed rounding error analyses of both kernels. An important application is to GMRES-based iterative refinement with block FMAs, about which our analysis provides new insight. Our framework is applicable to the tensor core units in the NVIDIA Volta and Turing GPUs. For these we compare matrix multiplication and LU factorization with TC16 and TC32 forms of FMA, which differ in the precision used for the output of the tensor cores. Our experiments on an NVDIA V100 GPU confirm the predictions of the analysis that the TC32 variant is much more accurate than the TC16 one, and they show that the accuracy boost is obtained with almost no performance loss.

MSC:

65Y10 Numerical algorithms for specific classes of architectures
65F05 Direct numerical methods for linear systems and matrix inversion
65F08 Preconditioners for iterative methods
65G50 Roundoff error
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] A. Abdelfattah, S. Tomov, and J. Dongarra, Towards half-precision computation for complex matrices: A case study for mixed-precision solvers on GPUs, in 2019 IEEE/ACM 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), 2019, pp. 17-24, https://doi.org/10.1109/ScalA49573.2019.00008.
[2] J. Appleyard and S. Yokim, Programming Tensor Cores in CUDA 9, https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/, 2017 (accessed 25 March 2019).
[3] ARM Architecture Reference Manual. ARMv8, for ARMv8-A Architecture Profile, ARM Limited, Cambridge, UK, 2018, https://developer.arm.com/docs/ddi0487/latest, version dated 31 October 2018; original release dated 30 April 2013.
[4] P. Blanchard, N. J. Higham, and T. Mary, A class of fast and accurate summation algorithms, SIAM J. Sci. Comput., 42 (2020), pp. 1541-1557, https://doi.org/10.1137/19M1257780; MIMS EPrint 2019.6, Manchester Institute for Mathematical Sciences, The University of Manchester, UK, 2019, http://eprints.maths.manchester.ac.uk/2748/, revised February 2020. · Zbl 1471.65035
[5] P. R. Capello and W. L. Miranker, Systolic super summation, IEEE Trans. Comput., 37 (1988), pp. 657-677, https://doi.org/10.1109/12.2205. · Zbl 0647.65032
[6] E. Carson and N. J. Higham, A new analysis of iterative refinement and its application to accurate solution of ill-conditioned sparse linear systems, SIAM J. Sci. Comput., 39 (2017), pp. A2834-A2856, https://doi.org/10.1137/17M1122918. · Zbl 1379.65019
[7] E. Carson and N. J. Higham, Accelerating the solution of linear systems by iterative refinement in three precisions, SIAM J. Sci. Comput., 40 (2018), pp. A817-A847, https://doi.org/10.1137/17M1140819. · Zbl 1453.65067
[8] E. Carson, N. J. Higham, and S. Pranesh, Three-Precision GMRES-Based Iterative Refinement for Least Squares Problems, MIMS EPrint 2020.5, Manchester Institute for Mathematical Sciences, The University of Manchester, UK, 2020, http://eprints.maths.manchester.ac.uk/2745/.
[9] CUDA C++ Programming Guide, https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma-type-sizes, 2019, version 10.2.89, 2019.
[10] M. Feldman, Fujitsu Reveals Details of Processor that Will Power Post-K Supercomputer, https://www.top500.org/news/fujitsu-reveals-details-of-processor-that-will-power-post-k-supercomputer, 2018 (accessed 22 November 2018).
[11] M. Feldman, IBM Takes Aim at Reduced Precision for New Generation of AI Chips, https://www.top500.org/news/ibm-takes-aim-at-reduced-precision-for-new-generation-of-ai-chips/, 2018 (accessed 8 January 2019).
[12] A. Haidar, A. Abdelfattah, M. Zounon, P. Wu, S. Pranesh, S. Tomov, and J. Dongarra, The design of fast and energy-efficient linear solvers: On the potential of half-precision arithmetic and iterative refinement techniques, in Computational Science-ICCS 2018, Y. Shi, H. Fu, Y. Tian, V. V. Krzhizhanovskaya, M. H. Lees, J. Dongarra, and P. M. A. Sloot, eds., Springer, Cham, 2018, pp. 586-600, https://doi.org/10.1007/978-3-319-93698-7_45.
[13] A. Haidar, H. Bayraktar, S. Tomov, J. Dongarra, and N. J. Higham, Mixed-Precision Solution of Linear Systems Using Accelerator-Based Computing, tech. report, 2020, in preparation.
[14] A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers, in Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18 (Dallas, TX), IEEE Press, Piscataway, NJ, 2018, pp. 47:1-47:11, https://doi.org/10.1109/SC.2018.00050.
[15] A. Haidar, P. Wu, S. Tomov, and J. Dongarra, Investigating half precision arithmetic to accelerate dense linear system solvers, in Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA ’17 (Denver, CO), ACM Press, New York, 2017, pp. 10:1-10:8, https://doi.org/10.1145/3148226.3148237.
[16] J. W. Hanlon, New Chips for Machine Intelligence, https://jameswhanlon.com/new-chips-for-machine-intelligence.html, 2019 (accessed 27 November 2019).
[17] N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd ed., SIAM, Philadelphia, PA, 2002, https://doi.org/10.1137/1.9780898718027. · Zbl 1011.65010
[18] N. J. Higham and T. Mary, A new approach to probabilistic rounding error analysis, SIAM J. Sci. Comput., 41 (2019), pp. A2815-A2835, https://doi.org/10.1137/18M1226312. · Zbl 07123205
[19] N. J. Higham and T. Mary, Sharper Probabilistic Backward Error Analysis for Basic Linear Algebra Kernels with Random Data, MIMS EPrint 2020.4, Manchester Institute for Mathematical Sciences, The University of Manchester, UK, 2020, http://eprints.maths.manchester.ac.uk/2743/.
[20] N. J. Higham and S. Pranesh, Exploiting Lower Precision Arithmetic in Solving Symmetric Positive Definite Linear Systems and Least Squares Problems, MIMS EPrint 2019.20, Manchester Institute for Mathematical Sciences, The University of Manchester, UK, 2019, http://eprints.maths.manchester.ac.uk/2736/.
[21] N. J. Higham, S. Pranesh, and M. Zounon, Squeezing a matrix into half precision, with an application to solving linear systems, SIAM J. Sci. Comput., 41 (2019), pp. A2536-A2551, https://doi.org/10.1137/18M1229511. · Zbl 1420.65017
[22] IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2019 (Revision of IEEE 754-2008), IEEE Computer Society, New York, 2019, https://doi.org/10.1109/IEEESTD.2019.8766229.
[23] Intel Corporation, BFLOAT16-Hardware Numerics Definition, White Paper, document 338302-001US, 2018, https://software.intel.com/en-us/download/bfloat16-hardware-numerics-definition.
[24] W. Krämer and M. Zimmer, Fast (parallel) dense linear system solvers in C-XSC using error free transformations and BLAS, in Numerical Validation in Current Hardware Architectures, A. Cuyt, W. Krämer, W. Luther, and P. Markstein, eds., Springer, Berlin, Heidelberg, 2009, pp. 230-249, https://doi.org/10.1007/978-3-642-01591-5_15.
[25] U. W. Kulisch and W. L. Miranker, The arithmetic of the digital computer: A new approach, SIAM Rev., 28 (1986), pp. 1-40, https://doi.org/10.1137/1028001. · Zbl 0597.65037
[26] S. Markidis, S. Wei Der Chien, E. Laure, I. B. Peng, and J. S. Vetter, NVIDIA tensor core programmability, performance & precision, in 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2018, pp. 522-531, https://doi.org/10.1109/ipdpsw.2018.00091.
[27] ORNL Launches Summit Supercomputer, https://www.ornl.gov/news/ornl-launches-summit-supercomputer, 2018 (accessed 30 June 2018).
[28] N. Rao, Beyond the CPU or GPU: Why Enterprise-Scale Artificial Intelligence Requires a More Holistic Approach, https://newsroom.intel.com/editorials/artificial-intelligence-requires-holistic-approach, 2018 (accessed 5 November 2018).
[29] S. M. Rump, Verification methods: Rigorous results using floating-point arithmetic, Acta Numer., 19 (2010), pp. 287-449, https://doi.org/10.1017/S096249291000005X. · Zbl 1323.65046
[30] S. M. Rump, IEEE754 precision-\(k\) base-\( \beta\) arithmetic inherited by precision-\(m\) base-\( \beta\) arithmetic for \(k < m\), ACM Trans. Math. Software, 43 (2016), pp. 20:1-20:15, https://doi.org/10.1145/2785965. · Zbl 1396.65177
[31] A. Shilov, Intel Architecture Manual Updates: bfloat16 for Cooper Lake Xeon Scalable Only?, https://www.anandtech.com/show/14179/intel-manual-updates-bfloat16-for-cooper-lake-xeon-scalable-only, 2019 (accessed 22 May 2019).
[32] Summit by the Numbers, https://www.olcf.ornl.gov/wp-content/uploads/2018/06/Summit_bythenumbers_FIN.png, 2018 (accessed 30 June 2018).
[33] Y. Tao, G. Deyuan, F. Xiaoya, and J. Nurmi, Correctly rounded architectures for floating-point multi-operand addition and dot-product computation, in 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors, 2013, pp. 346-355, https://doi.org/10.1109/ASAP.2013.6567600.
[34] S. Wang and P. Kanwar, BFloat16: The Secret to High Performance on Cloud TPUs, https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus, 2019 (accessed 14 September 2019).
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.