×

Parallel matrix multiplication: a systematic journey. (English) Zbl 1355.65060

Summary: We expose a systematic approach for developing distributed-memory parallel matrix-matrix multiplication algorithms. The journey starts with a description of how matrices are distributed to meshes of nodes (e.g., MPI processes), relates these distributions to scalable parallel implementation of matrix-vector multiplication and rank-1 update, continues on to reveal a family of matrix-matrix multiplication algorithms that view the nodes as a two-dimensional (2D) mesh, and finishes with extending these 2D algorithms to so-called three-dimensional (3D) algorithms that view the nodes as a 3D mesh. A cost analysis shows that the 3D algorithms can attain the (order of magnitude) lower bound for the cost of communication. The paper introduces a taxonomy for the resulting family of algorithms and explains how all algorithms have merit depending on parameters such as the sizes of the matrices and architecture parameters. The techniques described in this paper are at the heart of the Elemental distributed-memory linear algebra library. Performance results from implementation within and with this library are given on a representative distributed-memory architecture, the IBM Blue Gene/P supercomputer.

MSC:

65F30 Other matrix algorithms (MSC2010)
65Y05 Parallel numerical computation
65Y20 Complexity and performance of numerical algorithms
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar, {\it A three-dimensional approach to parallel matrix multiplication}, IBM J. Res. Develop., 39 (1995), pp. 575-582.
[2] R. C. Agarwal, F. Gustavson, and M. Zubair, {\it A high-performance matrix multiplication algorithm on a distributed memory parallel computer using overlapped communication}, IBM J. Res. Develop., 38 (1994), pp. 673-681.
[3] E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, {\it LAPACK Users’ Guide}, 3rd ed., Software Environments Tools 9, SIAM, Philadelphia, 1999, .
[4] G. Ballard, {\it Avoiding Communication in Dense Linear Algebra}, Ph.D. thesis, EECS Department, University of California, Berkeley, 2013.
[5] G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz, {\it Communication lower bounds and optimal algorithms for numerical linear algebra}, Acta Numer., 23 (2014), pp. 1-155. · Zbl 1396.65082
[6] R. H. Bisseling, {\it Parallel iterative solution of sparse linear systems on a transputer network}, in Parallel Computation, The Institute of Mathematics and Its Applications Conference 46, A. E. Fincham and B. Ford, eds., Oxford University Press, Oxford, UK, 1993, pp. 253-271. · Zbl 1253.65209
[7] R. H. Bisseling and W. F. McColl, {\it Scientific computing on bulk synchronous parallel architectures}, in Technology and Foundations: Information Processing ’94, Vol. I, IFIP Transactions A 51, B. Pehrson and I. Simon, eds., Elsevier Science Publishers, Amsterdam, 1994, pp. 509-514.
[8] J. Bruck, C.-T. Ho, S. Kipnis, E. Upfal, and D. Weathersby, {\it Efficient algorithms for all-to-all communications in multi-port message-passing systems}, IEEE Trans. Parallel Distrib. Syst., 8 (1997), pp. 1143-1156.
[9] L. E. Cannon, {\it A Cellular Computer to Implement the Kalman Filter Algorithm}, Ph.D. thesis, Montana State University, Bozeman, MT, 1969.
[10] E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn, {\it Collective communication: Theory, practice, and experience}, Concurrency and Computation: Practice and Experience, 19 (2007), pp. 1749-1783.
[11] J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker, {\it ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers}, in Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation, IEEE Press, Piscataway, NJ, 1992, pp. 120-127.
[12] J. Choi, D. W. Walker, and J. J. Dongarra, {\it PUMMA: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers}, Concurrency and Computation: Practice and Experience, 6 (1994), pp. 543-570.
[13] M. Christ, J. Demmel, N. Knight, T. Scanlon, and K. Yelick, {\it Communication Lower Bounds and Optimal Algorithms for Programs That Reference Arrays–Part \textup1}, preprint, 2013, .
[14] C. Edwards, P. Geng, A. Patra, and R. van de Geijn, {\it Parallel Matrix Distributions: Have We Been Doing It All Wrong?}, Technical Report TR-95-40, Department of Computer Sciences, The University of Texas at Austin, Austin, TX, 1995.
[15] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker, {\it Solving Problems on Concurrent Processors. Vol. \textupI}, Prentice Hall, Englewood Cliffs, NJ, 1988.
[16] K. Goto and R. A. van de Geijn, {\it Anatomy of high-performance matrix multiplication}, ACM Trans. Math. Softw., 34 (2008), 12. · Zbl 1190.65064
[17] J. Gunnels, C. Lin, G. Morrow, and R. van de Geijn, {\it A flexible class of parallel matrix multiplication algorithms}, in Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing (IPPS/SPDP 1998), IEEE Press, Piscataway, NJ, 1998, pp. 110-116.
[18] J. A. Gunnels, F. G. Gustavson, G. M. Henry, and R. A. van de Geijn, {\it FLAME: Formal linear algebra methods environment}, ACM Trans. Math. Softw., 27 (2001), pp. 422-455. · Zbl 1070.65522
[19] B. Hendrickson, R. Leland, and S. Plimpton, {\it An efficient parallel algorithm for matrix-vector multiplication}, Int. J. High Speed Comput., 7 (1995), pp. 73-88.
[20] B. A. Hendrickson and D. E. Womble, {\it The torus-wrap mapping for dense matrix calculations on massively parallel computers}, SIAM J. Sci. Comput., 15 (1994), pp. 1201-1226, . · Zbl 0808.65148
[21] S. Huss-Lederman, E. Jacobson, and A. Tsao, {\it Comparison of scalable parallel matrix multiplication libraries}, in Proceedings of the Scalable Parallel Libraries Conference, IEEE Press, Piscataway, NJ, 1993, pp. 142-149.
[22] S. Huss-Lederman, E. Jacobson, A. Tsao, and G. Zhang, {\it Matrix multiplication on the Intel Touchstone DELTA}, Concurrency and Computation: Practice and Experience, 6 (1994), pp. 571-594.
[23] D. Irony, S. Toledo, and A. Tiskin, {\it Communication lower bounds for distributed-memory matrix multiplication}, J. Parallel Distrib. Comput., 64 (2004), pp. 1017-1026. · Zbl 1114.68081
[24] J. G. Lewis and R. A. van de Geijn, {\it Implementing matrix-vector multiplication and conjugate gradient algorithms on distributed memory multicomputers}, in Proceedings of the Scalable High-Performance Computing Conference, IEEE Press, Piscataway, NJ, 1994, pp. 542-550.
[25] J. Li, A. Skjellum, and R. D. Falgout, {\it A poly-algorithm for parallel dense matrix multiplication on two-dimensional process grid topologies}, Concurrency and Computation: Practice and Experience, 9 (1997), pp. 345-389.
[26] L. H. Loomis and H. Whitney, {\it An inequality related to the isoperimetric inequality}, Bull. Amer. Math. Soc., 55 (1949), pp. 961-962. · Zbl 0035.38302
[27] W. F. McColl and A. Tiskin, {\it Memory-efficient matrix multiplication in the BSP model}, Algorithmica, 24 (1999), pp. 287-297. · Zbl 0943.68066
[28] J. Poulson, B. Marker, R. A. van de Geijn, J. R. Hammond, and N. A. Romero, {\it Elemental: A new framework for distributed memory dense matrix computations}, ACM Trans. Math. Softw., 39 (2013), 13. · Zbl 1295.65137
[29] J. Poulson, R. van de Geijn, and J. Bennighof, {\it Parallel Algorithms for Reducing the Generalized Hermitian-Definite Eigenvalue Problem}, FLAME Working Note 56, Technical Report TR-11-05, Institute for Computational Engineering and Sciences, The University of Texas at Austin, Austin, TX, 2011.
[30] M. D. Schatz, {\it Anatomy of Parallel Computation with Tensors}, Technical Report TR-13-21, Department of Computer Science, The University of Texas at Austin, Austin, TX, 2013.
[31] E. Solomonik and J. Demmel, {\it Communication-optimal parallel \textup2.5d matrix multiplication and lu factorization algorithms}, in Proceedings of the 17th International Conference on Parallel Processing (Euro-Par 2011), Part II, Lecture Notes in Comput. Sci. 6853, Springer, Berlin, 2011, pp. 90-109.
[32] G. W. Stewart, {\it Communication and matrix computations on large message passing systems}, Parallel Comput., 16 (1990), pp. 27-40. · Zbl 0713.65106
[33] R. van de Geijn and J. Watts, {\it SUMMA: Scalable universal matrix multiplication algorithm}, Concurrency and Computation: Practice and Experience, 9 (1997), pp. 255-274.
[34] R. A. van de Geijn, {\it Using PLAPACK: Parallel Linear Algebra Package}, The MIT Press, Cambridge, MA, 1997.
[35] F. G. Van Zee, {\it \tt libflame: The Complete Reference}, Lulu Press, Raleigh, NC, 2009.
[36] F. G. Van Zee, E. Chan, R. A. van de Geijn, E. S. Quintana-Ortí, and G. Quintana-Ortí, {\it The libflame library for dense matrix computations}, IEEE Comput. Sci. Engrg., 11 (2009), pp. 56-63. · Zbl 1364.65105
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.