×

KBLAS: an optimized library for dense matrix-vector multiplication on GPU accelerators. (English) Zbl 1369.65042


MSC:

65Fxx Numerical linear algebra
65Y10 Numerical algorithms for specific classes of architectures
65Y15 Packaged methods for numerical algorithms
65Y20 Complexity and performance of numerical algorithms
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Ahmad Abdelfattah, Jack Dongarra, David Keyes, and Hatem Ltaief. 2013a. Optimizing memory-bound SYMV kernel on GPU hardware accelerators. In High Performance Computing for Computational Science (VECPAR’12), Michel Dayd, Osni Marques, and Kengo Nakajima (Eds.). Lecture Notes in Computer Science, Vol. 7851. Springer, Berlin, 72–79. DOI:http://dx.doi.org/10.1007/978-3-642-38718-0_10 · Zbl 06232588
[2] Ahmad Abdelfattah, Eric Gendron, Damien Gratadour, David Keyes, Hatem Ltaief, Arnaud Sevin, and Fabrice Vidal. 2014. High performance pseudo-analytical simulation of multi-object adaptive optics over multi-GPU systems. In Euro-Par 2014 Parallel Processing, Fernando Silva, I. Dutra, and V. Santos Costa (Eds.). Lecture Notes in Computer Science, Vol. 8632. Springer International Publishing, 704–715. DOI:http://dx.doi.org/10.1007/978-3-319-09873-9_59 · Zbl 06400014
[3] Ahmad Abdelfattah, David Keyes, and Hatem Ltaief. 2013b. Systematic approach in optimizing numerical memory-bound kernels on GPU. In Euro-Par 2012: Parallel Processing Workshops, Ioannis Caragiannis, Michael Alexander, RosaMaria Badia, Mario Cannataro, Alexandru Costan, Marco Danelutto, F. Desprez, Bettina Krammer, Julio Sahuquillo, StephenL. Scott, and Josef Weidendorfer (Eds.). Lecture Notes in Computer Science, Vol. 7640. Springer, Berlin, 207–216. DOI:http://dx.doi.org/10.1007/978-3-642-36949-0_23 · Zbl 06151927
[4] Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. Journal of Physics: Conference Series 180 (2009), 012037.
[5] E. Anderson, Z. Bai, C. Bischof, Suzan L. Blackford, James W. Demmel, Jack J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and Danny C. Sorensen. 1999. LAPACK User’s Guide (3rd ed.). Society for Industrial and Applied Mathematics, Philadelphia. · Zbl 0934.65030
[6] BLAS. 1979. Basic Linear Algebra Subprograms. Retrieved from http://www.netlib.org/blas/. · Zbl 0412.65022
[7] Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. 2004. Brook for GPUs: Stream computing on graphics hardware. In ACM SIGGRAPH 2004 Papers (SIGGRAPH’04). ACM, New York, NY, 777–786. DOI:http://dx.doi.org/10.1145/1186562.1015800
[8] cuBLAS-XT. 2014. Accelerate BLAS calls with multiple GPUs. Retrieved from https://developer.nvidia.com/cublasxt.
[9] J. R. Humphrey, D. K. Price, K. E. Spagnoli, A. L. Paolini, and E. J. Kelmelis. 2010. CULA: Hybrid GPU accelerated linear algebra routines. In Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, Vol. 7705. 1.
[10] KBLAS. 2014. KAUST Basic Linear Algebra Subprograms. Available at http://cec.kaust.edu.sa/Pages/kblas.aspx. (2014).
[11] David B. Kirk and Wen-mei W. Hwu. 2010. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers, San Francisco, CA.
[12] MAGMA. 2009. Matrix Algebra on GPU and Multicore Architectures. Innovative Computing Laboratory, University of Tennessee. Retrieved from http://icl.cs.utk.edu/magma/.
[13] John D. McCalpin. 1991-2007. STREAM: Sustainable Memory Bandwidth in High Performance Computers. Technical Report. University of Virginia, Charlottesville, Virginia. Retrieved from http://www.cs.virginia.edu/stream/.
[14] John D. McCalpin. 1995. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter (Dec. 1995), 19–25.
[15] Rajib Nath, Stanimire Tomov, Tingxing “Tim” Dong, and Jack Dongarra. 2011b. Optimizing symmetric dense matrix-vector multiplication on GPUs. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM, New York, NY, Article 6, 10 pages. DOI:http://dx.doi.org/10.1145/2063384.2063392
[16] Rajib Nath, Stanimire Tomov, and Jack Dongarra. 2010a. An improved magma gemm for fermi graphics processing units. Internaitonal Journal on High Performance Computing Applications 24, 4 (Nov. 2010), 511–515. DOI:http://dx.doi.org/10.1177/1094342010385729
[17] Rajib Nath, Stanimire Tomov, and Jack Dongarra. 2010b. BLAS for GPUs. CRC Press, 57–80. DOI:http://dx.doi.org/doi:10.1201/b10376-6 · Zbl 1323.65140
[18] Rajib Nath, Stanimire Tomov, and Jack Dongarra. 2011a. Accelerating GPU kernels for dense linear algebra. In Proceedings of the 9th International Conference on High Performance Computing for Computational Science (VECPAR’10). Springer-Verlag, Berlin, 83–92. · Zbl 1323.65140
[19] NVIDIA. 2009. NVIDIA Fermi Compute Architecture Whitepaper. Retrieved from http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.
[20] NVIDIA. 2012. NVIDIA Kepler GK110 Architecture Whitepaper. Retrieved from http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.
[21] NVIDIA. 2014a. CUDA C Programming Guide. Retrieved from http://docs.nvidia.com/cuda/cuda-c-programming-guide/.
[22] NVIDIA. 2014b. The NVIDIA CUDA Basic Linear Algebra Subroutines. Retrieved from https://developer.nvidia.com/cublas/.
[23] NVIDIA. 2014c. cuBLAS::CUDA Toolkit Documentation. http://docs.nvidia.com/cuda/cublas/#appendix-acknowledgements.
[24] OPENACC. 2011. Directives for Accelerators. Retrieved from http://www.openacc-standard.org/.
[25] OPENCL. 2009. The open standard for parallel programming of heterogeneous systems. Retrieved from http://www.khronos.org/opencl/.
[26] Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui Sun. 2011. Fast implementation of DGEMM on fermi GPU. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM, New York, NY, Article 35, 11 pages. DOI:http://dx.doi.org/10.1145/2063384.2063431
[27] Stanimire Tomov, Rajib Nath, and Jack Dongarra. 2010. Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing. Parallel Computing 36, 12 (Dec. 2010), 645–654. DOI:http://dx.doi.org/10.1016/j.parco.2010.06.001 · Zbl 1214.65020
[28] V. Volkov and J. W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In International Conference for High Performance Computing, Networking, Storage and Analysis, 2008 (SC’08). 1–11. DOI:http://dx.doi.org/10.1109/SC.2008.5214359
[29] Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM 52, 4 (April 2009), 65–76. DOI:http://dx.doi.org/10.1145/1498765.1498785 · Zbl 05747769
[30] Ichitaro Yamazaki, Tingxing Dong, Raffaele Solc, Stanimire Tomov, Jack Dongarra, and Thomas Schulthess. 2013. Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems. Concurrency and Computation: Practice and Experience 26, 16 (2013), 2652–2666. DOI:http://dx.doi.org/10.1002/cpe.3152
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.