×

Multicore-optimized wavefront diamond blocking for optimizing stencil updates. (English) Zbl 1331.68286

Summary: The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multicore wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemporary Intel processor.

MSC:

68W15 Distributed algorithms
65Y05 Parallel numerical computation
68M14 Distributed systems
68M20 Performance evaluation, queueing, and scheduling in the context of computer systems
68Q25 Analysis of algorithms and problem complexity
68W10 Parallel algorithms in computer science
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] V. Bandishti, I. Pananilath, and U. Bondhugula, {\it Tiling stencil computations to maximize parallelism}, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, New York, 2012, 40.
[2] U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan, {\it A practical automatic polyhedral parallelizer and locality optimizer}, ACM SIGPLAN Notices, 43 (2008), pp. 101-113.
[3] M. Christen, O. Schenk, and H. Burkhart, {\it PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures}, in International Parallel and Distributed Processing Symposium, IEEE, Piscataway, NJ, 2011, pp. 676-687.
[4] K. Datta, {\it Auto-tuning Stencil Codes for Cache-Based Multicore Platforms}, Ph.D. thesis, University of California, Berkeley, Berkeley, CA, Berkeley, CA, 2009.
[5] K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf, and K. Yelick, {\it Optimization and performance modeling of stencil computations on modern microprocessors}, SIAM Rev., 51 (2009), pp. 129-159. · Zbl 1160.65359
[6] H. Dursun, M. Kunaseth, K. Nomura, J. Chame, R. F. Lucas, C. Chen, M. Hall, R. K. Kalia, A. Nakano, and P. Vashishta, {\it Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters}, J. Supercomput., 62 (2012), pp. 946-966.
[7] M. Frigo and V. Strumpen, {\it Cache oblivious stencil computations}, in Proceedings of the 19th Annual International Conference on Supercomputing, ACM, New York, 2005, pp. 361-366.
[9] T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege, {\it Hybrid hexagonal/classical tiling for GPUS}, in Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, ACM, New York, 2014, pp. 66-75.
[10] T. Grosser, S. Verdoolaege, A. Cohen, and P. Sadayappan, {\it The relation between diamond tiling and hexagonal tiling}, Parallel Process. Lett., 24 (2014), 1441002. · Zbl 1327.65289
[11] G. Hager, J. Treibig, J. Habich, and G. Wellein, {\it Exploring performance and power properties of modern multi-core chips via simple machine models}, Concurrency Comput. Pract. Exper., to appear.
[12] G. Hager and G. Wellein, {\it Introduction to High Performance Computing for Scientists and Engineers}, CRC Press, Boca Raton, FL, 2010.
[13] T. Henretty, R. Veras, F. Franchetti, L. N. Pouchet, J. Ramanujam, and P. Sadayappan, {\it A stencil compiler for short-vector SIMD architectures}, in Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ACM, New York, 2013, pp. 13-24.
[14] K. Yelick J. Demmel, and S. Williams, {\it Automatic performance tuning (autotuning)}, in The Berkeley Par Lab: Progress in the Parallel Computing Landscape, M. Wrinn D. Patterson, and D. Gannon, eds., Microsoft Research, 2013, pp. 337-376.
[15] L. Lamport, {\it The parallel execution of do loops}, Commun. ACM, 17 (1974), pp. 83-93. · Zbl 0273.68012
[17] T. Malas, G. Hager, H. Ltaief, and D. Keyes, {\it Towards Energy Efficiency and Maximum Computational Intensity for Stencil Algorithms Using Wavefront Diamond Temporal Blocking}, preprint, arXiv:1410.5561, 2014.
[18] N. Maruyama, T. Nomura, K. Sato, and S. Matsuoka, {\it Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers}, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, Piscataway, NJ, 2011, 11.
[19] J. D. McCalpin, {\it STREAM: Sustainable Memory Bandwidth in High Performance Computers}, Technical report, University of Virginia, Charlottesville, VA, 1991-2007.
[20] J. D. McCalpin, {\it Memory bandwidth and machine balance in current high performance computers}, IEEE Comput. Soc. Tech. Committee Comp. Architect. Newsletter, 1995, pp. 19-25.
[21] A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey, {\it \(3.5\)-D blocking optimization for stencil computations on modern CPUs and GPUs}, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, New York, 2010, pp. 1-13.
[22] D. Orozco and G. Gao, {\it Mapping the FDTD application to many-core chip architectures}, in Proceedings of the International Conference on Parallel Processing, IEEE, Piscataway, NJ, 2009, pp. 309-316.
[23] D. Orozco, E. Garcia, and G. Gao, {\it Locality optimization of stencil applications using data dependency graphs}, in Languages and Compilers for Parallel Computing, Springer, Berlin, 2011, pp. 77-91.
[24] W. Schönauer, {\it Scientific Supercomputing: Architecture and Use of Shared and Distributed Memory Parallel Computers}, http://www.rz.uni-karlsruhe.de/\string rx03/book (2000).
[25] S. Shrestha, J. Manzano, A. Marquez, J. Feo, and G. R. Gao, {\it Jagged tiling for intra-tile parallelism and fine-grain multithreading}, in Proceedings of the 27th International Workshop on Languages and Compilers for Parallel Computing, Hillsboro, OR, Springer-Verlag, New York, 2014, pp. 161-175.
[26] H. Stengel, J. Treibig, G. Hager, and G. Wellein, {\it Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model}, in Proceedings of the 29th International ACM Conference on Supercomputing, ACM, New York, 2015, pp. 207-216.
[27] R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel, {\it Cache oblivious parallelograms in iterative stencil computations}, in Proceedings of the 24th ACM International Conference on Supercomputing, ACM, New York, 2010, pp. 49-59.
[28] R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel, {\it Cache accurate time skewing in iterative stencil computations}, in Proceedings of the International Conference on Parallel Processing, IEEE Computer Society, Los Alamitos, CA, 2011, pp. 571-581.
[29] Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson, {\it The Pochoir stencil compiler}, in Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, ACM, New York, 2011, pp. 117-128.
[30] D. Unat, X. Cai, and S. B. Baden, {\it Mint: Realizing CUDA performance in \(3\) D stencil methods with annotated C}, in Proceedings of the International Conference on Supercomputing, ACM, New York, 2011, pp. 214-224.
[31] G. Wellein, G. Hager, T. Zeiser, M. Wittmann, and H. Fehske, {\it Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization}, in 33rd Annual IEEE International Computer Software and Applications Conference, Vol. 1, IEEE, Piscataway, NJ, 2009, pp. 579-586.
[32] S. Williams, A. Waterman, and D. Patterson, {\it Roofline: An insightful visual performance model for multicore architectures}, Commun. ACM, 52 (2009), pp. 65-76.
[33] M. Wittmann, G. Hager, J. Treibig, and G. Wellein, {\it Leveraging shared caches for parallel temporal blocking of stencil codes on multicore processors and clusters}, Parallel Process. Lett., 20 (2010), pp. 359-376.
[34] D. G. Wonnacott, {\it Using time skewing to eliminate idle time due to memory bandwidth and network limitations}, in International Parallel and Distributed Processing Symposium, IEEE Computer Society, Los Alamitos, CA, 2000, pp. 171-180.
[35] D. G. Wonnacott and M. M. Strout, {\it On the Scalability of Loop Tiling Techniques}, in Proceedings of the 3rd International Workshop on Polyhedral Compilation Techniques, Berlin, Passau Universität, Passau, Germany, 2013, pp. 3-11.
[36] X. Zhou, {\it Tiling Optimizations for Stencil Computations}, Ph.D. thesis, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, 2013.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.