skip to main content
research-article
Open access

MAPS: Optimizing Massively Parallel Applications Using Device-Level Memory Abstraction

Published: 08 December 2014 Publication History

Abstract

GPUs play an increasingly important role in high-performance computing. While developing naive code is straightforward, optimizing massively parallel applications requires deep understanding of the underlying architecture. The developer must struggle with complex index calculations and manual memory transfers. This article classifies memory access patterns used in most parallel algorithms, based on Berkeley’s Parallel “Dwarfs.” It then proposes the MAPS framework, a device-level memory abstraction that facilitates memory access on GPUs, alleviating complex indexing using on-device containers and iterators. This article presents an implementation of MAPS and shows that its performance is comparable to carefully optimized implementations of real-world applications.

References

[1]
Timo Aila and Samuli Laine. 2009. Understanding the efficiency of ray traversal on GPUs. In Proceedings of the Conference on High Performance Graphics 2009 (HPG’09). ACM, New York, NY, 145--149.
[2]
Ping An, Alin Jula, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy Amato, and Lawrence Rauchwerger. 2003. STAPL: An adaptive, generic parallel C++ library. In Proceedings of the 14th International Conference on Languages and Compilers for Parallel Computing (LCPC’01). Springer-Verlag, Berlin, 193--208.
[3]
Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183. EECS Department, University of California, Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html.
[4]
Michael Bauer, Henry Cook, and Brucek Khailany. 2011. CudaDMA: Optimizing GPU memory bandwidth via warp specialization. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM, New York, NY, Article 12, 11 pages.
[5]
Boost. 2014. Boost C++ Libraries. Retrieved from http://www.boost.org/.
[6]
Martin Burtscher and Keshav Pingali. 2011. An efficient CUDA implementation of the tree-based barnes hut n-body algorithm. In GPU Computing Gems Emerald Edition. Morgan Kaufmann, 75--92.
[7]
Shuai Che, Jeremy W. Sheaffer, and Kevin Skadron. 2011. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM, New York, NY, Article 13, 11 pages.
[8]
CUB. 2014. CUB GPU Computing Primitives Library, NVIDIA Research. Retrieved from http://nvlabs.github.io/cub/.
[9]
CUBLAS. 2014. CUBLAS Library Documentation. Retrieved from http://docs.nvidia.com/cuda/cublas/.
[10]
CUDA. 2014. NVIDIA CUDA SDK. (2014). http://www.nvidia.com/cuda.
[11]
CUFFT. 2014. CUFFT Library Documentation. Retrieved from http://docs.nvidia.com/cuda/cufft/.
[12]
Jianbin Fang, Henk Sips, Pekka Jaaskelainen, and Ana L. Varbanescu. 2014. Grover: Looking for performance improvement by disabling local memory usage in OpenCL kernels. In Proceedings of the 43rd International Conference on Parallel Processing (ICPP’14). IEEE.
[13]
Michael Garland. 2008. Sparse matrix computations on manycore GPU’s. In Proceedings of the 45th Annual Design Automation Conference (DAC’08). ACM, New York, NY, 2--6.
[14]
Kate Gregory and Ade Miller. 2012. C++ AMP: Accelerated Massive Parallelism with Microsoft® Visual C++®. Microsoft Press.
[15]
Kshitij Gupta, Jeff A. Stuart, and John D. Owens. 2012. A study of persistent threads style GPU programming for GPGPU workloads. In Innovative Parallel Computing.
[16]
Tianyi D. Han and Tarek S. Abdelrahman. 2011. hiCUDA: High-level GPGPU programming. IEEE Transactions on Parallel and Distributed Systems 22 (2011), 78--90.
[17]
Mark Harris, John Owens, Shubho Sengupta, Yao Zhang, and Andrew Davidson. 2007. CUDPP: CUDA Data Parallel Primitives Library. Retrieved from http://gpgpu.org/developer/cudpp.
[18]
Jared Hoberock and Nathan Bell. 2010. Thrust: A Parallel Template Library. Retrieved from http://thrust.github.io/.
[19]
Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. 2011. Accelerating CUDA graph algorithms at maximum warp. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). ACM, New York, NY, 267--276.
[20]
Hugues Hoppe. 1999. Optimization of mesh locality for transparent vertex caching. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH’99). ACM Press/Addison-Wesley, New York, NY, 269--276.
[21]
Seyong Lee and Rudolf Eigenmann. 2010. OpenMPC: Extended OpenMP programming and tuning for GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10). IEEE Computer Society, Washington, DC, 1--11.
[22]
Aaron E. Lefohn, Shubhabrata Sengupta, Joe Kniss, Robert Strzodka, and John D. Owens. 2006. Glift: Generic, efficient, random-access GPU data structures. ACM Trans. Graph. 25, 1 (Jan. 2006), 60--99.
[23]
MAPS. 2014. MAPS Code Repository. Retrieved from https://github.com/erdooom/MAPS.
[24]
David R. Musser, Gilmer J. Derge, and Atul Saini. 2001. STL Tutorial and Reference Guide, Second Edition: C++ Programming with the Standard Template Library. Addison-Wesley Longman, Boston, MA.
[25]
NPP. 2014. NVIDIA Performance Primitives (NPP) Library. Retrieved from http://developer.nvidia.com/npp/.
[26]
Lars Nyland, Mark Harris, and Jan Prins. 2007. Fast n-body simulation with CUDA. In GPU Gems 3, Hubert Nguyen (Ed.). Addison-Wesley Professional.
[27]
OpenACC. 2012. OpenACC—Directives for Accelerators. Retrieved from http://www.openacc-standard.org.
[28]
Xavier Provot. 1995. Deformation constraints in a mass-spring model to describe rigid cloth behavior. In Graphics Interface. 147--154.
[29]
Greg Ruetsch and Paulius Micikevicius. 2009. Optimizing Matrix Transpose in CUDA. Retrieved from https://users.csc.calpoly.edu/clupo/teaching/419/winter14/MatrixTranspose.pdf.
[30]
Pedro V. Sander, Diego Nehab, and Joshua Barczak. 2007. Fast triangle reordering for vertex locality and reduced overdraw. In ACM SIGGRAPH. 2007 Papers (SIGGRAPH’07). ACM, New York, NY, Article 89.
[31]
Bjarne Stroustrup. 2000. The C++ Programming Language (3rd ed.). Addison-Wesley Longman, Boston, MA.
[32]
Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, and Xipeng Shen. 2010. Streamlining GPU applications on the fly: Thread divergence elimination through runtime thread-data remapping. In Proceedings of the 24th ACM International Conference on Supercomputing (ICS’10). ACM, New York, NY, 115--126.

Cited By

View all
  • (2022)Lifting C semantics for dataflow optimizationProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532389(1-13)Online publication date: 28-Jun-2022
  • (2019)ThrustHeteroProceedings of the 12th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)10.1145/3299771.3299773(1-11)Online publication date: 14-Feb-2019
  • (2018)How Easy it is to Write Software for Heterogeneous Systems?ACM SIGSOFT Software Engineering Notes10.1145/3149485.314951142:4(1-7)Online publication date: 11-Jan-2018
  • Show More Cited By

Index Terms

  1. MAPS: Optimizing Massively Parallel Applications Using Device-Level Memory Abstraction

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 11, Issue 4
    January 2015
    797 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/2695583
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 December 2014
    Accepted: 01 October 2014
    Revised: 01 August 2014
    Received: 01 May 2014
    Published in TACO Volume 11, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPGPU
    2. heterogeneous computing architectures
    3. memory abstraction
    4. memory access patterns

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Ministry of Science and Technology, Israel

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)58
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 19 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Lifting C semantics for dataflow optimizationProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532389(1-13)Online publication date: 28-Jun-2022
    • (2019)ThrustHeteroProceedings of the 12th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)10.1145/3299771.3299773(1-11)Online publication date: 14-Feb-2019
    • (2018)How Easy it is to Write Software for Heterogeneous Systems?ACM SIGSOFT Software Engineering Notes10.1145/3149485.314951142:4(1-7)Online publication date: 11-Jan-2018
    • (2018)Exploring Textures in Traffic Matrices to Classify Data Center Communications2018 IEEE 32nd International Conference on Advanced Information Networking and Applications (AINA)10.1109/AINA.2018.00161(1123-1130)Online publication date: May-2018
    • (2018)Thrust2D: A new design abstraction framework for structured grid class of algorithmsConcurrency and Computation: Practice and Experience10.1002/cpe.474030:19Online publication date: 17-Jul-2018
    • (2017)High performance supercomputing on a budgetJournal of Computing Sciences in Colleges10.5555/3055338.305535432:4(86-92)Online publication date: 1-Apr-2017
    • (2017)How Effective is Design Abstraction in Thrust?Proceedings of the 2017 Workshop on Software Engineering Methods for Parallel and High Performance Applications10.1145/3085158.3086159(3-10)Online publication date: 26-Jun-2017
    • (2017)Softening Up the Network for Scientific Applications2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)10.1109/PDP.2017.19(108-115)Online publication date: 2017
    • (2017)An Empirical Evaluation of Design Abstraction and Performance of Thrust Framework2017 46th International Conference on Parallel Processing Workshops (ICPPW)10.1109/ICPPW.2017.43(233-242)Online publication date: Aug-2017
    • (2017)Thrust++: Extending Thrust Framework for Better Abstraction and Performance2017 IEEE 24th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2017.00049(368-377)Online publication date: Dec-2017
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media