skip to main content
10.1145/1854273.1854318acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Published: 11 September 2010 Publication History

Abstract

Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms. Ocelot includes a dynamic binary translator from Parallel Thread eXecution ISA (PTX) to many-core processors that leverages the Low Level Virtual Machine (LLVM) code generator to target x86 and other ISAs. The dynamic compiler is able to execute existing CUDA binaries without recompilation from source and supports switching between execution on an NVIDIA GPU and a many-core CPU at runtime. It has been validated against over 130 applications taken from the CUDA SDK, the UIUC Parboil benchmarks [1], the Virginia Rodinia benchmarks [2], the GPU-VSIPL signal and image processing library [3], the Thrust library [4], and several domain specific applications.
This paper presents a high level overview of the implementation of the Ocelot dynamic compiler highlighting design decisions and trade-offs, and showcasing their effect on application performance. Several novel code transformations are explored that are applicable only when compiling explicitly parallel applications and traditional dynamic compiler optimizations are revisited for this new class of applications. This study is expected to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures.

References

[1]
}}IMPACT, "The parboil benchmark suite," 2007.
[2]
}}S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in IEEE International Symposium on Workload Characterization, 2009. IISWC 2009., October 2009.
[3]
}}A. Kerr, D. Campbell, and M. Richards, "Gpu vsipl: High-performance vsipl implementation for gpus," in HPEC'08: High Performance Embedded Computing Workshop, Lexington, MA, USA, 2008.
[4]
}}J. Hoberock and N. Bell, "Thrust: A parallel template library," 2009, version 1.2.
[5]
}}L. G. Valiant, "A bridging model for parallel computation," Commun. ACM, vol. 33, no. 8, pp. 103--111, 1990.
[6]
}}J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy, A. Mahesri, S. S. Lumetta, M. I. Frank, and S. J. Patel, "Rigel: an architecture and scalable programming interface for a 1000-core accelerator," in ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture. New York, NY, USA: ACM, 2009.
[7]
}}NVIDIA, "Nvidias next generation cuda compute architecture: Fermi," NVIDIA Coporation, Tech. Rep., 2009.
[8]
}}AMD, "R600/r700/evergreen assembly language format," Tech. Rep., 2009.
[9]
}}A. Kerr, G. Diamos, and S. Yalamanchili, "Modeling gpu-cpu workloads and systems," in Third Workshop on General-Purpose Computation on Graphics Procesing Units, Pittsburg, PA, USA, March 2010.
[10]
}}C. Luk, S. Hong, and H. Kim, "Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping," in MICRO'09. New York, USA: IEEE, devember 2009.
[11]
}}V. J. Jimenez, L. Vilanova, I. Gelado, M. Gil, G. Fursin, and N. Navarro, "Predictive runtime code scheduling for heterogeneous architectures," in HiPEAC '09: Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers. Berlin, Heidelberg: Springer-Verlag, 2009, pp. 19--33.
[12]
}}J. C. Dehnert, B. K. Grant, J. P. Banning, R. Johnson, T. Kistler, A. Klaiber, and J. Mattson, "The transmeta code morphingTMsoftware: using speculation, recovery, and adaptive retranslation to address real-life challenges," in CGO '03: Proceedings of the international symposium on Code generation and optimization. Washington, DC, USA: IEEE Computer Society, 2003, pp. 15--24.
[13]
}}W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic warp formation and scheduling for efficient gpu control flow," in MICRO '07: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. Washington, DC, USA: IEEE Computer Society, 2007, pp. 407--420.
[14]
}}J. Stratton, V. Grover, J. Marathe, B. Aarts, M. Murphy, Z. Hu, and W. mei Hwu, "Efficient compilation of fine-grained spmd-threaded programs for multicore cpus," in CGO 2010, Toronto, Canada, April 2010.
[15]
}}G. Diamos, A. Kerr, and M. Kesavan, "Translating gpu binaries to tiered simd architectures with ocelot," Georgia Institute of Technology, Tech. Rep. GIT-CERCS-09-01, January 2009.
[16]
}}A. Kerr, G. Diamos, and S. Yalamanchili, "A characterization and analysis of ptx kernels," in IISWC09: IEEE International Symposium on Workload Characterization, Austin, TX, USA, October 2009.
[17]
}}C. Madriles, P. Lopez, J. M. Codina, E. Gibert, F. Latorre, A. Martinez, R. Martinez, and A. Gonzalez, "Anaphase: A fine-grain thread decomposition scheme for speculative multithreading," in PACT '09: Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques. Washington, DC, USA: IEEE Computer Society, 2009, pp. 15--25.
[18]
}}A. Chernoff and R. Hookway, "Digital fx!32 running 32-bit 86 applications on alpha nt," in NT'97: Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997. Berkeley, CA, USA: USENIX Association, 1997.
[19]
}}B. Alpern, S. Augart, S. M. Blackburn, M. Butrico, A. Cocchi, P. Cheng, J. Dolby, S. Fink, D. Grove, M. Hind, K. S. McKinley, M. Mergen, J. E. B. Moss, T. Ngo, and V. Sarkar, "The jikes research virtual machine project: building an open-source research community," IBM Syst. J., vol. 44, no. 2, 2005.
[20]
}}V. Bala, E. Duesterwald, and S. Banerjia, "Dynamo: a transparent dynamic optimization system," in PLDI '00: Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation. New York, NY, USA: ACM, 2000.
[21]
}}V. J. Reddi, A. Settle, D. A. Connors, and R. S. Cohn, "Pin: a binary instrumentation tool for computer architecture research and education," in WCAE '04: Proceedings of the 2004 workshop on Computer architecture education. New York, NY, USA: ACM, 2004, p. 22.
[22]
}}N. Nethercote and J. Seward, "Valgrind: a framework for heavyweight dynamic binary instrumentation," SIGPLAN Not., vol. 42, no. 6, pp. 89--100, 2007.
[23]
}}A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing cuda workloads using a detailed gpu simulator," in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Boston, MA, USA, April 2009.
[24]
}}S. Collange, D. Defour, and D. Parello, "Barra, a modular functional gpu simulator for gpgpu," Tech. Rep. hal-00359342, 2009.
[25]
}}C. Lattner and V. Adve, "Llvm: A compilation framework for lifelong program analysis & transformation," in CGO '04: Proceedings of the international symposium on Code generation and optimization. Washington, DC, USA: IEEE Computer Society, 2004, p. 75.
[26]
}}G. Diamos, "State explosion: An obvious limitation to strong scaling," NFinTes, Tech. Rep., 2009.
[27]
}}---, "Hydrazine: A high performance library for c++ and cuda," November 2009.
[28]
}}K. Fatahalian, T. J. Knight, M. Houston, M. Erez, D. R. Horn, L. Leem, J. Y. Park, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan, "Sequoia: Programming the memory hierarchy," in Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, 2006.
[29]
}}Y. Yan, J. Zhao, Y. Guo, and V. Sarkar, "Hierarchical place trees: A portable abstraction for task parallelism and data movement," in Proceedings of the 22nd Workshop on Languages and Compilers for Parallel Computing (LCPC), october 2009.

Cited By

View all
  • (2024)CuPBoP: Making CUDA a Portable LanguageACM Transactions on Design Automation of Electronic Systems10.1145/365994929:4(1-25)Online publication date: 23-Apr-2024
  • (2024)Low-Overhead Trace Collection and Profiling on GPU Compute KernelsACM Transactions on Parallel Computing10.1145/364951011:2(1-24)Online publication date: 8-Jun-2024
  • (2023)Exploring OpenMP GPU Offloading for Implementing Convolutional Neural NetworksProceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3582514.3582523(60-69)Online publication date: 25-Feb-2023
  • Show More Cited By

Index Terms

  1. Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques
      September 2010
      596 pages
      ISBN:9781450301787
      DOI:10.1145/1854273
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 September 2010

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. cuda
      2. ocelot
      3. ptx

      Qualifiers

      • Research-article

      Conference

      PACT '10
      Sponsor:
      • IFIP WG 10.3
      • IEEE CS TCPP
      • SIGARCH
      • IEEE CS TCAA

      Acceptance Rates

      Overall Acceptance Rate 121 of 471 submissions, 26%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)42
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 16 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)CuPBoP: Making CUDA a Portable LanguageACM Transactions on Design Automation of Electronic Systems10.1145/365994929:4(1-25)Online publication date: 23-Apr-2024
      • (2024)Low-Overhead Trace Collection and Profiling on GPU Compute KernelsACM Transactions on Parallel Computing10.1145/364951011:2(1-24)Online publication date: 8-Jun-2024
      • (2023)Exploring OpenMP GPU Offloading for Implementing Convolutional Neural NetworksProceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3582514.3582523(60-69)Online publication date: 25-Feb-2023
      • (2023)DrGPU: A Top-Down Profiler for GPU ApplicationsProceedings of the 2023 ACM/SPEC International Conference on Performance Engineering10.1145/3578244.3583736(43-53)Online publication date: 15-Apr-2023
      • (2023)High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel ConstructsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577475(119-134)Online publication date: 25-Feb-2023
      • (2022)Scalar Replacement Considering Branch DivergenceJournal of Information Processing10.2197/ipsjjip.30.16430(164-178)Online publication date: 2022
      • (2022)Breaking the Vendor LockProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569687(494-504)Online publication date: 8-Oct-2022
      • (2022)COX : Exposing CUDA Warp-level Functions to CPUsACM Transactions on Architecture and Code Optimization10.1145/355473619:4(1-25)Online publication date: 16-Sep-2022
      • (2022)Flexible Binary Instrumentation Framework to Profile Code Running on Intel GPUs2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS55109.2022.00011(109-120)Online publication date: May-2022
      • (2022)Profiling Intel Graphics Architecture with Long Instruction Traces2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS55109.2022.00001(1-11)Online publication date: May-2022
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media