skip to main content
10.1145/3503222.3507778acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Public Access

Breaking the computation and communication abstraction barrier in distributed machine learning workloads

Published: 22 February 2022 Publication History

Abstract

Recent trends towards large machine learning models require both training and inference tasks to be distributed. Considering the huge cost of training these models, it is imperative to unlock optimizations in computation and communication to obtain best performance. However, the current logical separation between computation and communication kernels in machine learning frameworks misses optimization opportunities across this barrier. Breaking this abstraction can provide many optimizations to improve the performance of distributed workloads. However, manually applying these optimizations requires modifying the underlying computation and communication libraries for each scenario, which is both time consuming and error-prone.
Therefore, we present CoCoNet, which contains (i) a domain specific language to express a distributed machine learning program in the form of computation and communication operations, (ii) a set of semantics preserving transformations to optimize the program, and (iii) a compiler to generate jointly optimized communication and computation GPU kernels. Providing both computation and communication as first class constructs allows users to work on a high-level abstraction and apply powerful optimizations, such as fusion or overlapping of communication and computation. CoCoNet enabled us to optimize data-, model- and pipeline-parallel workloads in large language models with only a few lines of code. Our experiments show that CoCoNet significantly outperforms state-of-the-art distributed machine learning implementations.

References

[1]
Accessed: 2022-01-12. Apache mxnet. https://mxnet.apache.org/
[2]
Accessed: 2022-01-12. cuBLAS. https://docs.nvidia.com/cuda/cublas/index.html
[3]
Accessed: 2022-01-12. cuDNN. https://docs.nvidia.com/cuda/cudnn/index.html
[4]
Accessed: 2022-01-12. CUTLASS. https://github.com/NVIDIA/cutlass
[5]
Accessed: 2022-01-12. GPUDirect RDMA. https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
[6]
Accessed: 2022-01-12. NVIDIA Apex. https://github.com/NVIDIA/apex
[7]
Accessed: 2022-01-12. NVIDIA BERT. https://github.com/NVIDIA/DeepLearningExamples
[8]
Accessed: 2022-01-12. NVIDIA Collective Communication Library. https://github.com/NVIDIA/nccl
[9]
Accessed: 2022-01-12. NVIDIA Megatron-LM. https://github.com/NVIDIA/Megatron-LM/
[10]
Accessed: 2022-01-12. OpenAI’s GPT-3 Language Model: A Technical Overview. https://lambdalabs.com/blog/demystifying-gpt-3/
[11]
Accessed: 2022-01-12. Parameter fusion in optimizer partition makes LAMB behaves differently. https://github.com/microsoft/DeepSpeed/issues/490
[12]
Accessed: 2022-01-12. Training with Mixed Precision. https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html
[13]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation.
[14]
Youcef Barigou and Edgar Gabriel. 2017. Maximizing Communication–Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective Operations. International Journal of Parallel Programming, 45 (2017), https://doi.org/10.1007/s10766-016-0477-7
[15]
P. Basu, A. Venkat, M. Hall, S. Williams, B. Van Straalen, and L. Oliker. 2013. Compiler generation and autotuning of communication-avoiding operators for geometric multigrid. In 20th Annual International Conference on High Performance Computing. https://doi.org/10.1109/HiPC.2013.6799131
[16]
Uday Bondhugula. 2013. Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. https://doi.org/10.1145/2503210.2503289
[17]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems.
[18]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation.
[19]
A. Denis and F. Trahay. 2016. MPI Overlap: Benchmark and Analysis. In 2016 45th International Conference on Parallel Processing (ICPP). https://doi.org/10.1109/ICPP.2016.37
[20]
Tyler Denniston, Shoaib Kamil, and Saman Amarasinghe. 2016. Distributed Halide. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. https://doi.org/10.1145/2851141.2851157
[21]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. Association for Computational Linguistics. https://doi.org/10.18653/v1/n19-1423
[22]
Message Passing Interface Forum. 2012. MPI: A Message-Passing Interface Standard Version 3.0.
[23]
Tobias Gysi, Jeremia B�r, and Torsten Hoefler. 2016. dCUDA: Hardware Supported Overlap of Computation and Communication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. https://doi.org/10.1109/SC.2016.51
[24]
Bastian Hagedorn, Archibald Samuel Elliott, Henrik Barthels, Rastislav Bodik, and Vinod Grover. 2020. Fireiron: A Data-Movement-Aware Scheduling Language for GPUs. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. https://doi.org/10.1145/3410463.3414632
[25]
Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. 2018. High Performance Stencil Code Generation with Lift. In Proceedings of the 2018 International Symposium on Code Generation and Optimization. https://doi.org/10.1145/3168824
[26]
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Advances in Neural Information Processing Systems 32.
[27]
Abhinav Jangda and Arjun Guha. 2020. Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUs. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. https://doi.org/10.1145/3410463.3414649
[28]
Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hosseing Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, and Olli Saarikivi. 2022. CoCoNet: Co-Optimize Computation and Communication for Distributed Neural Networks. https://doi.org/10.6084/m9.figshare.18480953
[29]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093.
[30]
Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. In Proceedings of Machine Learning and Systems.
[31]
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In 14th USENIX Symposium on Operating Systems Design and Implementation. https://www.usenix.org/system/files/osdi20-jiang.pdf
[32]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015. arxiv:1412.6980
[33]
N. Koziris, A. Sotiropoulos, and G. Goumas. 2003. A pipelined schedule to minimize completion time for loop tiling with computation and communication overlapping. J. Parallel and Distrib. Comput., 63, 11 (2003), https://doi.org/10.1016/S0743-7315(03)00102-3
[34]
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In International Conference on Learning Representations.
[35]
Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. 2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Proc. VLDB Endow., https://doi.org/10.14778/3415478.3415530
[36]
H. Lu, S. Seo, and P. Balaji. 2015. MPI+ULT: Overlapping Communication and Computation with User-Level Threads. In 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems. https://doi.org/10.1109/HPCC-CSS-ICESS.2015.82
[37]
Vladimir Marjanović, Jesús Labarta, Eduard Ayguadé, and Mateo Valero. 2010. Overlapping Communication and Computation by Using a Hybrid MPI/SMPSs Approach. In Proceedings of the 24th ACM International Conference on Supercomputing. https://doi.org/10.1145/1810085.1810091
[38]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. https://doi.org/10.1145/3341301.3359646
[39]
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. https://doi.org/10.1145/3458817.3476209
[40]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32.
[41]
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners.
[42]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation. https://doi.org/10.1145/2491956.2462176
[43]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. ZeRO: Memory Optimizations toward Training Trillion Parameter Models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.
[44]
I. Z. Reguly, G. R. Mudalige, and M. B. Giles. 2018. Loop Tiling in Large-Scale Stencil Codes at Run-Time with OPS. IEEE Transactions on Parallel and Distributed Systems, https://doi.org/10.1109/TPDS.2017.2778161
[45]
Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arxiv:1802.05799.
[46]
Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake Hechtman. 2018. Mesh-TensorFlow: Deep Learning for Supercomputers. In Advances in Neural Information Processing Systems.
[47]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arxiv:1909.08053.
[48]
M. Steuwer, T. Remmelg, and C. Dubach. 2017. LIFT: A functional data-parallel IR for high-performance GPU code generation. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). https://doi.org/10.1109/CGO.2017.7863730
[49]
Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy Considerations for Deep Learning in NLP. arxiv:1906.02243.
[50]
Hari Subramoni, Sourav Chakraborty, and Dhabaleswar K. Panda. 2017. Designing Dynamic and Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation and Communication. In High Performance Computing.
[51]
Hengjie Wang and Aparna Chandramowlishwaran. 2020. Pencil: A Pipelined Algorithm for Distributed Stencils. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. https://doi.org/10.1109/SC41405.2020.00089
[52]
Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2020. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. In International Conference on Learning Representations. https://openreview.net/forum?id=Syx4wnEtvH
[53]
Shixiong Zhao, Fanxin Li, Xusheng Chen, Xiuxian Guan, Jianyu Jiang, Dong Huang, Yuhao Qing, Sen Wang, Peng Wang, Gong Zhang, Cheng Li, Ping Luo, and Heming Cui. 2022. vPipe: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training. IEEE Transactions on Parallel and Distributed Systems, https://doi.org/10.1109/TPDS.2021.3094364

Cited By

View all
  • (2024)Thorough Characterization and Analysis of Large Transformer Model Training At-ScaleProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390348:1(1-25)Online publication date: 21-Feb-2024
  • (2024)Fast Kronecker Matrix-Matrix Multiplication on GPUsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638489(390-403)Online publication date: 2-Mar-2024
  • (2024)Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication PartitioningProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651379(178-191)Online publication date: 27-Apr-2024
  • Show More Cited By

Index Terms

  1. Breaking the computation and communication abstraction barrier in distributed machine learning workloads

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        ASPLOS '22: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
        February 2022
        1164 pages
        ISBN:9781450392051
        DOI:10.1145/3503222
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        In-Cooperation

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 22 February 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Badges

        Author Tags

        1. CUDA
        2. Code Generation
        3. Collective Communication
        4. Compiler Optimizations
        5. Distributed Machine Learning
        6. MPI

        Qualifiers

        • Research-article

        Funding Sources

        Conference

        ASPLOS '22

        Acceptance Rates

        Overall Acceptance Rate 535 of 2,713 submissions, 20%

        Upcoming Conference

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)1,035
        • Downloads (Last 6 weeks)192
        Reflects downloads up to 16 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Thorough Characterization and Analysis of Large Transformer Model Training At-ScaleProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390348:1(1-25)Online publication date: 21-Feb-2024
        • (2024)Fast Kronecker Matrix-Matrix Multiplication on GPUsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638489(390-403)Online publication date: 2-Mar-2024
        • (2024)Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication PartitioningProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651379(178-191)Online publication date: 27-Apr-2024
        • (2024)TCCL: Discovering Better Communication Paths for PCIe GPU ClustersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651362(999-1015)Online publication date: 27-Apr-2024
        • (2024)T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & CollectivesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640410(1146-1164)Online publication date: 27-Apr-2024
        • (2024)SoCFlow: Efficient and Scalable DNN Training on SoC-Clustered Edge ServersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624847(368-385)Online publication date: 27-Apr-2024
        • (2024)Galaxy: A Resource-Efficient Collaborative Edge AI System for In-situ Transformer InferenceIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621342(1001-1010)Online publication date: 20-May-2024
        • (2024)A Framework for Fine-Grained Synchronization of Dependent GPU KernelsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444873(93-105)Online publication date: 2-Mar-2024
        • (2024)Distributed Analytics For Big DataNeurocomputing10.1016/j.neucom.2024.127258574:COnline publication date: 17-Apr-2024
        • (2024)APapoFuture Generation Computer Systems10.1016/j.future.2023.11.004152:C(317-330)Online publication date: 4-Mar-2024
        • Show More Cited By

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media