skip to main content
10.1145/1088149.1088183acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

Optimization of MPI collective communication on BlueGene/L systems

Published: 20 June 2005 Publication History

Abstract

BlueGene/L is currently the world's fastest supercomputer. It consists of a large number of low power dual-processor compute nodes interconnected by high speed torus and collective networks, Because compute nodes do not have shared memory, MPI is the the natural programming model for this machine. The BlueGene/L MPI library is a port of MPICH2.In this paper we discuss the implementation of MPI collectives on BlueGene/L. The MPICH2 implementation of MPI collectives is based on point-to-point communication primitives. This turns out to be suboptimal for a number of reasons. Machine-optimized MPI collectives are necessary to harness the performance of BlueGene/L. We discuss these optimized MPI collectives, describing the algorithms and presenting performance results measured with targeted micro-benchmarks on real BlueGene/L hardware with up to 4096 compute nodes.

References

[1]
The MPICH and MPICH2 homepage. http://www-unix.mcs.anl.gov/mpi/mpich.
[2]
N. R. Adiga et al. An overview of the BlueGene/L supercomputer. In SC2002 - High Performance Networking and Computing, Baltimore, MD, November 2002.
[3]
G. Almasi, C. Archer, J. G. Casta�os, C. C. Erway, P. Heidelberger, X. Martorell, J. E. Moreira, K. Pinnow, J. Rattermann, N. Smeds, B. Steimacher-burow, W. Gropp, and B. Toonen. Implementing MPI on the BlueGene/L supercomputer. In Proceedings of Euro-Par 2004 Conference, Lecture Notes in Computer Science, Pisa, Italy, August 2004. Springer-Verlag.
[4]
G. Almasi, C. Archer, J. Gunnels, P. Heidelberger, X. Martorell, and J. E. Moreira. Architecture and performance of the BlueGene/L Message Layer. In Proceedings of the 11th EuroPVM/MPI conference, Lecture Notes in Computer Science. Springer-Verlag, September 2004.
[5]
G. Almasi, R. Bellofatto, J. Brunheroto, C. Cascaval, J. G. Casta�os, L. Ceze, P. Crumley, C. Erway, J. Gagliano, D. Lieber, X. Martorell, J. E. Moreira, A. Sanomiya, and K. Strauss. An overview of the BlueGene/L system software organization. In Proceedings of Euro-Par 2003 Conference, Lecture Notes in Computer Science, Klagenfurt, Austria, August 2003. Springer-Verlag.
[6]
G. Almasi et al. Cellular supercomputing with system-on-a-chip. In IEEE International Solid-state Circuits Conference ISSCC, 2001.
[7]
M. Barnett, R. J. Littlefield, D. G. Payne, and R. A. van de Geijn. Global combine on mesh architectures with wormhole routing. In International Parallel Processing Symposium, pages 156--162, 1993.
[8]
G. Chiola and G. Ciaccio. Gamma: a low cost network of workstations based on active messages. In Proc. Euromicro PDP'97, London, UK, January 1997, IEEE Computer Society., 1997.
[9]
W. Gropp, E. Lusk, D. Ashton, R. Ross, R. Thakur, and B. Toonen. MPICH Abstract Device Interface Version 3.4 Reference Manual: Draft of May 20, 2003. http://www-unix.mcs.anl.gov/mpi/mpich/adi3/adi3man.pdf.
[10]
S. K. S. Gupta and D. K. Panda. Barrier synchronization in distributed-memory multiprocessors using rendezvous primitives. In Proceedings of the 7th IEEE International Parallel Processing Symposium - IPPS'93. IEEE Press, 1993.
[11]
D. Hensgen, R. Finkel, and U. Manbet. Two algorithms for barrier synchronizatio. International Journal of Parallel Programming, 17(1):1--17, February 1998.
[12]
S. Pakin, M. Lauria, and A. Chien. High performance messaging on workstations: Illinois Fast Messages (FM) for Myrinet. In Supercomputing '95, San Diego, CA, December 1995, 1995.
[13]
D. K. Panda. Global reduction in wormhole k-ary n-cube networks with multidestination exchange worms. In IPPS: 9th International Parallel Processing Symposium. IEEE Computer Society Press, 1995.
[14]
R. Rabenseifne. A new optimized mpi reduce algorithm. High-Performance Computing-Center, University of Stuttgart, November 1997. http://www.hlrs.de/mpi/myreduce.html.
[15]
R. Rabenseifner. Optimization of collective reduction operations. In International Conference on Computational Science, June 2004.
[16]
R. Thakur and W. Gropp. Improving the performance of collective operations in mpich. In Proceedings of the 11th EuroPVM/MPI conference. Springer-Verlag, September 2003.
[17]
R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in mpich. International Journal of High Performance Computing Applications, 2005.
[18]
T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-net: A user-level network interface for parallel and distributed computing. In Proceedings of the 15th ACM Symposium on Operating Systems Principles, Copper Mountain, Colorado, December 1995.
[19]
T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active Messages: a mechanism for integrated communication and computation. In Proceedings of the 19th International Symposium on Computer Architecture, May 1992.
[20]
J. Watts and R. Van De Geijn. A pipelined broadcast for multidimensional meshes. Parallel Processing Letters, 5(2):281--292, 1995.

Cited By

View all
  • (2024)gZCCL: Compression-Accelerated Collective Communication Framework for GPU ClustersProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656636(437-448)Online publication date: 30-May-2024
  • (2024)POSTER: Optimizing Collective Communications with Error-bounded Lossy Compression for GPU ClustersProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638467(454-456)Online publication date: 2-Mar-2024
  • (2024)LIBRA: Enabling Workload-Aware Multi-Dimensional Network Topology Optimization for Distributed Training of Large AI Models2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00028(205-216)Online publication date: 5-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '05: Proceedings of the 19th annual international conference on Supercomputing
June 2005
414 pages
ISBN:1595931678
DOI:10.1145/1088149
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. BlueGene
  2. MPI
  3. collective communication
  4. optimization
  5. performance

Qualifiers

  • Article

Conference

ICS05
Sponsor:
ICS05: International Conference on Supercomputing 2005
June 20 - 22, 2005
Massachusetts, Cambridge

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)50
  • Downloads (Last 6 weeks)3
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)gZCCL: Compression-Accelerated Collective Communication Framework for GPU ClustersProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656636(437-448)Online publication date: 30-May-2024
  • (2024)POSTER: Optimizing Collective Communications with Error-bounded Lossy Compression for GPU ClustersProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638467(454-456)Online publication date: 2-Mar-2024
  • (2024)LIBRA: Enabling Workload-Aware Multi-Dimensional Network Topology Optimization for Distributed Training of Large AI Models2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00028(205-216)Online publication date: 5-May-2024
  • (2024)Enhancing Collective Communication in MCM Accelerators for Deep Learning Training2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00069(1-16)Online publication date: 2-Mar-2024
  • (2024)SUARA: A scalable universal allreduce communication algorithm for acceleration of parallel deep learning applicationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104767183(104767)Online publication date: Jan-2024
  • (2023)Accelerating Parallel Applications Based on Graph Reordering for Random Network TopologiesIEEE Access10.1109/ACCESS.2023.326979311(40373-40383)Online publication date: 2023
  • (2023)xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep LearningJournal of Computer Science and Technology10.1007/s11390-023-2894-638:1(166-195)Online publication date: 31-Mar-2023
  • (2022)MiCSProceedings of the VLDB Endowment10.14778/3561261.356126516:1(37-50)Online publication date: 1-Sep-2022
  • (2022)Applying on Node Aggregation Methods to MPI Alltoall Collectives: Matrix Block Aggregation AlgorithmProceedings of the 29th European MPI Users' Group Meeting10.1145/3555819.3555821(11-17)Online publication date: 14-Sep-2022
  • (2022)Optimizing the Bruck Algorithm for Non-uniform All-to-all CommunicationProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531468(172-184)Online publication date: 27-Jun-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media