Article

Optimization of MPI collective communication on BlueGene/L systems

Authors:

George Alm�si,

Philip Heidelberger,

Charles J. Archer,

Xavier Martorell,

C. Chris Erway,

Jos� E. Moreira,

B. Steinmacher-Burow,

Yili ZhengAuthors Info & Claims

ICS '05: Proceedings of the 19th annual international conference on Supercomputing

Pages 253 - 262

https://doi.org/10.1145/1088149.1088183

Published: 20 June 2005 Publication History

Abstract

BlueGene/L is currently the world's fastest supercomputer. It consists of a large number of low power dual-processor compute nodes interconnected by high speed torus and collective networks, Because compute nodes do not have shared memory, MPI is the the natural programming model for this machine. The BlueGene/L MPI library is a port of MPICH2.In this paper we discuss the implementation of MPI collectives on BlueGene/L. The MPICH2 implementation of MPI collectives is based on point-to-point communication primitives. This turns out to be suboptimal for a number of reasons. Machine-optimized MPI collectives are necessary to harness the performance of BlueGene/L. We discuss these optimized MPI collectives, describing the algorithms and presenting performance results measured with targeted micro-benchmarks on real BlueGene/L hardware with up to 4096 compute nodes.

References

[1]

The MPICH and MPICH2 homepage. http://www-unix.mcs.anl.gov/mpi/mpich.

[2]

N. R. Adiga et al. An overview of the BlueGene/L supercomputer. In SC2002 - High Performance Networking and Computing, Baltimore, MD, November 2002.

Digital Library

[3]

G. Almasi, C. Archer, J. G. Casta�os, C. C. Erway, P. Heidelberger, X. Martorell, J. E. Moreira, K. Pinnow, J. Rattermann, N. Smeds, B. Steimacher-burow, W. Gropp, and B. Toonen. Implementing MPI on the BlueGene/L supercomputer. In Proceedings of Euro-Par 2004 Conference, Lecture Notes in Computer Science, Pisa, Italy, August 2004. Springer-Verlag.

[4]

G. Almasi, C. Archer, J. Gunnels, P. Heidelberger, X. Martorell, and J. E. Moreira. Architecture and performance of the BlueGene/L Message Layer. In Proceedings of the 11th EuroPVM/MPI conference, Lecture Notes in Computer Science. Springer-Verlag, September 2004.

[5]

G. Almasi, R. Bellofatto, J. Brunheroto, C. Cascaval, J. G. Casta�os, L. Ceze, P. Crumley, C. Erway, J. Gagliano, D. Lieber, X. Martorell, J. E. Moreira, A. Sanomiya, and K. Strauss. An overview of the BlueGene/L system software organization. In Proceedings of Euro-Par 2003 Conference, Lecture Notes in Computer Science, Klagenfurt, Austria, August 2003. Springer-Verlag.

[6]

G. Almasi et al. Cellular supercomputing with system-on-a-chip. In IEEE International Solid-state Circuits Conference ISSCC, 2001.

[7]

M. Barnett, R. J. Littlefield, D. G. Payne, and R. A. van de Geijn. Global combine on mesh architectures with wormhole routing. In International Parallel Processing Symposium, pages 156--162, 1993.

Digital Library

[8]

G. Chiola and G. Ciaccio. Gamma: a low cost network of workstations based on active messages. In Proc. Euromicro PDP'97, London, UK, January 1997, IEEE Computer Society., 1997.

[9]

W. Gropp, E. Lusk, D. Ashton, R. Ross, R. Thakur, and B. Toonen. MPICH Abstract Device Interface Version 3.4 Reference Manual: Draft of May 20, 2003. http://www-unix.mcs.anl.gov/mpi/mpich/adi3/adi3man.pdf.

[10]

S. K. S. Gupta and D. K. Panda. Barrier synchronization in distributed-memory multiprocessors using rendezvous primitives. In Proceedings of the 7th IEEE International Parallel Processing Symposium - IPPS'93. IEEE Press, 1993.

Digital Library

[11]

D. Hensgen, R. Finkel, and U. Manbet. Two algorithms for barrier synchronizatio. International Journal of Parallel Programming, 17(1):1--17, February 1998.

Digital Library

[12]

S. Pakin, M. Lauria, and A. Chien. High performance messaging on workstations: Illinois Fast Messages (FM) for Myrinet. In Supercomputing '95, San Diego, CA, December 1995, 1995.

[13]

D. K. Panda. Global reduction in wormhole k-ary n-cube networks with multidestination exchange worms. In IPPS: 9th International Parallel Processing Symposium. IEEE Computer Society Press, 1995.

Digital Library

[14]

R. Rabenseifne. A new optimized mpi reduce algorithm. High-Performance Computing-Center, University of Stuttgart, November 1997. http://www.hlrs.de/mpi/myreduce.html.

[15]

R. Rabenseifner. Optimization of collective reduction operations. In International Conference on Computational Science, June 2004.

[16]

R. Thakur and W. Gropp. Improving the performance of collective operations in mpich. In Proceedings of the 11th EuroPVM/MPI conference. Springer-Verlag, September 2003.

[17]

R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in mpich. International Journal of High Performance Computing Applications, 2005.

Digital Library

[18]

T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-net: A user-level network interface for parallel and distributed computing. In Proceedings of the 15th ACM Symposium on Operating Systems Principles, Copper Mountain, Colorado, December 1995.

Digital Library

[19]

T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active Messages: a mechanism for integrated communication and computation. In Proceedings of the 19th International Symposium on Computer Architecture, May 1992.

Digital Library

[20]

J. Watts and R. Van De Geijn. A pipelined broadcast for multidimensional meshes. Parallel Processing Letters, 5(2):281--292, 1995.

Cited By

Huang JDi SYu XZhai YLiu JHuang YRaffenetti KZhou HZhao KLu XChen ZCappello FGuo YThakur R(2024)gZCCL: Compression-Accelerated Collective Communication Framework for GPU ClustersProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656636(437-448)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656636
Huang JDi SYu XZhai YLiu JHuang YRaffenetti KZhou HZhao KChen ZCappello FGuo YThakur RLee IChabbi MSteuwer M(2024)POSTER: Optimizing Collective Communications with Error-bounded Lossy Compression for GPU ClustersProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638467(454-456)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638467
Won WRashidi SSrinivasan SKrishna T(2024)LIBRA: Enabling Workload-Aware Multi-Dimensional Network Topology Optimization for Distributed Training of Large AI Models2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00028(205-216)Online publication date: 5-May-2024
https://doi.org/10.1109/ISPASS61541.2024.00028
Show More Cited By

Index Terms

Optimization of MPI collective communication on BlueGene/L systems

Recommendations

NUMA-aware shared-memory collective communication for MPI
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimizations of MPI collectives for clusters of NUMA nodes. We ...
Kernel-Assisted MPI Collective Communication among Many-core Clusters
CCGRID '12: Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)

Architectural hierarchies and hardware complexity brought by multicore or many-core Clusters, greatly challenge MPI applications' performance in two ways: performance efficiency and cross-platform portability. The cross-platform portability assumption, '...
NUMA-aware shared-memory collective communication for MPI
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimizations of MPI collectives for clusters of NUMA nodes. We ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '05: Proceedings of the 19th annual international conference on Supercomputing

June 2005

414 pages

ISBN:1595931678

DOI:10.1145/1088149

General Chair:
Arvind
MIT
,
Program Chair:
Larry Rudolph
MIT

Copyright � 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

ICS05

Sponsor:

SIGARCH

ICS05: International Conference on Supercomputing 2005

June 20 - 22, 2005

Massachusetts, Cambridge

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

118
Total Citations
View Citations
1,551
Total Downloads

Downloads (Last 12 months)50
Downloads (Last 6 weeks)3

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Huang JDi SYu XZhai YLiu JHuang YRaffenetti KZhou HZhao KLu XChen ZCappello FGuo YThakur R(2024)gZCCL: Compression-Accelerated Collective Communication Framework for GPU ClustersProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656636(437-448)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656636
Huang JDi SYu XZhai YLiu JHuang YRaffenetti KZhou HZhao KChen ZCappello FGuo YThakur RLee IChabbi MSteuwer M(2024)POSTER: Optimizing Collective Communications with Error-bounded Lossy Compression for GPU ClustersProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638467(454-456)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638467
Won WRashidi SSrinivasan SKrishna T(2024)LIBRA: Enabling Workload-Aware Multi-Dimensional Network Topology Optimization for Distributed Training of Large AI Models2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00028(205-216)Online publication date: 5-May-2024
https://doi.org/10.1109/ISPASS61541.2024.00028
Laskar SMajhi PKim SMahmud FMuzahid AKim E(2024)Enhancing Collective Communication in MCM Accelerators for Deep Learning Training2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00069(1-16)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00069
Nuriyev EManumachu RAseeri SVerma MLastovetsky A(2024)SUARA: A scalable universal allreduce communication algorithm for acceleration of parallel deep learning applicationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104767183(104767)Online publication date: Jan-2024
https://doi.org/10.1016/j.jpdc.2023.104767
Hu Y(2023)Accelerating Parallel Applications Based on Graph Reordering for Random Network TopologiesIEEE Access10.1109/ACCESS.2023.326979311(40373-40383)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3269793
Weingram ALi YQi HNg DDai LLu X(2023)xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep LearningJournal of Computer Science and Technology10.1007/s11390-023-2894-638:1(166-195)Online publication date: 31-Mar-2023
https://doi.org/10.1007/s11390-023-2894-6
Zhang ZZheng SWang YChiu JKarypis GChilimbi TLi MJin X(2022)MiCSProceedings of the VLDB Endowment10.14778/3561261.356126516:1(37-50)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.14778/3561261.3561265
Chochia GSolt DHursey J(2022)Applying on Node Aggregation Methods to MPI Alltoall Collectives: Matrix Block Aggregation AlgorithmProceedings of the 29th European MPI Users' Group Meeting10.1145/3555819.3555821(11-17)Online publication date: 14-Sep-2022
https://dl.acm.org/doi/10.1145/3555819.3555821
Fan KGilray TPascucci VHuang XMicinski KKumar SWeissman JChandra AGavrilovska ATiwari D(2022)Optimizing the Bruck Algorithm for Non-uniform All-to-all CommunicationProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531468(172-184)Online publication date: 27-Jun-2022
https://dl.acm.org/doi/10.1145/3502181.3531468
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents