Article

Lazy direct-to-cache transfer during receive operations in a message passing environment

Authors:

Farshad Khunjush,

Nikitas J. DimopoulosAuthors Info & Claims

CF '06: Proceedings of the 3rd conference on Computing frontiers

Pages 331 - 340

https://doi.org/10.1145/1128022.1128066

Published: 03 May 2006 Publication History

Abstract

The focus of this work is on techniques that promise to reduce the message delivery latency in message passing interface (MPI) environments. The main contributors to message delivery latency in message passing environments are the copying operations needed to transfer and bind a received message to the consuming process/thread. To reduce this copying overhead and to reach toward finer granularity, we introduce architectural extensions comprising of a specialized network cache and instructions to manage the operations of this extension. In this work we study the caching environment and evaluate a new technique called Lazy Direct-to-Cache Transfer (DTCT). Our simulations show that messages can be bound and kept into a network cache where they persist long enough to be consumed. We also demonstrate that lazy DTCT provides a significant reduction in the access latency for I/O intensive environments such as message passing configurations and SMPs without polluting the data cache.

References

[1]

F. Khunjush, M. Watheq El-Kharashi, K. F. Li, and N. J. Dimopoulos "Network Processor Design: Issues and Challenges", Proceedings, 2003 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing pp. 164--168, Aug. 2003.

[2]

"Evaluation of Direct-To-Cache Transfer during Receive Operations in a Message Passing Environment",Proceedings, Second Workshop on Advanced Networking and Communication Hardware, ANCHOR2005, in conjunction with ISCA-32, Madison, USA, June 2005.

[3]

F. Khunjush and N. J. Dimopoulos, "Hiding Message Delivery and Reducing Memory Access Latency by providing Direct-to-Cache Transfer during Receive Operations in a Message Passing Environment", Proceedings, the Sixth Workshop on Memory Performance: Dealing with Applications, Systems, and Architecture, MEDEA 2005, held in Conjunction with PACT05, St. Louis, USA, September 2005.

Digital Library

[4]

A. Afsahi and N. J. Dimopoulos, "Architectural Extensions to Support Efficient Communication Using Message Prediction", Proceedings, 16th Annual International Symposium on High Performance Computing Systems and Applications, HPCS2002, pp. 20--27, June 2002.

Digital Library

[5]

A. Afsahi and N. J. Dimopoulos, "Efficient Communication Using Message Prediction for Cluster of Multiprocessors", Fourth Workshop on Communication, Architecture, and Applications for Network-based Parallel Computing, CANPC'00, in conjunction with HPCA-6, Lecture Notes in Computer Science, No. 1797, pp. 162--178, January, 2000.

Digital Library

[6]

D. H. Bailey, T. Harsis, W. Saphir, R. V. der Wijngaart, A. Woo and M. Yarrow, "The NAS Parallel Benchmarks 2.0: Report NAS-95-020", Nasa Ames Research Center, December 1995.

[7]

T. Austin, E. Larson, D. Ernst, "SimpleScalar: an infrastructure for computer system modeling", IEEE Computer, Vol 35, Issue 2, February 2002, pp. 59--67.

Digital Library

[8]

N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic and W-K. Su, "Myrinet: A Gigabit-per-Second Local Area Network", IEEE Micro, Feb. 1995.

Digital Library

[9]

InfiniBand Trade Association. InfiniBand Architecture Specification, Release 1.0, October 24 2000.

[10]

C. Dubnicki, A. Bilas, Y. Chen, S. Damianakis and K. Li, "VMMC-2: Efficient Support for Reliable, Connection-Oriented Communication", Proceedings of the Hot Interconnect7, 1997.

[11]

S. H. Rodrigues, T. E. Anderson and D. E. Culler, "High-Performance Local Area Communication with Fast Sockets", USENIX 1997, Jan. 1997. R. Sheifert, Gigabit Ethernet Addison-Wesley, 1998.

Digital Library

[12]

A. Basu, M. Welsh, T. V. Eicken, "Incorporating Memory Management into User-Level Network Interface", Hot Interconnects V, Aug. 1997.

[13]

M. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. Felten, and J. Sandberg, "A Virtual Memory Mapped Network Interface for the SHRIMP Multicomputer", Proceedings, 21st Annual International Symposium on Computer Architecture ISCA-21, 1994, pp. 142--153.

Digital Library

[14]

M. Banikazemi, R. K. Govindaraju, R. Blackmore, D. K. Panda, "MPI-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP Systems," IEEE Trans. Parallel Distri. Systems, Vol. 12, No. 10, Oct. 2001, pp. 1081--1093.

Digital Library

[15]

H. Chu, "Zero-copy TCP in Solaris", Proceedings of the USENIX Annual Technical Conference, 1996, pp. 253--263.

Digital Library

[16]

Alacritech, Inc. Allacritech / SLIC technology overview. http://www.alacritech.com/html/tech_review.html.

[17]

N. L. Binkert, R. G. Dreslinski, E. G. Hallnor, Lisa R. Hsu, S. E. Raasch, A. L. Schultz, and S. K. Reinhardt, "The Performance Potential of an Integrated Network Interface", Proceedings, First Workshop on Advanced Networking and Communication Hardware, ANCHOR2004, in conjunction with ISCA-31, June 2004.

[18]

N. L. Binkert, L. R. Hsu, A. G. Saidi, R. G. Dreslinski, A. L. Schultz, and S. K. Reinhardt. "Performance Analysis of System Overheads in TCP/IP Workloads," Proc. 14th Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT 2005), Sept. 2005.

Digital Library

[19]

R. Huggahalli, R. Iyer, and S. Tetrick, "Direct Cache Access for High Bandwidth Network I/O," In Proc. 32nd Annual International Symposium on Computer Architecture, ISCA 32, pp. 50--59, June 2005.

Digital Library

[20]

S. S. Mukherjee and M. D. Hill, "Using Prediction to Accelerate Coherence Protocols", Proceedings, 25th Annual International Symposium on Computer Architecture ISCA-25, 1998.

Digital Library

[21]

M. M. K. Martin, P. J. Harper, D. J. Sorin, M. D. Hill, and D. A. Wood, "Using Destination-Set Prediction to Improve the Latency /Bandwidth Tradeoff in Shared Memory Multiprocessors," Proceedings, 30th Annual International Symposium on Computer Architecture ISCA-30 June 2003.

Digital Library

[22]

M. E. Acacio, J. Gonzlez, J. M. Garca, and J. Duato, "Owner Prediction for Accelerating Cache-to-Cache Transfers in a cc-NUMA Architecture," In Proceedings of the 2002 ACM/IEEE conference on Supercomputing, SC2002, Baltimore, Maryland, 2002.

Digital Library

[23]

J. Kim and D. J. Lilja, "Characterization of Communication Patterns in Message-Passing Parallel Scientific Application Programs", Proceedings of the Workshop on Communication, Architecture, and Applications for Network-based Parallel Computing, HPCA-4, February 1998, pp. 202--216.

Digital Library

[24]

P. H. Worley and I. T. Foster, "Parallel Spectral Transform Shallow Water Model: A Runtime-tunable parallel benchmark code", Proceedings of the Scalable High Performance Computing Conference, 1994, pp. 207--214.

[25]

S. Hioki, "Construction of Staples in Lattice Gauge Theory on a Parallel Computer", Parallel Computing, Volume 22, No. 10, October 1996, pp. 1335--1344.

Digital Library

[26]

N. Agarwal and N. J. Dimopoulos, "Using CoDeL to Rapidly Prototype Network Processor Extensions", Proceedings of Computer Systems: Architectures, Modeling, and Simulation: Third and Fourth International Workshops, SAMOS 2004, Samos, Greece, July, 2004, pp. 333--342.

Cited By

Hartmann TMoawad AFouquet FNain GKlein JTraon YLethbridge T(2015)Stream my modelsProceedings of the 18th International Conference on Model Driven Engineering Languages and Systems10.5555/3351736.3351750(80-89)Online publication date: 30-Sep-2015
https://dl.acm.org/doi/10.5555/3351736.3351750
Khunjush FGong DDimopoulos N(2011)Single-port and multi-port collective communication operations on single and dual Cell BE processor systemsInternational Journal of Communication Networks and Distributed Systems10.1504/IJCNDS.2011.0405596:4(373-391)Online publication date: 1-Jun-2011
https://dl.acm.org/doi/10.1504/IJCNDS.2011.040559
Khunjush FDimopoulos N(2009)Hiding message delivery latency using Direct-to-Cache-Transfer techniques in message passing environmentsMicroprocessors & Microsystems10.1016/j.micpro.2009.07.00133:7-8(430-440)Online publication date: 1-Oct-2009
https://dl.acm.org/doi/10.1016/j.micpro.2009.07.001
Show More Cited By

Index Terms

Lazy direct-to-cache transfer during receive operations in a message passing environment
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Hiding message delivery and reducing memory access latency by providing direct-to-cache transfer during receive operations in a message passing environment
MEDEA '05: Proceedings of the 2005 workshop on MEmory performance: DEaling with Applications , systems and architecture

The focus of this work is on techniques that promise to reduce the message delivery latency in message passing environments. The main contributors to message delivery latency in message passing environments are the copying operations needed to transfer ...
Hiding message delivery and reducing memory access latency by providing direct-to-cache transfer during receive operations in a message passing environment
Special issue: MEDEA'05

The focus of this work is on techniques that promise to reduce the message delivery latency in message passing environments. The main contributors to message delivery latency in message passing environments are the copying operations needed to transfer ...
Hiding message delivery latency using Direct-to-Cache-Transfer techniques in message passing environments

Communication overhead is the key obstacle to reaching hardware performance limits. The majority is associated with software overhead, a significant portion of which is attributed to message copying. To reduce this copying overhead, we have devised ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '06: Proceedings of the 3rd conference on Computing frontiers

May 2006

430 pages

ISBN:1595933026

DOI:10.1145/1128022

General Chairs:
Monica Alderighi
IASF - INAF
,
Valentina Salapura
IBM
,
Program Chair:
Sally A. McKee
Cornell University

Copyright � 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

CF06

Sponsor:

CF06: Computing Frontiers Conference

May 3 - 5, 2006

Ischia, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
240
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hartmann TMoawad AFouquet FNain GKlein JTraon YLethbridge T(2015)Stream my modelsProceedings of the 18th International Conference on Model Driven Engineering Languages and Systems10.5555/3351736.3351750(80-89)Online publication date: 30-Sep-2015
https://dl.acm.org/doi/10.5555/3351736.3351750
Khunjush FGong DDimopoulos N(2011)Single-port and multi-port collective communication operations on single and dual Cell BE processor systemsInternational Journal of Communication Networks and Distributed Systems10.1504/IJCNDS.2011.0405596:4(373-391)Online publication date: 1-Jun-2011
https://dl.acm.org/doi/10.1504/IJCNDS.2011.040559
Khunjush FDimopoulos N(2009)Hiding message delivery latency using Direct-to-Cache-Transfer techniques in message passing environmentsMicroprocessors & Microsystems10.1016/j.micpro.2009.07.00133:7-8(430-440)Online publication date: 1-Oct-2009
https://dl.acm.org/doi/10.1016/j.micpro.2009.07.001
Khunjush FDimopoulos N(2008)Extended characterization of DMA transfers on the Cell BE processor2008 IEEE International Symposium on Parallel and Distributed Processing10.1109/IPDPS.2008.4536190(1-8)Online publication date: Apr-2008
https://doi.org/10.1109/IPDPS.2008.4536190
Khunjush FDimopoulos N(2007)Comparing direct-to-cache transfer policies to TCP/IP and M-VIA during receive operations in MPI environmentsProceedings of the 5th international conference on Parallel and Distributed Processing and Applications10.5555/2395970.2395994(208-222)Online publication date: 29-Aug-2007
https://dl.acm.org/doi/10.5555/2395970.2395994
Khunjush FDimopoulos N(2007)Using the Cell Processor As a Network Assist to Minimize Latency2007 Canadian Conference on Electrical and Computer Engineering10.1109/CCECE.2007.239(936-939)Online publication date: Apr-2007
https://doi.org/10.1109/CCECE.2007.239
Khunjush FDimopoulos N(2007)Comparing Direct-to-Cache Transfer Policies to TCP/IP and M-VIA During Receive Operations in MPI EnvironmentsParallel and Distributed Processing and Applications10.1007/978-3-540-74742-0_21(208-222)Online publication date: 2007
https://doi.org/10.1007/978-3-540-74742-0_21

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents