skip to main content
10.1145/977091.977116acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
Article

Fighting the memory wall with assisted execution

Published: 14 April 2004 Publication History

Abstract

Assisted execution is a form of simultaneous multithreading in which a set of auxiliary "assistant" threads, called nanothreads, is attached to each thread of an application. Nanothreads are lightweight threads which run on the same processor as the main (application) thread and help execute the main thread as fast as possible. Nanothreads exploit resources that are idled in the processor because of hazards due to program dependencies and memory access delays.Assisted execution has the potential to alter the current trade-offs between static and dynamic execution mechanisms. Nanothreads can monitor and reconfigure the underlying hardware, can emulate hardware and can profile applications with little or no interference to improve the program on-line or off-line.We demonstrate the power of assisted execution with an important application, namely data prefetching to fight the memory wall problem. Simulation results on several SPEC95 benchmarks show that sequential and stride prefetching implemented with nanothreads performs just as well as ideal hardware prefetchers.

References

[1]
Mark Horowitz, Margaret Martonosi, Todd C. Mowry, and Michael D. Smith, "Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors," Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp. 260--270, May 1996.
[2]
Fredrik Dahlgren and Per Stenstr�m, "Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors," IEEE Transactions on Parallel and Distributed Systems, Vol. 7, No. 4, pp. 385--398, April 1996.
[3]
Parthasarathy Ranganathan, Vijay S. Pai, Hazim Abdel-Shafi, and Sarita V. Adve. "The Interaction of Software Prefetching with ILP Processors in Shared-Memory System," Proceedings of the 24th Annual International Symposium on Computer Architecture, June 1997.
[4]
Jonas Skeppstedt and Michel Dubois, "Hybrid Compiler/Hardware Prefetching for Multiprocessors Using Low-Overhead Cache Miss Traps," Proceedings of the International Conference on Parallel Processing, pp. 298--305, August 1997.
[5]
Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy, "Simultaneous Multithreading: Maximizing On-Chip Parallelism," Proceedings of the 22rd Annual International Symposium on Computer Architecture, pp. 392--403, June 1995.
[6]
Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, Rebecca L. Stamm, and Dean M. Tullsen, "Simultaneous Multithreading: A Platform for Next-generation Processors," IEEE Micro, pp. 12--18, September/October 1997.
[7]
MIPS Technologies Inc., "R10000 Microprocessor User's Manual-Version 2.0," December 1996.
[8]
Mats Brorsson, Fredrik Dahlgren, H�kan Nilsson, and Per Stenstr�m, "The CacheMire Test Bench -- A Flexible and Effective Approach for Simulation of Multiprocessors," Proceedings of 26th Annual Simulation Symposium, pp. 41--49, March 1993.
[9]
D. Kroft,"Lockup-free Instruction Fetch/Prefetch Cache Organization," Proceedings. of the 8th International Symposium on Computer Architecture, pp. 81--87, May 1991.
[10]
The SPEC Corporation, The SPEC95 Benchmark Suite, 1995.
[11]
Xiaogang Qiu and Michel Dubois, "Tolerating Late Memory Traps for ILP Processors", In Proceedings of the 26th Annual International Symposium on Computer Architecture(ISCA), pp. 76--87, 1999.
[12]
Yong Ho Song and Michel Dubois,"Assisted Execution", Technical Report #CENG 98-25, Department of EE-Systems, University of Southern California, October 1998.
[13]
Tor M. Aamodt, Paul Chow, Per Hammarlund, Hong Wang, and John P. Shen, Hardware Support for Prescient Instruction Prefetch, Proceedings of the 10th Conference on High-Performance Computer Architecture, 2004.
[14]
C. Zilles and G. Sohi,"Execution-based prediction using speculative slices," Proceedings of the 28th Annual International Symposium on Computer Architecture, 2001.
[15]
C.-K. Luk, "Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors," Proceedings of the 28th Annual International Symposium on Computer Architecture, 2001.
[16]
Robert S. Chappell, Francis Tseng, Yale N. Patt, Adi Yoaz,"Difficult-Path Branch Prediction Using Subordinate Microthreads," Proceedings of the 29th Annual International Symposium on Computer Architecture, 2002.
[17]
Craig Zilles, Joel Emer, and Gurindar Sohi, "The Use of Multithreading for Exception Handling,", Proceedings of the 32nd Annual International Symposium on Microarchitecture(Micro-32), 1999.
[18]
J. Collins, H. Wang, D. Tullsen, C. Hughes, Y.-F. Lee, D.Lavery, and J. Shen, "Speculative precomputation: Long-range prefetching of delinquent loads," Proceedings of the 28th Annual International Symposium on Computer Architecture, 2001.

Cited By

View all
  • (2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
  • (2015)A case for core-assisted bottleneck acceleration in GPUsACM SIGARCH Computer Architecture News10.1145/2872887.275039943:3S(41-53)Online publication date: 13-Jun-2015
  • (2015)A case for core-assisted bottleneck acceleration in GPUsProceedings of the 42nd Annual International Symposium on Computer Architecture10.1145/2749469.2750399(41-53)Online publication date: 13-Jun-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CF '04: Proceedings of the 1st conference on Computing frontiers
April 2004
522 pages
ISBN:1581137419
DOI:10.1145/977091
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 April 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cache memories
  2. latency tolerance
  3. prefetching
  4. simultaneous multithreading
  5. superscalar processors

Qualifiers

  • Article

Conference

CF04
Sponsor:
CF04: Computing Frontiers Conference
April 14 - 16, 2004
Ischia, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)2
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
  • (2015)A case for core-assisted bottleneck acceleration in GPUsACM SIGARCH Computer Architecture News10.1145/2872887.275039943:3S(41-53)Online publication date: 13-Jun-2015
  • (2015)A case for core-assisted bottleneck acceleration in GPUsProceedings of the 42nd Annual International Symposium on Computer Architecture10.1145/2749469.2750399(41-53)Online publication date: 13-Jun-2015
  • (2012)Reconfigurable Preexecution in Data Parallel Applications on Multicore SystemsElectrical Engineering and Intelligent Systems10.1007/978-1-4614-2317-1_3(29-38)Online publication date: 2-May-2012
  • (2011)Software Controlled Adaptive Pre-Execution for Data PrefetchingInternational Journal of Parallel Programming10.1007/s10766-011-0190-540:4(381-396)Online publication date: 30-Oct-2011
  • (2005)Moving Address Translation Closer to Memory in Distributed Shared-Memory MultiprocessorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2005.8416:7(612-623)Online publication date: 1-Jul-2005
  • (2005)Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP ProcessorProceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2005.18(93-104)Online publication date: 12-Nov-2005
  • (2005)"Flea-flicker" Multipass PipeliningProceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2005.1(319-330)Online publication date: 12-Nov-2005

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media