skip to main content
10.1109/MICRO.2005.31acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
Article

Store Memory-Level Parallelism Optimizations for Commercial Applications

Published: 12 November 2005 Publication History

Abstract

This paper studies the impact of off-chip store misses on processor performance for modern commercial applications. The performance impact of off-chip store misses is largely determined by the extent of their overlap with other off-chip cache misses. The epoch MLP model is used to explain and quantify how these overlaps are affected by various store handling optimizations and by the memory consistency model implemented by the processor. The extent of these overlaps are then translated to off-chip CPI. Experimental results show that store handling optimizations are crucial for mitigating the substantial performance impact of stores in commercial applications. While some previously proposed optimizations, such as store prefetching, are highly effective, they are unable to fully mitigate the performance impact of off-chip store misses and they also leave a performance gap between the stronger and weaker memory consistency models. New optimizations, such as the Store Miss Accelerator, an optimization of Hardware Scout and a new application of Speculative Lock Elision, are demonstrated to virtually eliminate the impact of off-chip store misses.

References

[1]
{1} L. Spracklen, Y. Chou, and S. G. Abraham, "Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications", Intl. Symp. on High-Performance Computer Architecture, pp. 225-236, 2005.
[2]
{2} S. Iacobovici, et. al., "Effective Stream-based and Execution-based Data Prefetching", Intl. Conf. on Supercomputing, pp. 1-11, 2004.
[3]
{3} P. Ranganathan, et. al., "Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors", Intl. Conf. on Architectural support for programming languages and operating systems, pp. 307-318, 1998.
[4]
{4} L. Barroso, K. Gharachorloo and E. Bugnion, "Memory System Characterization of Commercial Workloads", Intl. Symp. on Computer Architecture", pp. 3-14, 1998.
[5]
{5} A. Maynard, C. Donnelly, B. Olszewski, "Contrasting Characteristics and Cache Performance of Technical and Multi-User Commercial Workloads", Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 145-156, 1994.
[6]
{6} J. Lo et. al., "An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors", Intl. Symp. on Computer Architecture", pp. 39-50, 1998.
[7]
{7} R. Hankins et. al, "Scaling and Characterizing Database Workloads: Bridging the Gap between Research and Practice," in Intl. Symp. on Microarchitecture, 2003.
[8]
{8} Y. Chou, B. Fahs and S. Abraham, "Microarchitecture Optimizations for Exploiting Memory-Level Parallelism", Intl. Symp. on Computer Architecture", pp. 76, 2004.
[9]
{9} L. Lamport, "How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs", IEEE Trans. on Computers, vol. 47, no. 7, pp. 251-248, 1979.
[10]
{10} K. Gharachorloo et. al., "Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors", Intl. Symp. on Computer Architecture", pp. 15-26, 1990.
[11]
{11} J. Goodman, M. Vernon and P. Woest, "Efficient Synchronization Primitives for Large-Scale Cache-Coherent Multiprocessors", Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 64-73, April 1989.
[12]
{12} M. Dubois, C. Scheurich and F. Briggs, "Memory Access Buffering in Multiprocessors", Intl. Symp. on Computer Architecture, pp. 434-442, June 1986.
[13]
{13} R. Rajwar and J. Goodman, "Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution", Intl. Symp. on Microarchitecture, pp. 294-305, 2001.
[14]
{14} M. Herlihy, "A Methodology for Implementing Highly Concurrent Data Objects", ACM TOPLAS, 15(5):745-770, November 1993.
[15]
{15} S. Chaudhry, S. Yip, P. Caprioli and M. Tremblay, "High Performance Throughput Computing", IEEE MICRO Vol. 25 Issue 3, 2005.
[16]
{16} J. Dundas and T. Mudge, "Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss", Intl. Conf. on Supercomputing, 1997.
[17]
{17} O. Mutlu, J. Stark, C. Wilkerson and Y. Patt, "Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors", Intl. Symp. on High Performance Computer Architecture, 2003.
[18]
{18} I. Park, C. Ooi, and V. Vijaykumar, "Reducing Design Complexity of the Load/Store Queue", Intl. Symp. on Microarchitecture, 2003.
[19]
{19} A. Gandhi et al, "Scalable Load and Store Processing in Latency Tolerant Processors", Intl. Symp. on Computer Architecture, 2005.
[20]
{20} T. Tsuei and W. Yamamoto, "Queuing Simulation Model for Multiprocessor Systems", IEEE Computer, Vol. 36 Issue 2, pp. 58-64, 2003.
[21]
{21} K. Gharachorloo, A. Gupta and J. Hennessy, "Two Techniques to Enhance the Performance of Memory Consistency Models", Intl. Conf. on Parallel Processing, pp. 1355-1364, 1991.
[22]
{22} L. Spracklen and S. G. Abraham, "Chip Multithreading: Opportunities and Challenges", Intl. Symp. on High-Performance Computer Architecture, pp. 248-252, 2005.
[23]
{23} Sun Microsystems. SPARC Architecture Manual V9, 1996.
[24]
{24} Book E: Enhanced PowerPC Architecture, Version 1.0, May 2002, Chapter 6.1.6.
[25]
{25} R. Bhargava and L. John, "Issues in the Design of Store Buffers in Dynamically Scheduled Processors", Intl. Symp. on Performance Analysis of Systems and Software, pp. 76-87, 2000.
[26]
{26} F. Mounes-Toussi and D. Lilja, "Write Buffer Design for Cache-Coherent Shared-Memory Multiprocessors", Intl. Conf. on Computer Design, pp. 506-511, 1995.
[27]
{27} N. Jouppi, "Cache Write Policies and Performance", Intl. Symp. on Computer Architecture, pp. 191-201, 1993.
[28]
{28} J. Sahuquillo and A. Pont, "Impact of Reducing Miss Write Latencies in Multiprocessors with Two Level Cache", EUROMICRO Conference, pp. 333-336, 1998.
[29]
{29} C. Gniady, B. Falsafi, and V. Vijaykumar, "Is SC + ILP = RC?", Intl. Symp. on Computer Architecture, pp. 162-171, 1999.
[30]
{30} P. Ranganathan, et. al., "Using Speculative Retirement and Larger Instruction Windows to Narrow the Performance Gap between Memory Consistency Models", Symp. on Parallel Algorithms and Architectures, pp. 199-210, 1997.
[31]
{31} Y. Sohn, N. Jung and S. Maeng, "Request Reordering to Enhance the Performance of Strict Consistency Models", Computer Architecture Letters, 2002.
[32]
{32} J. F. Cantin, M. H. Lipasti, and J. E. Smith, "Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking", Intl. Symp. on Computer Architecture, pp. 246-257, 2005.
[33]
{33} A. Moshovos, "RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence", Intl. Symp. on Computer Architecture, pp. 234-245, 2005.
[34]
{34} D. Sorin et. al., "Analytic Evaluation of Shared-Memory Systems with ILP Processors", Intl. Symp. on Computer Architecture", pp. 380- 391, 1998.

Cited By

View all
  • (2018)Dynamic memory dependence predicationProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00029(235-246)Online publication date: 2-Jun-2018
  • (2015)zFENCEProceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751211(295-305)Online publication date: 8-Jun-2015
  • (2014)Research Problems and Opportunities in Memory SystemsSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1403021:3(19-55)Online publication date: 12-Oct-2014
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO 38: Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
November 2005
350 pages
ISBN:0769524400

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 12 November 2005

Check for updates

Qualifiers

  • Article

Conference

Micro-38
Sponsor:

Acceptance Rates

MICRO 38 Paper Acceptance Rate 29 of 147 submissions, 20%;
Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2018)Dynamic memory dependence predicationProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00029(235-246)Online publication date: 2-Jun-2018
  • (2015)zFENCEProceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751211(295-305)Online publication date: 8-Jun-2015
  • (2014)Research Problems and Opportunities in Memory SystemsSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1403021:3(19-55)Online publication date: 12-Oct-2014
  • (2014)SAMOProceedings of the 11th ACM Conference on Computing Frontiers10.1145/2597917.2597940(1-10)Online publication date: 20-May-2014
  • (2009)InvisiFenceACM SIGARCH Computer Architecture News10.1145/1555815.155578537:3(233-244)Online publication date: 20-Jun-2009
  • (2009)Reactive NUCAACM SIGARCH Computer Architecture News10.1145/1555815.155577937:3(184-195)Online publication date: 20-Jun-2009
  • (2009)InvisiFenceProceedings of the 36th annual international symposium on Computer architecture10.1145/1555754.1555785(233-244)Online publication date: 20-Jun-2009
  • (2009)Reactive NUCAProceedings of the 36th annual international symposium on Computer architecture10.1145/1555754.1555779(184-195)Online publication date: 20-Jun-2009
  • (2007)Pipeline spectroscopyProceedings of the 2007 workshop on Experimental computer science10.1145/1281700.1281715(15-es)Online publication date: 13-Jun-2007
  • (2007)Mechanisms for store-wait-free multiprocessorsACM SIGARCH Computer Architecture News10.1145/1273440.125069635:2(266-277)Online publication date: 9-Jun-2007
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media