Article

Store Memory-Level Parallelism Optimizations for Commercial Applications

Authors:

Lawrence Spracklen,

Santosh G. AbrahamAuthors Info & Claims

MICRO 38: Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture

Pages 183 - 196

https://doi.org/10.1109/MICRO.2005.31

Published: 12 November 2005 Publication History

Publisher Site Get Access

Abstract

This paper studies the impact of off-chip store misses on processor performance for modern commercial applications. The performance impact of off-chip store misses is largely determined by the extent of their overlap with other off-chip cache misses. The epoch MLP model is used to explain and quantify how these overlaps are affected by various store handling optimizations and by the memory consistency model implemented by the processor. The extent of these overlaps are then translated to off-chip CPI. Experimental results show that store handling optimizations are crucial for mitigating the substantial performance impact of stores in commercial applications. While some previously proposed optimizations, such as store prefetching, are highly effective, they are unable to fully mitigate the performance impact of off-chip store misses and they also leave a performance gap between the stronger and weaker memory consistency models. New optimizations, such as the Store Miss Accelerator, an optimization of Hardware Scout and a new application of Speculative Lock Elision, are demonstrated to virtually eliminate the impact of off-chip store misses.

References

[1]

{1} L. Spracklen, Y. Chou, and S. G. Abraham, "Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications", Intl. Symp. on High-Performance Computer Architecture, pp. 225-236, 2005.

Digital Library

[2]

{2} S. Iacobovici, et. al., "Effective Stream-based and Execution-based Data Prefetching", Intl. Conf. on Supercomputing, pp. 1-11, 2004.

Digital Library

[3]

{3} P. Ranganathan, et. al., "Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors", Intl. Conf. on Architectural support for programming languages and operating systems, pp. 307-318, 1998.

Digital Library

[4]

{4} L. Barroso, K. Gharachorloo and E. Bugnion, "Memory System Characterization of Commercial Workloads", Intl. Symp. on Computer Architecture", pp. 3-14, 1998.

Digital Library

[5]

{5} A. Maynard, C. Donnelly, B. Olszewski, "Contrasting Characteristics and Cache Performance of Technical and Multi-User Commercial Workloads", Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 145-156, 1994.

Digital Library

[6]

{6} J. Lo et. al., "An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors", Intl. Symp. on Computer Architecture", pp. 39-50, 1998.

Digital Library

[7]

{7} R. Hankins et. al, "Scaling and Characterizing Database Workloads: Bridging the Gap between Research and Practice," in Intl. Symp. on Microarchitecture, 2003.

Digital Library

[8]

{8} Y. Chou, B. Fahs and S. Abraham, "Microarchitecture Optimizations for Exploiting Memory-Level Parallelism", Intl. Symp. on Computer Architecture", pp. 76, 2004.

Digital Library

[9]

{9} L. Lamport, "How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs", IEEE Trans. on Computers, vol. 47, no. 7, pp. 251-248, 1979.

[10]

{10} K. Gharachorloo et. al., "Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors", Intl. Symp. on Computer Architecture", pp. 15-26, 1990.

Digital Library

[11]

{11} J. Goodman, M. Vernon and P. Woest, "Efficient Synchronization Primitives for Large-Scale Cache-Coherent Multiprocessors", Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 64-73, April 1989.

Digital Library

[12]

{12} M. Dubois, C. Scheurich and F. Briggs, "Memory Access Buffering in Multiprocessors", Intl. Symp. on Computer Architecture, pp. 434-442, June 1986.

Digital Library

[13]

{13} R. Rajwar and J. Goodman, "Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution", Intl. Symp. on Microarchitecture, pp. 294-305, 2001.

Digital Library

[14]

{14} M. Herlihy, "A Methodology for Implementing Highly Concurrent Data Objects", ACM TOPLAS, 15(5):745-770, November 1993.

Digital Library

[15]

{15} S. Chaudhry, S. Yip, P. Caprioli and M. Tremblay, "High Performance Throughput Computing", IEEE MICRO Vol. 25 Issue 3, 2005.

Digital Library

[16]

{16} J. Dundas and T. Mudge, "Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss", Intl. Conf. on Supercomputing, 1997.

Digital Library

[17]

{17} O. Mutlu, J. Stark, C. Wilkerson and Y. Patt, "Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors", Intl. Symp. on High Performance Computer Architecture, 2003.

Digital Library

[18]

{18} I. Park, C. Ooi, and V. Vijaykumar, "Reducing Design Complexity of the Load/Store Queue", Intl. Symp. on Microarchitecture, 2003.

Digital Library

[19]

{19} A. Gandhi et al, "Scalable Load and Store Processing in Latency Tolerant Processors", Intl. Symp. on Computer Architecture, 2005.

Digital Library

[20]

{20} T. Tsuei and W. Yamamoto, "Queuing Simulation Model for Multiprocessor Systems", IEEE Computer, Vol. 36 Issue 2, pp. 58-64, 2003.

Digital Library

[21]

{21} K. Gharachorloo, A. Gupta and J. Hennessy, "Two Techniques to Enhance the Performance of Memory Consistency Models", Intl. Conf. on Parallel Processing, pp. 1355-1364, 1991.

[22]

{22} L. Spracklen and S. G. Abraham, "Chip Multithreading: Opportunities and Challenges", Intl. Symp. on High-Performance Computer Architecture, pp. 248-252, 2005.

Digital Library

[23]

{23} Sun Microsystems. SPARC Architecture Manual V9, 1996.

[24]

{24} Book E: Enhanced PowerPC Architecture, Version 1.0, May 2002, Chapter 6.1.6.

[25]

{25} R. Bhargava and L. John, "Issues in the Design of Store Buffers in Dynamically Scheduled Processors", Intl. Symp. on Performance Analysis of Systems and Software, pp. 76-87, 2000.

Digital Library

[26]

{26} F. Mounes-Toussi and D. Lilja, "Write Buffer Design for Cache-Coherent Shared-Memory Multiprocessors", Intl. Conf. on Computer Design, pp. 506-511, 1995.

Digital Library

[27]

{27} N. Jouppi, "Cache Write Policies and Performance", Intl. Symp. on Computer Architecture, pp. 191-201, 1993.

Digital Library

[28]

{28} J. Sahuquillo and A. Pont, "Impact of Reducing Miss Write Latencies in Multiprocessors with Two Level Cache", EUROMICRO Conference, pp. 333-336, 1998.

Digital Library

[29]

{29} C. Gniady, B. Falsafi, and V. Vijaykumar, "Is SC + ILP = RC?", Intl. Symp. on Computer Architecture, pp. 162-171, 1999.

Digital Library

[30]

{30} P. Ranganathan, et. al., "Using Speculative Retirement and Larger Instruction Windows to Narrow the Performance Gap between Memory Consistency Models", Symp. on Parallel Algorithms and Architectures, pp. 199-210, 1997.

Digital Library

[31]

{31} Y. Sohn, N. Jung and S. Maeng, "Request Reordering to Enhance the Performance of Strict Consistency Models", Computer Architecture Letters, 2002.

Digital Library

[32]

{32} J. F. Cantin, M. H. Lipasti, and J. E. Smith, "Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking", Intl. Symp. on Computer Architecture, pp. 246-257, 2005.

Digital Library

[33]

{33} A. Moshovos, "RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence", Intl. Symp. on Computer Architecture, pp. 234-245, 2005.

Digital Library

[34]

{34} D. Sorin et. al., "Analytic Evaluation of Shared-Memory Systems with ILP Processors", Intl. Symp. on Computer Architecture", pp. 380- 391, 1998.

Digital Library

Cited By

Jin Z�nder S(2018)Dynamic memory dependence predicationProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00029(235-246)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00029
Aga SSingh ANarayanasamy SBhuyan LChong FSarkar V(2015)zFENCEProceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751211(295-305)Online publication date: 8-Jun-2015
https://dl.acm.org/doi/10.1145/2751205.2751211
Mutlu OSubramanian L(2014)Research Problems and Opportunities in Memory SystemsSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1403021:3(19-55)Online publication date: 12-Oct-2014
https://dl.acm.org/doi/10.14529/jsfi140302
Show More Cited By

Index Terms

Store Memory-Level Parallelism Optimizations for Commercial Applications
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Hardware

Recommendations

Co-optimizing memory-level parallelism and cache-level parallelism
PLDI 2019: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation

Minimizing cache misses has been the traditional goal in optimizing cache performance using compiler based techniques. However, continuously increasing dataset sizes combined with large numbers of cache banks and memory banks connected using on-chip ...
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism
ISCA 2004

The performance of memory-bound commercial applicationssuch as databases is limited by increasing memory latencies. Inthis paper, we show that exploiting memory-level parallelism(MLP) is an effective approach for improving the performance ofthese ...
Compiler Support for Optimizing Memory Bank-Level Parallelism
MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture

Many prior compiler-based optimization schemes focused exclusively on cache data locality. However, cache locality is only one part of the overall performance of applications running on emerging multicores or manycores. For example, memory stalls could ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO 38: Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture

November 2005

350 pages

ISBN:0769524400

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

IEEE Computer Society

United States

Publication History

Published: 12 November 2005

Check for updates

Qualifiers

Article

Conference

Micro-38

Sponsor:

SIGMICRO

Micro-38: The 38th Annual IEEE/ACM International Symposium on Microarchitecture

November 12 - 16, 2005

Barcelona, Spain

Acceptance Rates

MICRO 38 Paper Acceptance Rate 29 of 147 submissions, 20%;

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Sponsor:
sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
25
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jin Z�nder S(2018)Dynamic memory dependence predicationProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00029(235-246)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00029
Aga SSingh ANarayanasamy SBhuyan LChong FSarkar V(2015)zFENCEProceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751211(295-305)Online publication date: 8-Jun-2015
https://dl.acm.org/doi/10.1145/2751205.2751211
Mutlu OSubramanian L(2014)Research Problems and Opportunities in Memory SystemsSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1403021:3(19-55)Online publication date: 12-Oct-2014
https://dl.acm.org/doi/10.14529/jsfi140302
Raghavendra KWarrier TMutyam MTrancoso PFranklin DMcKee S(2014)SAMOProceedings of the 11th ACM Conference on Computing Frontiers10.1145/2597917.2597940(1-10)Online publication date: 20-May-2014
https://dl.acm.org/doi/10.1145/2597917.2597940
Blundell CMartin MWenisch T(2009)InvisiFenceACM SIGARCH Computer Architecture News10.1145/1555815.155578537:3(233-244)Online publication date: 20-Jun-2009
https://dl.acm.org/doi/10.1145/1555815.1555785
Hardavellas NFerdman MFalsafi BAilamaki A(2009)Reactive NUCAACM SIGARCH Computer Architecture News10.1145/1555815.155577937:3(184-195)Online publication date: 20-Jun-2009
https://dl.acm.org/doi/10.1145/1555815.1555779
Blundell CMartin MWenisch TKeckler SBarroso L(2009)InvisiFenceProceedings of the 36th annual international symposium on Computer architecture10.1145/1555754.1555785(233-244)Online publication date: 20-Jun-2009
https://dl.acm.org/doi/10.1145/1555754.1555785
Hardavellas NFerdman MFalsafi BAilamaki AKeckler SBarroso L(2009)Reactive NUCAProceedings of the 36th annual international symposium on Computer architecture10.1145/1555754.1555779(184-195)Online publication date: 20-Jun-2009
https://dl.acm.org/doi/10.1145/1555754.1555779
Puzak THartstein AEmma PSrinivasan VNadas A(2007)Pipeline spectroscopyProceedings of the 2007 workshop on Experimental computer science10.1145/1281700.1281715(15-es)Online publication date: 13-Jun-2007
https://dl.acm.org/doi/10.1145/1281700.1281715
Wenisch TAilamaki AFalsafi BMoshovos A(2007)Mechanisms for store-wait-free multiprocessorsACM SIGARCH Computer Architecture News10.1145/1273440.125069635:2(266-277)Online publication date: 9-Jun-2007
https://dl.acm.org/doi/10.1145/1273440.1250696
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents