research-article

Memory Row Reuse Distance and its Role in Optimizing Application Performance

Authors:

Mahmut Kandemir,

Mustafa KarakoyAuthors Info & Claims

SIGMETRICS '15: Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems

Pages 137 - 149

https://doi.org/10.1145/2745844.2745867

Published: 15 June 2015 Publication History

Abstract

Continuously increasing dataset sizes of large-scale applications overwhelm on-chip cache capacities and make the performance of last-level caches (LLC) increasingly important. That is, in addition to maximizing LLC hit rates, it is becoming equally important to reduce LLC miss latencies. One of the critical factors that influence LLC miss latencies is row-buffer locality (i.e., the fraction of LLC misses that hit in the large buffer attached to a memory bank). While there has been a plethora of recent works on optimizing row-buffer performance, to our knowledge, there is no study that quantifies the full potential of row-buffer locality and impact of maximizing it on application performance.

Focusing on multithreaded applications, the first contribution of this paper is the definition of a new metric called (memory) row reuse distance (RRD). We show that, while intra-core RRDs are relatively small (increasing the chances for row-buffer hits), inter-core RRDs are quite large (increasing the chances for row-buffer misses). Motivated by this, we propose two schemes that measure the maximum potential benefits that could be obtained from minimizing RRDs, to the extent allowed by program dependencies. Specifically, one of our schemes (Scheme-I) targets only intra-core RRDs, whereas the other one (Scheme-II) aims at reducing both intra-core RRDs and inter-core RRDs. Our experimental evaluations demonstrate that (i) Scheme-I reduces intra-core RRDs but increases inter-core RRDs; (ii) Scheme-II reduces inter-core RRDs significantly while achieving a similar behavior to Scheme-I as far as intra-core RRDs are concerned; (iii) Scheme-I and Scheme-II improve execution times of our applications by 17% and 21%, respectively, on average; and (iv) both our schemes deliver consistently good results under different memory request scheduling policies.

References

[1]

M. Xie, D. Tong, K. Huang and X. Cheng, Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning, HPCA, 2014.

[2]

L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, The Blacklisting Memory Scheduler:Achieving High Performance and Fairness at Low Cost, ICCD, 2014.

[3]

B. T. Davis, Modern DRAM Architectures. PhD thesis, University of Michigan, 2000.

Digital Library

[4]

W. Ding, D. Guttman and M. Kandemir, Compiler Support for Optimizing Memory Bank-Level Parallelism, MICRO, 2014.

Digital Library

[5]

S. O,Y. H. Son, N. S. Kim and J. H. Ahn, Row-buffer decoupling: a case for low-latency DRAM microarchitecture, ISCA, 2014.

[6]

D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture, HPCA, 2005.

Digital Library

[7]

J. Chang and G. S. Sohi, Cooperative cache partitioning for chip multiprocessors, ICS, 2007.

Digital Library

[8]

A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely Jr. and J. Emer, Adaptive insertion policies for managing shared caches, PACT, 2008.

Digital Library

[9]

M. Kandemir, S. P. Muralidhara, S. H. K. Narayanan, Y. Zhang, O. Ozturk, Optimizing shared cache behaviorof chip multiprocessors, MICRO, 2009.

Digital Library

[10]

S. Kim, D. Chandra and Y. Solihin Fair cache sharing and partitioning in achip multiprocessor architecture, PACT, 2004.

Digital Library

[11]

S. Rixner, Memory controller optimizations for web servers, MICRO, 2004.

Digital Library

[12]

S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, Memory access scheduling, ISCA, 2000.

Digital Library

[13]

Z. Zhang, Z. Zhu, and X. Zhang, A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality, MICRO, 2000.

Digital Library

[14]

S. M. Zahedi, and B. C. Lee, REF: resource elasticity fairness with sharing incentives for multiprocessors, ASPLOS, 2014.

Digital Library

[15]

H. Wang, R. Singh, M. J. Schulte, and N. S. Kim, Memory scheduling towards high-throughput cooperative heterogeneous computing, PACT, 2014.

Digital Library

[16]

J. Hasan, S. Chandra, and T. N. Vijaykumar, Efficient Use of Memory Bandwidth to Improve Network Processor Throughput, ISCA, 2003.

Digital Library

[17]

H. Yoon, J. Meza, R. Ausavarungnirun, R. A. Harding and O. Mutlu, Row Buffer Locality Aware Caching Policies for Hybrid Memories, ICCD, 2012.

Digital Library

[18]

K. Sudan, N. Chatterjee, D. Nellans, M. Awasthi, R. Balasubramonian and A. Davis, Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement, ASPLOS, 2010.

Digital Library

[19]

Y. Zhang, M. T. Kandemir and T. Yemliha, Studying inter-core data reuse in multicores, SIGMETRICS, 2011.

Digital Library

[20]

Y. Kim, D. Han, O. Mutlu and M. Harchol-Balter, ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers, HPCA, 2010.

[21]

JEDEC Solid State Technology Association, DDR3 SDRAM Specification, JESD79--3D edition, Sept, 2009

[22]

Calculating Memory System Power for DDR3, Technical report, Micron Technology Inc., 2--7, TN-4-01, 2007.

[23]

K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair queuing memory systems, MICRO, 2006.

Digital Library

[24]

O. Mutlu and T. Moscibroda. Stall-time fair memory access scheduling for chip multiprocessor, MICRO, 2007.

Digital Library

[25]

O. Mutlu and T. Moscibroda, Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems, ISCA, 2008.

Digital Library

[26]

T. E. Carlson, W. Heirman, and L. Eeckhout, Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulations, SC, 2011.

Digital Library

[27]

D. Chen and Y. Zhong, Predicting whole-program locality through reuse distance analysis, PLDI, 2003.

Digital Library

[28]

G. Keramidas, P. Petoumenos and S. Kaxiras, Cache Replacement Based on Reuse-Distance Prediction, ICCD, 2007.

[29]

A. Jaleel, K. B. Theobald, S. C. Steely Jr. and J. Emer, High Performance Cache Replacement Using Re-Reference Interval Prediction, ISCA, 2007.

Digital Library

[30]

K. Beyls and E. H. D'Hollander, Reuse distance as a metric for cache behavior, IPDCS, 2001.

[31]

G. Almasi, C. Cascaval and D. A. Padua, Calculating stack distances efficiently, SIGPLAN Not., 2003

Digital Library

[32]

Y. Jiang, E. Z. Zhang, K. Tian, X. Shen, Is reuse distance applicable to data locality analysis on chip multiprocessors?, Compiler Construction, 2010.

Digital Library

[33]

M. Kandemir, A compiler technique for improving whole-program locality, POPL, 2001.

Digital Library

[34]

D. L. Schuff, M. Kulkarni, and V. S. Pai, Accelerating multicore reuse distance analysis with sampling and parallelization, PACT, 2010.

Digital Library

[35]

Y. Kim, M. Papamichael, O. Mutlu and M. Harchol-Balter, Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior, MICRO, 2010.

Digital Library

[36]

M. Awasthi, D. Nellans, K. Sudan, R. Balasubramonian and A. Davis, Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers, PACT, 2010.

Digital Library

[37]

H. Park, S. Baek, J. Choi, D. Lee and S. Noh, Regularities considered harmful: forcing randomness to memory accesses to reduce row-buffer conflicts for multi-core, multi-bank systems, ASPLOS, 2013.

Digital Library

[38]

R. Barrett, R. Barrett, M. Berry3, T. F. Chan, J. Demmel, J. M. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. Van der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition, SIAM, 1994.

[39]

V. Aslot, M. Domeika, R. Eigenmann, G. Gaertner, W. B. Jones, and B. Parady, SPEComp: A new benchmark suite for measuring parallel computer performance, WOMPEI, 2001.

Digital Library

[40]

https://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks.

[41]

D J. Craik, A .Kumar, G. C. Levy, MOLDYN: a generalized program for the evaluation of molecular dynamics models using nuclear magnetic resonance spin-relaxation data, J. Chem. Inf. Comput. Sci., 1983.

[42]

https://software.sandia.gov/hpcg/html/index.html.

[43]

C. Kim, D. Burger, and S. Keckler, An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches, ASPLOS, 2002.

Digital Library

[44]

J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, 4th Edition, Morgan Kaufmann, 2012.

Digital Library

Cited By

Pandey SYazdanbakhsh ALiu H(2024)TAO: Re-Thinking DL-based Microarchitecture SimulationProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36560128:2(1-25)Online publication date: 29-May-2024
https://dl.acm.org/doi/10.1145/3656012
Kandemir MTang XZhao HRyoo JKarakoy MFreund SYahav E(2021)Distance-in-time versus distance-in-spaceProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454069(665-680)Online publication date: 19-Jun-2021
https://dl.acm.org/doi/10.1145/3453483.3454069
Yaglikci APatel MKim JAzizi ROlgun AOrosa LHassan HPark JKanellopoulos KShahroodi TGhose SMutlu O(2021)BlockHammer: Preventing RowHammer at Low Cost by Blacklisting Rapidly-Accessed DRAM Rows2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00037(345-358)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00037
Show More Cited By

Index Terms

Memory Row Reuse Distance and its Role in Optimizing Application Performance
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

Memory Row Reuse Distance and its Role in Optimizing Application Performance
Performance evaluation review

Continuously increasing dataset sizes of large-scale applications overwhelm on-chip cache capacities and make the performance of last-level caches (LLC) increasingly important. That is, in addition to maximizing LLC hit rates, it is becoming equally ...
Harvesting Row-Buffer Hits via Orchestrated Last-Level Cache and DRAM Scheduling for Heterogeneous Multicore Systems

In heterogeneous multicore systems, the memory subsystem, including the last-level cache and DRAM, is widely shared among the CPU, the GPU, and the real-time cores. Due to their distinct memory traffic patterns, heterogeneous cores result in more ...
Improve LLC Bypassing Performance by Memory Controller Improvements in Heterogeneous Multicore System
PDCAT '14: Proceedings of the 2014 15th International Conference on Parallel and Distributed Computing, Applications and Technologies

The shared last-level cache (SLLC) in heterogeneous multicore system is an important memory component that shared and competitive between multiple cores, so how to improve the SLLC performance has become an important research area. Last-level cache (LLC) ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMETRICS '15: Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems

June 2015

488 pages

ISBN:9781450334860

DOI:10.1145/2745844

General Chairs:
Bill Lin
University of California, San Diego
,
Jun (Jim) Xu
Georgia Tech
,
Program Chairs:
Sudipta Sengupta
Microsoft Research
,
Devavrat Shah
Massachusetts Institute of Technology

ACM SIGMETRICS Performance Evaluation Review Volume 43, Issue 1
Performance evaluation review
June 2015
468 pages
ISSN:0163-5999
DOI:10.1145/2796314
Editors:
Derek Eager
University of Saskatchewan
,
Carey Williamson
University of Calgary
Issue’s Table of Contents

Copyright � 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMETRICS: ACM Special Interest Group on Measurement and Evaluation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF
Intel Inc.

Conference

SIGMETRICS '15

Sponsor:

SIGMETRICS

SIGMETRICS '15: ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems

June 15 - 19, 2015

Oregon, Portland, USA

Acceptance Rates

SIGMETRICS '15 Paper Acceptance Rate 32 of 239 submissions, 13%;

Overall Acceptance Rate 459 of 2,691 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
455
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)3

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Pandey SYazdanbakhsh ALiu H(2024)TAO: Re-Thinking DL-based Microarchitecture SimulationProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36560128:2(1-25)Online publication date: 29-May-2024
https://dl.acm.org/doi/10.1145/3656012
Kandemir MTang XZhao HRyoo JKarakoy MFreund SYahav E(2021)Distance-in-time versus distance-in-spaceProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454069(665-680)Online publication date: 19-Jun-2021
https://dl.acm.org/doi/10.1145/3453483.3454069
Yaglikci APatel MKim JAzizi ROlgun AOrosa LHassan HPark JKanellopoulos KShahroodi TGhose SMutlu O(2021)BlockHammer: Preventing RowHammer at Low Cost by Blacklisting Rapidly-Accessed DRAM Rows2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00037(345-358)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00037
Tang XZhang ZXu WKandemir MMelhem RYang JSarkar VKim H(2020)Enhancing Address Translations in Throughput Processors via CompressionProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414633(191-204)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414633
Karakoy MKislal OTang XKandemir MArunachalam M(2019)Architecture-Aware Approximate ComputingACM SIGMETRICS Performance Evaluation Review10.1145/3376930.337694647:1(23-24)Online publication date: 17-Dec-2019
https://dl.acm.org/doi/10.1145/3376930.3376946
Karakoy MKislal OTang XKandemir MArunachalam M(2019)Architecture-Aware Approximate ComputingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/3341617.33261533:2(1-24)Online publication date: 19-Jun-2019
https://dl.acm.org/doi/10.1145/3341617.3326153
Tang XKandemir MKarakoy MArunachalam MMcKinley KFisher K(2019)Co-optimizing memory-level parallelism and cache-level parallelismProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314599(935-949)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314599
Karakoy MKislal OTang XKandemir MArunachalam MNahum EBonald TDuffield N(2019)Architecture-Aware Approximate ComputingAbstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems10.1145/3309697.3331508(23-24)Online publication date: 20-Jun-2019
https://dl.acm.org/doi/10.1145/3309697.3331508
Hassan HPatel MKim JYaglikci AVijaykumar NGhiasi NGhose SMutlu OManne SHunter HAltman E(2019)CROWProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322231(129-142)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322231
Breslow AJayasena N(2018)Morton filtersProceedings of the VLDB Endowment10.14778/3213880.321388411:9(1041-1055)Online publication date: 1-May-2018
https://dl.acm.org/doi/10.14778/3213880.3213884
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents