tutorial

Compiler Support for Optimizing Memory Bank-Level Parallelism

Authors:

Mahmut KandemirAuthors Info & Claims

MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 571 - 582

https://doi.org/10.1109/MICRO.2014.34

Published: 13 December 2014 Publication History

Abstract

Many prior compiler-based optimization schemes focused exclusively on cache data locality. However, cache locality is only one part of the overall performance of applications running on emerging multicores or manycores. For example, memory stalls could constitute a very large fraction of execution time even in cache-optimized codes, and one of the main reasons for this is lack of memory-level parallelism. Motivated by this, we propose a compiler-based Bank-Level Parallelism (BLP) optimization scheme that uses loop tile scheduling. More specifically, we first use Cache Miss Equations to predict where the last-level cache miss will happen in each tile, and then identify the set of memory banks that will be accessed in each tile. Using this information, two tile scheduling algorithms are proposed to maximize BLP, each targeting a different scenario. We further discuss how our compiler-based scheme can be enhanced to consider memory controller-level parallelism and row-buffer locality. Our experimental evaluation using 11 multithreaded applications shows that the proposed BLP optimization can improve average BLP by 17.1% on average, resulting in a 9.2% reduction in average memory access latency. Furthermore, considering memory controller-level parallelism and row-buffer locality (in addition to BLP) takes our average improvement in memory access latency to 22.2%.

References

[1]

"Gem5." {Online}. Available: http://gem5.org

[2]

"Micron Datasheet." {Online}. Available: http://www.micron.com/

[3]

"Open64." {Online}. Available: http://www.open64.net/

[4]

"Intel 64 and ia-32 architectures optimization reference manual," 2012. {Online}. Available: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

[5]

J. H. Ahn et al., "Future scaling of processor-memory interfaces," in Proc. of SC, 2009.

Digital Library

[6]

J. H. Ahn et al., "Improving system energy efficiency with memory rank subsetting," ACM Trans. Archit. Code Optim., 2012.

Digital Library

[7]

J. Anderson et al., "Real-time scheduling on multicore platforms," in Proc. of RTAS, 2006.

[8]

V. Aslot et al., "SPEComp: A new benchmark suite for measuring parallel computer performance," OpenMP Shared Memory Parallel Programming, 2001.

Digital Library

[9]

R. Ausavarungnirun et al., "Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems," SIGARCH Comput. Archit. News, 2012.

Digital Library

[10]

M. M. Baskaran et al., "Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors," in Proc. of PPoPP, ser. PPoPP '09, 2009.

Digital Library

[11]

L. Chen et al., "A study of leveraging memory level parallelism for dram system on multicore/many-core architecture," in Proc. of TrustCom, 2013.

Digital Library

[12]

T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson, Introduction to Algorithms, 2nd ed. McGraw-Hill Higher Education, 2001.

Digital Library

[13]

W. Ding et al., "Reshaping cache misses to improve row-buffer locality in multicore systems," in Proc. of PACT, 2013.

Digital Library

[14]

S. Ghosh et al., "Cache miss equations: An analytical representation of cache misses," in Proc. of ICS, 1997.

Digital Library

[15]

E. Ipek et al., "Self-optimizing memory controllers: A reinforcement learning approach," SIGARCH Comput. Archit. News, 2008.

Digital Library

[16]

F. Irigoin and R. Triolet, "Supernode partitioning," in Proc. of POPL, 1988.

Digital Library

[17]

M. K. Jeong et al., "Balancing dram locality and parallelism in shared memory cmp systems," in Proc. of HPCA, 2012.

Digital Library

[18]

M. Kandemir et al., "Cache topology aware computation mapping for multicores," in Proc. of PLDI, 2010.

Digital Library

[19]

Y. Kim et al., "ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers," Proc. of HPCA, 2010.

[20]

Y. Kim et al., "Thread cluster memory scheduling: Exploiting differences in memory access behavior," in Proc. of MICRO, 2010.

Digital Library

[21]

Y. Kim et al., "A case for exploiting subarray-level parallelism (salp) in dram," in Proc. of ISCA, 2012.

Digital Library

[22]

M. D. Lam et al., "The cache performance and optimizations of blocked algorithms," in Proc. of ASPLOS, 1991.

Digital Library

[23]

C. J. Lee et al., "Improving memory bank-level parallelism in the presence of prefetching," in Proc. of MICRO, 2009.

Digital Library

[24]

C. J. Lee et al., "DRAM-aware last-level cache writeback: Reducing write-caused interference in memory systems," HPS Technical Report, 2010.

[25]

A. W. Lim and M. S. Lam, "Maximizing parallelism and minimizing synchronization with affine transforms," in Proc. of POPL, 1997.

Digital Library

[26]

J. Liu et al., "On-chip cache hierarchy-aware tile scheduling for multi-core machines," in Proc. of CGO, 2011.

Digital Library

[27]

L. Liu, Z. Cui, M. Xing, Y. Bao, M. Chen, and C. Wu, "A software memory partition approach for eliminating bank-level interference in multicore systems," in Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, ser. PACT '12. New York, NY, USA: ACM, 2012, pp. 367--376. Available: http://doi.acm.org/10.1145/2370816.2370869

Digital Library

[28]

Q. Lu et al., "Data layout transformation for enhancing data locality on nuca chip multiprocessors," in Proc. of PACT, 2009.

Digital Library

[29]

W. Mi et al., "Software-hardware cooperative DRAM bank partitioning for chip multiprocessors," in Proc. of NPC, 2010.

Digital Library

[30]

S. P. Muralidhara et al., "Reducing memory interference in multicore systems via application-aware memory channel partitioning," in Proc. of MICRO, 2011.

Digital Library

[31]

O. Mutlu and T. Moscibroda, "Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared dram systems," in Proc. of ISCA, 2008.

Digital Library

[32]

V. S. Pai and S. Adve, "Code transformations to improve memory parallelism," in Proc. of MICRO, 1999.

Digital Library

[33]

H. Park et al., "Regularities considered harmful: Forcing randomness to memory accesses to reduce row buffer conflicts for multi-core, multibank systems," in Proc. of ASPLOS, 2013.

Digital Library

[34]

S. Phadke and S. Narayanasamy, "Mlp aware heterogeneous memory system," in Proc. of DATE, 2011.

[35]

M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt, "A case for mlp-aware cache replacement," in Proc. of ISCA, 2006.

Digital Library

[36]

J. Ramanujam and P. Sadayappan, "Tiling multidimensional iteration spaces for multicomputers," 1992.

[37]

S. Rixner, "Memory controller optimizations for web servers," in Proc. of MICRO, 2004.

Digital Library

[38]

A. Sharifi et al., "Addressing end-to-end memory access latency in noc- based multicores," in Proc. of MICRO, 2012.

Digital Library

[39]

K. Sudan et al., "Micro-pages: Increasing dram efficiency with locality- aware data placement," in Proc. of ASPLOS, 2010.

Digital Library

[40]

I.-J. Sung et al., "Data layout transformation exploiting memory-level parallelism in structured grid many-core applications," in Proc. of PACT, 2010.

Digital Library

[41]

A. N. Udipi et al., "Rethinking dram design and organization for energy- constrained multi-cores," in Proc. of ISCA, 2010.

Digital Library

[42]

M. E. Wolf and M. S. Lam, "A loop transformation theory and an algorithm to maximize parallelism," IEEE Trans. Parallel Distrib. Syst., 1991.

Digital Library

[43]

M. E. Wolf and M. S. Lam, "A data locality optimizing algorithm," in Proc. of PLDI, 1991.

Digital Library

[44]

M. Wolfe, "Iteration space tiling for memory hierarchies," in Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing, 1989.

Digital Library

[45]

M. Xie, D. Tong, K. Huang, and X. Cheng, "Improving system through-put and fairness simultaneously in cmp systems via dynamic bank partitioning," in The 20th annual IEEE International Symposium on High Performance Computer Architecture, ser. HPCA '14, 2014.

[46]

G. Yao et al., "Memory-centric scheduling for multicore hard real-time systems," Real-Time Syst., 2012.

Digital Library

[47]

Z. Zhang et al., "A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality," in Proc. of MICRO, 2000.

Digital Library

[48]

H. Zheng et al., "Mini-rank: Adaptive dram architecture for improving memory power efficiency," in Proc. of MICRO, 2008.

Digital Library

[49]

X. Zhou et al., "Hierarchical overlapped tiling," in Proc. of CGO, 2012.

Digital Library

[50]

Z. Zhu and Z. Zhang, "A performance comparison of dram memory system optimizations for smt processors," in Proc. of HPCA, 2005.

Digital Library

Cited By

Khadirsharbiyani SKotra JRao KKandemir M(2022)Data ConvectionProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080276:1(1-25)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3508027
Tang XKandemir MKarakoy MArunachalam MMcKinley KFisher K(2019)Co-optimizing memory-level parallelism and cache-level parallelismProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314599(935-949)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314599
Vijaykumar NJain AMajumdar DHsieh KPekhimenko GEbrahimi EHajinazar NGibbons PMutlu O(2018)A case for richer cross-layer abstractionsProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00027(207-220)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00027
Show More Cited By

Index Terms

Compiler Support for Optimizing Memory Bank-Level Parallelism
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Co-optimizing memory-level parallelism and cache-level parallelism
PLDI 2019: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation

Minimizing cache misses has been the traditional goal in optimizing cache performance using compiler based techniques. However, continuously increasing dataset sizes combined with large numbers of cache banks and memory banks connected using on-chip ...
Hardware and compiler support for cache coherence in large-scale shared-memory multiprocessors
The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors
Special issue on cache memory and related problems

Current microprocessors incorporate techniques to aggressively exploit instruction-level parallelism (ILP). This paper evaluates the impact of such processors on the performance of shared-memory multiprocessors, both without and with the latency-hiding ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture

December 2014

697 pages

ISBN:9781479969982

General Chair:
Krisztian Flautner
ARM
,
Program Chairs:
Thomas F. Wenisch
University of Michigan
,
Emre Ozer
ARM
,
Publications Chair:
Michael Ferdman
Stony Brook University

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 13 December 2014

Check for updates

Qualifiers

Tutorial
Research
Refereed limited

Conference

MICRO-47

Sponsor:

SIGMICRO

MICRO-47: The 47th Annual IEEE/ACM International Symposium of Microarchitecture

December 13 - 17, 2014

Cambridge, United Kingdom

Acceptance Rates

MICRO-47 Paper Acceptance Rate 53 of 279 submissions, 19%;

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Sponsor:
sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
195
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Khadirsharbiyani SKotra JRao KKandemir M(2022)Data ConvectionProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080276:1(1-25)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3508027
Tang XKandemir MKarakoy MArunachalam MMcKinley KFisher K(2019)Co-optimizing memory-level parallelism and cache-level parallelismProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314599(935-949)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314599
Vijaykumar NJain AMajumdar DHsieh KPekhimenko GEbrahimi EHajinazar NGibbons PMutlu O(2018)A case for richer cross-layer abstractionsProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00027(207-220)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00027
Tang XKandemir MYedlapalli PKotra JHsu WYang CLipasti MLee H(2016)Improving bank-level parallelism for irregular applicationsThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195708(1-12)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195708
Kandemir MZhao HTang XKarakoy M(2015)Memory Row Reuse Distance and its Role in Optimizing Application PerformanceACM SIGMETRICS Performance Evaluation Review10.1145/2796314.274586743:1(137-149)Online publication date: 15-Jun-2015
https://dl.acm.org/doi/10.1145/2796314.2745867
Kandemir MZhao HTang XKarakoy MLin BXu JSengupta SShah D(2015)Memory Row Reuse Distance and its Role in Optimizing Application PerformanceProceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems10.1145/2745844.2745867(137-149)Online publication date: 15-Jun-2015
https://dl.acm.org/doi/10.1145/2745844.2745867

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents