skip to main content
10.1109/MICRO.2014.34acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
tutorial

Compiler Support for Optimizing Memory Bank-Level Parallelism

Published: 13 December 2014 Publication History

Abstract

Many prior compiler-based optimization schemes focused exclusively on cache data locality. However, cache locality is only one part of the overall performance of applications running on emerging multicores or manycores. For example, memory stalls could constitute a very large fraction of execution time even in cache-optimized codes, and one of the main reasons for this is lack of memory-level parallelism. Motivated by this, we propose a compiler-based Bank-Level Parallelism (BLP) optimization scheme that uses loop tile scheduling. More specifically, we first use Cache Miss Equations to predict where the last-level cache miss will happen in each tile, and then identify the set of memory banks that will be accessed in each tile. Using this information, two tile scheduling algorithms are proposed to maximize BLP, each targeting a different scenario. We further discuss how our compiler-based scheme can be enhanced to consider memory controller-level parallelism and row-buffer locality. Our experimental evaluation using 11 multithreaded applications shows that the proposed BLP optimization can improve average BLP by 17.1% on average, resulting in a 9.2% reduction in average memory access latency. Furthermore, considering memory controller-level parallelism and row-buffer locality (in addition to BLP) takes our average improvement in memory access latency to 22.2%.

References

[1]
"Gem5." {Online}. Available: http://gem5.org
[2]
"Micron Datasheet." {Online}. Available: http://www.micron.com/
[3]
"Open64." {Online}. Available: http://www.open64.net/
[4]
"Intel 64 and ia-32 architectures optimization reference manual," 2012. {Online}. Available: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
[5]
J. H. Ahn et al., "Future scaling of processor-memory interfaces," in Proc. of SC, 2009.
[6]
J. H. Ahn et al., "Improving system energy efficiency with memory rank subsetting," ACM Trans. Archit. Code Optim., 2012.
[7]
J. Anderson et al., "Real-time scheduling on multicore platforms," in Proc. of RTAS, 2006.
[8]
V. Aslot et al., "SPEComp: A new benchmark suite for measuring parallel computer performance," OpenMP Shared Memory Parallel Programming, 2001.
[9]
R. Ausavarungnirun et al., "Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems," SIGARCH Comput. Archit. News, 2012.
[10]
M. M. Baskaran et al., "Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors," in Proc. of PPoPP, ser. PPoPP '09, 2009.
[11]
L. Chen et al., "A study of leveraging memory level parallelism for dram system on multicore/many-core architecture," in Proc. of TrustCom, 2013.
[12]
T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson, Introduction to Algorithms, 2nd ed. McGraw-Hill Higher Education, 2001.
[13]
W. Ding et al., "Reshaping cache misses to improve row-buffer locality in multicore systems," in Proc. of PACT, 2013.
[14]
S. Ghosh et al., "Cache miss equations: An analytical representation of cache misses," in Proc. of ICS, 1997.
[15]
E. Ipek et al., "Self-optimizing memory controllers: A reinforcement learning approach," SIGARCH Comput. Archit. News, 2008.
[16]
F. Irigoin and R. Triolet, "Supernode partitioning," in Proc. of POPL, 1988.
[17]
M. K. Jeong et al., "Balancing dram locality and parallelism in shared memory cmp systems," in Proc. of HPCA, 2012.
[18]
M. Kandemir et al., "Cache topology aware computation mapping for multicores," in Proc. of PLDI, 2010.
[19]
Y. Kim et al., "ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers," Proc. of HPCA, 2010.
[20]
Y. Kim et al., "Thread cluster memory scheduling: Exploiting differences in memory access behavior," in Proc. of MICRO, 2010.
[21]
Y. Kim et al., "A case for exploiting subarray-level parallelism (salp) in dram," in Proc. of ISCA, 2012.
[22]
M. D. Lam et al., "The cache performance and optimizations of blocked algorithms," in Proc. of ASPLOS, 1991.
[23]
C. J. Lee et al., "Improving memory bank-level parallelism in the presence of prefetching," in Proc. of MICRO, 2009.
[24]
C. J. Lee et al., "DRAM-aware last-level cache writeback: Reducing write-caused interference in memory systems," HPS Technical Report, 2010.
[25]
A. W. Lim and M. S. Lam, "Maximizing parallelism and minimizing synchronization with affine transforms," in Proc. of POPL, 1997.
[26]
J. Liu et al., "On-chip cache hierarchy-aware tile scheduling for multi-core machines," in Proc. of CGO, 2011.
[27]
L. Liu, Z. Cui, M. Xing, Y. Bao, M. Chen, and C. Wu, "A software memory partition approach for eliminating bank-level interference in multicore systems," in Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, ser. PACT '12. New York, NY, USA: ACM, 2012, pp. 367--376. Available: http://doi.acm.org/10.1145/2370816.2370869
[28]
Q. Lu et al., "Data layout transformation for enhancing data locality on nuca chip multiprocessors," in Proc. of PACT, 2009.
[29]
W. Mi et al., "Software-hardware cooperative DRAM bank partitioning for chip multiprocessors," in Proc. of NPC, 2010.
[30]
S. P. Muralidhara et al., "Reducing memory interference in multicore systems via application-aware memory channel partitioning," in Proc. of MICRO, 2011.
[31]
O. Mutlu and T. Moscibroda, "Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared dram systems," in Proc. of ISCA, 2008.
[32]
V. S. Pai and S. Adve, "Code transformations to improve memory parallelism," in Proc. of MICRO, 1999.
[33]
H. Park et al., "Regularities considered harmful: Forcing randomness to memory accesses to reduce row buffer conflicts for multi-core, multibank systems," in Proc. of ASPLOS, 2013.
[34]
S. Phadke and S. Narayanasamy, "Mlp aware heterogeneous memory system," in Proc. of DATE, 2011.
[35]
M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt, "A case for mlp-aware cache replacement," in Proc. of ISCA, 2006.
[36]
J. Ramanujam and P. Sadayappan, "Tiling multidimensional iteration spaces for multicomputers," 1992.
[37]
S. Rixner, "Memory controller optimizations for web servers," in Proc. of MICRO, 2004.
[38]
A. Sharifi et al., "Addressing end-to-end memory access latency in noc- based multicores," in Proc. of MICRO, 2012.
[39]
K. Sudan et al., "Micro-pages: Increasing dram efficiency with locality- aware data placement," in Proc. of ASPLOS, 2010.
[40]
I.-J. Sung et al., "Data layout transformation exploiting memory-level parallelism in structured grid many-core applications," in Proc. of PACT, 2010.
[41]
A. N. Udipi et al., "Rethinking dram design and organization for energy- constrained multi-cores," in Proc. of ISCA, 2010.
[42]
M. E. Wolf and M. S. Lam, "A loop transformation theory and an algorithm to maximize parallelism," IEEE Trans. Parallel Distrib. Syst., 1991.
[43]
M. E. Wolf and M. S. Lam, "A data locality optimizing algorithm," in Proc. of PLDI, 1991.
[44]
M. Wolfe, "Iteration space tiling for memory hierarchies," in Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing, 1989.
[45]
M. Xie, D. Tong, K. Huang, and X. Cheng, "Improving system through-put and fairness simultaneously in cmp systems via dynamic bank partitioning," in The 20th annual IEEE International Symposium on High Performance Computer Architecture, ser. HPCA '14, 2014.
[46]
G. Yao et al., "Memory-centric scheduling for multicore hard real-time systems," Real-Time Syst., 2012.
[47]
Z. Zhang et al., "A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality," in Proc. of MICRO, 2000.
[48]
H. Zheng et al., "Mini-rank: Adaptive dram architecture for improving memory power efficiency," in Proc. of MICRO, 2008.
[49]
X. Zhou et al., "Hierarchical overlapped tiling," in Proc. of CGO, 2012.
[50]
Z. Zhu and Z. Zhang, "A performance comparison of dram memory system optimizations for smt processors," in Proc. of HPCA, 2005.

Cited By

View all
  • (2022)Data ConvectionProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080276:1(1-25)Online publication date: 28-Feb-2022
  • (2019)Co-optimizing memory-level parallelism and cache-level parallelismProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314599(935-949)Online publication date: 8-Jun-2019
  • (2018)A case for richer cross-layer abstractionsProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00027(207-220)Online publication date: 2-Jun-2018
  • Show More Cited By

Index Terms

  1. Compiler Support for Optimizing Memory Bank-Level Parallelism

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture
      December 2014
      697 pages
      ISBN:9781479969982

      Sponsors

      Publisher

      IEEE Computer Society

      United States

      Publication History

      Published: 13 December 2014

      Check for updates

      Qualifiers

      • Tutorial
      • Research
      • Refereed limited

      Conference

      MICRO-47
      Sponsor:

      Acceptance Rates

      MICRO-47 Paper Acceptance Rate 53 of 279 submissions, 19%;
      Overall Acceptance Rate 484 of 2,242 submissions, 22%

      Upcoming Conference

      MICRO '24

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)3
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 16 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Data ConvectionProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080276:1(1-25)Online publication date: 28-Feb-2022
      • (2019)Co-optimizing memory-level parallelism and cache-level parallelismProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314599(935-949)Online publication date: 8-Jun-2019
      • (2018)A case for richer cross-layer abstractionsProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00027(207-220)Online publication date: 2-Jun-2018
      • (2016)Improving bank-level parallelism for irregular applicationsThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195708(1-12)Online publication date: 15-Oct-2016
      • (2015)Memory Row Reuse Distance and its Role in Optimizing Application PerformanceACM SIGMETRICS Performance Evaluation Review10.1145/2796314.274586743:1(137-149)Online publication date: 15-Jun-2015
      • (2015)Memory Row Reuse Distance and its Role in Optimizing Application PerformanceProceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems10.1145/2745844.2745867(137-149)Online publication date: 15-Jun-2015

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media