skip to main content
10.1109/MICRO56248.2022.00029acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources

Published: 18 December 2023 Publication History

Abstract

Graphics Processing Units (GPUs) are widely-used accelerators for data-parallel applications. In many GPU applications, GPU memory bandwidth bottlenecks performance, causing underutilization of GPU cores. Hence, disabling many cores does not affect the performance of memory-bound workloads. While simply power-gating unused GPU cores would save energy, prior works attempt to better utilize GPU cores for other applications (ideally compute-bound), which increases the GPU's total throughput.
In this paper, we introduce Morpheus, a new hardware/software co-designed technique to boost the performance of memory-bound applications. The key idea of Morpheus is to exploit unused core resources to extend the GPU last level cache (LLC) capacity. In Morpheus, each GPU core has two execution modes: compute mode and cache mode. Cores in compute mode operate conventionally and run application threads. However, for the cores in cache mode, Morpheus invokes a software helper kernel that uses the cores' on-chip memories (i.e., register file, shared memory, and L1) in a way that extends the LLC capacity for a running memory-bound workload. Morpheus adds a controller to the GPU hardware to forward LLC requests to either the conventional LLC (managed by hardware) or the extended LLC (managed by the helper kernel). Our experimental results show that Morpheus improves the performance and energy efficiency of a baseline GPU architecture by an average of 39% and 58%, respectively, across several memory-bound workloads. Morpheus' performance is within 3% of a GPU design that has a quadruple-sized conventional LLC. Morpheus can thus contribute to reducing the hardware dedicated to a conventional LLC by exploiting idle cores' on-chip memory resources as additional cache capacity.

References

[1]
X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W.-M. Hwu, "Adaptive cache management for energy-efficient GPU computing," in MICRO, 2014.
[2]
N. Nematollahi, M. Sadrosadati, H. Falahati, M. Barkhordar, and H. Sarbazi-Azad, "Neda: Supporting direct inter-core neighbor data exchange in GPUs," IEEE CAL, 2018.
[3]
N. Nematollahi, M. Sadrosadati, H. Falahati, M. Barkhordar, M. P. Drumond, H. Sarbazi-Azad, and B. Falsafi, "Efficient nearest-neighbor data sharing in GPUs," ACM TACO, 2020.
[4]
N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu, "A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps," in ISCA, 2015.
[5]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in IISWC, 2009.
[6]
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu, "Parboil: A revised benchmark suite for scientific and commercial throughput computing," Center for Reliable and High-Performance Computing, 2012.
[7]
NVIDIA, "A102 (RTX 3080) architecture whitepaper - NVIDIA file downloads," 2021. [Online]. Available: https://images.nvidia.com/aem-dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf
[8]
M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, "Accel-Sim: An extensible simulation framework for validated GPU modeling," in ISCA, 2020.
[9]
X. Zhu, M. Awatramani, D. Rover, and J. Zambreno, "Onac: Optimal number of active cores detector for energy efficient GPU computing," in ICCD, 2016.
[10]
J. Tan and X. Fu, "RISE: Improving the streaming processors reliability against soft errors in GPGPUs," in PACT, 2012.
[11]
J. Wadden, A. Lyashevsky, S. Gurumurthi, V. Sridharan, and K. Skadron, "Real-world design and evaluation of compiler-managed GPU redundant multithreading," in ISCA, 2014.
[12]
W. Zhao, Q. Chen, H. Lin, J. Zhang, J. Leng, C. Li, W. Zheng, L. Li, and M. Guo, "Themis: Predicting and reining in application-level slowdown on spatial multitasking GPUs," in IPDPS, 2019.
[13]
P. Aguilera, K. Morrow, and N. S. Kim, "QoS-aware dynamic resource allocation for spatial-multitasking GPUs," in ASP-DAC, 2014.
[14]
J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte, "The case for GPGPU spatial multitasking," in HPCA, 2012.
[15]
H. Dai, Z. Lin, C. Li, C. Zhao, F. Wang, N. Zheng, and H. Zhou, "Accelerate GPU concurrent kernel execution by mitigating memory pipeline stalls," in HPCA, 2018.
[16]
X. Zhao, Z. Wang, and L. Eeckhout, "HeteroCore GPU to exploit TLP-resource diversity," IEEE TPDS, 2019.
[17]
A. Jog, O. Kayiran, A. Pattnaik, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Exploiting core criticality for enhanced GPU performance," in SIGMETRICS, 2016.
[18]
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in MICRO, 2011.
[19]
B. H. Bloom, "Space/time trade-offs in hash coding with allowable errors," Communications of the ACM, 1970.
[20]
NVIDIA, "CUDA C++ programming guide," 2022. [Online]. Available: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
[21]
M. Sadrosadati, A. Mirhosseini, S. B. Ehsani, H. Sarbazi-Azad, M. Drumond, B. Falsafi, R. Ausavarungnirun, and O. Mutlu, "LTRF: Enabling high-capacity register files for GPUs via hardware/software cooperative register prefetching," in ASPLOS, 2018.
[22]
Y. Oh, G. Koo, M. Annavaram, and W. W. Ro, "Linebacker: Preserving victim cache lines in idle register files of GPUs," in ISCA, 2019.
[23]
F. Khorasani, H. A. Esfeden, N. Abu-Ghazaleh, and V. Sarkar, "In-register parameter caching for dynamic neural nets with virtual persistent processor specialization," in MICRO, 2018.
[24]
X. Xie, Y. Liang, X. Li, Y. Wu, G. Sun, T. Wang, and D. Fan, "Enabling coordinated register allocation and thread-level parallelism optimization for GPUs," in MICRO, 2015.
[25]
N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose, A. Jog, P. B. Gibbons, and O. Mutlu, "Zorua: A holistic approach to resource virtualization in GPUs," in MICRO, 2016.
[26]
N. Chatterjee, M. O'Connor, D. Lee, D. R. Johnson, S. W. Keckler, M. Rhu, and W. J. Dally, "Architecting an energy-efficient dram system for GPUs," in HPCA, 2017.
[27]
M. Zhu, Y. Zhuo, C. Wang, W. Chen, and Y. Xie, "Performance evaluation and optimization of HBM-enabled GPU for data-intensive applications," IEEE TVLSI, 2018.
[28]
N. Agarwal, D. Nellans, M. O'Connor, S. W. Keckler, and T. F. Wenisch, "Unlocking bandwidth for GPUs in CC-NUMA systems," in HPCA, 2015.
[29]
J. Zhao and Y. Xie, "Optimizing bandwidth and power of graphics memory with hybrid memory technologies and adaptive data migration," in ICCAD, 2012.
[30]
L. Fan, P. Cao, J. Almeida, and A. Z. Broder, "Summary cache: A scalable wide-area web cache sharing protocol," IEEE/ACM TON, 2000.
[31]
Z. Jia, M. Maggioni, J. Smith, and D. P. Scarpazza, "Dissecting the NVidia Turing T4 GPU via microbenchmarking," arXiv preprint arXiv:1903.07486, 2019.
[32]
M. Mantor, "AMD Radeon™ HD 7970 with graphics core next (GCN) architecture," in HCS, 2012.
[33]
G. Pekhimenko, V. Seshadri, O. Mutlu, M. A. Kozuch, P. B. Gibbons, and T. C. Mowry, "Base-delta-immediate compression: Practical data compression for on-chip caches," in PACT, 2012.
[34]
NVIDIA. (August 2022) CUDA toolkit documentation, v11.7. [Online]. Available: https://docs.nvidia.com/cuda/index.html
[35]
K. Iyer and J. Kiel, "GPU debugging and Profiling with NVIDIA Parallel Nsight," Game Development Tools, 2016.
[36]
D. Yan, W. Wang, and X. Chu, "Optimizing batched winograd convolution on GPUs," in PPoPP, 2020.
[37]
V. Kandiah, S. Peverelle, M. Khairy, J. Pan, A. Manjunath, T. G. Rogers, T. M. Aamodt, and N. Hardavellas, "AccelWattch: A power modeling framework for modern GPUs," in MICRO, 2021.
[38]
M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron, "Energy-efficient mechanisms for managing thread context in throughput processors," in ISCA, 2011.
[39]
Micron. (May 2021) 8Gb: 2 Channel x16/x8 GDDR6X SGRAM. [Online]. Available: https://media-www.micron.com/-/media/client/global/documents/products/data-sheet/dram/gddr/gddr6/gddr6x_sgram_8gb_brief.pdf?rev=161547726f0b45239d3da37ef29b09bf
[40]
S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron, "Pannotia: Understanding irregular GPGPU graph applications," in IISWC, 2013.
[41]
A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in ISPASS, 2009.
[42]
M. Gebhart, S. W. Keckler, B. Khailany, R. Krashinsky, and W. J. Dally, "Unifying primary cache, scratch, and register file memories in a throughput processor," in MICRO, 2012.
[43]
N. Jing, J. Wang, F. Fan, W. Yu, L. Jiang, C. Li, and X. Liang, "Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs," in MICRO, 2016.
[44]
Q. Xu, H. Jeon, K. Kim, W. W. Ro, and M. Annavaram, "Warped-slicer: Efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming," in ISCA, 2016.
[45]
L.-W. Chang, H.-S. Kim, and W.-m. W. Hwu, "DySel: Lightweight dynamic selection for kernel-based data-parallel programming model," in ASPLOS, 2016.
[46]
J. J. K. Park, Y. Park, and S. Mahlke, "Dynamic resource management for efficient utilization of multitasking GPUs," in ASPLOS, 2017.
[47]
N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, "CACTI 6.0: A tool to model large caches," HP laboratories, 2009.
[48]
J. Knudsen, "Nangate 45nm open cell library," CDNLive, EMEA, 2008.
[49]
R. Komuravelli, M. D. Sinclair, J. Alsop, M. Huzaifa, M. Kotsifakou, P. Srivastava, S. V. Adve, and V. S. Adve, "Stash: Have your scratchpad and cache it too," in ISCA, 2015.
[50]
S. Darabi, E. Yousefzadeh-Asl-Miandoab, N. Akbarzadeh, H. Falahati, P. Lotfi-Kamran, M. Sadrosadati, and H. Sarbazi-Azad, "OSM: Off-chip shared memory for GPUs," IEEE TPDS, 2022.
[51]
S. Sardashti, A. Seznec, and D. A. Wood, "Yet another compressed cache: A low-cost yet effective compressed cache," ACM TACO, 2016.
[52]
S. Sardashti and D. A. Wood, "Decoupled compressed cache: Exploiting spatial locality for energy-optimized compressed caching," in MICRO, 2013.
[53]
A. Arelakis and P. Stenstrom, "SC 2: A statistical compression cache scheme," in ISCA, 2014.
[54]
A. Ghasemazar, P. Nair, and M. Lis, "Thesaurus: Efficient cache compression via dynamic clustering," in ASPLOS, 2020.
[55]
A. Ghasemazar, M. Ewais, P. Nair, and M. Lis, "2DCC: Cache compression in two dimensions," in DATE, 2020.
[56]
J. Chang and G. Sohi, "Cooperative caching for chip multiprocessors," in ISCA, 2006.
[57]
M. K. Qureshi, "Adaptive spill-receive for robust high-performance caching in CMPs," in HPCA, 2009.
[58]
S. Dublish, V. Nagarajan, and N. Topham, "Cooperative caching for GPUs," ACM TACO, 2016.
[59]
R. Venkatesan, S. G. Ramasubramanian, S. Venkataramani, K. Roy, and A. Raghunathan, "STAG: Spintronic-tape architecture for GPGPU cache hierarchies," in ISCA, 2014.
[60]
N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: Near-optimal block placement and replication in distributed caches," in ISCA, 2009.
[61]
J. Chang and G. S. Sohi, "Cooperative caching for chip multiprocessors," in ISCA, 2006.
[62]
J. Chang and G. S. Sohi, "Cooperative cache partitioning for chip multiprocessors," in ICS, 2007.
[63]
B. Li, J. Sun, M. Annavaram, and N. S. Kim, "Elastic-Cache: GPU cache architecture for efficient fine- and coarse-grained cache-line management," in IPDPS, 2017.
[64]
M. A. Ibrahim, O. Kayiran, Y. Eckert, G. H. Loh, and A. Jog, "Analyzing and leveraging decoupled L1 caches in GPUs," in HPCA, 2021.
[65]
B. Li, J. Wei, J. Sun, M. Annavaram, and N. S. Kim, "An efficient GPU cache architecture for applications with irregular memory access patterns," ACM TACO, 2019.
[66]
E. Choukse, M. B. Sullivan, M. O'Connor, M. Erez, J. Pool, D. Nellans, and S. W. Keckler, "Buddy compression: Enabling larger memory for deep learning and HPC workloads on GPUs," in ISCA, 2020.
[67]
M. H. Samavatian, H. Abbasitabar, M. Arjomand, and H. Sarbazi-Azad, "An efficient STT-RAM last level cache architecture for GPUs," in DAC, 2014.
[68]
N. Jing, L. Jiang, T. Zhang, C. Li, F. Fan, and X. Liang, "Energy-efficient eDRAM-based on-chip storage architecture for GPGPUs," IEEE TC, 2015.
[69]
M. H. Samavatian, M. Arjomand, R. Bashizade, and H. Sarbazi-Azad, "Architecting the last-level cache for GPUs using STT-RAM technology," ACM TODAES, 2015.
[70]
J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky, "Nvidia a100 tensor core gpu: Performance and innovation," IEEE Micro, 2021.
[71]
X. Zhao, Y. Liu, A. Adileh, and L. Eeckhout, "LA-LLC: Inter-core locality-aware last-level cache to exploit many-to-many traffic in GPGPUs," IEEE CAL, 2016.
[72]
Y. Liu, X. Zhao, M. Jahre, Z. Wang, X. Wang, Y. Luo, and L. Eeckhout, "Get out of the valley: Power-efficient address mapping for GPUs," in ISCA, 2018.
[73]
J. Wang, L. Jiang, J. Ke, X. Liang, and N. Jing, "A sharing-aware L1. 5D cache for data reuse in GPGPUs," in ASP-DAC, 2019.
[74]
A. Mirhosseini, M. Sadrosadati, B. Soltani, H. Sarbazi-Azad, and T. F. Wenisch, "BiNoCHS: Bimodal network-on-chip for CPU-GPU heterogeneous systems," in NOCS, 2017.
[75]
A. Mirhosseini, M. Sadrosadati, F. Aghamohammadi, M. Modarressi, and H. Sarbazi-Azad, "BARAN: Bimodal adaptive reconfigurable-allocator network-on-chip," ACM TOPC, 2019.
[76]
X. Zhao, S. Ma, Y. Liu, L. Eeckhout, and Z. Wang, "A low-cost conflict-free NoC for GPGPUs," in DAC, 2016.
[77]
X. Zhao, D. Kaeli, Z. Wang, L. Eeckhout et al., "Intra-cluster coalescing and distributed-block scheduling to reduce GPU NoC pressure," IEEE TC, 2019.
[78]
A. Bakhoda, J. Kim, and T. M. Aamodt, "Throughput-effective on-chip networks for manycore accelerators," in MICRO, 2010.
[79]
H. Kim, J. Kim, W. Seo, Y. Cho, and S. Ryu, "Providing cost-effective on-chip network bandwidth in GPGPUs," in ICCD.
[80]
M. Sadrosadati, A. Mirhosseini, S. Roozkhosh, H. Bakhishi, and H. Sarbazi-Azad, "Effective cache bank placement for GPUs," in DATE, 2017.
[81]
H. Jang, J. Kim, P. Gratz, K. H. Yum, and E. J. Kim, "Bandwidth-efficient on-chip interconnect designs for GPGPUs," in DAC, 2015.
[82]
X. Zhao, M. Jahre, and L. Eeckhout, "Selective replication in memory-side GPU caches," in MICRO, 2020.
[83]
A. Li, G.-J. van den Braak, A. Kumar, and H. Corporaal, "Adaptive and transparent cache bypassing for GPUs," in SC, 2015.
[84]
N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V. Veidenbaum, "Improving cache management policies using dynamic reuse distances," in MICRO, 2012.
[85]
W. Jia, K. A. Shaw, and M. Martonosi, "Characterizing and improving the use of demand-fetched caches in GPUs," in ICS, 2012.
[86]
X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang, "Coordinated static and dynamic cache bypassing for GPUs," in HPCA, 2015.
[87]
C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou, "Locality-driven dynamic GPU cache bypassing," in ICS, 2015.
[88]
W. Jia, K. A. Shaw, and M. Martonosi, "MRPB: Memory request prioritization for massively parallel processors," in HPCA, 2014.
[89]
R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu, "Exploiting inter-warp heterogeneity to improve GPGPU performance," in PACT, 2015.
[90]
M. Rhu, M. B. Sullivan, J. Leng, and M. Erez, "A locality-aware memory hierarchy for energy-efficient GPU architectures," in MICRO, 2013.
[91]
H. Wang, F. Luo, M. Ibrahim, O. Kayiran, and A. Jog, "Efficient and fair multiprogramming in GPUs via effective bandwidth management," in HPCA, 2018.
[92]
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-conscious wavefront scheduling," in MICRO, 2012.
[93]
T. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware warp scheduling," in MICRO, 2013.
[94]
O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das, "Neither more nor less: Optimizing thread-level parallelism for GPGPUs," in PACT, 2013.
[95]
B. Wang, Y. Zhu, and W. Yu, "OAWS: Memory occlusion aware warp scheduling," in PACT, 2016.
[96]
D. Li, M. Rhu, D. R. Johnson, M. O'Connor, M. Erez, D. Burger, D. S. Fussell, and S. W. Redder, "Priority-based cache allocation in throughput processors," in HPCA, 2015.
[97]
M. Mao, J. Hu, Y. Chen, and H. Li, "VWS: A versatile warp scheduler for exploring diverse cache localities of GPGPU applications," in DAC, 2015.
[98]
O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das, "Managing GPU concurrency in heterogeneous architectures," in MICRO, 2014.
[99]
G. Koo, Y. Oh, W. W. Ro, and M. Annavaram, "Access pattern-aware cache management for improving data utilization in GPU," in ISCA, 2017.
[100]
S. Darabi, N. Mahani, H. Baxishi, E. Yousefzadeh-Asl-Miandoab, M. Sadrosadati, and H. Sarbazi-Azad, "NURA: A framework for supporting non-uniform resource accesses in GPUs," ACM POMACS, 2022.
[101]
T. M. Aamodt, P. Chow, P. Hammarlund, H. Wang, and J. P. Shen, "Hardware support for prescient instruction prefetch," in HPCA, 2004.
[102]
J. A. Brown, H. Wang, G. Chrysos, P. H. Wang, and J. P. Shen, "Speculative precomputation on chip multiprocessors," Proc. of MTEAC, 2002.
[103]
R. S. Chappell, J. Stark, S. P. Kim, S. K. Reinhardt, and Y. N. Patt, "Simultaneous subordinate microthreading (SSMT)," in ISCA, 1999.
[104]
R. S. Chappell, F. Tseng, A. Yoaz, and Y. N. Patt, "Microarchitectural support for precomputation microthreads," in MICRO, 2002.
[105]
J. D. Collins, D. M. Tullsen, H. Wang, and J. P. Shen, "Dynamic speculative precomputation," in MICRO, 2001.
[106]
J. D. Collins, H. Wang, D. M. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J. P. Shen, "Speculative precomputation: Long-range prefetching of delinquent loads," in ISCA, 2001.
[107]
M. Dubois, "Fighting the memory wall with assisted execution," in Proceedings of the 1st Conference on Computing Frontiers, 2004.
[108]
M. Dubois and Y. Song, "Assisted execution," University of Southern California CENG Technical Report, 1998.
[109]
K. Z. Ibrahim, G. T. Byrd, and E. Rotenberg, "Slipstream execution mode for cmp-based multiprocessors," in HPCA, 2003.
[110]
M. Kamruzzaman, S. Swanson, and D. M. Tullsen, "Inter-core prefetching for multicore processors using migrating helper threads," in ASPLOS, 2011.
[111]
D. Kim and D. Yeung, "Design and evaluation of compiler algorithms for pre-Execution," in ASPLOS, 2002.
[112]
J. Lu, A. Das, W.-C. Hsu, K. Nguyen, and S. G. Abraham, "Dynamic helper threaded prefetching on the Sun UltraSPARC/spl reg/CMP processor," in MICRO, 2005.
[113]
C.-K. Luk, "Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors," in ISCA, 2001.
[114]
O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, "Runahead execution: An alternative to very large instruction windows for out-of-order processors," in HPCA, 2003.
[115]
O. Mutlu, H. Kim, and Y. N. Patt, "Techniques for efficient processing in runahead execution engines," in ISCA, 2005.
[116]
M. Stephenson, S. K. S. Hari, Y. Lee, E. Ebrahimi, D. R. Johnson, D. Nellans, M. O'Connor, and S. W. Keckler, "Flexible software profiling of GPU architectures," in ISCA, 2015.
[117]
W. Zhang, D. M. Tullsen, and B. Calder, "Accelerating and adapting precomputation threads for efficient prefetching," in HPCA, 2007.
[118]
C. B. Zilles, J. S. Emer, and G. S. Sohi, "The use of multithreading for exception handling," in MICRO, 1999.
[119]
C. Zilles and G. Sohi, "Execution-based prediction using speculative slices," in ISCA, 2001.
[120]
M. Bauer, H. Cook, and B. Khailany, "CudaDMA: optimizing GPU memory bandwidth via warp specialization," in SC, 2011.
[121]
T. M. Aamodt and P. Chow, "Optimization of data prefetch helper threads with path-expression based statistical modeling," in ICS, 2007.

Index Terms

  1. Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MICRO '22: Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture
        October 2022
        1498 pages
        ISBN:9781665462723

        Sponsors

        Publisher

        IEEE Press

        Publication History

        Published: 18 December 2023

        Check for updates

        Qualifiers

        • Research-article

        Conference

        MICRO '22
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 484 of 2,242 submissions, 22%

        Upcoming Conference

        MICRO '24

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 39
          Total Downloads
        • Downloads (Last 12 months)39
        • Downloads (Last 6 weeks)7
        Reflects downloads up to 19 Oct 2024

        Other Metrics

        Citations

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media