skip to main content
10.1109/SC.2010.28acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
Article

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Published: 13 November 2010 Publication History

Abstract

Scaling computations on emerging massive-core supercomputers is a daunting task, which coupled with the significantly lagging system I/O capabilities exacerbates applications' end-to-end performance. The I/O bottleneck often negates potential performance benefits of assigning additional compute cores to an application. In this paper, we address this issue via a novel functional partitioning (FP) runtime environment that allocates cores to specific application tasks -- checkpointing, de-duplication, and scientific data format transformation -- so that the deluge of cores can be brought to bear on the entire gamut of application activities. The focus is on utilizing the extra cores to support HPC application I/O activities and also leverage solid-state disks in this context. For example, our evaluation shows that dedicating 1 core on an oct-core machine for checkpointing and its assist tasks using FP can improve overall execution time of a FLASH benchmark on 80 and 160 cores by 43.95% and 41.34%, respectively.

References

[1]
Intel. Advancing Multi-Core Technology into the Tera-scale Era, 2009. http://techresearch.intel.com/articles/Tera-Scale/1449.htm.
[2]
Henry Monti, Ali R. Butt, and Sudharshan S. Vazhkudai. Timely offloading of result-data in hpc centers. In Proc. ACM ICS, 2008.
[3]
Top500 supercomputer sites. http://www.top500.org/.
[4]
Philip Schwan. Lustre: Building a File System for 1,000-node Clusters. In Proc. Ottawa Linux Symposium, 2003.
[5]
Galen Shipman, Dave Dillow, Sarp Oral, and Feiyi Wang. The spider center wide file system: From concept to reality. In Proc. Cray User Group, 2009.
[6]
Ami Marowka. Parallel computing on any desktop. Commun. ACM, 50(9):74-78, 2007.
[7]
Herb Sutter. The free lunch is over: A fundamental turn toward concurrency in software. Dr. Dobb's Journal, 30(3):202-210, 2005.
[8]
Nalini Vasudevan and Stephen A. Edwards. Celling shim: compiling deterministic concurrency to a heterogeneous multicore. In Proc. ACM SAC, 2009.
[9]
Jack Dongarra. The impact of multicore on math software and exploiting single precision computing to obtain double precision results. In Proc. ICPP, 2006.
[10]
Silas B. Wickizer, Haibo Chen, Rong Chen, Yandong Mao, Frans Kaashoek, Robert Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, and Zheng Zhang. Corey: An operating system for many cores. In Proc. Usenix OSDI, 2008.
[11]
Herb Sutter and James Larus. Software and the concurrency revolution. ACM Queue, 3(7):54-62, 2005.
[12]
Duc Vianney, Gad Haber, Andre Heilper, and Marcel Zalmanovici. Performance analysis and visualization tools for cell/b.e. multicore environment. In Proc. ACM IFMT, 2008.
[13]
Sain-Zee Ueng, Melvin Lathara, Sara S. Baghsorkhi, and Wen-Mei W. Hwu. Cuda-lite: Reducing gpu programming complexity. In Proc. LCPC, 2008.
[14]
Aaron E. Darling, Lucas Carey, and Wu-chun Feng. The design, implementation, and evaluation of mpiblast. In Proc. ClusterWorld, 2003.
[15]
Anshu Dubey, Katie Antypas, Murali K. Ganapathy, Lynn B. Reid, Katherine Riley, Daniel J. Sheeler, Andrew Siegel, and Klaus Weide. Extensible component-based architecture for Flash, a massively parallel, multiphysics simulation code. Parallel Computing, 35(10-11):512-522, 2009.
[16]
Robert Rosner, Alan Calder, Jonathan Dursi, Bruce Fryxell, Donald Q. Lamb, Jens C. Niemeyer, Kevin Olson, Paul Ricker, Frank X. Timmes, James W. Truran, Henry Tufo, Yuan-Nan Young, Michael Zingale, Ewing Lusk, and Rick Stevens. Flash code: Studying astrophysical thermonuclear flashes. Computing in Science and Engineering (CSE), 2(2):33-41, 2000.
[17]
K. E. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett. Server-directed collective I/O in Panda. In Proc. SC, 1995.
[18]
X. Ma, J. Lee, and M. Winslett. High-level buffering for hiding periodic output cost in scientific simulations. IEEE Transactions on Parallel and Distributed Systems, 17(3):193-204, 2006.
[19]
S. More, A. Choudhary, I. Foster, and M. Q. Xu. MTIO: a multithreaded parallel I/O system. In Proc. International Parallel Processing Symposium, 1997.
[20]
A. Gara, M. A. Blumrich, D. Chen, G. L.-T. Chiu, P. Coteus, M. E. Giampapa, R. A. Haring, P. Heidelberger, D. Hoenicke, G. V. Kopcsay, T. A. Liebsch, M. Ohmacht, B. D. Steinmacher-Burow, T. Takken, and P. Vranas. Overview of the blue gene/l system architecture. IBM Journal of Research and Development, 49(2/3):195-212, 2005.
[21]
Supercomputer uses flash to solve data-intensive problems 10 times faster. http://www.sdsc.edu/News%20Items/PR110409_ gordon.html, 2009.
[22]
Christos D. Antonopoulos, Filip Blagojevic, Andrey N. Chernikov, Nikos P. Chrisochoides, and Dimitrios S. Nikolopoulos. A multigrain delaunay mesh generation method for multicore smt-based architectures. J. Parallel Distrib. Comput., 69(7):589- 600, 2009.
[23]
Ligang He, Stephen A. Jarvis, Daniel P. Spooner, and Graham R. Nudd. Dynamic scheduling of parallel real-time jobs by modelling spare capabilities in heterogeneous clusters. In Proc. IEEE ICCC, 2003.
[24]
Jorge Manuel Gomes Barbosa and Belmiro Daniel Rodrigues Moreira. Dynamic job scheduling on heterogeneous clusters. In Proc. IEEE ISPDC, 2009.
[25]
M. Mustafa Rafique, Ali R. Butt, and Dimitrios S. Nikolopoulos. Designing accelerator-based distributed systems for high performance. In Proc. IEEE/ACM CCGrid, 2010.
[26]
M. Mustafa Rafique, Benjamin Rose, Ali R. Butt, and Dimitrios S. Nikolopoulos. Supporting mapreduce on large-scale asymmetric multi-core clusters. SIGOPS Oper. Syst. Rev., 43(2):25-34, 2009.
[27]
Shimin Chen, Babak Falsafi, Phillip B. Gibbons, Michael Kozuch, Todd C. Mowry, Radu Teodorescu, Anastassia Ailamaki, Limor Fix, Gregory R. Ganger, Bin Lin, and Steven W. Schlosser. Log-based architectures for general-purpose monitoring of deployed code. In Proc. Architectural and System Support for Improving Software Dependability Workshop, 2006.
[28]
Edmund B. Nightingale, Daniel Peek, Peter M. Chen, and Jason Flinn. Parallelizing security checks on commodity hardware. In Proc. ACM ASPLOS, 2008.
[29]
Kue-Hwan Sihn, Baik Hyunki, Kim Jong-Tae, Bae Sehyun, and Song Hyo Jung. Novel approaches to parallel h.264 decoder on symmetric multicore systems. In Proc. IEEE ICASSP, 2009.
[30]
S. Arash Ostadzadeh, Roel J. Meeuws, Kamana Sigdel, and Koen Bertels. A multipurpose clustering algorithm for task partitioning in multicore reconfigurable systems. In Proc. IEEE CISIS, 2009.
[31]
Turgay Altilar and Yakup Paker. Minimum overhead data partitioning algorithms for parallel video processing. In Proc. International Conference on Domain Decomposition Methods, 2001.
[32]
Robert Ennals, Sharp, and Mycroft. Task partitioning for multicore network processors. In Proc. Cluster Computing, 2005.
[33]
Marc de Kruijf and Karthikeyan Sankaralingam. MapReduce for the Cell B.E. Architecture. IBM Journal of Research and Development, 53(5):10:1-10:12, 2009.
[34]
Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. Mars: a mapreduce framework on graphics processors. In Proc. ACM PACT, 2008.
[35]
F.T. Hady, T. Bock, M. Cabot, J. Chu, J. Meinecke, K. Oliver, and W. Talarek. Platform level support for high throughput edge applications: the twin cities prototype. IEEE Network, 17(4):22- 27, 2003.
[36]
Sanjay Kumar, Gavrilovska, Karsten Schwan, and Srikanth Sundaragopalan. C-core: Using communication cores for high performance network services. In Proc. IEEE NCA, 2005.
[37]
J. S. Plank, K. Li, and M. A. Puening. Diskless Checkpointing. IEEE Transactions on Parallel and Distributed Systems, 9(10):972-986, 1998.
[38]
G. Bronevetsky and A. Moody. Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/sec File I/O. LLNL Technical Report LLNL-TR-415791, Lawrence Livermore National Laboratory, 2009.
[39]
S. Park and K. Shen. A Performance Evaluation of Scientific I/O Workloads on Flash-Based SSDs. In Proc. Workshop IASDS, 2009.
[40]
FUSE. File System in Userspace, 2007. http://fuse.sourceforge. net/.
[41]
Kamil Iskra, John W. Romein, Kazutomo Yoshii, and Peter H. Beckman. Zoid: I/o-forwarding infrastructure for petascale architectures. In Proc. ACM SIGPLAN PPoPP, 2008.
[42]
International exascale software project roadmap. In Proc. Crosscutting Technologies for Computing at the Exascale Workshop, 2010.
[43]
B. Schroeder and G. A. Gibson. Disk failures in the real world: What does an mttf of 1,000,000 hours mean to you? In Proc. USENIX FAST, 2007.
[44]
Feng Chen, David A. Koufaty, and Xiaodong Zhang. Understanding intrinsic characteristics and system implications of flash memory based solid state drives. In Proc. SIGMETRICS/Performance, 2009.
[45]
Numonyx. Wear leveling in nand flash memories. http://www. numonyx.com/Documents/Application%20Notes/AN1822.pdf.
[46]
Ali R. Butt, Troy A. Johnson, Yili Zheng, and Y. Charlie Hu. Kosha: A peer-to-peer enhancement for the network file system. Journal of Grid Computing: Special issue on Global and Peer-to-Peer Computing, 4(3):323-341, 2006.
[47]
S. Vazhkudai, X. Ma, V. Freeh, J. Strickland, N. Tammineedi, and S. Scott. Freeloader: Scavenging desktop storage resources for bulk, transient data. In Proc. SC, 2005.
[48]
Samer Al-Kiswany, Matei Ripeanu, Sudharshan S. Vazhkudai, and Abdullah Gharaibeh. stdchk: A checkpoint storage system for desktop grid computing. In Proc. ICDCS, 2008.
[49]
Dash User Guide: Technical Summary, June 2010. http://www. sdsc.edu/us/resources/dash/index.html.
[50]
Eliezer Levy, Avi Silberschatz, and Avi Silberschatz. Incremental recovery in main memory database systems. IEEE Transactions on Knowledge and Data Engineering, 4(6):529-540, 1992.
[51]
http://hdf.ncsa.uiuc.edu/HDF5/doc/. HDF5 - A New Generation of HDF.
[52]
http://www.unidata.ucar.edu/packages/netcdf/docs.html. NetCDF Documentation.
[53]
Jay Lofstead, Fang Zheng, Scott Klasky, and Karsten Schwan. Adaptable, metadata rich io methods for portable high performance io. In Proc. IPDPS, 2009.
[54]
Hasan Abbasi, Jay Lofstead, Fang Zheng, Scott Klasky, Karsten Schwan, and Matthew Wolf. Extending i/o through high performance data services. In Proc. Cluster Computing, 2009.
[55]
Pavan Konanki and Ali R. Butt. An exploration of hybrid hard disk designs using an extensible simulator, 2008. Masters Thesis, Virginia Tech.
[56]
Intel. Intel x25-e extreme sata solid-state drive. http://www.intel. com/design/flash/nand/extreme/index.htm.

Cited By

View all
  • (2023)Autonomic Orchestration of in-situ and In-Transit Data Analytics for Simulation StudiesProceedings of the Winter Simulation Conference10.5555/3643142.3643207(781-792)Online publication date: 10-Dec-2023
  • (2020)Scalable Coordination of Hierarchical ParallelismProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404398(1-11)Online publication date: 17-Aug-2020
  • (2019)Modeling high-throughput applications for in situ analyticsInternational Journal of High Performance Computing Applications10.1177/109434201984726333:6(1185-1200)Online publication date: 1-Nov-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
November 2010
634 pages
ISBN:9781424475599

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 13 November 2010

Check for updates

Qualifiers

  • Article

Conference

SC '10
Sponsor:

Acceptance Rates

SC '10 Paper Acceptance Rate 51 of 253 submissions, 20%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Autonomic Orchestration of in-situ and In-Transit Data Analytics for Simulation StudiesProceedings of the Winter Simulation Conference10.5555/3643142.3643207(781-792)Online publication date: 10-Dec-2023
  • (2020)Scalable Coordination of Hierarchical ParallelismProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404398(1-11)Online publication date: 17-Aug-2020
  • (2019)Modeling high-throughput applications for in situ analyticsInternational Journal of High Performance Computing Applications10.1177/109434201984726333:6(1185-1200)Online publication date: 1-Nov-2019
  • (2019)UMR-ECProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325406(219-230)Online publication date: 17-Jun-2019
  • (2018)Topology-aware space-shared co-analysis of large-scale molecular dynamics simulationsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291688(1-15)Online publication date: 11-Nov-2018
  • (2018)Scaling embedded in-situ indexing with deltaFSProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291660(1-15)Online publication date: 11-Nov-2018
  • (2018)Topology-aware space-shared co-analysis of large-scale molecular dynamics simulationsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00027(1-15)Online publication date: 11-Nov-2018
  • (2018)Scaling embedded in-situ indexing with deltaFSProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00006(1-15)Online publication date: 11-Nov-2018
  • (2017)Supporting Fault-Tolerance in Presence of In-Situ AnalyticsProceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing10.5555/3101112.3101155(304-313)Online publication date: 14-May-2017
  • (2016)To waffinity and beyondProceedings of the 12th USENIX conference on Operating Systems Design and Implementation10.5555/3026877.3026910(419-434)Online publication date: 2-Nov-2016
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media