skip to main content
10.1145/2925426.2926295acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Mini-Ckpts: Surviving OS Failures in Persistent Memory

Published: 01 June 2016 Publication History

Abstract

Concern is growing in the high-performance computing (HPC) community on the reliability of future extreme-scale systems. Current efforts have focused on application fault-tolerance rather than the operating system (OS), despite the fact that recent studies have suggested that failures in OS memory may be more likely. The OS is critical to a system's correct and efficient operation of the node and processes it governs---and the parallel nature of HPC applications means any single node failure generally forces all processes of this application to terminate due to tight communication in HPC. Therefore, the OS itself must be capable of tolerating failures in a robust system. In this work, we introduce mini-ckpts, a framework which enables application survival despite the occurrence of a fatal OS failure or crash. mini-ckpts achieves this tolerance by ensuring that the critical data describing a process is preserved in persistent memory prior to the failure. Following the failure, the OS is rejuvenated via a warm reboot and the application continues execution effectively making the failure and restart transparent. The mini-ckpts rejuvenation and recovery process is measured to take between three to six seconds and has a failure-free overhead of between 3-5% for a number of key HPC workloads. In contrast to current fault-tolerance methods, this work ensures that the operating and runtime systems can continue in the presence of faults. This is a much finer-grained and dynamic method of fault-tolerance than the current coarse-grained application-centric methods. Handling faults at this level has the potential to greatly reduce overheads and enables mitigation of additional faults.

References

[1]
Clover Leaf. http://uk-mac.github.io/CloverLeaf/.
[2]
OSU MPI micro benchmarks. http://mvapich.cse.ohio-state.edu/benchmarks/.
[3]
PENNANT. https://github.com/losalamos/PENNANT.
[4]
Protected and persistent RAM filesystem. http://sourceforge.net/projects/pramfs/.
[5]
Top 500 list. http://www.top500.org/, June 2002.
[6]
S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira. Adaptive incremental checkpointing for massively parallel systems. In Proceedings of the 18th Annual International Conference on Supercomputing, ICS '04, pages 277--286, New York, NY, USA, 2004. ACM.
[7]
M. A. Auslander, D. C. Larkin, and A. L. Scherr. The evolution of the mvs operating system. IBM Journal of Research and Development, 25(5):471--482, 1981.
[8]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, D. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks. The International Journal of Supercomputer Applications, 5(3):63--73, Fall 1991.
[9]
K. Bailey, L. Ceze, S. D. Gribble, and H. M. Levy. Operating system implications of fast, cheap, non-volatile memory. In Proceedings of the 13th USENIX Conference on Hot Topics in Operating Systems, HotOS'13, pages 2--2, Berkeley, CA, USA, 2011. USENIX Association.
[10]
M. Baker and M. Sullivan. The recovery box: Using fast recovery to provide high availability in the unix environment. In In Proceedings USENIX Summer Conference, pages 31--43, 1992.
[11]
A. Bohra, I. Neamtiu, P. Gallard, F. Sultan, and L. Iftode. Remote repair of operating system state using backdoors. In International Conference on Autonomic Computing (ICAC-04), New-York, NY, May 2004. Initial version published as Technical Report, Rutgers University DCS-TR-543.
[12]
G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot -- A technique for cheap recovery. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 3--3, Berkeley, CA, USA, 2004. USENIX Association.
[13]
P. M. Chen, W. T. Ng, S. Chandra, C. Aycock, G. Rajamani, and D. Lowell. The Rio File Cache: Surviving operating system crashes. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS VII, pages 74--83, New York, NY, USA, 1996. ACM.
[14]
A. Depoutovitch and M. Stumm. Otherworld: Giving applications a chance to survive OS kernel crashes. In Proceedings of the 5th European Conference on Computer Systems, EuroSys '10, pages 181--194, New York, NY, USA, 2010. ACM.
[15]
J. Duell. The design and implementation of Berkeley Lab's Linux Checkpoint/Restart. Tr, Lawrence Berkeley National Laboratory, 2000.
[16]
E. N. Elnozahy and J. S. Plank. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. Dependable and Secure Computing, IEEE Transactions on, 1(2):97--108, Apr. 2004.
[17]
P. Emelianov and S. Hallyn. State of criu and integration with lxc. Linux Plumbers Conference, Sept. 2013.
[18]
K. B. Ferreira, R. Riesen, R. Brightwell, P. G. Bridges, and D. Arnold. Libhashckpt: Hash-based incremental checkpointing using GPUs. In Proceedings of the 18th EuroMPI Conference, Santorini, Greece, September 2011.
[19]
K. B. Ferrira, R. Riesen, P. G. Bridges, D. Arnold, and R. Brightwell. Accelerating incremental checkpointing for extreme-scale computing. FGCS, 2013.
[20]
D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, and R. Brightwell. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 78:1--78:12, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press.
[21]
R. Gioiosa, J. Sancho, S. Jiang, and F. Petrini. Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference, pages 9--9, Nov 2005.
[22]
K. Greenan and E. L. Miller. Reliability mechanisms for file systems using non-volatile memory as a metadata store. In Proceedings of the 6th ACM & IEEE Conference on Embedded Software EMSOFT 06, pages 178--187, Oct 2006.
[23]
K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput., 33(6):518--528, June 1984.
[24]
Y. Huang, C. Kintala, N. Kolettis, and N. Fulton. Software rejuvenation: Analysis, module and applications. In Fault-Tolerant Computing, 1995. FTCS-25. Digest of Papers., Twenty-Fifth International Symposium on, pages 381--390, June 1995.
[25]
J. Hulbert. Introducing the advanced XIP file system. In Linux Symposium, page 211, 2008.
[26]
A. A. Hwang, I. A. Stefanovici, and B. Schroeder. Cosmic rays don't strike twice: Understanding the nature of DRAM errors and the implications for system design. SIGPLAN Not., 47(4):111--122, Mar. 2012.
[27]
D. Ibtesham, D. Arnold, P. G. Bridges, K. B. Ferreira, and R. Brightwell. On the viability of compression for reducing the overheads of checkpoint/restart-based fault tolerance. In Proceedings of the International Conference on Parallel Processing (ICPP), 2012.
[28]
D. Ibtesham, K. B. Ferreira, and D. Arnold. A study of checkpoint compression for high-performance computing systems. International Journal of High Performance Computing Applications (IJHPCA), 2015.
[29]
D. Jewett. Integrity s2: A fault-tolerant unix platform. In Fault-Tolerant Computing, 1991. FTCS-21. Digest of Papers., Twenty-First International Symposium, pages 512--519, June 1991.
[30]
D. B. Johnson and W. Zwaenepoel. Recovery in distributed systems using asynchronous message logging and checkpointing. In Proceedings of the Seventh Annual ACM Symposium on Principles of Distributed Computing, PODC '88, pages 171--181, New York, NY, USA, 1988. ACM.
[31]
K. Kourai and S. Chiba. A fast rejuvenation technique for server consolidation with virtual machines. In DSN, pages 245--255. IEEE Computer Society, 2007.
[32]
N. Naksinehaboon, N. Taerat, C. Leangsuksun, C. Chandler, and S. L. Scott. Benefits of software rejuvenation on HPC systems. In ISPA, pages 499--506. IEEE, 2010.
[33]
R. A. Oldfield, S. Arunagiri, P. J. Teller, S. Seelam, R. Riesen, M. R. Varela, and P. C. Roth. Modeling the impact of checkpoints on next-generation systems. In Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies, San Diego, CA, September 2007.
[34]
J. S. Plank, J. Xu, and R. H. Netzer. Compressed differences: An algorithm for fast incremental checkpointing. University of Tennessee, Tech. Rep. CS-95-302, 1995.
[35]
M. Rieker, J. Ansel, and G. Cooperman. Transparent user-level checkpointing for the native POSIX thread library for Linux. In The International Conference on Parallel and Distributed Processi ng Techniques and Applications, Las Vegas, NV, Jun 2006.
[36]
B. Schroeder and G. A. Gibson. Understanding failures in petascale computers. Journal of Physics: Conference Series, 78(1):012022, 2007.
[37]
B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: A large-scale field study. In Proceedings of the 11th Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS) 2009, pages 193--204, Seattle, WA, USA, June 11-13, 2009. ACM Press, New York, NY, USA.
[38]
V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira, J. Stearley, J. Shalf, and S. Gurumurthi. Memory errors in modern systems: The good, the bad, and the ugly. SIGPLAN Not., 50(4):297--310, Mar. 2015.
[39]
M. M. Swift, S. Martin, H. M. Levy, and S. J. Eggers. Nooks: An architecture for reliable device drivers. In Proceedings of the 10th Workshop on ACM SIGOPS European Workshop, EW 10, pages 102--107, New York, NY, USA, 2002. ACM.
[40]
P. Velardi and R. K. Iyer. A study of software failures and recovery in the mvs operating system. IEEE Trans. Computers, 33(6):564--568, 1984.
[41]
S. Yi, J. Heo, Y. Cho, and J. Hong. Adaptive page-level incremental checkpointing based on expected recovery time. In Proceedings of the 2006 ACM Symposium on Applied Computing, SAC '06, pages 1472--1476, New York, NY, USA, 2006. ACM.

Cited By

View all
  • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
  • (2022)Software approaches for resilience of high performance computing systems: a surveyFrontiers of Computer Science10.1007/s11704-022-2096-317:4Online publication date: 12-Dec-2022
  • (2017)Resilience Design PatternsSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1703014:3(4-42)Online publication date: 15-Sep-2017

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '16: Proceedings of the 2016 International Conference on Supercomputing
June 2016
547 pages
ISBN:9781450343619
DOI:10.1145/2925426
� 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICS '16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
  • (2022)Software approaches for resilience of high performance computing systems: a surveyFrontiers of Computer Science10.1007/s11704-022-2096-317:4Online publication date: 12-Dec-2022
  • (2017)Resilience Design PatternsSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1703014:3(4-42)Online publication date: 15-Sep-2017

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media