Article

Microreboot — A technique for cheap recovery

Authors:

Shinichi Kawamoto,

Armando FoxAuthors Info & Claims

OSDI'04: Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation - Volume 6

Page 3

Published: 06 December 2004 Publication History

Abstract

A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we use separation of process recovery from data recovery to enable microrebooting - a fine-grain technique for surgically recovering faulty application components, without disturbing the rest of the application.

We evaluate microrebooting in an Internet auction system running on an application server. Microreboots recover most of the same failures as full reboots, but do so an order of magnitude faster and result in an order of magnitude savings in lost work. This cheap form of recovery engenders a new approach to high availability: microreboots can be employed at the slightest hint of failure, prior to node failover in multi-node clusters, even when mistakes in failure detection are likely; failure and recovery can be masked from end users through transparent call-level retries; and systems can be rejuvenated by parts, without ever being shut down.

References

[1]

{1} A. Adya, W. Bolosky, M. Castro, G. Cermak, R. Chaiken, J. Douceur, J. Howell, J. Lorch, M. Theimer, and R. Wattenhofer. FARSITE: Federated, available, and reliable storage for an incompletely trusted environment. In Proc. 5th Symposium on Operating Systems Design and Implementation , Boston, MA, 2002.]]

Digital Library

[2]

{2} M. Baker and M. Sullivan. The Recovery Box: Using fast recovery to provide high availability in the UNIX environment. In Proc. Summer USENIX Technical Conference, San Antonio, TX, 1992.]]

[3]

{3} M. Barnes. J2EE application servers: Market overview. The Meta Group, March 2004.]]

[4]

{4} N. Bhatti, A. Bouch, and A. Kuchinsky. Integrating user-perceived quality into web server design. In Proc. 9th International WWW Conference, Amsterdam, Holland, 2000.]]

Digital Library

[5]

{5} E. Brewer. Lessons from giant-scale services. IEEE Internet Computing, 5(4):46-55, July 2001.]]

Digital Library

[6]

{6} P. A. Broadwell, N. Sastry, and J. Traupman. FIG: A prototype tool for online verification of recovery mechanisms. In Workshop on Self-Healing, Adaptive and Self-Managed Systems, New York, NY, 2002.]]

[7]

{7} K. Buchacker and V. Sieh. Framework for testing the fault-tolerance of systems including OS and network aspects. In Proc. IEEE High-Assurance System Engineering Symposium , Boca Raton, FL, 2001.]]

Digital Library

[8]

{8} G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proc. 8th Workshop on Hot Topics in Operating Systems, Elmau, Germany, 2001.]]

Digital Library

[9]

{9} G. Candea and A. Fox. Crash-only software. In Proc. 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.]]

Digital Library

[10]

{10} E. Cecchet, J. Marguerite, and W. Zwaenepoel. Performance and scalability of EJB applications. In Proc. 17th Conference on Object-Oriented Programming, Systems, Languages, and Applications, Seattle, WA, 2002.]]

Digital Library

[11]

{11} M. Chen, A. Zheng, J. Lloyd, M. Jordan, and E. Brewer. Failure diagnosis using decision trees. In Proc. Intl. Conference on Autonomic Computing, New York, NY, 2004.]]

Digital Library

[12]

{12} T. C. Chou. Beyond fault tolerance. IEEE Computer, 30(4):31-36, 1997.]]

Digital Library

[13]

{13} T. C. Chou. Personal communication. Oracle Corp., 2003.]]

[14]

{14} H. Cohen and K. Jacobs. Personal comm. Oracle, 2002.]]

[15]

{15} S. Duvur. Personal comm. Sun Microsystems, 2004.]]

[16]

{16} Information obtained under an agreement that prohibits disclosure of the company's name, May 2004.]]

[17]

{17} T. Garfinkel, B. Pfaff, J. Chow, M. Rosenblum, and D. Boneh. Terra: a virtual machine-based platform for trusted computing. In Proc. 19th ACM Symposium on Operating Systems Principles, Bolton Landing, NY, 2003.]]

Digital Library

[18]

{18} J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee. Hypertext transfer protocol - HTTP/1.1. Internet RFC 2616, June 1999.]]

[19]

{19} J. Gray. Why do computers stop and what can be done about it? In Proc. 5th Symp. on Reliability in Distributed Software and Database Systems, Los Angeles, CA, 1986.]]

[20]

{20} Y. Huang, C. M. R. Kintala, N. Kolettis, and N. D. Fulton. Software rejuvenation: Analysis, module and applications. In Proc. 25th International Symposium on Fault-Tolerant Computing, Pasadena, CA, 1995.]]

Digital Library

[21]

{21} JBoss web page. http://www.jboss.org/.]]

[22]

{22} Keynote Systems. http://www.keynote.com/.]]

[23]

{23} W. LeFebvre. CNN.com-Facing a world crisis. Talk at 15th USENIX Systems Administration Conference, 2001.]]

[24]

{24} H. Levine. Personal communication. EBates.com, 2003.]]

[25]

{25} J. Liedtke. Toward real microkernels. Communications of the ACM, 39(9):70-77, 1996.]]

Digital Library

[26]

{26} B. Ling, E. Kiciman, and A. Fox. Session state: Beyond soft state. In Proc. 1st Symposium on Networked Systems Design and Implementation, San Francisco, CA, 2004.]]

Digital Library

[27]

{27} D. E. Lowell, S. Chandra, and P. M. Chen. Exploring failure transparency and the limits of generic recovery. In Proc. 4th Symposium on Operating Systems Design and Implementation, San Diego, CA, 2000.]]

Digital Library

[28]

{28} G. Messer. Personal communication. US Bancorp, 2004.]]

[29]

{29} A. Messinger. Personal comm. BEA Systems, 2004.]]

[30]

{30} Microsoft. The Microsoft .NET Framework. Microsoft Press, Redmond, WA, 2001.]]

[31]

{31} R. Miller. Response time in man-computer conversational transactions. In Proc. AFIPS Fall Joint Computer Conference , volume 33, 1968.]]

Digital Library

[32]

{32} N. Mitchell. IBM Research. Personal Comm., 2004.]]

[33]

{33} N. Mitchell and G. Sevitsky. LeakBot: An automated and lightweight tool for diagnosing memory leaks in large Java applications. In Proc. 17th European Conf. on Object-Oriented Programming, Darmstadt, Germany, 2003.]]

[34]

{34} B. Murphy and T. Gent. Measuring system and software reliability using an automated data collection process. Quality and Reliability Engineering Intl., 11:341-353, 1995.]]

[35]

{35} A. Pal. Personal communication. Yahoo!, Inc., 2002.]]

[36]

{36} D. Reimer. IBM Research. Personal comm., 2004.]]

[37]

{37} RUBiS project web page. http://rubis.objectweb.org/.]]

[38]

{38} W. D. Smith. TPC-W: Benchmarking an E-Commerce solution. Transaction Processing Council, 2002.]]

[39]

{39} M. Sullivan and R. Chillarege. Software defects and their impact on system availability - a study of field failures in operating systems. In Proc. 21st International Symposium on Fault-Tolerant Computing, Montr�al, Canada, 1991.]]

[40]

{40} Sun Microsystems. http://java.sun.com/j2ee/.]]

[41]

{41} M. M. Swift, B. N. Bershad, and H. M. Levy. Improving the reliability of commodity operating systems. In Proc. 19th ACM Symposium on Operating Systems Principles, Bolton Landing, NY, 2003.]]

Digital Library

[42]

{42} K. Whisnant, Z. Kalbarczyk, and R. Iyer. Micro-checkpointing: Checkpointing for multithreaded applications. In Proc. IEEE Intl. On-Line Testing Workshop, 2000.]]

Digital Library

[43]

{43} A. P. Wood. Software reliability from the customer view. IEEE Computer, 36(8):37-42, Aug. 2003.]]

Digital Library

[44]

{44} Zona research bulletin: The need for speed II, Apr. 2001.]]

Cited By

Lou CHuang PSmith SBhagwan RPorter G(2020)Understanding, detecting and localizing partial failures in large system softwareProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388284(559-574)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388284
Lou CHuang PSmith S(2019)Comprehensive and Efficient Runtime Checking in System Software through WatchdogsProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3317550.3321440(51-57)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3317550.3321440
Doddamani SSinha PLu HCheng TBagdi HGopalan KSartor JNaik MRossbach C(2019)Fast and live hypervisor replacementProceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments10.1145/3313808.3313821(45-58)Online publication date: 14-Apr-2019
https://dl.acm.org/doi/10.1145/3313808.3313821
Show More Cited By

Index Terms

Microreboot — A technique for cheap recovery
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance
    2. Reliability
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software performance
      2. Software reliability

Recommendations

Crash-only software and microreboot: a design and technique for achieving high availability in frequently-failing software systems
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96: Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)

In this paper, we propose a quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-...
A Self-Recovery Model for Distributed Applications Based on Microreboot
ICICSE '08: Proceedings of the 2008 International Conference on Internet Computing in Science and Engineering

Automatic and fast recovery from failure is the important way of guaranteeing high availability for distributed application systems. On the base of microreboot techniques and autonomic computing ideas, key issues of realizing self-recovery for ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

OSDI'04: Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation - Volume 6

December 2004

403 pages

Sponsors

USENIX Assoc: USENIX Assoc

Publisher

USENIX Association

United States

Publication History

Published: 06 December 2004

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

146
Total Citations
View Citations
70
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lou CHuang PSmith SBhagwan RPorter G(2020)Understanding, detecting and localizing partial failures in large system softwareProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388284(559-574)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388284
Lou CHuang PSmith S(2019)Comprehensive and Efficient Runtime Checking in System Software through WatchdogsProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3317550.3321440(51-57)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3317550.3321440
Doddamani SSinha PLu HCheng TBagdi HGopalan KSartor JNaik MRossbach C(2019)Fast and live hypervisor replacementProceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments10.1145/3313808.3313821(45-58)Online publication date: 14-Apr-2019
https://dl.acm.org/doi/10.1145/3313808.3313821
Veeraraghavan KMeza JMichelson SPanneerselvam SGyori AChou DMargulis SObenshain DPadmanabha SShah ASong YXu TArpaci-Dusseau AVoelker G(2018)MaelstromProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291196(373-389)Online publication date: 8-Oct-2018
https://dl.acm.org/doi/10.5555/3291168.3291196
Ertel SGoens AAdam JCastrillon JDubach CXue J(2018)Compiling for concise code and efficient I/OProceedings of the 27th International Conference on Compiler Construction10.1145/3178372.3179505(104-115)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3178372.3179505
Monperrus M(2018)Automatic Software RepairACM Computing Surveys10.1145/310590651:1(1-24)Online publication date: 23-Jan-2018
https://dl.acm.org/doi/10.1145/3105906
Abdi FChen CHasan MLiu SMohan SCaccamo MGill CSinopoli BLiu XTabuada P(2018)Guaranteed physical security with restart-based design for cyber-physical systemsProceedings of the 9th ACM/IEEE International Conference on Cyber-Physical Systems10.1109/ICCPS.2018.00010(10-21)Online publication date: 11-Apr-2018
https://dl.acm.org/doi/10.1109/ICCPS.2018.00010
Nielebock SRosu GDi Penta MNguyen T(2017)Towards API-specific automatic program repairProceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering10.5555/3155562.3155697(1010-1013)Online publication date: 30-Oct-2017
https://dl.acm.org/doi/10.5555/3155562.3155697
Boos KVecchio EZhong L(2017)A Characterization of State Spill in Modern Operating SystemsProceedings of the Twelfth European Conference on Computer Systems10.1145/3064176.3064205(389-404)Online publication date: 23-Apr-2017
https://dl.acm.org/doi/10.1145/3064176.3064205
Setty SSu CLorch JZhou LChen HPatel PRen JKeeton KRoscoe T(2016)Realizing the fault-tolerance promise of cloud storage using locks with intentProceedings of the 12th USENIX conference on Operating Systems Design and Implementation10.5555/3026877.3026916(501-516)Online publication date: 2-Nov-2016
https://dl.acm.org/doi/10.5555/3026877.3026916
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents