skip to main content
10.5555/1251254.1251257guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Microreboot — A technique for cheap recovery

Published: 06 December 2004 Publication History

Abstract

A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we use separation of process recovery from data recovery to enable microrebooting - a fine-grain technique for surgically recovering faulty application components, without disturbing the rest of the application.
We evaluate microrebooting in an Internet auction system running on an application server. Microreboots recover most of the same failures as full reboots, but do so an order of magnitude faster and result in an order of magnitude savings in lost work. This cheap form of recovery engenders a new approach to high availability: microreboots can be employed at the slightest hint of failure, prior to node failover in multi-node clusters, even when mistakes in failure detection are likely; failure and recovery can be masked from end users through transparent call-level retries; and systems can be rejuvenated by parts, without ever being shut down.

References

[1]
{1} A. Adya, W. Bolosky, M. Castro, G. Cermak, R. Chaiken, J. Douceur, J. Howell, J. Lorch, M. Theimer, and R. Wattenhofer. FARSITE: Federated, available, and reliable storage for an incompletely trusted environment. In Proc. 5th Symposium on Operating Systems Design and Implementation , Boston, MA, 2002.]]
[2]
{2} M. Baker and M. Sullivan. The Recovery Box: Using fast recovery to provide high availability in the UNIX environment. In Proc. Summer USENIX Technical Conference, San Antonio, TX, 1992.]]
[3]
{3} M. Barnes. J2EE application servers: Market overview. The Meta Group, March 2004.]]
[4]
{4} N. Bhatti, A. Bouch, and A. Kuchinsky. Integrating user-perceived quality into web server design. In Proc. 9th International WWW Conference, Amsterdam, Holland, 2000.]]
[5]
{5} E. Brewer. Lessons from giant-scale services. IEEE Internet Computing, 5(4):46-55, July 2001.]]
[6]
{6} P. A. Broadwell, N. Sastry, and J. Traupman. FIG: A prototype tool for online verification of recovery mechanisms. In Workshop on Self-Healing, Adaptive and Self-Managed Systems, New York, NY, 2002.]]
[7]
{7} K. Buchacker and V. Sieh. Framework for testing the fault-tolerance of systems including OS and network aspects. In Proc. IEEE High-Assurance System Engineering Symposium , Boca Raton, FL, 2001.]]
[8]
{8} G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proc. 8th Workshop on Hot Topics in Operating Systems, Elmau, Germany, 2001.]]
[9]
{9} G. Candea and A. Fox. Crash-only software. In Proc. 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.]]
[10]
{10} E. Cecchet, J. Marguerite, and W. Zwaenepoel. Performance and scalability of EJB applications. In Proc. 17th Conference on Object-Oriented Programming, Systems, Languages, and Applications, Seattle, WA, 2002.]]
[11]
{11} M. Chen, A. Zheng, J. Lloyd, M. Jordan, and E. Brewer. Failure diagnosis using decision trees. In Proc. Intl. Conference on Autonomic Computing, New York, NY, 2004.]]
[12]
{12} T. C. Chou. Beyond fault tolerance. IEEE Computer, 30(4):31-36, 1997.]]
[13]
{13} T. C. Chou. Personal communication. Oracle Corp., 2003.]]
[14]
{14} H. Cohen and K. Jacobs. Personal comm. Oracle, 2002.]]
[15]
{15} S. Duvur. Personal comm. Sun Microsystems, 2004.]]
[16]
{16} Information obtained under an agreement that prohibits disclosure of the company's name, May 2004.]]
[17]
{17} T. Garfinkel, B. Pfaff, J. Chow, M. Rosenblum, and D. Boneh. Terra: a virtual machine-based platform for trusted computing. In Proc. 19th ACM Symposium on Operating Systems Principles, Bolton Landing, NY, 2003.]]
[18]
{18} J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee. Hypertext transfer protocol - HTTP/1.1. Internet RFC 2616, June 1999.]]
[19]
{19} J. Gray. Why do computers stop and what can be done about it? In Proc. 5th Symp. on Reliability in Distributed Software and Database Systems, Los Angeles, CA, 1986.]]
[20]
{20} Y. Huang, C. M. R. Kintala, N. Kolettis, and N. D. Fulton. Software rejuvenation: Analysis, module and applications. In Proc. 25th International Symposium on Fault-Tolerant Computing, Pasadena, CA, 1995.]]
[21]
{21} JBoss web page. http://www.jboss.org/.]]
[22]
{22} Keynote Systems. http://www.keynote.com/.]]
[23]
{23} W. LeFebvre. CNN.com-Facing a world crisis. Talk at 15th USENIX Systems Administration Conference, 2001.]]
[24]
{24} H. Levine. Personal communication. EBates.com, 2003.]]
[25]
{25} J. Liedtke. Toward real microkernels. Communications of the ACM, 39(9):70-77, 1996.]]
[26]
{26} B. Ling, E. Kiciman, and A. Fox. Session state: Beyond soft state. In Proc. 1st Symposium on Networked Systems Design and Implementation, San Francisco, CA, 2004.]]
[27]
{27} D. E. Lowell, S. Chandra, and P. M. Chen. Exploring failure transparency and the limits of generic recovery. In Proc. 4th Symposium on Operating Systems Design and Implementation, San Diego, CA, 2000.]]
[28]
{28} G. Messer. Personal communication. US Bancorp, 2004.]]
[29]
{29} A. Messinger. Personal comm. BEA Systems, 2004.]]
[30]
{30} Microsoft. The Microsoft .NET Framework. Microsoft Press, Redmond, WA, 2001.]]
[31]
{31} R. Miller. Response time in man-computer conversational transactions. In Proc. AFIPS Fall Joint Computer Conference , volume 33, 1968.]]
[32]
{32} N. Mitchell. IBM Research. Personal Comm., 2004.]]
[33]
{33} N. Mitchell and G. Sevitsky. LeakBot: An automated and lightweight tool for diagnosing memory leaks in large Java applications. In Proc. 17th European Conf. on Object-Oriented Programming, Darmstadt, Germany, 2003.]]
[34]
{34} B. Murphy and T. Gent. Measuring system and software reliability using an automated data collection process. Quality and Reliability Engineering Intl., 11:341-353, 1995.]]
[35]
{35} A. Pal. Personal communication. Yahoo!, Inc., 2002.]]
[36]
{36} D. Reimer. IBM Research. Personal comm., 2004.]]
[37]
{37} RUBiS project web page. http://rubis.objectweb.org/.]]
[38]
{38} W. D. Smith. TPC-W: Benchmarking an E-Commerce solution. Transaction Processing Council, 2002.]]
[39]
{39} M. Sullivan and R. Chillarege. Software defects and their impact on system availability - a study of field failures in operating systems. In Proc. 21st International Symposium on Fault-Tolerant Computing, Montr�al, Canada, 1991.]]
[40]
{40} Sun Microsystems. http://java.sun.com/j2ee/.]]
[41]
{41} M. M. Swift, B. N. Bershad, and H. M. Levy. Improving the reliability of commodity operating systems. In Proc. 19th ACM Symposium on Operating Systems Principles, Bolton Landing, NY, 2003.]]
[42]
{42} K. Whisnant, Z. Kalbarczyk, and R. Iyer. Micro-checkpointing: Checkpointing for multithreaded applications. In Proc. IEEE Intl. On-Line Testing Workshop, 2000.]]
[43]
{43} A. P. Wood. Software reliability from the customer view. IEEE Computer, 36(8):37-42, Aug. 2003.]]
[44]
{44} Zona research bulletin: The need for speed II, Apr. 2001.]]

Cited By

View all
  • (2020)Understanding, detecting and localizing partial failures in large system softwareProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388284(559-574)Online publication date: 25-Feb-2020
  • (2019)Comprehensive and Efficient Runtime Checking in System Software through WatchdogsProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3317550.3321440(51-57)Online publication date: 13-May-2019
  • (2019)Fast and live hypervisor replacementProceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments10.1145/3313808.3313821(45-58)Online publication date: 14-Apr-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
OSDI'04: Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation - Volume 6
December 2004
403 pages

Sponsors

  • USENIX Assoc: USENIX Assoc

Publisher

USENIX Association

United States

Publication History

Published: 06 December 2004

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Understanding, detecting and localizing partial failures in large system softwareProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388284(559-574)Online publication date: 25-Feb-2020
  • (2019)Comprehensive and Efficient Runtime Checking in System Software through WatchdogsProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3317550.3321440(51-57)Online publication date: 13-May-2019
  • (2019)Fast and live hypervisor replacementProceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments10.1145/3313808.3313821(45-58)Online publication date: 14-Apr-2019
  • (2018)MaelstromProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291196(373-389)Online publication date: 8-Oct-2018
  • (2018)Compiling for concise code and efficient I/OProceedings of the 27th International Conference on Compiler Construction10.1145/3178372.3179505(104-115)Online publication date: 24-Feb-2018
  • (2018)Automatic Software RepairACM Computing Surveys10.1145/310590651:1(1-24)Online publication date: 23-Jan-2018
  • (2018)Guaranteed physical security with restart-based design for cyber-physical systemsProceedings of the 9th ACM/IEEE International Conference on Cyber-Physical Systems10.1109/ICCPS.2018.00010(10-21)Online publication date: 11-Apr-2018
  • (2017)Towards API-specific automatic program repairProceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering10.5555/3155562.3155697(1010-1013)Online publication date: 30-Oct-2017
  • (2017)A Characterization of State Spill in Modern Operating SystemsProceedings of the Twelfth European Conference on Computer Systems10.1145/3064176.3064205(389-404)Online publication date: 23-Apr-2017
  • (2016)Realizing the fault-tolerance promise of cloud storage using locks with intentProceedings of the 12th USENIX conference on Operating Systems Design and Implementation10.5555/3026877.3026916(501-516)Online publication date: 2-Nov-2016
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media