skip to main content
article

IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective

Published: 01 September 1999 Publication History

Abstract

Fault tolerance in IBM S/390� systems during the 1980s and 1990s had three distinct phases, each characterized by a different uptime improvement rate. Early TCM-technology mainframes delivered excellent data integrity, instantaneous error detection, and positive fault isolation, but had limited on-line repair. Later TCM mainframes introduced capabilities for providing a high degree of transparent recovery, failure masking, and on-line repair. New challenges accompanied the introduction of CMOS technology. A significant reduction in parts count greatly improved intrinsic failure rates, but dense packaging disallowed on-line CPU repair. In addition, characteristics of the microprocessor technology posed difficulties for traditional in-line error checking. As a result, system fault-tolerant design, particularly in CPUs and memory, underwent another evolution from G1 to G5. G5 implements an innovative design for a high-performance, fault-tolerant single-chip microprocessor. Dynamic CPU sparing delivers a transparent concurrent repair mechanism. A new internal channel provides a high-performance, highly available Parallel Sysplex� in a single mainframe. G5 is both the culmination of decades of innovation and careful implementation, and the highest achievement of S/390 fault-tolerant design.

References

[1]
A. Avizienis, H. Kopetz, and J. C. Laprie, Dependable Computing and Fault-Tolerant Systems, Springer-Verlag, New York, 1987, pp. 1-36.
[2]
M. Y. Hsiao, W. C. Carter, J. W. Thomas, and W. R. Stringfellow, "Reliability, Availability, and Serviceability of IBM Computer Systems: A Quarter Century of Progress," IBM J. Res. Develop. 25, No. 5, 453-465 (1981).
[3]
D. C. Bossen and M. Y. Hsiao, "Model for Transient and Permanent Error-Detection and Fault-Isolation Coverage," IBM J. Res. Develop. 26, No. 1, 67-77 (1982).
[4]
D. P. Siewiorek and R. S. Swarz, Reliable Computer Systems, Digital Press, Bedford, MA, 1992, pp. 485-507.
[5]
L. Spainhower, J. Isenberg, R. Chillarege, and J. Berding, "Design for Fault-Tolerance in System ES/9000 Model 900," Proceedings of the 22nd Annual International Symposium on Fault-Tolerant Computing, 1992, pp. 38-47.
[6]
L. Spainhower, T. A. Gregg, and R. Chillarege, "IBM's ES/9000 Model 982's Fault-Tolerant Design for Consolidation," IEEE Micro 14, No. 1, 48-59 (1994).
[7]
J. M. Nick, B. B. Moore, J.-Y. Chung, and N. S. Bowen, "S/390 Cluster Technology: Parallel Sysplex," IBM Syst. J. 36, No. 2, 172-201 (1997).
[8]
Pentium Family User's Manual, No. 1: Data Book, Order No. 241428, Intel Corporation, Mt. Prospect, IL, 1994, pp. 12-7-12-8.
[9]
Tandem Computers Incorporated, "NonStop Himalaya Range: K200, K2000, and K20000 Servers," NonStop Servers Product Description, 1995.
[10]
J. Robertson, "Alpha Particles Worry IC Makers as Device Features Keep Shrinking," Semicond. Business News, October 21, 1998.
[11]
Robert Horst, Doug Jewett, and Daniel Lenoski, "The Risk of Data Corruption in Microprocessor-Based Systems," Proceedings of the 23rd Annual International Symposium on Fault-Tolerant Computing, 1993, pp. 576-585.
[12]
S. Chandra and P. M. Chen, "How Fail-Stop Are Faulty Programs?" Proceedings of the 28th Annual International Symposium on Fault-Tolerant Computing, 1998, pp. 240-249.
[13]
C. F. Webb and J. S. Liptay, "A High-Frequency Custom CMOS S/390 Microprocessor," IBM J. Res. Develop. 41, No. 4/5, 463-473 (1997).
[14]
P. R. Turgeon, P. Mak, M. A. Blake, M. F. Fee, C. B. Ford III, P. J. Meaney, R. Seigler, and W. W. Shen, "The S/390 G5/G6 Binodal Cache," IBM J. Res. Develop. 43, No. 5/6, 661-670 (1999).
[15]
C. L. Chen and M. Y. Hsiao, "Error Detection and Correction for Four-Bit-per-Chip Memory System," U.S. Patent 5,757,823, 1998.
[16]
T. A. Gregg, "S/390 CMOS Server I/O: The Continuing Evolution," IBM J. Res. Develop. 41, No. 4/5, 449-462 (1997).
[17]
L. Spainhower and T. A. Gregg, "G4: A Fault Tolerant CMOS Mainframe," Proceedings of the 28th Annual International Symposium on Fault-Tolerant Computing, 1998, pp. 432-440.
[18]
T. A. Gregg, K. M. Pandey, and R. K. Errickson, "Integrated Cluster Bus for the IBM S/390 Parallel Sysplex," IBM J. Res. Develop. 43, No. 5/6, 795-806 (1999).

Cited By

View all
  • (2024)Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and ReliabilityACM Computing Surveys10.1145/366367256:11(1-76)Online publication date: 28-Jun-2024
  • (2021)A Formal Approach to Accountability in Heterogeneous Systems-on-ChipIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2020.297041718:6(2926-2940)Online publication date: 9-Nov-2021
  • (2018)Tolerating Soft Errors in Processor Cores Using CLEAR (Cross-Layer Exploration for Architecting Resilience)IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2017.275270537:9(1839-1852)Online publication date: 1-Sep-2018
  • Show More Cited By
  1. IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image IBM Journal of Research and Development
        IBM Journal of Research and Development  Volume 43, Issue 5
        September 1999
        305 pages

        Publisher

        IBM Corp.

        United States

        Publication History

        Published: 01 September 1999
        Accepted: 27 May 1999
        Received: 17 December 1998

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 16 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and ReliabilityACM Computing Surveys10.1145/366367256:11(1-76)Online publication date: 28-Jun-2024
        • (2021)A Formal Approach to Accountability in Heterogeneous Systems-on-ChipIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2020.297041718:6(2926-2940)Online publication date: 9-Nov-2021
        • (2018)Tolerating Soft Errors in Processor Cores Using CLEAR (Cross-Layer Exploration for Architecting Resilience)IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2017.275270537:9(1839-1852)Online publication date: 1-Sep-2018
        • (2018)Error correlation prediction in lockstep processors for safety-critical systemsProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00065(737-748)Online publication date: 20-Oct-2018
        • (2017)InCheckProceedings of the 54th Annual Design Automation Conference 201710.1145/3061639.3062265(1-6)Online publication date: 18-Jun-2017
        • (2016)FlipBackProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014943(1-12)Online publication date: 13-Nov-2016
        • (2016)Reverse replication of virtual machines (rRVM) for low latency and high availability servicesProceedings of the 9th International Conference on Utility and Cloud Computing10.1145/2996890.2996894(118-127)Online publication date: 6-Dec-2016
        • (2016)CLEARProceedings of the 53rd Annual Design Automation Conference10.1145/2897937.2897996(1-6)Online publication date: 5-Jun-2016
        • (2016)A Case for Acoustic Wave Detectors for Soft-ErrorsIEEE Transactions on Computers10.1109/TC.2015.241965265:1(5-18)Online publication date: 1-Jan-2016
        • (2015)FluidCheckACM Transactions on Architecture and Code Optimization10.1145/284262012:4(1-26)Online publication date: 22-Dec-2015
        • Show More Cited By

        View Options

        View options

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media