skip to main content
10.1109/MICRO.2005.8acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
Article

A Mechanism for Online Diagnosis of Hard Faults in Microprocessors

Published: 12 November 2005 Publication History

Abstract

We develop a microprocessor design that tolerates hard faults, including fabrication defects and in-field faults, by leveraging existing microprocessor redundancy. To do this, we must: detect and correct errors, diagnose hard faults at the field deconfigurable unit (FDU) granularity, and deconfigure FDUs with hard faults. In our reliable microprocessor design, we use DIVA dynamic verification to detect and correct errors. Our new scheme for diagnosing hard faults tracks instructions' core structure occupancy from decode until commit. If a DIVA checker detects an error in an instruction, it increments a small saturating error counter for every FDU used by that instruction, including that DIVA checker. A hard fault in an FDU quickly leads to an above-threshold error counter for that FDU and thus diagnoses the fault. For deconfiguration, we use previously developed schemes for functional units and buffers, and we present a scheme for deconfiguring DIVA checkers. Experimental results show that our reliable microprocessor quickly and accurately diagnoses each hard fault that is injected and continues to function, albeit with somewhat degraded performance.

References

[1]
{1} T. Austin, E. Larson, and D. Ernst. SimpleScalar: An Infrastructure for Computer System Modeling. IEEE Computer, 35(2):59-67, Feb. 2002.
[2]
{2} T. M. Austin. DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design. In Proc. of the 32nd Annual IEEE/ACM Int'l Symposium on Microarchitecture, pages 196-207, Nov. 1999.
[3]
{3} D. T. Blaauw, C. Oh, V. Zolotov, and A. Dasgupta. Static Electromigration Analysis for On-Chip Signal Interconnects. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 22(1):39-48, Jan. 2003.
[4]
{4} D. Boggs et al. The Microarchitecture of the Intel Pentium 4 Processor on 90nm Technology. Intel Technology Journal, 8(1), Feb. 2004.
[5]
{5} F. Bower, P. Shealy, S. Ozev, and D. Sorin. Tolerating Hard Faults in Microprocessor Array Structures. In Proc. of the Int'l Conference on Dependable Systems and Networks, pages 51-60, June 2004.
[6]
{6} J. Carter, S. Ozev, and D. Sorin. Circuit-Level Modeling for Concurrent Testing of Operational Defects due to Gate Oxide Breakdown. In Proc. of Design, Automation, and Test in Europe (DATE), pages 300-305, Mar. 2005.
[7]
{7} T. Chen and G. Sunada. An Ultra-Large Capacity Single-Chip Memory Architecture with Self-Testing and Self-Repairing. In Proc. of the Int'l Conference on Computer Design (ICCD), pages 576-581, Oct. 1992.
[8]
{8} W. B. Culbertson et al. The Teramac Custom Computer: Extending the Limits with Defect Tolerance. In Proc. of the IEEE Int'l Symposium on Defect and Fault Tolerance in VLSI Systems, Nov. 1996.
[9]
{9} T. J. Dell. A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory. IBM Microelectronics Division Whitepaper, Nov. 1997.
[10]
{10} D. J. Dumin. Oxide Reliability: A Summary of Silicon Oxide Wearout, Breakdown and Reliability. World Scientific Publications, 2002.
[11]
{11} G. Hinton et al. The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal, Feb. 2001.
[12]
{12} IBM. Enhancing IBM Netfinity Server Reliability: IBM Chipkill Memory. IBM Whitepaper, Feb. 1999.
[13]
{13} International Technology Roadmap for Semiconductors, 2003.
[14]
{14} JEDEC Solid State Technology Association. Failure Mechanisms and Models for Semiconductor Devices. JEDEC Publication JEP122-B, Aug. 2003.
[15]
{15} D. Jewett. Integrity S2: A Fault-Tolerant UNIX Platform. In Proc. of the 21st Int'l Symposium on Fault-Tolerant Computing Systems, pages 512-519, June 1991.
[16]
{16} P. Mazumder and J. S. Yih. A Novel Built-In Self-Repair Approach to VLSI Memory Yield Enhancement. In Proc. of the Int'l Test Conference, pages 833-841, 1990.
[17]
{17} S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed Design and Implementation of Redundant Multhreading Alternatives. In Proc. of the 29th Annual Int'l Symposium on Computer Architecture, pages 99-110, May 2002.
[18]
{18} M. Nicolaidis, N. Achouri, and S. Boutobza. Dynamic Data-bit Memory Built-In Self-Repair. In Proc. of the Int'l Conference on Computer Aided Design, pages 588- 594, Nov. 2003.
[19]
{19} D. Patterson, G. Gibson, and R. Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proc. of 1988 ACM SIGMOD Conference, pages 109-116, June 1988.
[20]
{20} S. K. Reinhardt and S. S. Mukherjee. Transient Fault Detection via Simultaneous Multithreading. In Proc. of the 27th Annual Int'l Symposium on Computer Architecture, pages 25-36, June 2000.
[21]
{21} E. Rotenberg. AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors. In Proc. of the 29th Int'l Symposium on Fault-Tolerant Computing Systems, pages 84-91, June 1999.
[22]
{22} K. Sawada et al. Built-in Self Repair Circuit for High Density ASMIC. In Proc. of the IEEE Custom Integrated Circuits Conference, 1989.
[23]
{23} E. Schuchman and T. N. Vijaykumar. Rescue: A Microarchitecture for Testability and Defect Tolerance. In Proc. of the 32nd Annual Int'l Symposium on Computer Architecture, pages 160-171, June 2005.
[24]
{24} T. Sherwood et al. Automatically Characterizing Large Scale Program Behavior. In Proc. of the Tenth Int'l Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2002.
[25]
{25} P. Shivakumar, S. W. Keckler, C. R. Moore, and D. Burger. Exploiting Microarchitectural Redundancy For Defect Tolerance. In Proc. of the 21st Int'l Conference on Computer Design, Oct. 2003.
[26]
{26} L. Spainhower and T. A. Gregg. IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective. IBM Journal of Research and Development, 43(5/6), September/November 1999.
[27]
{27} J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers. The Case for Lifetime Reliability-Aware Microprocessors. In Proc. of the 31st Annual Int'l Symposium on Computer Architecture, June 2004.
[28]
{28} J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers. The Impact of Technology Scaling on Lifetime Reliability. In Proc. of the Int'l Conference on Dependable Systems and Networks, June 2004.
[29]
{29} J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers. Exploiting Structural Duplication for Lifetime Reliability Enhancement. In Proc. of the 32nd Annual Int'l Symposium on Computer Architecture, June 2005.
[30]
{30} K. Sundaramoorthy, Z. Purser, and E. Rotenberg. Slipstream Processors: Improving both Performance and Fault Tolerance. In Proc. of the Ninth Int'l Conference on Architectural Support for Programming Languages and Operating Systems, pages 257-268, Nov. 2000.
[31]
{31} J. Tao, J. F. Chen, N. W. Cheung, and C. Hu. Modeling and Characterization of Electromigration Failures Under Bidirectional Current Stress. IEEE Trans. on Electron Devices, 43(5):800-808, May 1996.
[32]
{32} D. M. Tullsen et al. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. In Proc. of the 23rd Annual Int'l Symposium on Computer Architecture, pages 191- 202, May 1996.
[33]
{33} T. N. Vijaykumar, I. Pomeranz, and K. K. Chung. Transient Fault Recovery Using Simultaneous Multithreading. In Proc. of the 29th Annual Int'l Symposium on Computer Architecture, pages 87-98, May 2002.
[34]
{34} C. Weaver and T. Austin. A Fault Tolerant Approach to Microprocessor Design. In Proc. of the Int'l Conference on Dependable Systems and Networks, pages 411-420, July 2001.
[35]
{35} D. Wilson. The Stratus Computer System. In Resilient Computer Systems, pages 208-231, 1985.
[36]
{36} L. Youngs and S. Paramanandam. Mapping and Repairing Embedded-Memory Defects. IEEE Design & Test of Computers, pages 18-24, January-March 1997.

Cited By

View all
  • (2020)WATCHER: in-situ failure diagnosisProceedings of the ACM on Programming Languages10.1145/34282114:OOPSLA(1-27)Online publication date: 13-Nov-2020
  • (2017)Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsACM Computing Surveys10.1145/309269950:4(1-38)Online publication date: 4-Oct-2017
  • (2015)A Hardware Framework for Yield and Reliability Enhancement in Chip MultiprocessorsACM Transactions on Embedded Computing Systems10.1145/262968814:1(1-26)Online publication date: 21-Jan-2015
  • Show More Cited By

Index Terms

  1. A Mechanism for Online Diagnosis of Hard Faults in Microprocessors

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MICRO 38: Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
        November 2005
        350 pages
        ISBN:0769524400

        Sponsors

        Publisher

        IEEE Computer Society

        United States

        Publication History

        Published: 12 November 2005

        Check for updates

        Qualifiers

        • Article

        Conference

        Micro-38
        Sponsor:

        Acceptance Rates

        MICRO 38 Paper Acceptance Rate 29 of 147 submissions, 20%;
        Overall Acceptance Rate 484 of 2,242 submissions, 22%

        Upcoming Conference

        MICRO '24

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)1
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 16 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2020)WATCHER: in-situ failure diagnosisProceedings of the ACM on Programming Languages10.1145/34282114:OOPSLA(1-27)Online publication date: 13-Nov-2020
        • (2017)Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsACM Computing Surveys10.1145/309269950:4(1-38)Online publication date: 4-Oct-2017
        • (2015)A Hardware Framework for Yield and Reliability Enhancement in Chip MultiprocessorsACM Transactions on Embedded Computing Systems10.1145/262968814:1(1-26)Online publication date: 21-Jan-2015
        • (2014)Reliability-aware exceptionsProceedings of the conference on Design, Automation & Test in Europe10.5555/2616606.2616731(1-6)Online publication date: 24-Mar-2014
        • (2014)Exploiting Existing Comparators for Fine-Grained Low-Cost Error DetectionACM Transactions on Architecture and Code Optimization10.1145/265634111:3(1-24)Online publication date: 27-Oct-2014
        • (2014)A low-power instruction replay mechanism for design of resilient microprocessorsACM Transactions on Embedded Computing Systems10.1145/256003413:4(1-23)Online publication date: 10-Mar-2014
        • (2013)Deconfigurable microprocessor architectures for silicon debug accelerationACM SIGARCH Computer Architecture News10.1145/2508148.248597641:3(631-642)Online publication date: 23-Jun-2013
        • (2013)Deconfigurable microprocessor architectures for silicon debug accelerationProceedings of the 40th Annual International Symposium on Computer Architecture10.1145/2485922.2485976(631-642)Online publication date: 23-Jun-2013
        • (2012)Efficient soft error protection for commodity embedded microprocessors using profile informationACM SIGPLAN Notices10.1145/2345141.224843347:5(99-108)Online publication date: 12-Jun-2012
        • (2012)Efficient soft error protection for commodity embedded microprocessors using profile informationProceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems10.1145/2248418.2248433(99-108)Online publication date: 12-Jun-2012
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media