skip to main content
10.1145/1555349.1555372acmconferencesArticle/Chapter ViewAbstractPublication PagesmetricsConference Proceedingsconference-collections
research-article

DRAM errors in the wild: a large-scale field study

Published: 15 June 2009 Publication History

Abstract

Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days.
The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology and DIMM age?
We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly feared, we don't observe any indication that newer generations of DIMMs have worse error behavior.

References

[1]
Mosys adds soft-error protection, correction. Semiconductor Business News, 28 Jan. 2002.
[2]
Z. Al-Ars, A. J. van de Goor, J. Braun, and D. Richter. Simulation based analysis of temperature effect on the faulty behavior of embedded drams. In ITC'01: Proc. of the 2001 IEEE International Test Conference, 2001.
[3]
R. Baumann. Soft errors in advanced computer systems. IEEE Design and Test of Computers, pages 258--266, 2005.
[4]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In Proc. of OSDI'06, 2006.
[5]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In Proc. of OSDI'06, 2006.
[6]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In Proc. of OSDI'06, 2006.
[7]
S. Govindavajhala and A. W. Appel. Using memory errors to attack a virtual machine. In SP '03: Proc. of the 2003 IEEE Symposium on Security and Privacy, 2003.
[8]
T. Hamamoto, S. Sugiura, and S. Sawada. On the retention time distribution of dynamic random access memory (dram). IEEE Transactions on Electron Devices, 45(6):1300--1309, 1998.
[9]
A. H. Johnston. Scaling and technology issues for soft error rates. In Proc. of the 4th Annual Conf. on Reliability, 2000.
[10]
X. Li, K. Shen, M. Huang, and L. Chu. A memory soft error measurement on production systems. In Proc. of USENIX Annual Technical Conference, 2007.
[11]
T. C. May and M. H. Woods. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices, 26(1), 1979.
[12]
T. C. May and M. H. Woods. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices, 26(1), 1979.
[13]
D. Milojicic, A. Messer, J. Shau, G. Fu, and A. Munoz. Increasing relevance of memory hardware errors: a case for recoverable programming models. In Proc. of the 9th ACM SIGOPS European workshop, 2000.
[14]
S. S. Mukherjee, J. Emer, T. Fossum, and S. K. Reinhardt. Cache scrubbing in microprocessors: Myth or necessity? In PRDC '04: Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004.
[15]
S. S. Mukherjee, J. Emer, and S. K. Reinhardt. The soft error problem: An architectural perspective. In HPCA '05: Proc. of the 11th International Symposium on High-Performance Computer Architecture, 2005.
[16]
E. Normand. Single event upset at ground level. IEEE Transaction on Nuclear Sciences, 6(43):2742--2750, 1996.
[17]
T. J. O'Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh. Field testing for cosmic ray soft errors in semiconductor memories. IBM J. Res. Dev., 40(1), 1996.
[18]
R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with sawzall. Scientific Programming Journal, Special Issue on Grids and Worldwide Computing Programming Models and Infrastructure, 13(4), 2005.
[19]
B. Schroeder and G. A. Gibson. A large scale study of failures in high-performance-computing systems. In DSN 2006: Proc. of the International Conference on Dependable Systems and Networks, 2006.
[20]
B. Schroeder and G. A. Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In 5th USENIX FAST Conference, 2007.
[21]
B. Schroeder and G. A. Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In 5th USENIX FAST Conference, 2007.
[22]
J. Xu, S. Chen, Z. Kalbarczyk, and R. K. Iyer. An experimental study of security vulnerabilities caused by errors. In DSN 2001: Proc. of the 2001 International Conference on Dependable Systems and Networks, 2001.
[23]
J. F. Ziegler and W. A. Lanford. Effect of Cosmic Rays on Computer Memories. Science, 206:776--788, 1979.

Cited By

View all
  • (2024)Shadow Filesystems: Recovering from Filesystem Runtime Errors via Robust Alternative ExecutionProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665942(15-22)Online publication date: 8-Jul-2024
  • (2024)Supports for Testing Memory Error Handling Code of In-memory Key Value Stores2024 19th European Dependable Computing Conference (EDCC)10.1109/EDCC61798.2024.00020(41-48)Online publication date: 8-Apr-2024
  • (2024)Investigating Memory Failure Prediction Across CPU Architectures2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S)10.1109/DSN-S60304.2024.00033(88-95)Online publication date: 24-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMETRICS '09: Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
June 2009
336 pages
ISBN:9781605585116
DOI:10.1145/1555349
  • cover image ACM SIGMETRICS Performance Evaluation Review
    ACM SIGMETRICS Performance Evaluation Review  Volume 37, Issue 1
    SIGMETRICS '09
    June 2009
    320 pages
    ISSN:0163-5999
    DOI:10.1145/2492101
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data corruption
  2. dimm
  3. dram
  4. dram reliability
  5. ecc
  6. empirical study
  7. hard error
  8. large-scale systems
  9. memory
  10. soft error

Qualifiers

  • Research-article

Conference

SIGMETRICS09

Acceptance Rates

Overall Acceptance Rate 459 of 2,691 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)167
  • Downloads (Last 6 weeks)25
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Shadow Filesystems: Recovering from Filesystem Runtime Errors via Robust Alternative ExecutionProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665942(15-22)Online publication date: 8-Jul-2024
  • (2024)Supports for Testing Memory Error Handling Code of In-memory Key Value Stores2024 19th European Dependable Computing Conference (EDCC)10.1109/EDCC61798.2024.00020(41-48)Online publication date: 8-Apr-2024
  • (2024)Investigating Memory Failure Prediction Across CPU Architectures2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S)10.1109/DSN-S60304.2024.00033(88-95)Online publication date: 24-Jun-2024
  • (2024)Impact of variation of the cooling system operating strategy on energy efficiency and waste heat quality: a preliminary investigation on a hybrid-cooled data centreJournal of Physics: Conference Series10.1088/1742-6596/2766/1/0120572766:1(012057)Online publication date: 1-May-2024
  • (2024)A characterization of soft-error sensitivity in data-parallel and model-parallel distributed deep learningJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104879190:COnline publication date: 1-Aug-2024
  • (2023)HashTagProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620394(2797-2814)Online publication date: 9-Aug-2023
  • (2023)PERSEUSProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585942(49-63)Online publication date: 21-Feb-2023
  • (2023)From Missteps to Milestones: A Journey to Practical Fail-Slow DetectionACM Transactions on Storage10.1145/361769019:4(1-28)Online publication date: 1-Nov-2023
  • (2023)Exploring Error Bits for Memory Failure Prediction: An In-Depth Correlative Study2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323692(01-09)Online publication date: 28-Oct-2023
  • (2023)A Systematic Study of DDR4 DRAM Faults in the Field2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071066(991-1002)Online publication date: Feb-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media