skip to main content
10.5555/2388996.2389102acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Detection and correction of silent data corruption for large-scale high-performance computing

Published: 10 November 2012 Publication History

Abstract

Faults have become the norm rather than the exception for high-end computing clusters. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that allow applications to compute incorrect results.
This paper studies the potential for redundancy to detect and correct soft errors in MPI message-passing applications while investigating the challenges inherent to detecting soft errors within MPI applications by providing transparent MPI redundancy. By assuming a model wherein corruption in application data manifests itself by producing differing MPI messages between replicas, we study the best suited protocols for detecting and correcting corrupted MPI messages.
Using our fault injector, we observe that even a single error can have profound effects on applications by causing a cascading pattern of corruption which in most cases spreads to all other processes. Results indicate that our consistency protocols can successfully protect applications experiencing even high rates of silent data corruption.

References

[1]
B. Schroeder, E. Pinheiro, and W.-D. Weber, "Dram errors in the wild: a large-scale field study," in SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 2009, pp. 193--204.
[2]
E. Pinheiro, W.-D. Weber, and L. A. Barroso, "Failure trends in a large disk drive population," in USENIX Conference on File and Storage Technologies, 2007.
[3]
A. A. Hwang, I. A. Stefanovici, and B. Schroeder, "Cosmic rays don't strike twice: understanding the nature of dram errors and the implications for system design," in Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS '12, 2012, pp. 111--122.
[4]
J. T. Daly, "ADTSC nuclear weapons highlights: Facilitating high-throughput ASC calculations," Los Alamos National Laboratory, Los Alamos, NM, USA, Tech. Rep. LALP-07-041, Jun. 2007.
[5]
J. T. Daly, L. A. Pritchett-Sheats, and S. E. Michalak, "Application MT-TFE vs. platform MTTF: A fresh perspective on system reliability and application throughput for computations at scale," in Proceedings of the Workshop on Resiliency in High Performance Computing (Resilience) 2008, May 2008, pp. 19--22.
[6]
I. Philp, "Software failures and the road to a petaflop machine," in HPCRI: 1st Workshop on High Performance Computing Reliability Issues, in Proceedings of the 11th International Symposium on High Performance Computer Architecture (HPCA-11). IEEE Computer Society, 2005.
[7]
K. Ferreira, J. Stearley, J. H. L. III, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. Bridges, and D. Arnold, "Evaluating the viability of process replication reliability for exascale systems," in Supercomputing, nov 2011.
[8]
A. Geist, "What is the monster in the closet?" Aug. 2011, invited Talk at Workshop on Architectures I: Exascale and Beyond: Gaps in Research, Gaps in our Thinking.
[9]
G. Bronevetsky and A. Moody, "Scalable i/o systems via node-local storage: Approaching 1 tb/sec file i/o," Lawrence Berkeley National Laboratory, TR 415791, 2009.
[10]
J. R. Sklaroff, "Redundancy management technique for space shuttle computers," IBM Journal of Research and Development, vol. 20, no. 1, pp. 20--28, 1976.
[11]
S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, "Robust system design with built-in soft-error resilience," Computer, vol. 38, no. 2, pp. 43--52, 2005.
[12]
M. Gomaa, C. Scarbrough, T. N. Vijayjumar, and I. Pomeranz, "Transient-fault recovery for chip multiprocessors," in International Symposium on Computer Architecture, May 2003, pp. 98--109.
[13]
S. K. Reinhardt and S. S. Mukherjee, "Transient fault detection via simultaneous multithreading," in International Symposium on Computer Architecture, 2000, pp. 25--36.
[14]
H. Quinn and P. Graham, "Terrestrial-based radiation upsets: A cautionary tale," in Symposium on Field-Programmable Custom Computing Machines (FCCM) 2005, Apr. 18--20, 2005, pp. 193--202.
[15]
J. Elliot, K. Kharbas, D. Fiala, F. Mueller, C. Engelmann, and K. Ferreirar, "Combining partial redundancy and checkpointing for HPC," in International Conference on Distributed Computing Systems, 2012, p. (accepted).
[16]
J. Vetter, "Hpc landscape --- application accelerators: Deus ex machina?" Sep. 2009, invited Talk at High Performance Embedded Computing Workshop.
[17]
J. Shalf, "Simulation challenge: Exascale planning overview," Aug. 2010, invited Talk at HEC FSIO R&D Workshop.
[18]
J. Dongarra, P. Beckman, T. Moore, P. Aerts, G. Aloisio, J. C. Andre, D. Barkai, J. Y. Berthou, T. Boku, B. Braunschweig, and et al., "The international exascale software project roadmap," International Journal of High Performance Computing Applications, vol. 25, no. 1, pp. 3--60, 2011.
[19]
D. Fiala, F. Mueller, C. Engelmann, K. Ferreira, R. Brightwell, and R. Riesen, "Detection and correction of silent data corruption for large-scale high-performance computing," Dept. of Computer Science, North Carolina State University, Tech. Rep. TR 2012--5, May 2012.
[20]
S. B�hm and C. Engelmann, "File i/o for mpi applications in redundant execution scenarios," in Euromicro International Conference on Parallel, Distributed, and network-based Processing, Feb. 2012.
[21]
N. DeBardeleben, J. Laros, J. T. Daly, S. L. Scott, C. Engelmann, and B. Harrod, "High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development," Whitepaper, Dec. 2009. {Online}. Available: http://www.csm.ornl.gov/~engelman/publications/debardeleben09high-end.pdf
[22]
P. H. Hargrove and J. C. Duell, "Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters," in Journal of Physics: Proceedings of the Scientific Discovery through Advanced Computing Program (SciDAC) Conference 2006, vol. 46. Denver, CO, USA: Institute of Physics Publishing, Bristol, UK, Jun. 25--29, 2006, pp. 494--499. {Online}. Available: http://www.iop.org/EJ/article/1742-6596/46/1/067/jpconf6_46_067.pdf
[23]
G. Bronevetsky and A. Moody, "Scalable I/O systems via node-local storage: Approaching 1 TB/sec file I/O," Lawrence Livermore National Laboratory, Livermore, CA, USA, Tech. Rep. TR-JLPC-09-01, Aug. 2009. {Online}. Available: http://dx.doi.org/10.2172/964079
[24]
S. Li, K. Chen, M.-Y. Hsieh, N. Muralimanohar, C. D. Kersey, J. B. Brockman, A. F. Rodrigues, and N. P. Jouppi, "System implications of memory reliability in exascale computing," in Supercomputing, 2011, pp. 46:1--46:12.
[25]
D. Fiala, K. Ferreira, F. Mueller, and C. Engelmann, "A tunable, software-based dram error detection and correction library for hpc," in Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, Sep. 2011, pp. 110--121.
[26]
S. E. Michalak, K. W. Harris, N. W. Hengartner, B. E. Takala, and S. A. Wender, "Predicting the number of fatal soft errors in Los Alamos National Laboratory's ASC Q supercomputer," IEEE Transactions on Device and Materials Reliability (TDMR), vol. 5, no. 3, pp. 329--335, 2005. {Online}. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1545893
[27]
G. Bronevetsky and B. R. de Supinski, "Soft error vulnerability of iterative linear algebra methods," in Proceedings of the 21st ACM International Conference on Supercomputing (ICS) 2008. Island of Kos, Greece: ACM Press, New York, NY, USA, Jun. 7--12, 2007. {Online}. Available: http://greg.bronevetsky.com/papers/2008ICS.pdf
[28]
D. P. Siemwiorek, "Architecture of fault-tolerant computers: An historical perspective," Proceedings of the IEEE, vol. 79, no. 12, pp. 1710--1734, 1991. {Online}. Available: http://dx.doi.org/10.1109/5.119549
[29]
A. Golander, S. Weiss, and R. Ronen, "DDMR: Dynamic and scalable dual modular redundancy with short validation intervals," IEEE Computer Architecture Letters, vol. 7, no. 2, pp. 65--68, 2008. {Online}. Available: http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.12
[30]
A. Shye, J. Blomstedt, T. Moseley, V. J. Reddi, and D. A. Connors, "PLR: A software approach to transient fault tolerance for multicore architectures," IEEE Transactions on Dependable and Secure Computing (TDSC), vol. 6, no. 2, pp. 135--148, 2009. {Online}. Available: http://doi.ieeecomputersociety.org/10.1109/TDSC.2008.62
[31]
S. S. Mukherjee, M. Kontz, and S. K. Reinhardt, "Detailed design and evaluation of redundant multithreading alternatives," in Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA) 2002. Anchorage, AK, USA: IEEE Computer Society, May 25--29, 2002, pp. 99--110. {Online}. Available: http://doi.ieeecomputersociety.org/10.1109/ISCA.2002.1003566
[32]
C. Engelmann, H. H. Ong, and S. L. Scott, "The case for modular redundancy in large-scale high performance computing systems," in Proceedings of the 8th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009. Innsbruck, Austria: ACTA Press, Calgary, AB, Canada, Feb. 16--18, 2009, pp. 189--194. {Online}. Available: http://www.csm.ornl.gov/~engelman/publications/engelmann09case.pdf
[33]
R. Brightwell, K. B. Ferreira, and R. Riesen, "Transparent redundant computing with MPI," in EuroMPI, ser. Lecture Notes in Computer Science, R. Keller, E. Gabriel, M. M. Resch, and J. Dongarra, Eds., vol. 6305. Springer, 2010, pp. 208--218.
[34]
C. Engelmann and S. B�hm, "Redundant execution of hpc applications with mr-mpi," in Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2011. Innsbruck, Austria: ACTA Press, Calgary, AB, Canada, Feb. 15--17, 2011.
[35]
T. LeBlanc, R. Anand, E. Gabriel, and J. Subhlok, "Volpexmpi: An MPI library for execution of parallel applications on volatile nodes," in Lecture Notes in Computer Science: Proceedings of the 16th European PVM/MPI Users' Group Meeting (EuroPVM/MPI) 2009, vol. 5759. Espoo, Finland: Springer Verlag, Berlin, Germany, Sep. 7--10, 2009, pp. 124--133. {Online}. Available: http://dx.doi.org/10.1007/978-3-642-03770-2_19
[36]
B. Roundtree, G. Cobb, T. Gamblin, M. Schulz, B. Supinski, and H. Tufo, "Parallelizing heavyweight debugging tools with mpiecho," in High-performance Infrastructure for Scalable Toolsi, WHIST 2011, Held as part of ICS '11, Tucson, Arizona, 2011, pp. 803--808.
[37]
G. Cobb, B. Roundtree, H. Tufo, M. Schulz, T. Gamblin, and B. de Supinski, "Mpiecho: A framework for transparent mpi task replication," Dept. of Computer Science, University of Colorado at Boulder, Tech. Rep. CU-CS-1082-11, Jun. 2011.

Cited By

View all
  • (2022)Communication-efficient distributed eigenspace estimation with arbitrary node failuresProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601593(18197-18210)Online publication date: 28-Nov-2022
  • (2021)Resilience and fault tolerance in high-performance computing for numerical weather and climate predictionInternational Journal of High Performance Computing Applications10.1177/109434202199043335:4(285-311)Online publication date: 1-Jul-2021
  • (2021)BAASHProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476155(1-18)Online publication date: 14-Nov-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2012
1161 pages
ISBN:9781467308045

Sponsors

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 10 November 2012

Check for updates

Qualifiers

  • Research-article

Conference

SC '12
Sponsor:

Acceptance Rates

SC '12 Paper Acceptance Rate 100 of 461 submissions, 22%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Communication-efficient distributed eigenspace estimation with arbitrary node failuresProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601593(18197-18210)Online publication date: 28-Nov-2022
  • (2021)Resilience and fault tolerance in high-performance computing for numerical weather and climate predictionInternational Journal of High Performance Computing Applications10.1177/109434202199043335:4(285-311)Online publication date: 1-Jul-2021
  • (2021)BAASHProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476155(1-18)Online publication date: 14-Nov-2021
  • (2020)Functional FaultsProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3350755.3400261(453-463)Online publication date: 6-Jul-2020
  • (2019)Bi-Source Verification Against Silent Data Corruption in High Performance ComputingProceedings of the 9th Balkan Conference on Informatics10.1145/3351556.3351567(1-4)Online publication date: 26-Sep-2019
  • (2019)GreenMMProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330373(308-318)Online publication date: 26-Jun-2019
  • (2019)Dual-Page CheckpointingACM Transactions on Architecture and Code Optimization10.1145/329105715:4(1-27)Online publication date: 8-Jan-2019
  • (2018)Resilient co-scheduling of malleable applicationsInternational Journal of High Performance Computing Applications10.1177/109434201770497932:1(89-103)Online publication date: 1-Jan-2018
  • (2018)Soft fault detection and correction for multigridInternational Journal of High Performance Computing Applications10.1177/109434201668400632:6(897-912)Online publication date: 1-Nov-2018
  • (2018)Programmer-guided reliability for extreme-scale applicationsInternational Journal of High Performance Computing Applications10.1177/109434201666762532:5(598-612)Online publication date: 1-Sep-2018
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media