research-article

Detection and correction of silent data corruption for large-scale high-performance computing

Authors:

Christian Engelmann,

Ron BrightwellAuthors Info & Claims

SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 78, Pages 1 - 12

Published: 10 November 2012 Publication History

Abstract

Faults have become the norm rather than the exception for high-end computing clusters. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that allow applications to compute incorrect results.

This paper studies the potential for redundancy to detect and correct soft errors in MPI message-passing applications while investigating the challenges inherent to detecting soft errors within MPI applications by providing transparent MPI redundancy. By assuming a model wherein corruption in application data manifests itself by producing differing MPI messages between replicas, we study the best suited protocols for detecting and correcting corrupted MPI messages.

Using our fault injector, we observe that even a single error can have profound effects on applications by causing a cascading pattern of corruption which in most cases spreads to all other processes. Results indicate that our consistency protocols can successfully protect applications experiencing even high rates of silent data corruption.

References

[1]

B. Schroeder, E. Pinheiro, and W.-D. Weber, "Dram errors in the wild: a large-scale field study," in SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 2009, pp. 193--204.

Digital Library

[2]

E. Pinheiro, W.-D. Weber, and L. A. Barroso, "Failure trends in a large disk drive population," in USENIX Conference on File and Storage Technologies, 2007.

Digital Library

[3]

A. A. Hwang, I. A. Stefanovici, and B. Schroeder, "Cosmic rays don't strike twice: understanding the nature of dram errors and the implications for system design," in Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS '12, 2012, pp. 111--122.

Digital Library

[4]

J. T. Daly, "ADTSC nuclear weapons highlights: Facilitating high-throughput ASC calculations," Los Alamos National Laboratory, Los Alamos, NM, USA, Tech. Rep. LALP-07-041, Jun. 2007.

[5]

J. T. Daly, L. A. Pritchett-Sheats, and S. E. Michalak, "Application MT-TFE vs. platform MTTF: A fresh perspective on system reliability and application throughput for computations at scale," in Proceedings of the Workshop on Resiliency in High Performance Computing (Resilience) 2008, May 2008, pp. 19--22.

Digital Library

[6]

I. Philp, "Software failures and the road to a petaflop machine," in HPCRI: 1st Workshop on High Performance Computing Reliability Issues, in Proceedings of the 11th International Symposium on High Performance Computer Architecture (HPCA-11). IEEE Computer Society, 2005.

[7]

K. Ferreira, J. Stearley, J. H. L. III, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. Bridges, and D. Arnold, "Evaluating the viability of process replication reliability for exascale systems," in Supercomputing, nov 2011.

Digital Library

[8]

A. Geist, "What is the monster in the closet?" Aug. 2011, invited Talk at Workshop on Architectures I: Exascale and Beyond: Gaps in Research, Gaps in our Thinking.

[9]

G. Bronevetsky and A. Moody, "Scalable i/o systems via node-local storage: Approaching 1 tb/sec file i/o," Lawrence Berkeley National Laboratory, TR 415791, 2009.

[10]

J. R. Sklaroff, "Redundancy management technique for space shuttle computers," IBM Journal of Research and Development, vol. 20, no. 1, pp. 20--28, 1976.

Digital Library

[11]

S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, "Robust system design with built-in soft-error resilience," Computer, vol. 38, no. 2, pp. 43--52, 2005.

Digital Library

[12]

M. Gomaa, C. Scarbrough, T. N. Vijayjumar, and I. Pomeranz, "Transient-fault recovery for chip multiprocessors," in International Symposium on Computer Architecture, May 2003, pp. 98--109.

Digital Library

[13]

S. K. Reinhardt and S. S. Mukherjee, "Transient fault detection via simultaneous multithreading," in International Symposium on Computer Architecture, 2000, pp. 25--36.

Digital Library

[14]

H. Quinn and P. Graham, "Terrestrial-based radiation upsets: A cautionary tale," in Symposium on Field-Programmable Custom Computing Machines (FCCM) 2005, Apr. 18--20, 2005, pp. 193--202.

Digital Library

[15]

J. Elliot, K. Kharbas, D. Fiala, F. Mueller, C. Engelmann, and K. Ferreirar, "Combining partial redundancy and checkpointing for HPC," in International Conference on Distributed Computing Systems, 2012, p. (accepted).

Digital Library

[16]

J. Vetter, "Hpc landscape --- application accelerators: Deus ex machina?" Sep. 2009, invited Talk at High Performance Embedded Computing Workshop.

[17]

J. Shalf, "Simulation challenge: Exascale planning overview," Aug. 2010, invited Talk at HEC FSIO R&D Workshop.

[18]

J. Dongarra, P. Beckman, T. Moore, P. Aerts, G. Aloisio, J. C. Andre, D. Barkai, J. Y. Berthou, T. Boku, B. Braunschweig, and et al., "The international exascale software project roadmap," International Journal of High Performance Computing Applications, vol. 25, no. 1, pp. 3--60, 2011.

Digital Library

[19]

D. Fiala, F. Mueller, C. Engelmann, K. Ferreira, R. Brightwell, and R. Riesen, "Detection and correction of silent data corruption for large-scale high-performance computing," Dept. of Computer Science, North Carolina State University, Tech. Rep. TR 2012--5, May 2012.

[20]

S. B�hm and C. Engelmann, "File i/o for mpi applications in redundant execution scenarios," in Euromicro International Conference on Parallel, Distributed, and network-based Processing, Feb. 2012.

Digital Library

[21]

N. DeBardeleben, J. Laros, J. T. Daly, S. L. Scott, C. Engelmann, and B. Harrod, "High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development," Whitepaper, Dec. 2009. {Online}. Available: http://www.csm.ornl.gov/~engelman/publications/debardeleben09high-end.pdf

[22]

P. H. Hargrove and J. C. Duell, "Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters," in Journal of Physics: Proceedings of the Scientific Discovery through Advanced Computing Program (SciDAC) Conference 2006, vol. 46. Denver, CO, USA: Institute of Physics Publishing, Bristol, UK, Jun. 25--29, 2006, pp. 494--499. {Online}. Available: http://www.iop.org/EJ/article/1742-6596/46/1/067/jpconf6_46_067.pdf

[23]

G. Bronevetsky and A. Moody, "Scalable I/O systems via node-local storage: Approaching 1 TB/sec file I/O," Lawrence Livermore National Laboratory, Livermore, CA, USA, Tech. Rep. TR-JLPC-09-01, Aug. 2009. {Online}. Available: http://dx.doi.org/10.2172/964079

[24]

S. Li, K. Chen, M.-Y. Hsieh, N. Muralimanohar, C. D. Kersey, J. B. Brockman, A. F. Rodrigues, and N. P. Jouppi, "System implications of memory reliability in exascale computing," in Supercomputing, 2011, pp. 46:1--46:12.

Digital Library

[25]

D. Fiala, K. Ferreira, F. Mueller, and C. Engelmann, "A tunable, software-based dram error detection and correction library for hpc," in Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, Sep. 2011, pp. 110--121.

[26]

S. E. Michalak, K. W. Harris, N. W. Hengartner, B. E. Takala, and S. A. Wender, "Predicting the number of fatal soft errors in Los Alamos National Laboratory's ASC Q supercomputer," IEEE Transactions on Device and Materials Reliability (TDMR), vol. 5, no. 3, pp. 329--335, 2005. {Online}. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1545893

[27]

G. Bronevetsky and B. R. de Supinski, "Soft error vulnerability of iterative linear algebra methods," in Proceedings of the 21^st ACM International Conference on Supercomputing (ICS) 2008. Island of Kos, Greece: ACM Press, New York, NY, USA, Jun. 7--12, 2007. {Online}. Available: http://greg.bronevetsky.com/papers/2008ICS.pdf

Digital Library

[28]

D. P. Siemwiorek, "Architecture of fault-tolerant computers: An historical perspective," Proceedings of the IEEE, vol. 79, no. 12, pp. 1710--1734, 1991. {Online}. Available: http://dx.doi.org/10.1109/5.119549

[29]

A. Golander, S. Weiss, and R. Ronen, "DDMR: Dynamic and scalable dual modular redundancy with short validation intervals," IEEE Computer Architecture Letters, vol. 7, no. 2, pp. 65--68, 2008. {Online}. Available: http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.12

Digital Library

[30]

A. Shye, J. Blomstedt, T. Moseley, V. J. Reddi, and D. A. Connors, "PLR: A software approach to transient fault tolerance for multicore architectures," IEEE Transactions on Dependable and Secure Computing (TDSC), vol. 6, no. 2, pp. 135--148, 2009. {Online}. Available: http://doi.ieeecomputersociety.org/10.1109/TDSC.2008.62

Digital Library

[31]

S. S. Mukherjee, M. Kontz, and S. K. Reinhardt, "Detailed design and evaluation of redundant multithreading alternatives," in Proceedings of the 29^th Annual International Symposium on Computer Architecture (ISCA) 2002. Anchorage, AK, USA: IEEE Computer Society, May 25--29, 2002, pp. 99--110. {Online}. Available: http://doi.ieeecomputersociety.org/10.1109/ISCA.2002.1003566

Digital Library

[32]

C. Engelmann, H. H. Ong, and S. L. Scott, "The case for modular redundancy in large-scale high performance computing systems," in Proceedings of the 8^th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009. Innsbruck, Austria: ACTA Press, Calgary, AB, Canada, Feb. 16--18, 2009, pp. 189--194. {Online}. Available: http://www.csm.ornl.gov/~engelman/publications/engelmann09case.pdf

[33]

R. Brightwell, K. B. Ferreira, and R. Riesen, "Transparent redundant computing with MPI," in EuroMPI, ser. Lecture Notes in Computer Science, R. Keller, E. Gabriel, M. M. Resch, and J. Dongarra, Eds., vol. 6305. Springer, 2010, pp. 208--218.

Digital Library

[34]

C. Engelmann and S. B�hm, "Redundant execution of hpc applications with mr-mpi," in Proceedings of the 10^th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2011. Innsbruck, Austria: ACTA Press, Calgary, AB, Canada, Feb. 15--17, 2011.

[35]

T. LeBlanc, R. Anand, E. Gabriel, and J. Subhlok, "Volpexmpi: An MPI library for execution of parallel applications on volatile nodes," in Lecture Notes in Computer Science: Proceedings of the 16^th European PVM/MPI Users' Group Meeting (EuroPVM/MPI) 2009, vol. 5759. Espoo, Finland: Springer Verlag, Berlin, Germany, Sep. 7--10, 2009, pp. 124--133. {Online}. Available: http://dx.doi.org/10.1007/978-3-642-03770-2_19

Digital Library

[36]

B. Roundtree, G. Cobb, T. Gamblin, M. Schulz, B. Supinski, and H. Tufo, "Parallelizing heavyweight debugging tools with mpiecho," in High-performance Infrastructure for Scalable Toolsi, WHIST 2011, Held as part of ICS '11, Tucson, Arizona, 2011, pp. 803--808.

[37]

G. Cobb, B. Roundtree, H. Tufo, M. Schulz, T. Gamblin, and B. de Supinski, "Mpiecho: A framework for transparent mpi task replication," Dept. of Computer Science, University of Colorado at Boulder, Tech. Rep. CU-CS-1082-11, Jun. 2011.

Cited By

Charisopoulos VDamle AKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Communication-efficient distributed eigenspace estimation with arbitrary node failuresProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601593(18197-18210)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3601593
Benacchio TBonaventura LAltenbernd MCantwell CD�ben PGillard MGiraud LG�ddeke DRaffin ETeranishi KWedi N(2021)Resilience and fault tolerance in high-performance computing for numerical weather and climate predictionInternational Journal of High Performance Computing Applications10.1177/109434202199043335:4(285-311)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.1177/1094342021990433
Mamun AYan FZhao Dde Supinski BHall MGamblin T(2021)BAASHProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476155(1-18)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476155
Show More Cited By

Detection and correction of silent data corruption for large-scale high-performance computing

Recommendations

Detection and correction of silent data corruption for large-scale high-performance computing
SC '12: Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis

Faults have become the norm rather than the exception for high-end computing clusters. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that allow applications to compute incorrect results. ...
Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing
IPDPSW '11: Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Faults have become the norm rather than the exception for high-end computing on clusters with 10s/100s of thousands of cores, and this situation will only become more dire as we reach exascale computing. Exacerbating this situation, some of these faults ...
Detecting Silent Data Corruption for Extreme-Scale MPI Applications
EuroMPI '15: Proceedings of the 22nd European MPI Users' Group Meeting

Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. These trends are pushing supercomputer construction to the limits of miniaturization and energy-saving ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2012

1161 pages

ISBN:9781467308045

General Chair:
Jeffrey K. Hollingsworth
University of Maryland

Sponsors

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 10 November 2012

Check for updates

Qualifiers

Research-article

Conference

SC '12

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC '12: International Conference for High Performance Computing, Networking, Storage and Analysis

November 10 - 16, 2012

Utah, Salt Lake City

Acceptance Rates

SC '12 Paper Acceptance Rate 100 of 461 submissions, 22%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

59
Total Citations
View Citations
671
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Charisopoulos VDamle AKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Communication-efficient distributed eigenspace estimation with arbitrary node failuresProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601593(18197-18210)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3601593
Benacchio TBonaventura LAltenbernd MCantwell CD�ben PGillard MGiraud LG�ddeke DRaffin ETeranishi KWedi N(2021)Resilience and fault tolerance in high-performance computing for numerical weather and climate predictionInternational Journal of High Performance Computing Applications10.1177/109434202199043335:4(285-311)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.1177/1094342021990433
Mamun AYan FZhao Dde Supinski BHall MGamblin T(2021)BAASHProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476155(1-18)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476155
Sheffi GPetrank EScheideler CSpear M(2020)Functional FaultsProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3350755.3400261(453-463)Online publication date: 6-Jul-2020
https://dl.acm.org/doi/10.1145/3350755.3400261
Krluku EGusev MZdraveski V(2019)Bi-Source Verification Against Silent Data Corruption in High Performance ComputingProceedings of the 9th Balkan Conference on Informatics10.1145/3351556.3351567(1-4)Online publication date: 26-Sep-2019
https://dl.acm.org/doi/10.1145/3351556.3351567
Zamani HLiu YTripathy DBhuyan LChen ZEigenmann RDing CMcKee S(2019)GreenMMProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330373(308-318)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330373
Wu SZhou FGao XJin HRen J(2019)Dual-Page CheckpointingACM Transactions on Architecture and Code Optimization10.1145/329105715:4(1-27)Online publication date: 8-Jan-2019
https://dl.acm.org/doi/10.1145/3291057
Benoit APottier LRobert Y(2018)Resilient co-scheduling of malleable applicationsInternational Journal of High Performance Computing Applications10.1177/109434201770497932:1(89-103)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.1177/1094342017704979
Altenbernd MG�ddeke D(2018)Soft fault detection and correction for multigridInternational Journal of High Performance Computing Applications10.1177/109434201668400632:6(897-912)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1177/1094342016684006
Bernholdt DElwasif WKartsaklis CLee SMintz T(2018)Programmer-guided reliability for extreme-scale applicationsInternational Journal of High Performance Computing Applications10.1177/109434201666762532:5(598-612)Online publication date: 1-Sep-2018
https://dl.acm.org/doi/10.1177/1094342016667625
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents