News Archives

[Colloquium] Fault-tolerant solvers via algorithm/system codesign

January 22, 2013

Watch Colloquium: 

M4V file (803 MB)

  • Date: Tuesday, January 22, 2013 
  • Time: 11:00 am — 11:50 am 
  • Place: Mechanical Engineering 218

Mark Hoemmen
Sandia National Laboratories USA 

Protecting arithmetic and data from corruption due to hardware errors costs energy. However, energy increasingly constrains modern computer hardware, especially for the largest parallel computers being built and planned today. As processor counts continue to grow, it will become too expensive to correct all of these “soft errors” at system levels, before they reach user code. However, many algorithms only need reliability for certain data and phases of computation, and can be designed to recover from some corruption. This suggests an algorithm / system codesign approach. We will show that if the system provides a programming model to applications that lets them apply reliability only when and where it is needed, we can develop “fault-tolerant” algorithms that compute the right answer despite hardware errors in arithmetic or data. We will demonstrate this for a new iterative linear solver we call “Fault-Tolerant GMRES” (FT-GMRES). FT-GMRES uses a system framework we developed that lets solvers control reliability per allocation and provides fault detection. This project has also inspired a fruitful collaboration between numerical algorithms developers and traditional “systems” researchers. Both of these groups have much to learn from each other, and will have to cooperate more to achieve the promise of exascale.


Bio: Mark Hoemmen is a staff member at Sandia National Laboratories in Albuquerque. He finished his PhD in computer science at the University of California Berkeley in spring 2010. Mark has a background in numerical linear algebra and performance tuning of scientific codes. He is especially interested in the interaction between algorithms, computer architectures, and computer systems, and in programming models that expose the right details of the latter two to algorithms. He also spends much of his time working on the Trilinos library of (