News Archives

[Colloquium] Fault Resilience in Exascale Systems

October 8, 2010

Watch Colloquium: 

M4V file (704 MB)

  • Date: Friday, October 8, 2010 
  • Time: 12noon — 12:50 pm 
  • Place: Centennial Engineering Center, Room 1041

Rolf Riesen
Rolf Riesen, Ph.D. Principal Member Technical Staff Scalable Computing Systems Sandia National Laboratories

Exascale systems will have hundred thousands of compute nodes and millions of components which increases the likelihood of faults. Today applications use checkpoint/restart to recover from these faults, but even under ideal conditions, applications running on more than 50,000 nodes will spend more than half of their total run time saving checkpoints, restarting, and redoing work that was lost.

Redundant computing is a method to allow an application to continue working even when failures occur. Instead of each failure causing an application interrupt which causes lost work and requires restart time, multiple failures can be absorbed by the application until redundancy is exhausted. In this talk I will present a method to analyze the benefits of redundant computing, present simulation results of the cost, and discuss a prototype MPI implementation.

Bio: Rolf Riesen grew up in Switzerland and learned electronics there. He got interested in software because he got tired of burning his fingers on a soldering iron and got a master’s and a Ph.D. in computer science from the University of New Mexico (UNM). His advisor was Barney Maccabe who now leads the Computer Science and Mathematics Division CSM at Oak Ridge National Laboratory.

In 1990 he started working with a group at Sandia while he was a research assistant at UNM and, after finishing his Master’s, he was hired as a member of the technical staff in 1993. Throughout this time he designed, implemented, and debugged various pieces of system software starting with SUNMOS on the nCUBE 2 and Puma on the Intel Paragon. He created his own cluster, Cplant, before large clusters were common, and was involved in the Puma successors: Jaguar, Cougar, and Catamount for the Intel ASCI Red machine and the Cray XT3 Red Storm.