Managing Application Resilience: A Programming Language ApproachPedro Diniz
USC Information Science Institute
Tuesday, July 21, 2015 15:00-16:00,
System resilience is an important challenge that needs to be addressed in the era of extreme scale computing. High-performance computing systems will be architected using millions of processor cores and memory modules. As process technology scales, the reliability of such systems will be challenged by the inherent unreliability of individual components due to extremely small transistor geometries, variability in silicon manufacturing processes, device aging, etc. Therefore, errors and failures in extreme scale systems will increasingly be the norm rather than the exception. Not all the errors detected warrant catastrophic system failure, but there are presently no mechanisms for the programmer to communicate their knowledge of algorithmic fault tolerance to the system.
In this talk we present a programming model approach for system resilience that allows programmers to explicitly express their fault tolerance knowledge. We propose novel resilience oriented programming model extensions and programming directives, and illustrate their effectiveness. An inference engine leverages this information and combines it with runtime gathered context to increase the dependability of HPC systems. The preliminary experimental results presented here, for a limited set of kernel codes from both scientific and graph-based computing domains reveal that with a very modest programming effort, the described approach incurs fairly low execution time overhead while allowing computations to survive a large number of faults that would otherwise always result in the termination of the computation.
As transient faults become the norm, rather than the exception, it will be come increasingly important to provide the user with high-level programming mechanisms with which he/she can convey important application acceptability criteria. For best performance (either in terms of time, power, energy) the underlying systems need to leverage this information to better navigate the very complex system-level trade-offs to still deliver a reliable and productive computing environment. The work presented here is a simple first step towards this vision.
Speaker Bio: Pedro C. Diniz is a Research Associate in the Computational Sciences Division at the University of Southern California's Information Sciences Institute. Dr. Diniz has 20 years of experience in the areas of computer architecture, high-performance computing and compilation, program analysis and optimization. He has been a principal participant in major research programs funded by DARPA and DoE’s Office of Science. He has collaborated with universities, national laboratories and industry as prime contractor and sub-contractor. Dr. Diniz received a B.S. in Computer and Electrical Engineering and a M.S. in Electrical Engineering from Technical University of Lisbon in 1988 and 1992 and a Ph.D. from the University California at Santa Barbara in 1997. His current research focuses on program analysis for software resiliency and high-performance and reconfigurable computing.
Contact: M. Mascagni
Note: Visitors from outside NIST must contact Cathy Graham; (301) 975-3800; at least 24 hours in advance.