Introspective Fault Tolerance for Exascale Systems
This paper discusses introspective fault tolerance for exascale systems, highlighting the need for multi-way communication mechanisms between hardware, OS, runtime systems, and applications. It emphasizes tuning tradeoffs based on application characteristics, power, performance, and resiliency while addressing challenges in fault detection, system-level improvements, and hardware preparedness.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Introspective Fault Tolerance for Exascale Systems Rinku Gupta, Kamil Iskra, Kazutomo Yoshii, Pavan Balaji, Pete Beckman
Motivation Exascale systems will have faults Power constraints, high-density silicon Number of hardware/software components Both hardware and software have a role to play Hardware techniques ECC checks, 2D error coding Can get too expensive when bit rates increase (both cost and power) Software techniques need to complement hardware resilience with clearly defined roles Mechanisms are needed for lower-level hardware and operating system to interface with upper levels for end-to-end for resiliency and fault tolerance Datacenter: 109 threads Rack: 104 - 105 threads Socket: 500 - 5000threads Die: 100 - 1000threads Core/tile: 1 - 10 threads Image courtesy of Intel : SC 11 BOF on Resilience S/W on Exascale Computing 2
Introspective Fault Tolerance Current fault exchange models are too simplistic: OS kills the application on a hard error OS/hardware returns an error code saying something bad happened Hardware/OS/low-level runtime automatically corrects errors and hides it from the application The fundamental concept of introspective fault tolerance: multi-way communication mechanism between operating system, runtime systems and applications Hardware/OS/runtime should continue to give information to applications (like they currently do) Applications/runtime systems should also pass down information (or hints) to the low-level runtime/OS on what they can get away with Tuning tradeoffs based on application characteristics (e.g., OS can turn off ECC checks for some application specified memory regions) Tradeoffs based on power, performance and resiliency (e.g., lesser voltage means lesser power, but more faults) 3
Challenges Research focus for achieving this goal: Understand what faults/system changes highly impact applications Understand how to improve fault detection at OS- or system-level What interfaces are required between operating system and upper-level software? What techniques would allow upper level software to use information received from OS? What mechanisms are needed in the OS to manipulate resilience, power and performance Is hardware prepared for this? 4
An Example Interface Based on annotations and low-level interfaces/hooks Allocate regular memory Introspect soft ECC errors Allocate memory with hard error returns Introspect hard ECC errors Allocate unreliable memory Call routines for memory check Application can query OS for soft/hard error information decide whether to continue execution or migrate/terminate better end-to-end fault tolerance 5