Application-Level Correctness and its Impact on Fault Tolerance
Application-Level Correctness and its Impact on Fault Tolerance
Files
Publication or External Link
Date
2006-08
Authors
Li, Xuanhua
Yeung, Donald
Advisor
Citation
DRUM DOI
Abstract
Fundamental to any fault tolerance research is the definition of
correct program execution. Traditionally, correct program's execution
requires architectural state to be numerically perfect. However, in
many cases, even if program execution is not 100%
numerically correct, it may be completely
acceptable if the answers can satisfy user's requirement.
Hence, faults which have
caused such numerically faulty execution are no longer intolerable.
The extent to which programs are more fault resilient at higher levels
of abstraction is application dependent. Programs that produce
inexact and/or approximate outputs can be very resilient at the
application level. We call such programs soft computations, and
we find they are common in multimedia workloads, as well as artificial
intelligence (AI) workloads. Programs that compute exact numerical
outputs offer less error resilience at the application level.
However, we find all programs studied in this paper exhibit some
enhanced fault resilience at the application level, including those
that are traditionally considered exact computations-e.g., SPECInt
CPU2000.
This report investigates definitions of program correctness that view
correctness from the application's standpoint rather than the
architecture's standpoint. Under application-level correctness,
a program's execution is deemed correct as long as the result it
produces is acceptable to the user. To quantify user satisfaction, we
rely on application-level fidelity metrics that capture user-perceived
program solution quality. We conduct a detailed fault susceptibility
study that measures how much more fault resilient programs are when
defining correctness at the application level compared to the
architecture level. Our results show for 6 multimedia and AI
benchmarks that 45.8% of architecturally incorrect faults are correct
at the application level. For 3 SPECInt CPU2000 benchmarks, 17.6% of
architecturally incorrect faults are correct at the application
level. Based on our study on algorithmic properties for fault tolerance,
we also investigate a lightweight fault recovery mechanism that
exploits the relaxed requirements on numerical integrity provided by
application-level correctness to reduce checkpoint cost. Our
lightweight fault recovery mechanism successfully recovers 66.3% of
program crashes in our multimedia and AI workloads, while incurring
minimum runtime overhead.