Browsing by Author "Li, Xuanhua"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Application-Level Correctness and its Impact on Fault Tolerance(2006-08) Li, Xuanhua; Yeung, DonaldFundamental to any fault tolerance research is the definition of correct program execution. Traditionally, correct program's execution requires architectural state to be numerically perfect. However, in many cases, even if program execution is not 100% numerically correct, it may be completely acceptable if the answers can satisfy user's requirement. Hence, faults which have caused such numerically faulty execution are no longer intolerable. The extent to which programs are more fault resilient at higher levels of abstraction is application dependent. Programs that produce inexact and/or approximate outputs can be very resilient at the application level. We call such programs soft computations, and we find they are common in multimedia workloads, as well as artificial intelligence (AI) workloads. Programs that compute exact numerical outputs offer less error resilience at the application level. However, we find all programs studied in this paper exhibit some enhanced fault resilience at the application level, including those that are traditionally considered exact computations-e.g., SPECInt CPU2000. This report investigates definitions of program correctness that view correctness from the application's standpoint rather than the architecture's standpoint. Under application-level correctness, a program's execution is deemed correct as long as the result it produces is acceptable to the user. To quantify user satisfaction, we rely on application-level fidelity metrics that capture user-perceived program solution quality. We conduct a detailed fault susceptibility study that measures how much more fault resilient programs are when defining correctness at the application level compared to the architecture level. Our results show for 6 multimedia and AI benchmarks that 45.8% of architecturally incorrect faults are correct at the application level. For 3 SPECInt CPU2000 benchmarks, 17.6% of architecturally incorrect faults are correct at the application level. Based on our study on algorithmic properties for fault tolerance, we also investigate a lightweight fault recovery mechanism that exploits the relaxed requirements on numerical integrity provided by application-level correctness to reduce checkpoint cost. Our lightweight fault recovery mechanism successfully recovers 66.3% of program crashes in our multimedia and AI workloads, while incurring minimum runtime overhead.Item Exploiting Inherent Program Redundancy for Fault Tolerance(2009) Li, Xuanhua; Yeung, Donald; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Technology scaling has led to growing concerns about reliability in microprocessors. Currently, fault tolerance studies rely on creating explicitly redundant execution for fault detection or recovery, which usually involves expensive cost on performance, power, or hardware, etc. In our study, we find exploiting program's inherent redundancy can better trade off between reliability, performance, and hardware cost. This work proposes two approaches to enhance program reliability. The first approach investigates the additional fault resilience at the application level. We explore program correctness definition that views correctness from the application's standpoint rather than the architecture's standpoint. Under application-level correctness, multiple numerical outputs can be deemed as correct as long as they are acceptable to users. Thus faults that cause program to produce such outputs can also be tolerated. We find programs which produce inexact and/or approximate outputs can be very resilient at the application level. We call such programs soft computations, and find that they are common in multimedia workloads, as well as artificial intelligence (AI) workloads. Programs that only compute exact numerical outputs offer less error resilience at the application level. However, all programs that we have studied exhibit some enhanced fault resilience at the application level, including those that are traditionally considered as exact computations-e.g., SPECInt CPU2000. We conduct fault injection experiments and evaluate the additional fault tolerance at the application level compared to the traditional architectural level. We also exploit the relaxed requirements for numerical integrity of application-level correctness to reduce checkpoint cost: our lightweight recovery mechanism checkpoints a minimal set of program state including program counter, architectural register file, and stack; our soft-checkpointing technique identifies computations that are resilient to errors and excludes their output state from checkpoint. Both techniques incur much smaller runtime overhead than traditional checkpointing, but can successfully recover either all or a major part of program crashes in soft computations. The second approach we take studies value predictability for reducing fault rate. Value prediction is considered as additional execution, and its results are compared with corresponding computational outputs. Any mismatch between them is accounted as symptom of potential faults and incurs restoration process. To reduce misprediction rate caused by limitations of predictor itself, we characterize fault vulnerability at the instruction level and only apply value prediction to instructions that are highly susceptible to faults. We also vary threshold of confidence estimation according to instruction's vulnerability-instructions with high vulnerability are assigned with low confidence threshold, while instructions with low vulnerability are assigned with high confidence threshold. Our experimental results show benefit from such selective prediction and adaptive confidence threshold on balance between reliability and performance.