Study of the Impact of Hardware Failures on Software Reliability
Files
Publication or External Link
Date
Authors
Citation
DRUM DOI
Abstract
Software plays an increasingly important role in modern safety-critical systems. Reliable software becomes desirable for all stakeholders. Typical software related failures include software internal failures, input failures, output failures, support failures and multiple interaction failures. This dissertation provides a methodology to study the impact of hardware support failures on software reliability.
The hardware failures we are focusing on in this study are semiconductor device intrinsic failures that are directly related to software execution during device operation. The software execution on hardware devices, in essence, is a series of 0 and 1 signal alternations for the inputs of hardware components. Such signal alternations lead to voltage changes and current flows in the microelectronic hardware device, which serve as electrical stresses on the device and may lead to physical failures. The failure mechanisms include Hot Carrier Injection (HCI), Electromigration (EM), and Time Dependent Dielectric Breakdown (TDDB). During device operation such hardware failures could propagate to circuit level in the form of signal delays, changes of circuit functionality, and signals stuck at a logic value (0 or 1), which could further propagate into the software layer and affect the reliability of the software.
The proposed methodology is divided into three parts: (i) analysis of the manifestations of permanent failures on circuit elements (logic gates, flip-flops, etc.), (ii) development of reliability models for the circuit elements as functions of the software execution, and (iii) calculation of failure probability distributions of the hardware circuit elements under the software execution.
The methodology is applied to a comprehensive case study, targeting all the CPU registers and ALU logic gates of a computer system based on the Z80 microprocessor. About 120 different types of failure manifestations are observed, and more than 250 reliability models for the different types of failure manifestations and circuit elements are developed. Such models allow us to calculate the failure probability distributions of the CPU registers and ALU gates of the Z80 computer system under the software execution. We also extend the methodology and the case study to the consideration of transient failures, also known as Single Event Upsets (SEUs).