On Concurrent Error Detection and Error Propagation

Abstract: This thesis addresses three important steps in the selection of error detection mechanisms for microprocessors: (i) the design and evaluation of error detection mechanisms, (ii) the study of microprocessor error behavior and propagation and (iii) the design and use of error models.

The first part of the thesis evaluates four error detection methods with respect to para- meters such as error detection coverage and performance loss, while the second and third parts focus on determining the error patterns most likely to occur in a computer system when different types of faults are present and how to incorporate those error patterns into error models. The models can be used to efficiently evaluate error detection mechanisms in error injection experiments when fault injection experiments would be too time-consuming.

To determine the efficiency of an error detection mechanism, it must be evaluated both analytically and experimentally. The first part of this thesis, DESIGN AND EVALUATION OF CONCURRENT ERROR DETECTION MECHANISMS, contains four papers, Papers A to D, on concurrent error detection. Papers A and C describe the design and evaluation of two signature monitoring methods: one similar to the basic technique, and one new method, Implicit Signature Monitoring. The control flow error detection coverage of the basic technique varied between 94% and 99% depending on workload characteristics, while the coverage for the new method was better than 99.99% for a 16-bit signature. Paper B describes how low-cost error detection mechanisms can be combined into a single method, the TTA scheme. The evaluation shows that the method is capable of detecting up to 98% of all control flow errors at a performance loss of about 30%. Paper D describes a method, Application Signature Checking, using time redundancy in combination with watchdog timers to detect transient faults. The evaluation of the method shows that it is capable of detecting more than 99.5% of all transient faults resulting in errors.

Error behavior studies provide a useful means to determine the error patterns most likely to occur in a computer system. The second part of this thesis, STUDIES ON ERROR BEHAVIOR AND ERROR PROPAGATION, contains three papers, Papers E to G, on error behavior studies. The major result of paper E is that less control flow errors can be expected to occur for microprocessors with large register files, e.g. RISCs, than for simple microprocessors having only few registers, indicating that systems cannot rely only on control flow checking mechanisms to detect errors. The fact that faults injected into internal signals propagate more rapidly than faults injected directly into registers is the main finding in paper F, and the primary result of paper G is that control flow errors are not randomly distributed with respect to where they occur and to what location the erroneous branch go.

To speed up fault simulation, it is necessary to develop error models that are abstractions of low-level faults. The errors can then be injected instead of the faults. The third part of this thesis, OBSERVATIONS ON ERROR MANIFESTATION ON THE FUNCTIONAL LEVEL, presents an error model that models errors on the functional level. In paper H, the accuracy of the error model (error behavior function) for bit-flip faults is evaluated by comparing the results of single bit-flip fault injections into a microprocessor with error injection into the same processor. The results are very similar, indicating that the model is accurate. The fraction of faults that can be emulated using the error model presented in paper H is experimentally evaluated in paper I for two fault models and six workloads. The results show that 70% to 98% of all faults can be emulated.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.