MICRO-50 Tutorial for Reliability Assesment

Microarchitecture Level Reliability Assessment: Throughput and Accuracy

Sunday, 15 October 2017, Boston, MA, USA
Morning tutorial held in conjunction with MICRO 2017

Tutorial Summary

Early assessment of the vulnerability of microprocessor components to hardware faults can drive effective protection decisions. Microarchitecture-level simulators are employed for such early assessments and can deliver reliability reports for a large number of hardware structures taking into consideration the masking effects of the entire stack of hardware and software layers. Statistical fault injection at the microarchitecture level is a very accurate approach which, however, may suffer from low throughput if a statistically significant assessment is required.

This tutorial focuses on recent advances delivered by the Computer Architecture Lab of the University of Athens in the area of microarchitecture level reliability assessment using statistical fault injection. We present GeFIN (Gem5-based Fault Injector) a state-of-the-art microarchitecture level fault injection framework built on Gem5 simulator. GeFIN supports massive and fast injection campaigns for all different types of faults (transient, permanent, intermittent) on arbitrary combinations of several dozens of microarchitectural components modeled in Gem5. We first present the baseline Gem5 engine as well as AVF (Architectural Vulnerability Factor) and FIT (Failures in Time) measurements reported by the tool which are reports fine-grained fault effects classifications.

We also present two GeFIN add-ons designed to improve the throughput of the injections campaigns but preserve the accuracy of the reliability measurements. The first add-on is a set of speed-up methods on GeFIN individual runs themselves and the second add-on is MeRLiN a fault classification approach based on dynamic instruction profiling which aims at pruning the number of faults in extremely large fault lists. Both add-ons deliver large throughput improvements (several orders of magnitude) for comprehensive (and thus statistically significant) fault injection campaigns while they preserve the reported AVF measurements.

The tutorial includes measurements for different microarchitectural configurations (corresponding to different CPU models), discussion about ACE analysis and fault injection at the microarchitecture level, discussion about CPU and GPU reliability assessment at the microarchitecture level as well as comparison between microarchitecture-level and register-transfer level fault injection on a commercial CPU model.