Computer Architecture Lab
Dept. of Informatics & Telecommunications @ University of Athens
L. Yang, G. Papadimitriou, D. Sartzetakis, A. Jog, E. Smirni, and D. Gizopoulos, “GPU Reliability Assessment: Insights Across the Abstraction Layers”, IEEE International Conference on Cluster Computing (CLUSTER 2024), Kobe, Japan, September 2024.
CAL focuses on different aspects of Computer Architecture. We deal with computing systems built around general-purpose and specialized CPUs, memory systems, and accelerators such as GPUs. We care about the complex interactions among different parameters including performance, power/energy and dependability/reliability. We deliver methods and tools for fast evaluation of reliability, for energy-efficient computing, for error detection and recovery, as well as for silicon debug and validation.
We design mechanisms for improving the performance of computing systems, focusing on emerging application domains that stress the limits of the memory system and its impact on execution. Our hardware, software, and co-design approach involves all the layers of the computing stack, from the application, the runtime and the OS, to the architecture and microarchitecture layers.
We devise models, methods and tools for CPUs and memories silicon debug and validation aiming to detect and locate hard-to-detect design bugs that escape pre-silicon, simulation-based verification. Our methods aim to improve the coverage of the silicon debug process (and thus its bug detection capability) while reducing significantly the time of the process.
We deliver methods and tools for the evaluation of computing systems reliability for different types of models (transients or permanents) and for all major hardware components including CPUs, GPUs and memories. The main focus is on the speed of the reliability evaluation and its accuracy. We work at the microarchitecture level, the software level and the RT level.
We investigate the margins of modern computing systems hardware to reveal the potential of energy and power savings when they operate beyond nominal conditions of voltage, frequency and refresh rates. We characterize the variation among different chips, different cores within chips and different workloads regarding their design margins, aiming to predict safe and energy-efficient operation points of modern hardware.
We design hardware-based and software-based methods for the detection and tolerance of transient and permanent faults in the hardware components of computing systems. We provide solutions for CPUs, GPUs and memories. The main focus is the error coverage of the methods as well as the minimization of their cost in terms of system performance, energy/power and hardware area.