Thermal Tracking and Estimation for Microprocessors
Publication or External Link
Due to increasing integration density and operating frequency of today's high performance processors, the temperature of a typical chip can easily exceed 100 degrees Celsius. However, the runtime thermal state of a chip is very hard to predict and manage due to the random nature in computing workloads, as well as the process, voltage and ambient temperature variability (together called PVT variability). The uneven nature (both in time and space) of the heat dissipation of the chip could lead to severe reliability issues and error-prone chip behavior (e.g. timing errors). Many dynamic power/thermal management techniques have been proposed to address this issue such as dynamic voltage and frequency scaling (DVFS), clock gating and etc. However, most of such techniques require accurate knowledge of the runtime thermal state of the chip to make efficient and effective control decisions. In this work we address the problem of tracking and managing the temperature of microprocessors which include the following sub-problems: (1) how to design an efficient sensor-based thermal tracking system on a given design that could provide accurate real-time temperature feedback; (2) what statistical techniques could be used to estimate the full-chip thermal profile based on very limited (and possibly noise-corrupted) sensor observations; (3) how do we adapt to changes in the underlying system's behavior, since such changes could impact the accuracy of our thermal estimation.
The thermal tracking methodology proposed in this work is enabled by on-chip sensors which are already implemented in many modern processors. We first investigate the underlying relationship between heat distribution and power consumption, then we introduce an accurate thermal model for the chip system. Based on this model, we characterize the temperature correlation that exists among different chip modules and explore statistical approaches (such as those based on Kalman filter) that could utilize such correlation to estimate the accurate chip-level thermal profiles in real time. Such estimation is performed based on limited sensor information because sensors are usually resource constrained and noise-corrupted. We also took a further step to extend the standard Kalman filter approach to account for (1) nonlinear effects such as leakage-temperature interdependency and (2) varying statistical characteristics in the underlying system model. The proposed thermal tracking infrastructure and estimation algorithms could consistently generate accurate thermal estimates even when the system is switching among workloads that have very distinct characteristics. Through experiments, our approaches have demonstrated promising results with much higher accuracy compared to existing approaches. Such results can be used to ensure thermal reliability and improve the effectiveness of dynamic thermal management techniques.