Thermal, Power Delivery and Reliability Management for 3D ICS
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
Three-dimensional (3D) integration technology is promising to continuously improve the performance of electronic devices by vertically stacking multiple active layers and connecting them with Through-Silicon-Vias (TSVs). Meanwhile, the thermal and power integrity problems are exacerbated since the power flux in 3D integrated circuits (3D ICs) increases linearly with the number of stacked layers. Moreover, the TSV structure in 3D ICs introduces new reliability problems since TSVs are vulnerable to various failure mechanisms (e.g. electromigration) and the failure of power-ground TSVs will cause voltage drop thereby significantly degrading the performance of 3D ICs. To make things worse, the high temperature, thermal gradient and power load in 3D ICs accelerate the failure of TSVs. Therefore, in order to push the 3D integration technology to full commercialization, the thermal, power integrity and reliability problem should be properly addressed in both design-time and run-time.
In 3D ICs, the heat flux will easily exceed the capability of the traditional air cooling. Therefore, several aggressive cooling methods are applied to remove heat from the 3D IC, which include micro-fluidic cooling, the phase change material based cooling etc. These cooling schemes are usually implemented close to the heat source to gain high heat removal capability, thus causing more challenges to the design of 3D ICs. Unfortunately, physical design tools for 3D ICs with those aggressive cooling methods are lack. In this thesis, we will focus on 3D ICs with micro-fluidic (MF) cooling. The physical design for this kind of 3D ICs involves complex trade-offs between the circuit performance, power delivery noise, and temperature. For example, both TSVs and micro-cavities for MF cooling are fabricated in the substrate region. Therefore, they will compete in space: the allocation of signal TSVs should avoid micro-cavities to realize a feasible design, thus enforcing more constraints to the physical placement of 3D ICs. Moreover, power delivery networks (PDNs) in 3D ICs are enabled by power-ground (P/G) TSVs. The number and distribution of P/G TSVs are also constrained by micro-cavities which will influence the power integrity of the 3D IC. In addition, the capability of MF cooling degrades downstream the flow of coolant thereby causing large in-layer temperature gradient. The spatial temperature variance will affect the reliability of 3D ICs. in order to avoid it, the gate/modules in 3D ICs should be placed properly. In order to address the trade-offs 3D ICs with MF cooling, different design-time methods for application specific ICs (ASICs) and field programmable gate arrays (FPGAs) are proposed, respectively. For 3D ASICs, we propose a co-design method that integrates the design of MF cooling heat sink and P/G TSVs to the physical placement for 3D ICs. Experiments on publicly available benchmarks show that using our method, we can achieve better results compared to the traditional sequential design flow. The case for 3D FPGAs is more complicated than ASICs since the routing and logic resources are fixed and the chip power and temperature is hard to estimate until the circuit is routed. Therefore, in this thesis, we first build a design space exploration (DSE) framework to study how MF cooling affects the design of 3D FPGAs. Following this, we utilize an existing 3D FPGA placement and routing tool to develop a cooling-aware placement framework for 3D FPGAs to reduce the temperature gradient.
Since the activity of 3D ICs cannot be completely estimated at the design stage, the run-time management, besides design-time methods, is required to address the thermal, power and reliability problems in 3D ICs. However, the vertically stacked structure makes the run-time management for 3D ICs more complicated than 2D ICs. The major reason of this is that the power supply noise and temperature can be coupled across layers in 3D ICs. This means the activity of one layer may affect the performance and reliability of other layers through voltage/temperature coupling. As a result, we cannot perform run-time management for each layer (perhaps implemented with dierent chips) of 3D ICs separately as in 2D systems. Therefore, the space of control nodes will become larger and more complicated. To make things worse, the existing run-time management techniques have various drawbacks (e.g. large off-line characterization overhead, poor scalability etc. ), which needs more eort to improve. In this thesis, we propose a phase-driven Q-learning based run-time management technique which can tune the activity of the processor to maximize the 3D CPU performance subject to the reliability constraint.