Publication or External Link
Pre-execution is a novel latency-tolerance technique where one or more helper threads run in front of the main computation and trigger long-latency delinquent events early so that the main thread makes forward progress without experiencing stalls. The most important issue in pre-execution is how to construct effective helper threads that quickly get ahead and compute the delinquent events accurately. Since the manual construction of helper threads is error-prone and cumbersome for a programmer, automation of such an onerous task is inevitable for pre-execution to be widely used for a variety of real-world workloads. In this thesis, we study compiler-based pre-execution to construct prefetching helper threads using a source-level compiler. We first introduce various compiler algorithms to optimize the helper threads; program slicing removes noncritical code unnecessary to compute the delinquent loads, prefetch conversion reduces blocking in the helper threads by converting delinquent loads into nonblocking prefetches, and loop parallelization speculatively parallelizes the targeted code region so that more memory accesses are overlapped simultaneously. In addition to these algorithms to expedite the helper threads, we also propose several important algorithms to select the right loops for pre-execution regions and pick up the best thread initiation scheme to invoke helper threads. We implement all these algorithms in the Stanford University Intermediate Format (SUIF) compiler infrastructure to automatically generate effective helper threads at the program source level. Furthermore, we replace the external tools to perform program slicing and offline profiling in our most aggressive compiler framework with static algorithms to reduce the complexity of compiler implementation. We conduct thorough evaluation of the compiler-generated helper threads using a simulator that models the research SMT processor. Our experimental results show compiler-based pre-execution effectively eliminates the cache misses and improves the performance of a program. In order to verify whether prefetching helper threads provide wall-clock speedup even in real silicon, we apply compiler-based pre-execution in a real physical system with the Intel Pentium 4 processor with Hyper-Threading Technology. To generate helper threads, we use the pre-execution optimization module in the Intel research compiler infrastructure and propose three helper threading scenarios to invoke and synchronize the helper threads. Our physical experimentation results prove prefetching helper threads indeed improve the performance of selected benchmarks. Moreover, to achieve even more speedup in real silicon, we observe several issues need to be addressed a priori. Unlike the research SMT processor where most processor resources are shared or replicated, some critical hardware structures in the hyper-threaded processor are hard-partitioned in the multithreading mode. Therefore, the resource contention is more intricate, and thus helper threads must be invoked very judiciously. In addition, the program behavior dynamically changes during execution and the helper threads should adapt to it to maximize the benefit from pre-execution. Hence we implement user-level library routines to monitor the dynamic program behavior with little overhead and show the potential of having a runtime mechanism to dynamically throttle helper threads. Furthermore, in order to activate and deactivate the helper threads at a very fine granularity, having light-weight thread synchronization mechanisms is very crucial. Finally, we apply compiler-based pre-execution to multiprogrammed workloads. When introducing helper threads in a multiprogramming environment, multiple main threads compete with each other to acquire enough hardware contexts to launch helper threads. In order to address such a resource contention problem, we propose a mechanism to arbitrate the main threads. Our simulation-based experiment shows pre-execution also helps to boost the throughput of a multiprogrammed workload by reducing the latencies in the individual applications. Moreover, when the helper thread occupancy of each main thread in the workload is not too high, multiple main threads effectively share the hardware contexts for helper threads and utilize the processor resources in the SMT processor.