Compiler-Based Pre-Execution

dc.contributor.advisorYeung, Donalden_US
dc.contributor.authorKim, Dongkeunen_US
dc.contributor.departmentElectrical Engineeringen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2005-02-02T06:19:06Z
dc.date.available2005-02-02T06:19:06Z
dc.date.issued2004-10-22en_US
dc.description.abstractPre-execution is a novel latency-tolerance technique where one or more helper threads run in front of the main computation and trigger long-latency delinquent events early so that the main thread makes forward progress without experiencing stalls. The most important issue in pre-execution is how to construct effective helper threads that quickly get ahead and compute the delinquent events accurately. Since the manual construction of helper threads is error-prone and cumbersome for a programmer, automation of such an onerous task is inevitable for pre-execution to be widely used for a variety of real-world workloads. In this thesis, we study compiler-based pre-execution to construct prefetching helper threads using a source-level compiler. We first introduce various compiler algorithms to optimize the helper threads; program slicing removes noncritical code unnecessary to compute the delinquent loads, prefetch conversion reduces blocking in the helper threads by converting delinquent loads into nonblocking prefetches, and loop parallelization speculatively parallelizes the targeted code region so that more memory accesses are overlapped simultaneously. In addition to these algorithms to expedite the helper threads, we also propose several important algorithms to select the right loops for pre-execution regions and pick up the best thread initiation scheme to invoke helper threads. We implement all these algorithms in the Stanford University Intermediate Format (SUIF) compiler infrastructure to automatically generate effective helper threads at the program source level. Furthermore, we replace the external tools to perform program slicing and offline profiling in our most aggressive compiler framework with static algorithms to reduce the complexity of compiler implementation. We conduct thorough evaluation of the compiler-generated helper threads using a simulator that models the research SMT processor. Our experimental results show compiler-based pre-execution effectively eliminates the cache misses and improves the performance of a program. In order to verify whether prefetching helper threads provide wall-clock speedup even in real silicon, we apply compiler-based pre-execution in a real physical system with the Intel Pentium 4 processor with Hyper-Threading Technology. To generate helper threads, we use the pre-execution optimization module in the Intel research compiler infrastructure and propose three helper threading scenarios to invoke and synchronize the helper threads. Our physical experimentation results prove prefetching helper threads indeed improve the performance of selected benchmarks. Moreover, to achieve even more speedup in real silicon, we observe several issues need to be addressed a priori. Unlike the research SMT processor where most processor resources are shared or replicated, some critical hardware structures in the hyper-threaded processor are hard-partitioned in the multithreading mode. Therefore, the resource contention is more intricate, and thus helper threads must be invoked very judiciously. In addition, the program behavior dynamically changes during execution and the helper threads should adapt to it to maximize the benefit from pre-execution. Hence we implement user-level library routines to monitor the dynamic program behavior with little overhead and show the potential of having a runtime mechanism to dynamically throttle helper threads. Furthermore, in order to activate and deactivate the helper threads at a very fine granularity, having light-weight thread synchronization mechanisms is very crucial. Finally, we apply compiler-based pre-execution to multiprogrammed workloads. When introducing helper threads in a multiprogramming environment, multiple main threads compete with each other to acquire enough hardware contexts to launch helper threads. In order to address such a resource contention problem, we propose a mechanism to arbitrate the main threads. Our simulation-based experiment shows pre-execution also helps to boost the throughput of a multiprogrammed workload by reducing the latencies in the individual applications. Moreover, when the helper thread occupancy of each main thread in the workload is not too high, multiple main threads effectively share the hardware contexts for helper threads and utilize the processor resources in the SMT processor.en_US
dc.format.extent978968 bytes
dc.format.mimetypeapplication/pdf
dc.identifier.urihttp://hdl.handle.net/1903/1957
dc.language.isoen_US
dc.subject.pqcontrolledEngineering, Electronics and Electricalen_US
dc.subject.pquncontrolledCompileren_US
dc.subject.pquncontrolledPre-executionen_US
dc.subject.pquncontrolledPhysical Experimentationen_US
dc.subject.pquncontrolledMultithreadingen_US
dc.subject.pquncontrolledSMTen_US
dc.subject.pquncontrolledData Prefetchingen_US
dc.titleCompiler-Based Pre-Executionen_US
dc.typeDissertationen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
umi-umd-1907.pdf
Size:
956.02 KB
Format:
Adobe Portable Document Format