Optimizing SMT Processors for High Single-Thread Performance
Files
Publication or External Link
Date
Advisor
Citation
DRUM DOI
Abstract
Simultaneous Multithreading (SMT) processors achieve high processor
throughput at the expense of single-thread performance. This paper
investigates resource allocation policies for SMT processors that
preserve, as much as possible, the single-thread performance of
designated foreground'' threads, while still permitting other
background'' threads to share resources. Since background threads
on such an SMT machine have a near-zero performance impact on
foreground threads, we refer to the background threads as transparent threads.
Transparent threads are ideal for performing low-priority or non-critical
computations, with applications in process scheduling, subordinate
multithreading, and on-line performance monitoring.
To realize transparent threads, we propose three mechanisms for
maintaining the transparency of background threads: slot
prioritization, background thread instruction-window partitioning, and
background thread flushing. In addition, we propose three mechanisms to boost background thread performance without sacrificing
transparency: aggressive fetch partitioning, foreground thread
instruction-window partitioning, and foreground thread flushing. We
implement our mechanisms on a detailed simulator of an SMT processor,
and evaluate them using 8 benchmarks, including 7 from the SPEC
CPU2000 suite. Our results show when cache and branch predictor
interference are factored out, background threads introduce less than
1% performance degradation on the foreground thread. Furthermore,
maintaining the transparency of background threads reduces their
throughput by only 23% relative to an equal priority scheme.
To demonstrate the usefulness of transparent threads, we study Transparent Software Prefetching (TSP), an implementation of software data prefetching using transparent threads. Due to its near-zero overhead, TSP enables prefetch instrumentation for all loads in a program, eliminating the need for profiling. TSP, without any profile information, achieves a 9.52% gain across 6 SPEC benchmarks, whereas conventional software prefetching guided by cache-miss profiles increases performance by only 2.47%. Also UMIACS-TR-2003-07