A. James Clark School of Engineering

Permanent URI for this communityhttp://hdl.handle.net/1903/1654

The collections in this community comprise faculty research works, as well as graduate theses and dissertations.

Browse

Search Results

Now showing 1 - 2 of 2
  • Thumbnail Image
    Item
    Extended Split-Issue: Enabling Flexibility in the Hardware Implementation of NUAL VLIW DSPs
    (2004-06) Iyer, Bharath; Srinivasan, Sadagopan; Jacob, Bruce
    VLIW architecture based DSPs have become widespread due to the combined benefits of simple hardware and compiler-extracted instruction-level parallelism. However, the VLIW instruction set architecture and its hardware implementation are tightly coupled, especially so for Non-Unit Assumed Latency (NUAL) VLIWs. The problem of object code compatibility across processors having different numbers of functional units or hardware latencies has been the Achilles' heel of this otherwise powerful architecture. In this paper, we propose eXtended Split-Issue (XSI), a novel mechanism that breaks the instruction packet syntax of an NUAL VLIW compiler without violating the dataflow dependences. XSI provides a designer the freedom of disassociating the hardware implementation of the NUAL VLIW processor from the instruction set architecture. Further, we investigate fairly radical (in the context of VLIW) changes to the hardware—like removing an adder, adding a multiplier, and incorporating simultaneous multithreading (SMT)—to show that our technique works for a variety of hardware configurations without compromising on performance. The technique can be used in both single-threaded and multi-threaded architectures to achieve a level of flexibility heretofore unavailable in the VLIW arena.
  • Thumbnail Image
    Item
    VLIW Instruction Scheduling for Reduced Code Size
    (2005-12-05) Haga, Steve Wayne; Barua, Rajeev K; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Code size is important to the cost of embedded systems. Although VLIW architectures are popular for embedded systems, they impose constraints on instruction placement that make it difficult to find a compact schedule. Existing VLIW instruction scheduling methods primarily target run-time but not code size. The usual approach has two components. First, methods such as trace scheduling provide a mechanism to correctly move instructions across basic blocks. Second, the instructions within a trace are scheduled, perhaps moving instructions across blocks. Because run-time is the only consideration, this approach increases code size by inserting compensation code. Methods such as superblocking increase the size even further by duplicating code. We present a compiler method for instruction scheduling that, for the first time, uses the power of across-block scheduling methods such as trace scheduling to reduce code size as well as run-time. For a certain class of VLIWs, we show that trace scheduling, previously synonymous with increased code size, can in fact reduce it. Our within-trace scheduler uses a cost-model driven, back-tracking approach. Starting with an optimal, exponential-time algorithm, branch-and-bound techniques and non-optimal heuristics reduce the compile time to within a factor of 2 of the original, on average. The code size for our benchmarks is reduced by 16.3% versus the best existing across-block scheduler, while being within 0.8% of its run-time, on a 6-wide VLIW. For a 3-wide VLIW, code size improves by 14.7%, with the same 0.8% run-time cost. Thus, the code size improvements are fairly stable across VLIW widths. We further explore the impact of our techniques on machines with predication support or small I-cache sizes. In the process, we present a novel predication analysis of general applicability. If predication is present, the code size improves to 16.6%. In addition, for machines with small I-caches, the reduced code size of our approach tends to yield better cache hit rates. We find that, although this effect is modest, the performance improvement more than offsets the run-time costs of our method. Therefore, on machines with small I-caches, our code size improvements are achievable at no run-time cost.