VLSI Algorithms and Architectures for Time-Recursive Discrete Sinuoidal Transforms with Applications to Real-Time Video Communications
Publication or External Link
In this dissertation, we address the problem of developing efficient VLSI algorithms and architectures for discrete sinusoidal transforms in real-time applications for video communication systems. The major difficulty of this problem is that the resulting architectures should compute a huge amount of data at very high speed for real-time video applications and match the requirement of VLSI architectures, regularity, modularity and locality. In traditional FFT based algorithms, the serial data is buffered and then transformed using the FFT scheme.
We propose a "time-recursive" approach to perform transforms that merge the buffering and transform operations into a single unit. The transformed data are updated according to a recursive formula, whenever a new datum arrives. Therefore the waiting time is completely eliminated. The unified lattice and IIR architectures for time-recursive transforms are proposed. The resulting architectures are regular, modular, and have only local interconnections and are better suited for VLSI implementations. There is no limitation on the transform size N and the number of multipliers required for computing the DCT by lattice and IIR structures are 6N - 8 and 2N - 2 respectively. In the case of dual generation of the DCT and DST by IIR structure, only 1.5N multipliers are required for each transform on average. The throughput of this scheme is one input sample per clock cycle.
We also apply the time-recursive approach to multidimensional separable transforms. The resulting d- dimensional structures are fully-pipelined and consist of only d 1-D transform arrays and shift registers for computing a d-D DXT. The delay time due to transpositions of the conventional d-D transforms is eliminated in our approach. It is shown that the architectures is optimal in the sense that the number of the multipliers used is minimum and both speed and area are asymptotically optimal.
The VLSI implementation of the lattice module based on the distributed arithmetic is also described. The chip can dually generate the DCT and DST simultaneously. It has been fabricated under 2 m double-metal CMOS technology and tested to be fully functional with a throughput rate 14.5-MHz and a data processing rate of 116Mb/s.