Institute for Systems Research
Permanent URI for this communityhttp://hdl.handle.net/1903/4375
Browse
Search Results
Item Algorithm-Based Low-Power Digital Signal Processing System Designs(1995) Wu, A.Y.; Liu, K.J.R.; ISRIn most low-power VLSI designs, the supply voltage is usually reduced to lower the total power consumption. However, the device speed will be degraded as the supply voltage goes down. In order to meet the low-power/high-throughput constraint, the key issue is to ﲣompensate the increased delay so that the device can be operated at the slowest possible speed without affecting the system throughput rate.In this dissertation, new algorithmic- level techniques for compensating the increased delays based on the multirate approach are proposed.
Given the digital signal processing (DSP) problems, we apply the multirate approach to reformulate the algorithms so that the desired outputs can be obtained from the decimated input sequences. Since the data rate in the resulting multirate architectures is M- times slower than the original data rate while maintaining the same throughput rate, the speed penalty caused by the low supply voltage is compensated at the algorithmic/architectural level.
This new low-power design technique is applied to several important DSP applications. The first one is a design methodology for the low- power design of FIR/IIR systems. By following the proposed design procedures, users can convert a speed-demanding system function into its equivalent multirate transfer function. This methodology provides a systematic way for VLSI designers to design low- power/high-speed filtering architectures at the algorithmic/architectural level.
The multirate approach is also applied to the low-power transform coding architecture design. The resulting time-recursive multirate transform architectures inherit all advantages of the existing time-recursive transform architectures such as local communication, regularity, modularity, and linear hardware complexity, but the speed for updating the transform coefficients becomes M-times slower.
The last application is a programmable video co-processor system architecture that is capable of performing FIR/IIR filtering, subband filtering, discrete orthogonal transforms (DT) and adaptive filtering for the host processor in video applications. The system can be easily reconfigurated to perform multirate FIR/IIR/DT operations. Hence, we can either double the processing speed on-the-fly, based on the same processing elements, or apply this feature to the low- power implementation of this co- processor.
The methodology and the applications presented in this dissertation constitute a design framework for achieving low-power consumption at the algorithmic/architectural level for DSP applications.
Item A Wavefront Array for URV Decomposition Updating(1995) Raghupathy, A.; Koc, Ut-Va; Liu, K.J. Ray; ISRThe rank revealing URV decomposition is an effective tool in many signal processing applications that require the computation of the noise subspace of a matrix. In this paper, we consider a parallel architecture for updating the URV decomposition on a wavefront array. The wavefront array provides an efficient real time mechanism for adaptive computation of the null space of a matrix as well as for handling rank changes during updating.Item Built-In Self-Test and Fault Diagnosis for Analog Circuits in the Frequency Domain(1995) Chao, C-Y.; Milor, Linda; ISRDue to the increasing complexity of analog circuits, finding out whether an analog circuit meets the required specification is a difficult and time consuming task. We propose a complete Build- In Self-Test and Fault-Diangosis circuit based on the frequency domain specifications for analog circuits in order to reduce the time and effort for testing. This built-in circuit supports an automated yes or no testing when used in production test. Furthermore, the ability to access internal blocks inside an analog circuit provides a fault diagnosis ability when an engineer wants to find out the cause of faulty circuits. Coupled with extra multiplexers, this circuit can be used to detect faults in analog parts in a mixed signal circuit.Item ASIC Design of Bit-Serial and Bit-Parallel Discrete Cosine Transform Processors(1994) Karunakaran, Vignarajah; Liu, K.J.R.; ISRDesigns of the bit-serial and bit-parallel versions of the Discrete Cosine Transform Processor using the universal IIR filter module are presented, with emphasis on the bit-serial design. A bit-serial cell mini-library was created. The designs were performed with the AlliedSignal Aerospace Microelectronics Center's 1.2 micro double metal p-well CMOS standard cell library. The core of the bit-serial design is the 18-bit data x 8-bit coefficient bit-serial multiplier, whose design is also presented in detail; the multiplier is capable of handling negative data and negative coefficients, and has an accuracy of o(2-16), The 8-point 18-bit bit-serial DCT has a maximum clock speed of 139.0 MHz and 55.6 MHz under best and worst case conditions respectively. Two bit-parallel design implementations are presented, one with straight bit-parallel multiplier cells and the other with ROM multipliers using distributed arithmetic. The bit-parallel designs are also 8-point, but have an 8-bit wide input and a 12-bit wide output, thereby calculating with much less precision. The parallel multiplier chip's maximum speed under best and worst case conditions is 28.4 MHz and 11.4 MHz respectively, whereas the ROM multiplier chip's is 36.3 MHz and 14.5 MHz respectively. All three designs have a throughput of one clock cycle, with respect to their data input rates. The latencies for the bit-serial and bit-parallel designs are 38 and 5 cycles respectively.Item Systolic Architectures for Signal Compression and Discrimination(1994) Yu, S-S.; JaJa, J.F.; ISRIn this dissertation we propose systolic architectures for several classes of signal processing computations including schemes based on vector quantization and high order crossings techniques. The systolic concept is adapted to design architectures that are simple, regular, and that achieve high concurrency, local communication, and high throughput. Our tree- structured vector quantization (TSVQ) architecture is composed of a linear array of processors, each processor performing the computations required at one level of the binary tree. Encoding is performed in a pipelined fashion with each processor contributing a portion of the path decision through the tree until the final processor is reached to get the complete index. The predictive TSVQ (PTSVQ) architecture for real-time video coding applications uses pipelined arithmetic components to speed up the computation and to provide for regularity in design. This high throughput architecture is suitable for implementing a fully pipelined real-time PTSVQ system. Data and control flow in both architectures flow in a pipelined fashion and no global control signals are needed. We also present a class of architectures for performing signal discrimination and classification based on higher order crossing (HOC) methods. We also present a detailed design of a prototype HOC PCB system using off-the shelf components that can be used for non-destructive testing.Item VLSI Design of High-Speed Time-Recursive 2-D DCT/IDCT Processor for Video Applications(1994) Srinivasan, V.; Liu, K.J. Ray; ISRIn this paper we present a full-customer VLSI design of high- speed 2-D DCT/IDCT Processor based on the new class of time- recursive algorithms and architectures which has never been implemented to prove its performance. We show that the VLSI implementation of this class of DCT/IDCT algorithms can easily meet the high-speed requirements of HDTV due to its modularity, regularity, local connectivity, and scalability. Our design of the 8 x 8 DCT/IDCT can operate at 50 MHz with a 400 Mbps throughput based on a very conservative estimate under 1.2 CMOS technology. In comparison to the existing designs, our approach offers many advantages that can be further explored for even higher performance.Item Split Recursive Least Squares: Algorithms, Architectures, and Applications(1994) Wu, A-Y.; Liu, K.J. Ray; ISRIn this paper, a new computationally efficient algorithm for recursive least-squares (RLS) filtering is presented. The proposed Split RLS algorithm can perform the approximated RLS with O(N) complexity for signals having no special data structure to be exploited, while avoiding the high computational complexity (O(N2)) required in the conventional RLS algorithms. Our performance analysis shows that the estimation bias will be small when the input data are less correlated. We also show that for highly correlated data, the orthogonal preprocessing scheme can be used to improve the performance of the Split RLS. Furthermore, the systolic implementation of our algorithm based on the QR- decomposition RLS (QRD-RLS) arrays as well as its application to multidimensional adaptive filtering is also discussed. The hardware complexity for the resulting array is only O(N) and the system latency can be reduced to O(log2 N). The simulation results show that the Split RLS outperforms the conventional RLS in the application of image restoration. A major advantage of the Split RLS is its superior tracking capability over the conventional RLS under non-stationary environments.Item Full Custom VLSI Implementation of Time-Recursive 2-D DCT/IDCT Chip(1993) Srinivasan, V.; Liu, K.J.R.; ISRDiscrete Cosine Transform (DCT) based compression techniques play an important role in today's digital applications such as high definition television (HDTV) and teleconferencing which require high speed transmission of digital video signals. In this thesis, a high-performance VLSI implementation of a DSP chip which computes the two-dimensional discrete cosine transform and its inverse (2-D DCT/IDCT) is presented. The chip is based on the fully-pipelined time recursive IIR structure and employs a highly modular and hierarchical design strategy. Architectural model simulations are performed for determining system parameters required to achieve a high-speed and high-performance implementation. Based on these simulations, ROM and internal bus precision are chosen to ensure a minimum PSNR of 40 dB which is required for most digital imaging applications. High speed design is obtained by using distributed arithmetic to achieve fast multiplication through table lookups. A two-phase nonoverlapping clock is employed to perform computations in both phases, resulting in twice the throughput. Various submodules like ROM lookup tables. adders, half-latches, delay-units and multiplexors are implemented. Timing simulations of critical path modules indicate a clock frequency of 50 MHz corresponding to a data rate of 400 Mb/s. The chip dimensions are 24550 l x 27094 l and its area is 240 mm2. The chip has been submitted for fabrication in 1.2 CMOS N-well double-metal single-poly technology.Item An Architectural Framework for VLSI Time-Recursive Computation with Applications(1993) Frantzeskakis, Emmanuel N.; Baras, J.; ISRThe time-recursive computation model has been proven as a particularly useful tool in audio, video, radar and sonar real- time data processing architectures. Unlike the FFT based architectures, the time-recursive ones require only local communication, they imply linear implementation cost and they operate in a single-input multiple-output (SIMO) manner. This is appropriate for the above applications since the data are supplied serially. Also, the time-recursive architectures are modular and regular and they allow high degree of parallelism; thus they are very appropriate for VLSI implementation.In this dissertation, we establish an architectural framework for parallel time-recursive computation. We consider a class of linear operators (or signal transformers) that are characterized by discrete time, time invariant, compactly supported, but otherwise arbitrary kernel functions. We specify the properties of linear operators that can be implemented efficiently in a time-recursive way. Based on these properties, we develop a systematic routine that produces a time-recursive architectural implementation for a given operator. We demonstrate the use and effectiveness of this routine by means of specific examples, namely the Discrete Cosine Transform (DCT), the Discrete Fourier Transform (DFT) and the Discrete Wavelet Transform (DWT).
By using this architectural framework we obtain novel architectures for the uniform-DFT QMF bank, the cosine modulated QMF bank, the 1-D and 2-D Modulated Lapped Transform (MLT), as well as an Extended Lapped Transform (ELT). Furthermore, the architectural implementation of the Cepstral Transform and a Short Time Fourier Transform are considered based on the time-recursive architecture of the DFT. All of the above designs are modular, regular, with local communication and linear cost in operator counts. In particular, the 1-D MLT requires 1N + 3 adders and N - 1 rotation circuits, where N denotes the data block size. The 2-D MLT requires 3 1-D MLT circuits and no matrix transposition. The ELT has basis length equal to 4N and it requires 3N + 4 multipliers, 4N + 4 adders and N + 2 rotation circuits. These results are expected to have a significant impact on real-time audio and video data compression, in frequency domain adaptive filtering and in spectrum analysis.
Item A Class of Square Root and Division Free Algorithms and Architectures for QRD-Based Adaptive Signal Processing(1993) Frantzeskakis, Emmanuel N.; Liu, K.J. Ray; ISRThe least squares (LS) minimization problem constitutes the cores of many real-time signal processing problems, such as adaptive filtering, system identification and adaptive beamforming. Recently efficient implementations of the recursive least squares (RLS) algorithm and the constrained recursive least squares (CRLS) algorithm based on the numerically stable QR decomposition (QRD) have been of great interest. Several papers have proposed modifications to the rotation algorithm that circumvent the square root operations and minimize the number of divisions that are involved in the Givens rotation. It has also been shown that all the known square root free algorithms are instances of one parametric algorithm. Recently, a square root free and division free algorithm has been proposed [4].In this paper, we propose a family of square root and division free algorithms and examine its relationship with the square root free parametric family. We choose a specific instance for each one of the two parametric algorithms and make a comparative study of the systolic structures based on these two instances, as well as the standard Givens rotation. We consider the architectures for both the optimal residual computation and the optimal weight vector extraction.
The dynamic range of the newly proposed algorithm for QRD-RLS optimal residual computation and the wordlength lower bounds that guarantee no overflow are presented. The numberical stability of the algorithm is also considered. A number of obscure points relevant to the realization of the QRD-RLS and QRD-CRLS algorithms are clarified. Some systolic structures that are described in this paper are very promising, since they require less computational complexity ( in various aspects) than the structures known to date and they make the VLSI implementation easier.