Real Time Audio- Video Coding Using TMS320C80 (Completed 2001)

Eren Gurses, Serdar Ince, Anil Aksay

Real time high quality video transmission over limited bandwidth communication lines  requires extensive computation for compression. Motion estimation is one of the blocks  that both effects the quality and determine the computational complexity of the compression algorithm. Since motion estimation is not part of the standards used for very-low-bit-rate video transmission (H.263), the improvements introduced may have a big effect on above mentioned problems.

In this work, we describe both the implementation of a low cost motion estimation algorithm on a multi-processor DSP architecture and also the general partitioning of H.263 video codec on the same architecture. In order to decrease the bit-rate and computational complexity while maintaining the quality, we both introduced some programming techniques and new strategies to exploit the parallel architecture of the overall system. The implementation of the overall algorithm is done using the Precision MX Video Engine including the TMS320C80 (‘C80). The reason to choose ‘C80 MVP (Multimedia Video Processor) is the high computation power of the chip. The chip contains four parallel processing advanced DSPs (PPs) with 64-bit instructions and 32-bit fixed-point data and a RISC master processor (MP) with integral IEEE-754 floating-point unit. The PPs are capable of achieving up to four parallel operations per cycle with 32 bit operations. Other features are a transfer controller (TC) for mass data transfer, a video controller (VC) for image capture and display. The SRAM on the chip is shared among the processors and a high speed network called "crossbar" accomplishes the interface between these RAMs and the processors.
 

The implementation of H.263 algorithm consists of the video capture, DCT, Quantization, motion estimation, VLC, and the decision steps. In order to utilize the power of the multi-processor DSP architecture, these should be efficiently partitioned between the processors. In our implementation we used 1 PP for DCT/Quantization/VLC and one PP for motion estimation. The remaining PPs are planned to be used for decoding and speech codec. The MP takes care of the control between processors and the video capture/display. The first step in the codec is to capture the video signal coming from the camera and make the frames ready for PPs to process. First, the video frames are captured to DRAM of ‘C80 in the required format. Once one frame/block is captured, MP sends a message to PP0 to start its execution. The task of PP0 is to transfer the data from DRAM to on-chip memory and do the motion estimation. Once MP sends a message to PP0 to start its execution, PP0 waits for a message from PP1 in order to verify that PP1 does not use the frame/block at on-chip memory anymore. When PP0 receives the message from PP1, it copies the frame, processes it and send a message to PP1 to allow the usage of the new input.For the motion estimation, an algorithm based on block matching is implemented on PP0. In the block matching technique, a block in a frame is taken at time t-D t to be tracked in future frames. Motion field estimation consists of finding the coordinates of this block in the frame at time t that minimizes the correspondence between the reference image segment. This correspondence is calculated with Sum of Absolute Differences (SAD) algorithm. The efficiency of the motion estimation algorithm is entirely based on the SAD and the search algorithm speed. The simplest, but most computationally intensive search method is exhaustive search. Exhaustive search (sometimes called full-search) evaluates SAD at every location in the search area. This results in complexity of O(nxn). Instead of evaluating the SAD at every location, in the proposed implementation, we decrease the search cost by evaluating SAD at predefined locations based on the size of the search window by giving the highest priority to the motion vector of value (0,0). The results show that although the accuracy of the motion estimation changes slightly (also the SNT of the reconstructed frame), the search cost drops approximately to O(log n). Another improvement we introduced comes from the implementation of SAD. SAD algorithm can be implemented straightforwardly if 8 bit architecture and no parallel instructions are considered but it will be very slow. In the proposed implementation, we get use of the ‘C80 architecture, i.e. 32 bit processing with 8-bit split ALU facility and up to four parallel instructions per cycle using 64-bit instruction op-code. This resulted in 282 cycles per 16x16 block using one PP which is approximately 4.5 times faster calculation.