Real Time Audio- Video Coding Using TMS320C80 (Completed 2001)
Eren Gurses, Serdar Ince, Anil Aksay
Real time high quality video transmission over limited bandwidth communication lines requires extensive computation for compression. Motion estimation is one of the blocks that both effects the quality and determine the computational complexity of the compression algorithm. Since motion estimation is not part of the standards used for very-low-bit-rate video transmission (H.263), the improvements introduced may have a big effect on above mentioned problems.
In this work, we describe
both the implementation of a low cost motion estimation algorithm on a
multi-processor DSP architecture and also the general partitioning of H.263
video codec on the same architecture. In order to decrease the bit-rate
and computational complexity while maintaining the quality, we both introduced
some programming techniques and new strategies to exploit the parallel
architecture of the overall system. The implementation of the overall algorithm
is done using the Precision MX Video Engine including the TMS320C80 (‘C80).
The reason to choose ‘C80 MVP (Multimedia Video Processor) is the high
computation power of the chip. The chip contains four parallel processing
advanced DSPs (PPs) with 64-bit instructions and 32-bit fixed-point data
and a RISC master processor (MP) with integral IEEE-754 floating-point
unit. The PPs are capable of achieving up to four parallel operations per
cycle with 32 bit operations. Other features are a transfer controller
(TC) for mass data transfer, a video controller (VC) for image capture
and display. The SRAM on the chip is shared among the processors and a
high speed network called "crossbar" accomplishes the interface between
these RAMs and the processors.
The implementation of H.263
algorithm consists of the video capture, DCT, Quantization, motion estimation,
VLC, and the decision steps. In order to utilize the power of the multi-processor
DSP architecture, these should be efficiently partitioned between the processors.
In our implementation we used 1 PP for DCT/Quantization/VLC and one PP
for motion estimation. The remaining PPs are planned to be used for decoding
and speech codec. The MP takes care of the control between processors and
the video capture/display. The first step in the codec is to capture the
video signal coming from the camera and make the frames ready for PPs to
process. First, the video frames are captured to DRAM of ‘C80 in the required
format. Once one frame/block is captured, MP sends a message to PP0 to
start its execution. The task of PP0 is to transfer the data from DRAM
to on-chip memory and do the motion estimation. Once MP sends a message
to PP0 to start its execution, PP0 waits for a message from PP1 in order
to verify that PP1 does not use the frame/block at on-chip memory anymore.
When PP0 receives the message from PP1, it copies the frame, processes
it and send a message to PP1 to allow the usage of the new input.For the
motion estimation, an algorithm based on block matching is implemented
on PP0. In the block matching technique, a block in a frame is taken at
time t-D t to be tracked in future frames. Motion field estimation consists
of finding the coordinates of this block in the frame at time t that minimizes
the correspondence between the reference image segment. This correspondence
is calculated with Sum of Absolute Differences (SAD) algorithm. The efficiency
of the motion estimation algorithm is entirely based on the SAD and the
search algorithm speed. The simplest, but most computationally intensive
search method is exhaustive search. Exhaustive search (sometimes called
full-search) evaluates SAD at every location in the search area. This results
in complexity of O(nxn). Instead of evaluating the SAD at every location,
in the proposed implementation, we decrease the search cost by evaluating
SAD at predefined locations based on the size of the search window by giving
the highest priority to the motion vector of value (0,0). The results show
that although the accuracy of the motion estimation changes slightly (also
the SNT of the reconstructed frame), the search cost drops approximately
to O(log n). Another improvement we introduced comes from the implementation
of SAD. SAD algorithm can be implemented straightforwardly if 8 bit architecture
and no parallel instructions are considered but it will be very slow. In
the proposed implementation, we get use of the ‘C80 architecture, i.e.
32 bit processing with 8-bit split ALU facility and up to four parallel
instructions per cycle using 64-bit instruction op-code. This resulted
in 282 cycles per 16x16 block using one PP which is approximately 4.5 times
faster calculation.