BDTI H.264 Decoder Benchmark Certified Results: Texas Instruments TMS320DM6446 DaVinci SoC

Overview of the Texas Instruments DM6446 DaVinci SoC

The Texas Instruments DM6446 is a system-on-chip in the DaVinci processor family. It consists of the following processing engines:

  • An ARM9 general-purpose RISC processor operating at up to 297 MHz.
  • A C64x+ digital signal processor (DSP) operating at up to 594 MHz.
  • A video hardware acceleration engine, known as the Video and Image Coprocessor (VICP), which is used to offload certain video and imaging processing tasks from the DSP. This processing unit is not utilized in the DM6446 H.264 decoder implementation certified by BDTI. However, the VICP could be utilized to offload the in-loop deblocking filter from the C64x+ DSP and reduce the processor utilization reported below.

In addition to numerous peripherals, the DM6446 also includes a Video Processing Subsystem that consists of a Video Processing Front End supporting video capture and a Video Processing Back End supporting video playback. The Video Processing Front End includes a CCD controller, a preview engine, a histogram module, an auto-exposure/white balance/focus module, and a resizer. The Video Processing Back End includes an on-screen display engine supporting alpha blending, and four DACs providing a means for composite NTSC/PAL video, S-video, and component video output.

BDTI used the TI Digital Video Software Development Kit (DVSDK) for assessing the performance of the DM6446 SoC. The DVSDK is a software package including:

  • A C64x+ DSP based H.264 video codec software implementation (as well as other video and audio codecs).
  • An ARM9 software application that invokes the H.264 codec, delivers/receives data to/from the DSP, and interfaces to the hard disk drive, camera, and LCD monitor.
  • The PC-based TI DM644x SoC Analyzer, which was used to make the performance measurements presented in this report.

The ARM and DSP software were run on the TI Digital Video Evaluation Module (DVEVM) hardware platform; please see BDTI's White Paper, An Independent Analysis of the Texas Instruments Digital Video Evaluation Module, for a BDTI analysis and overall description of the DVEVM.

BDTI Certified LogoBDTI H.264 Decoder Benchmark Certified Results

H.264 decoder solution performance is reported as the minimum processing engine clock rate and percent utilization required to decode BDTI’s Primary Operating Point (D1 resolution, 30 fps, 1.5 Mbps) H.264 bitstream in real-time. External memory is required for the H.264 decoder implementation on the DM6446, and the DVEVM hardware platform utilizes a Micron MT47H64M16BT DDR2 SDRAM (128 megabytes) external memory device operating at 162 MHz. Since external memory performance can impact the utilization requirements for the DM6446, the characteristics of the external memory device should be considered if a different device is to be used in a particular design.

The DM6446 H.264 decoder processing requirements vary on a frame-by-frame basis over the test clip depending on the video content; therefore another factor that affects the processing engine performance is the number of output “delay buffers” used. In recognition of this, BDTI has chosen to present the minimum clock rate and percent utilization required by the DM6446 for real-time operation at a number of output delay buffer sizes (see Figures 2 and 3). In the figures, “0 buffers” (i.e., no buffering of output frames) indicates the processing engine clock rate or percent utilization required to process the single most processing intensive frame in the video clip in real-time (i.e., 1/30th of a second). Adding delay buffers (each of which holds one decoded frame) smoothes the processing load across multiple frames and reduces the required clock rate. For the TI DM6446, using three buffers results in a minimum required clock rate essentially equal to the minimum clock rate achievable (i.e., the average per-frame processing over the entire video clip). The 3-buffer case is the typical output buffering used in real-world applications; the 0-buffer case is not typical and would only be used in extremely delay sensitive applications. For a more detailed description of delay buffers, along with additional information about performance metrics for the Solution Benchmark for H.264 Decoders, click here.

The time required for the DM6446 to complete H.264 decoder processing on the worst-case frame in the BDTI test clip is 23.13 msec. This is 69.39 percent of the 33.33 msec (1/30 of a second) available per frame. The average time required for the DM6446 to completely process each of the frames in the clip is 17.15 msec, or 51.44 percent, of the 33.33 msec available per frame. The DM6446 consists of multiple processing engines, and the utilization across each processing engine is not uniform. As shown in Figure 1, the C64x+ DSP accounts for the vast majority of the processing, and the ARM adds a small amount of pre-processing and post-processing. The “inactive” segment indicates the time when no DM6446 processing engines are being utilized for the H.264 decoder, and therefore both DM6446 processing engines (i.e., the ARM9 and the C64x+ DSP) are available for other activities. As can be seen in Figure 1 the minimum inactive time for the DM6446 is 30.61 percent of a frame for the single worst-case frame in the BDTI H.264 test clip, and the average inactive time is 48.56 percent of each frame over the entire test clip.

FIGURE 1: DM6446 Minimum Inactive Time.

The pre- and post-processing work done by the ARM includes delivering the input data to the DSP, retrieving the output data from the DSP, and making the inter-processor call to the DSP to invoke the decoder. The total processor utilization for these activities is small—requiring only approximately 3-to-4 percent utilization of the ARM operating at 297 MHz. Furthermore, the ARM utilization does not vary significantly from frame to frame—it does not depend on the content of the data to be decoded. Thus, our results focus on the C64x+ DSP processor utilization since it dominates the processing requirements for H.264 decoding on the DM6446.

As can be seen in the figures below, the time required for the C64x+ DSP to complete H.264 decoder processing on the worst-case frame in the BDTI test clip is 22.02 msec. This is 66.07 percent of the 33.33 msec (1/30 of a second) available per frame and is indicated by the “0 buffers” case shown in Figures 2 and 3. The average time required for the C64x+ DSP to completely process each of the frames in the clip is 16.13 msec, or 48.2 percent of the 33.33 msec available per frame. This is also shown in Figures 2 and 3. As can be seen, as the number of delay buffers approaches 3 (which is typical for real-world applications), the C64x+ DSP processing engine requirement approaches the average utilization over all frames, which is the theoretical minimum.

Note that the DM6446 VICP hardware acceleration engine is not utilized in the H.264 decoder implementation used to obtain the following results. It could be utilized to offload the in-loop deblocking filter from the C64x+ DSP and reduce the processor utilization reported below.

FIGURE 2: DM6446 H.264 Video Decoder Benchmark Certified Results: minimum clock rate required for real-time operation.  (See text for description of output buffers.)

FIGURE 3: DM6446 H.264 Video Decoder Benchmark Certified Results: CPU utilization required for real-time operation.  (See text for description of output buffers.)

Texas Instruments DM6446 video decoder performance on the BDTI H.264 Decoder Benchmark
Baseline Profile, D1 (720×480) Resolution, 30 fps, 1.5 Mbps
Metric Minimum DSP Clock Rate (MHz) DSP  Utilization @ 594 MHz (Percent) ARM Program Memory Usage (KBytes) DSP Program Memory Usage (KBytes) ARM Static Data Memory Usage (KBytes) DSP Static Memory Usage (MBytes) Dynamic Data Memory Usage (Bytes) Buffering Delay (Seconds)
Average over entire clip 286 48.2% 121.7 555.2 11.3 553.4 N/A N/A
Buffering 3 frames 291 49.0% 121.7 555.2 11.3 553.4 5.6 0.100
Buffering 2 frames 295 49.7% 121.7 555.2 11.3 553.4 4.8 0.067
Buffering 1 frame 302 50.8% 121.7 555.2 11.3 553.4 4.0 0.033
No buffering—highest CPU load frame 392 66.0% 121.7 555.2 11.3 553.4 3.2 0

Memory usage results are taken from Texas Instruments Application Report SPRAAF6, "DaVinci System Level Benchmarking Measurements." They include audio decoding algorithms and system/driver program memory usage (i.e., Linux, camera, monitor, etc.).

No reproduction or reuse of the above information is permitted without the express authorization of BDTI.  For reproduction permission or to obtain benchmark results for your processing engine, please contact BDTI.