Last week processor core licensor ARC introduced a multimedia subsystem for its ARC 700 family of configurable cores. This multimedia subsystem extends the ARC 700 CPU with new instructions, a powerful SIMD engine, memory, a DMA controller, and video decoding software. The SIMD engine is the most notable feature of this subsystem. This engine features a 128-bit-wide data path that can perform up to sixteen 8-bit operations, eight 16-bit operations or four 32-bit operations per cycle. In comparison, ARC’s existing DSP extensions can perform a maximum of one 32-bit or two 16-bit operations per cycle.
The basic features of ARC’s SIMD engine are remarkably similar to those of ARM’s NEON extensions. (For more on NEON, see the October 2004 edition of Inside DSP.) For example, the ARM NEON extensions can also perform up to sixteen 8-bit operations, eight 16-bit operations or four 32-bit operations per cycle.
However, the ARC SIMD engine differs from NEON in two respects. First, the ARC SIMD engine offers two operating modes. In the first mode, the SIMD engine draws instructions and data from the ARC 700 pipeline. This mode resembles the operation of the ARM NEON extensions. In the second mode, the SIMD engine operates from its own private instruction and data memories. This mode allows the SIMD engine to operate in parallel with the CPU. ARM does not currently offer similar functionality.
The ARC SIMD engine is also more specialized than the NEON extensions. For example, the ARC SIMD engine offers specialized instructions to accelerate the deblocking filters in H.264 and VC-1 video codecs. The ARM NEON extensions do not offer similarly specialized instructions. As with the specialized instructions found in DSPs, It’s likely that ARC’s specialized instructions will boost performance at the cost of increased programming complexity.
In addition to the SIMD extensions, ARC’s multimedia subsystem adds entropy decoding instructions to the ARC 700 CPU, which are useful for the variable length decoding tasks found in video decoders. The multimedia subsystem also includes a new DMA engine. Among other capabilities, the DMA engine can load data from external RAM directly into the SIMD engine’s private data memory.
According to ARC, the multimedia subsystem is capable of operating at the same clock speed as the ARC 700 cores. For example, ARC says the subsystem can achieve a worst-case speed of 533 MHz when implemented in TMSC’s 0.13-micron LVLK-OD process.
ARC expects most licensees to use the multimedia subsystems for lower-speed designs. For example, ARC projects that a multimedia subsystem running at 166 MHz will be able to decode H.264 baseline profile video at D1 resolution and 30 frames per second. (ARC has not yet completed development of this decoder, so this claim has not yet been proven.) According to ARC, the multimedia subsystem can achieve this 166 MHz clock speed in the low-cost TSMC CL013G process. In this same process, the area for the subsystem is 2.36 mm² (including an ARC 725D core, but not any memory).
If ARC’s performance figures are correct, it has created a solution that combines efficiency and flexibility. In terms of performance and die area, the ARC multimedia subsystem appears to be competitive with highly specialized video decoding engines. Yet ARC’s multimedia subsystem—particularly the SIMD engine—is relatively flexible and useful for tasks other than video decoding. Combining efficiency and flexibility is no easy feat, so ARC’s claims are likely to be greeted with skepticism. Before it can win customers, ARC will need to provide strong evidence to support its bold claims.
ARC expects to make the multimedia subsystem available to customers early next year.
Add new comment