Last week TI introduced the TMS320C6455, the first general-purpose DSP to use TI's new ‘C64x+ core. (The previously announced TCI6482 also uses the new core, but this part is available only to select customers. See the February 2005 edition of Inside DSP for details.) TI also revealed the details of the 'C64x+ architecture.
As the name suggests, the ‘C64x+ is based on TI's well-established ‘C64x DSP architecture. The ‘C64x+ is object-code compatible with its predecessor, and in most respects it is similar to its predecessor. For example, both architectures can execute up to eight instructions per cycle. And like current ‘C64x-based chips, the new ‘C6455 will operate at up to1 GHz. However, the ‘C64x+ includes some important upgrades that significantly improve both the throughput and the memory-efficiency of the new architecture.
The most prominent upgrade is increased multiply-accumulate (MAC) throughput. The ‘C64x+ can perform up to eight 16-bit MAC operations per cycle, compared to a maximum of four MAC operations per cycle on the ‘C64x. The ‘C64x+ is also able to complete up to two 32 x 32 MAC operations per cycle. In contrast, the ‘C64x does not directly support 32 x 32 MAC operations. The ‘C64x+ also offers expanded add and subtract capabilities, as well as new bit-manipulation instructions that accelerate security and communications algorithms.
Interestingly, the ‘C64x+ adds no video-specific instructions. This is striking because video applications are a key target for the new architecture. Although the ‘C64x+ will have respectable video-processing capabilities, it could have benefited from additional video-oriented instructions.
Moving beyond new instructions, the ‘C64x+ also takes a new approach to software-pipelined loops—which are used heavily in optimized ‘C64x code to reduce the impact of the deep pipeline. The ‘C64x+ adds a loop buffer that greatly reduces the need for loop setup and cleanup code. The obvious benefit of this change is that it reduces code size in loop-intensive signal-processing code. The loop buffer also allows the programmer to schedule instructions that execute only once in parallel with loop instructions. This feature makes use of execution slots that would otherwise go unused, significantly improving performance in some cases.
Although the loop buffer brings important benefits, it requires a style of programming that many assembly-level programmers will find unfamiliar and challenging. This is particularly problematic because the ‘C64x was already a challenging assembly-code target.
Last but not least, the ‘C64x+ supports 16-bit wide instruction words as well as the 32-bit instructions used by the ‘C64x. The use of mixed-width instruction sets is a common memory-saving feature, but the ‘C64x+ takes an unusual approach to implementing this feature. Due to this unusual approach, the programmer cannot specify which instructions use 16-bit encoding. Instead, the assembler determines where it can use 16-bit encoding. It is difficult to tell where 16-bit instructions will be used, making it difficult for assembly-level programmers to minimize memory use. The upside of TI's approach is that re-assembling ‘C64x code for the ‘C64x+ will usually provide significant memory savings.
BDTI recently completed an analysis of the ‘C64x+ using its BDTI Benchmarks. Based on the results of this analysis, the combination of new instructions and the loop buffer give the ‘C64x+ a 20% performance boost over its predecessor. On some algorithms, the ‘C64x+ also uses roughly half as much program memory as the ‘C64x. (An analysis of both program and data memory use shows that the ‘C64x+ uses about 15% less memory than its predecessor overall). Benchmark results for the ‘C64x and ‘C64x+ are available at http://www.BDTI.com/Services/Benchmarks/DKB.
Overall, the ‘C64x+ is a significant, if not revolutionary, improvement over the ‘C64x. By improving both speed and memory use, TI is sure to strengthen its lead in high-performance DSP. The main challenge for TI will be helping its customers deal with the increased complexity in what was already a highly complicated architecture.
The ‘C6455 is expected to begin sampling in the third quarter of 2005. Volume production is scheduled for the second quarter of 2006. Planned pricing for 10,000-unit orders is $259 for the 1 GHz version, $219 for the 850 MHz version, and $179 for the 720 MHz version.
Add new comment