## Excerpts from BDTI's Report:

# Inside the LSI Logic ZSP500



A technical evaluation by the staff of Berkeley Design Technology, Inc.

The following are excerpts and abridged text from BDTI's report, Inside the LSI Logic ZSP500.

Contents of this excerpt include:

- Introduction
- Scope
- The ZSP500 Core
- BDTI Benchmark™ Performance:
  - Sample Execution Time Results
  - Sample Memory Usage Results
- Summary

The complete report may be ordered from BDTI. Details are on page 4.

#### Introduction

In 2002 LSI Logic introduced the ZSP500, a licensable, superscalar, fixedpoint DSP core. The ZSP500 can execute up to four instructions per cycle and can perform up to two 16-bit MACs per cycle. The ZSP500 is based on LSI's earlier ZSP400 core but adds several significant features, including dedicated address generation hardware. LSI's ZSP architecture family includes the ZSP400, ZSP500, and the forthcoming quad-MAC ZSP600 core, which together offer a range of performance levels and die sizes. ZSP family members are codecompatible at the assembly level, allowing code to be ported between family members with relative ease.

The ZSP500 targets cost- and powersensitive applications, including cellular handsets, voice-over-network (VoN) applications, and consumer devices like personal digital assistants.

The projected worst-case clock speed for the ZSP500 is 250 MHz in a 0.13micron process. The ZSP500 has not yet been fabricated.

#### Scope

Inside the LSI Logic ZSP500 is intended for anyone interested in understanding the DSP performance and capabilities of the ZSP500 core. It presumes a basic knowledge of DSP processor concepts and terms, which are covered in BDTI's textbook, DSP Processor Fundamentals. Inside the LSI Logic ZSP500 is especially useful for electronic system designers, hardware and software engineers, system-on-chip (SoC) designers, engineering managers, and product marketing managers. This report will aid in the assessment of the ZSP500's suitability for a given application, and it will allow system and chip designers to make informed decisions when considering the ZSP500 for their latest designs.

The report includes brief analyses of several other DSP-oriented cores and processors: the LSI Logic ZSP400 core and LSI40x chips; the 3DSP SP-5 core; the ADI/Intel Micro Signal Architecture core (MSA1) and ADI ADSP-21535 chip; and the Texas Instruments TMS320C55xx chips. These processors have been included to give the reader insight into how the ZSP500 compares to other well-known DSP architectures.

#### The ZSP500 Core

The ZSP500 is a superscalar architecture that can execute up to four instructions in parallel. The superscalar design means that instructions are scheduled for parallel execution by the processor at run time instead of being statically scheduled by a programmer or compiler. The ZSP500 data path contains sixteen 16-bit data registers, two 16-bit ALUs, and a combined multiplier/ALU unit composed of two 16-bit multiplyaccumulate (MAC) units and a 40-bit ALU.

The data registers can be concatenated in pairs to form eight 32-bit registers. Each register pair also has a corresponding 8-bit guard register that extends the register pair to 40 bits. These guard registers allow any extended register pair to be used as a 40-bit accumulator.

The two 16-bit ALUs can be used in parallel with any of the execution units in the combined multiplier/ALU unit; however, the 40-bit ALU cannot be used in parallel with the MAC units.

The combined multiplier/ALU unit can accept only one instruction per cycle; hence, its two MAC units cannot

#### About BDTI

Founded in 1991, Berkeley Design Technology, Inc. (BDTI) helps companies develop, select, and use DSP technology. BDTI provides:

Consulting and Analysis

- Independent analysis of processors, tools, algorithms, and software
- Processor benchmarking
- Insightful seminars and training
- In-depth published reports
- Trusted advisory consulting services

Software Development Services

- Audio
- Video
- Communications
- General DSP component libraries

operate independently. Instead, the ZSP500 uses single-instruction multipledata (SIMD) dual-MAC operations.

The ZSP500 data path is fundamentally a 16-bit data path: it uses 16-bit registers as inputs, and stores results to 16bit registers. However, most instructions have variants that support 32-bit data, using paired 16-bit registers as operands. For example, the two multipliers can be combined to perform 32-bit MAC operations, and the two 16-bit ALUs can be combined to perform 32-bit ALU operations. As a result, the ZSP500 has better support for 32-bit precision than most DSP processors-a notable advantage in, for example, audio applications. Several instructions have SIMD variants that perform the same operation on both halves of paired 16-bit operands. A few instructions, e.g., some arithmetic and shift instructions, support 40-bit data.

The ALUs support the typical assortment of logical and arithmetic operations. The ZSP500 also supports bit-field insert/extract, comparison operations, and maximum/minimum operations.

Specialized operations include single-cycle exponent detection and normalization. Single-cycle add-compareselect instructions are also available; these instructions can be used to implement a Viterbi "butterfly" in one instruction cycle. The ZSP500 also supports specialized "C-like" instructions that are intended to improve compiler performance.

#### **Memory System**

The only memory contained in the ZSP500 core is a small instruction prefetch buffer. Other attributes of the memory system will be determined by the chip designer.

The ZSP500 supports a range of memory architectures, e.g., either separate data and instruction memories or a single data/instruction memory. Licensees can design their own memory systems, or they can use the reference memory system provided by LSI.

The reference memory system contains a data cache and a data prefetch mechanism that helps to improve the ZSP500's performance over that of the ZSP400 (which suffers from frequent data cache misses).

The ZSP500 core interfaces to an offcore memory controller via a 128-bit instruction bus, two 32-bit data read buses, and two 32-bit data write buses. Although the ZSP500 can address only two data transfers per cycle, it can transfer up to 128 bits of data per cycle by tying the 32-bit data buses together for one 64-bit read and one 64-bit write. Hence, the ZSP500 can complete a maximum of four 16-bit reads and four 16-bit writes per cycle as long as the 16-bit data is arranged in groups of four in memory. The ZSP500 can perform data transfers on any 16-bit boundary.

On-chip memory bandwidth will vary depending on chip design choices. When the ZSP500 is connected to highspeed memory, the maximum sustainable on-chip data memory bandwidth is 1 billion 16-bit words/second for reads and 1 billion 16-bit words/second for writes at 250 MHz.

#### Addressing

In contrast with LSI's earlier ZSP400, which does not contain dedicated address generation units (AGUs), the ZSP500 contains two AGUs. Each AGU can complete one 16-, 32-, 40-, or 64-bit transfer per cycle, and can also perform some basic arithmetic and shift operations. The ZSP500 also differs from its predecessor in that it has a dedicated address register file. This file contains eight 32-bit address registers and eight 16-bit modifier registers. The ZSP500 can also use data registers for certain addressing modes.

The new addressing hardware makes the ZSP500 more flexible than the ZSP400. For example, it is possible to implement an efficient radix-4 FFT on the ZSP500, but not on the ZSP400. The new addressing hardware also allows the ZSP500 to eliminate the "data linking" cache-management scheme that causes frequent cache misses on the ZSP400.

The ZSP500 supports a variety of addressing modes, including registerindirect with pre- or post-modification, and circular addressing. A bit-reversed addressing mode is supported for certain address and data registers.

#### Pipeline

The ZSP500 uses an eight-stage pipeline (compared with five stages in the ZSP400) that is divided into the following stages: fetch/decode, grouping, address data read, address generation, memory access request, memory access receive, execute, and write-back. In each instruction cycle, the grouping stage groups up to four instructions. These instructions proceed through the pipeline in parallel, and are executed together.

The pipeline is fully interlocked; all data hazards are resolved by the grouping stage, so that instructions always behave as if they were executed serially.

Multiply operations have two-cycle latencies and single-cycle throughput. All other data path instructions have single-cycle latency and throughput.

Unlike most processors, when the ZSP500 executes a branch instruction it can fetch instructions at the branch target address and issue the fetched instructions simultaneously with the branch instruction. Hence, the apparent branch latency can be as little as zero cycles, improving performance in some algorithms.

#### Instruction Set

The ZSP500 uses a RISC-like mixed-width instruction set. Most instructions are 16 bits wide; 32-bit encoding is used mainly for instructions that use 16-bit immediate data and for some three-operand instructions, including some variants of the multiply instruction.

The ZSP500 assembly language uses the traditional opcode-operand style. Most operations require two operands, where one of the operands is used as both a source and a destination register. Some instructions, most notably the multiply operations, store the result in a separate destination register.

In general, the ZSP500 instruction set is very orthogonal. Some addressing modes restrict the selection of address and/or operand registers, but these restrictions are fairly benign. The rules that govern instruction grouping are sometimes obscure, however, and must be carefully observed in order to obtain peak speed. On the whole, the ZSP500 is compiler-friendly.

#### **Benchmark Performance**

Inside the LSI Logic ZSP500 includes extensive benchmark results, used to quantitatively evaluate the processor's DSP performance. For each benchmark, BDTI reports cycle counts, execution times, and memory usage. BDTI also provides extensive analysis that compares the benchmark implementations used on the ZSP500 to those used on the other processors. Explanation is provided when a processor's performance is higher or lower than expected based on a high-level view of the architecture. In this section, we present sample execution time and memory usage results taken from the complete set of results in the report.

#### **Execution Time**

To determine the execution time of a particular benchmark on a given processor, the number of instruction cycles the processor requires to execute the benchmark is multiplied by the processor's instruction cycle time. *Inside the LSI Logic ZSP500* includes tables and charts illustrating the number of cycles required by each processor to execute each benchmark and uses these results to generate corresponding tables and charts for execution times.

#### About the BDTI Benchmarks<sup>™</sup>

The BDTI Benchmarks are a set of DSP software functions that BDTI has independently designed to provide an objective basis for comparing processor performance characteristics such as speed and memory use for DSP applications. The BDTI Benchmark functions are implemented in assembly language to allow a realistic assessment of processor DSP performance. The resulting software is then verified for functional correctness, optimality, and adherence to the BDTI Benchmark specifications. Benchmark performance results are obtained either through manual analysis and careful, detailed simulation, or by measurement on sample devices.



The execution time results for the ZSP500 were obtained using a projected core clock speed of 250 MHz—the projected worst-case speed for the core in a 0.13-micron process. The benchmark results assume the use of LSI's reference memory system, which includes a data cache and data prefetch mechanism. Note that the number of cycles required to execute a benchmark—and the core clock speed—are implementation-dependent. For example, different memory systems may yield different benchmark cycle counts.

#### Sample Benchmark Results

The execution time results for BDTI's 256-point FFT benchmark are shown in the figure above. As illustrated in this figure, at 250 MHz the ZSP500 has a result that is much faster than those of the 200 MHz LSI402ZX and 200 MHz TMS320C5509, and comparable to those of the other processors-even the quad-MAC SP-5. The ZSP500 cycle count on this benchmark (not shown) is much lower than that of the LSI402ZX because the ZSP500 is able to use its dedicated address generation hardware and more sophisticated data caching scheme to improve its performance. The ZSP500 is also able to efficiently implement housekeeping tasks, which map well to the processor's simple instructions and four-issue architecture.

#### Memory Use

Execution speed is often the primary metric used to compare processors. However, a processor's memory usage is also important. For example, the memory requirements of an application can have a significant impact on overall system cost. In addition, processors may experience significant performance degradation when application code and data do not fit in on-chip memory. Because of these and other factors, memory efficiency is an important metric in processor selection. For each of the BDTI Benchmarks<sup>TM</sup>, BDTI reports each processor's program, constant data, nonconstant data, and total memory use.

#### **Control Benchmark**

The BDTI Benchmarks<sup>™</sup> include one benchmark function specifically designed to evaluate memory use for control-oriented software. Control-oriented tasks usually constitute the bulk of an application's program memory requirements, but only a fraction of the application processing time. Thus, in control-oriented tasks, minimizing memory use is usually a more serious concern than maximizing execution speed.

BDTI's Control benchmark is designed to represent control-oriented software. While most of the BDTI Benchmarks<sup>TM</sup> are optimized primarily for maximum speed, BDTI's Control benchmark is optimized for minimum

memory usage. This optimization hierarchy mirrors the approach generally followed by control-code programmers. Note that memory usage results on the Control benchmark are not necessarily indicative of processor memory use in signal-processing-intensive code.

#### Sample Benchmark Results

The memory usage results for BDTI's Control benchmark are shown in the figure below. Because the ZSP500 uses a mixture of 16-bit and 32-bit instructions on this benchmark, it consumes about 15% more memory than the ZSP400, TMS320C55xx, or the MSA1, none of which use instructions wider than 16 bits on this benchmark. The ZSP500 uses significantly less memory than the SP-5, which only supports 32-bit instructions.

#### Conclusions

In terms of benchmark cycle counts, the ZSP500 is among the more efficient DSP-oriented cores currently available. This may surprise the casual observer, because the ZSP500 contains only two MAC units. MAC operations are central to many DSP algorithms, and MAC throughput is often used as a proxy for architectural efficiency. Nevertheless, the ZSP500 benchmark cycle counts are significantly lower than those of dual-MAC counterparts like the MSA1 and the TMS320C55xx—and in many cases,



the ZSP500 outpaces even quad-MAC competitors like the SP-5. This efficiency, combined with its mid-range projected clock speed of 250 MHz in a 0.13-micron process, gives the ZSP500 very strong DSP performance.

Today, the ZSP500 is available only as a licensable core. However, its predecessor, the ZSP400, is provided as a licensable core, in ASIC libraries, in application-specific standard product chips, and in off-the-shelf packaged processors. Few processors are available in more than one of these forms; the availability of the ZSP architecture in all four forms is a unique among DSPs. If LSI Logic makes the ZSP500 available in multiple forms (it has yet to announce any such plans), it will give the ZSP500 a key advantage.

The ZSP500 is a strong offering. It is one of the fastest DSP cores available today, and is compatible with even higher-performance forthcoming family members. And, with the futures of many core vendors uncertain, LSI's size and stability should help attract customers. The greatest challenge for the ZSP500 may be in software development tools and infrastructure. While LSI's software and tools offerings are superior to those of most DSP core vendors, they still lag significantly behind those available for the best-supported DSP chips. ■

### Order Form

| Inside the LSI Logic ZSP500: A BDTI Technical Evaluation         |                                                  | <b>Description</b>                                             | <u>Qty</u> |   | Price <b>Price</b> |   |  |
|------------------------------------------------------------------|--------------------------------------------------|----------------------------------------------------------------|------------|---|--------------------|---|--|
| Mail this form along with a check or fax with purchase order to: |                                                  | First copy                                                     | 1          | х | \$1500             | = |  |
| Darladay Dasim Taska alam Ias                                    |                                                  |                                                                |            | х | \$650              | = |  |
| 2107 Dwight Way, Second Floor                                    | Fax: +1 (510) 665-1600<br>Fax: +1 (510) 665-1680 | Tax (for CA orders) =                                          |            |   |                    |   |  |
| Berkeley, CA 94704 USA                                           | Email: info@BDTI.com                             | International orders add \$75<br>for shipping & handling       |            |   |                    | = |  |
| Name                                                             | TOTAL                                            |                                                                |            |   | =                  |   |  |
| Title, Division                                                  |                                                  | Payment                                                        |            |   |                    |   |  |
| Company                                                          |                                                  | International orders must be prepaid in US dollars.            |            |   |                    |   |  |
| Address                                                          |                                                  | Contact BDTI at info@BDTI.com for volume discounts             |            |   |                    |   |  |
| City, State, Zip, Country                                        |                                                  | Check enclosed, payable to<br>Berkeley Design Technology, Inc. |            |   |                    |   |  |
| Tel: Fax:                                                        |                                                  | Purchase order, copy attached                                  |            |   |                    |   |  |
| Email:                                                           | Credit Card (contact BDTI for instructions)      |                                                                |            |   |                    |   |  |