# Inside the 3DSP SP-5



A Technical Evaluation by the staff of Berkeley Design Technology, Inc.

The following are excerpts and abridged text from BDTI's report, Inside the 3DSP SP-5.

Contents of this excerpt include:

- Introduction
- Scope
- The SP-5 Core
- Benchmark Performance:
  - Sample Execution Time Results
  - Sample Memory Usage Results
- Summary

The complete report may be ordered from BDTI. Details are on page 4.

## Introduction

In 2000, 3DSP introduced the SP-5, a licensable superscalar fixed-point DSP core. The SP-5 is capable of executing up to four MACs per cycle and can be configured to include either  $24 \times 16$  or  $32 \times 32$ -bit multipliers.

3DSP provides the SP-5 as a synthesizable core for use in system-on-chip (SoC) designs targeting telecommunications, audio, and video applications. Licensees can choose the fabrication process, select cell libraries, and optimize synthesis goals to tailor performance and power consumption to suit a particular application.

At the time this report was published, in December 2001, 3DSP was in the process of fabricating a demonstration chip in a 0.18-micron process with an expected clock speed of 220 MHz at 1.8 volts under typical conditions. 3DSP projects the SP-5 will execute at 320 MHz in a 0.13-micron process under typical conditions. Worst-case clock speed data was not provided by 3DSP; in general, worst-case speeds can be expected to be significantly slower than typical speeds—differences of 30% are common.

## Scope

Inside the 3DSP SP-5 is intended for anyone interested in understanding the DSP performance and capabilities of the SP-5 core. It presumes a basic knowledge of DSP processor concepts and terms, which are covered in BDTI's DSP Processor Fundamentals. Inside the 3DSP SP-5 is especially useful for electronic system designers, hardware and software engineers, system-on-chip (SoC) designers, engineering managers, and product marketing managers. This report will aid in the assessment of the SP-5's suitability for a given application, and it will allow system designers to make informed decisions when considering the SP-5 for their latest designs.

For purposes of comparison, this report includes brief analyses of several

## About BDTI

Berkeley Design Technology, Inc. (BDTI) was founded in 1991 to assist companies in creating, selecting, and using DSP technology. The technical staff of BDTI has extensive experience in the development of DSPintensive software and hardware for commercial applications. BDTI offers a variety of technical products and services, including:

- Published reports on DSP processors and technology
- •DSP software development services
- Technical advisory services
- Training

other DSP-oriented cores and processors: the LSI Logic ZSP400 core and LSI402ZX chip; the ADI/Intel Micro Signal Architecture core and ADI ADSP-21535 chip; the Infineon Carmel 10xx core; and the StarCore SC140 core and Motorola MSC8101 chip. These processors have been included to give the reader insight into how the SP-5 compares to other well-known DSP architectures.

# The SP-5 Core

The SP-5 is a superscalar architecture that can execute up to two instructions in parallel. The superscalar design means that instruction scheduling is performed by the processor at run-time instead of being statically scheduled by a programmer or compiler.

The SP-5 data path consists of two multipliers, four accumulators, two adders, one shifter, one logic unit, and one "ASIC" unit. Any two of these units can be used concurrently to take advantage of the core's dual-issue capability. A register file contains 32 general-purpose 32-bit registers; these registers are shared by all execution units.

The SP-5 supports a wider range of data types than most processors, allowing 8-, 16-, 24-, and 32-bit fixed-point data types for many operations. It also provides hardware support for 16-bit complex data. This data type flexibility is useful in applications such as multimedia devices that incorporate modem, image/video, voice, and audio functionality. Each of these tasks has its own data type preference and precision require-

ments, and it is more efficient to implement on a processor that can easily accommodate the differences.

The core supports SIMD (single instruction, multiple data) operations on 8- and 16-bit data. With this feature, operands are treated as multiple packed sub-operands, and operations are performed on all sub-operands in parallel. For example, by using 16-bit data packed into 32-bit operands, each SIMD multiplier can compute two multiply-accumulate operations per cycle for a combined throughput of four MAC operations per cycle. SIMD support is available for most operations, including adds, subtracts, multiplies, shifts, and comparisons.

For the standard SP-5 implementation, each multiplier can perform one  $16 \times 24$  multiply, up to two  $16 \times 16$  multiplies, or up to four  $8 \times 16$  multiplies per cycle. When more than one multiply is performed, SIMD features are utilized. The licensee has the option of replacing the  $16 \times 24$  multipliers for  $32 \times 32$  multipliers. A single-instruction, singlecycle complex multiplication using 16bit data is supported. Products of the multiplier can replace, be added to, or subtracted from the contents of an accumulator. All multiplier inputs are treated as integers in the sense that there are no mode bits to specify automatic product shift as might be used in fractional number representation.

The shift unit provides, in addition to shift operations, bit field and pack operations.

The logic unit supports bitwise logical operations as well as bit field set and clear, and exponent detect.

The ASIC unit contains SIMD maximum and minimum functions. Additionally, an instruction is provided to repeat the selection made by the last maximum or minimum operation. These functions are designed to accelerate add-compareselect operations used in Viterbi decoders.

#### Memory System

Since the SP-5 is a synthesizable core, the memory subsystem will vary with the implementation. 3DSP provides a standard subsystem to support the SP-5 but licensees also have the option of designing their own. The licensee can specify the size of program and data memory based on the target application.

The core uses a Harvard memory architecture, characterized by independent data and program memory spaces, and allows access of both memory spaces in parallel. The program memory address bus is configurable to be up to 32 bits wide with a 64-bit data bus.

Data memory is divided into A and B memory banks, each supporting an address space of up to 32 bits. Each bank has two address and data buses that support two accesses of each bank per cycle. At 220 MHz, the SP-5 sustainable data memory read bandwidth is 1.76 billion 16-bit words/second assuming the 16-bit data is packed into 32-bit words aligned on 32-bit boundaries.

#### Addressing

The SP-5 supports a variety of addressing modes, including registerdirect, register-indirect, paged-direct, and register-indirect with post-modification. Data memory access is supported by two address generation units (AGUs); each AGU generates up to two independent addresses per cycle.

Register-indirect addressing with post-modification provides modulo addressing, a special mode for accessing two-dimensional arrays, which can support reordering the output data generated by radix-2 or radix-4 FFT algorithms.

# Pipeline

The SP-5 uses a five-stage pipeline consisting of instruction fetch, instruction decode, operand fetch, execution, and write. Instructions are scheduled by the core in the ID stage. The pipeline splits into two branches after this stage, referred to as the "left" and "right" pipelines. Each of these contains an operand fetch, execute, and write stage. The instruction decode stage attempts to issue one instruction to each pipeline for every instruction cycle.

Although the SP-5 is a dual-issue architecture, the instruction decode stage cannot always issue two instructions at once. The instruction decoder detects all instruction sequences that would cause a data dependency pipeline hazard and inserts one or more NOPs into the left or right pipeline to prevent the hazard from corrupting the computation.

#### Instruction Set

The SP-5 fetches two 32-bit instruction words at a time and can execute up to two instructions in parallel. Because instructions are scheduled by the processor at run-time, data dependencies will not affect the functionality of a block of code. However, since the core will insert NOP instructions to ensure proper execution, execution time does depend on data dependencies and resource conflicts. Although the processor handles instruction scheduling, the order in which instructions appear in the code can affect performance. The assembler does not provide information regarding potential conflicts; thus, the programmer must carefully analyze the simulator output to optimize code.

The basic SP-5 assembly language instruction uses a C-like algebraic syntax for nearly all arithmetic and multiply instructions and a C-like functional form (i.e., funct\_name(args)) for other operations. The instruction set provides flags to control when the status register is updated and to conditionally execute instructions.

The orthogonality of the SP-5 instruction set is generally high, but there are some instructions in which addressing modes or operations are restricted. For example, in the register-indirect addressing mode, only registers 0 through 7 may be used for source operands.

## **Benchmark Performance**

Inside the 3DSP SP-5 includes extensive benchmark results, used to quantitatively evaluate the processor's DSP performance. For each benchmark, BDTI reports cycle counts, execution times, and memory usage. BDTI also provides extensive analysis that compares the benchmark implementation used on the SP-5 to those of the other processors. Explanation is provided when a processor's performance is lower than expected based on a high-level view of the architecture. In this section, we present sample execution time and memory usage results taken from the complete set of results in the report.

## **Execution Time**

To determine the execution time of a particular benchmark on a given processor, the number of instruction cycles the processor requires to execute the benchmark is multiplied by the processor's instruction cycle time. *Inside the 3DSP SP-5* includes tables and charts illustrating the number of cycles required by each processor to execute each benchmark and uses these results to generate corresponding tables and charts for execution times.

The execution time results for the SP-5 were obtained using a projected core clock speed of 220 MHz, the expected speed of the SP-5 demonstration chip. The core is synthesizable, and the number of cycles required to execute a benchmark and the clock speed are implementation dependent (e.g., different memory systems may yield different benchmark cycle counts). Refer to the full report for a discussion of the trade-offs associated with different implementations of the SP-5.

## Sample Benchmark Results

The execution time results for BDTI's FFT benchmark are shown in the figure above. As illustrated in this figure, at 220 MHz the SP-5 has a faster result on this benchmark than the 200 MHz LSI402ZX but somewhat slower than the other processors featured in this report. The SP-5 core makes extensive use of its specialized addressing modes, its complex multiply support, and its SIMD adders in this benchmark.

The cycle count (not shown) of the SP-5 on the FFT benchmark is below average for this group of processors, but the projected clock speed of the SP-5 demonstration chip is lower than that of most of the other processors. As a result, the execution time of the SP-5 is slower than that of most of the other processors in spite of its fairly low cycle count. For example, the SP-5 has a lower cycle count than the ADSP-21535, but when this result is combined with the core's



projected 220 MHz instruction cycle rate, the SP-5 execution time is slower. The MSC8101 has the lowest cycle count on this benchmark and the highest clock speed. These factors combine to make its execution time significantly faster than those of the other processors on this benchmark.

# Memory Use

Execution speed is often the primary metric used to compare processors. However, a processor's memory usage is also important. For example, the memory requirements of an application can have a significant impact on overall sys-

# About the BDTI Benchmarks<sup>™</sup>

The BDTI Benchmarks are a set of DSP software functions that BDTI has independently designed to provide an objective basis for comparing processor performance characteristics such as speed and memory use for DSP applications. The BDTI Benchmark functions are implemented in assembly language to allow a realistic assessment of processor DSP performance. The resulting software is then verified for functional correctness, optimality, and adherence to the BDTI Benchmark specifications. Benchmark performance results are obtained either through manual analysis and careful, detailed simulation, or by measurement on sample devices.

tem cost. In addition, processors may experience significant performance degradation when application code and data do not fit in on-chip memory. Because of these and other factors, memory efficiency is an important metric in processor selection. For each of the BDTI Benchmarks, BDTI reports each processor's program, constant data, non-constant data, and total memory use.

# **Control Benchmark**

The BDTI Benchmarks<sup>™</sup> include one benchmark function specifically designed to evaluate memory use for control-oriented software. Control-oriented tasks usually constitute the bulk of an application's program memory requirements, but only a fraction of the application processing time. Thus, in control-oriented tasks, memory use is usually a more serious concern than execution speed.

BDTI's Control benchmark is designed to represent control-oriented code. While most of the BDTI Benchmarks<sup>TM</sup> are optimized primarily for maximum speed, BDTI's Control benchmark is optimized for minimum memory usage. This optimization hierarchy mirrors the approach generally followed by control-code programmers. Note that memory usage results on the Control benchmark are not necessarily indicative of processor memory use in signal-processing-intensive code.

#### Sample Benchmark Results

The memory usage results for BDTI's Control benchmark are shown in the figure at right. The SP-5 has the highest memory usage of the processors in this report. This is primarily due to the core's exclusive use of 32-bit instructions; all other processors in this report support instructions that are smaller than 32 bits. In addition, the SP-5 memory usage is increased by the core's limited support of immediate operands.

#### Summary

Based on the BDTI Benchmark<sup>TM</sup> results for the SP-5 running at 220 MHz, the core's speed will be lower than that of the quad-MAC SC140at 300 MHz but significantly higher than that of the LSI ZSP400 core at 200 MHz. However, a particular implementation of the SP-5 may achieve higher or lower performance than that of the demonstration chip. For example, 3DSP expects that fabrication in a 0.13-micron process will achieve a clock speed of 320 MHz under typical conditions. By using the cycle count tables provided in the full report, the execution times for each benchmark can be calculated for this or any other clock speed to allow customized comparisons of processor speed.

On the basis of architectural efficiency, the benchmark cycle counts of the SP-5 indicate that it is among the more powerful DSP-oriented cores cur-



Description

**0**40 /

rently available. One reason for its relatively low cycle counts is its ability to execute four MACs per cycle. This capability makes it much more powerful than cores from ARM, for example, which have only one multiplier. Most other DSP-oriented cores are dual-MAC architectures; one exception is the BOPS ManArray architecture, which can execute more than four MACs per cycle.

As is always the case with new architectures, the success of the SP-5 will depend to a large extent on the availability of competent development tools, particularly the compiler.

The SP-5 appears to be a very competent architecture; however, for the processor to succeed, 3DSP will need to establish its credibility by demonstrating working silicon and building its portfolio of licensees.

#### **Order Form**

| Inside the 3DSP SP-5: A BDTI Technical Evaluation                                                                           |                                                | Description                                                      | <u>Qly</u> |   | Price  |   |  |
|-----------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|------------------------------------------------------------------|------------|---|--------|---|--|
| Mail this form along with a check or fax with purchase order to:<br>Berkeley Design Technology, Inc. Tel: +1 (510) 665-1600 |                                                | First copy                                                       | 1          | х | \$1500 | = |  |
|                                                                                                                             |                                                | Additional                                                       |            | х | \$650  | = |  |
| 2107 Dwight Way, Second Floor<br>Berkeley, CA 94704 USA                                                                     | Fax: +1 (510) 665-1680<br>Email: info@BDTI.com | Tax (for CA c                                                    | orders)    |   |        | = |  |
|                                                                                                                             |                                                | International orders add \$75<br>for shipping & handling =       |            |   |        |   |  |
| Name                                                                                                                        |                                                | TOTAL                                                            |            |   |        | = |  |
| Title, Division                                                                                                             |                                                | Payment                                                          |            |   |        |   |  |
| Company                                                                                                                     |                                                | International orders must be prepaid in US dollars.              |            |   |        |   |  |
| Address                                                                                                                     |                                                | Contact BDTI at info@bdti.com for volume discounts               |            |   |        |   |  |
| City, State, Zip, Country                                                                                                   |                                                | Check enclosed, payable to Berkeley Design Tech-<br>nology, Inc. |            |   |        |   |  |
| Tel: Fax:                                                                                                                   | Purchase order, copy attached                  |                                                                  |            |   |        |   |  |
| Email:                                                                                                                      | Credit Card (contact BDTI for instructions)    |                                                                  |            |   |        |   |  |