Case Study: Maximizing DSP Software Performance on ARM Processors

A decade ago, ARM processors were mainly found in cell phones, disk drives, and few other specialized applications. These days, they seem to be everywhere, from microcontrollers to tablet PCs. During this same time period, digital signal processing (DSP) tasks such as multimedia and communications functions have also become increasingly common in a wide range of systems. Given these two trends, it's no surprise that there's been a big uptick in products using ARM processors to implement digital signal processing tasks.

The good news is that some of the newer ARM cores have strong DSP capabilities, for example the NEON SIMD (single instruction, multiple data) instructions in the Cortex-A8 and Cortex-A9 cores. The bad news is that it can be difficult to tap that performance potential. ARM does provide a vectorizing compiler, but there are real limits to what even a good compiler can achieve, particularly for DSP algorithms. For example, most compilers have a hard time making good use of SIMD capabilities. And even if a compiler can figure out a way to use SIMD instructions, the resulting code can wind up being slower than the equivalent code without SIMD instructions. Among other causes, this can happen when the compiler cannot determine key information about loop lengths and data structure alignment.

As a result, getting good DSP performance from an ARM core typically requires careful manual code optimization based on detailed knowledge of the application and the processor. If you're a DSP software engineer, this probably isn't news to you. But if you're new to implementing DSP applications and are accustomed to applications that are well-suited for compilers, this may come as an unpleasant surprise.

BDTI recently completed a DSP software optimization project on an ARM processor for an audio algorithm company. The company needed to get its algorithm running in the minimum MIPS on an ARM core, and the compiled version of the code simply wasn't cutting it. BDTI was able to leverage its expertise in ARM architectures and DSP algorithms to identify many opportunities for assembly-level optimization. Using Neon SIMD operations (including 4x32-bit load/store, addition, and multiplication operations), BDTI engineers sped up key algorithm kernels by as much as 5x. The net result? In four short weeks, the algorithm was running with the required performance, and the algorithm company was able to meet the needs of its customer.

Find out how BDTI can help you get the most performance out of your ARM DSP software: contact Jeremy Giddings, giddings@bdti.com or phone us at 1-510-451-1800.