Jeff Bier’s Impulse Response—Smartphone Benchmarks: Caveat Emptor

Submitted by Jeff Bier on Wed, 07/10/2013 - 22:01

Smartphones have become the most important application for high-performance, energy-efficient processors (see "ARM's 2015 Mid-Range Platform Prep: A 32-Bit Next-Step" in this month's edition of InsideDSP). That's because smartphones are a huge and growing business, and processors make a big difference in how smartphones perform – and how long their batteries last. As a result, interest has been growing in smartphone processor performance, and there's been quite a bit of benchmarking activity. Unfortunately, many of the benchmarks being used are of questionable value.

As I've been writing for years, there are many ways in which benchmarks can go wrong. A decade ago, I showed that some of the simplistic benchmarks being used to evaluate embedded processors were so bad that you could get a better estimate of a processor's performance by counting the pins on the package. Just because a piece of code can run on a smartphone doesn't mean the performance of that code tells you anything meaningful about the performance of the phone – or the processor. When looking at smartphone benchmark results, skepticism is advised.

One set of benchmark results that has received quite a bit of press attention recently is contained in a report by ABI Research entitled "Intel Apps Processor Outperforms NVIDIA, Qualcomm, Samsung." In reviewing these benchmark results, several potential issues caught my eye. I'll focus on just one in this column: the "RAM" benchmark. I'm especially skeptical of benchmarks that attempt to isolate one specialized element of a processor or system, because it can be very difficult (if not impossible) to relate the performance of that element to the performance of the processor or system as a whole. Therefore I was immediately skeptical of the RAM benchmark.

The published ABI Research report provides no information about the benchmark methodology, but the results appear to be based on the AnTuTu Android benchmark suite – a smartphone benchmark app popular with consumers. The AnTuTu RAM benchmark is comprised of several tests, one of which is the "bitfield" kernel taken from the NBench benchmark suite, which was created by BYTE magazine in the mid-1990s. To understand the results, BDTI examined the NBench code for the Lenovo K900 smartphone (based on the Intel Z2580 processor) and the Samsung S4 i9500 smartphone (based on the Samsung Exynos 5410 processor).

What we found was very interesting: on the Intel Z2580 processor, the compiler removed a key element of the benchmark. Where the benchmark source code calls for a read-modify-write operation, the compiler substituted a write operation alone. It's not yet clear why the compiler omitted some essential steps in the benchmark code. It may be a case of a compiler analyzing the source code and realizing that some statements could safely be removed without affecting the code's output. That's a good thing in real-world application development, but a bad thing in a benchmark that purports to compare processors in an apples-to-apples manner. What is clear is that the benchmark results are not meaningful for purposes of comparing the processors, given that the ARM-based Exynos processor performs all the operations specified in the benchmark source code, while the Intel Z2580 processor skips some steps.

This type of benchmarking pitfall has been well known in the industry for at least 25 years, so it's disappointing to see it still showing up today. The fact that such problems have persisted for decades underscores the need for benchmark users to be skeptical about benchmark results. Unless you understand the benchmark methodology in detail and have confidence that it yields accurate, relevant and fair results, it's unwise to rely on those results.

Stepping back and looking at the bigger picture, we also should ask: would a RAM benchmark be meaningful for comparing processors, even if the apples-to-oranges problem described above were corrected? That's debatable. For processor and system designers, benchmarks that attempt to isolate individual elements like RAM bandwidth can be helpful in the design process. Given the variety of complex tasks that users perform on smartphones, however, such simplified tests are usually not meaningful to a user. Let's imagine that Phone A has 1.5x the performance of Phone B on a fair RAM benchmark. How will that affect users' experience of performance, for example, when reading email or browsing the web? Will these tasks run 1.5x faster on Phone A? Will the difference be noticeable at all? It's very difficult to predict the relationship between RAM performance and system performance, given all the other factors that affect performance.

If we really want to understand the performance of smartphones and their processors, I believe the answer is not to recycle 20-year-old benchmarks that were never designed for this purpose, and have little relevance to how users actually use their smartphones. Instead, I think we need a new benchmark, designed from the ground up to reflect whole-system performance (including battery life) as experienced by the user.

If you're interested in helping to create a better smartphone benchmark, please drop me a line.

Jeff Bier is president of BDTI and founder of the Embedded Vision Alliance. Please post a comment here or send him your feedback at http://www.BDTI.com/Contact. To subscribe to BDTI's monthly InsideDSP newsletter, click here.

Add new comment

Log in to post comments