Jeff Bier’s Impulse Response—Bamboozling with Benchmarks, Part 1

My colleagues and I at BDTI believe very strongly in benchmarks. We’ve been developing and implementing signal processing benchmarks for over a decade, and we know that good benchmarks play an essential role in evaluating processing engines. You can see, then, why we get bent out of shape when benchmarks are used misleadingly. This happens pretty regularly in vendor marketing materials, but we’ve also seen it in training classes and technical articles.

Most people don’t set out to use benchmark results deceptively; they may simply be unaware of benchmarking pitfalls. (This is why BDTI requires people who want to publicize our benchmark results to submit the relevant pages for approval—we want to make sure that our benchmarks are not used to mislead.)

After being on the receiving end of hundreds of presentations that have included benchmarks from various sources, we’ve seen the same problems over and over. If you learn what to look for, you won’t be misled—and you won’t inadvertently create misleading materials yourself. To this end, we’ve created our Top Ten list of ways in which benchmarks are abused. In this column we’ll tell you about four of these; next month’s column will cover the rest. In no particular order:

Comparing internal benchmark results to those generated by other sources. This is a common practice in part because of its practicality—a vendor only has to code a benchmark for its latest processor, and can grab results for competitors off the Web or elsewhere. This isn’t always a bad thing, but should be viewed warily. The problem is that benchmarks from different sources are unlikely to be truly comparable. For example, some FFT benchmarks include descrambling; others don’t. Some assume that instructions and data reside in on-chip memory; others don’t. And in general, assumptions such as these are not clearly noted—so a vendor may not even realize that its internal results are not comparable to those published by competitors.
Making claims that aren’t supported by the benchmarks. E.g., saying that a benchmark shows that a processor is “best in class” when a key competitor hasn’t been benchmarked. Or claiming that a benchmark shows that a chip is “most area efficient” when the benchmark doesn’t consider the board area used by required external components.
Using benchmarks implemented using unoptimized, high-level code to predict performance in applications that are likely to use careful optimization—or vice versa. The implementation approach used for a benchmark should resemble that of the corresponding application. In embedded signal processing applications, for example, people usually hand-optimize their performance-critical code—and benchmarks for this application space should reflect this.
Using cycle counts to represent speed. Cycle counts are interesting, but by themselves tell you nothing about a processor’s speed. Processor architects trade off cycle efficiency for clock rate all the time; you need to know both to assess a processor’s execution speed.

Check the March 2006 Impulse Response for six more ways in which benchmarks can be used misleadingly.

Jeff Bier’s Impulse Response—Bamboozling with Benchmarks, Part 1

Add new comment