Go Beyond The Kernel – Refocusing HPC Benchmarking on Total Application Performance

Go Beyond the Kernel:

Refocusing HPC Benchmarking on Total Application Performance

Want to improve application performance by 10x or 100x? Few HPC customers would say no. Yet in some cases, the promises of tremendous performance improvements from accelerators, attached processors, field-programmable gate arrays, and the like evaporate when total application performance is evaluated. Benchmarks that focus on kernel or even partial application performance provide incomplete picture with respect to the impact on total application benchmarking.  While difficult, HPC customers should look to test total application performance.

Why benchmark?

Benchmarking is an essential means for helping end users choose and configure HPC systems. An end user has a problem and needs to know the best way to solve it. More specifically, the end user has a specific workload to run and needs to find hardware that can deliver the best performance, reliability, application portability, and ease of application maintenance. As Purdue University researchers wrote in a recent IEEE article that argued for real application benchmarking, an HPC benchmark should, among other things, produce metrics that help customers evaluate the overall or total time to solution for their problems.

The claim of a 10x to 100x improvement from a particular product can easily grab someone’s attention. But what does that 10x measurement really mean? In many cases, these claims are derived from kernel or partial application benchmarking, which might fail to tell the whole story. While an increase in floating-point performance or the addition of a CPU accelerator could deliver a significant improvement for one kernel, the total application improvement depends on additional HPC system elements. As one participant argued in a recent HPC conference reported by IDC, solution time can be represented as an equation:

Solution time = processing time + memory time + communication time + I/O time – all four combine to form total application time.  The Caveat Emptor is to make sure you analyze your application; understanding what you are measuring and ensuring that you have the balanced architecture to deliver the best performance.  I am biased but I think Intel’s soon to be announced Nehalem processor delivers just that. 

Kernel benchmarking has its place, but benchmarking total (or “real”) application performance is critical for accurately evaluating HPC systems.