Are You Realizing the Payoff of Parallel Processing?

By Andrey Vladimirov, head of HPC research at Colfax International

When it comes to high-performance computing, consumers can be divided into three basic user groups. Perhaps the most common and obvious case is the performance-hungry users who crave faster time to insight on complex workloads, cutting throughput times of days down to hours or minutes. Another class of users seeks greater scalability, which is often achieved by adding more compute nodes. Yet another type of user looks for more efficient systems that consume less energy to do a comparable amount of processing work.

Coincidentally, all of these situations can benefit greatly from parallel processing to take greater advantage of the capabilities of today’s multi core processors to improve performance, scalability, and efficiency. And the first step to realizing these gains is to modernize your code.

I will explore the benefits of code modernization momentarily. But first, let’s take a step back and look at the underlying hardware picture.

In the last three decades of the 20th century, processors evolved by increasing clock frequencies. This approach enabled ongoing gains in application performance until processors hit a ceiling—the clock speed of around 3GHz and associated heat dissipation issues.

To gain greater performance, the computing industry moved to parallel processing. Starting in the 1990s, people used distributed frameworks, such as MPI, to spread workloads over multiple compute nodes, which worked on different aspects of a problem in parallel. In the 2000s, multicore processors emerged that allowed parallel processing within a single chip. The degree to which intra-processor parallelism has evolved is very significant, with tens of cores in modern processors. For a case in point, see the Intel® Many Integrated Core Architecture (Intel® MIC Architecture), delivered via Intel® Xeon Phi™ coprocessors.

A simultaneous advance came in the form of vector processing, which adds to each core an arithmetic unit that can apply a single arithmetic operation to a short vector of multiple numbers in parallel. At this point, the math gets pretty interesting. Intel Xeon Phi products are available with up to 61 cores, each of which has 16 vector lanes in single precision. In theory this means that the processor can accelerate throughput on a workload by a factor of 60 x 16—for a 960x gain—in comparison to running the workload on a single processor without vectors (in fact, the correct factor is 2 x 960 because of the dual-issue nature of Knights Corner architecture cores, but this is another story).

And here’s where application modernization enters the picture. To realize gains like this, applications need to be modified to take advantage of the parallel processing and vectorization capabilities in today’s HPC processors. If the application can’t take advantage of these capabilities, you end up paying for performance that you can’t receive.

That said, as Intel processor architectures evolve, you get performance boosts in some areas without doing anything with your code. For instance, such architectural improvements as bigger caches, instruction pipelining, smarter branch prediction, and prefetching improve performance of some applications without any changes in the code. However, parallelism is different. To realize the full potential of the capabilities of multiple cores and vectors, you have to make your application aware of parallelism. That is what code modernization is about: it is the process of adapting applications to new hardware capabilities, especially parallelism on multiple levels.

With some applications, this is a fairly straightforward task. With others it’s a more complex undertaking. The specifics of this part of the discussion are beyond the scope of this post. The big point is that you have to modify your code to get the payoff that comes with a multicore processing platform with built-in vectorization capabilities.

As for that payoff, it can be dramatic. This graphic shows the gains made when an astrophysical application HEATCODE was optimized to take advantage of the capabilities of the Intel platforms. In these benchmarks, the same performance-critical code written in C++ was used on an Intel Xeon processor and on an Intel Xeon Phi coprocessor. Review the study.


Here’s another example of the payoff that comes with code modernization. This graphic illustrates the importance of parallelism and optimization on a synthetic N-body application designed as an educational “toy model.” Review the example.


As these examples show, when code is modernized to take full advantage of today’s HPC hardware platforms, the payoff can be enormous. That certainly applies to general-purpose multi-core processors, such as Intel Xeon CPUs. However, on top of that, for applications that know how to use multiple cores, vectors, and memory efficiently, specialized parallel processors, such as Intel Xeon Phi coprocessors, can further increase performance and lower power consumption by a factor of up to 3x. For details, see this performance-per-dollar and performance-per-watt study.

Intel Xeon Phi coprocessors build on the capabilities of the Intel Xeon platform, which is used in servers around the world. General-purpose Intel Xeon processors are available with up to 18 cores per processor chip, or 36 cores in a dual-socket configuration. These processors are already highly parallel. Intel Xeon Phi coprocessors take the architecture to a new, massively parallel level with up to 61 cores per chip.

A great thing about the Intel Xeon Phi architecture is that code written for the Intel Xeon platform can run unmodified on the Intel Xeon Phi coprocessor, as well as general-purpose CPU platforms. But there’s a catch: if the code isn’t modernized, it can’t take advantage of all of the capabilities of the Intel MIC Architecture used in the Intel Xeon Phi coprocessor. This makes code modernization essential.

Once you have a robust version of code, you are basically future-ready. You shouldn’t have to make major modifications to take advantage of new generations of the Intel architecture. Just like in the past, when computing applications could “ride the wave” of increasing clock frequencies, your modernized code will be able to automatically take advantage of the ever-increasing parallelism in future x86-based computing platforms.

At Colfax Research, these topics are close to our hearts. We make it our business to teach parallel programming and optimization, including programming for Intel Xeon Phi coprocessors, and we provide consulting services on code modernization.

We keep in close contact with the experts at Intel to stay on top of the current and upcoming technology. For instance, we started working with the Intel Xeon Phi platform early on, well before its public launch. We have since written a book on parallel programming and optimization with Intel Xeon Phi coprocessors, which we use as a basis for training software developers and programmers who want to take full advantage of the capabilities of Intel’s parallel platforms.

For a deeper dive into code modernization opportunities and challenges, explore our Colfax Research site. This site offers a wide range of technical resources, including research publications, tutorials, case studies, and videos of presentations. And should you be ready for a hands-on experience, check out the Colfax training series, which offers software developer trainings in parallel programming using Intel Xeon processors and Intel Xeon Phi coprocessors.

For more information:

Intel, the Intel logo, Xeon, and Xeon Phi are trademarks of Intel Corporation in the United States and other countries. * Other names and brands may be claimed as the property of others.

©2015 Colfax International, All rights reserved