‘Accelerators’ Then and Now – What Has Really Changed?

I have been around the supercomputing market for over 25 years and have had an opportunity to see some interesting ideas come and go.  Let me share two that I experienced firsthand.

  • CDC’s Cyber205 or a Cray 1S.  The CRAY-IS and the CDC Cyber 205 both offered effective vector processing, however, code conversion between them may have required some significant algorithmic changes. Cray of course won the HPC race at that time.  Note, the Cyber 205 was a tremendous performer, when you could keep their extreme.ly long vector pipeline busy. However, one branch or gap in the vector processing pipeline would cause a flush of the vector unit and what performance advantage you appeared to have vs. a Cray 1S was quickly erased.
  • An early day accelerator was Floating Point Systems.  In particular the FPS 164 was an awesome “off load” system where the needs of a few users were satisfied with better throughput than the Cray X-MP and Y-MP of the day. Convex, had a better idea.  It was better at serving the needs of more than an FPS 164 and was simpler to develop, maintain and scale software to next generation systems.

So what are the lessons from history? Perhaps it is that there it is there is a tight connection between application, architectures and algorithms and that it is extremely important to maintain a level of application flexibility and versatility in order to adopt new architectures as they become available in the market.  The old adage still remains true, software will outlive the useful life of hardware.  So it is important to be able to quickly adapt new shifts.

The same questions probably still apply today as they did when Cray, CDC and FPS were around.

When does an accelerator computing strategy work best?

The easiest answer is if your application is extremely data parallel in nature, then it may be well suited for an accelerator strategy. The word extremely is the critical part.

If your application only performs some level of data parallelism and includes task, thread and cluster level parallelism or contains a small fraction of branching or is host to irregular data sizes, then perhaps an accelerator may not be the best fit.

How much real performance will an accelerator strategy deliver?

Often times we hear claims of 10X, 20X or even greater than 30X.

These are great headlines, but as many have noted, you need to understand an accelerators impact on the total execution time of your application.  What may have been 10X to 30X or more on a kernel of the application may only deliver a mere 2X to 3X or even less in terms of total application performance improvement.

Of course the real question is what are we really comparing performance speed ups to?

I have seen well tuned software on accelerators compared to “baseline” code running on one core of an old processor.  However, when you use available software technology and turn compiler flags on and add in a math kernel library call the performance on multi-core solutions can jump by over 10X and in some cases can exceed 30X multiples for total execution time.  This standards based accelerated software will scale forward as newer microarchitectures are made available from Intel.

Why is the difference between the promise and the actual performance so great?

Always a good question.

The promise deals with a small part or a kernel of the software that is data parallel and can potentially scale linearly as more compute resources are added.  Again if the application is extremely data parallel, then an accelerator strategy may be the correct approach.

However, when the actual performance result, or total application performance, is significantly different it is often because of several things.

  • One common reason is that you may be comparing optimized software on multi-core systems to optimized software on an accelerator.  When I compare similarly optimized software on a multi-core system I see that 20 – 30X difference often fades to less than 2X  and in most cases better than hardware accelerators.  This is because optimized software on a multi-core solution accelerates all components of the application.
  • Another situation is the bandwidth imbalance of the attach points of the accelerators, typically the attach speeds do not match the memory bandwidth or the ALU speed on the accelerators and the theoretical peak flops are tough to achieve.  Sometimes, for larger workloads due to limited amount of memory on the accelerator card, performance deteriorates.
  • Another situation may be that your application depends on different forms of parallelism which include task, thread or cluster level parallelism and even in some cases sequential forms of your software

So back to the differences in performance between the Cray 1 and CDC Cyber 205.

While Cyber 205 was great at edges of science the Cray proved to be the workhorse of high performance computing.  It offered better system balance than the Cyber 205.  Here is an example, if you take great care to optimize your software for a particular architecture you will no doubt see tremendous performance gains.  However, like the Cyber 205, if you break that pipeline you need to pay for the overhead to restart the long vector pipeline.  Often times, even with today’s accelerators, that start up cost reduces what appears to be stellar performance gains of the Cyber 205 to being no better than, or sometimes, even slower than the Cray 1.  There were of course examples with the Cyber 205, as there is today with accelerators that demonstrate where select sciences can see tremendous advantages over traditional computing solutions.

What other considerations may weigh in your decision to adopt an accelerator strategy?

Are you constantly refining your software?

Many researchers would probably answer yes.  They are constantly refining their software to improve the results the performance or both.

As I mentioned at the beginning of the blog, the old adage still remains true, software will outlive the useful life of hardware.  So it is important to be able to quickly adapt to new shifts.  One way to simplify these moves is to use standards based tools which can give you the flexibility to create applications that can use the multiple types of parallelism mentioned above via tools, compilers, and libraries.  You may also want to use standards based tools to acquire the versatility you need in order to scale your software across multiple architectures – e.g. large, many and heterogeneous cores.

The caveat with respect to using non standard tools is that you become locked into a specific architecture.  If that architecture from the same vendor would happen to change, you may be required to make some significant changes (e.g. tuning to grain sizes).

Do you want to maintain, support and update multiple code bases?

I don’t.  I want to invest n the development of parallel algorithms.  The old adage is that software will far out live any hardware implementation still applies and I need the flexibility and versatility to quickly and as painlessly as possible be able to adopt new architectures as they are made available.  I do not want to invest in maintaining, supporting and updating an ever increasing set of code streams as newer architectures are made available.

Our team goal at Intel is to develop software tools and hardware technology that can help you scale-forward your application performance to future platforms without requiring a massive rebuild – just drop-in a new runtime that is optimized for the new platform to experience the improvement (akin to the printer/display driver model, buy a new printer/display, install the respective driver, and your system enjoys improved benefits).  That is the goal.

If you want to learn more about what we are doing to deliver high performing HPC solutions that are both flexible and versatile please visit www.intel.com/go/hpc