The Next Giant Leap in Cray Adaptive Supercomputing – The Intel Xeon Phi Processor

By Jay Gould, Sr. Product Marketing Manager, Supercomputing Products, Cray Inc.

Cray and Intel collaborate closely for years in advance of announcing new products; consulting processor features, software tool support, benchmarking, system integration and bring-up, as well as jointly working with our early customers on code modernization and performance optimization.  Cray’s continued investment in programming environment and system management software innovation, decades of system design expertise and the integration of the latest processing technology from Intel combine to introduce our most peak performant supercomputer to date.  This giant leap in Cray’s adaptive supercomputing strategy delivers a scalable, production platform that supports state-of-the-art multi-core and many-core processing technologies in the same architecture, better enabling users to implement the most optimized configuration to get the best performance results out of their diverse applications.

The Intel Xeon Phi processor family – Previously codenamed “Knights Landing”

The world’s most challenging compute applications are not all served best by a single specific processing node structure. Intel continues to evolve their Intel® Xeon® E5-2600 processor “multi-core” product line, and additionally they have also introduced the Intel® Xeon PhiTM Product Family of “many-core” devices. The newest member of this “many-core” family, previously codenamed “Knights Landing”, also includes some new integrated device features to address performance, memory bandwidth and power efficiency. In addition to the immense sea of processor cores with greater thread support, and a longer vector length, the new device family also includes up to 16 GB of integrated on-chip High Bandwidth Memory (HBM), boasting a 3+ teraflop performance per device. The embedded cores are low power units to optimize performance/watt, and the HBM provides up to 5x faster bandwidth that regular DDR memory. This device cries out “parallelism”.

Early Users and Pioneers

Numerous big name HPC industry organizations have already publically announced their commitment to this next step in the evolution of Cray multi-petascale computing, including luminaries like Argonne National Labs (ANL),  European Centre for Medium-Range Weather Forecasts (ECMWF),  National Energy Research Scientific Computing Center (NERSC)/ Lawrence Berkeley National Labs, Los Alamos National Labs (LANL), and Sandia National Labs (SNL).  These leading edge research facilities continuously pioneer the frontier of the most challenging compute projects, increasingly restricted by the growing volume of data and the system data I/O movement that is required for these extreme applications.

Users and their Data Intensive Compute Applications

Cray successfully executed early validation of HPC codes for the code named “Knights Landing” in 2015, and disclosed scaling applications to over 10,000 cores back at SC15, including GTC, HPGMG-FV, HPL, MILC, miniDFT, miniGHOST, OMB, SNAP, and UMT.

Early customers like NERSC are very systematic about identifying codes that could benefit from the many-core architecture with HBM. With 600+ projects and 6000+ users, NERSC utilizes a lot of core hours per year. However, careful analysis revealed that many of those user project run times make use of the same top 25-30 base codes, so they are evaluating multiple codes each in the areas of Advanced Scientific Computing Research, Biological and Environmental Research, Basic Energy Science, Fusion Energy Sciences, High Energy Physics, and Nuclear Physics to find which are the best fit for the new many-core device family.

The Cray® XC™ Supercomputer Series

Leveraging these parallel capabilities, the Cray XC Series integrates four of the 64-to-68 core Intel® Xeon PhiTM product family devices as a four node compute blade to scale the most performant systems ever. The hybrid Cray XC supercomputer cabinets can adaptively support both Intel® Xeon® E5-2600 processor and Intel® Xeon PhiTM product family blades in the up to 48 slots available, delivering up to a record high 586 teraflops / XC cabinet of peak performance.

Configuring this new compute blade with on-node DDR4 memory as well as the on-chip integrated HBM, the Cray XC supercomputer flexibly enables a variety of different memory modes to support diverse codes and workloads.

The Cray® XC™ Series Software Advantage

Cray has invested decades in a robust software tool chain development, and combined with years of “many-core” expertise, this returns great value for supporting the new device’s higher core counts, expanded threads per core, and the wider vector unit. To optimize code execution performance, Cray provides a software stack that accelerates time to insight, easing code analysis with the CrayPat and Apprentice tools, identifying bottlenecks, providing porting assistance recommendations via Reveal, and delivering the auto-vectorization advantages of the proven Cray compiler.  The compiler also supports optimization for parallelism via directive-based OpenMP programming (making use of ~10x more threads) and AVX-512 instructions of double wide vector length (up to an 8x speed increase). The Cray performance-enhancing software tools really optimize the silicon innovations of the Intel® Xeon PhiTM processors.

The Intel High Speed Memory can be configured as a cache or as a directly addressable fast memory, and Cray has created a flexible software feature to enable a user to configure the nodes at job launch. This capability enables Cray XC supercomputers to support a spectrum of use-modes that span from new code creation, to application tuning, to re-building, and all the way to merely loading preexisting ISV codes and executing.

Optimizing codes with the best-in-class software tools and identifying the optimal configuration of multi-core/many-core and memory assignment for a specific application will result in more performant execution. Better system performance and memory bandwidth means faster time to results, more iterations per research session, and/or more codes run in any given period of time. In the end, use one heterogeneous Cray® XC™ supercomputer and “adapt” the best compute configuration of “multi-core” and/or “many-core” processors for your specific application needs.