Expanding Resources with Rapid HPC Orchestration in the Cloud

IT_Peer_Network · ‎06-22-2020

The industry has reached an interesting point for high performance computing in the cloud. You could almost measure this solely by the growing presence of cloud providers at the two major HPC conferences over the past few years. All the major cloud vendors had large, prime real estate on the exposition floor, and many other service providers were spread through kiosks and booths. HPC in the cloud is now one of the hottest topics at these shows, and there is plenty of interest with companies and research institutions eager to figure out how to expand their resource capabilities in fiscally responsible ways.

HPC in the cloud is not new. It’s been building and gaining momentum since the early days that cloud computing emerged, but the past few years have seen dramatic changes and improvements in infrastructure that puts HPC in reach of a much broader audience. You find more specialized instances that provide bigger processing power and memory bandwidth for computationally intensive applications. There’s also more focus on providing higher performing fabric for scaling out workloads in the cloud. Emergence of more HPC-friendly system orchestration and software services also have made leaps and bounds to help “assemble” cluster instances in the cloud. It’s been great progress and really interesting to be part of helping this growth.

That said, migrating HPC applications to the cloud can still be an easier said than done activity even with the advancements in cloud infrastructure. There’s still a learning curve involved to figure out the interface and mechanisms to spin up an instance for a particular cloud provider. If you want or need to have multiple cloud providers, then you have multiple paths to learn. There are many service providers and companies that help address the learning curve and simplify landing workloads in the cloud, but even they face the same issue. At some point in the workflow there is a need to have a common interface to launch a workload with a back-end interface to the cloud provider of choice.

Rapid HPC Orchestration (RHOC)

Recently, Intel and Google started collaborating on creating some tooling that would provide a common interface. That resulted in Intel launching an open source project called Rapid HPC Orchestration in the Cloud or RHOC. Why spin up this effort? As alluded to above, we look at RHOC as an enabler to help users tap into HPC in the cloud. It’s the enzyme that provides an accelerated path for users to realize the collective value of Intel and our cloud partners. Talking a little bit about how RHOC works will help explain the value.

Not trying to reinvent the wheel, RHOC utilizes two common utilities from Hashicorp as the underlying mechanisms to provide multi-cloud support for spinning up instances. Using Terraform’s infrastructure as code approach and Packer’s mechanisms to create cloud images, RHOC uses templates to define connections to specific instances within a cloud vendor and uses templates that define the image to spin up in that vendor.

Users launch jobs direct from the command line and specify which templates to use as well as provide their user account credentials for the cloud provider. RHOC then handles setting up a cluster in the cloud provider. RHOC builds the compute image or reuses a prebuilt image, spins up the desired instances in the cloud, configures them as an HPC cluster, and launches the job. It defaults to running a job as an ephemeral cluster and will take care of grabbing your output data and spinning down the instances when a job has completed. It does provide an option to keep the cluster running, though, when it’s needed to keep the instances active. Just remember to shut it down later!

Optimized HPC

This means that through RHOC, Intel and our cloud partners can collaborate on providing templates that define optimized instances for HPC and optimized compute node images for a given cloud provider. For example, Intel collaborates with many of the individual cloud providers to optimize and tune fabric drivers utilized by the Intel MPI library. RHOC gives us the mechanism to provide this collaboration in the template for that given provider. Through RHOC, these optimizations become part of the images that users get when launching their jobs on a supported cloud. This is collective goodness that helps streamline access to expanded HPC resources.

Ilias Kastardis, HPC Solution Lead at Google, who contributes to the collaborative efforts, commented, “Google Cloud is looking forward to providing RHOC to our HPC customers, as it will give them a simplified way to set up and manage performance optimized HPC clusters. We believe it will be a valuable toolkit for our customers in running their HPC workloads.”

Google is the first provider included in RHOC, but more are coming. We are also expanding our partnerships to help enrich the feature set of the tool and continue to push out new optimizations within templates.

For a look at RHOC, you can check out this quick demonstration at the Intel® HPC +AI Pavilion, and if you are figuring out how to simplify your experience of running HPC in the cloud, watch this space!

Written by Brock Taylor, Director of HPC Solutions at Intel