Growing Pains: Scaling Deep Learning Inference

Training an effective deep neural network is one thing, but deploying it in a way that keeps up with customer demand and is both performant and cost-efficient is hard. We’ve combined a heavily optimized software stack with deep learning-enabled hardware to fix that.

There’s an exciting change in the mix of problems that machine learning folks talk about. Teams have found their groove with data management and model training, and now have rapidly expanding user-bases. Of course, as great as it is to see your user graph go vertical, success comes with new problems.

Deep learning-based products are computationally demanding. As user demand increases, teams can quickly outgrow their existing software and hardware frameworks that took them this far. Inference can get expensive quickly, and so can the engineering work required to let inference scale smoothly. And this is even before we talk about the challenges of monitoring model’s behavior in production! These problems can pull teams away from their core work of adding features and improving models.

Inference ≠ Training

One common question I get on putting models into production relates to the choice of hardware. Usually, the team and the software stack they’ve built will have been shaped by their needs while training models, and will often be optimized for use with accelerators, e.g. GPUs. Often, they’ll have heard the CPUs can be a great option for inference, will benchmark them—using their accelerator-optimized software stack—and get disappointing results. Is CPU really the right option for inference? Absolutely, but you do need to use the right hardware/software mix.

Successful examples of inference-on-CPU deployments include many firms you’ve probably heard of. Facebook uses CPUs for inference to serve customized news feeds to its 2 billion users. Health care provider Montefiore is using Intel® Xeon® Scalable processor-based system for their Patient-centered Analytical Machine Learning (PALM) AI platform, and Philips is using Intel Xeon Scalable processors to perform inference on patients’ X-rays and computed tomography (CT) scans without the need for accelerators. Crucially, the Intel Distribution of OpenVINO™ toolkit lets these teams take a trained model and produce a version optimized for speed of execution on Intel hardware. So although deployment changes, the training pipeline isn’t affected at all.

Inference at Scale

If you’ve had enough success with machine learning to worry about your inference workload, you are past the point where you want to be manually allocating new cores or instances to meet demand. Fortunately, although your product is special, most of the problems you’ll encounter as you design for deployment and monitoring at scale are common to all the machine learning pioneers out there.

Machine learning engineers have already started to converge on a set of best practices for deploying deep neural networks in a way that allows for smooth deployments, easy scaling, and sleep-well-at-night monitoring. Some people are calling it “MLOps”, inspired by the success of the DevOps movement.

At Intel, we’ve taken leading software frameworks for MLOps and determined how best to integrate them with our inference libraries, and paired this stack with our latest Intel Xeon processors, which feature new instructions for accelerated inference, to create Intel Select Solutions for AI Inferencing.

The solution is built around the Deep Learning Reference Stack (DLRS), an integrated, high-performance open source software stack that is packaged into a convenient Docker container. The DLRS helps reduce the complexity associated with integrating multiple software components in production AI environments as it is a pre-validated, configured collection of required libraries and software components.

The stack also includes highly tuned containers for TensorFlow and PyTorch, along with OpenVINO (our library for high-performance inference of a model trained in TensorFlow or PyTorch). We’ve also included Kubeflow to allow teams to smoothly roll out new versions of their models with zero downtime. Plus, Intel Select Solutions for AI Inferencing can use Seldon Core to help manage inference pipelines, speed up inferencing requests between servers, and monitor models for unexpected behavior.

Reducing Complexity

Offering precise, benchmarked software and hardware elements, Intel Select Solutions for AI Inferencing makes commercial deployment much, much easier than sourcing and tweaking individual components on your own. Our customers are already getting the benefits. Recently, a larger retailer was able to replace workload-specific GPU platforms for inference workloads at the edge with Intel Select Solutions for AI Inferencing, thereby optimizing store operations, associated workflows, and product conversions.

By working closely with Intel and adopting solutions that have already been rigorously tested, we can help you achieve a faster time to market and free up your team to concentrate on delivering new features. Intel solutions are excellent for deep learning inference with low latency and high throughput, have a TCO that beats inference accelerators, and give you the portability and flexibility to allow for different deployment models and new techniques as the machine learning revolution develops. For more information read the solution brief and visit the Intel Selection Solutions overview page.