Co-authored by Andy Davis, Software Engineer at Google
The availability of open source deep learning frameworks like TensorFlow* is making artificial intelligence (AI) available to everyone. And every day, researchers and engineers are using it to solve new business, engineering, and even societal problems. Intel and Google engineers have been working hand-in-hand to optimize TensorFlow for Intel® Xeon® and Xeon Phi™ processors. The goal of this work is to ensure that TensorFlow users are able to enjoy the full performance capabilities that Intel platforms can deliver. This work can result in significantly higher performance on typical use cases, and can enable the AI community to tackle harder and more complex problems iterating on solutions more quickly. All of this extra performance is possible on widely available Intel hardware.
TensorFlow is one of the most widely used deep learning frameworks. Developed by Google, it enables numerical computation using data flow graphs with a focus on speed, flexibility, and production-readiness. Its community of users and developers is one of the most vibrant in AI. It is critical to support this community with leading performance on the most widely used hardware in the industry. The work presented here introduces a number of changes to TensorFlow ensuring that it takes advantage of key performance features in Intel processors. These changes are implemented in such a way that existing Python*-based topologies can experience dramatic performance improvements with no modifications at the model level.
To tap into the deep learning performance Intel Xeon and Intel Xeon Phi processors offer requires the following:
- Ensure the code takes advantage of the modern vector instructions in the processors that enable parallel processing of data within a single core.
- Increase execution parallelism in the software to use all available cores efficiently.
- Pre-fetch and cache data efficiently to ensure that data is available when the execution units are ready to process it.
Optimized operations were implemented to use deep neural network (DNN) primitives provided by the Intel® Math Kernel Library (Intel® MKL) when running on Intel hardware. In addition to matrix multiplication, new optimized primitives include
- Direct batched convolution
- Inner product
- Pooling: maximum, average
- Normalization: local response normalization across channels (LRN), batch normalization
- Activation: rectified linear unit (ReLU)
- Data manipulation: multi-dimensional transposition (conversion), split, concat, sum and scale.
And there are other updates that minimize data format conversions, share memory across TensorFlow and Intel MKL operations, and optimize threading for better CPU utilization.
Some of the changes are already publicly available through TensorFlow git repository and the merging of the remaining changes is being finalized.
What it means for AI
Optimizing TensorFlow means deep learning applications built using this widely available and widely applied framework can now run much faster on Intel® processors to increase flexibility, accessibility, and scale. The Intel Xeon Phi processor, for example, is designed to scale out in a near-linear fashion across cores and nodes to dramatically reduce the time to train machine learning apps. And TensorFlow can now scale with future performance advancements as we continue enhancing the performance of Intel processors to handle even bigger and more challenging AI workloads.
The collaboration between Intel and Google to optimize TensorFlow is part of ongoing efforts to make AI more accessible to developers and data scientists, and to enable AI applications to run wherever they’re needed on any kind of device—from the edge to the cloud. Intel believes that’s the key to creating the next-generation AI algorithms and models to solve the most pressing problems in business, science, engineering, medicine, and society.
Stay tuned for the remaining optimizations to show up soon on Google’s TensorFlow github.