At today’s Intel Data Centric Innovation Summit, I briefed a group of reporters and investors how the Intel® Xeon® processor architecture will deliver the optimal performance in the datacenter. When I meet with senior architects from cloud, network, supercomputing, and enterprise customers to define our future processors, they focus on architecture and capabilities that deliver better value at the datacenter level. They want increased service capacity for every data center infrastructure dollar that they spend, and they want improved performance for every workload they run. Let’s take a high-level look at how our balanced approach to processor architecture delivers better for our customers’ needs.
Two Critical Metrics
Often, we see processor-level benchmarks scores that measure the aggregate performance of the processor or the multi-socket server. Roughly speaking, these aggregate metrics measure Throughput (TPT), the total amount of capacity that they provide. TPT is critical and helps drive the total cost of ownership (TCO) of the infrastructure.
While TPT is important, it is even more critical that each task running on the infrastructure meets the response time or minimum latency required by the application. The response time of a specific job, whether a virtual machine, a web transaction, or online search, is typically dependent on the level of performance of each core, referred to as Per Core Performance (PCP). PCP speaks directly to how fast each of those functions operate. If the PCP is low, workload performance may not even meet the service-level agreement required by the customer, and any level of TPT becomes secondary and not meaningful. You definitely won’t win their business with inadequate per-core performance.
Throughput and Per Core Performance through the Customer’s Eyes
Our customers demand a strong mix of both PCP and TPT because both matter to the bottom line. If you are a cloud service provider (CSP) offering Infrastructure- or Platform-as-a-Service, your goal is to increase revenue by maximizing the number of virtual machines (VMs) you can sell per server node, so cores-per-server and TPT are very important. But if PCP is low, your instances will not meet customers’ performance expectations, and that puts you at risk to your CSP competitors.
This same dynamic plays out in other scenarios. For example, lots of cores and high TPT allow you to run many search operation simultaneously, but will your service be preferred by the public if each search takes longer due to lower PCP? Will you be money-ahead if you can simultaneously process lots of financial trades, but your rival using higher PCP made the trades first?
Data center operators can scale up their overall TPT with additional server nodes or more processors-per-node, but there is really no way to scale your way out of insufficient PCP.
Intel’s Balanced Approach across Processors and Platforms
Our team designs Intel® processors and platforms to deliver an optimal balance of PCP and TPT. We usually think about “effective TPT” when architecting our processors, where we can improve the throughput of the infrastructure while meeting the minimum response time or performance requirements of every workload. The Intel® Xeon® Scalable processor family delivers both PCP and TPT through advanced micro-architecture to drive higher instructions per cycle, a highly-tuned design that delivers high frequency at low power, instruction set innovations, such as AVX-512, that allow more work to be completed in each clock-cycle, a highly-efficient cache design, on- and off- die accelerators, memory innovations, and more.
Our unique Mesh Architecture, with support for advanced coherence flows, and the balanced memory system supports low latency and high bandwidth, enabling performance to scale almost linearly with core count. It delivers best TPT and PCP, and increases the predictability of workload response, a critical concern to multi-threaded or highly parallel workloads. In addition, over the generations we advanced our multi-processor coherence architecture to allow our processors to scale up to 8 sockets in a glueless fashion to maximize total system TPT.
High Data Center Utilization Drives Maximum ROI
Our datacenter customers don’t think just at the processor-level, but rather in terms of racks, services, and entire data centers. To maximize return on their investment, they must operate their data centers at high utilization. Over several generations, we have worked across different segments to understand the usage profile of the infrastructure across many real workloads, usages, and deployment models to enable high utilization.
Intel® platforms increase utilization through greater workload consolidation and help reduce the guard-bands customers add to accommodate workload fluctuations. Intel® Virtualization Technology, large caches, and large memory capacities enable more VMs per node. Unpredictable or “greedy” workloads create resource conflicts or large peaks and valleys in utilization. Intel’s performance isolation capabilities like cache capacity enforcement, turbo/throttle isolation, and memory allocation reduce this performance “jitter” without resorting to added capacity or lower utilization.
Intel® Xeon® platform accelerators, such as Intel® QuickAssist, Intel® Direct Data IO, and purpose-built ISA help speed up the crypto and compression functions in the datacenter. In addition, our manageability/RAS features, such as Intel® Run Sure Technology increase the system availability and uptime.
Ready for Artificial Intelligence
In the past, human-defined simulation models and algorithms determined application outcomes. AI, on the other hand, learns and determines the results based on prior relevant data. All platforms need to adapt this new paradigm as we continue to see AI embedded in more and more applications.
We have built a number of key architecture constructs like AVX-512 and a rich set of instructions to optimally process the key kernels in machine learning and, especially, deep neutral networks. In addition, the large L2 and larger shared LLC allow for the right data structures to reside in the appropriate caching hierarchy and be shared efficiently, increasing both compute and as power efficiency.
We continue to profile the critical AI workloads that help identify new microarchitecture and instruction set semantics to grow the performance of AI workloads. You will see us continue to drive additional innovations in our upcoming processors. For example, the combination of Intel DL Boost and Intel® AVX-512 deliver the ability to process more lower-precision AI operations per clock cycle than any other CPU architecture.
Integrating Intel® Optane™ DC Persistent Memory
Introduction of SSDs into the datacenter radically improved the workload performance and responsiveness in the datacenter by dramatically improving the access to storage. Intel® Optane™ DC persistent memory offers compelling improvement in performance, reliability and persistence capabilities to bring similar transformation to the memory tier. As the size of the data that datacenter operates with continues to grow, the ability to bring more data closer to the compute in the memory tier, as opposed to storage tier, will directly improve the application performance as well as datacenter utilization and responsiveness.
Bringing high-performance non-volatile memory into a memory sub-system requires fundamental changes to the system architecture both in hardware and software for applications to directly access the larger capacity and the persistence attributes. Our upcoming Intel® Xeon® Scalable processor integrates Intel® Optane™ DC persistent memory in a high-performance manner.
There are a thousand technical details behind every idea I’ve raised today, but I hope this simple framing provides a helpful lens to fully consider the important choice of platform architectures.
Check out all the news from Intel’s Data Centric Innovation Summit here.