Cisco HyperFlex*, Intel® Xeon® Scalable processors and Intel® Optane™ SSDs: Co-innovating a new class of performance in Hyperconverged Infrastructure

Hyperconverged infrastructure has had a great run being deployed for tier-2 and general purpose workloads, but it’s time to rethink what hyperconverged infrastructure (HCI) solutions can deliver, and Cisco and Intel have done just that with the new Cisco HyperFlex All-NVMe nodes.

While most HCI options on the market are stuck behind a performance barrier preventing them from being a viable option for latency sensitive enterprise applications, Cisco and Intel are ushering in a new era in HCI usability by co-innovating new class of HCI performance.

Some like to refer to HCI solely as a software defined solution, but software and hardware are co-dependent on each other to deliver an optimal user experience.  Cisco created HyperFlex as a fully engineered solution knowing that what defines HCI is not just the software or the hardware, but the outcomes delivered from the solution as a whole. Business demands have never been greater, and today, organizations of all sizes are realizing that technology can be a strategic enabler to achieve their goals.  HCI solutions not only need to support a variety of workloads including mission critical applications, they need to do so with greater VM density to maximize TCO savings, while simultaneously delivering application performance that end users need to maximize productivity. These demands push the boundaries past the capabilities of the SATA-based SSD’s, so to meet these demands, Cisco and Intel have teamed up to take HCI to a new era of performance and usability.

Value is delivered in the HCI space with workload efficiency and ease of deployment. IT departments need the flexibility to quickly deploy a mixed set of workloads onto a centrally managed platform, with a confidence that the infrastructure will both deliver, and be fully utilized. Running latency sensitive workloads like SQL, Oracle, or SAP database on HCI is attainable today; however, adding additional applications and creating mixed workload environments in the cluster can become a delicate game of balancing resources, leaving some nodes underutilized. Even with all flash nodes, disk I/O performance can limit scalability when using SATA based SSD’s due to protocol inefficiencies, latency variability, and limiting 6Gbps interfaces which can starve a modern CPU. HCI systems can provide the most efficient workload placement when you are using platforms engineered to properly balance processor, storage, and networking resources using the most scalable ingredients. Cisco has chosen to innovate in order to deliver maximum performance and value to their customers. Not only can HyperFlex scale compute resources by simply incorporating bare UCS servers into the cluster to add additional Xeon scalable family CPUs to the cluster, by leveraging a tight partnership with Intel, Cisco has engineered a well-balanced, all NVMe HCI platform. With the new HX220c All NVMe system, Cisco is delivering the first fully engineered HCI appliance that utilizes Intel Optane SSDs in cache as well as full Intel 3D NAND NVMe SSDs as capacity drives, along with the latest Intel Xeon Scalable processors.

Kaustubh Das, Vice President of Product Marketing at Cisco, said, “We are excited to co-innovate with Intel, utilizing Intel Optane & NVMe technology, for our caching and capacity storage tiers. This opens up a new frontier of performance for hyperconverged infrastructure systems, in our upcoming HyperFlex release.”

The partnership between Intel and Cisco on the HX220c is centered on integration of three key technologies to balance the platform, and provide new levels of performance for mission critical applications. These areas were integration of All NVMe interface SSDs, such as the latest Intel® 3D NAND NVMe SSDs, enablement of the Volume Management Device feature of the latest Intel Xeon Scalable Processors to manage Reliability, Availability, and Scalability (RAS) with NVMe SSDs, and integrating Intel’s Optane DC P4800X SSDs as the caching layer, improving the storage performance efficiency and TCO.

To balance the storage performance with the performance capabilities of the Intel Xeon Scalable processor family, Cisco has made the move to All NVMe interface SSDs. By moving to NVMe, this platform provides far more performant access to the data by both increasing the interface bandwidth, and reducing the interface latency as compared to traditional SATA or SAS SSDs. The application delivered latency of NVMe interface SSDs can be significantly lower than SATA or SAS, due to the efficiency of the protocol and software overhead. Intel demonstrated this a couple years back, when NVMe SSDs were first launched, as seen here.

A new class of performance in Hyperconverged Infrastructure
See System Configuration Details #1.

 

This move to all NVMe is not only about performance balance, but also solution Total Cost of Ownership (TCO). Ask any IT system administrator, and they will tell you that CPU cycles are precious. Significant investments are made in mission critical application software licenses, therefore it is paramount to enable these applications to consume the CPU resources to ensure these investments were not squandered. Fetching data for the CPU consumes some of these resources, so doing this more efficiently means more CPU cycles for application software. As we also showed a few years back, the NVMe storage interface is far more efficient in terms of the CPU cycles needed to fetch data, as seen here.

See System Configuration Details #2.

 

When the NVMe interface was introduced, it changed the way storage is connected within the system, introducing some challenges in managing the data availability demands of mission critical applications. With NVMe, the storage is now directly attached to the CPU, using the PCIe interface, as opposed to SATA or SAS SSDs which were generally connected to the CPU using an add-in PCIe to SAS storage controller. These SAS storage controllers provided some much needed data availability features, by both managing drive carrier LED functionality, so you know the proper drive to service in cases when you have failures, as well as by isolating and managing storage device errors such as a hot removal or addition of a new device to avoid service interruptions.

Cisco’s new HX220c platform is able to meet the demanding RAS challenges of mission critical applications with NVMe SSDs because they worked closely with Intel to use a feature of the Intel Xeon Scalable processor family called Intel® Volume Management Device (Intel® VMD). The Intel® VMD feature provided Cisco a standard programming interface they could use to overcome RAS challenges like surprise drive removal errors, firmware management, features like LED status lights on the drives, and even enable future enhancements like hot pluggable NVMe drives.

Cisco also implemented the Intel® Optane™ DC P4800X SSD as the caching SSD layer in the HX220c platform to help improve the system efficiency. The Optane SSD is an ideal caching solution for an HCI solution due to the unique value offered by Intel’s Optane technology. A storage caching layer must endure a very demanding storage IO workload, and at the same time deliver consistent performance to applications. The caching layer SSDs have to simultaneously manage incoming application IO, while at the same time responding to application read requests quickly and efficiently. The caching SSD must also deliver data to the storage tier SSD layer at the same time, all without slowing things down. The Intel Optane SSD, due to the unique capabilities of Optane memory media, can meet the demands of such a workload, as you can see here.

This chart shows the response time of read requests (lower is better) for an Intel Optane SSD (Orange) as compared to a NAND SSD (Blue) while also managing a heavy, and increasing, write workload (Grey, right axis). As you can see, a NAND SSD can struggle to provide consistent and low latency responses to read requests as the write pressure demands increase on that SSD, where the Optane SSD provides consistent and extremely low response time, even while managing this extreme write pressure. In most other storage systems, in order to mitigate these effects, a large capacity NAND SSD is used and “overprovisioned”, meaning that much of the available capacity is not used by the system, as this can improve the NAND SSDs ability to manage latency variability in such demanding workloads. Overprovisioning is a very inefficient use of flash capacity. With the Optane SSD, no such overprovisioning is required to provide consistent performance, so all the caching capacity can be used for the demanding workloads. Additionally, Optane SSDs provide significantly reduced latency to the applications, enabling IT administrators to easily mix a myriad of demanding workloads within the HyperFlex HCI cluster, without worry for “Noisy Neighbors” affecting their mission critical workloads, improving workload placement flexibility and workload density.  See System Configuration Details #2 below.

Hyperconverged Infrastructure has evolved greatly since it came to market and that evolution has just taken a huge step forward through the partnership between Cisco and Intel. Mixed workload environments as well as mission critical workloads like databases are increasing common to be deployed on HCI, and now those capabilities are greatly enhanced through the introduction of the HX220c M5 All NVMe node which  has the performance to provide even greater workload density. Software and hardware features are codependent in an HCI solution and adding features to only one of them creates bottlenecks in the other. The solution needs to be engineered with the software and hardware together in order to have the flexibility to address the individual workload characteristics to maximize TCO benefits. This is where Cisco has an advantage over the competition. By owning the hardware, software, and network, Cisco is  able to leverage parterships with industry leaders like Intel to holistically create an optimal HCI platform that is able to adopt new technology quickly to enhance the overall solution.


  1. System Configuration Details: Source – Intel, System Configuration : Intel® S2600CP Family,  2x Intel® Xeon® E5-2690v2 CPU, 64 GiB DIMM DDR3, 1x NVMe* PCIe* Intel® SSD DC P3700 400GB, 1x Intel® SSD DC S3700 400GB, 1x LSI 9207-8i + 6Gb SAS HGST SSD at 400GB + 1x Intel® SSD DC S3700 400GB, 1x LSI 9300-8i + 12Gb SAS HGST SSD 400GB, CentOS 6.9* distribution with 6.32 Linux* Kernel. FIO storage workload fio --ioengine=libaio --description=100Read100Random --iodepth=4 --rw=randread --blocksize=4096 --size=100% --runtime=600 --time_based --numjobs=1 --name=/dev/nvme0n1 --name=/dev/nvme0n1 --name=/dev/nvme0n1 --name=/dev/nvme0n1 --name=/dev/nvme0n1 --name=/dev/nvme0n1 --name=/dev/nvme0n1 --name=/dev/nvme0n1 2>&1 | tee -a NVMeONpciE.log, 8x workers, QD4, random read, 4k block, 100% span of target, unformatted partition Estimated results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown". Implementation of these updates may make these results inapplicable to your device or system.
  2. System Configuration Details: Responsiveness defined as average read latency measured at queue depth 1 during 4k random write workload. Measured using FIO 2.15. Common Configuration - Intel 2U Server System, OS CentOS 7.2, kernel 3.10.0-327.el7.x86_64, CPU 2 x Intel® Xeon® E5-2699 v4 @ 2.20GHz (22 cores), RAM 396GB DDR @ 2133MHz. Configuration – Intel® Optane™ SSD DC P4800X 375GB and Intel® SSD DC P3700 1600GB.  Latency – Average read latency measured at QD1 during 4K Random Write operations using fio-2.15.  Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown".  Implementation of these updates may make these results inapplicable to your device or system.