Continuing our series of blogs on the performance of Windows Server* 2016 with Storage Spaces Direct in our Intel lab, which we introduced a few months back with our post: 3 Ready to Go Configurations for Windows Server 2016 with Storage Spaces Direct. In this blog, we present the results of the all flash NVMe and SATA configuration IOPs performance test, and show the benefit of deploying an all flash NVMe + SATA vs Hybrid NVMe + HDD configuration.
We set up a four-node cluster of the all flash NVMe and SATA configuration consisting of four 2U Intel® Server Systems equipped with Intel® Server Board S2600WT2R. The configuration for each server consisted of:
2x Intel® Xeon® processor E5-2695 v4 (45M Cache, 2.10GHz, 18 cores, 120W)
Cache Tier: 4x 2TB Intel® SSD DC P3700 Series (NVMe)
Capacity Tier: 20x 1.6TB Intel® SSD DC S3610 Series (SATA)
1 x 10GbE dual-port Chelsio* T520 adapter
1x 10 GbE Extreme Networks Summit X670-48x switch for cluster networking
Let’s break down the storage capacity:
- There is 128 TB in the cluster [(32TB/node)*4]
- With 3 way mirroring available in Storage Spaces Direct:
- 42.6 TB of total space (128 TB /3 = 42.6 TB)
- 10.66 TB per node of available storage (42.6 TB/4 nodes = 10.66 TB)
- Total shared space of 9*4 (=36 TB) + 2 TB = 38 TB
36x Azure-like VMs per node were deployed and each VM comes with 2 cores, 3.5GB RAM and 60GB disk. Each VM was also equipped with 150 GB Data VHD (30.24 TB total space used from the shares) containing 2*70GB Diskspd files (spill over), and 1*70 GB Diskspd files (cached in).
- 36x Azure-like VMs per node
- 60 GB OS VHD + 150 GB Data VHD per VM [30.24 TB total space used from the shares]
- Spill over: 2*70GB Diskspd files per VM
- Cached in: 1*70GB Diskspd files per VM
Running Windows Server 2016 Technical Preview 5 (TP5), with 36 VMs per node, for a total of 144 VMs, we ran DISKSPD (version: 2.0.15) similar to our testing on the NVMe + HDD configuration on each virtual machine with 4 threads, and 32 outstanding IOs. The working set contained within the caching tier achieved ~1.54 million aggregate IOPS and average CPU utilization of 89% for 4K 100% Random 100% Reads. For 8K 70/30 Read/Write scenario, we achieved aggregate 630,375 IOPS of with an average CPU utilization of 79.8%.
When entire working set was contained within SSD Caching Tier (All Cached In):
|4K Random Reads||8K 70/30 RW|
|VMs||144 VMs||144 VMs|
|Avg. CPU Utilization (%)||89||88.6|
4K – 100% Random 100% Reads (All Cached In):
With increasing the working set to use 75% of total storage on each node with the same configuration as above, we achieved ~1.22 million aggregate IOPS and average CPU utilization of 82.81% for 4K 100% Random 100% Reads. For 8K 70/30 RW scenario, we achieved aggregate IOPS of 518,079 with an average CPU utilization of 60.31%.
When entire working set is 75% of total storage on each node:
|4K Random Reads||8K 70/30 RW|
|VMs||144 VMs||144 VMs|
|Avg. CPU Utilization (%)||82.81||60.31|
4K – 100% Random 100% Reads (75% of total storage):
In an All Cached In scenario, the working set is contained within the caching tier size; for this configuration with caching tier being 3.2 TB /Node we see from the results above that the IOPs performance stays consistent for 4K 100% Random 100% Reads (Aggregate IOPs - ~1.54 million). If we assume that the working set is increased beyond 3.2TB (Cache Tier Size), we see the IOPs performance across all the nodes in the cluster to be consistent with smaller performance delta between All Cached In scenario vs when working set is larger than cache size. For example, based on the above results we see that when working set is 75% of the overall storage we see that the IOPs performance is reduced by ~20% but the performance is distributed across all the nodes in a balanced manner and does not result in imbalance in the nodes. Compare this to our NVMe+HDD configuration results where we saw that the IOPs performance was reduced by ~82% when the working set was 78% of the overall storage and also resulted in node imbalance.
Similarly, for 8K 70/30 Read & Write when the working set of NVMe+SATA configuration is 75% of the overall storage, we see that there is a delta of ~16% in performance when compared to the entire working set being contained within the caching tier. When the working set was at 78% of the overall storage in NVMe+HDD configuration, we saw ~77% lower performance compared to the working set being contained within the caching tier and also imbalance in the nodes.
We clearly see that IOPs performance across all the nodes in the cluster is very consistent on an all flash configuration like NVMe + SATA where the working set utilizes the SATA SSDs for capacity and NVMe SSDs for cache. The overall performance is not gated by the size of the Caching tier. As the working set grows and utilizes the capacity SSDs, we still see consistent performance across all the nodes in a balanced manner, with ~1.54 million IOPS when all Cached In and ~1.22 million IOPS with 75% utilization of the entire storage in the cluster.
Claus Joergensen’s recent blog highlighted the initial results of the 16 node cluster in Microsoft labs using a similar configuration with Intel Xeon processor-based systems and Intel NVMe and SATA SSDs. These test showed we can linearly scale the 4 node cluster configuration to 16 nodes while still maintaining consistent performance as we scale the Storage Spaces Direct solution.
Coming up, we will share workload testing on these NVMe+SATA and NVMe+HDD configurations, as well as IOPS performance on the all NVMe configuration results. Very excited to share those results with you. To give you a preview of it, take a look at what was presented at IDF 2016: Storage Spaces Direct with an All Intel® NVMe SSD based configuration - ~2.4 Million IOPS!! Stay tuned for more results!
Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as HammerDB, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Internal Testing.* Other names and brands names may be claimed as the property of others.