IOPs performance on NVMe + HDD configuration with Windows Server 2016 and Storage Spaces Direct

In our previous blog on Storage Spaces Direct, we discussed three different configurations that we jointly developed with Microsoft: IOPS optimized (all-flash NVMe), throughput/capacity optimized (all-flash NVMe and SATA SSD), and capacity optimized (hybrid NVMe and HDD). Since then, we have been testing these configurations with Windows Server 2016 TP5 release in our lab and monitoring how they perform when we activate Storage Spaces Direct within Windows Server 2016 TP5. In this blog, we present the results of the hybrid NVMe and HDD configuration IOPs performance test.

Configuration

The hybrid NVMe and HDD configuration setup consisted of four 2U Intel® Server Systems equipped with Intel® Server Board S2600WT2R. The configuration for each server consisted of:

Processor: 

2x Intel® Xeon® processor E5-2650 v4 (30M Cache, 2.2GHz, 12 cores, 105W)

Storage:

Cache Tier: 2x 2TB Intel® SSD DC P3700 Series

Capacity Tier: 8x 6TB 3.5” HDD Seagate^ ST6000NM0024

Network:

1 x 10GbE dual-port Chelsio^ T520 adapter

Capture6

With total capacity storage of 192 TB in the cluster [(48TB/node)*4] and with three-way mirroring we had 64 TB of total space (192 TB /3 = 64 TB) with each node having 16 TB of available storage (64 TB/4 nodes = 16 TB). The total used share space was 14*4 (=56 TB) + 2 TB = 58 TB. For cluster networking, 1x 10 GbE Extreme Networks Summit X670-48x switch was used.

24x Azure-like VMs per node were deployed and each VM comes with 2 cores, 3.5GB RAM and 60GB disk. Each VM was also equipped with 500 GB Data VHD (53.76 TB total space used from the shares) containing 4*98 GB Diskspd files (spill over), and 2*10 GB Diskspd files (cached in).

VMs:

  • 24x Azure-like VMs per node
  • 60 GB OS VHD + 500 GB Data VHD per VM [53.76 TB total space used from the shares]
  • Spill over: 4*98GB Diskspd files per VM
  • Cached in: 2*10GB Diskspd files per VM

Results

With 24 VMs per node, for a total of 96 VMs, we ran DISKSPD (version: 2.0.15) in each virtual machine with 4 threads, and 32 outstanding IOs, with the working set contained within the caching tier, we achieved 954,240 aggregate IOPS and average CPU utilization of 80.23% for 4K 100% Random 100% Reads. For 8K 70/30 Read/Write scenario, we achieved aggregate IOPS of 641,979, with an average CPU utilization of 88.6%.

When entire working set is contained within SSD Caching Tier (All Cached In):

4K Random Reads

8K 70/30 RW

VMs

96 VMs

96 VMs

Aggregate IOPS

954,240

641,979

Avg. CPU Utilization (%)

80.23

88.6

 4K – 100% Random 100% Reads:

 CSVFS Reads/secCapture3

8K 70/30 Read & Write

 CSVFS Reads/secCapture  CSVFS Writes/secCapture1When the working set was increased to use 78% of total storage on each node with the same configuration as above, we achieved 176,613 aggregate IOPS and average CPU utilization of 21.89% for 4K 100% Random 100% Reads. For 8K 70/30 RW scenario, we achieved aggregate IOPS of 135,365, with an average CPU utilization of 16.42%.

When entire working set is 78% of total storage on each node:

4K Random Reads 8K 70/30 RW

VMs

96 VMs 96 VMs

Aggregate IOPS

176,613

135,365

Avg. CPU Utilization (%) 21.89

16.42

To understand the cluster IOPs performance as the working set starts to spill over the caching tier, we did a theoretical analysis (these are estimates alone, given not all working set size IOPs were measured) based on the above results to estimate the effect on IOPs performance vs Working Set Size. In an All Cached In scenario, the working set is contained within the caching tier size; for this configuration with caching tier being 4 TB /Node we see from the results above that the IOPs performance stays consistent for 4K 100% Random 100% Reads (Average IOPs - 954,240). If we assume that the working set is increased beyond 4TB (Cache Tier Size), we expect to see a drop in overall IOPs performance.  At 6TB, IOPs performance has dropped by ~ 50% for 4K 100% Random 100% Reads. As the working set size increases further, we expect to see lower IOPs performance yet, and based on our testing when working set was increased to 9.4TB we see the IOPs performance to be reduced by ~82% when compared to the working set being contained entirely within the caching tier.

Capture4

The table below provides the outline of the theoretical analysis used in estimating IOPs vs. working set size performance.

4K 100% Random 100% Reads:

Per Node

IOPS

Working Set Size [TB]

Cache Size [TB] Spill over [TB] IOPS/Node

IOPS

0.96

4 0 238,560** 954,240**

1

4 0 238,560*

954,240*

2

4 0 238,560*

954,240*

3

4 0 238,560*

954,240*

4

4 0 238,560*

954,240*

6

4 2 119,280*

477,120*

7

4 3 79,520*

318,080*

8

4 4 59,640*

238,560*

9 4 5 47,712*

190,848*

9.408 4 5.4 44,178**

176,711**

*Estimated IOPs ** Measured IOPs

Similarly for 8K 70/30 Read & Write based on the theoretical analysis, we expect to see ~60% reduction in IOPs performance as the working set grows from 4TB to 6TB. When the working set increases to 9.4TB, the measured IOPs performance was ~77% lower compared to the working set being contained within the caching tier.

Capture5

Theoretical analysis used in estimating IOPs vs. working set size performance for 8K 70/30 Read & Write.

8K 70/30 Read & Write:

Per Node

IOPS

Working Set Size [TB]

Cache Size [TB] Spill over [TB] IOPS/Node

IOPS

0.96

4 0 160,495** 641,979**

1

4 0 160,495* 641,979*
2 4 0 160,495*

641,979*

3

4 0 160,495* 641,979*
4 4 0 160,495*

641,979*

6

4 2 66,873* 267,491*
7 4 3 51,773*

207,090*

8

4 4 42,235* 168,942*
9 4 5 35,666*

142,662*

9.408 4 5.4 33,576**

134,305**

 *Estimated IOPs ** Measured IOPs

Conclusion

Hybrid storage configurations like NVMe SSD + HDD work well for workloads where the working set fits within the NVMe SSD cache. In the case where the entire working set is resident within the capacity of the NVMe drives, we see an aggregate IO performance of ~950K IOPS (4K 100% Random 100% Reads). As the working set increases, or a multi-tenant configuration changes the request profile on the storage, we would expect data to spill over out of the NVMe SSDs. As this happens, the performance will be gated by the IOPS capacity of the HDDs, resulting in imbalance in the nodes.  This can potentially be addressed by employing Windows Server 2016 Storage QoS by pre-defining performance minimum and maximum for virtual machines. To support a growing working set of data, while maintaining consistent performance across all of the nodes, it would be more effective to deploy SSDs in the capacity tier.

We’ll be presenting the results of the all-flash NVMe and SATA SSD configuration IOPs performance test soon, so stay tuned for our next blog.

Disclaimers

Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as HammerDB, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Internal Testing.^ Other names and brands names may be claimed as the property of others.