With the growing amount of data in every area of our life, the problem of data movement becomes a major challenge, especially in the modern data center. Super Computing organizations face this problem in their respective areas of innovation and research, such as, Lifesciences, Energy, Manufacturing, Machine Learning and many others. As more data generally provides greater accuracy of results, these innovators, try to accomplish petabytes (PB) of data exchange to move their projects and missions to ever greater levels. In many cases the network infrastructure is capable of high speed transfers, however, the storage subsystems may not be as capable. Data Transfer Nodes (known as DTNs) are the buffers, full of NVMe*-based SSDs, enabled to deliver maximum transfer performance from one DTN to another, and across cluster infrastructures. The following article shares the details of an optimized DTN architecture at the SLAC National Accelerator Laboratory utilizing Intel® SSDs.
Since the Winter of 2014, The Office of the CIO (OCIO)/Computing Division of the Department of Energy (DOE)’s SLAC National Accelerator Laboratory (SLAC) has been collaborating with Zettar Inc. (Zettar) – a software company which is focused on DTN solutions. Together they were exploring a solution that has the potential to achieve multi-100Gbps and faster file-to-file data transfers.
It should be noted that the approach for achieving this level of data transfers is very different from the typical data transfers performed interactively by general computer users. The former is like constructing the world-renowned Hetch Hetchy Project for transporting water from Yosemite to the City of San Francisco, whereas the latter is more like putting together an irrigation system for a corn field. See the following two figures.
From the above, it should also be evident that for distributed data-intensive data transfers, the solution must adopt a "co-design" approach, i.e. considering storage, compute, and networking all at the same time – such transfers are no longer network-alone tasks anymore. The most critical part is the storage performance. Like having a low water level in the reservoir (e.g. during a drought season), without sufficient storage IOPS and throughput, the rest, e.g. how much compute power and network bandwidth are available, won’t really matter.
As the end of this decade approaches, more and more US Department of Energy (DOE) projects have become "distributed" data-intensive, i.e. they not only generate stunning amounts of data in multi-PB range, but also must transport such data to one or more DOE supercomputing centers for analysis, among the other data transfer needs - both in country and internationally. Hosted at SLAC, the Lunac Coherent Light Source (LCLS) free-electron laser project, a premier Exascale Computing Initiative project, is an excellent example of such endeavor.
FIG. 3 LCLS-II Data Throughput, Data Storage and Data Processing Estimates (Source: LCLS-II Offline Computing Scaling, LCLS, SLAC internal presentation; unpublished)
The focus of SLAC and Zettar’s efforts has been assessing the feasibility of the LCLS project's distributed computational approach and finding ways to realizing it, so as to have a solution ready before the phase II (aka LCLS-II) is brought online around 2020. Zettar Inc. (Zettar) leads such efforts, with the support and participation of the Office of the CIO (OCIO)/Computing Division and the LCLS project of the SLAC. In 2016, the Zettar team has been using its 3rd generation of its reference data transfer system design and HPC data transfer software, named, zx, for the effort.
All three generation of solutions are scale-out capable. In particular, the 3rd generation test bed can be configured to provide a network bandwidth up to 8 x 100Gbps.
FIG. 4 The functional diagram of the 3rd generation test bed. Intel® SSD DC Family for NVMe are critical to the setup.
The following are given equal weight to the data transfer solution design:
- High availability (HA) among data transfer nodes and their network interfaces (software peer-to-peer; hardware 4-node cluster, each node has 2x25G Ethernet interfaces, built in failover of links, NICs, servers)
- Scale-out (software peer-to-peer; hardware 1U each time, i.e. finer granularity)
- High resource extraction efficiency (thus ensuring low cost, low component count, low complexity, and higher reliability)
- Low cost (thus 8 SSD/storage server, rather than 10 SSDs/storage server; inexpensive models for read, pricier model for writes)
- Small form factor (thus 2U/4node high-density servers/1U servers as building blocks, not 2U units)
- Energy efficiency (fewer components is better)
- Forward-looking (High-speed low latency fabric and 25G Ethernet NICs are employed -modern design)
- Storage tiering friendly - as illustrated below (Zettar handles the Compute Nodes and the (all-NVMe) High Performance Storage tier)
Owing to the high storage performance requirement, the Zettar team started using Intel DC P3700 NVMe SSDs in mid-2014. The main reason is that in high-speed data transfers, the writes are where the bottlenecks are. The outstanding write performance of DC P3700 makes it simply a practical choice. Furthermore, the endurance and consistent throughput levels are important too. Transferring e.g. 20PB's worth of data over a period of 3 weeks non-stop makes performance consistency a "must-have".
For cost reasons, all three generations of test bed employ COTS hardware running CentOS 7.x OS. Since clusters are employed, so Zettar has experimented with various ways of aggregating distributed NVMe SSDs. At present it has found using a high-performance parallel file system (e.g. Lustre, BeeGFS) to be a reasonable near-term approach. It is also investigating other possible approaches, which may by-pass file systems altogether in the long run. File systems tend to carry high-overheads, thus reduce the performance available from raw devices.
FIG. 5 How storage tiering works in a 3-tiered system, see Designing High-Performance Storage Tiers Intel® Enterprise Edition for Lustre* software and Intel® Non-Volatile Memory Express (NVMe) Storage Solutions.
As indicated in the SLAC Technical Note, SLAC-TN-16-001, with the 2nd generation of test bed, two clusters of 4 nodes each, and 2 Intel DC P3700 U.2 1.6TB NVMe SSD/node (or 8 NVMe SSDs/cluster), and a test data set of 20000x50MiB files, the throughput level attained was around 70Gbps. In the same Note, it's noted that if RDMA employed and external dedicated scale-out storage, it was anticipated to have about 20% increase of throughput. Please see the following figure.
Indeed, with the 3rd generation test bed, as shown above, with two 1U storage servers, 4 Intel DC P3700 U.2 1.6TB NVMe SSDs each (thus 8 NVMe SSDs/cluster as before), 87Gbps was attained. Other predictions from the aforementioned Note have been confirmed to be true as well.
In the winter of 2016, LCLS/Zettar has embarked on a pilot deployment. In 2017 and beyond, SLAC/Zettar will not only expand said pilot deployment but also continue pushing the envelope. Intel’s innovation in a storage technologies, coupled with NVMeoF will be important elements in what we will be doing. For more information, please refer to the linked references, and visit the DOE booth for a joint-talk given by DOE's ESNet, NERSC, LCLS/SLAC, and Zettar for both a comprehensive background and a live demo using a 5000-mile ESNet 100G loop.
Andrey Kudryavtsev, Intel Corp.
Chin Fang, Zettar Inc.