Reaching one million database transactions per second… Aerospike + Intel SSD

aerospike1mtps.png

We’ve known  the innovators at Aerospike for a few years now, and today we are announcing more than 1 million transaction per second (TPS) on a single server with Aerospike’s NoSQL database. That might not seem like such a big deal, until you realize we are not using DRAM for this, as you’ve seen on some previous posts about Aerospike doing 1 million TPS. We are trading out DRAM for NVM (non-volatile memory) in the classic form of NAND memory. NAND to database fanatics like us is hot, because you store so much more.  NoSQL innovators have learned how to utilize NVM with breathtaking performance and new data architectures. NVM is plenty fast when your specification is 1 millisecond per row “get”. In fact it’s the perfect trade-off of, fast, lower cost, and non-volatile. The best thing is the price. Did I tell you about the price yet?

NVM today and even more so tomorrow is a small fraction of the price of DRAM. Better still you are not constrained by say 256GB, or some sweet spot of memory pricing that always leaves you a bit short of goal. Terabyte class servers with NVM give you so much more headroom to grow your business and not reconstruct and upgrade your world in months.  How does 6 + Terabytes of NVM database memory on a single box sound?

Here at Intel, we say. Be bold, go deep into the Terabyte class of database server!

So how did we do this? Well our friends at Aerospike make it possible with a special file system (often called a database storage engine), that keeps the hash to the data in DRAM (a very small amount of DRAM, we set it to 64 GB), and the actual 1k or greater (key,value) row is kept in a large and growth capable “namespace” on 4 PCIe SSDs. Aerospike likes Intel SSD for their block level response consistency, because when you replace DRAM and concurrently run at this level of process threading, consistency becomes paramount. In fact we like to target 99% consistency of reads under 1 millisecond, during our tests. Here are the core performance results.

95% read Database Results (Aerospike’s asmonitor and Linux iostat)

asmonitor data

Record Size Number of clients threads Total TPS Percent below 1ms (Reads) Percent below 1ms
(Writes)
Std  Dev of Read Latency

(ms)

Std Dev of Write Latency (ms) Database size
1k 576 1,124,875 97.16 99.9 0.79 0.35 100G
2k 448 875,446 97.33 99.57 0.63 0.18 200G
4k 384 581,272 97.22 99.85 0.63 0.05 400G
1k with replication 512 1,003,471 96.11 99.98 0.87 0.30 200G

iostat data

Record Size Read MB/sec Write MB/sec Avg queue depth on SSD Average drive latency CPU % busy
1k 418 29 31 0.11 93
2k 547 43 27 0.13 81
4k 653 52 20 0.16 52
1k (replication) 396 51 30 0.13 94

Notes:

1. Data is averaged and summarized across 2 hours of warmed up runs. Many runs executed for consistency.

2. 4k test was network constrained, hence the lower CPU attained during this test.

We ran our tests on 1k, 2k and 4k row sizes, and 1k again with asynch replication turned on. We kept the data row-wise small, which is common for operational databases that manage cookies, user profiles and trade/bidding information in an operational row structure. The Aerospike database does have a binning process that can give you columns, but so many usages exist for strings, so we configured for no-bin (i.e. 1 column). This configuration will give you the highest performance for Aerospike.

The databases we built were from 100GB to 400GB, but as made the database bigger we did not see any drop in performance. We used a small database to maintain some agility in building and re-working this effort over and over. Our scalability problems came about as we scaled the rows sizes and that was at the network level, and no longer as a balancing act between the SSD and threading levels on  the CPU. We simply need more network infrastructure to go to larger row sizes. Taking a server beyond 20Gbit of networking per server at a 4k row sizes was a wall for us. Supporting nodes that are producing 40Gbit and higher throughput rates can become an expensive undertaking.  This network throughput and cost factor will affect your expense thresholds and be a decision factor on truly how dense of an Aerospike cluster you wish to attain.

Configuration and Key Results

We used Intel's best 18 core Xeon Xeon v3 family servers which support 72 cpu hardware threads per machine. Aerospike is very highly threaded and can use lots of cores and threads per server and with htop we were recording over 100 active threads per monitoring sample, loading the CPU queues nicely. As far as balance to the SSD and queue depths of the SSD we found that achieving  our range of 95% to 100% consistency under 1 ms db record retrieval was most perfected at a queue depths of under 32 on these Intel NVMe (non-volatile memory express)  SSD’s. The numbers in the asmonitor data table shows that we were actually getting mostly 97% of all transactions running under 1 millisecond. A very high achievement.

Configuration details is below, for those attempting to replicate this work. All components and software is available on the market today. Try the Aerospike Community Edition free for download here.

AEROSPIKE DATABASE CONFIGURATION

Description Details
Edition Community Edition
Version 3.3.40
Bin Single Bin
Number of nodes Two
Replication Factor One (*Two used with 1k rows and replication)
RAM Size 64 GB
Devices Two P3700 PCIe Devices per node ( 4 total)
Write block Size 128k

AEROSPIKE BENCHMARK TOOL CONFIGURATION

Example command used to load the database:

./run_benchmarks -h 172.16.5.32 -p 3000 -n test -k 100000000 -l 23 -b 1 -o S:2048 -w I -z 64

Example command used to run the benchmark from client:

./run_benchmarks -h 172.16.5.32 -p 3000 -n test -k 100000000 -l 23 -b 1 -o S:2048 -w RU,95 -z 64 -g 125000

Flags of Aerospike Client:

-u              Full usage

-b              set the number of Aerospike bins (Default is 1)

-h            set the Aerospike host node

-p            set the port on which to connect to Aerospike

-n            set the Aerospike namespace

-s            set the Aerospike set name

-k            set the number of keys the client is dealing with

-S            set the starting value of the working set of keys

-w            set the desired workload (I - Linear 'insert'| RU, - Read-Update with 80% reads & 20% writes)

-T            set read and write transaction timeout in milliseconds

-z            set the number of threads the client will use to generate load

-o            set the type of object(s) to use in Aerospike transactions (I - Integer| S: - String | B: - Java blob)

-D          Run benchmarks in Debug mode

System

Details

Dell R730xd Server System

One primary (dual system with replication testing)

Dual CPU socket, rack mountable server system

Dell A03 Board, Product Name: 0599V5

CPU Model used

2 each - Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz max frequency: 4Ghz

18 cores, 36 logical processors per CPU

36 cores, 72 logical processors total

DDR4 DRAM Memory

128GB installed

BIOS Version

Dell* 1.0.4 , 8/28/2014

Network Adapters

Intel® Ethernet Converged 10G X520 – DA2 (dual port PCIe add-in card)

1 – embedded 1G network adapter for management

2 – 10GB port for workload

Storage Adapters

None

Internal Drives and  Volumes

/ (root) OS system – Intel SSD for Data Center Family S3500 – 480GB Capacity

/dev/nvme0n1 Intel SSD for Data Center Family P3700 – 1.6TB Capacity, x4 PCIe AIC

/dev/nvme1n1 Intel SSD for Data Center Family P3700 -  1.6TB Capacity, x4 PCIe AIC

/dev/nvme2n1 Intel SSD for Data Center Family P3700 -  1.6TB Capacity, x4 PCIe AIC

/dev/nvme3n1 Intel SSD for Data Center Family P3700 -  1.6TB Capacity, x4 PCIe AIC

6.4TB of raw capacity for Aerospike database namespaces

Operating System, kernel

& NVMe driver

Red Hat Enterprise Linux Server Version 6.5

Linux kernel version changed to 3.16.3

nvme block driver version 0.9 (vermagic: 3.16.3)

Note: Intel PCIe drives use the Non-Volatile Memory express storage standard for Non-volatile memory, this requires an NVMe SSD software driver in your Linux kernel. The currently recommended kernel is 3.19 based for work such as this, benchmark results.

PCIe NVMe Intel drives latest firmware update and tool

Intel embeds its most stable maintenance release support software for Intel SSD’s into a tool we call Intel Solid State Drive Data Center Tool. Our latest release just landed and it important that you use the MR2 release included in the latest version 2.2.0 to achieve these kind of results for small blocks.  Intel’s firmware for the Intel SSD for Data Center PCIe family gets tested worldwide by hundreds of labs many of them directly touched by software companies such as Aerospike. No other SSD manufacturer is as connected both in the platform and in the software vendor collaboration space as Intel is. Guaranteeing you the Solutions level scalability you see in this blog. Intel’s SSD products are truly platform connected and end user software inspired.

https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=23931

Conclusion

The world of deep servers that dish out row-based Terabytes has arrived, and feeding a Hadoop cluster or vice-versa from these kind of ultra-fast NoSQL clusters is gaining traction. These are TPS numbers never heard of in the Relational SQL world from a single server. NoSQL has gained traction as purpose built, fast, and excellent for use cases such as trading, session and profile management. Now you see this web scale friendly architecture move into the realm of immense data depth per node. If you are thinking 256GB of DRAM per node is your only option for critical memory scale, think again, those days are behind us now.

You can see the back story on this by visiting our partner Aerospike and looking for the webinar by Frank Ober.

here is the link:

Webinars | Aerospike NoSQL In-Memory Key Value Database

Special thanks to Swetha Rajendiran of Intel and Young Paik of Aerospike for their commitment and efforts in building and producing these test results with me.

Published on Categories StorageTags ,

About Frank Ober

Frank Ober is a Data Center Solutions Architect in the Non-Volatile Memory Group of Intel. He joined 3 years back to delve into use cases for the emerging memory hierarchy after a 25 year Enterprise Applications IT career, spanning, SAP, Oracle, Cloud Manageability and other domains. He regularly tests and benchmarks Intel SSDs against application and database workloads, and is responsible for many technology proof point partnerships with Intel software vendors.