THIS PERFORMANCE JUST IN: Next level Business Analytics with the New Intel© XEON© Processor E7 v4 and Microsoft© SQL Server 2016 on a 100TB Dataset

Co-Authored By:

Mahmut Aktasoglu, Intel Software Solutions Group,  mahmut.s.aktasoglu@intel.com

Avaneesh Shetty, Intel Software Solutions Group, avaneesh.k.shetty@intel.com

 

With an increasing amount of data being stored in modern data warehouses, organizations need solutions that scale proportionally. Although building clusters of servers and distributing data across the systems is one of the more common approaches to scale, it is not desirable from a performance or TCO point of view.

Intel and Microsoft collaborate to ensure our data platform solutions delivers cutting edge performance and the best ROI. In a recent Accelerating Analytics Whitepaper, we showcased the capabilities of a data platform, powered by Intel® Xeon® processor E7 v4 family, Intel® Data Center SSDs and Intel® RAID Controllers; coupled with Microsoft® SQL Server 2016 and Windows Server 2016. We loaded a 100TB dataset following TPC-H specifications, which models a data warehouse with approximately 600 billion records in its largest table. Using SQL Server’s columnar database engine, the records were loaded into a clustered columnstore index at an impressive rate of 1.6TB/h. The resulting columnar-store database took about 35TB of total disk space.

In this blog, we will show our new results from the same workload after upgrading our processors to the latest Intel® Xeon® processor E7-8890 v4. We kept the rest of the configuration unchanged[1], demonstrating how SQL Server 2016 performance scales by simply upgrading your processors. Spoiler Alert:

  • Up to 36% shorter query run times on average on a 100TB database with 4-socket Intel® Xeon® processor E7-8890 v4 processor2
  • Up to 2.3x performance improvement in the SQL Server 2016 columnstore engine by utilizing Intel® AVX2 instruction set.4

Test Queries

We used the same 7 queries from previous experiments, plus 5 additional ad-hoc queries derived from TPC-H specs which involves aggregations over large number of records, to highlight AVX2 performance gains. You can find brief descriptions of the queries in the table below.

Test Query ID Notes
TQ1 Pricing summary report
TQ2 Revenue change forecast
TQ3 Important supplier stock identification
TQ4 Promotion effect
TQ5 Top supplier
TQ6 Product type profit measure
TQ7 Returned item report
TQ8 Minimum cost supplier
TQ9 Shipping priority
TQ10 Local supplier volume
TQ11 Shipping modes and order priority
TQ12 Small quantity order revenue

The new Intel® Xeon® E7 v4 performance

The Intel Xeon processor E7 v4 family is the newest member of data center processors, designed for your performance needs in your data platform. Compared to its predecessor, Intel Xeon E7 v4 family can have up to 48 logical cores (33% increase vs. Xeon E7-8890 v3) in terms of the maximum degree of parallelism that can be attained.

In the graph below, we show performance of the same queries executing on exactly the same configuration, the only difference being the processors. Before each query run, we clear all SQL Server caches. Then, each query is run twice to warm up SQL Server column-store object pool cache. We plot the query run time of the second run. With the new Intel Xeon E7-8890 v4 processor, we see performance gains across the board. For the same 7 queries we used in our previous work, we observed up to 36% performance gains on average compared to Xeon E7-8890 v3 for complex ad-hoc queries against a 100TB dataset[2]. As simple as a processor upgrade, customers can unlock the potential of SQL Server 2016 by coupling it with Intel’s latest data platform CPUs.

blog1

Test Query ID 4-socket

Intel® Xeon® Processor E7-8890 v3

4-socket

Intel® Xeon® Processor E7-8890 v4

Performance Increase
TQ1 161.23 137.79 1.17
TQ2 5.31 4.97 1.06
TQ3 3761.52 2987.22 1.25
TQ4 914.44 412.93 2.21
TQ5 492.94 401.17 1.22
TQ6 105.02 74.32 1.41
TQ7 79.23 51.75 1.53
Average Performance Increase 1.36x

 

Microsoft SQL SERVER 2016 optimized for Intel AVX2

Our next experiment demonstrates yet another value of Intel® Xeon® processor family technology. With its latest release, SQL Server supports Intel® SIMD instruction sets in columnstore batch-mode operations. What does this mean?  SQL Server 2016 has different versions of the same algorithms using vector instructions. When you upgrade your database software to SQL Server 2016, it will detect all vector instruction set extensions available on the CPU and choose the algorithms that provides the best performance[3].

To show the performance benefits of using vector instructions in SQL Server’s columnstore engine, we designed an experiment. In this experiment, we run same queries, on the same platform with the only difference being the instruction set being used by SQL Server. We have implemented SQL Server 2016 prototype that can override selection of algorithms via a trace flag. By using this trace flag, we turn off AVX2 support for the first set of runs (i.e., scalar). In the second run, we let SQL Server 2016 choose the best version of algorithms, which uses AVX2 instructions in our case.

Similar to our previous experiment, we clear SQL Server caches in-between each query run, and then run the queries twice. We use the run time from the second execution of the query. We plot the results in the chart below. In this chart, you can see the performance boost of up to 2.3x increase by utilizing Intel® AVX2 instruction set4.

blog2

Test Query ID Scalar AVX2 Performance Increase
TQ1 149.70 137.79 1.08
TQ2 11.52 4.97 2.31
TQ8 142.49 122.30 1.16
TQ9 593.99 451.82 1.31
TQ10 670.59 510.61 1.31
TQ11 113.71 98.72 1.15
TQ12 593.99 495.98 1.19
Average Performance Increase 1.31x

Test query 2 is the perfect value proposition scenario in our experiments. It is a query with a relatively simpler query plan, and spends most of query execution time in AVX2-optimized scan operator. This query calculates an aggregate value over a data set from a given year with some additional predicates (i.e., processing roughly 12 billion rows). Hence performance difference in query run times can be directly attributed to improvements from vectorizing the algorithms. As query plans gets more complicated, measurable benefits from utilizing instruction level parallelism will also reduce (following Amdahl’s law), but still impactful: A reduction of up to 25% in query run times on average with AVX2[4].

We encourage readers of this blog to explore the possibilities with Intel SIMD instruction set extension for their use case by consulting to Intel® 64 and IA-32 Architectures Software Developer Manuals and easy-to-use Intel® Intrinsics Guide web browser tool.

 

Disclaimer

Intel, the Intel logo, Intel Xeon, Intel SSDs, and Intel Raid Controllers are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

For more information, visit www.intel.com/benchmarks.

 

[1] We have installed a newer build of Microsoft Windows Server 2016.

[2] 36% more performance average on 7 ad-hoc data warehousing queries, comparing Intel® Xeon® Processor E7-8890 v4 to Intel® Xeon® Processor E7-8890 v3 on a 4 socket server with 6TB DDR4 memory, 100TB storage for data (64 Intel® SSDs DC3500 series) running Microsoft SQL Server 2016 RTM on Windows Server 2016.

[3] SQL Server 2016 supports Intel’s SSE4.2, AVX and AVX2 instruction set extensions.

[4] Up to 2.3x performance improvement and 25% average reduction in query run times on 7 ad-hoc data warehousing queries, comparing customized SQL Server 2016 RTM with AVX2 instruction set support to same engine without AVX2 support, on a 4 socket server with Intel® Xeon® Processor E7-8890 v4 processors, 6TB DDR4 memory, 100TB storage for data (64 Intel® SSDs DC3500 series) running on Windows Server 2016. Reduction in query run time calculated as: ( (1 -  (AVX2 run time/ scalar run time) ) x 100). Average reduction in run time is geometric mean of reduction in query run times for the 7 queries.