Pushing the Boundaries of Big-Data Analysis

For those of you who follow me on social media, you know that Intel has been working hard with industry leaders to provide the means to crunch massive amounts of data. But how do we define “massive”? A small company might consider a few hundred gigabytes to be massive, whereas a large enterprise might see a few hundred gigabytes as nothing but noise. To explore the boundaries of what “massive” means, Intel and SAP set out to build an analytics system capable of handling dozens of terabytes of data and flexible enough to scale up to a petabyte or more. These were our goals:

  • Build a system that can ingest hundreds of millions of rows of data per hour
  • Provide ad-hoc query and reporting capabilities across millions of rows of data
  • Enable enterprises to use use exploratory and predictive analysis tools to better understand past activities and future trends
  • Dynamically manage data movement across multiple storage and compute system tiers

To achieve these goals, our system used a multi-tiered storage and analytics concept based on “hot” and “warm” data temperatures. Hot data is data that users access frequently, while warm data is typically older and does not require frequent access. In our system, hot data resided in SAP HANA*—a fast, in-memory database running on Lenovo servers powered by the Intel® Xeon® processor E7 v2 and Intel Xeon processor E7 v3 families.1 Persistent storage for this data tier was provided by an EMC VMAX3* storage system.2


The drawback with in-memory databases is the expensive cost of RAM compared to disk-based storage. When dozens of terabytes or more of data are involved, keeping everything in RAM can be cost-prohibitive. That’s where the benefits of the warm data tier come in. In our system, this tier stored older, less frequently accessed data on cost-effective disk-based storage or on solid-state drives (SSDs) while continuing to provide fast access and analysis speeds. Our warm data tier was powered by SAP HANA Dynamic Tiering running on a Lenovo server powered by the Intel Xeon processor E7-8850 v2 product family.3

The performance results are impressive:

  • The system was able to load 24 billion rows of data per hour.4
  • ETL of 2 billion rows took only 288 seconds, while aggregating and analyzing 1.8 billion rows with SAP HANA Predictive Analysis Library took less than 120 seconds.4
  • A total of 800 billion rows of data were loaded, equal to 51 TB of uncompressed data.4
  • The process of making 5.6 billion rows of data ready for analysis only required 6 minutes.4


Want to find out more? Check out our solution brief, “Solving the Big Data Analytics Riddle,” to see how a combination of Intel Xeon processor-based servers and SAP HANA can accelerate your big data analytics. And be sure to follow me and my growing #TechTim community on Twitter: @TimIntel.


1 The hot data tier contained 14 servers with the following CPUs, with each server containing 1 TB of RAM and four sockets: two servers containing the Intel® Xeon® processor E7-8890 v3, four servers containing the Intel Xeon processor E7-8890 v2, five servers containing the Intel Xeon processor E7-8880 v2, two servers containing the Intel Xeon processor E7-4890 v2, and one server containing the Intel Xeon processor E7-4860 v2. Total number of cores in the hot data tier was 852, and total RAM was 14 TB.

2 For complete EMC VMAX3* benchmark information, visit

3 The warm data tier contained a single eight-socket server containing the Intel® Xeon® processor E7-8850 v2. The total number of cores was 96, and the total RAM was 2 TB.

4 Total data loaded was 800 billion rows, which was approximately 51 TB of uncompressed data. The EMC VMAX3* system achieved 800 MB per second (approximately 2 percent of the available capacity) while writing 24 billion rows per hour to the warm data tier. The data load consisted of 100 data files being loaded in parallel using 100 of the 192 cores on the warm data tier, with the remaining cores being used for ad-hoc queries. Each data file was 5 GB in size and contained approximately 77 million rows of data. Note that performance was not measured using industry-standard benchmarks.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Published on Categories Big DataTags ,
Tim Allen

About Tim Allen

Tim is a strategic relationship manager for Intel driving enablement for enterprise software companies related to the cloud, big data, analytics, AEC, commercial VR, datacenter, and IoT. Tim has 20+ years of industry experience including work as a systems analyst, developer, system administrator, enterprise systems trainer, product marketing engineer, and marketing program manager. Prior to Intel Tim worked at IBM, Tektronix, Intersolv, Sequent and Con-Way Logistics. Tim holds a BSEE in computer engineering from BYU and an MBA in finance from the University of Portland. Specialties include - PMP, MCSE, CNA, HP-UX, AIX, Shell, Perl, C++