Big Data and Data Analytics on the Cloud

Introduction

Data, and the insights that its analysis can provide, is increasing in nearly all industries. As companies continue to analyze their data to provide better insight into their products, sales, customers, and more, the need for infrastructure that is capable of handling the analysis grows. Running your big data and other data analysis workloads in the cloud is a great way to take advantage of the scaling and flexibility the cloud offers. In this blog, I’ll discuss Intel’s experience running some big data workloads on Apache SparkTM clusters in the cloud, specifically AWS and Azure. My goal is not to help you decide which cloud service provider (CSP) to choose, but to walk you through some decisions you’ll need to make and considerations to keep in mind when you’re setting up your AWS or Azure environment.

Big Data and Data Analytics Offerings

To ensure a controlled benchmarking environment, we at Intel built our Spark cluster directly on AWS EC2 or Azure Virtual Machines (VMs) manually. However, both CSPs offer dedicated big data services that you may find better suit your needs.

Azure

For managed Apache Spark deployments in Azure, customers should look into the CSP’s HDInsight service. Customers can use HDInsight to create a managed cluster with many open-source frameworks such as Hadoop, Apache Spark, and more. HDInsight clusters work with data stored on Azure’s storage offerings, such as Azure Blob storage and Azure Data Lake Storage.1 HDInsight uses an HDFS interface to allow compute clusters to operate on object storage (Azure Blob) for unstructured data. You can read more about HDInsight and Azure storage benefits on Azure’s documentation site.2 Azure also offers cloud-scale analytics services, including Azure Synapse Analytics, Azure Databricks, and Azure Machine Learning.3

AWS

For a managed deployment of a big data environment, Amazon recommends Amazon EMR, which supports many open-source tools, including Apache Spark. Amazon EMR automates tasks and simplifies setup and operation. You can deploy your Amazon EMR workloads on normal EC2 instances as well as on Elastic Kubernetes Service.4 AWS also offers several services dedicated to big data and data analytics, including Amazon Athena and Amazon Quicksight.5 With Amazon EMR, you can leverage tools such as S3 Select to improve performance when analyzing data on Amazon S3 object storage. Amazon recommendations and limitations can help you determine whether S3 storage is right for your big data analytics.6

Big Data Instance Sizing Considerations

When creating an Apache Spark cluster for your big data analytics workload—whether with the Spark services or manually—you must make sure that the underlying infrastructure can provide the level of performance that your workload requires. Because Spark is a distributed processing framework that relies heavily on memory, compute and memory resources are of utmost concern. However, networking is also a major player, so you need to ensure that your cluster isn’t bottlenecking on network speeds. Finally, while disks and disk performance may not be as crucial to your Spark big data workload as they are to other workloads, make sure that your disks provide perform well enough to keep up with demand. Trying to find the right VM instance series and sizes while balancing all these different requirements can be overwhelming. To guide you through the process, I’d like to discuss what we learned in our testing.

First, though, let me share a quick, high-level description of the testing that I’ll be discussing in this blog. On both AWS and Azure, we created a five-node Spark cluster of each VM or instance type and size we tested. One VM instance served as the drive node, while the other four instances served as the worker nodes that executed the tasks. To test the performance of the cluster, we installed the HiBench benchmark suite and ran two of its micro benchmarks: K-means clustering (Kmeans) and Bayesian Classification (Bayes). We executed three runs of every test and used the median throughput as the final result. To learn more about the tests we cite throughout this blog, please read the briefs we link to below.7,8

AWS

I won’t dive too deeply into all of the EC2 offerings from AWS because their documentation covers everything pretty well. Both Compute and Memory Optimized EC2 instances are good options for a big data cluster, as are the General Purpose instances if your big data workload is well balanced. The first thing you must decide is your compute—the processor generation and number of vCPUs you need. It may be tempting to cut corners on processing power to save money, such as choosing the M4 series with older Intel processors, which are slightly less expensive than newer instances backed by 2nd Generation Intel Xeon Scalable processors. However, in our tests, the M5n instance cluster delivered up to 1.72x the throughput of the M4 cluster while, at the time of the testing, costing 1.19x as much as the M4 instances. This translates to much better value for your dollar when you choose the newer hardware. Also, note that some instance series offer more than one processor model and generation. On some of these multi-processor series, you will note that to keep the performance from varying too much from instance to instance, AWS has disabled certain features of the newer processors. One such case is the C5 instances with up to 36 vCPUs; in these instances, AWS has disabled Intel DL Boost, normally available in 2nd Gen Xeon Scalable processors.

The number of vCPUs you need depends mostly on your workload needs, but note that the increased value of newer processors applies across all three of the instance sizes we tested: 8, 16, and 64 vCPUs. I say, “depends mostly,” because with cloud computing, every instance has various performance limits beyond the processors and memory capacity. Even if your compute needs aren’t too heavy, make sure that the instance you choose doesn’t limit the network bandwidth of your cluster too much. For example, even with the newer 2nd Gen Xeon Scalable processors, the M5 series doesn’t guarantee 10 Gbps networking bandwidth until the 32 vCPU size.9 For our testing, we chose the M5n series, which offers enhanced networking compared to the M5 series, especially for smaller-sized instances. On the M5n series, you can get up to 25 Gbps instead of just up to 10 Gbps for instances with 2 through 16 vCPUs. The Compute Optimized C5n and Memory Optimized R5n series also have AWS Enhanced Networking, so if you need more compute or memory than the General Purpose M5n series offers, be sure to check those out. Whatever instance size you choose for your workload, confirm that the network limits are high enough to support your performance needs.

If, like us, you choose a more manual Spark set up, I recommend leveraging the AWS AMI library during deployment. For our environment, we created a base image AMI with the OS, software installations, settings, and configurations all saved. From that image, we could create as many Spark nodes as we needed while ensuring that they remained consistent from one instance to the next.

Azure

Let’s step through the same exercise on Azure VMs. Again, choose a VM series that you think will best fit your needs, whether that be general-purpose VMs or series that are more compute or memory-focused. Within those categories, Azure offers several series that range from VMs with Intel processors as old as the E5-2673 v3 to those as new as the public preview versions featuring the latest Intel 3rd Generation Xeon Scalable processor. Here again, our tests showed that newer processors offer a better value for your dollar: our Ddsv4 clusters saw up to 1.55 times the throughput as the Dsv3 clusters while, at the time of testing, costly only 1.17 times as much. Another processor advantage available with newer VMs is the number of processor models available on each VM. When you create a Dsv3 VM, you cannot choose which of the four different generations of Intel processors will be on the VM. Thus, if you want a newer processor, but spin up a VM only to find it uses an old E5-2673 v3 processor, your only recourse is to deallocate that VM and try again until you get the right processor. You can save time by choosing a VM series—such as the Ddsv4 series—that offers only the processor you need.

As you decide which VM size you need for your workload, note any limitations to performance the VM has. Certain performance caps increase with vCPU count, such as the number of disks you can attach to the VM, the number of NICs you can have, and most importantly, the network bandwidth available for each VM. For example, a Standard_D16ds_v4 VM has twice the network bandwidth available as a Standard_D8ds_v4 VM.10 Read through the documentation carefully to avoid any surprises from unexpected performance caps.

One more thing to note when looking at VM specifics: check the features that are available with each series. For instance, the Ddv4 and Ddsv4 series are very similar. However, only the Ddsv4 series supports Premium Storage and Premium Storage caching. Additionally, only some of the series offer Generation 2 VMs. If you want the benefits of a Generation 2 VM, such as Intel Software Guard Extensions (SGX) and virtualized persistent memory (vPMEM), choose a series that has Generation 2 available.11

Finally, leverage Azure snapshots and Image Gallery to create a baseline VM. Ensure that your Spark nodes are consistent by spinning them up from the same base image, saving time and effort on repetitive tasks.

Conclusion

I hope I’ve helped you make sense of some of the issues to keep in mind as you set up your AWS or Azure environment for your big data workloads. There are many advantages to be had by running these workloads in the cloud, and it is worth persevering to make the right selections from the many available offerings.

[1] https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-overview
[2] https://docs.microsoft.com/en-us/azure/hdinsight/overview-azure-storage
[3] https://azure.microsoft.com/en-us/solutions/big-data/
[4] https://aws.amazon.com/emr/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc
[5] https://aws.amazon.com/big-data/datalakes-and-analytics/
[6] https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html

[7] https://www.intel.com/content/www/us/en/partner/workload/amazon/analyze-more-for-apache-spark-benchmark.html
[8] https://www.intel.com/content/www/us/en/partner/workload/microsoft/analyze-data-for-apache-spark-benchmark.html
[9] https://aws.amazon.com/ec2/instance-types/
[10] https://docs.microsoft.com/en-us/azure/virtual-machines/ddv4-ddsv4-series
[11] https://docs.microsoft.com/en-us/azure/virtual-machines/generation-2

Notices & Disclaimers
Intel technologies may require enabled hardware, software or service activation.
No product or component can be absolutely secure.
Your costs and results may vary.
All product plans and roadmaps are subject to change without notice.
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.