Intel Distribution of Hadoop on Amazon EC2

Raghu Sakleshpur is an engineering manager at Intel who works on Hadoop deployments and Big data technologies with partners, ISVs and customers. He is a technologist to the core and loves to share his experiences on Big data and Hadoop technologies whenever the opportunity presents itself. In his spare time, he loves pursuing his other passions like running, hiking, biking and watching sports.

Intel Hadoop on Amazon Web Services Cloud

Amazon web services elastic compute cloud (EC2) can support both private and public instances of virtual systems. The amazon EC2 allows the use of static ip addresses and provides support to create virtual private cloud with isolated personal networks of virtual systems.  Amazon EC2 also provides a variety of virtual instance types each with different size and performance characteristic. Detailed instructions on creating a set of virtual instances with secure shell access enabled to all instances can easily be found online.

Installing the Intel Manager

Once a set of virtual instances are created, the first step is to identify a virtual instance that will run Intel Manager and install the Intel Manager. A copy of the IDH installer with a temporary license and detailed documentation can be downloaded for free from under the download section. The IDH installer guides the user through a series of steps to collect the required information before the Intel manager installation can complete. This installation of the Intel Manager can also be accomplished with a batch script without the need for any user interaction.

Configuring the Hadoop Cluster

The Intel manager supports both a user interactive as well as a REST API interface to configure nodes into a Hadoop cluster. In the interactive mode, a wizard in the Intel Manager guides the administrator to add virtual instances into the cluster and helps assign Hadoop roles such as namenode, jobtracker, Hbase master etc., to the various virtual instances and installs the necessary software to configure the Hadoop cluster. The REST API allows back end scripts to configure a Hadoop cluster that can be created on a need basis without any user interaction.

IDH configurations On EC2

There are 3 possible Intel Hadoop configurations on Amazon EC2 virtual instances. The virtual instance type will depend on the Hadoop configuration chosen.

HDFS over ephemeral storage: HDFS can be configured via Intel Manager during the cluster configuration steps, to use the temporary virtual instance storage. This configuration is best suited for Hadoop jobs that generate their own data or copy over external data to the temporary HDFS file system from an external data or blob storage systems such as Amazon S3 before executing the map reduce job. The output results of the map reduce job will need to be copied over to an external permanent storage filesystem. These hadoop clusters cost less to spin up and are best suited for long batch styled deep data analytic jobs which can run on temporary Hadoop clusters that are spun up on a need basis.

HDFS over permanent storage: Here, each individual server instance is configured with sufficient local disk storage aka permanent instance storage and Intel Manager configures and installs HDFS over the permanent local disk storage. Such Hadoop clusters can be installed once and used repeatedly for different map reduce jobs. These clusters will cost more and are preferred when the requirement is to have a persistent HDFS data store. This type of Hadoop clusters are best suited for providing data analytics over Hadoop HDFS as a service over the cloud.

Hadoop over S3: Intel Hadoop supports a Hadoop compatible file system plugin architecture that can allow support of external blob stores such as Amazon S3 in the place of HDFS. During the configuration phase of the Hadoop cluster, the user can choose to disable HDFS and IDH can be configured to work on an existing S3 filesystem instance. Map reduce jobs run seamlessly on the files in the S3 instance and eliminates the need to move data in and out of the Hadoop cluster. The cost of such clusters can get expensive as it may include additional S3 blob store charges. However, such configurations can be ideally suited for certain scenarios where movement of data in and out of S3 can get expensive and not preferred.