Deploying The Intel(r) Distribution for Apache Hadoop* (IDH) on Virtualized Environments

Raghu Sakleshpur is an engineering manager at Intel who works on Hadoop deployments and Big data technologies with partners, ISVs and customers. He is a technologist to the core and loves to share his experiences on Big data and Hadoop technologies whenever the opportunity presents itself. In his spare time, he loves pursuing his other passions like running, hiking, biking and watching sports.

So why would one want to run Hadoop on virtual system instances? After all, Hadoop was originally designed for bare metal hardware. Wouldn’t map reduce jobs run slower on virtualized instances compared to bare metal hardware? With no concept of local storage, wouldn’t disk I/O, which is critical for Hadoop performance, be slower on virtual instances?

These are all valid questions to ask before considering moving Hadoop to a virtual infrastructure. For answers, one needs to look closely at advancements in server virtualization technology and the storage technology options that are available in the cloud today. And it will be apparent how the rapid evolution of technology in these areas is pushing enterprise computing services as a whole to the cloud.

Of the many compelling reasons such as energy cost savings, and optimal resource utilization, that one can list to support Hadoop  in virtualized environments, the primary reason one may argue for, is the opportunity to move big data services to a cloud based infrastructure with the goal of making big data analytics accessible anytime and anywhere.

The Intel(r) Distribution for Apache Hadoop* (IDH) is agnostic to the nature of the underlying platform infrastructure and works transparently on top of virtual or physical hardware alike. Intel Manager, the control center for the IDH cluster can be installed and executed on virtual hardware and it can configure a Hadoop cluster with a bunch of virtual system instances. The local access storage offered by some of the virtual infrastructures can easily meet or even in some cases exceed direct attached disk I/O speeds by using superior network fabric interconnect. Intel Manager also supports RESTful web services APIs and an entire Hadoop cluster can be deployed using RESTAPI in scripts with no user interaction, on a need basis. Dynamic Hadoop clusters created this way can be configured to load and run deep analytics on data, on a per use basis. Depending on the application requirement, the data may be generated or loaded on to ephemeral storage or accessed from permanent storage servers in the cloud. Also, in the place of HDFS, IDH could work seamlessly on top of certain other types of external file systems. As a result, Hadoop based big data application services can be provisioned in the cloud with the underlying IDH infrastructure running completely on a shared public or private virtual platform infrastructure.

Creating and running a Hadoop cluster on an ad-hoc basis on virtualized infrastructures makes auditing and enforcing security policies localized to application domains much easier. Charging customers for services for only the duration for which the cluster was configured and used becomes simpler.

All of the reasons mentioned above should make a compelling case to implement Hadoop clusters on both public and private virtualized infrastructures. In the next blog I will talk more on running IDH over Amazon Web services cloud.