Flying elephants: Enterprise Hadoop in the cloud

As my flight glided over Chicago last week, the skyscrapers in the distance reminded me that these tall monuments stood there now because a great fire had once razed their predecessors. But more importantly because a number of independent technologies happened to connect and coincide in ways that made the whole greater than the sum of parts. Skyscrapers would not be possible without the introduction of the Bessemer process that enabled mass-production of strong steel, the design of buildings that replaced load bearing brick walls and wrought-iron beams with steel girders and curtain walls of glass, the invention of the telephone that allowed communication not only across distances but also heights, the development of elevators that could carry people safely. And just as it was with skyscrapers that touched the clouds at the turn of the twentieth century, so it is now with servers and storage and software that enable big data analytics in the cloud.

An infrastructure-as-a-service means that enterprise workloads have a place to go with headroom to scale. This is useful not only with legacy enterprise workloads but also new data-intensive applications that process unstructured data from diverse sources. In short, "big data analytics" is a natural fit for the cloud infrastructure.

We bring this up now because several independent efforts are now converging towards the goal of enabling an elastic Hadoop deployment in enterprise datacenters. This has long  been enabled in public clouds, where developers have used Elastic MapReduce on AWS to test and develop apps since 2009. But as enterprises bring Hadoop  out of the POC closet into production, will they deploy it in a managed virtualized infrastructure much like any other enterprise workloads?

I, for one, think it is inevitable but much work remains before we can declare victory.

Delivering enterprise-grade performance, security, and manageability for Hadoop in any environment, not just in the cloud, is a hard problem. But hard problems can be solved by smart groups working collaboratively, with support from a broad ecosystem of vendors, enthusiasts, and buyers. This is in part why Intel launched its own distribution based on Apache Hadoop. Taking a commercial stake in the Hadoop distribution business signifies the seriousness of purpose behind the advancement of Apache Hadoop for new usage models and use cases.

The Intel Distribution takes full advantage of hardware features like AES-NI in the Intel Xeon processor and SSD for storage and 10GbE for networking. And all that code is upstream as a submitted patch. And ever since we started testing and optimizing Hadoop on Xeon-based servers, we've been testing Hadoop in VMs. Nothing thrills our developers more, it would seem, than the challenge of extracting the optimum performance out that virtualized infrastructure. We're working closely in the community to harden Hadoop by providing a common framework for authentication, authorization, and auditing.

One of the newest projects in which we're involved is integrating and scaling Hadoop on virtualized infrastructures under various cloud management frameworks. As one example, we're involved in Project Savanna along with Mirantis and Red Hat. Savanna is designed as an OpenStack-native component that can quickly provision Hadoop clusters and resize them on demand. Savanna integrates with OpenStack Horizon to allow users to specific cluster parameters and deploy it quickly. It can work with any distribution in theory, but we showed a working demo of the Intel Distribution for Apache Hadoop on OpenStack (on Fedora images) at the Intel booth at the Red Hat Summit, where our friends from Mirantis have joined the Intel team to answer questions.

We're excited about our collaboration with Mirantis on this project and look forward to announcing some great progress shortly. In my next post, I'll dive into the details of the integration and highlight some of the work Intel is doing in both the Hadoop and the OpenStack projects to enable a trusted platform for analytics in the cloud.