Intel Hadoop on Nutanix Virtual Storage Platform

Storage architectures for enterprise computing, by design, are intended to provide scalability, reliability and fault tolerance for large scale business IT needs. Even though enterprise quality storage comes with an added cost, as compared to a vanilla disk array aka JBOD (just a bunch of disks), it is the preferred way to record transactions and persist data in many enterprise verticals such as high finance, banking and healthcare to name a few. Hadoop’s file system framework, HDFS, was architected to provide a framework to manage simple local disk based storage in an easily scalable design while providing a rudimentary data replication mechanism to provide fault tolerance for disk failures. Even though such a storage framework could easily satisfy non critical computing requirements, it does fall short in meeting strict guidelines and requirements of data availability durability for enterprise IT computing needs.

Intel’s distribution of Hadoop (IDH) addresses this gap between core Hadoop design philosophy and enterprise IT storage needs, with a partnered solution of IDH on Nutanix a virtualized storage platform. Two key enterprise storage features that are now available to IDH clusters users on Nutanix platform are discussed below.

Virtual Disk Storage Infrastructure



With a converged compute and storage infrastructure for Hadoop, the Nutanix platform entirely removes the need to have an additional enterprise quality disk storage management layer below HDFS. Even though the data replication feature of HDFS can handle simple disk and connectivity failures, the need to manually design specific architectures to handle multiple failures in the network and storage sub-systems is eliminated with a VM centric replication approach. By virtualizing the Hadoop disks, Nutanix takes on the responsibility of data placement on physical disks, and thus can provide incremental performance gains via storage tiers using server-side SSDs, higher data durability with always on data-scrubbing and storage efficiencies via automatic disk balancing, compression, de-duplication etc. Management of Hadoop becomes easier with virtualization providing enterprises the agility necessary, while enabling enterprise-grade monitoring and management of the entire Hadoop infrastructure. Expensive block level replication for a Hadoop DataNode, triggered by disk failures is completely avoided thus avoiding unnecessary burden of re-replicating terabytes of data.

The Nutanix storage controller also ensures data locality for all virtual disks by writing data to the direct attached disks on the physical nodes, unlike any other enterprise storage solution, thus ensuring scalability of the Hadoop deployments with optimal network utilization.

Disaster Recovery architecture


Implementing scale-out architectures with support for disaster recovery enables high availability guarantees for Intel Hadoop. Using Nutanix's built-in support for Disaster Recovery (DR) at the infrastructure layer, administrators can configure periodic snapshot replication between two sites. adminBoth sides can remain active with their own independent clusters, with separate or overlapping datasets as necessary.. Incremental bi-directional replication between sites, run-book automation and virtualization makes migrating nodes across racks and even data centers a breeze. Intelligent and incremental replication, along with compression ensures optimal network and storage utilization and delivers substantial cost and time savings.

In addition to handling disaster recovery, planned fail over, where both sites remain up during the fail over process can be achieved for maintenance and performing OS and site upgrades.

In summary, a scale out architecture for Enterprise Hadoop cluster using IDH on Nutanix ensures business resiliency while saving cost time and complexity that would be extremely complex and time consuming to implement otherwise without this shared offering.