Intel® Distribution for Apache Hadoop* (IDH): The Value Proposition

Girish Juneja is CTO of Datacenter Software Division and General Manager of the Big data and Expressway software businesses. Girish has over 21 year experience in building software businesses, at Intel, as an entrepreneur and earlier at Verizon Telecommunications.  Prior to his current position, he was the Director of Application Security & Identity Products in the Software Services Division in SSG. In this role he was responsible for conceptualization, product development, sales & marketing of Intel Expressway software product line for Service Providers, ISVs, OEMs and select end-users. Girish also led the development of Identity software strategy for SSG that led to acquisition of Nordic Edge, development & launch of McAfee Identity Manager & Intel CloudSSO Identity-as-a-Service offering in collaboration with  Girish joined Intel in 2005 with the acquisition of Sarvega, a company he founded. Girish received his MBA from University of Chicago, MS Computer Science from University of Maryland, and a Bachelor’s degree in Electrical & Electronics Engineering from BITS, Pilani, India.


The amount of data that companies need to process is growing increasingly larger and the problem of Big Data is only going to get bigger as more organizations realize the benefits of working through enormous data sets in real time.

In fact, the Big Data market was estimated to be $5 billion for 2012 and is predicted to reach $53 billion by 2017, according to Wikibon Market Research. About 2.8 zettabytes of data were generated worldwide in 2012 and the amount will increase to 40 zettabytes in 2020, according to the IDC Digital Universe report.

Whether you need to quickly analyze consumer behavior, sort through personal location data from millions of smartphones, or detect credit card fraud as it is happening, Big Data generates significant financial value across a myriad of sectors. 

Many organizations are turning to the open source Hadoop framework for churning through vast sets of unstructured data. However, their efforts to go beyond pilot phase of testing the technology for use are hampered by several issues, including immature elements of non-functional aspects of Hadoop deployment, security & management, availability of experienced Hadoop specialists, optimized deployment of hadoop framework on current and evolving compute, storage, virtualization & network infrastructures to deliver optimal performance & lower query latencies and lack of easy and seamless integration of Hadoop based “data lakes” with other deployed Enterprise software such as Data warehouses, business intelligence & in-memory databases.

This is why we created the Intel® Distribution for Apache Hadoop* (IDH), and the support/services capabilities around it worldwide and are investing resources into the community directly to expand the capabilities of Apache Hadoop into handling data explosion related data center software.

Hadoop continues to evolve rapidly and being open source, can be downloaded by everyone easily, without costs & used to evaluate its ability to address diverse set of big data problems including data warehouse like reporting queries, text searches over vast data sets, low latency near real time querying etc., as the framework gets improved with help of several companies such as Intel and others investing aggressively to improve the framework in open source.

Enterprise Grade Management and Reliability

Intel has over 3 years of experience extending, building, deploying & optimizing Hadoop framework with some of our largest cloud, telecommunications & enterprise partners.

This experience, combined with output of research from Intel Labs in optimizing Hadoop clusters through innovations in machine learning, has provided us some unique perspectives on how to provide enterprise grade manageability missing from the framework.

The Intel Manager in IDH includes wizards that will efficiently guide you through tasks and workflows, while tabbed navigation helps you quickly navigate between components. It simplifies deployment of HDFS and HBase, HIVE and the rest of the Hadoop platform.

In addition, we have established worldwide enterprise-grade support and services backed by Intel & its ISV & OEM partners.

Ready for Production: Security & Performance

Scores of Intel engineers are working on Project Rhino, a completely open source effort to create a security framework in Apache Hadoop that has been missing since its genesis. The intent is to improve Apache Hadoop so it has a common security framework, including a common authorization framework, single-sign on, authentication, enhanced audit logging and data confidentiality.

As an example, several recent news events related to data leaks have raised the awareness that data confidentiality is an absolute must have for production deployment of Apache Hadoop based “data lake” environment.

However, both HDFS and HBase today do not have encryption built into their architecture. Intel has already extended the compression codec to enable encryption, validated and tested HDFS encryption and decryption on petabyte scale data, enabled acceleration of encryption through Intel AES-NI instruction set that reduces the cost of encryption to near-zero, and provided management controls using Intel Manager and its APIs to configure encryption and keystores.  This has been contributed to the open source Hadoop community as well in order to make these innovations available to all distributions.

We have designed the Intel Manager Administration Console, as well as our entire IDH distribution, to seamlessly integrate with Kerberos, LDAP and Active Directory protocols.  Even more importantly, we have done this in a holistic way, considering the entire Apache Hadoop ecosystem, as well as the entire organizational security picture.  This provides a single point of entry for your administrators managing the Hadoop clusters and your developers looking for their job logs and other job stats.  You can rest assured that all security features developed for IDH will integrate seamlessly with your other systems and will not be a band aid solution or a silo.

The other main area where IDH differentiates itself is in performance. Leveraging Intel’s strength in optimizing software for hardware capabilities, we have tuned IDH to get maximum scalability and performance, especially when running on Xeon servers.  IDH runs on all hardware platforms supported by Apache Hadoop, yet when combined with particular CPU, disk and network equipment it detects the enhanced capabilities of that hardware and automatically leverages those features if present.

For example, you can reduce processing time dramatically with our Big Data solution. Using a Xeon 5600 HDD with a 1 GbE connection to do a 1 TB TeraSort benchmark test will take more than four hours, but when you:

  • Upgrade the processor to the latest chipsets, you reduce the required processing time by 50 percent

  • Upgrade to SSD hard-drives & caching software gives you another 80 percent speed improvement, and

  • Upgrade to a 10 GbE network connection cuts the execution time in half once more.

Under this configuration with the Intel® Distribution for Apache Hadoop* (IDH), the time to process this TeraSort benchmark plummets from more than four hours to less than 10 minutes.

Intel Manager and IDH are tested & qualified to be rock solid when configuring, deploying and managing clusters, regardless of their size.  Whether you are working on a one node POC cluster or a 1000 node production environment, you can rest assured that IDH will meet your needs.