Securing Big Data for the Enterprise: Project Rhino and the Intel® Distribution for Apache Hadoop* (IDH)

Girish Juneja is CTO of Datacenter Software Division and General Manager of the Big data and Expressway software businesses. Girish has over 21 year experience in building software businesses, at Intel, as an entrepreneur and earlier at Verizon Telecommunications.

Prior to his current position, he was the Director of Application Security & Identity Products in the Software Services Division in SSG. In this role he was responsible for conceptualization, product development, sales & marketing of Intel Expressway software product line for Service Providers, ISVs, OEMs and select end-users. Girish also led the development of Identity software strategy for SSG that led to acquisition of Nordic Edge, development & launch of McAfee Identity Manager & Intel CloudSSO Identity-as-a-Service offering in collaboration with

Girish joined Intel in 2005 with the acquisition of Sarvega, a company he founded. Girish received his MBA from University of Chicago, MS Computer Science from University of Maryland, and a Bachelor’s degree in Electrical & Electronics Engineering from BITS, Pilani, India

Avik Dey is Director of Worldwide Big Data Engineering at Intel. Avik and his team work on research and development for Apache Hadoop and make their work available to Intel customers through the Intel® Distribution for Apache Hadoop* software. Avik’s focus is on making Apache Hadoop an enterprise class software that works and plays well in today’s data center. Avik’s roots in Hadoop goes back to his days as the Lead Program Manager for Hadoop stack at Yahoo!, where he was responsible for managing delivery of Hadoop as a service, to over 1,000 users hosted in over two dozen clusters large to small, running more than 40,000 nodes. Avik was also the Program Chair for Hadoop Summit 2011. Prior to Intel, Avik worked at eBay and Yahoo!


As good as Hadoop is today, it's still not fully enterprise ready. This is one of the major reasons why we are developing the Intel® Distribution for Apache Hadoop* (IDH) now.

We have an acute awareness of how to make Hadoop run best in a large data center deployment. Hadoop functions optimally when you have the right balance and performance optimization of compute, storage and network, which is an area of Intel's expertise.

However, organizations where security is a key requirement are not able to use Hadoop broadly today because of a lack of enterprise-level security. To that end, we are working on Project Rhino, an open source security project that will establish a comprehensive security framework for Hadoop. Doing so will help organizations protect sensitive and legally protected consumer data from identity theft and fraud, in industries ranging from financial services, healthcare, retail and telecommunications.

Intel is devoted to helping improve the overall open source platform and enabling a whole new ecosystem of partners, capabilities and end users of big data in security-sensitive industries. Project Rhino will address the security issues and compliance protocols for a wide range of use cases for government, defense and corporate environments.


First Goals

The first goals of Project Rhino are to provide framework support for encryption and key management and to integrate it across the Hadoop ecosystem, creating a common authentication and authorization framework.

We are also pledged to implement token-based authentication and single sign on. We will extend HBase support for ACLs on a per-cell basis and to improve audit logging (audit messages from Hadoop components do not currently use a consistent or unified format).

In financial services, Project Rhino will address the security and compliance challenges surrounding non-public information and personally identifiable information, while in healthcare it will support the Health Insurance Portability and Accountability Act, also known as HIPAA.  U.S. government security requirements include compliance with the Federal Information Security Management Act and the Department of Defense Information Assurance Certification and Accreditation Process.  For e-commerce, Project Rhino will take into account the Payment Card Industry Data Security Standard.

Project Rhino will contribute code to the Apache Foundation so that all players can benefit from our efforts to enhance Hadoop's security.

Key Considerations for Protecting Data

We have identified six key considerations when it comes to protecting data with enterprise-grade security:

  1. API Security: Some Hadoop APIs work without authentication, with security typically provided through whatever application is using the cluster, but this is not always the case. We must protect Hadoop APIs from code and command injection, buffer overflow attacks and all other Web services attacks.
  2. Authentication: While Hadoop can use Kerberos for authenticating users and services to a cluster, an attacker could steal or duplicate a Kerberos ticket or just insert a rogue client into the network. We need strong authentication of nodes to keep rogue clients and improper requests out of the cluster.
  3. Authorization: Each component in Hadoop currently has its own authorization engine. We will abstract the common functions into a reusable authorization framework with a consistent interface. Where necessary we will modify the components so they each offer equivalent functionality and build support into others.
  4. Auditing and Alerts: What do you do if you suspect that someone has breached your cluster? Can you actually detect the attack? Doing so requires that you have a record, which means that Hadoop needs auditing and alerts. Despite the fact that open source as well as proprietary tools for auditing and alerts will integrate with Hadoop, the platform actually requires a consistent audit logging and alerting framework across all of the Apache Hadoop platform's projects.
  5. Role-Based Access Control: Many Hadoop deployments have not included built-in tools, facilities and documentation to handle role-based access control. Using access controls, role separation and the latest encryption technologies can help you keep unauthorized and malicious attackers from getting their hands on crucial data.
  6. Encryption: Hadoop needs strong key management tools. Administrators concerned about attackers accessing their archives or reading from their disc will set up traditional data management systems. However, the most difficult part of encryption isn't in protecting the data so much as managing the encryption keys.

Project Rhino is designed to address these security issues as we develop a common encryption, authentication and authorization framework for the Apache Hadoop family of projects.

A Commitment to Open Source

At Intel, we want to keep the open Hadoop platform rich, healthy and independent. As a long-term supporter of open source software with a history of deep investment in open source, we believe we are best equipped to do so.

Intel's stability and longevity in open source demonstrate that we are committed to staying for the long haul. Organizations can remain confident that they will be able to rely on Intel to support the Apache Hadoop community with both code and collaboration.

To find out more please check out the IDH web site and let us know your thoughts about Big Data security by posting comments in the space provide below.

Girish & Avik