Securing your big Data: Hadoop & Security

It is no secret that the open source Apache Hadoop project has been notorious for its lack of a formal security framework since its early days. Since Apache Hadoop evolved out of efforts to find solutions for big data problems in social media companies, one can say that enterprise quality security was never high on the list of feature to implement until Hadoop started to become the big data platform of choice inside corporate data centers. Intel’s distribution of Hadoop (IDH) has tackled the yawning security gap in open source Hadoop by providing the much desired data confidentiality and data access security with a ground-up approach by making the security infrastructure a core Hadoop component of its distribution. IDH can integrate with existing LDAP and Kerberos installations out of the box and supports seamless fine grained encryption and decryption for further data security across its eco-system.

Even with a secured Hadoop such as IDH, deploying and exposing APIs to share data inside and outside of the data center would still require a strong enterprise tested security model for both on premise and cloud deployments of Hadoop. A secure, dependable API model is needed to not only ingest data in and out of the Hadoop deployment but to also provide the ability unlock value out of the vast stored repository of Hadoop data.

Intel Expressway API Manager for IDH


                                                                                                                                                    Intel Distribution of Hadoop


Intel Expressway API Manager (EAM) is a software-appliance designed to securely share and consume application services APIs on-premise or in the cloud. EAM can provide a cloud service brokerage capability that includes features such as routing, data protection and secure middleware among others for a whole gamut of applications from legacy systems to mobile service enablement.

EAM with Intel’s distribution of Hadoop is a two-tier architecture, as shown in the figure above. EAM integrates with existing Hadoop deployments in a touch less manner and does not require any glue code to be written for a deployment. It provides a secure REST API facade with authentication and authorization based on OAUTH and identity stores such as LDAP to share value and intelligence in the enterprise “Big data” with customers and partners alike. Using this model, the Intel Hadoop cluster will only trust calls from the EAM gateway and all potential callers will access data and services only through the EAM gateway.

Intel Service Gateway for Big Data

Decoupling the data access component via security and API services gateway from the data storage component like the IDH deployment provides the flexibility to modify and customize each component without having to make unnecessary changes in one component just to accommodate enhancements or bug fixes being made in the other component.

As the data access component evolves, it opens the door to provide newer services such as supporting mobile devices access to big data without having to make any changes in the data storage deployment architecture at all. This completely eliminates writing code and developing technology to support vertical silos to accommodate newer technologies.


The figure above summarizes an example implementation of big data analytics using IDH as a platform to store image surveillance metadata captured from traffic cameras in the transportation vertical solution for Hadoop. In this example, traffic cameras extract vehicle information such as license plate number, speed of the vehicle, lane number and other relevant vehicle data and send them together with captured images to a remote store which has an HBase deployment of IDH. The camera data is sent using web services via the Intel API service gateway. Here, either SOAP APIs with attachment containing the image payload or REST APIs with MIME attachments can be used. The API manager uses the corporate LDAP server to restrict user access to the IDH REST API calls exposed. The image data collected can be analyzed with REST or SOAP based APIs in real time from upstream analytics tools which will be consumers of the data repository in HBase. As the figure depicts, the consumers of IDH data resource are not just limited to legacy analytic tools and environment but can also be any application running on a mobile appliance or device.