Open source scale-out data processing: Bringing Elephants into the Enterprise

Data processing and analytics are important focus areas for various groups within IT.

One of the applications used internally to track our R&D computing-related statistics gets up to 100M records/day. This data is used for computing utilization

analysis and optimization, capacity planning and other purposes.

Today, this application is implemented using traditional relational DB with SAN backend. Query performance is problematic at this scale, multiple aggregations and extensive tuning is required to achieve acceptable performance for the pre-defined queries.

Ideally, we'd like to store more raw data and allow more ad-hoc analytics' capabilities.

With the existing environment, this becomes too expensive.

We recently performed a study of an open source MapReduce framework's applicability for our internal needs.

This is a popular framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

During our exploration, we achieved almost linear performance scalability using the framework. We explored various interface layers, both open source and some commercial ones, which may provide better user experience to analysts who are not necessarily experienced Java programmers to use the original interface.

We also decided to harden the cluster using login access restrictions, as well as blocking the framework's ports' access from outside of the cluster.

Currently, we see this framework as a possible complementary solution for the relational DB for specific use cases.

Are you using any kind of MapReduce or NoSQL solutions in your environment for similar purposes? How well does it serve your needs?

Till the next post,

   Gregory Touretsky