Enterprise Data processing with IDH and Cascading

With Hadoop gaining acceptance as a viable big data repository in the Enterprise computing world, the architecture of Enterprise data processing applications are rapidly changing to exploit the map reduce processing framework of Hadoop. To facilitate development of complex enterprise grade data processing and machine learning applications that can be deployed and managed across private or public cloud based Hadoop clusters, Intel distribution of Hadoop (IDH) now supports the open source Cascading application framework.

With the this framework running on IDH, one can easily build data flows aka pipes that can transform data from a variety of data sources aka taps including Hadoop to facilitate deep data analytics. Applications can now be developed in domain specific programming language using Cascading extensions and can be integrated with other external data sources (taps) and systems. Cascading is a community driven open source project hosted on GitHub and comes with its own software development kit (SDK) and tools for easy development of big data analytics applications.

Cascading โ€“

Extensions are also user contributed code on top of the Cascading framework that now facilitates the development of big data applications on Intel Hadoop for programmers from different domains. Some popular Cascading tools which are now available on IDH as a result of the support of Cascading framework are listed below:

Lingual โ€“

As A SQL command shell and JDBC driver for executing ANSI SQL queries as Cascading applications that runs on Hadoop clusters.

Bixo โ€“

A Cascading based web crawling and data mining toolkit which is a more robust alternative for Apache Nutch.

Load โ€“

A command line tool for creating high load jobs on Intel Hadoop clusters.

Multitool โ€“

A command line tool for processing large files, similar to sed and grep that are available on UNIX platforms.

Some of the domain specific language support that is part of the Cascading extensions are described below:

Cascalog โ€“ 

This is a fully featured data processing and querying library for applications written in Clojure or Java languages. With Cascalog, one can process big data inside of Hadoop and can be used in the place of Pig or Hive with a higher level of abstraction that is missing in those tools.

Cascading JRuby โ€“

This domain specific language extension for Cascading provides a data flow API (implemented in Java) to Ruby programmers who can now rapidly develop efficient MapReduce jobs for Intel Hadoop distribution.

PyCascading โ€“

This is essentially a Python wrapper for Cascading that provides Python programmers the full data processing capability of MapReduce available in the Cascading framework running on Intel Hadoop clusters.

Scalding โ€“

This is Scala library built on top of Cascading framework that makes it easy to specify MapReduce Intel Hadoop jobs using the Scala language. It is comparable to Pig but offers tighter integration with Scala.