Boosting Big Data Workflows for Big Results

When working with small data, it is relatively easy to manipulate, wrangle, and cope with all of the different steps in the data access, data processing, data mining, and data science workflow. All of the various steps become familiar and reproducible, often manually. These steps (and their sequences) are also relatively simple to adjust and extend. However, as the data collection becomes increasingly massive, distributed, and diverse, while also demanding more real-time response and action, the challenges become enormous: the challenge to extend, modify, reproduce, document, or do anything new within your data workflow. This is a serious problem, because data-driven workflows are the life and existence of big data professionals everywhere: data scientists, data analysts, and data engineers.


Workflows for Big Data Professionals

Data professionals perform all types of data functions in their workflow processes: archive, discover, access, visualize, mine, manipulate, fuse, integrate, transform, feed models, learn models, validate models, deploy models, etc. It is a dizzying day’s work. We start manually in our workflow development, identifying what needs to happen at each stage of the process, what data are needed, when they are needed, where data needs to be staged, what are the inputs and outputs, and more.  If we are really good, we can improve our efficiency in performing these workflows manually, but not substantially. A better path to success is to employ a workflow platform that is scalable (to larger data), extensible (to more tasks), more efficient (shorter time-to-solution), more effective (better solutions), adaptable (to different user skill levels and to different business requirements), comprehensive (providing a wide scope of functionality), and automated (to break the time barrier of manual workflow activities). The "Big Data Integration" graphic below from http://www.apervi.com/ identifies several of the business needs, data functions, and challenge areas associated with these big data workflow activities.

big_data_inforgaphic_Edit.jpg


All-in-one Data Workflow Platform

A workflow platform that performs a few of those data functions for a specific application is nothing new – you can find solutions that deliver workflows for business intelligence reporting, or analytic processing, or real-time monitoring, or exploratory data analysis, or for predictive analytic deployments. However, when you find a unified big data orchestration platform that can do all of those things – that brings all the rivers of data into one confluence (like the confluence of the Allegheny and Monongahela Rivers that merge to form the Ohio River in the eastern United States) – then you have a powerful enterprise-level big data orchestration capability for numerous applications, users, requirements, and data functions.  The good news is that there is a company that offers such a platform: Apervi is that company, and Conflux is that confluence.

Apervi is a big data integration development company. From Apervi’s comprehensive collection of product documentation, you learn about all of the features and benefits of their Conflux product.  For example, the system has several components: Designer, Monitor, Dashboard, Explorer, Scheduler, and Connector Pack. We highlight and describe each of these various components below:

    • The Conflux Designer is an intuitive HTML5 user interface for designing, building, and deploying workflows, using simple drag-and-drop interactivity. Workflows can be shared with other users across the business.
    • The Conflux Monitor keeps track of job progress, with key statistics available in real-time, from any device, any browser, anywhere.  Drilldown capabilities empower exploratory analysis of any job, enabling rapid response and troubleshooting.
    • The Conflux Dashboard provides rich visibility into KPIs and job stats, on a fully customizable screen that includes a variety user-configurable alert and notification widgets. The extensible dashboard framework can also integrate custom dashboard widgets.
    • The Conflux Explorer puts search, discovery, and navigation powers into the hands of the data scientist, enabling that functionality across multiple data sources simultaneously. A mapping editor allows the user to locate and extract the relevant, valuable, and interesting information nuggets within targeted data streams.
    • The Conflux Scheduler is a flexible, intuitive scheduling and execution tool, which is extensible and can be integrated with third party products.
    • The Conflux Connector Pact is perhaps the single most important piece of the workflow puzzle: it efficiently integrates and connects data that are streaming from many disparate heterogeneous sources. Apervi provides several prebuilt connectors for specific industry segments, such as Telecom, Healthcare, and Electronic Data Interchange (EDI).

AperviConfluxDiagram.png


Big Benefits from a Seamless Confluence of Data Workflow Functions

For organizations who are trying to cope with big data and to manage complex big data workflows, a multi-functional user-oriented workflow platform like Apervi's Conflux can be leveraged to boost results in several ways. These benefits include:

  • Reduce operational costs
  • Drive faster results, from data discovery to information-based decision-making
  • Accelerate development of data-based products across verticals and business functions
  • Manage integration effectively through monitoring and intelligent insights.



For more information, Apervi provides detailed white papers, datasheets, product documentation, case studies, and infographics on their website at http://www.apervi.com/.

Dr. Kirk Borne is a Data Scientist and Professor of Astrophysics and Computational Science in the George Mason University School of Physics, Astronomy, and Computational Sciences. He received his B.S. degree in physics from LSU and his Ph.D. in astronomy from the California Institute of Technology. He has been at Mason since 2003, where he teaches graduate and undergraduate courses in Data Science and advises many doctoral dissertation students in Data Science research projects. He focuses on achieving big discoveries from big data and promotes the use of data-centric experiences with big data in the STEM education pipeline at all levels. He promotes the "Borne Ultimatum" -- data literacy for all!

Connect with Kirk on LinkedIn.

Follow Kirk on Twitter at @KirkDBorne.

Read more of his blogs at http://rocketdatascience.org/