Throughout human history, insights and innovation have fueled scientific discovery, economic growth, and social progress. That process is accelerating like never before, thanks to big data analytics—a new field that is truly shaping our world.
Data scientists use big data analytics solutions to efficiently capture, process, analyze, and store vast amounts of data of all types. Together with Intel® architecture platforms, open-standards-based software contributions from Intel, like BigDL, support the most ambitious analytics-driven initiatives.
Intel is a leading upstream contributor to the Apache* Spark* project, the leading open source software engine for large-scale data processing. From June 5-7, 2017, more than 3,000 developers, engineers, data scientists, researchers, and business professionals engaged in learning and networking at the Spark Summit in San Francisco, the world’s largest event for the Spark community. I was honored to be an invited keynote speaker where I expounded on Unleashing Data Intelligence with Intel and Apache Spark.
Unleashing BigDL at Spark Summit
During the keynote, I discussed Intel® software innovations that further accelerate big data analytics and the pace of insight and innovation, such as our BigDL open source distributed deep learning framework (open sourced on Dec 30, 2016). The BigDL project was initiated in 2015 by our Spark developers who saw an emerging trend in the data center for Deep Learning (DL) workloads to conduct both training and inference operations. With the Intel® Xeon® processor as the incumbent in the data center powered by Apache Spark as the prevalent big data platform, we identified the lack of good deep learning capabilities in Apache Spark. Our team quickly forged forward and developed a DL library on top of Apache Spark with feature parity with all popular DL frameworks and high single node Xeon performance leveraging Intel® Math Kernel Library. This open source project has received large community support to-date and increasing cloud & enterprise support with wide adoption among top CSPs like AWS*, Azure*, ALiCloud*, Databricks* and Enterprises such as Cloudera*, and Cray* to name a few.
I highlighted some of BigDL’s newly released features: Python language support delivers on one of the most requested features by the BigDL user community; notebook integration using systems such as Jupyter notebooks distributed across the cluster combines Python libraries, Spark SQL and DataFrames, MLlib, deep learning models in BigDL, and interactive visualization tools; and TensorBoard support, which helps data scientists visualize and understand the behavior of BigDL programs. I also previewed some of our plans for expanding the BigDL ecosystem. For example, our Free Compute for BigDL program will make infrastructure available for researchers, data scientists and deep-learning explorers who are ready to scale-out deep learning algorithms on Apache Spark.
New Optimized Analytics Package for Spark
I introduced the Optimized Analytics Package for Spark (OAP for Spark) which accelerates Online Analytics Processing (OLAP). OAP for Spark enables customers to use Spark for their ad-hoc query workloads, making full use of their memory and CPU power. This is a new open source project that is now available to the community at OAP Code. Lin Xiaodong, Director of Baidu* Infrastructure Department commented on the OAP use: “OAP for Spark is quite fit for Baidu’s data analytics requirements, and brings 1.5X-5X performance gain for ad-hoc query. We’d like to dive into the OAP open source community with Intel for more significant acceleration in the future releases, to unleash the power of new hardware platforms.”
Artificial Intelligence Will Usher In a Better World
We also demonstrated resources and technologies for data scientists and framework developers, including: how the Intel® Nervana™ AI Academy sharpens data scientists’ and developers’ machine learning skills; a comprehensive scheduling solution for Apache* Spark* on Intel® Xeon® + FGPA which provides Spark an API for Intel FPGA resource discovery, configuration, management, and intelligent scheduling; An ad-hoc SQL query engine on top of Spark SQL gave attendees a close look at the Spinach project’s user scenarios, architecture, performance, and real world adoption; and Deep Learning to Big Data Analytics on Apache Spark Using BigDL demonstrated speech recognition and object detection applications we built on BigDL.
In addition, other Intel sessions are a good source of reference: BigDL: Bringing Ease of Use of Deep Learning for Apache Spark by Jason Dai & Radhika Rangarajan; Accelerating SparkML Workloads on the Intel® Xeon®+FPGA Platform by Zhankun Tang & Zhongyue Nah; Optimized Analytics Package for Spark by Daoyuan Wang & Yuanjian Li (Baidu); A Predictive Analytics Workflow on DICOM Images using Apache Spark by Anahita Bhiwandiwalla & Karthik Vadla; Deep Learning to Big Data Analytics on Apache Spark Using BigDL by Xianyan Jia & Yuhao Yang; and Distributed End-to-End Drug Similarity Analytics and Visualization Workflow by Anahita Bhiwandiwalla & Dina Suehiro.
Follow Michael Greene on Twitter @greene1of5 for the latest on Intel in BigDL and more.