Last year I read an article in which Hadoop co-developers Mike Cafarella and Doug Cutting explained how they originally set out to build an open-source search engine. They saw it as serving a specific need to process massive amounts of data from the Internet, and they were surprised to find so much pent up demand for this kind of computing across all businesses. The article suggested it was a happy coincidence.
I see it more as a happy intersection of computing, storage and networking technology with business needs to use a growing supply of data more efficiently. Most of us know that Intel has developed much of the hardware technology that enables what we've come to call Big Data, but Intel is working hard to make the algorithms that run on Intel systems as efficient as possible, too. My colleague Pradeep Dubey presented a session this week at the Intel Developer Forum in San Francisco on how developers can take advantage of optimized data analytics and machine learning algorithms on Intel® Architecture-based data center platforms. In this blog I thought I would back up a bit and explain how this came about and why it's so important.
The explosion of data available on the Internet has driven market needs for new ways to collect, process, and analyze it. In the past, companies mostly processed the data they created in the course of doing business. That data could be massive. For example, in 2012 it was estimated that Walmart collected data from more than one million customer transactions per hour. But it was mostly structured data that is relatively well behaved. Today the Internet offers up enormous amounts of mostly unstructured data, and the Internet of Things promises yet another surge. What businesses seek now goes beyond business intelligence. They seek business insight, which is intelligence applied.
What makes the new data different isn't just that there's so much of it, but that an estimated 80 percent of it is unstructured—comprised of text, images, and audio that defies confinement to the rows and columns of traditional databases. It also defies attempts to tame it with traditional analytics because it needs to be interpreted before it can be used in predictive algorithms. Humans just can’t process data efficiently or consistently enough to analyze all this unstructured data, so the burden of extracting meaning from it lands on the computers in the data center.
First, let's understand this burden a little deeper. A key element of the approach I described above is machine learning. We ask the machine to actually learn from the data, to develop models that represent this learning, and to use the models to make predictions or decisions. There are many machine learning techniques that enable this, but they all have two things in common: They require a lot of computing horsepower and they are complex for the programmer to implement in a way that uses data center resources efficiently. So our approach at Intel is two-fold:
Optimize the Intel ® Xeon processor and the Intel® Xeon Phi™ coprocessor hardware to handle the key parts of machine learning algorithms very efficiently.
Make these optimizations readily available to developers through libraries and applications that take advantage of the capabilities of the hardware using standard programming languages and familiar programming models.
Intel Xeon Phi enhances parallelism and provides a specialized instruction set to implement key data analytics functions in the hardware. To access those capabilities, we provide an array of supporting software like the Intel Data Analytics Accelerator Library, a set of optimized building blocks that can be used in all stages of the data analytics workflow; the Intel Math Kernel Library, math processing routines that increase application performance on Xeon processors and reduce development time; and the Intel Analytics Toolkit for Apache Hadoop that lets data scientists focus on analytics instead of mastering the details of programming for Hadoop and myriad open source tools.
Furthermore, like the developers of Hadoop itself, we believe it's important to foster a community around data analytics tools that engages experts from all quarters to make them better and easier to use. We think distributing these tools freely makes them more accessible and speeds progress across the whole field, so we rely on the open source model to empower the data analytics ecosystem we are creating around Intel Xeon systems. That's not new for Intel; we're already a top contributor to open source programs like Linux and Spark, and to Hadoop through our partnership with Cloudera. And that is definitely not a coincidence. Intel recognizes that open source brings talent and investment together to create solutions that people can build on rather than a bunch of competing solutions that diffuse the efforts of developers. Cafarella says it's what made Hadoop so successful—and it's the best way we've found to make Intel customers successful, too.