Interesting Times In Database Land: Oracle and Big Data

Perhaps you've heard the ancient Chinese curse: "May you live in interesting times".

After personally attending (and presenting) at both Oracle's OpenWorld and IBM's Information on Demand conferences in the past two months, and closely following Microsoft's announcements at their recent PASS Summit conference, I feel that the future of database technology is extremely interesting.

The Chinese intended their saying as a curse.  However, in the database setting, it's more of a blessing since it means that as a practitioner, you’re presented with a far richer set of capabilities than you've ever had available in the past.  But that plethora of choices can also become overwhelming.

I can't begin to cover everything from these conferences and announcements in a single post.  What I hope to convey here and in the coming weeks is my take on where the broad database landscape stands today vs. even just one year ago, and where I think it's headed in the future.

OpenWorld was the first event in the sequence, so I'll start there (and stop there for this post). A year ago, OpenWorld was all about Exadata and Exalogic.  Meanwhile, Oracle's many aspiring competitors promoted their ability to use NoSQL and in-memory techniques to do things that Big Red couldn't do.

Apparently Big Red took notice, because this year's event saw the announcement of two appliance offerings that are clearly a response to the NoSQL and in-memory opportunities.  Consider Oracle's approach to Big Data and NoSQL.  Perhaps unsurprisingly, it's a bit different than most.

The conventional approach to Big Data is to use Hadoop. The standard approach for implementing Hadoop seems to mean taking a big pile of the cheapest servers you can find, loading them up with big, cheap disks, interconnecting them with cheap but slow gigabit Ethernet, installing the Apache Hadoop stack on each one of these servers, and then throwing reams of mostly useless data at the resulting Hadoop cluster in an attempt to find the useful needle in the data exhaust haystack.

The question of what to actually DO with those useful needles once you've managed to find them using Hadoop (using algorithms that you've hand-coded yourself to run in the Hadoop run-time environment), is generally left as an exercise for the reader.

Oracle takes a different approach. Instead of old, underpowered, slow servers, they use modern dual-socket E5-series Intel Xeon processor-based servers. As an alternative to slow gigabit Ethernet, they use 40Gbit Infiniband Architecture-based interconnects, which also connect to Exadata.

Instead of merely integrating the standard open-source Hadoop stack on their Big Data appliance, Oracle augments that stack with two key Oracle-proprietary software elements that might just prove useful to Big Data practitioners: 1) the Oracle Database Loader for Hadoop and 2) a variant of the open-source 'R' statistical analysis package that they adapted to work in the Big Data appliance (there are other elements as well, but these two are what I view as the two biggies).

If you've ever loaded bulk data into an Oracle database (or any other database, for that matter), you know that it can take a while.  Much of the reason for that is that the database engine has to extensively massage the incoming data stream in order to prepare it for proper storage in the real database. But what if you knew how the data needed to be formatted in the real database, so you could do that processing 'out of band', in a massively parallel manner, and present only pre-processed data to the database for final storage?

Sounds more efficient, doesn't it? That's what the Oracle Database Loader for Hadoop does.

Of course, Oracle's stuff isn't free.  Neither are the similar offerings from EMC/Greenplum (who kind of started this Big Data frenzy), or IBM, or Microsoft.  However, you get something for your money in all cases.  Implementing a working Hadoop cluster using software downloaded from the Apache project is a time-consuming exercise, to say the least.  If you've got the time to fiddle with that stuff, by all means go for it.  But if your mission is to start making productive use of your company's big data as quickly as possible, then one of the commercial options is probably a better choice.

In that case, pay the vendors to do the intricate technical work for you, and be aware that all of them share a common underlying platform: the Intel Xeon(r) processor.

Oracle's Big Data Appliance won't be shipping until next year, so it's appropriate to take a wait-and-see approach, but the specifications look interesting. In my next post, I’ll talk about in-memory. For now, I welcome comments or questions regarding your thoughts on the future of Big Data. The future of Big Data seems fascinating, and I’d love to hear about your thoughts on the topic.