Big Memory, Big Data and the Semantic Web Part 2

OK, I said I’d update the story in a week, and it turns out to have been two months.  Fortunately, that delay has given me more stuff to talk about!

Previously, I introduced a database technology called a triple, how they’re stored, and Franz Technology’s goal of loading and querying a trillion triples on a large scale Intel server.  Triplestores are perfect for making sense out of extremely complex data.  However, a triplestore is only useful if massive quantities of information can be loaded, updated and effectively queried in a reasonable amount of time.

That is why Franz Technology’s announcement is so interesting.  Less than a month before the 6/7 announcement, Intel gave Franz access to one of our lab systems -  a high performance server from IBM.

  • The system was an IBM x3850 X5 8-socket system configured with:
  • 8 Xeon® E7-8870 processors, each with 10 cores,  30MB L3 cache, running at 2.4GHz per core
  • 2 terabytes of 1066MHz DDR3 DRAM
  • 22TB of fast Fibre Channel SAN storage
  • This particular system can get much larger in terms of both memory and storage, but we had to go with what we had in the lab at the time.

Before Franz had an opportunity to work with this system, the largest triplestore they'd been able to assemble contained roughly 50 billion entries.

Running on the 8-socket Xeon E7 system, Franz was able to load and efficiently query more than 320 billion triples, and the factor limiting scale wasn't memory or processors--it was the amount of disk space available.  With some additional spindles and memory, Franz is confident that they can achieve the previously unthinkable result of a trillion triples.

It's difficult for the human mind to grasp a trillion anything - dollars, stars, or triples.  The important thing to understand here is that the amount of processing that goes into loading and querying a trillion triples is enormous.  Unless you have a hardware platform that can deliver a corresponding amount of concentrated processing power at an affordable cost, it's all kind of pointless.

What Franz demonstrated was that such a hardware platform exists, it performs even better than expected, and it delivers a level of capacity that allows customers to think about putting the full potential of the Semantic Web to use in important and creative ways.

The other thing that's so interesting about this example is that triplestores are perfect for making 'fuzzy' (e.g. - probabilistic) decisions.  Combine a triplestore with a Bayesian Belief Network (BBN) reasoning/machine learning application, and you've got a very powerful combination.  Instead of just retrieving data that satisfies a predefined query, a BBN combined with a triplestore can 'discover' relationships in the data based on patterns of recurrence and feedback loops.

One of Franz' key customers, Amdocs, used Semtech to present their vision of how they plan to use this technique to anticipate what a telephony services subscriber might be calling about before the customer service rep picks up the phone.  If that actually works, to scale and affordably, which is what the Xeon E7 processor is all about, then I think it’ll be a pretty amazing breakthrough.

On August 16, Franz announced their semantic web breakthrough, that they  achieved the trillion triples mark.  This time, they did it on an Intel-based HPC cluster-in-the-cloud provided by Stillwater SuperComputing.  This achievement demonstrates that Franz’ graph-oriented database engine is capable of scaling out as well as scaling up.

It’s always good to have choices in terms of deployment architecture, and Franz’ approach to doing triplestores provides an ideal test bed for comparing the scale-out vs. scale-up approaches for this workload.  My bet is that a cluster of large machines will prove to be better-suited to the realities of processing the Semantic Web than will a larger cluster of smaller machines.

The reason I say that is that with triplestores, it’s virtually impossible to predict in advance the path that any particular query will take through the data.  So if the data is sharded across a large number of nodes, a select/join operation is very likely to bottleneck on the network connection between nodes.  If all of the data is stored locally to a single large machine, then joins process at the full speed of memory, which is always orders of magnitude greater than any network.

It will be interesting to monitor progress of the old scale-out vs. scale-up debate in this new arena of triplestores.  Watch this space for updates.  I promise to let you know when something interesting happens.

What do you think?  Are Triplestores interesting to your business?