Tackling Genome Sequencing with Hadoop…

Genome sequencing is a laboratory process that determines the complete DNA sequence of an organism's genome at a single time. Steve Jobs also had his genome sequenced for $100,000. Commercialization of full genome sequencing is in an early stage and growing rapidly. At the 13th “ISC Big Data Conference 2013” I talked to Girish Juneja, CTO, Datacenter Software Division & General Manager at Intel.

Nanometer small DNA produces Terabyte big data sets

It’s interesting: in order to give you an example of how far the journey may take us with big data, let us start by taking a look at the DNA sector. Each genome pair sequence has 3.2 million base pairs. Our early problem with genome sequencing was it’s outrageous cost. But as of today, the cost for genome sequencing has been cut a million times to something north of $1000. This cut in costs for sequencing is a trend that will continue resulting in a significant uptick of data originating form genome sequencing. In addition huge amounts of data are generated by the pharmaceutical industry every day. All of this information can now we leverage to research new drugs, which are based on findings from genome markers attributed to specific diseases. When you take electronic health records of patients additionally into consideration, the amount for data that needs to be processed is immense but the potential merits for human kind are beyond imagination.

The first question is how you collect such a huge amount of data?

The Hadoop environment which SAP calls Infinite storage is responsible for collecting the data. If one of the nodes is full, it automatically adjusts the other nodes and takes more servers to process when necessary. That’s only how we are capable of storing these amounts of data. After storing the data you have to do some analysis.

You have to interact with other data sources. That’s where the electrical health records come into play. You can combine the electrical health records with the genome sequencing. Now you get a lot of subgroups with different markers. This is where the analysis peace comes in. The infinite storage of Hadoop is very good in analyzing this data, sometimes it can be done in couple hours. Once you cleared the base analysis like what the gender is, which race, etc. and created some reference data you can start and analyze the “sweet or treat” data. It means that when A happens B is the result or when C is related to D, than E must be connected somehow. The memory database comes really handy in that analysis. You pull the references from Hadoop into SAP HANA then you do the real-time interactive analytics. The combination of the two systems becomes very powerful. It has the ability to deal with different structure and data size combined with the speed and performance of a response time from under a second.

We are working big data together.

When you look at data sets from Yahoo, Facebook or Twitter, you will see they are working with big data too. The total data set is maybe larger in their case but they were mostly dealing with webclicks or login information. That data interaction is rather one or two dimensional.

What is the difference between scientific data and some of the webclick data?

The scientific data attends to have a lot of metadata and it’s usually layered. That means that it’s often not only tracking the webclick or the “like” information. That would be only one dimensional. Scientific data instead is looking from different angles and lots of perspectives. Intel is going to take the webclick data from the social networks combining it with the multi layered, more dimensional data sets and making sure the ecosystem works for both types of data. We at Intel work closely with other components like SAP HANA that actually make this multi-dimensional analysis available in real-time.

Education should adjust on industry experiences

Intel has been working with SAP for a long time. But we invest in other environments too, to push the big data agenda forward. We have two major facilities: One is the Massachusetts Institute of Technology (MIT) and the other is Carnegie Mellon University (CMU). The CMU is focused upon figuring out new technologies to analyze big data better by using predictive analysis and the MIT is focused on visualization. We are investing in those places quite heavily. We are bringing the researched innovations from the institutes and working closely to the SAP side. One of the challenges is the skillset to deal with big data. Actually it is thin to find truly experts or data scientist on this topic.

What would be your recommendation for the universities and studies? What should be the subjects to focus on?

I think big data is the new real life and we are all trying to find it out. It would be better to put more effort on the mathematical and statistical skills which eventually result in data scientists. You get an interesting combination of programmatic knowledge of infrastructure and mathematical statistical machine learning skill sets. That is where the work has to be done to studying the future of data analyzing.

Jaroslaw Animucki SAP HANA P&M