Part II: Data-Driven Science and the Coming Era of Petascale Genomics

Read Part 1 of this two-part blog series

What does the next 17 years hold in store for us? I believe it we are at the dawn of the era of petascale genomics. In 2015, we can manipulate gigascale genomic datasets without much difficulty. Deriving insight from data at this scale is pervasive today, and the IT infrastructure needed isn’t much more advanced than a small Linux cluster, or an easily affordable amount of time on a large cloud provider’s infrastructure. Manipulation of terascale datasets, however, is still not easy. It is possible to be sure, and researchers are busy attempting to derive insight from genomic data at these scales. But definitely not easy, and again the reason is the IT infrastructure.

Terascale data sets do not fit neatly into easily affordable computational architectures in 2015 – one needs advanced techniques to split up the data for analysis (e.g., Hadoop-style workflows) or one needs advanced systems well beyond the average Linux HPC cluster (e.g., the SGI UV server). Indeed, the skilled IT observer would say that these techniques and systems were invented for data analysis at terascales.  But true petascale genomics research? No, we’re not there yet. We can certainly create data at petascales, and storage infrastructure for storing petabytes of data are fairly common (a petabyte stored on hard drives can easily fit into half a rack in 2015), but this is not petascacle analysis. But to be adept at analyzing and deriving scientific insight from petascale genomic datasets requires IT architectures that have not yet been produced (although theoretical designs abound, including future generations of systems from SGI!)

We are headed in this direction. NGS technologies are only getting more affordable. If there’s anything the past 17 years has taught us it is that once the data can be generated at some massive scale, it will be. 

Perhaps “consumer” genomics will be the driver. The costs of DNA sequencing will be low enough that individuals with no scientific or HPC background will want to sequence their own DNA for healthcare reasons. Perhaps the desire for control over one’s genomic data will become pervasive (giving a whole new meaning to “personalized medicine”) versus having that information be controlled by healthcare providers or (gasp!) insurance companies. Once you have millions of individuals capturing their own genomes on digital media we will have petascale genomics analysis.

Imagine the insights we can gain from manipulation of data at these scales. Genomic analysis of not one human genome, but millions of genomes, and perhaps also tracking genomic information through time. Why not? If the cost of DNA sequencing is not a barrier why not sequence individuals or even whole populations through time? That’ll give new meaning to “genome-wide association studies”, that’s for sure. Whatever the reason and whatever the timeline, the destination is not in doubt – we will one day manipulate petascale genomics datasets and we will derive new scientific insight simply because of the scale and pace of the research. And it will be advanced IT architectures from companies like SGI and Intel that will make this possible.

Here’s to the next 17 years. I’ll see you in 2032 and we’ll talk about how primitive your 50,000-core cluster and your 100PB filesystems are then.

What questions do you have?


James Reaney is Senior Director, Research Markets for Silicon Graphics International (SGI).