Part I: Data-Driven Science and the Coming Era of Petascale Genomics

Seventeen years. That’s how long it has taken us to move from the dawn of automated DNA sequencing to the data tsunami that defines next-generation sequencing (NGS) and genomic analysis in general today. I’m remembering, with some fondness, the year 1998 which I’ll consider as the year the life sciences got serious about automated DNA sequencing, about sequencing the human genome in particular, and the year the train left the station and the genomics research went from the benchtop to prime mover of high-performance computing (HPC) architectures and never looked back.

1998 was the year Perkin Elmer formed PE Biosystems, an amalgam of Applied Biosystems, PerSeptive Biosystems, Tropix, and PE Informatics, among other acquisitions. That was the year PE decided they could sequence the human genome before the academics could – that is, by competing against their own customers, and they would do it by brute force application of automated sequencing technologies. That was the year Celera Genomics was born and Craig Venter became a household name. At least if you lived in a household where molecular biology was a common dinnertime subject.

Remember Zip Drives?

In 1998, PE partnered with Hitachi to produce the ABI “PRISM” 3700, and hundreds of these machines were sold worldwide, kick starting the age of genomics. PE Biosystems revenues that year were nearly a billion dollars. The 3700 was such a revolutionary product that it purportedly could produce the same amount of DNA data in a single day what the typical academic lab could produce in a whole year. And yet, from an IT perspective, the 3700 was quite primitive. The computational engine driving the instrument was a Mac Centris, later upgraded to a Quadra, then finally to a Dell running Windows NT. There was no provision for data collection other than local storage, which if you wanted any portability was at that time the ubiquitous Iomega Zip Drive. You remember those? Those little purplish-blue boxes that sat on top of your computer and gave you a whopping 100 megabytes of portable storage. The pictures on my phone would easily fill several Zip disks today.

Networking the 3700 was no mean feat either. We had networking in 1998 of course; gigabit Ethernet and most wireless networking technologies were still just an idea in 1998 but 100 megabit (100Base-TX) connections were common enough and just about anyone in any academic research setting had a least 10 megabit (10Base-T) connections available. The problem was the 3700, and specifically the little Dell PC that was paired with the instrument and responsible for all the data collection and subsequent transfer of data to some computational facility (Beowulf-style Linux HPC clusters were just becoming commonplace in 1998 as well.)  As shipped from PE at that time, there was zero provision for networking, zero provision for data management beyond the local hard drive and/or the Zip Drive.

It seems laughable today but PE did not consider storage and networking, i.e., the collection and transmission of NGS data, a strategic platform element. I guess it didn’t matter since they were making a BILLION DOLLARS selling 3700s and all those reagents, even if a local hard drive and sneakernet were your only realistic data management options. Maybe they just didn’t have the proper expertise at that time.  After all, PE was in the business of selling laboratory instruments, not computers, storage, or networking infrastructure.

Changing Times

How times have changed. NGS workflows today practically demand HPC-style computational and data management architectures. The capillary electrophoresis sequencing technology in the 3700 was long-ago superseded by newer and more advanced sequencing technologies, dramatically increasing the data output of these instruments and simultaneously lowering the costs as well.  It is not uncommon today for DNA sequencing centers to output many terabytes of sequencing data every day from each machine, and there can be dozens of machines all running concurrently. To be a major NGS center meant also being adept at collecting, storing, transmitting, managing, and ultimately archiving petascale amounts of data. That’s seven orders of magnitude removed from the Zip Drive. If you are also in the business of genomics analysis that meant you needed to be experts in computational systems capable of handling data and data rates at these scales as well.

Today, this means either massively scalable cloud-based genomics platforms or the more traditional and even higher scale HPC architectures that dominate all large research computing centers worldwide. We are far, far beyond the days of any single Mac Quadra or Dell server. Maybe if PE had been paying closer attention to IT side of the NGS equation they would still be making billions of dollars today.

In Part II of this blog, I’ll look at what’s in store for the next 17 years in genomics. Watch for the post next week.

James-Reaney_avatar_1430432638-80x80.jpg

James Reaney is Senior Director, Research Markets for Silicon Graphics International (SGI).