All In One Day by 2020 – the phrase encompasses our real ambition here at Intel to empower researchers to give clinicians the information they need to deliver a targeted treatment plan for patients in just one 24-hour period. I wanted to provide you with some insight into where we are today and what’s driving forward the journey to All In One Day by 2020.
Genomics Code Optimization
We have been working with industry-leader experts, and commercial and open source authors of key genomic codes for several years on code optimization to ensure that genome processing runs as fast as possible on Intel®-based systems and clusters. The result is a significant improvement on the speed of key genomic programs which will help get sequencing and processing down to minutes, for example:
- Intel has sped up a key piece of the Haplotype Caller in GATK, the pairHMM kernel to be 970x faster for an overall 1.8x increase in the pipeline performance;
- The acceleration of file compression for genomics files, e.g. BAM and SAM files by over 4x
- The acceleration of Python using Intel's Math Kernel Library (MKL) producing a 15x speedup on a 16-core Haswell CPU;
- Finally, using the enhanced MKL, in conjunction with its Data Analytics Acceleration Library (DAAL), has enabled DAAL to be 100x faster than R for k-means clusters and 35x faster than Weka on Apriori.
You can find out more about Intel’s work in code optimization at our dedicated Optimized Genomics Code webpage.
Scalability for Success
As we see an explosion in the volume of available data the importance of being able to scale a high performance computing system becomes ever more critical to accelerating success. We have put forth the Intel® Scalable System Framework to guide the market on the optimal construction of an HPC solution that is multi-purpose, expandable and scalable.
Combining the Scalable System Framework with optimized life sciences codes has resulted in a new, more flexible, scalable, and performant architecture. This reduces the need for purpose-built systems and instead offers an architecture that can span a variety of diverse workloads while offering increased performance.
Another key element of an architecture is the balance between three key factors: compute, storage, and fabric. And today we see the fruits of our work coming to life, for example, in a brilliant collaboration between TGen, Dell and Intel which optimized TGen’s RNA-Seq pipeline from 7 days to under 4 hours. TGen are successfully operating FDA-approved clinical trials, balancing research and providing clinical treatment of pediatric oncology patients.
From a week to a day
It’s useful, I think, to see just how far we’ve come in the last four years as we look ahead to the next four years to 2020. In 2012 it took a week to perform the informatics on a whole human in a cloud environment going from the raw sequence data to an annotated result. Today, the time for the informatics had decreased to just 1 day for whole genomes.
With the Dell and Qiagen reference architectures that are based on optimized code and the Intel® Scalable System Framework, a throughput-based solution has been created. This means that when fully loaded these base systems will perform the informatics on ~50 whole genomes per day.
However, it is important to note the genomes processed on these systems still take ~24 hours to run, but they are being processed in a highly parallel manner. If you use a staggered start time of ~30 minutes between samples, this results in a completed genome being produced approximately every 30 minutes. For the sequencing instrumentation, Illumina can process a 30x whole human genome in 27 hours using its “rapid-run mode”.
So, in 2016, we can sequence a whole genome and do the informatics processing in just over 2 days (51 hours consisting of 27 hours of sequencing + 24 hours of informatics time), that’s just ~1 day longer than our ambition of All In One Day by 2020.
Three final points to keep in mind:
- There are steps in the All In One Day process that are our outside of the sequencing and the informatics, such as the doctor's visit, the sample preparation for sequencing, the genome interpretation and the dissemination of results to the patient. These steps will add additional time to the above 51 hours.
- The reference architectures are highly scalable meaning a larger system can do more genomes per day. 4 times the nodes produce 4 times throughput.
- There are enhancements still to be made. For example, streaming the output from the sequencer to the informatics cluster such that the informatics can be started before the sequencing is finished will further compress the total time towards our all-in-one-day goal.
I’m confident our ambitions will be realized.