Primary and Secondary Genome Processing Without Moving Data

There is a data avalanche in life sciences happening right now.  Life sciences research is advancing exponentially as genomic, clinical, and pharmaceutical research generates a staggering amount of data.  More and more research establishments tend to focus on a data-rich whole genome sequencing (WGS) vs. whole exome sequencing (WES) process.

Today, just one complete WGS run produces ~1/2 TB of raw data/image files.  Even though the image files are expected to be smaller with newer and cheaper sequencers, the volume of people having their genome sequenced will grow from thousands to millions over the next decade. This was further fueled by the sequencing cost of a human genome dropping below $1,000 in 2014, compared to a cool $3B in 2003.

Complex, highly granular, unstructured scientific data prove overwhelming to legacy systems, impacting the entire IT infrastructure. The traditional infrastructure is unable to scale quickly and requires new solutions to keep up.

From sample preparation of millions of tiny fragments or entire strand of DNA, sequencing instruments can be thought of as a data factory that pumps out megabytes, gigabytes, and terabytes of data. That data is captured for processing in order to characterize their meaning. This involves analyzing the intensity from high-res camera images to extract the data, and then, through a sequence of steps, deciding what the representation means with respect to, in the case of the DNA alphabet, the base sequences of A, C, G, T.

Data processing at this level typically requires a high performance file system connected to a compute platform that runs the algorithms for image analysis from mapping/indexing fragments of DNA, to aligning/merging a long DNA sequence to reconstruct the original sequence, for analysis/interpretation of volumes of datasets to produce meaningful results.

The challenge is that there is just too much data that needs to be stored, analyzed, and then stored again. The cycles of analysis require the image data stored externally to be brought across the network into a computer, analyzed, and then written back to external storage. This process ends up constantly transferring data back and forth across the data center and unnecessarily burdening and overwhelming the IT infrastructure.

What if we didn’t have to copy the data and process it on the same platform that captures and stores it? This is exactly what some cool innovative genomics companies such as The Genome Analysis Centre (TGAC) are doing with the help of some smart technology provided by leading companies, namely SGI UV 300 supercomputer and Intel® Solid State Drive Data Center P3700 NVMe* Flash drives: http://cis.nbi.ac.uk/new-uv300-big-memory-systems-for-tgac/.

Check out the workflow diagram below.

Pipeline Consolidation on UV & Object store
Figure 1: Pipeline Consolidation on UV & Object store

Here is what the processing sequence (pun intended) looks like:

  • Raw data from the lab equipment is streamed into object or file storage, and simultaneously stored on local Flash installed directly inside UV. The data is either compressed at source, or triggered by iRODS rule to run on the UV or another device.
  • Primary analysis pipeline (PAP) will run on UV against data on Flash to produce QA result. Uncompressed quality scores are written back to Flash, and then compressed scores are stored with raw data in the object or file store.
  • Assembly, alignment, and annotation pipelines run against raw and scored data on Flash inside UV. Assembled/aligned sequence and annotation data is then stored with raw data in the object or file store and temporarily again on Flash.
  • Onward analysis can be performed on data while on the SSD, or can be deleted from the SSD if not required immediately.

Complete data set remains in object or file store available for dissemination, sharing and long term use, while all the processing occurs where the data first arrived, and never needs to be moved – it stays at “Tier Zero” all along. This approach eliminates the data transfer delays and unnecessary compression/decompression cycles allowing the pipeline to run at the maximum speed. The researcher’s workflow and overall time to solution is accelerated by running all of the pre-processing, assembly and analysis operations on a single system, without the need to move data.

The key benefits of this approach is that the scientists can accomplish more – in less time – while the total cost of ownership (TCO) is reduced due to the ease of a single-system administration and the consolidation of workloads.

Krill Malkin, Director of Storage Engineering, SGI

Hard to believe? Come to SGI’s booth at ISC16 and talk to them about it.

Published on Categories Health & Life Sciences, High Performance ComputingTags , ,
Andrey Kudryavtsev

About Andrey Kudryavtsev

Andrey Kudryavtsev is SSD Solution Architect in the NVM Solution Group at Intel. His main focus is the HPC area, where he helps end-customers and eco system partners to utilize the benefits of modern storage technologies and accelerate the SSD adoption for NVMe. He holds more than 12 years of total server experience, the last 10 years working for Intel. He is the guru of engineering creativity and is an influence in his field. He graduated from Nizhny Novgorod State University in Russia by Computer Science in 2004. Outside of work, he is the owner and coauthor of many experimental technologies in music, musical instruments, and multi-touch surfaces.