Reference Architectures for High-Volume Genome Analysis

Dr. Michael J. McManus, Senior Health & Life Sciences Solution Architect, Intel

I recently had the pleasure of joining Mikael Flensborg, Director and Solutions Lead at QIAGEN Bioinformatics, in delivering a webinar entitled Reference Architectures for the QIAGEN Biomedical Genomics Solutions. Though, in reality, the example architectures I ran through in the webinar could be applied to other reference-based genomics software tools, Intel has worked particularly closely with QIAGEN to define the infrastructure architectures that, combined with optimized software, can deliver massively scalable whole genome analysis at ever-reducing cost.

The webinar itself is available on demand, so I won’t cover all of its content in this post, but rather give an overview of the subjects covered, and point to a number of other useful resources available from QIAGEN and Intel related to planning a system that will meet the compute and storage needs of high-volume genome analysis.

As Mikael covers in the webinar, QIAGEN’s focus is on developing software which delivers the most intuitive user interface for NGS scientists and, critically, having that interface consistent - whether it is sitting on a desktop computer or a high-performance computing cluster. Mikael talks about how the combination of QIAGEN’s Biomedical Genomics Workbench software and its Biomedical Genomics Server Solution effectively ‘masks’ the complexity of cluster computing, particularly through the work QIAGEN has done in partnership with Intel to optimize the software code for the Intel chipset. And while in my time here at Intel I spend my days deep in that complex world, we too, of course, are passionate about bringing massive computing power to scientists in intuitive and easy to use ways.

However, as Mikael points out, a common question from QIAGEN’s customers is what kind of hardware architecture do they need for specific genome sequencing throughput and workloads? Which is where I came in…

As you might expect, it’s not a straightforward question to answer, as there are a number of different parameters to the output of sequencers and the type of genome being sequenced: whole genomes, exomes, RNA-Seq, and gene panels all have different demands in relation to compute power and storage. And at its core (excuse the pun) these two factors are central to planning the infrastructure architecture for today’s needs and those of the future. But how best to estimate?

Taking a step back, for context, I speak about the principles involved in building highly-performant and scalable systems, specifically our Scalable System Framework. This is designed to help overcome the barriers and bottlenecks that we see in the Health and Life Sciences sector, from the walls between memory, storage, and unoptimized software, to the challenges of divergent infrastructures, all driven by our desire to make high-performance computing available to everyone, at every scale.

In relation to our work with QIAGEN, to help create some benchmarks which could then be used to estimate system size (in addition to establishing the total cost of ownership of the Genomics Server Solution over a four-year period) we worked with QIAGEN to test at the highest known genome throughput: 18,000 whole genomes per year from the Illumina HiSeq X Ten sequencing system (also known as the ‘$1,000 sequencer’). The full results off that test can be found in the associated co-authored White Paper: Analyzing Whole Genomes for as Little as $22.

The test gave us some great benchmarks from which to estimate the size of system needed in both compute power and storage terms for different workloads. For example, in relation to whole genomes, we have defined a benchmark of 1.5 genomes per node per day, with a storage capacity need of 27 terabytes per node per month. Conversely, exomes can be processed at a higher rate of 45 per node per day, but this speed also results in storage being accumulated very quickly, at 135 terabytes per node per month. We have established similar benchmarks for RNA-Seqs and gene panels, which means that any organization can reliably estimate the size of system and infrastructure architecture based on the output of sequencers used, in terms of both volume and genomic mix.

And really, to conclude, this is the critical point. Any organization looking to estimate the size of system required – and one which allows for the scalability to meet its future needs – should always start from the perspective of its specific workload.

I hope you find the webinar useful. Once again, you can find it on demand here.