How to Achieve 39 Percent Faster Performance for Whole Genome Analysis

The transition toward next-generation, high-throughput genome sequencers is creating new opportunities for researchers and clinicians. Population-wide genome studies and profile-based clinical diagnostics are becoming more common and more cost-effective. At the same time, such high-volume and time-sensitive usage models put more pressure on bioinformatics pipelines to deliver meaningful results faster and more efficiently.

Recently, Intel worked closely with Seven Bridges Genomics’ bioinformaticians to design the optimal genomics cluster building block for direct attachment to high-throughput, next-generation sequencers using the Intel Genomics Cluster solution. Though most use cases will involve variant calling against a known genome, more complex analyses can be performed with this system. A single 4-node building block is powerful enough to perform a full transcriptome. As demands grow, additional building blocks can easily be added to a rack to support multiple next-generation sequencers operating simultaneously.

Verifying Performance for Whole Genome Analysis

To help customers quantify the potential benefits of the PCSD Genomics Cluster solution, Intel and Seven Bridges Genomics ran a series of performance tests using the Seven Bridges Genomics software platform. Performance for a whole genome pipeline running on the test cluster was compared with the performance of the same software platform running on a 4-node public cloud cluster based on the previous generation Intel Xeon processor E5 v2 family.

The subset of the pipeline used for the performance tests includes four distinct computational phases:

  • Phase A: Alignment, deduplication, and sorting of the raw data reads
  • Phase B: Local realignment around Indels
  • Phase C: Base quality score recalibration
  • Phase D: Variant calling and variant quality score recalibration.

The results of the performance tests were impressive. The Intel Genomic Cluster solution based on the Intel® Xeon processor E5-2695 v3 family completed a whole genome pipeline in just 429 minutes versus 726 minutes for the cloud-based solution powered by the prior-generation Intel® Xeon processor E5 v2 family.

Based on these results, researchers and clinicians can potentially complete a whole genome analysis almost five hours sooner using the newer system. They can also use this 4-node system as a building block for constructing large, local clusters. With this strategy, they can easily scale performance to enable high utilization of multiple high-volume, next-generation sequencers.

For a more in-depth look at these performance tests, we will soon release a detailed abstract that will provide more detailed information about the workloads and system behavior in each phase of the analysis.

What questions do you have?