elPrep: Fast, Single-pass, Parallel pre-Processing for Genomic Pipelines

This blog is written by a colleague, Robert Sugar, who is a Software Architect at Intel Health and Life Sciences and features some very exciting advances for those who work in the genomics field, particularly as we gather at BioData World Congress, in Cambridge, UK, to hear how organisations across the world are advancing precision medicine. If you'd like to discuss an aspect of this blog you can find Robert's LinkedIn details at the end of his writing.


By Robert Sugar

With an aspiration of all-in-one-day genome sequencing bringing precision medicine to reality, I wanted to share some of the work Intel has been undertaking with partners to speed up an important part of the sequencing process ahead of BioData World Congress 2015. Like most good stories, this one starts at the beginning, by which I mean the mapping phase of DNA sequencing analysis which prepares sequence alignment files for variant calling in sequencing pipelines.

From Multiple to Single Pass

In a typical genomic pipeline, after the mapping phase, a number of tasks such as marking duplicates, sorting and filtering must take place which usually requires multiple passes by different preparation tools. The consequences of calling multiple command line tools numerous times include repeated I/O between the steps and multiple passes over the same data which may have only been incrementally modified. Moreover, many of these tools (such as the Picard tool recommended by the gold standard GATK workflow) utilize only a single CPU thread. As a result, often more time is spent in sorting and filtering than in variant calling itself. This not only slows down the entire process of sequencing a genome but also has financial implications too.

Intel, alongside IMEC and in collaboration with Janssen Pharamceutica (as part of the ExaScience Life Lab) developed elPrep, an open source high-performance tool for DNA sequence (BAM file) processing, which uses a single-pass, parallel filtering architecture. elPrep simplifies not only the computational processes but also the end user’s need to understand these processes, thus reducing both time and costs. Additional filters can also be easily added for customer-specific workflows.

elPrep BAM Processing.png

Figure 1: traditional multi-pass BAM file processing (blue arrows) vs. a single-pass elPrep workflow (orange arrows). Source: Charlotte Herzeel (Imec) and Pascal Costanza (Intel)

Meeting Today’s Standards

Throughout the development of elPrep it was vitally important to ensure compatibility with existing tools and datasets, and this has been achieved. elPrep can be used as a drop-in replacement for existing tools today, e.g. it is now the standard tool for exome loads at Janssen Pharmaceutica. What truly makes elPrep a fantastic tool for the genomics community though is the single-pass, extensible filtering architecture. By merging computations of multiple steps it avoids repeated file I/O between the preparations steps. Unnecessary barriers are removed allowing all preparation steps to be executed in parallel.

Reducing Time, Increasing Value

It is worth illustrating the impact of this with figures from a real-life exome sequencing project by Janssen Pharmaceutica as reported in ‘elPrep: High-Performance Preparation of Sequence Alignment/Map Files for Variant Calling (2015)’1 The runtime of an exome workload can be reduced from around 1 hour 40 minutes to around 10 to 15 minutes. The gains also tend to increase with the complexity of the pipeline. This is a considerable time (and cost) saving when looked at in the context of mapping and analysing whole-genome data.

elPrep is an important addition to the toolset of those working in the genomics field. I would draw your attention to the previously mentioned paper which provides extensive detail on the benefits of elPrep and more information on how it compares to existing SAM/BAM manipulation tools. There is more work to be done as we look ahead to all-in-one-day genomic sequencing but this is an exciting development. elPrep is available open source for both academic and industrial customers at https://github.com/ExaScience/elprep and is being integrated to online genomic toolkits, such as DNANexus.

Contact Robert Sugar on LinkedIn

1 Herzeel C, Costanza P, Decap D, Fostier J, Reumers J (2015) elPrep: High-Performance Preparation of Sequence Alignment/Map Files for Variant Calling. PLoS ONE 10(7): e0132868. doi:10.1371/journal.pone.0132868