Speeding Up Genomics Applications with Faster Compression

After several years of work on genomics with industry and academia, Intel has introduced the Genomics Kernel Library (GKL). This open-source code gives developers access to performance optimizations that accelerate genomics applications on Intel® architecture. It targets hardware based on Intel® Xeon® processors, FPGAs, and Intel® Xeon Phi™ coprocessors.

Intel is working with the Broad Institute of MIT and Harvard on this project, and to help make GKL as useful as possible, we have provided native libraries for both Linux* and Mac OS X*. There are Java* wrappers for the GATK (Genome Analysis Toolkit) and HTSJDK (High-Throughput Sequencing JDK), as well as native C/C++ support. GKL as a whole sits in the solution stack between genomics applications and the hardware, optimizing the software’s ability to take advantage of Intel architecture.

Accelerating Compression and Decompression

A big part of the GKL performance opportunity comes from accelerated compression and decompression. GKL accomplishes this for compression level 1 functions with a fast, DEFLATE-compatible compression routine. (DEFLATE is the binary compression algorithm used by zlib, gzip, and zip.) For compression levels 2 through 9, GKL uses a set of patches for zlib that were created by Intel’s Open Source Technology Center.

Results: Substantial Speedup with Minimal Change to Compression Ratio

To showcase the performance benefit of GKL, Intel tested the standard Java Deflater* and Java Inflater* components used by GATK and other genomics applications against versions that have been optimized using GKL. We ran compression and decompression routines at various compression levels, using sample BAM files about 300 MB in size. The results are striking.

On the compression side, GKL optimization of Java Deflater increased performance by a factor of about 1.3x to 2.8x depending on the compression level. Likewise, GKL optimization of Java Inflater accelerated decompression by a factor of about 1.7x to 2x. The cost of this performance increase in terms of compression ratio is minimal. For example, the largest performance increase in this testing—approximately 2.8x acceleration of Java Deflater at compression level 1—increased the compressed file size by only about 1.8 percent.1

For details of the performance testing and results, read the white paper, “Accelerating the Compression and Decompression of Genomics Data using GKL Provided by Intel.”

System under test: 2x Intel® Xeon® processor E5-2699 v4, 256 GB RAM, Intel® SSD DC P3700 Series (2 TB), sample input BAM file (~300 MB, provided by the Broad Institute), GKL release 4.3, Java Deflater* module, Java Inflater* module.