GATK4 (Genome Analysis Toolkit) Launch: Optimizing Genomics Analytics

Genomics holds real promise to improve healthcare for countless patients worldwide, and genomics analytics is the foundation for precision medicine. When the genomic roots of a disorder have been identified, pharmaceutical companies can develop treatments targeting the specific underlying disorder. Clinicians can then use these targeted therapies to develop treatment plans individualized to each patient, increasing the chances of successful outcomes. To reach individualized treatment plans, massive genomic datasets must be analyzed and compared to identify the variants that can spur these breakthroughs. Genomic data is doubling approximately every 7 months and will reach about one zettabase by 2025. (“Big Data: Astronomical or Genomical?” PLOS Biology July 7, 2015). The demand for technology in the genomics space is growing almost as quickly as the data.

Intel is a data company that addresses big data problems and helps organizations sort, analyze, and interpret data to solve real-world problems. Intel has capabilities in software optimization, scaling solutions, and speeding up the time to meaningful outcomes. Intel applies new technologies like faster processors, NVMe*and PCIe* SSDs, FPGAs, high-speed fabrics, and artificial intelligence (AI) to address these big data problems, taking into account not just one part of an issue (for example, processing power or IO speed) by looking at a holistic solution. And what better problem is there to address than curing disease?

Collaborating to Support Genomics Research

In late 2016, Intel and the Broad Institute of MIT and Harvard, the leader in genomics research, announced the Center for Genomic Data Engineering with a five year, $25M commitment from Intel. This Center was a unique partnership for both organizations. Intel saw the opportunity to affect medical research and patient outcomes by combining the expertise of the two organizations to support, analyze, and manage the rapidly increasing genomics data available to researchers, pharmaceutical companies, and clinicians.

The Center for Genomic Data Engineering has profiled the GATK Best Practices Pipeline and identified areas for improvement for speed and scale.   This  resulted in performance optimizations with improvements for PairHMM, used in Haplotype caller, for Intel® Xeon® processors with Intel® Advanced Vector Extensions 512 (Intel® AVX 512) and Intel FPGAs.   Intel released the Genomics Kernel Library which provides performance improvements for genomics workloads across the genomics pipeline.   The Center, along with the Intel Science and Technology Center at MIT, also developed GenomicsDB, a variant data store that has provided 5x improvement (Broad Institute, Geraldine Van der Auwera, Ph.D., BioIT World May 24, 2017, https://software.broadinstitute.org/gatk/blog?id=9644) in speed and scale for joint genotyping.  These efforts have enabled GATK to run faster at both a lower cost and larger scale.

Intel packaged the work and optimizations from the Center into the Intel Select Solution for Genomics Analytics (https://builders.intel.com/docs/intel-select-genomics-analytics.pdf). This is an end to end reference architecture to provide hardware configurations, software requirements, and WDLs (Workflow Definition Language) scripts to provide the best performance with a deployment recipe.  This is supported by key OEM partners (HPE, Colfax, Inspur, and Lenovo) and Intel is working to expand support to additional OEMs and CSPs.  We will continue to innovate and improve on our genomic analytics toolkit for the industry.

GATK4 Release

Intel is very excited for the release of the next version of GATK as an open source solution: GATK4.  GATK has been the leading software tool for genomics analysis for many years, and the move to GATK4 and open source solidifies it as the premier genomics analytics software moving forward.

Broad Institute has committed to innovating GATK to provide scale, speed, and lower cost.   Broad reports that the re-engineered GATK solves key bottlenecks, including an analysis step where GATK4 can analyze types of genomic sequence data 15 times faster than GATK3 while increasing input capacity by a factor of five.  They also included additional pipelines to address the requirements of the market, allowing more users to leverage the leading open source solution for genomics analysis and standardizing the pipelines. With the growth of genomics data and the increased use of genomic analysis, it is important to process data in a standard fashion so results can be shared, compared, and leveraged for multiple studies.

GATK4 was completely rewritten for performance, flexibility, speed, and scalability.  It will include end-to-end pipeline scripts that can be run on any local or cloud compute infrastructure.   This innovation and focus on scale is required to meet the needs of the genomics market and tackle the coming zettabase of data.   GATK4 is the first and only open-source software package that covers all major types of variant and mutation detection use cases for population genetics, rare and common diseases, and cancer genomics.  GATK4 includes both well-established pipelines and new tools that take advantage of the latest developments in machine learning and neural networks algorithms.  Intel is committed to supporting these efforts with technology to further accelerate better analysis for genomics data.

For more information on GATK4, visit the Broad Institute’s site at: https://software.broadinstitute.org/gatk/gatk4.

More on the Intel and Broad collaboration can be found here: https://www.intel.com/content/www/us/en/healthcare-it/solutions/genomics-broad-data.html and info on the Intel Genomics Analytics Solution here:
https://www.intel.com/content/www/us/en/healthcare-it/solutions/genomics-analytics.html