How to Accelerate Human Genome Resequencing

Genome resequencing in patients is an important step in the detection of mutation for congenital diseases. Traditionally, genomics software has been run on High Performance Computing (HPC) architectures.

Hadoop and MapReduce technologies are slowly transforming the Life Sciences arena by allowing parallel read-mapping algorithms to scale effectively and resulting in shorter execution times and lower costs (from software execution and hardware). Michael Schatz (University of Maryland) and Ben Langmead (John Hopkins University) have introduced various software applications like Crossbow into the Hadoop ecosystem, enabling gene resequencing to run on Hadoop clusters as well as on the cloud (Amazon Web Services via Elastic Map Reduce service). Crossbow provides a scalable software pipeline that can analyze over 35x coverage of the human genome on a 10-node Hadoop cluster in about one day.

However, being open source, Hadoop seems less polished in some areas and can be difficult to manage in others. Companies like Intel Corporation have started with the Apache Hadoop Distribution and added components to it for better manageability and performance – optimized for Intel Xeon processors, in order to provide businesses with an open enterprise Hadoop platform for next generation analytics and life sciences, called the Intel® Distribution for Apache Hadoop Software.

In this new paper, we demonstrate how to install and configure Crossbow and its required components – Bowtie, SOAPsnp and SRA toolkit within Intel® Hadoop Distribution. Read the full paper here.

Technological advancement is far outpacing our knowledge and abilities to interpret genomic information.  However, technology is allowing newer opportunities to interpret more and more information about humans and other animals.

What are your thoughts on the usage, benefits and side-effects of these technologies?