Harnessing the Power of Hadoop for Genomic Assembly and Alignment

Genome resequencing allows us to understand how genetic differences affect health and cause diseases. This is an important step in detecting anomalies associated with many genetically inherited diseases like Heart Disorders, Down Syndrome, Cystic Fibrosis and Chromosomal Abnormalities.

Next Generation Sequencing (NGS) technologies running on High Performance Computing (HPC) architectures have enabled the sequencing on DNA at groundbreaking speeds. However the storage, analysis and management of the massive DNA sequence datasets produced as a result of NGS research, is a new challenge. Hadoop and Mapreduce technologies come into play here by allowing parallel read-mapping algorithms to scale effectively and resulting in shorter execution times and lower costs (from software execution and hardware).

Among other areas Hadoop technologies may be useful are data storage, data management, statistical analysis and statistical association between various data sources. Organizations are now able to store large datasets in Hadoop Distributed File Systems (HDFS) and are able to use real-time analytics software to access data directly from HDFS bypassing any data migration headaches. Software packages like Myrna, developed by Ben Langmead, Kasper Hansen and Jeff Leek (John Hopkins University) is one such tool that allows the calculation of differential gene expressions in RNA-seq datasets on cloud (Amazon Elastic Map Reduce) or Hadoop clusters .

Innovative companies like Intel Corporation are interested in collaborating with various key partners in the Life Sciences area in an effort to accelerate such work. Intel wants to provide businesses with an open enterprise Hadoop platform alternative for next generation analytics and life sciences, called the Intel® Distribution for Apache Hadoop Software, which provides better manageability and performance – optimized for Intel Xeon processors.

In this paper, we demonstrate how to install and configure Myrna and its required components – Bowtie, R/Bioconductor and SRA toolkit within the Intel® Hadoop Distribution. Read the paper.

What is your experience with big data and Hadoop in life sciences? Do you think Hadoop is ready to become the life sciences research and analytics platform of the future?