By Aruna Kumar, HPC Solutions Architect Life Science, Intel
15,000 to 20,000 variants per exome (33 Million bases) vs. 3 million single nucleotide polymorphisms per genome. HPC a clearly welcome solution to deal with the computational and storage challenges of genomics at the cross roads of clinical deployment.
At the High performance Computing User Forum held at Norfolk in mid-April, it was clear that the face of HPC is changing. The main theme was Bio-Informatics â€“ a relatively newcomer to the user base of HPC. Bioinformatics including high throughput sequencing have introduced computing to entire new fields that have not utilized computing in the past. Just as in social sciences, these fields appear to share a thirst for large amounts of data that is still largely a search for incidental findings but seeking architectural, algorithmic optimizations and usage based abstractions simultaneously. This is a unique challenge for HPC and one that is challenging HPC systems solutions.
What does this mean for the care of our health?
Health outcomes are increasingly tied to the real time usage of vast amounts of both structured and unstructured data. Sequencing of the genome or targeted exome is distinguished by its breadth. Clinical diagnostics such as blood work for renal failure, diabetes, or aneamia that are characterized by depth of testing, genomics is characterized by breadth of testing.
As aptly stated by Dr. Leslie G. Biesecker and Dr. Douglas R. Green in 2014 New England Journal of Medicine paper, â€śThe interrogation of variation in about 20,000 genes simultaneously can be a powerful and effective diagnostics method.â€ť
However, it is amply clear from the work presented by Dr. Barbara Brandom, Director of Global Rare Diseases Patient Registry Data Repository (GRDR) at NIH, that the common data elements that need to be curated to improve therapeutic development and quality of life for many people with rare diseases is an relatively complex blend of structured and unstructured data.
GRDR Common Data Elements table includes contact information, socio-demographic information, diagnosis, family history, birth and reproductive history, Anthropometric information, patient-reported outcome, medications/devices/health services, clinical research and biospecimen, and communication preferences.
Now to some sizing of data and compute needs to appropriately scale the problem from a clinical perspective. Current sequencing sampling is at 30x from the Illumina HiSeqX systems. That is 46 thousand files that are generated in a three day sequencing run adding up to a 1.3 terabyte (TB) of data. This data is converted to variant calling referred to by Dr. Green earlier in the article. This analysis to the point of generating variant calling files accumulates an additional 0.5 TB of data per human genome. In order for clinicians and physicians to identify stratified subpopulation segments with specific variants, it is often necessary to sequence complex targeted regions at much higher sampling rates with longer read lengths than that generated by current 30x sampling. This will undoubtedly exacerbate an already significant challenge.
So how does Intelâ€™s solutions fit in?
Intel Genomics Solutions together with the Intel Cluster Ready program are providing much needed sizing guidance to enable the clinicians and their associated IT data center to provide personalized medicine in the most efficient manner to scale with growing needs.
The needs broadly from a compute perspective, are to handle the volume of genomics data in a real time manner to generate alignment mapping files. These mapping files contain the entire sequence information, the quality and position information, resulting from a largely single threaded process of converting FASTQ files into alignment mapping files. The alignment mapping files are generated as text files and converted to a more compressed binary format often known as BAM (binary alignment map) files. The difference between a reference genome and the aligned sample file (BAM) is what is contained in a variant calling files. Variants come in many forms, although the most common form is the presence or absence in a corresponding position of a single base or nucleotide. This is known as single nucleotide polymorphism (SNP). The process of research and diagnostics involves generation and visualization of BAM, SNPs and entire VCF files.
Given the lack of penetrance of incidental findings across a large numbers of diseases, the final step to impacting patient outcomes unstructured data and meta data, requires the use of parallel file systems such as Lustre and object storage technologies that provide the ability to scale-out and support personalized medicine use cases.
More details on how Intel Genomics Solutions aid the scale out to directly impact personalized medicine in a clinical environment in a future blog!
For more resources you can find out Intelâ€™s role in Health and Life Sciences here and learn more about Intel in HPC at intel.com/go/hpc or learn more about Intelâ€™s boards and systems products at http://www.intelserveredge.com/