Dr. Peter White is the developer and inventor of the “Churchill” platform, and serves as GenomeNext’s principal genomic scientist and technical advisor.
Dr. White is a principal investigator in the Center for Microbial Pathogenesis at The Research Institute at Nationwide Children’s Hospital and an Assistant Professor of Pediatrics at The Ohio State University. He is also Director of Molecular Bioinformatics, serving on the research computing executive governance committee, and Director of the Biomedical Genomics Core, a nationally recognized microarray and next-gen sequencing facility that help numerous investigators design, perform and analyze genomics research. His research program focuses on molecular bioinformatics and high performance computing solutions for “big data”, including discovery of disease associated human genetic variation and understanding the molecular mechanism of transcriptional regulation in both eukaryotes and prokaryotes.
We recently caught up with Dr. White to talk about population scale genomics and the 1000 Genomes Project.
Intel: What is population scale genomics?
White: Population scale genomics refers to the large-scale comparison of sequenced DNA datasets of a large population sample. While there is no minimum, it generally refers to the comparison of sequenced DNA samples from hundreds, even thousands, of individuals with a disease or from a sampling of populations around the world to learn about genetic diversity within specific populations.
The human genome is comprised of approximately 3,000,000,000 DNA base-pairs (nucleotides). The first human genome sequence was completed in 2006, the result of an international effort that took a total of 15 years to complete. Today, with advances in DNA sequencing technology, it is possible to sequence as many as 50 genomes per day, making it possible to study genomics on a population scale.
Intel: Why does population scale genomics matter?
White: Population scale genomics will enable researchers to understand the genetic origins of disease. Only by studying the genomes of 1000’s of individuals will we gain insight into the role of genetics in diseases such as cancer, obesity and heart disease. The larger the sample size that can be analyzed accurately, the better researchers can understand the role that genetics plays in a given disease, and from that we will be able to better treat and prevent disease.
Intel: What was the first population scale genomic analysis?
White: The 1000 Genomes Project is an international research project, through the efforts of a consortium of over 400 scientists and bioinformaticians, set out to establish a detailed catalogue of human genetic variation. This multi-million dollar project was started in 2008 and sequencing of 2,504 individuals was completed in April 2013. The data analysis of the project was completed 18 months later, with the release of the final population variant frequencies in September 2014. The project resulted in discovery of millions of new genetic variants and successfully produced the first global map of human genetic diversity.
Intel: Can analysis of future large population scale genomics studies be automated?
White: Yes. The team at GenomeNext and Nationwide Children’s Hospital were challenged to analyze a complete population dataset compiled by the 1,000 Genomes Consortium in one week as part of the Intel Heads In the Clouds Challenge on Amazon Web Services (AWS). The 1000 Genomes Project is the largest publically available dataset of genomic sequences, sampled from 2,504 individuals from 26 populations around the world.
All 5,008 samples (2,504 whole genome sequences & 2,504 high depth exome sequences) were analyzed on GenomeNext’s Platform, leveraging its proprietary genomic sequence analysis technology (recently published in Genome Biology) operating on the AWS Cloud powered by Intel processors. The entire automated analysis process was completed in one week, with as many as 1,000 genome samples being completed per day, generating close to 100TB of processed result files. The team found there was a high-degree of correlation with the original analysis performed by the 1,000 Genomes Consortium, with additional variants potentially discovered during the analysis performed utilizing GenomeNext’s Platform.
Intel: What does GenomeNext’s population scale accomplishment mean?
White: GenomeNext believes this is the fastest, most accurate and reproducible analysis of a dataset of this magnitude. One of the benefits of this work will enable researchers and clinicians, using population scale genomic data to distinguish common genetic variation as discovered in this analysis, from rare pathogenic disease causing variants. As populations scale genomic studies become routine, GenomeNext provides a solution through which the enormous data burden of such studies can be managed and by which analysis can be automated and results shared with scientists globally through the cloud. Access to a growing and diverse repository of DNA sequence data, including the ability to integrate and analyze the data is critical to accelerating the promise of precision medicine.
Our ultimate goals are to provide a global genomics platform, automate the bioinformatics workflow from sequencer to annotated results, provide a secure and regulatory compliant platform, dramatically reduce the analysis time and cost, and remove the barriers of population scale genomics.