Dr. Michael McManus, speaking at the BioData World Congress, 2016
As you would expect, genomic sequencing has come a long way since its origins in the 1970s. Back then, researchers only needed to grapple with a small number of base pairs to analyse, whereas today, equipped with far more advanced technology, vastly reduced costs, and a heightened practical awareness, the healthcare industry can finally plan to tackle sequencing on a macro level: sequencing everyone on the planet.
This scale of sequencing is now seen as vitally important as it provides a snapshot on the genomic variation and the evolutionary development of the human race. It will also improve patient outcomes globally, and goes a long way to reducing the inherent ethnic bias in the current reference genome.
Our improved understanding will allow us to identify key trends within our own genome that point towards conditions affected by our genetic makeup, like sickle cell anaemia, diabetes and even some forms of cancer. There is a cost to sequencing everyone, however, which ranges from financial to ethical, and they must be considered and addressed if we are to be successful in our research.
My presentation slides from the recent BioData World Congress at the Wellcome Genome Campus near Cambridge, UK, are available to view in full on the link below, but it’s important to provide an overview of its content for ease of reference, and to compliment the infographics which bring to life some of the more eye-opening facts surrounding genomic sequencing.
Challenges and considerations
To put it mildly, sequencing everyone comes with a significant set of challenges. The amount of data generated by sequencing everyone borders on the unimaginable, and therefore the question of storage of this data becomes critical.
The data produced by sequencing the human genomes of the entire population would require 7.3 zettabytes of storage. That’s an almost incomprehensible size, so let’s try this: if one byte of storage needed for the data produced by sequencing the human genomes of the entire population was equal to the thickness of a $1 bill, the stack would extend all the way from Earth to the star Lambda Andromaedae A, 84.2 light-years away. (Which may still be fairly difficult to imagine and you must ignore the laws of physics!)
And storage isn’t the only roadblock. Handling this amount of data and turning it into actionable information requires appropriate interpretation and analytics, energy capabilities, and investment.
And, of course, there are also legal and ethical considerations which must be addressed, not least the questions of who owns the data? Will it be secure? And how will it be used?
Big data industry potential
Genomics will undoubtedly be one of the domains that will dominate the big data discussion by 2025. This is down, in large part, to the growth of the cumulative number of human genomes being sequenced since 2010.
Some sequencing projects today have hundreds of thousands of participants, such as the UK genome project and the National Geographic’s Genographic Project; some involve millions, including Kuwait’s country-wide sequencing; and others are projecting tens of millions and more - the Chinese PMI aims to sequence 100 million people by 2030.
In the genomics workflow overview, these programmes generate FASTQ files upon sequencing which can be very large, compounding the sequencing storage problem. If these files were deleted however, keeping the BAM and VCF files which are created in subsequent stages, would greatly reduce storage requirements. CPUs are also becoming much more power efficient and emerging sequencer technology will reduce compute and storage further. But it remains a significant challenge.
The accelerating pace of innovation has increased collaboration in healthcare technology.
Intel’s scalable system framework comes with four key end-user benefits: breakthrough performance; standards-based; broad vendor availability; and the ability to build a common infrastructure across emerging workloads.
Cooperating with SAP and Dell in precision medicine, Intel is actively working to create a robust reference architecture optimized for the SAP Connected Health platform and SAP HANA. Built to handle the variety, velocity, and volumes of big data analytics, the architecture will deliver outstanding scalability, performance, and reliability for high-impact health analytics and enterprise data centres, such as Dell PowerEdge R930.
This partnership exemplifies the type of innovation in technology needed to overcome some of the challenges facing global sequencing – but with such valuable healthcare benefits as a result of sequencing, the efforts are sure to be worth it.