As SC14 approaches, we have invited industry experts to share their views on high performance computing and life sciences. Below is a guest post from Ari E. Berman, Ph.D., Director of Government Services and Principal Investigator at BioTeam, Inc. Ari will be sharing his thoughts on high performance infrastructure and high speed data transfer during SC14 at the Intel booth (#1315) on Wednesday, Nov. 19, at 2 p.m. in the Intel Community Hub and at 3 p.m. in the Intel Theater.
There is a ton of hype these days about Big Data, both in what the term actually means, and what the implications are for reaching the point of discovery in all that data.
The biggest issue right now is the computational infrastructure needed to get to that mythical Big Data discovery place everyone talks about. Personally, I hate the term Big Data. The term “big” is very subjective and in the eye of the beholder. It might mean 3PB (petabytes) of data to one person, or 10GB (gigabytes) to someone else.
From my perspective, the thing that everyone is really talking about with Big Data is the ability to take the sum total of data that’s out there for any particular subject, pool it together, and perform a meta-analysis on it to more accurately create a model that can lead to some cool discovery that could change the way we understand some topic. Those meta-analyses are truly difficult and, when you’re talking about petascale data, require serious amounts of computational infrastructure that is tuned and optimized (also known as converged) for your data workflows. Without properly converged infrastructure, most people will spend all of their time just figuring out how to store and process the data, without ever reaching any conclusions.
Which brings us to life sciences. Until recently, life sciences and biomedical research could really be done using Excel and simple computational algorithms. Laboratory instrumentation really didn’t create that much data at a time, and it could be managed with simple, desktop-class computers and everyday computational methods. Sure, the occasional group was able to create enough data that required some mathematical modeling or advanced statistical analysis or even some HPC, and molecular simulations have always required a lot of computational power. But, in the last decade or so, the pace of advancement of laboratory equipment has left large swath of overwhelmed biomedical research scientists in the wake of the amount of data being produced.
The decreased cost and increased speed of laboratory equipment, such as next-generation sequencers (NGS) and high-throughput high-resolution imaging systems, has forced researchers to become very computationally savvy very quickly. It now takes rather sophisticated HPC resources, parallel storage systems, and ultra-high speed networks to process the analytics workflows in life sciences. And, to complicate matters, these newer laboratory techniques are paving the way towards the realization of personalized medicine, which carries the same computational burden combined with the tight and highly subjective federal restrictions surrounding the privacy of personal health information (PHI). Overcoming these challenges has been difficult, but very innovative organizations have begun to do just that.
I thought it might be useful to very briefly discuss the three major trends we see having a positive effect on life sciences research:
1. Science DMZs: There is a rather new movement towards the implementation of specialized research-only networks that prioritize fast and efficient data flow over security (while still maintaining security), also known as the Science DMZ model (http://fasterdata.es.net). These implementations are making it easier for scientists to get around tight enterprise networking restrictions without blowing the security policies of their organizations so that scientists can move their data effectively without pissing off their compliance officers.
2. Hybrid Compute/Storage Models: There is a huge push to move towards cloud-based infrastructure, but organizations are realizing that too much persistent cloud infrastructure can be more costly in the long term than local compute. The answer is the implementation of small local compute infrastructures to handle the really hard problems and the persistent services, hybridized with public cloud infrastructures that are orchestrated to be automatically brought up when needed, and torn down when not needed; all managed by a software layer that sits in front of the backend systems. This model looks promising as the most cost-effective and flexible method that balances local hardware life-cycle issues with support personnel, as well as the dynamic needs of scientists.
3. Commodity HPC/Storage: The biggest trend in life sciences research is the push towards the use of low-cost, commodity, white box infrastructures for research needs. Life sciences has not reached the sophistication level that requires true capability supercomputing (for the most part), thus, well-engineered capacity systems built from white-box vendors provide very effective computational and storage platforms for scientists to use for their research. This approach carries a higher support burden for the organization because many of the systems don’t come pre-built or supported overall, and thus require in-house expertise that can be hard to find and expensive to retain. But, the cost balance of the support vs. the lifecycle management is worth it to most organizations.
Biomedical scientific research is the latest in the string of scientific disciplines that require very creative solutions to their data generation problems. We are at the stage now where most researchers spend a lot of their time just trying to figure out what to do with their data in the first place, rather than getting answers. However, I feel that the field is at an inflection point where discovery will start pouring out as the availability of very powerful commodity systems and reference architectures come to bear on the market. The key for life sciences HPC is the balance between effectiveness and affordability due to a significant lack of funding in the space right now, which is likely to get worse before it gets better. But, scientists are resourceful and persistent; they will usually find a way to discover because they are driven to improve the quality of life for humankind and to make personalized medicine a reality in the 21st century.
What questions about HPC do you have?