Big Data in Life Sciences: The Cost of Not Being Prepared

For years, the term “Big Data” has been thrown around the Healthcare and Life Science research fields like it was a new fashion that was trendy to talk about. In some manner, everyone knew that the day was coming that the amount of data being generated would outpace our ability to process it if major steps to stave off that eventuality weren’t taken immediately. But, many IT organizations chose to treat the warnings of impending overload much like Y2K in the aftermath, that it was a false threat and there was no real issue to prepare for in advance. That was five years ago, and, the time for big data has come.

The pace at which life science-related data can be produced has increased at a rate that far exceeds Moore’s Law, and it has never been cheaper or easier for scientists and clinical researchers to acquire data in vast quantities. Many research computing environments have found themselves in the middle of a data storm, in which researchers and healthcare professionals need enormous amounts of storage, and need to analyze the stored data with alacrity so that discoveries can be made, and cures for disease can be possible. In the wake of a lack of preparedness on the organizations’ part, researchers have found themselves in the middle of a research computing desert with nowhere to go, and the weight of that data threatening to collapse onto them.

Storage and Compute

The net result of IT calling the assumed bluff of the scientists is that they are unprepared to provide the sheer amount of storage that is necessary for the research, and, even when they can provide that storage, they don’t have enough compute power to help them get through the data (so that it can be archived), causing a back log of data storage that exponentially compounds as more and more data pours into the infrastructure. To make matters worse, scientists are left with the option of moving the data elsewhere to help them get through processing and analysis. Sometimes, well-funded laboratories purchase their own HPC equipment, sometimes cloud-based compute and storage is purchased, sometimes researchers find a collaborator with access to an HPC system that they can use to help chunk through the backlog. Unfortunately, these solutions create another barrier; how to get that much data moved from one point to another. Most organizations don’t have Internet connections much above 1Gbps for the entire organization, while most of these datasets are many terabytes (TBs) in size and would take weeks to move over those connections at saturation (which would effectively shut down the Internet connection for the organization). So, being the resourceful folks they are, scientists then take to physically shipping hard drives to their collaborators to be able to move their data, which has it’s own complex set of issues to contend with.

The depth of the issues that have arisen out of the lack of preparedness of research- or healthcare-based organizations are so profound that many of these organizations are finding it difficult to attract and hire the talent they need to actually accomplish their missions. New researchers, and those on the forefront of laboratory technologies, largely understand the requirements they have computationally. If a hiring organization isn’t going to be able to provide that, they look elsewhere.

Today and Tomorrow

As such, these organizations have finally started to make the proper investments into research computing infrastructure, and the problem is slowly starting to get better. But, many of them are taking the approach of only funding what they have to today to get today’s jobs done. This approach is a bit like expanding a highway in a busy city to meet the current population’s needs, rather than trying to build it for 10 years from now; it won’t make a difference in the problem by the time the highway is completed because the population will have already exceeded that capacity. Building this stuff the correct way for an unpredictable time at some point in the future is scary, and quite expensive, but the alternative is the likely failure of the organization to meet their mission. Research computing is now a reality in life science and healthcare research, and not investing will only slow things down and cost the organizations much more in the future.

So, if this situation describes your organization, encourage them to invest now in technologies for the 5-years-from-now timeframe. Ask them to think big, to think strategically, instead of putting tactical bandages on the problems at hand. If we can get most organizations to invest in the needed technologies, scientists will be able to stop worrying about where their data goes, and will be able to get back to work, which will result in an overall improvement in our health-span as a society.

What questions do you have?