SC14: Understanding Gene Expression through Machine Learning

This guest blog is by Sanchit Misra, Research Scientist, Intel Labs, Parallel Computing Lab, who will be presenting a paper by Intel and Georgia Tech this week at SC14.

Did you know that the process of winemaking relies on yeast optimizing itself for survival? When we put yeast in a sugar solution, it turns on genes that produce the enzymes that convert sugar molecules to alcohol. The yeast cell makes a living from this process (by gaining energy to multiply) and humans get wine.

This process of turning on a gene is called expression. The genes that an organism can express are all encoded in its DNA. In multi-cellular organisms like humans, the DNA of each cell is the same, but cells in different parts of the body express different genes to perform the corresponding functions. A gene also interacts with several other genes during the execution of a biological process. These interactions, modeled mathematically using “gene networks,” are not only essential in developing a holistic understanding of an organism’s biological processes, they are invaluable in formulating hypotheses to further the understanding of numerous interesting biological pathways, thus playing a fundamental role in accelerating the pace and diminishing the costs of new biological discoveries. This is the subject of a paper presented at the SC14 by Intel Labs and Georgia Tech.

Owing to the importance of the problem, numerous mathematical modeling techniques have been developed to learn the structure of gene networks. There appears, not surprisingly, to be a correlation between the quality of learned gene networks and the computational burden imposed by the underlying mathematical models. A gene network based on Bayesian networks is of very high quality but requires a lot of computation to construct. To understand Bayesian networks, consider the following example.

A patient visits a doctor for diagnosis with symptoms A, B and C. The doctor says that there is a high probability that the patient is suffering from ailments X or Y and recommends further tests to zero in on one of them. What the doctor does is an example of probabilistic inference, in which the probability that a variable has a certain value is estimated based on the values of other related variables. Inference that is based on Bayes’ laws of probability is called Bayesian inference. The relationships between variables can be stored in the form of a Bayesian network. Bayesian networks are used in a wide range of fields including science, engineering, philosophy, medicine, law, finance, etc. In the case of gene networks, the variables are genes and the corresponding Bayesian network models for each gene what other genes are related to it and what is the probability of expression of the gene given the expression values of the related genes.

Through a collaboration between Intel Labs’ Parallel Computing Lab and researchers at Georgia Tech and IIT Bombay, we now have the first ever genome-scale approach for construction of gene networks using Bayesian network structure learning. We have demonstrated this capability by constructing the whole-genome network of the plant Arabidopsis thaliana from over 168.5 million gene expression values by computing a mathematical function 7.3 trillion times with different inputs. For this, we collected a total of 11,760 Arabidopsis gene expression datasets (from NASC, AtGenExpress and GEO public repositories). A problem of this scale would have consumed about six months using the state-of-the-art solution. We can now solve the same problem in less than 3 minutes!

To achieve this, we have not only scaled the problem to a much bigger machine - 1.5 million cores of Tianhe-2 supercomputer with 28 PFLOP/s peak performance, we also applied algorithm-level innovations including avoiding redundant computation, a novel parallel work decomposition technique and dynamic task distribution. We also made implementation optimizations to extract maximum performance out of the underlying machine.

sanchit image3.jpg

sanchit image 2.jpg

  • (Top)    Root Development subnetwork                 (Bottom) Cold Stress subnetwork

Using our software, we generated gene regulatory networks for several datasets - subsets of the Arabidopsis dataset - and validated them using known knowledge from the TAIR (The Arabidopsis Information Resource) database. As a demonstration of the validity and how genome-scale networks can be used to aid biological research, we conducted the following experiment. We picked the genes that are known to be involved in root development and cold stress and randomly picked a subset of those genes (red nodes in the above figures). We took the whole-genome network generated by our software for Arabidopsis and extracted subnetworks that contain our randomly picked subset of genes and all the other genes that are connected to them. The extracted subnetworks contain a rich presence of other genes known to be in the respective pathways (green nodes) and closely associated pathways (blue nodes), serving as a validation test. The nodes shown in yellow are genes with no known function. Their presence in the root development subnetwork indicates they might function in the same pathway. The biologists at Georgia Tech are performing experiments to see if the genes corresponding to yellow nodes are indeed involved in root development. Similar experiments are being conducted for several other biological processes.

Arabidopsis is a model plant for which NSF had launched a 10 year initiative in 2000 to find the functions of all of its genes, yet the functions of 40 percent of its genes are still not known. This method can help accelerate the discovery of the functions of the rest of the genes. Moreover, it can easily be scaled to other species including human beings. The understanding of how genes function and interact with each other in a broad variety of organisms can pave the way for new medicines and treatments. Moreover, we can also compare the gene networks across organisms to enhance our understanding of the similarities and differences between them ultimately aiding in a deeper understanding of evolution.

What questions do you have?