Systems Biologyby Robert Jones
Think of the DNA in a cell, the genome, as a set of blueprints. The proteins are the molecular machines encoded in that genome. Their interactions with DNA, other proteins, water, and all of the other molecules constitute the production lines in the factory of the living cell. Energy production, protein synthesis, signaling, and a hundred other processes all involve an exquisite choreography of this molecular machinery. Exploring this big picture of cellular function at the fine resolution of the molecules themselves is what systems biology is all about.
The grand vision is to integrate information from all of the remarkable sources that we have available today to explore ever more complex aspects of biology. It is only by grasping the entire molecular complexity of a process that we can hope to understand the function of the brain, the development of an embryo, and the changes that take place in a disease like breast cancer.
That’s the grand vision. The reality is a little tamer. A lot of effort today is spent characterizing the proteins in the cell and figuring out which ones interact with each other. Other groups are using microarray-based, gene expression experiments to show how sets of genes are turned on and off in response to stimulus. And some groups try to integrate all of the data to produce the “big picture” that everyone wants to see.
I’ll introduce what I think of as the five components of systems biology, and then describe a hands-on example that lets you explore a protein network. I will finish up with a set of resources than you can use to delve further into this emerging field of study.
You can’t just look at the sequence of a protein and tell with what it interacts with; you need to do some work in the lab. Typically, this means identifying individual types of proteins in the cytoplasm, tagging them with some chemical “label,” and using that to track where in the cell they are located and to what other proteins they bind.
Scientists have been doing this for years with specific proteins, but the current efforts combine a variety of new biochemical techniques with automation to improve throughput. This new burst of progress has earned this field a new name, proteomics, to distinguish it from good old protein biochemistry. This echoes the emergence of genomics from molecular biology.
Mass spectrometry (MS) is a major tool in proteomics because of its ability to identify the components of complex mixtures of proteins. The technology behind mass spectrometry has made some amazing advances in the past few years, but the basic idea remains the same: you separate molecules on the basis of how much they weigh, with a resolution of a single atomic mass unit.
Some of the buzzwords to look out for in proteomics include:
MALDI-QTOF Mass Spectrometry (and many variants thereof)
Isotope-Coded Affinity Tag (ICAT) analysis
Yeast Two-Hybrid Screens
Green Fluorescent Protein (GFP) tagging
2-D Polyacrylamide Gel Electrophoresis (2-D PAGE)
The next step is to take all of these one-to-one protein interactions and build a graph, or network, that represents the entire set. Each node represents a protein, and each edge represents a known interaction. In principle, we can locate the proteins that we know are involved in a specific process and, from their interactions, perhaps discover additional proteins. We can define “pathways” or “cascades” of proteins that, for example, transmit a signal from the cell surface to the nucleus or that cooperate to construct a complex molecule.
The problem is that most networks involve hundreds of proteins. Displaying all of these can result in a tangled mess that is un-interpretable. Here is a relatively simple network that shows protein interactions in yeast.
|Yeast network displayed in Osprey|
The display of complex graphs is not just an issue for systems biology, and algorithms from other fields are being brought to bear on our networks. It's an interesting mix of graph theory, visualization, and user interface design. We need a way to view the entire network that is comprehensible. Also, we hope to find a way to limit our view to specific subgraphs, hiding or collapsing the rest of the network when it is not relevant. And finally, we need to interact with individual nodes and edges to view any annotation associated with them. While the tools that I describe below are making significant advances, there is still a lot of work to be done before they become really useful.
Interaction networks are part of the puzzle. They show us the “circuit diagram.” We want to understand how the network operates and how it responds to changes in the inputs. For many processes we already have a lot of relevant information from “conventional” cell biology, microarray experiments, etc. What we’re working on is a way to integrate all of these data together, with the interaction network as one possible framework on which to display everything.
Most of the processes that we are interested in include some component of gene regulation as well as protein interactions. For example, the detection of a protein on the surface of a cell may trigger a cascade of protein interactions that results in one or more genes being expressed in the genome. This interplay of the “worlds” of proteins and DNA is perhaps the biggest challenge for data integration. Whenever a number of proteins interact to accomplish a specific process, chances are that some of the genes that encode them will be expressed in some coordinated manner. So it is reasonable to overlay microarray gene expression data on the protein interaction network and look for correlations.
In the cell itself, most of the processes that we care about have been studied for many years in individual labs and the results have been written up in countless scientific papers. As a result, a major source of knowledge on interactions and regulation is the scientific literature.
It’s an interesting phenomenon that biology today dines at two tables. It draws heavily on the specific data from the databases, but still finds its interpretations of those data in the scientific literature. Automated extraction of the knowledge embedded in the literature remains a distant goal. But some success has been had in extracting specific terms, such as gene names, from papers and inferring interactions where pairs of gene names are frequently found together.
The problem lies in the diversity of the (mostly) English language text of these papers, and the false-positive rate for these inferred interactions is high. With the abstracts of more than 14 million papers available on the PubMed site at NIH, textual analysis is getting a lot of attention and is still an area where a bright idea can make a big impact.