Bioinformatics and Comparative Genomicsby Robert Jones
The complete DNA sequence of the Human Genome is a remarkable achievement for molecular biology and represents the work of many people in a number of large sequencing centers. Far from resting on their laurels, those centers have gone on to sequence the genomes of the mouse, rat, pufferfish, zebrafish, chicken, chimpanzee ... you name it they're sequencing it.
Why this drive to sequence every animal in the zoo? Do we really care about the genetics of pufferfish? In isolation, not so much, but comparisons with the other genomes yield tremendous insights into the genes that are essential for life and those that define the species. They reveal the mechanisms of evolution and the hidden mechanisms of gene regulation.
This article will give a brief introduction to comparative genomics and will show you how to start exploring this treasure trove of data.
A Tale of Two Genomes
Geographic maps are a useful analogy for how we study genomes. If you were given a detailed map of London, you could learn a lot about what defines a large cosmopolitan city. You would see a large number of apartments, shops, and restaurants and might reasonably conclude that these are essential for life in the city. But you could not assess the relative importance of unique features like Buckingham Palace or the Brick Lane street market.
Things would be clearer if you were also given a detailed map of Paris. That too has apartments, shops, and restaurants, confirming your earlier hypothesis. It also has street markets, so perhaps those are an important, albeit secondary, aspect of city life. In contrast, Paris has no "active" royal palaces. Why not? One interpretation might be that Buckingham Palace is an important feature that distinguishes London from other cities. Another might be that a royal family has no function whatsoever in a modern society and survives in London merely as an evolutionary remnant.
It's the same with the human genome. Analysis of the sequence by itself has yielded a vast amount of information. But there are many regions to which we are unable to assign any features. Other regions clearly represent genes but we have no idea what roles the encoded proteins perform in the cell. Some of these will perform vital functions whereas others will be the genetic equivalents of the floppy drive, once important functions that human evolution has rendered obsolete.
Comparing the sequence to a second genome can answer many of these questions. We can compare one with the other, locate conserved sequence segments and assess their significance. The more genomes we have, the more confident we can become of our assignments and the higher the "resolution" at which we can examine the subtleties.
In choosing the next species to sequence, after the human, one might be tempted to pick a close relative like chimpanzee. But we can learn more from a distant relation like the mouse. As far as biology is concerned, mouse and human are not that different. We both have four limbs and two eyes. We pee, poop, fornicate, and enjoy a nice piece of cheese. The genes responsible for these common structures and functions should be highly conserved and we might expect them to stand out against a background of dissimilar sequences. As we shall see in a moment, that is exactly what happens.
The DNA sequence is the most complete representation of a genome that we have available, and it is the form of the data that we actually compare and align, using the BLAST software tools, for example. But we interpret those comparisons in several different ways, which you can think of as multiple resolutions.
First is what we might think of as a medium-resolution view. Take all the known genes from one genome and find their matching genes, if they exist, in the other genome. This gives you a broad picture of which genes have been conserved between species and which are unique.
For each conserved gene, the high-resolution view is the detailed sequence alignment. Translated into the protein sequence, this shows which parts of the protein, encoded by the gene, have been conserved and are, by implication, important for structure and function.
Here we need to introduce the way that genes in "higher" organisms are arranged. Instead of being a single block of sequence that is translated into a protein, as is the case with bacteria, our genes and those of mice, fruit fly, etc. are split into blocks called exons. These are arranged in linear order on the genome with spacer sequences, called introns, in between. The introns can be, and usually are, much longer than the exons. When the gene is turned on, the entire region (introns and exons) is copied into messenger RNA and then the introns are spliced out.
This gene structure of introns and exons varies between species for the same gene, and so that adds another level at which we need to compare genomes. We can think of this as a low-resolution view.
The final level zooms out even further. It describes how groups of genes are arranged between the genomes. This is a concept called synteny.
Evolution never makes things simple for biologists. We can't just line up the mouse and human genomes starting at one end of a chromosome and expect to find matching regions one after another. On the time scale of evolution, the process of recombination -- the genetic equivalent of cut-and-paste -- is continually at work rearranging the genome. Large blocks of genes are moved around within, and between, genomes.
In addition, genetic drift --the appearance and selection of random mutations -- is continually trying to introduce minor changes into the sequence. We know natural selection will conserve important genes but we would not expect, a priori, the arrangement of groups of unrelated genes to be conserved between mouse and human. But Nature always likes to surprise us.
This figure shows the correspondence between regions of human chromosomes on the right, and their counterparts in mouse. Perhaps more striking than the "shuffling" of blocks between the species are the large blocks of sequence that are common to both, and in particular the complete conservation, at this level of resolution, in the sex chromosomes X and Y.
Image credit: U.S. Department of Energy Human Genome Program
Exploring similarities and differences between genomes at all these levels, visualizing them and relating them to other types of information, are just some of the challenges and opportunities facing bioinformatic scientists in this area.
The Software: Genome Browsers
This section introduces you to some of the software tools being used in comparative genomics. Because of the volume and nature of the data involved, almost all the visualization tools in this field use a web interface to access large databases of pre-computed sequence comparisons and annotations.
Although this means the topic doesn't have a real Mac OS X angle to it, I think you will find the following examples interesting. I plan to include hands-on examples throughout this series of articles so you can get a real feel for the data and the software tools we use to explore it. Get in there and get your hands dirty!
To explore comparative genomics we will use the VISTA Genome Browser from Ed Rubin's group at Lawrence Berkeley National Laboratory (LBNL) in Berkeley, Calif. For more information on the software and its authors, check out the paper listed in the resources section below. It is an excellent way to illustrate some of the ideas discussed here.
The browser is a Java applet that is invoked from your web browser; it requires Java 2. To get started, go to the VISTA Browser home page.
Pages: 1, 2