oreilly.comSafari Books Online.Conferences.


AddThis Social Bookmark Button

Bioinformatics and Comparative Genomics
Pages: 1, 2

The BRCA1 Gene via VISTA

We are going to look at conservation in the BRCA1 gene on human chromosome 17. This is of great medical importance. In the presence of specific mutations, the gene predisposes women to breast cancer. One of the earliest diagnostic tests that came out of the genome project looks for these mutations.

In the Position box on the VISTA home page enter the coordinates of this gene, (chr17:41,560,000-41,660,000) and click Go.

The window that appears shows two "peak and valley" plots of a similar score and in this case the horizontal dimension has been split in the middle to form two panes.

Let's just look at the top pane for now:

There are five tracks in this figure. The x-axis at the bottom shows the sequence coordinates on this chromosome in the human genome. The top track shows the organization of genes that cover this region. In this case we can see from the arrowhead at the left end that BRCA1 is encoded on the reverse strand of the DNA, running right to left.

The small blue blocks represent the exons in this gene, separated by the much larger introns. The track below this shows the location of sequence repeats, with the colors representing different types. The genome is packed full of these so-called "junk" sequences, the functions of which are still unclear. Notice that the repeats account for most, but not all, of the intron sequences.

The two large tracks with the peaks and valleys display how well the human sequence is conserved in the genomes of mouse and rat, respectively. The similarity plot is truncated at 50% so any peaks shown indicate strong similarity. These two plots look broadly the same, indicating that the mouse and rat sequences are very similar, as you might expect.

Overall the similarity between these and the human sequence is pretty low but there are many strikingly well-conserved segments, as indicated by the peaks. How these line up with the exons of the human BRCA1 gene is even more striking. All the BRCA1 exons in this region are conserved in mouse and rat. The plots highlight this by shading these segments the same blue color as the exons. This is a strong indication that mouse and rat have a functional equivalent of BRCA1.

But what about the peaks that don't line up with the exons? Those segments with more than 75% similarity are colored pink. Their locations fall in between the human sequence repeats. Evolution has conserved these segments between these distant species for a reason, but what is it? Sequence features like these are the big mystery of the human genome. The presumption is that they are involved in regulation of gene expression or that they play some role in chromosome structure, the higher degrees of packing necessary to fit all this DNA into the cell.

Let's now bring another genome into the comparison and see if that sheds additional light. Go to the Select/Add menu on the left of the applet, select Chimp, and click OK on the dialog that pops up. For clarity I will just show the top panel of the display:

This looks different. With the exception of the region at the right of the panel, and two or three small regions, the chimp sequence is essentially identical to the human. So close, in fact, that a comparison at this resolution is not very informative. This is a good example of why it was important to sequence the genomes from a number of distant species first, in order to identify the conserved regions against a background of non-conserved sequences. Where the chimpanzee/human comparison has greater value is in the high-resolution comparisons of individual genes, exons, and proteins, as well as in the low-resolution view of synteny.

Here are a few other genes that you might like to look at within the VISTA Browser. You can enter the exact coordinates in the Position box, or just enter the gene name and let VISTA look up the location for you.

HBA1 (chr16:166,679-167,520)
The gene for one of the two protein chains in hemoglobin. HBA1 is small and simple with three exons. It is less than 1,000 nucleotides in length.

MYD88 (ch3:38,140,778-38,145,135)
This gene is involved in the stimulation of cells in the immune system in response to infection. It is relatively small, around 4,000 nucleotides long.

NFKB1 (chr4:103,880,889-103,996,877)
This is an important gene involved in inflammation. The protein encoded by the gene regulates the expression of certain other genes. The exons of NFKB1 are spread over a large segment of the genome, and the gene is about 115,000 nucleotides in length.

Try those genes or just pick a random segment of a chromosome. Play with the interface, zoom in and out, move left and right, and try adding additional genome tracks.

Other Genome Browsers

There are more ways than one to visualize gene organization and conservation. Here are a few other browsers you can try out. Most of these include additional tracks or annotations such as alternative models for gene structure, matches to partial transcript sequences, and other features that I don't have room to describe. Just bear in mind that genome data is not as defined and clear-cut as some of the hype would have you believe.

To illustrate this let's look at BRCA1 in the UC Santa Cruz Genome Browser. Go to the UCSC Genome Browser and enter the BRCA1 coordinates we used earlier.

Yowza. I'm not sure what Edward Tufte would have to say about the design of this, but there are a lot of annotations available and you need to see all of them to evaluate any given region. Many of the features on the image are linked to more-detailed information. Click away and see where it takes you.

For a display that favors tables over images take a look at the Ensembl Viewer from the European Bioinformatics Institute in Cambridge, UK.

The National Center for Biotechnology Information (NCBI) at the NIH has a yet another perspective on visualization with the Map Viewer program. This beast of a URL will show you BRCA1.

Or you can go to the Map Viewer home page and work from there.

The Major Players in Comparative Genomics

Not surprisingly, most of the action in this field is taking place in the big genome centers that generate the data:

  • The Broad Institute (nee Whitehead Center for Genome Research), Cambridge, Mass.
  • The Sanger Institute, Cambridge, UK.
  • Washington University Genome Sequencing Center, St. Louis, Mo.
  • Baylor College of Medicine Human Genome Sequencing Center, Houston, Texas.
  • DOE Joint Genome Institute, various sites, U.S.
  • RIKEN Genomic Sciences Center, Japan.

There is plenty of software development at each of these centers, but as often happens in this field, the bioinformatics have their hands full just managing the data. The integrated browsers described here have all been developed away from the primary data sources at UC Santa Cruz, LBNL in Berkeley, the EBI in the UK, and the NCBI in Washington, D.C.

One area that I've not been able to touch on here is that of comparing bacterial genomes. These sequences are only a few million nucleotides and their genomic organization is much simpler. There are a lot of well-characterized bacterial genomes available, and comparison of these gets really interesting. The Institute for Genomic Research (TIGR) in Maryland and the Sanger Institute in the UK are two of the main centers for this work.

Final Thoughts

You can see from the data shown in the examples given above that comparative genomics is full of challenges. From efficient handling of the vast amounts of data to quantifying similarity from different perspectives, from genome visualization to reverse engineering the course of human evolution, there is a huge amount of work to be done even with the data we already have on disk. Every one of these challenges is an opportunity for developers and creative thinkers to be involved in this remarkable area of science.

I hope this article and its worked examples have given you a taste for this side of bioinformatics and inspires you to learn more. Let me know what you think about it.