Introduction to Bioinformaticsby Robert Jones
Bioinformatics is the intersection of molecular biology and computer science. For software developers, it's a fascinating and challenging area in which to work. During the coming months, the Mac DevCenter will touch on different areas of bioinformatics; what the hot topics are right now, how Mac OS X and open source software are playing a role, and how you can get involved.
In this article I want to introduce this exciting field and set the scene for the articles that will follow.
What is Bioinformatics?
When molecular biologists started to generate DNA sequence data 27 years ago, it was natural that computer scientists and mathematicians would take a keen interest. Here in the messy, wet, analog world of biology was digital information: a linear string of four chemical groups encoding the entire blueprints for the protein machinery of the living cell. How could you not be interested in cracking that code?
This field of study gained a real identity, and the name bioinformatics, in the mid-1980s, as DNA sequencing became a fundamental tool for molecular biology and sequence data started to appear in significant volume. Right from the start, three concepts emerged that remain central to bioinformatics today.
The first is data representation. The DNA in the human genome is not neatly arranged in the pristine double helix we all recognize. It is coated with proteins that bind to specific sequences, which untwist the helix to allow gene expression and wind it up into tightly packed supercoils. Far from being a static archive of blueprints, DNA is a complex, dynamic, three-dimensional molecule. And yet we represent all of this as a simple string of the characters A, C, G and T.
This is a remarkable abstraction. Most of the processes involving genes that we know about have been discovered using this grossly simplified representation of reality. It is the perfect representation for computer analysis, and without it we could never have approached a project on the scale of the human genome.
Second is the concept of similarity. Evolution has operated on every sequence that we see today. It conserves genes that encode important proteins and sequences that are involved in gene regulation. Sequences that encode useful functions are transferred, like code modules, from one organism to another. Because of evolution, similar sequences have similar functions.
Algorithms for comparing sequences and finding similar regions are at the heart of bioinformatics. At many different levels, they are used to find genes, determine their functions, study their regulation and assess how they, and entire genomes, have evolved over time.
Third is the reality that bioinformatics is not a theoretical science; it is driven by the data, which in turn is driven by the needs of biology. Relatively few researchers have the luxury to develop algorithms and theories in the traditional academic sense. Most people are fully consumed in the day-to-day management and analysis of data.
We have a lot of data. The introduction of automated DNA sequencing in the early 1990s created what was, at the time, a torrent of sequence data. But it was the Human Genome Project, with its massive automation, production lines, and money, that really opened the floodgates in the past few years. Compare the rate of growth of sequence data in GenBank, the NIH sequence database, to Moore's Law, that well-known measure of technical advancement, and you will appreciate the challenge facing biology.
And those are just the sequences! Microarray technologies, able to measure the expression of thousands of genes in a single experiment, have developed over the past decade and now produce huge amounts of data. New techniques for looking at genetic variations in large human populations, and for identifying interactions between sets of proteins in cells, are pouring data onto file servers around the world. Bioinformatics is charged with managing and making sense of all of the data, keeping pace with both data production and technology development. There's plenty of work to go around.
What are the Hot Topics in Bioinformatics?
Edited screenshot taken from the UC Santa Cruz Genome Browser (genome.ucsc.edu)
This has the highest profile, thanks to the Human Genome Project. It has been, and still is, the focus of a huge amount of work. The first "tier" of genome sequences (human, rat, mouse, and fruit fly) is now complete, and the big sequencing labs are moving on to organisms like the chimpanzee, rhesus macaque, cow, chicken, and sea urchin.
Why this huge effort to sequence the entire contents of the zoo? Comparative genomics: the same approach to biology used by Charles Darwin, but based on sequences instead of the beaks of finches. By comparing the genomes of related species, we can learn a tremendous amount about how genomes are organized and how major evolutionary changes takes place. At the level of individual, genes we can uncover novel mechanisms for regulation that were hidden when we just had one sequence to work with. Similarity is everything!
Single Nucleotide Polymorphisms (SNPs)
Edited screenshot taken from Ensembl SNPView (www.ensembl.org)
Another avenue that opens up once we have a "reference" human genome is the study of sequence differences between individuals -- the fine details: what makes you different from me. It turns out that the genome is full of single nucleotide differences, called polymorphisms, or SNPs for short. Most of these have no direct impact on anything. But their distribution throughout the genome, their frequency in the human population, and their patterns of inheritance make them extremely useful markers for differences that we do care about. By measuring sets of SNPs in thousands of individuals and correlating them with the incidence of a disease, we can identify which regions of the genome are involved and eventually pinpoint the genes themselves.
The combination of these molecular assays with large clinical studies of populations generates huge amounts of data and a whole new set of challenges for bioinformatics.
Pages: 1, 2