Introduction to Bioinformatics
Pages: 1, 2
Microarray technologies show us which genes are turned on in different cell types in different circumstances. In response to infection, for example, certain cell types will express sets of genes and synthesize certain proteins that respond to the stress. Messenger RNA (mRNA) is like a photocopy of a blueprint that is used in the shop to build a specific type of protein. In a microarray, we can attach sequences from a range of genes to a glass slide in a series of dots, and then bind the mRNA extracted from a population of cells and measure how much binds to each dot. That gives us a snapshot of which genes are being expressed at any given time. Compare the patterns for mRNA from, for example, normal breast tissue and from a breast tumor, and you can identify proteins that are only present in the tumor. Those proteins are potential targets for cancer treatments, vaccines, and other therapeutics.
Edited screenshot taken from CytoScape tutorial (www.cytoscape.org)
The genome gives us all of the genes in an organism, and microarrays tell us which subset is expressed in a particular biological process. Now the bottleneck in understanding biology is shifting to the world of proteins and the interactions between them. The traditional approach of dissecting out individual interactions with the help of mutations and inhibitors just doesn't scale. That is where systems biology comes in with a slew of novel technologies aimed at seeing the big picture of everything going on in a cell.
New advances in mass spectrometry have allowed this established chemical analysis technology to identify the components of complex mixtures of proteins. Inventive chemical labeling techniques provide insight into the transient interactions between different proteins in the cell. This bundle of new technologies is called proteomics.
The integration of all of these results with gene expression data and the collective knowledge of cell biology, contained in the scientific literature, becomes another huge challenge. This is leading to exciting work in textual analysis, pathway modeling, and network visualization.
Nitrogenase structure 1CP2 displayed in MacPyMol (pymol.sourceforge.net)
While our abstraction of the DNA sequence works remarkably well, in the world of proteins the nuances of three-dimensional structure are everything. Structural biologists determine the structure of proteins using X-ray crystallography and nuclear magnetic resonance, a slew of heavy numerical methods, and a lot of computing. This is a huge field in its own right that predates bioinformatics by several decades. It focuses on the details of structure, the dynamics of molecular motion, and the specific interactions with drugs and other proteins. Bioinformatics, with its focus on huge volumes of data, has often had an uneasy interface with structural biology; "quantity versus quality" some might say, but that distinction is becoming every more blurred as all of these data sources become more integrated.
Software in Bioinformatics
Two main factors have shaped the current landscape of bioinformatics software. As already mentioned, the field has been driven by the massive amount of data and the research projects that generate it. As a result, most people in bioinformatics work on very focused projects and few have the luxury to sit back and write the ideal program for gene prediction, for example.
In addition, the technologies used in the lab, and the data they produce, have evolved very rapidly. That has made it very difficult to commit a lot of resources and time to specific pieces of software. The lifespan of a software project is often quite short and the lead time before deployment is minimal. Being able to understand the essence of a problem and hack up a quick solution that gets the job done are critical skills for a good bioinformatics developer.
A classic example is the genome assembler written by Jim Kent at UC Santa Cruz. Excellent software already existed for assembling the fragments of data produced by sequencing instruments into large blocks, but it could not handle the scale of the task that the Human Genome Project had created. Rather than try to modify existing code, it made sense for Kent to start from scratch and build something, in very short order, that was tailored to the task at hand. More than a quick hack, but a lot less than a complete, polished product, Jim's software assembled the human genome.
Refined, mature software packages usually emerge from research groups with a direct bioinformatics focus, as opposed to playing a support role in, say, a genome center. Of all of the software out there, the "killer app" in bioinformatics has to be BLAST, the suite of sequence comparison tools from NCBI, the National Center for Biotechnology Information at the NIH. The BLAST team built a very fast sequence-comparison engine that could search the entire contents of GenBank in seconds. Over the years, they have improved performance and extended their algorithms, but have always retained their focus on what they do well. As a result, every molecular biologist that has ever looked at a sequence has used the NCBI BLAST server.
What Role Is There for Mac OS X?
Molecular biologists have had a long history of involvement with the Mac, in part from a natural gravitation to the platform but undoubtedly helped by Applied Biosystems' choice of the Mac as the front end to its DNA-sequencing instruments. That changed in the mid-90s as the Windows interface improved and the price/performance ratio shifted in favor of the PC platform. Computer scientists and developers coming into bioinformatics, on the other hand, were used to using Unix. The rise of Linux has locked that preference firmly in place.
Mac OS X has the potential to be the ideal platform for bioinformatics development, with Unix under the hood, a great desktop, productivity applications, and integration with Windows systems. Porting existing bioinformatics packages from Linux is usually straightforward, and many are already available from the Fink project. Being able to expose command-line tools to the desktop user in a simple way will broaden their user base dramatically, and Mac OS X provides several ways in which to build this kind of interface.
Getting Involved in the Field
Bioinformatics is a very rewarding area for software developers to work in. There is something for everyone, whether you're into the minutiae of database design, complex user interface design, advanced statistical algorithms, or good old Perl script hacking. The technologies that produce the data you work on are amazing. The data and the biology behind them are fascinating. On top of that, biologists tend to be nice people to be around!
One topic that we will cover in the coming months is how you get started in bioinformatics: what programming skills you need, how much biology you should know, and how to build a lifelong career in the field.
Bioinformatics and the Mac DevCenter
Over the next few months we will cover these topics and others in more detail. Where possible, we'll also include short tutorials that introduce you some of the key software tools used in bioinformatics. These will guide you through analyses of real datasets from the Human Genome Project and elsewhere. They won't make you an expert, but I hope they will spur you in to further explorations of your own. Stay tuned!
Robert Jones runs Craic Computing, a small bioinformatics company in Seattle that provides advanced software and data analysis services to the biotechnology industry. He was a bench molecular biologist for many years before programming got the better of him.
Return to the Mac DevCenter