Genomics and Proteomics

When we discussed DNA chips, I mentioned that there is a tremendous amount of information that is now available and will soon become available in terms of the complete DNA sequences of a number of different organisms. The human genome should be available in a fairly final form within a couple of years now. There are quite a number of other genomes that have also been sequenced. What do we do with all this information and where to we go from here?

One of the most fundamental questions in biochemistry today is, how can one predict from the primary sequence of a gene, what the structure and function of the protein encoded by that gene is. There are lots of ways of going about this. There are those who try to calculate protein folding directly. You can imagine that this is a horribly complex problem (think of the number of degrees of freedom in a protein with 100 amino acids). There are others who correlate sequences with folds (given a library of known structures) and folds with functions (given a library of known functions). The goal is to get to a point where we can design a denovo enzyme from scratch at the computer. This is a very tough problem.

In order to continue to make more progress in that area, we need a larger and larger number of known structures and functions to look at. Thus, there is a large movement afoot to complement the DNA sequence of an entire organism with a library of protein structures and functions for the entire organism. What is being done initially is to overexpress some of the smaller genes from the organism and solve the structures. From the structures, make predictions about biochemistry, and then do the tests required to determine if those predictions are correct. The problem is that we are talking about tens of thousands of proteins here. Structure solution by NMR or Xray crystallography traditionally takes months for one protein. We need to get faster. Much faster. The present approach is to automate most of the sample preparation and data collection and analysis routines to a point where one can feed samples to the machines and computers take over, generating structures in days. The new NMR machines are allowing fairly good sized proteins to be analyzed -- closing in on 100,000 Daltons. That is still small by most standards, but it encompasses a significant fraction of the proteins in the genome. How we deal with larger proteins and protein complexes or membrane proteins is not known. It is possible to solve these structures now (it was not 20 years ago), but it is still very slow and involves a great deal of trial and error.

What we certainly can do, of medical importance, is to use the human genome to correlate sequence with disease and other phenotypes. Using DNA chips it will be possible for all of us to know in what medically important ways we are different from one another. If we look at this across large human populations, it will certainly start to point to the genetic basis for many conditions that presently we lack a complete understanding of. Many cancers are thought to be predisposed by genetic factors. We will rapidly uncover more and more of those. Old age itself is thought by many to be a genetic disease -- with death being preprogrammed. People are studying the causes of old age very seriously now and are coming up with surprising answers. It may not be inevitable.

When you couple this knowledge with the recent advances in gene manipulation in higher animals (cloning of mammals, introduction of or alteration of genes in mammals), there is the potential to recreate ourselves and the creatures in our world in different ways -- to direct the course of evolution. Pretty scary.