__The inside story of the first fully sequenced chromosome. __
We are crossing the finish line: Sometime this summer, a rough draft of the human genome, sequenced and assembled, will be completed - a map that will show 3 billion nucleotides arranged on 23 pairs of chromosomes.
This seemed like a distant goal just one year ago, but the pace of discovery has taken off, supercharged by millions of dollars in private funding and massive advances in technology, ranging from processing power to high-speed genetic sequencing machines that take in samples of DNA at one end and spit nucleotide code out the other. Public human genome databases now hold 8 billion nucleic acid sequences and are growing at a rate of 1 billion a month. In March, a consortium of public and private research groups revealed the genetic code of Drosophila melanogaster, the fruit fly. Then, at a congressional hearing in April, Celera Genomics president Craig Venter announced that his company had completed a rough, unassembled sequencing of the entire human genome. Both Celera and the international Human Genome Project are sprinting to finish the job.
But this achievement marks another beginning - in itself, the assembled sequence means very little. Researchers still don't agree on the exact number of genes it takes to make a human - anywhere from 35,000 to 120,000. In the next decade, as we uncover the subtle details of each strand, we'll discover the full meaning of the human genome, from how we live to how we die. To tease that information out of the terabytes of data, researchers use analytical tools developed by companies such as Berkeley-based Neomorphic (www.neomorphic.com), whose sophisticated interface and proprietary gene-hunting algorithms let scientists scan an entire chromosome at once or zoom in to the base-pair level. When a consortium of four Human Genome Project institutes sequenced the second smallest human chromosome last December, researchers counted 545 genes, directly associated with 35 diseases; included were 11 areas that couldn't be deciphered with the available technology. Six months later, Neomorphic recrunched that data and predicted around 1,000 genes, along with their many possible gene splice variants. See for yourself: Chromosome 22 - the first human chromosome to be fully sequenced - appears on the foldout inside, spliced and diced by Neomorphic. The most complete analysis of the best-understood chromosome to date, this map details the front lines of genome research.
__Deoxyribonucleic Acid Test __
This satellite view of chromosome 22, courtesy the number-crunching tools of Neomorphic, lays out the chromosome's estimated 34.5 million base pairs along the horizontal gray axis that separates the two strands of the double helix.
Before Neomorphic began its assault on chromosome 22, public genome data included gene predictions by the Human Genome Project and single base-pair variations known as SNPs or "snips" (which are cataloged at www.ncbi.nlm.nih.gov/SNP/) from the National Center for Biotechnology Information. Neomorphic recrunched the data, using statistical analysis to predict the genes shown in green, then layering in lab analysis for those in orange, which show the different ways that a single gene could be translated into proteins. Each of the callout boxes showcases Neomorphic tools at work, breaking out nucleotide traces and filtering out repeats, predicting protein splice variations, inspecting specific base pairs, aligning protein sequences, and annotating with outside analysis.
Disease Hunt A single typo on a strand 3 billion bases long could cause serious illness. If, say, the G near 640 were to mutate into an A, the result would be thrombophilia, a treatable blood disorder. Identifying these mutations in order to understand the molecular basis of disease is - for pharmas and medical researchers, at least - the ultimate goal of human genome research.
Trace The sequencing machine is the workhorse of genome research. A DNA sample goes in one end, a string of As, Ts, Cs, and Gs - the raw data of bioinformatics - comes out the other. Inside the hulking machine, the sample is analyzed for the presence of adenine, thymine, cytosine, and guanine - the nucleotides that make up the base pairs. The wave pattern that illustrates the levels of each chemical is called the trace. Peaks of adenine are recorded as As, thymine peaks as Ts, and so on. In some cases, the machine can't determine the base, either because the levels of each of the four chemicals are roughly equal or because the sample is damaged. The level of certainty is indicated by the height of the colored bars at bottom.
Alternative Splicing Once a chromosome has been sequenced, researchers edit its string of letters into genes. This is no small task: Genes have no fixed length (they may span anywhere from 100 to more than 2 million nucleotides), and it's not obvious where one gene ends and another begins. To identify individual genes, scientists use a combination of computer power and human expertise.
Once scientists identify a gene, they look for the different ways it could be translated into proteins. (It's like trying to read a page of letters without the benefit of spaces or punctuation: The string "letsgotogether"could be translated to "Let's go together"or "Let's go to get her.") This process is known as alternative splicing. In this detail of possible splice variations, the blocks represent exons (bases thought to code for proteins) and the lines represent introns (bases not involved in creating proteins).
Repeats Only about 3 to 5 percent of the genome actually codes for genes. The rest is referred to as junk DNA, though it would be more accurate to say that scientists don't understand what it's doing there. Included in this category are repeats - strings ranging in length from hundreds to thousands of nucleotides that reappear along the strand and make up 40 percent of chromosome 22 (and 20 percent of the genome overall). The blue lines at left are a close-up view of the indicated gene sequence. The repeats are represented by the multi-colored band (each color indicates a different kind of repeat) below the sequence. Before they start analyzing a sequence, scientists "mask" or block out these repeats to speed up the process.
Comparative Genomics To determine the function of a human gene, scientists look for clues in the genetic makeup of other more-studied species. In the example shown here, a section of a protein coded by a gene found on chromosome 22 is aligned with similar protein sequences from eight different organisms. The similarities (shown in green) indicate that this portion of human gene contains the blueprint for a channel that allows chloride ions to enter and exit the cell. This channel varies from species to species, and these differences (shown here in red) are reflected at the genetic level.
Genome Annotation Interpreting the genome is an ongoing, iterative process. To truly understand the function of each gene, scientists combine data from computational analysis of the sequence with lab results in an interactive gene-editing application. In this example, the splice variant shown in red has been copied into the white editing window. As a scientist tweaks the length of a predicted exon or intron, the corresponding changes in the protein sequence appear in the window below. This process allows scientists to add their own logic to the mix, adjusting the computer's analysis in the same way a human translator might edit a document that had been machine-translated. This human interpretation of the genome is the next crucial step.
*Data Sources:
The Sanger Centre, www.sanger.ac.uk/HGP/Chr22/;
Department of Molecular Biology, Keio University School of Medicine, www.dmb.med.keio.ac.jp;
Department of Chemistry and Biochemistry, University of Oklahoma, dna1.chem.ou.edu;
Genome Sequencing Center, Washington University School of Medicine, genome.wustl.edu/gsc/index.shtml;
Neomorphic. *