Some of the base pairs in a given genome are strung together into templates that code for proteins or RNA molecules. These are the classic “genes.” Other base pairs probably have little or no function. Among the DNA that is not in classic gene-templates, however, there is a lot of important information, including “control regions.”How much of each “type” of DNA exists in a particular genome varies. A recent study suggests that the currently used methods for scanning DNA for regulatory sequences may systematically m miss more than half of that information.Looking specifically at the DNA surrounding the zebrafish phox2b gene, McGaughey et. al. using five different measures of “evolutionary constraint” (indicating functionality of a particular DNA sequence), found that each method misses regulatory sequences. They estimate that between 29 and 61% of actual regulatory elements are missed by these various commonly used methods. They conclude:
the noncoding functional component of vertebrate genomes may far exceed estimates predicated on evolutionary constraint.
When searching for non-coding but still functional (as regulatory) sequences, genetic researchers develop “training sets” of known functional sequences that software then uses to find matches elsewhere. The training sets are developed by messing around with an organism’s DNA and observing the effects of this treatment on development. It is painstaking work, and it may be impractical to carry out this work on just any organism. Some organisms make better lab subjects than others owing to availability, the difficulties of providing the proper developmental environment, etc.A key assumption of this method is that a given DNA sequence that is functional (a regulatory sequence) will not vary randomly across distantly related organisms because it is under selection.The information derived from this detailed work can then be structured into statistical probes that can be applied with software to large genetic databases. In theory, a group of likely coding sequences would match homologous (similar by common descent) sequences in a wide range of organisms.For this study, the researchers essentially tested the assumptions of this model by bringing it (the model) back to the lab bench. They found that the predicted proportion of regulatory sequences from each of the five different methods missed actual on-the-ground (in the petri dish, as it were) sequences at the rates cited above.In this study, the researchers sliced up the DNA sequence around the phox2b gene into small bits, tagged each bit with a fluorescent protein. Then, they inserted each bit into zebrafish embryos. If the bit being tested turned out to be a regulatory sequence, it would become part of the growing tissues and the glowing protein woudl be visible. If not, no glowing. This resulted in several cool images such as this one:(From Figure 1. In situ hybridization (ISH) of endogenous phox2b expression. ISH was performed on wild-type zebrafish embryos from 24 to 96 hpf using a dig-labeled phox2b RNA probe. (a) Dorsal view of 24-hpf embryo, illustrating phox2b expression in the hindbrain and anterior spinal column. (b) Dorsal view of 48-hpf embryo. hb, hindbrain; mo, medulla oblongata; ventral diencephalon (filled arrowhead), locus coeruleus (open arrowhead). (c) Lateral view of 48-hpf embryo. Rhombomeres of the hindbrain are numbered; mo, medulla oblongata; locus coeruleus (open arrowhead); cranial ganglia (black arrow). (d) Lateral view of the trunk of a 96-hpf embryo. Spinal cord (open dotted arrow) and ENS (open arrows).)They uncovered a total of 17 discrete DNA segments that had the ability to make fish glow in the right cells. The team then analyzed the entire region around the phox2b gene using the five commonly used computer programs that compute sequence conservation; these established methods picked up only 29 percent to 61 percent of the phox2b regulators McCallion identified in the zebrafish experiments.The situation is further complicated by the phylogenetic distances that exist among living organisms. It can be safely assumed that although a particular regulatory sequence is constrained as to what its exact composition (of base pairs) can be, there is also room for variation, some random, some adaptive. This variation can be expected to increase, on average, with phylogenetic distance between any to organisms. Applying this sort of statistical probing technique, if based on one set of organisms (say, one species of zebra fish) to closely related organisms (say, some other species of zebra fish) should work very well. But applying the same search method to more distantly related fish, such as fugu (puffer fish), would be different. These two types of fish separated about 350 million years ago, so there is a cumulative 700 million years of “evolutionary time” separating them. Using data derived from zebra fish to probe mammalian genomes would be even more extreme.According to one of the paper’s authors:
The problem with this approach … is that it’s often throwing the baby out with the bath water. So while we believe sequence conservation is a good method to begin finding regulatory elements, to fully understand our genome we need other approaches to find the missing regulatory elements.Our data supports the recent NIH encyclopedia of DNA elements project, which suggests that many DNA sequences that bind to regulatory proteins are in fact not conserved. I hope this pilot shows that these types of analyses can be worthwhile, especially now that they can be done quickly and easily in zebrafish.
McGaughey. McGaughey, David M. , Ryan M. Vinton, Jimmy Huynh, Amr Al-Saif, Michael A. Beer, and Andrew S. McCallion. Metrics of sequence constraint overlook regulatory sequences in an exhaustive analysis at phox2b . Genome Res. Published December 10, 2007, 10.1101/gr.6929408, ().Johns Hopkins Press Release