The C-value Paradox If we examine the genomes of proks and eubacteria, bacterial genomes are quite compact with very little "unused" DNA. Eukaryotic genomes vary quite markedly in their DNA content (C-value) but they all contain far more DNA than they require (on theoretical grounds, at least) to carry out all of their functions.
The C-value is the total amount of DNA in the genome.
This phenomenon of apparently having an excess of DNA in the cell is known as the C-value paradox.
Looking at some haploid genome sizes (see table):
Haploid Genome Sizes for Selected Organisms
Organism kb bp E. coli 4600 4.6 x 107 S. cerevisiae 13500 1.3 x 107 D. melanogaster 140000 1.4 x 108 H. sapiens 3300000 3.0 x 109 S. American lungfish 102000000 10.2 x 1011 Flowering plants 107 - 1011 Amphibia 109 -1011. For higher eukaryotes, estimate of 3 kb per gene x 105 genes = 3 x 108 bp needed in haploid genome
Range of DNA Content of Haploid Genomes of Representative Groups of Organisms
Group Approximate Size of haploid Genome (bp) Viruses & Bacteria 105 - 5 x 106 Algae & Fungi 5 x 106 - 5 x 107 Worms 5 x 107 - 108 Insects 108 - 5 x 109 Echinoderms 5 x 108 - 5 x 109 Fish 109 - 5 x 1010 Amphibians 5 x 109 - 5 x 1011 Birds 5 x 109 - 1010 Reptiles 5 x 109 - 5 x 1010 Mammals 5 x 109 - 5 x 1010 Flowering plants 5 x 109 - 1012 - One estimate is that if we take 3 kb to be a generous estimate for the average gene size of a human and that there are approximate 105 genes required for all functions, then there is a ten-fold excess of DNA in the genome (humans have 109 bp in haploid genome).This is one aspect of the C-value paradox: the ‘excess’ DNA in genomes.
- Note the range of C-values within the amphibia taxon - there are large variations in C-values within taxa whose complexity do not apparently vary very much. The amphibia do not apparently show a 100-fold variation in complexity.
- Note also, flowering plants have the widest range of C-values.
- Basically, then there is a relationship between evolutionary increases in complexity and minimum C-values, but maximum values for the same group can be several orders of magnitude higher.
Another way of stating the C-value paradox:
What is the excess DNA? Junck? A lot of it probably is.
This takes us on to look at the ways that the genomic DNA of eukaryotes can be classified.
(1) DEGREE OF REPETITION
(a) Unique sequences - unique genes, coding sequences(2) EUCHROMATIN & HETEROCHROMATIN(b) Moderately repetitive DNA - 102 to 104 copies per haploid genome, e.g. SINEs (500 bp) and LINEs (5-7 kb).
(c) Highly repetitive DNA - 104 to 106 copies per haploid genome, 6 to 200 bp in tandem arrays, e.g. satellite DNA.
Euchromatin contains most of the structural genes, transcribed DNA whereas heterochromatin refers to regions of the genome that are not generally transcribed and the DNA is highly condensed.(3) CODING & NON-CODING
Molecular biology and recombinant DNA technology are rendering this distinction less clear and as we find out more about the nature of genomes at the molecular level we may as well classify the DNA into its actual major components (following).
Fig of rRNA gene cluster(a) GENESMost protein coding genes are represented once per haploid genome. These are unique or single copy genes.
Some genes are maintained in multiple copies known as gene families. These are a set of genes descended by duplication and variation from an ancestral gene. Members of a family are not necessarily identical, but share DNA sequence homology and their gene products are functionally related, e.g. the immunoglobulin genes.
The members of a gene family may be clustered together in long tandem arrays of adjacent repeats e.g. the rRNA genes, the histone genes and the globin gene family. In the DNA containing the rRNA and histone genes, the stretch of DNA encompassing the genes is repeated several hundred times, including the non-transcribed (and non-translated) regions.
Fig 17.19 Concepts (Alpha and Beta
Haemoglobin Gene Families)
e.g. several thousand rRNA genes in Zea mays clustered on chromosome 6, whereas 50-200 rRNA genes in man scattered over chromosome 5Gene families clustered in tandem arrays are highly conserved.
Single copies of clustered genes may be found elsewhere in the genome and are called ORPHONS.
E.g. in the sea urchin Lytechinuspictus, the cluster of histone genes occur in the order H1, H4, H2B, H3, H2A repeated several hundreds (300-600) of times (extremely rapid rate of cell division during development). However, 5-20 copies of each histone gene as orphons. The difference is not totally clear but clustered tandem arrays tend to be expressed in early development.
Other examples of gene families are:
(i) Mouse class I genes of the H-2 (histocompatibility) locus - 36 members in 13 clusters over 837 kb of DNA.(ii) The globin gene family, e.g. the "-globin family (E expressed in embryo, A1 & A2 expressed in adults. Two others are PSEUDOGENES (R).
The human "-globin gene family consists of 5 genes on the short arm of chromosome 16. Zeta is expressed only in the embryo, there are two non-functional Rgenes and 2 copies of the "-globin gene ("1, "2, or "A1 and "A2) expressed during foetal and adult stages.
This family spans over 30 kb of which very little codes for polypeptides.
The beta-globin family is similar, only 5% of the 60 kb region on the short arm of chromosome 11 consists of coding sequences. There are 6 genes: one embryonic, two foetal, two expressed after birth, each gene coding for a 146 aa polypeptide. The remainder of the DNA has no known function (C-value paradox).
(c) GENE SUPERFAMILIESGene superfamilies are gene families which do not share identical functions. Not highly homologous, but are related through shared sub-domains of the proteins they encode.
Gene superfamilies may appear to be totally unrelated, i.e. as distinct gene families, but may be related by sharing the same sub-domains of proteins. The final gene product may be different, having a different function within (or outside) the cell. The overall base sequences are not highly homologous, but the base sequence for a protein domain may be very similar (i.e. they probably share an exon). These gene families are considered to be members of gene superfamilies. May have evolved from a common ancestral gene. The proteins are related in structure and their genes are related in organisation.
(i) The mammalian immunoglobulins - 2 families of Ig light chains 6 and 8 and one family of heavy chains (H), each family has both variable (V) and constant (C) genes. There are hundreds of V-genes in man.
(ii) Class I, class II histocompatibility antigens, T lymphocyte receptors all share common sub-domains.
These sets of families of lymphocyte-specific molecules are part of a superfamily that includes families of cell surface molecules involved in adhesion between cells.
(d) PSEUDOGENESPseudogenes share strong base sequence similarities to functional genes nut are NON-FUNCTIONAL. Rgenes can be found for many of the different types of genes we have mentioned. The small pol III genes (tRNA, scRNA, 5S rRNA)often have hundreds of Rgenes. Also Ig genes.
E.g. the U1 snRNA genes have ~50-100 functional copies per genome and 500-1000 Rgenes
- in the 7SL RNA gene family there are 4 active genes in mammals and several hundred Rgenes.
Chicken lambda light chain locus has 25 V (variable) Rgenes.
Most protein coding genes rarely have more than 20 Rgenes.
Many Rgenes are similar to the original gene from which they probably arose by tandem duplication. They show similarities in the flanking sequences, exons, introns etc. They probably arose by duplication and if this does not confer a selective advantage then one of the copies is free to mutate. If it mutates so as to become non-functional then it becomes a Rgene. A gene could be inactivated by mutations that prevent any or all of the stages of gene expression, e.g. abolishing transcription signals, preventing splicing or inducing early termination of translation.
Qgenes are probably selectively neutral or nearly so and can accumulate mutational changes rapidly, possibly leading to a different functional gene over evolutionary time. The Rgenes in the "-globin locus, RE1 and RA1, probably diverged from functional genes ~45 mil yrs ago.
Processed pseudogenesSome Rgenes have no flanking sequences or introns. These represent a DNA copy of a processed transcript rather than a mutational deviation from a parent gene. These could have arisen by action of the enzyme reverse transcriptase and inclusion of the DNA sequence produced, back into the genome. These Rgenes cannot be transcribed because they lack functional promoters.
e.g. the mouse R"3 globin gene lacks both introns but is still very similar to "-globin mRNA.
Known as processed Rgenes or sometimes as RETROPOSONS (DNA sequence mobilised via RNA form, e.g. retroviruses). A source for RT in euk cells is the presence of ENDOGENOUS RETROVIRUSES, and some TRANSPOSONS code for RT and integrase.
(e) SINEs (SHORT INTERSPERSED ELEMENTS)
~500 bp, up tp 105 copies. Several families of SINEs, some are transcribed by pol III such that 1-5% of the hnRNA (the population of primary transcripts in the nucleus of a eukaryotic cell, wide size variation) may be SINE transcripts. The best characterised SINEs are the Alu I family in primates and related B1 family in rodents.
The Alu I sequences in man may be around 3-5 x 105 copies in the haploid genome, ~282 nt long totalling 3-6% of total DNA. Possess recognition sequence for restriction enzyme Alu I, hence name. A recent theory based on sequence similarity is that they might be copies of a processed 7SL RNA (pol III). These Alu sequences have been inserting into the genome over a long period of time. There is an Alu sequence in the beta-globin gene cluster in exactly the same position in man and in chimpanzees. Alu sequences also found in introns and elsewhere.Other SINEs e.g. the 250 bp EC1 element is found in nearly all euks and seems to act as a hotspot for recombination and gene conversion.
SINEs are similar to retroposons originating from pol III transcripts.
(f) LINEs (LONG INTERSPERSED ELEMENTS)
~ 5-7 kb. Only a single LINE family in primates, the Kpn family (restriction enzyme site). A single family in rodents which is related to Kpn. Both now known as the L1 family. 6400 bp, 2-5 x 104 copies in mammalian genomes. Similar to retroposons. These LINEs seem to be transcribed at low level by pol II.
Retroposons originating from pol II transcripts. Reverse transcriptase sequence detected, related to pol of retroviruses.Fig of the human beta-globin locus showing Alu sequences LINEs, Rgenes etc.
Only a very small proportion of genome in euks codes for proteins, ~10% in sea urchin, 5-10% in Drosophila, 1-2% in humans.
(g) TRANSPOSONSSINEs and LINEs have clearly moved around the genome over evolutionary time as have processed Rgenes. Transposons are mobile genetic elements that can move around the genome within a single generation of an organism.
There are a number of different classes, some of which are related to retroviruses and are termed RETROTRANSPOSONS.
- duplex DNA sequence in euk genome corresponding to the genome of an RNA retrovirus, no longer functional.
Two classes of retroposons are distinguished, the viral superfamily and the nonviral superfamily.
Viral Superfamily | Nonviral Superfamily | |
Common types | Ty
(S. cerevisiae)
copia (D. melanogaster) LINEs L1 (mammals) |
SINEs B1 /
ALU
(mammals)
Processed Rgenes of pol III transcripts |
Termini | Long terminal repeats (LTRs) | No repeats |
Target repeats | 4-6 bp | 7-21 bp |
Reading frames | Reverse transcriptase and/or integrase | None |
Organisation | May contain introns | No introns |
VNTRsAnother type of tandem repeats includes those called variable number of tandem repeats or minisatellite sequences. The repeating DNA sequence or VNTR may be 15 to 100 bp long.
- moderately repetitive DNA, ,tandem = head to tail
- used in DNA fingerprinting (DNA typing), differences between individuals result from the variation in the number of tandem repeat, hence length of overall unit at a particular locus. (Fig 15.27, Russell)
- highly repetitive DNA consisting of very short sequences repeated many times in tandem in large clusters. Found in heterochromatic regions of chromosomes, e.g. centromeric DNA.
% GC content different from majority of DNA in genome, so forms a satellite band (or bands) in CsCl density gradient centrifugation.
e.g. in D. virilis
Satellite | Predominant sequence | Total Length bp | % Genome |
I | ACAAACT | 1.1 x 107 | 25 |
II | ATAAACT | 3.6 x 106 | 8 |
III | ACAAATT | 3.6 x 106 | 8 |
Cryptic | AATATAG |
NB satellites I, II and III are related 7bp sequences. Satellites may show unrelated sequences in other genomes.Satellites evolve by lateral amplification from a single sequence to give a large number of tandem copies.
This template created by the Web Diner.