template

            Organisation of the Genome
                       The C-value Paradox
                       Classification of Genomic DNA
                  Features of the Eukaryotic Genomes
                               Genes, Families & Superfamilies; Pseudogenes; SINEs, LINEs & Transposons; Endogenous Proviruses,
                               Satellite DNA; VNTRs

The C-value Paradox
If we examine the genomes of proks and eubacteria, bacterial genomes are quite compact with very little "unused" DNA. Eukaryotic genomes vary quite markedly in their DNA content (C-value) but they all contain far more DNA than they require (on theoretical grounds, at least) to carry out all of their functions.
The C-value is the total amount of DNA in the genome.
This phenomenon of apparently having an excess of DNA in the cell is known as the C-value paradox.
Looking at some haploid genome sizes (see table):
Haploid Genome Sizes for Selected Organisms

Organism kb bp

E. coli 4600 4.6 x 10⁷

S. cerevisiae 13500 1.3 x 10⁷

D. melanogaster 140000 1.4 x 10⁸

H. sapiens 3300000 3.0 x 10⁹

S. American lungfish 102000000 10.2 x 10¹¹

Flowering plants 10⁷ - 10¹¹

Amphibia 109 -1011.

For higher eukaryotes, estimate of 3 kb per gene x 10⁵ genes = 3 x 10⁸ bp needed in haploid genome

Range of DNA Content of Haploid Genomes of Representative Groups of Organisms

Group Approximate Size of haploid Genome (bp)

Viruses & Bacteria 10⁵ - 5 x 10⁶

Algae & Fungi 5 x 10⁶ - 5 x 10⁷

Worms 5 x 10⁷ - 10⁸

Insects 10⁸ - 5 x 10⁹

Echinoderms 5 x 10⁸ - 5 x 10⁹

Fish 10⁹ - 5 x 10¹⁰

Amphibians 5 x 10⁹ - 5 x 10¹¹

Birds 5 x 10⁹ - 10¹⁰

Reptiles 5 x 10⁹ - 5 x 10¹⁰

Mammals 5 x 10⁹ - 5 x 10¹⁰

Flowering plants 5 x 10⁹ - 10¹²

- One estimate is that if we take 3 kb to be a generous estimate for the average gene size of a human and that there are approximate 10⁵ genes required for all functions, then there is a ten-fold excess of DNA in the genome (humans have 10⁹ bp in haploid genome).
This is one aspect of the C-value paradox: the ‘excess’ DNA in genomes.
- Note the range of C-values within the amphibia taxon - there are large variations in C-values within taxa whose complexity do not apparently vary very much. The amphibia do not apparently show a 100-fold variation in complexity.
- Note also, flowering plants have the widest range of C-values.
- Basically, then there is a relationship between evolutionary increases in complexity and minimum C-values, but maximum values for the same group can be several orders of magnitude higher.
Another way of stating the C-value paradox:
What is the excess DNA? Junck? A lot of it probably is.
This takes us on to look at the ways that the genomic DNA of eukaryotes can be classified.

Classification of Genomic DNA

(1) DEGREE OF REPETITION

(a) Unique sequences - unique genes, coding sequences
(b) Moderately repetitive DNA - 10² to 10⁴ copies per haploid genome, e.g. SINEs (500 bp) and LINEs (5-7 kb).
(c) Highly repetitive DNA - 10⁴ to 10⁶ copies per haploid genome, 6 to 200 bp in tandem arrays, e.g. satellite DNA.

(2) EUCHROMATIN & HETEROCHROMATIN

Euchromatin contains most of the structural genes, transcribed DNA whereas heterochromatin refers to regions of the genome that are not generally transcribed and the DNA is highly condensed.

(3) CODING & NON-CODING

Molecular biology and recombinant DNA technology are rendering this distinction less clear and as we find out more about the nature of genomes at the molecular level we may as well classify the DNA into its actual major components (following).

(a) GENES
Most protein coding genes are represented once per haploid genome. These are unique or single copy genes.
(b) GENE FAMILIES
Some genes are maintained in multiple copies known as gene families. These are a set of genes descended by duplication and variation from an ancestral gene. Members of a family are not necessarily identical, but share DNA sequence homology and their gene products are functionally related, e.g. the immunoglobulin genes.
The members of a gene family may be clustered together in long tandem arrays of adjacent repeats e.g. the rRNA genes, the histone genes and the globin gene family. In the DNA containing the rRNA and histone genes, the stretch of DNA encompassing the genes is repeated several hundred times, including the non-transcribed (and non-translated) regions.

Fig of rRNA gene cluster

Fig 17.19 Concepts (Alpha and Beta Haemoglobin Gene Families)

e.g. several thousand rRNA genes in Zea mays clustered on chromosome 6, whereas 50-200 rRNA genes in man scattered over chromosome 5
Gene families clustered in tandem arrays are highly conserved.
Single copies of clustered genes may be found elsewhere in the genome and are called ORPHONS.

E.g. in the sea urchin Lytechinuspictus, the cluster of histone genes occur in the order H1, H4, H2B, H3, H2A repeated several hundreds (300-600) of times (extremely rapid rate of cell division during development). However, 5-20 copies of each histone gene as orphons. The difference is not totally clear but clustered tandem arrays tend to be expressed in early development.

Other examples of gene families are:

(i) Mouse class I genes of the H-2 (histocompatibility) locus - 36 members in 13 clusters over 837 kb of DNA.
(ii) The globin gene family, e.g. the "-globin family (E expressed in embryo, A1 & A2 expressed in adults. Two others are PSEUDOGENES (R).
The human "-globin gene family consists of 5 genes on the short arm of chromosome 16. Zeta is expressed only in the embryo, there are two non-functional Rgenes and 2 copies of the "-globin gene ("₁, "₂, or "A₁ and "A₂) expressed during foetal and adult stages.
This family spans over 30 kb of which very little codes for polypeptides.
The beta-globin family is similar, only 5% of the 60 kb region on the short arm of chromosome 11 consists of coding sequences. There are 6 genes: one embryonic, two foetal, two expressed after birth, each gene coding for a 146 aa polypeptide. The remainder of the DNA has no known function (C-value paradox).

(c) GENE SUPERFAMILIES
Gene superfamilies are gene families which do not share identical functions. Not highly homologous, but are related through shared sub-domains of the proteins they encode.
Gene superfamilies may appear to be totally unrelated, i.e. as distinct gene families, but may be related by sharing the same sub-domains of proteins. The final gene product may be different, having a different function within (or outside) the cell. The overall base sequences are not highly homologous, but the base sequence for a protein domain may be very similar (i.e. they probably share an exon). These gene families are considered to be members of gene superfamilies. May have evolved from a common ancestral gene. The proteins are related in structure and their genes are related in organisation.
(i) The mammalian immunoglobulins - 2 families of Ig light chains 6 and 8 and one family of heavy chains (H), each family has both variable (V) and constant (C) genes. There are hundreds of V-genes in man.
(ii) Class I, class II histocompatibility antigens, T lymphocyte receptors all share common sub-domains.
These sets of families of lymphocyte-specific molecules are part of a superfamily that includes families of cell surface molecules involved in adhesion between cells.

(d) PSEUDOGENES
Pseudogenes share strong base sequence similarities to functional genes nut are NON-FUNCTIONAL. Rgenes can be found for many of the different types of genes we have mentioned. The small pol III genes (tRNA, scRNA, 5S rRNA)often have hundreds of Rgenes. Also Ig genes.
E.g. the U1 snRNA genes have ~50-100 functional copies per genome and 500-1000 Rgenes
- in the 7SL RNA gene family there are 4 active genes in mammals and several hundred Rgenes.
Chicken lambda light chain locus has 25 V (variable) Rgenes.
Most protein coding genes rarely have more than 20 Rgenes.
Many Rgenes are similar to the original gene from which they probably arose by tandem duplication. They show similarities in the flanking sequences, exons, introns etc. They probably arose by duplication and if this does not confer a selective advantage then one of the copies is free to mutate. If it mutates so as to become non-functional then it becomes a Rgene. A gene could be inactivated by mutations that prevent any or all of the stages of gene expression, e.g. abolishing transcription signals, preventing splicing or inducing early termination of translation.
Qgenes are probably selectively neutral or nearly so and can accumulate mutational changes rapidly, possibly leading to a different functional gene over evolutionary time. The Rgenes in the "-globin locus, RE1 and RA1, probably diverged from functional genes ~45 mil yrs ago.

Processed pseudogenes
Some Rgenes have no flanking sequences or introns. These represent a DNA copy of a processed transcript rather than a mutational deviation from a parent gene. These could have arisen by action of the enzyme reverse transcriptase and inclusion of the DNA sequence produced, back into the genome. These Rgenes cannot be transcribed because they lack functional promoters.
e.g. the mouse R"3 globin gene lacks both introns but is still very similar to "-globin mRNA.
Known as processed Rgenes or sometimes as RETROPOSONS (DNA sequence mobilised via RNA form, e.g. retroviruses). A source for RT in euk cells is the presence of ENDOGENOUS RETROVIRUSES, and some TRANSPOSONS code for RT and integrase.
(e) SINEs (SHORT INTERSPERSED ELEMENTS)
~500 bp, up tp 10⁵ copies. Several families of SINEs, some are transcribed by pol III such that 1-5% of the hnRNA (the population of primary transcripts in the nucleus of a eukaryotic cell, wide size variation) may be SINE transcripts. The best characterised SINEs are the Alu I family in primates and related B1 family in rodents.

The Alu I sequences in man may be around 3-5 x 10⁵ copies in the haploid genome, ~282 nt long totalling 3-6% of total DNA. Possess recognition sequence for restriction enzyme Alu I, hence name. A recent theory based on sequence similarity is that they might be copies of a processed 7SL RNA (pol III). These Alu sequences have been inserting into the genome over a long period of time. There is an Alu sequence in the beta-globin gene cluster in exactly the same position in man and in chimpanzees. Alu sequences also found in introns and elsewhere.
Other SINEs e.g. the 250 bp EC1 element is found in nearly all euks and seems to act as a hotspot for recombination and gene conversion.
SINEs are similar to retroposons originating from pol III transcripts.
(f) LINEs (LONG INTERSPERSED ELEMENTS)
~ 5-7 kb. Only a single LINE family in primates, the Kpn family (restriction enzyme site). A single family in rodents which is related to Kpn. Both now known as the L1 family. 6400 bp, 2-5 x 10⁴ copies in mammalian genomes. Similar to retroposons. These LINEs seem to be transcribed at low level by pol II.

Retroposons originating from pol II transcripts. Reverse transcriptase sequence detected, related to pol of retroviruses.
Fig of the human beta-globin locus showing Alu sequences LINEs, Rgenes etc.
Only a very small proportion of genome in euks codes for proteins, ~10% in sea urchin, 5-10% in Drosophila, 1-2% in humans.

(g) TRANSPOSONS
SINEs and LINEs have clearly moved around the genome over evolutionary time as have processed Rgenes. Transposons are mobile genetic elements that can move around the genome within a single generation of an organism.
There are a number of different classes, some of which are related to retroviruses and are termed RETROTRANSPOSONS.
(h) ENDOGENOUS PROVIRUSES
- duplex DNA sequence in euk genome corresponding to the genome of an RNA retrovirus, no longer functional.
Two classes of retroposons are distinguished, the viral superfamily and the nonviral superfamily.

	Viral Superfamily	Nonviral Superfamily
Common types	Ty (S. cerevisiae) copia (D. melanogaster) LINEs L1 (mammals)	SINEs B1 / ALU (mammals) Processed Rgenes of pol III transcripts
Termini	Long terminal repeats (LTRs)	No repeats
Target repeats	4-6 bp	7-21 bp
Reading frames	Reverse transcriptase and/or integrase	None
Organisation	May contain introns	No introns

VNTRs
Another type of tandem repeats includes those called variable number of tandem repeats or minisatellite sequences. The repeating DNA sequence or VNTR may be 15 to 100 bp long.
- moderately repetitive DNA, ,tandem = head to tail
- used in DNA fingerprinting (DNA typing), differences between individuals result from the variation in the number of tandem repeat, hence length of overall unit at a particular locus. (Fig 15.27, Russell)
Satellite DNA
- highly repetitive DNA consisting of very short sequences repeated many times in tandem in large clusters. Found in heterochromatic regions of chromosomes, e.g. centromeric DNA.
% GC content different from majority of DNA in genome, so forms a satellite band (or bands) in CsCl density gradient centrifugation.

e.g. in D. virilis

Satellite	Predominant sequence	Total Length bp	% Genome
I	ACAAACT	1.1 x 10⁷	25
II	ATAAACT	3.6 x 10⁶	8
III	ACAAATT	3.6 x 10⁶	8
Cryptic	AATATAG

NB satellites I, II and III are related 7bp sequences. Satellites may show unrelated sequences in other genomes.
Satellites evolve by lateral amplification from a single sequence to give a large number of tandem copies.

Syllabus

This template created by the Web Diner.

Organism	kb	bp
E. coli	4600	4.6 x 10⁷
S. cerevisiae	13500	1.3 x 10⁷
D. melanogaster	140000	1.4 x 10⁸
H. sapiens	3300000	3.0 x 10⁹
S. American lungfish	102000000	10.2 x 10¹¹
Flowering plants		10⁷ - 10¹¹
Amphibia		109 -1011.

Group	Approximate Size of haploid Genome (bp)
Viruses & Bacteria	10⁵ - 5 x 10⁶
Algae & Fungi	5 x 10⁶ - 5 x 10⁷
Worms	5 x 10⁷ - 10⁸
Insects	10⁸ - 5 x 10⁹
Echinoderms	5 x 10⁸ - 5 x 10⁹
Fish	10⁹ - 5 x 10¹⁰
Amphibians	5 x 10⁹ - 5 x 10¹¹
Birds	5 x 10⁹ - 10¹⁰
Reptiles	5 x 10⁹ - 5 x 10¹⁰
Mammals	5 x 10⁹ - 5 x 10¹⁰
Flowering plants	5 x 10⁹ - 10¹²