Assignment2

ASSIGNMENT 2 cont.

Self study at S - star

Page 6

i.e. Comparative Genomics

Part III: Sequence Alignment __Genome Scale
•	Single gene: thousand to tens of thousands of bases
•	Single protein: hundreds to thousand s of residues
•	Complete genomes:

Species	Total number of base pairs in genome
H. pyroli	1.6 million
E. coli	4.8 million
Baker’s yeast	12 million
C. elegans	97 million
Drosophila	137 million
Human	3.1 billion
Mouse	3.1 billion

Challenges
?	Large size of the DNA sequences to be aligned
-       Memory
       -       Speed
?	Occurrence of both short and long insertions and deletions
?	Large – scale changes such as tandem repeats and large – scale reversals
?	High degree of divergence in the third position of codons

A Suffix tree – based Method
?	AL Delcher, et. al., 1999, Nucleic Acids Research
?	Designed for fast alignment of large, closely – related sequences
-       Assumption: there is a mapping between large subsequences of two inputs.
?	Aligned two ~ 4 Mb genome sequences of two tuberculosis strains in less than 1 min of computation time.
There are Three Steps:
1.	Identify all Maximum Unique Matches (MUMs)
2.	Extract the longest set of matches that occurr in the same order in both genomes
3.	Close the local gaps by identifying inserts, repeats, tandem repeats, small mutated regions, and SNPs
Step 1: Identify All MUMs
?	A Maximum Unique Match is a subsequence that occurs exactly once in GenomeA and once in Genome B, and is not contained in any longer such sequence.
?	A na?ve algorithm is O (n^3), where n is the sum of the length of Genome A and B.
?	Use the suffix tree data structure for efficiency ? O(n) for both run time and space
-      A generous upper bound for space: 37 bytes per base.
-      << 8 Gb of memory for comparison of two 100 Mb sequences.
Suffix Tree
?	A suffix tree is a compact representation that stores all possible suffixes of an input sequemce S.
?	A suffix tree is a subsequence that begins at any position in the sequence and extends to the end of the sequence.