ASSIGNMENT 2 cont.
Self study at S - star
Page 6
i.e. Comparative Genomics
Part III: Sequence Alignment __Genome Scale Single gene: thousand to tens of thousands of bases Single protein: hundreds to thousand s of residues Complete genomes: Species Total number of base pairs in genome H. pyroli 1.6 million E. coli 4.8 million Bakers yeast 12 million C. elegans 97 million Drosophila 137 million Human 3.1 billion Mouse 3.1 billion Challenges ? Large size of the DNA sequences to be aligned - Memory - Speed ? Occurrence of both short and long insertions and deletions ? Large scale changes such as tandem repeats and large scale reversals ? High degree of divergence in the third position of codons A Suffix tree based Method ? AL Delcher, et. al., 1999, Nucleic Acids Research ? Designed for fast alignment of large, closely related sequences - Assumption: there is a mapping between large subsequences of two inputs. ? Aligned two ~ 4 Mb genome sequences of two tuberculosis strains in less than 1 min of computation time. There are Three Steps: 1. Identify all Maximum Unique Matches (MUMs) 2. Extract the longest set of matches that occurr in the same order in both genomes 3. Close the local gaps by identifying inserts, repeats, tandem repeats, small mutated regions, and SNPs Step 1: Identify All MUMs ? A Maximum Unique Match is a subsequence that occurs exactly once in GenomeA and once in Genome B, and is not contained in any longer such sequence. ? A na?ve algorithm is O (n^3), where n is the sum of the length of Genome A and B. ? Use the suffix tree data structure for efficiency ? O(n) for both run time and space - A generous upper bound for space: 37 bytes per base. - << 8 Gb of memory for comparison of two 100 Mb sequences. Suffix Tree ? A suffix tree is a compact representation that stores all possible suffixes of an input sequemce S. ? A suffix tree is a subsequence that begins at any position in the sequence and extends to the end of the sequence.