ASSIGNMENT 2 cont.
Self study at S - star
i.e. Comparative Genomics
Page 7
? Concatenate the two genomes into one sequence separated by a dummy character. ? Use McCreight’s algorithm to build suffix tree in linear time. - Clever use of sets of pointers ? Label each leaf node to indicate which suffix it represents in which genome. ? Identify Maximum Unique Matches in one scan - Every unique matching sequence is represented by an internal node with exactly two child leaf nodes, one from each genome. - Unique matches that are maximal can be identified by mismatches at their ends ? Identify MUMs on both DNA strands Step 2: Sorting the MUMs ? Set length of the shortest of MUM. - e.g., 50 for highly similar genomes, 20 for similar ones ? Sort the MUMs according to their position in Genome A ? Use a variation of Longest Increasing Subsequence algorithm. ? Run time O(KlogK), where K = number of MUMs
Step 3: Closing the Gap – four classes 1. Repeats ? In their study, most repeats are tendem repeats. All their tendem repeats were adjacent to unique sequence. ? Can be identified by MUMs overlapping each other 2. SNPs ? Simple cases: gap of one base between MUMs. ? SNP adjacent to repeat sequences: use repeat processing 3. Insertions ? Simple insertion: large gap in alignment in one genome but not the other ? Transposition: appear in MUM alignment out of sequence. 4. Variable/ Polymorphic regions: ? Appear as gap in Mum alignment ? If short, use Smith – Waterman dynamic programming ? If long, run MuM detection with reduced minimum length Computation time vs. size and similarity LengthSequence similarity Step 1 Step 2 Step 3 (# sec) (# sec) (# sec) M. tuberculosis H37Rv 4 Mb vs. 99% identical 5 45 5 M. tuberculosis CDC155 14 Mb M. genitalium 580 Kb 20 Kb in MUMs of vs. > 15 b; < 50 % id in 6.5 0.02 116 M. pneumonia 816 Kb gap regions Subsequences of Human 223 Kb 14 Kb in MUMs of chromosome 12p13 vs. > 15 b; 1.6 29 ? Mouse chromosome 6 228 Kb Large gaps