To locate Local Convergence on Short Motifs for Biological Sequences

Wang, Tsai-Cheng

Biological sequences have simple alphabetical representations for complex chemical compounds such that a sequence can be viewed as a formation of short patterns, says motifs, in reversed. A non-uniform distribution of a given motif statistically floats some areas, such as the CpG Island, that imply to important biological features. Other short motifs converged within local areas may apply to the significant responses for biological purposes.

This program applies the hit-and-extension method to locate areas of local convergence on short motifs. A hash table is preprepared for the desired short motifs such that scanning the entire sequence will give us the starting position to initiate the “extension” and the exact motif that we are looking for once the hash table is hit. The “extension” searches for continuous qualified windows on both left and right directions. By given the scores to matched and mismatched motifs, a window is accepted when its accumulative score passes a threshold. The “extension” stops whenever a disqualified window is found.

Example 1. shows a local convergence from 600351 to 600521 in gi|29807536|ref|NT_011525.5|Hs22_11682 Homo sapiens chromosome 22 genomic contig where motif = gat, matched score = 2, mismatched score = -1, window size = 32 and threshold = 0. Last three protein sequences are translated from the found DNA subsequence at the positions starting from 0, +1, and +2 respectively. Note that the hash table is designed for the occurrences of two consecutive motifs.


Example 1. (S=score, E=expected value, L=length, R=range, X=motif, I=identical ratio)

S=14 E=4.9e-33 L=171 R=[600351,600521] X=gat I=38%
A:75 C:1 G:44 T:51
gatagtagatagattagatagatgatagatttgatagatagattggatag
ataatagataggatagattagatagatggatagattagataggatagatt
agacagataggatagatgatagatagatagatagataattagattgatag
attaaatagatgatagatgat
BSR*IR*MIBLIBRLBR**IG*IR*MBRLBRIB*TBRIBBR*IBR*LB**IK*MIBB
IVBRLBR**I**IBWIBBR*BRLBRWIB*IG*IRQIG*MIBR*IBB*IBRLBR**M
**IB*IBBRFBR*IG*IIBRIB*IBG*IR*BRLBR*BR**IBR*IIRLIB*IBBR*


Table 1. Sample results of chromosome 22 of Homo Sapiens in various parameters.


b

t

x

y

f

w

v

View1

1

2

1

-2

4

16

0

View2

2

2

1

-1

2

16

0

View3

3

2

2

-1

1

32

0

View4

4

2

2

-1

1

32

0


View the detailed results of chromosome 22 of Homo Sapiens in various parameters in Table 1. where
b = motif length, (In this case, all possible combinations of a given motif length are considered.)
t = number of consecutive motifs in the hash table,
x = score of matched motif,
y = score of mismatched NA,
f = minimum number of extended windows to be accepted as a local convergence,
w = window size,
v = threshold.

Download my executables for nucleic acids or amino acids compiled for Linux (RedHat 9.0).

View the How-to-Use for more details.