To locate Low Complexity Regions in Biological Sequences

Wang, Tsai-Cheng

To locate low complexity regions in a sequence is made by observing the composition of short motifs within a fixed-size window. Compute the number of occurrences of motifs for each position in a window that provides us its motif distribution. This distribution can be represented by a point in a multidimensional space where an axis denotes one type of motifs and the scale of an axis is the number of occurrences of its motif. The distance from this point to the General Reference Point (GPR) therefore reflects a level of complexity. A low complexity window will be this distance that goes above (or below, depends on the chosen GPR.) a threshold. For biological sequences, perhaps few consecutive windows suggest an effective length of low complexity, and truncate the “noise” on both ends of the found subsequence to obtain the true region.

View the detailed results of NT_011516 on the chromosome 22 of Homo Sapiens in various window sizes (w) and thresholds (v).

	w	v*
View1	16	6.29
View2	32	9.37
View3	48	11.95
View4	64	14.33

*:This value is given by 5 times standard deviation above mean on a random-generated sequence.

Download my executables for nucleic acids or amino acids compiled for Linux (RedHat 9.0).

View the How-to-Use for more details.