Prediction of Protein Secondary Structure Using GP
June - July, 2003
Summer Internship, Stockholm Bioinformatics Center,
Sweden
Under Dr. R. M. MacCallum
In this project we attempted to use Genetic Programming (GP) to predict the secondary structure of protein. Genetic Programming is basically a search/optimization algorithm, which attracted global attention through the book Genetic Programming: On the Programming of Computers by Means of Natural Selection, by Prof. John Koza. GP is a class of algorithms that try to evolve programs to solve a given problem. It primarily requires the ideal input and output dataset to calculate the performance of the evolving program. A typical GP would begin with a random population of programs and try to search for a solution using principles of natural evolution, such as survival of the fittest, crossovers, mutations, etc. GP is generally sensitive to parameter variation and there is no guarantee that GP would be able to solve a given problem. (However, there have been numerous publications, where GP has been able to solve NP-Complete problem, regression problems, circuit design problems, etc.)
We used the multi-aligned amino acid sequence as the input data to the GP and wanted as
output the correct secondary structure of the protein. Since the input data set is highly
dimensional, it becomes difficult for the GP to use it effectively. Therefore, we decided
to use Self Organizing Maps (SOM) to transform the initial higher dimensional data (the
multi-aligned amino acid sequence) to lower dimensional data (positions in the SOM). SOM
is a neural network tool, which uses unsupervised (competitive) learning to classify data
into positions in a map. One can choose the dimensions of the map (analogous to number of
classification parameters) and size of the map (analogous to total number of different
classes). We used GP and some simpler statistical techniques to find the optimal mapping
between SOM co-ordinates and the three
secondary structure states.