Protein structure prediction method based on secondary structure similarity
Technical Field
The invention relates to the fields of biological informatics, molecular dynamics simulation, statistical learning and combination optimization and computer application, in particular to a protein structure prediction method based on secondary structure similarity.
Background
Since the end of the twentieth century, with the rapid development of the field of life sciences, more and more researchers and research institutes have participated in the research of life sciences. The protein is formed by dehydration and condensation of 20 different amino acids by taking mRNA as a template to form an amino acid sequence, and then is folded to form a three-dimensional structure with a specific function. The secondary structure of a protein refers to a regularly repeating conformation, such as alpha helices and beta sheets, in the polypeptide chain of the protein. The understanding of the three-dimensional structure of the protein is the basis for researching the biological function and activity mechanism of the protein, and the problem of protein structure prediction is one of the research hotspots in the fields of bioinformatics and computer application at present, and has very important guiding significance for the invention of new protein and the design of drug target protein. The three-dimensional structure of proteins is now determined experimentally mainly by methods such as X-ray diffraction and Nuclear Magnetic Resonance (NMR). The X-ray diffraction technique can obtain a protein structure with high precision, but the method is not suitable for the protein which can not prepare resolution crystals. The nuclear magnetic resonance method does not need to prepare protein crystals, but only can measure small proteins with less than 300 amino acids, and is long in time consumption and high in cost.
Since the rate of protein structure determination is much slower than that of sequence determination, in fact, only 0.2% of the protein sequences possess the experimentally determined protein structure, it is a very meaningful work to predict the structure from the protein sequence using computer methods. The Anfinsen experiments show that structural information of the protein is contained in its sequence, thereby indicating that structural prediction is feasible from the sequence. The mainstream protein structure prediction methods at present mainly comprise a homologous modeling method, a threading method and a de novo prediction method. For the case of high sequence identity (> 50%), homology modeling and threading are the first choice, but for the case of low sequence similarity (< 30%), the first two methods are no longer applicable, and only de novo prediction can be selected.
In the process of protein structure de novo prediction, two major bottlenecks exist at present, one is deceptiveness of energy landscape, so that the obtained energy low conformation is not a natural conformation, and is specifically represented as inaccurate energy function, and a good conformation cannot be selected; another is the lack of ability of the prior art to sample conformational space, which is manifested by a lack of diversity in the conformations.
Therefore, the current protein structure prediction method has defects in prediction accuracy and sampling capability, and needs to be improved.
Disclosure of Invention
In order to overcome the defects of insufficient sampling capability and prediction accuracy of the conventional protein structure prediction method, the invention provides a group protein structure prediction method based on secondary structure similarity, which has better sampling capability and higher prediction accuracy, designs a secondary structure similarity index, selects individuals with better energy and structure through the double constraints of the secondary structure similarity index and an energy function, and effectively solves the problem of low protein structure prediction accuracy caused by inaccuracy of the energy function.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a method for predicting protein structure based on secondary structure similarity, the method comprising the steps of: 1) setting parameters, and the process is as follows:
reading sequence information of a target protein, fragment library information, and an energy function, and setting population position { x } of a protein conformation1,x2,...,xi,...,xNPWhere NP is the population size, xiRepresenting the ith individual of the population, with Gen iteration number and G maximum iteration numbermaxCross probability p1Probability of variation p2The sequence length is L;
2) calculating the similarity of the secondary structure by the following process:
for a protein x of sequence length L
iThe secondary structure of the prediction obtained by the PSIPRED online server is
k∈[1,L]The secondary structure of the conformation obtained by the Dssp algorithm is
k∈[1,L]According to the formula
Calculating to obtain the secondary structure similarity of the conformation, recording the score as 1 when the secondary structures of the k-th position and the k-th position are the same, otherwise, recording the score as 0, traversing the whole sequence length to obtain the final score, and dividing the final score by the sequence length to obtain the similarity SS of the secondary structurei;
3) Population initialization, the process is as follows:
performing fragment assembly on individuals of the population until residue positions of all positions are replaced once, and finishing initialization operation;
4) population crossing, the process is as follows:
4.1) pairing NP individuals pairwise to form NP/2 pairs, and numbering the NP/2 pairs1,a2,...,aj,...,aNP2,j∈[1,NP/2];
4.2) randomly selecting one of the groups ajAccording to the probability p1Judging whether to carry out crossing, if so, randomly selecting a certain position of the individual, and setting the random crossing length as crosslength E [3,10 ]]Exchanging fragments of the two individuals to obtain two new individuals;
5) population variation, the process is as follows:
5.1) for individual xi,i∈[1,NP]Using the probability of variation p2Fragment assembly is carried out if random (0,1) > p2Then a fragment assembly of 3 fragments is performed, if random (0,1) < p2Then, segment assembly of 9 segments is carried out, and the individual x is obtained after segment assemblyi′;
5.2) Using energy function to group Individual x before and after AssemblyiAnd xi' separately calculating energy to obtain EiAnd Ei', if Ei′<EiThen the individual x is retainedi'; if Ei′>EiThen, according to Boltzmann probability p ═ exp { - (E)i′-Ei) KT }, to judge whether to receive the assembled individual, if random (0,1) < p, then keep the individual xi'; otherwise, individual x is retainedi;
6) And (3) selecting the population by using the secondary structure similarity, wherein the process is as follows:
firstly, merging an initial population and a varied population into a new population with the population size of 2 × NP, then calculating the secondary structure similarity of individuals of the new population, sequencing the merged population according to the level of the secondary structure similarity, selecting the individuals with high primary NP secondary structure similarity as the selected population individuals, and finally setting Gen + 1;
7) judging whether the maximum iteration number G is reachedmaxIf the conditions are met, stopping iteration and outputting the last generation populationVolume information, otherwise, returning to the step 4).
The technical conception of the invention is as follows: the invention provides a protein structure prediction method based on secondary structure similarity under the framework of a group algorithm. Firstly, the setting of the cross probability in the group algorithm can control the speed of group convergence, avoid precocity and also enable information interaction between groups; then, mutation can increase the diversity of conformation to obtain better conformation; and finally, in the selection process, the population is preferred by using the secondary structure similarity, individuals with small secondary structure similarity are eliminated, and better individuals are left, so that the problem of inaccuracy of an energy function is avoided.
The beneficial effects of the invention are as follows: on one hand, a group algorithm is used, information interaction is carried out among groups, and search of a conformation space is increased; on the other hand, the population is selected through the secondary structure similarity, so that the retention probability of high-quality individuals is greatly increased, errors caused by inaccuracy of energy functions are reduced, and the prediction accuracy is improved.
Drawings
FIG. 1 is a flow chart of a method for predicting protein structure based on secondary structure similarity.
FIG. 2 is a conformational distribution diagram obtained when protein 1AIL is subjected to structure prediction by a protein structure prediction method based on secondary structure similarity.
FIG. 3 is a three-dimensional structural diagram obtained by predicting the structure of protein 1AIL by a protein structure prediction method based on secondary structure similarity.
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1 to 3, a method for predicting a protein structure based on secondary structure similarity, the method comprising the steps of:
1) setting parameters, and the process is as follows:
reading sequence information of a target protein, fragment library information, and an energy function, and setting population position { x } of a protein conformation1,x2,...,xi,...,xNPWhere NP is the population size, xiRepresenting a populationThe ith individual of (1), the number of iterations is Gen, and the maximum number of iterations GmaxCross probability p1Probability of variation p2The sequence length is L;
2) calculating the similarity of the secondary structure by the following process:
for a protein x of sequence length L
iThe secondary structure of the prediction obtained by the PSIPRED online server is
k∈[1,L]The secondary structure of the conformation obtained by the Dssp algorithm is
k∈[1,L]According to the formula
Calculating to obtain the secondary structure similarity of the conformation, recording the score as 1 when the secondary structures of the k-th position and the k-th position are the same, otherwise, recording the score as 0, traversing the whole sequence length to obtain the final score, and dividing the final score by the sequence length to obtain the similarity SS of the secondary structurei;
3) Population initialization, the process is as follows:
performing fragment assembly on individuals of the population until residue positions of all positions are replaced once, and finishing initialization operation;
4) population crossing, the process is as follows:
4.1) pairing NP individuals pairwise to form NP/2 pairs, and numbering the NP/2 pairs1,a2,...,aj,...,aNP/2,j∈[1,NP/2];
4.2) randomly selecting one of the groups ajAccording to the probability p1Judging whether to carry out crossing, if so, randomly selecting a certain position of the individual, and setting the random crossing length as crosslength E [3,10 ]]Exchanging fragments of the two individuals to obtain two new individuals;
5) population variation, the process is as follows:
5.1) for individual xi,i∈[1,NP]Using the probability of variation p2Fragment assembly is carried out if random (0,1) > p2Then a fragment assembly of 3 fragments is performed, if random (0,1) < p2Then, segment assembly of 9 segments is carried out, and the individual x is obtained after segment assemblyi′;
5.2) Using energy function to group Individual x before and after AssemblyiAnd xi' separately calculating energy to obtain EiAnd Ei', if Ei′<EiThen the individual x is retainedi'; if Ei′>EiThen, according to Boltzmann probability p ═ exp { - (E)i′-Ei) KT }, to judge whether to receive the assembled individual, if random (0,1) < p, then keep the individual xi'; otherwise, individual x is retainedi;
6) And (3) selecting the population by using the secondary structure similarity, wherein the process is as follows:
firstly, merging an initial population and a varied population into a new population with the population size of 2 × NP, then calculating the secondary structure similarity of individuals of the new population, sequencing the merged population according to the level of the secondary structure similarity, selecting the individuals with high primary NP secondary structure similarity as the selected population individuals, and finally setting Gen + 1;
7) judging whether the maximum iteration number G is reachedmaxAnd if the conditions are met, stopping iteration and outputting the information of the population individuals of the last generation, otherwise, returning to the step 4).
In this embodiment, an α -sheet protein 1AIL with a sequence length of 73 is taken as an example, and a method for predicting a protein structure based on a secondary structure similarity includes the following steps:
1) setting parameters, and the process is as follows:
reading sequence information of a target protein, fragment library information, an energy function "score 3", and setting population position { x } of a protein conformation1,x2,...,xi,...,xNPWhere NP-100 is the population size, xiThe ith individual representing the population, with Gen iterations and maximum iterationsNumber G max100, cross probability p10.1, probability of variation p20.5, sequence length L73;
2) calculating the similarity of the secondary structure by the following process:
for a protein x of sequence length L
iThe secondary structure of the prediction obtained by the PSIPRED online server is
k∈[1,L]The secondary structure of the conformation obtained by the Dssp algorithm is
k∈[1,L]According to the formula
Calculating to obtain the secondary structure similarity of the conformation, recording the score as 1 when the secondary structures of the k-th position and the k-th position are the same, otherwise, recording the score as 0, traversing the whole sequence length to obtain the final score, and dividing the final score by the sequence length to obtain the similarity SS of the secondary structurei;
3) Population initialization, the process is as follows:
performing fragment assembly on individuals of the population until residue positions of all positions are replaced once, and finishing initialization operation;
4) population crossing, the process is as follows:
4.1) pairing NP individuals pairwise to form NP/2 pairs, and numbering the NP/2 pairs1,a2,...,aj,...,aNP/2,j∈[1,NP/2];
4.2) randomly selecting one of the groups ajAccording to the probability p1Judging whether to carry out crossing, if so, randomly selecting a certain position of the individual, and setting the random crossing length as crosslength E [3,10 ]]Exchanging fragments of the two individuals to obtain two new individuals;
5) population variation, the process is as follows:
5.1) for individual xi,i∈[1,NP]Using the probability of variation p2Fragment assembly is carried out if random (0,1) > p2Then a fragment assembly of 3 fragments is performed, if random (0,1) < p2Then, segment assembly of 9 segments is carried out, and the individual x is obtained after segment assemblyi′;
5.2) Using energy function to group Individual x before and after AssemblyiAnd xi' separately calculating energy to obtain EiAnd Ei', if Ei′<EiThen the individual x is retainedi'; if Ei′>EiThen, according to Boltzmann probability p ═ exp { - (E)i′-Ei) KT }, to judge whether to receive the assembled individual, if random (0,1) < p, then keep the individual xi'; otherwise, individual x is retainedi;
6) And (3) selecting the population by using the secondary structure similarity, wherein the process is as follows:
firstly, merging an initial population and a varied population into a new population with the population size of 2 × NP, then calculating the secondary structure similarity of individuals of the new population, sequencing the merged population according to the level of the secondary structure similarity, selecting the individuals with high primary NP secondary structure similarity as the selected population individuals, and finally setting Gen + 1;
7) judging whether the maximum iteration number G is reachedmaxAnd if the conditions are met, stopping iteration and outputting the information of the population individuals of the last generation, otherwise, returning to the step 4).
Using the method described above, the protein was obtained in a near-native conformation using the alpha-folded protein 1AIL with a sequence length of 73, the minimum RMS deviation being
Mean root mean square deviation of
The prediction structure is shown in fig. 3.
The above description is the optimization effect of the present invention using 1AIL protein as an example, and is not intended to limit the scope of the present invention, and various modifications and improvements can be made without departing from the scope of the present invention.