CN108334746B

CN108334746B - A protein structure prediction method based on secondary structure similarity

Info

Publication number: CN108334746B
Application number: CN201810034686.4A
Authority: CN
Inventors: 李章维; 孙科; 余宝昆; 马来发; 周晓根; 张贵军
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Guangzhou Zhaoji Biotechnology Co ltd; Shenzhen Xinrui Gene Technology Co ltd
Priority date: 2018-01-15
Filing date: 2018-01-15
Publication date: 2021-06-18
Anticipated expiration: 2038-01-15
Also published as: CN108334746A

Abstract

A protein conformation space search method based on secondary structure similarity, firstly, the setting of cross probability in a group algorithm can not only control the speed of group convergence, avoid precocity, but also enable information interaction among groups; then, mutation can increase the diversity of conformation to obtain better conformation; and finally, in the selection process, the population is preferred by using the secondary structure similarity, individuals with small secondary structure similarity are eliminated, and better individuals are left, so that the problem of inaccuracy of an energy function is avoided. The invention has better sampling capability and higher prediction precision.

Description

Protein structure prediction method based on secondary structure similarity

Technical Field

The invention relates to the fields of biological informatics, molecular dynamics simulation, statistical learning and combination optimization and computer application, in particular to a protein structure prediction method based on secondary structure similarity.

Background

Since the end of the twentieth century, with the rapid development of the field of life sciences, more and more researchers and research institutes have participated in the research of life sciences. The protein is formed by dehydration and condensation of 20 different amino acids by taking mRNA as a template to form an amino acid sequence, and then is folded to form a three-dimensional structure with a specific function. The secondary structure of a protein refers to a regularly repeating conformation, such as alpha helices and beta sheets, in the polypeptide chain of the protein. The understanding of the three-dimensional structure of the protein is the basis for researching the biological function and activity mechanism of the protein, and the problem of protein structure prediction is one of the research hotspots in the fields of bioinformatics and computer application at present, and has very important guiding significance for the invention of new protein and the design of drug target protein. The three-dimensional structure of proteins is now determined experimentally mainly by methods such as X-ray diffraction and Nuclear Magnetic Resonance (NMR). The X-ray diffraction technique can obtain a protein structure with high precision, but the method is not suitable for the protein which can not prepare resolution crystals. The nuclear magnetic resonance method does not need to prepare protein crystals, but only can measure small proteins with less than 300 amino acids, and is long in time consumption and high in cost.

Since the rate of protein structure determination is much slower than that of sequence determination, in fact, only 0.2% of the protein sequences possess the experimentally determined protein structure, it is a very meaningful work to predict the structure from the protein sequence using computer methods. The Anfinsen experiments show that structural information of the protein is contained in its sequence, thereby indicating that structural prediction is feasible from the sequence. The mainstream protein structure prediction methods at present mainly comprise a homologous modeling method, a threading method and a de novo prediction method. For the case of high sequence identity (> 50%), homology modeling and threading are the first choice, but for the case of low sequence similarity (< 30%), the first two methods are no longer applicable, and only de novo prediction can be selected.

In the process of protein structure de novo prediction, two major bottlenecks exist at present, one is deceptiveness of energy landscape, so that the obtained energy low conformation is not a natural conformation, and is specifically represented as inaccurate energy function, and a good conformation cannot be selected; another is the lack of ability of the prior art to sample conformational space, which is manifested by a lack of diversity in the conformations.

Therefore, the current protein structure prediction method has defects in prediction accuracy and sampling capability, and needs to be improved.

Disclosure of Invention

In order to overcome the defects of insufficient sampling capability and prediction accuracy of the conventional protein structure prediction method, the invention provides a group protein structure prediction method based on secondary structure similarity, which has better sampling capability and higher prediction accuracy, designs a secondary structure similarity index, selects individuals with better energy and structure through the double constraints of the secondary structure similarity index and an energy function, and effectively solves the problem of low protein structure prediction accuracy caused by inaccuracy of the energy function.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for predicting protein structure based on secondary structure similarity, the method comprising the steps of: 1) setting parameters, and the process is as follows:

reading sequence information of a target protein, fragment library information, and an energy function, and setting population position { x } of a protein conformation₁,x₂,...,x_i,...,x_NPWhere NP is the population size, x_iRepresenting the ith individual of the population, with Gen iteration number and G maximum iteration number_maxCross probability p₁Probability of variation p₂The sequence length is L;

2) calculating the similarity of the secondary structure by the following process:

for a protein x of sequence length L_iThe secondary structure of the prediction obtained by the PSIPRED online server is

k∈[1,L]The secondary structure of the conformation obtained by the Dssp algorithm is

k∈[1,L]According to the formula

Calculating to obtain the secondary structure similarity of the conformation, recording the score as 1 when the secondary structures of the k-th position and the k-th position are the same, otherwise, recording the score as 0, traversing the whole sequence length to obtain the final score, and dividing the final score by the sequence length to obtain the similarity SS of the secondary structure_i；

3) Population initialization, the process is as follows:

performing fragment assembly on individuals of the population until residue positions of all positions are replaced once, and finishing initialization operation;

4) population crossing, the process is as follows:

4.1) pairing NP individuals pairwise to form NP/2 pairs, and numbering the NP/2 pairs₁,a₂,...,a_j,...,a_NP2,j∈[1,NP/2]；

4.2) randomly selecting one of the groups a_jAccording to the probability p₁Judging whether to carry out crossing, if so, randomly selecting a certain position of the individual, and setting the random crossing length as crosslength E [3,10 ]]Exchanging fragments of the two individuals to obtain two new individuals;

5) population variation, the process is as follows:

5.1) for individual x_i,i∈[1,NP]Using the probability of variation p₂Fragment assembly is carried out if random (0,1) > p₂Then a fragment assembly of 3 fragments is performed, if random (0,1) < p₂Then, segment assembly of 9 segments is carried out, and the individual x is obtained after segment assembly_i′；

5.2) Using energy function to group Individual x before and after Assembly_iAnd x_i' separately calculating energy to obtain E_iAnd E_i', if E_i′＜E_iThen the individual x is retained_i'; if E_i′＞E_iThen, according to Boltzmann probability p ═ exp { - (E)_i′-E_i) KT }, to judge whether to receive the assembled individual, if random (0,1) < p, then keep the individual x_i'; otherwise, individual x is retained_i；

6) And (3) selecting the population by using the secondary structure similarity, wherein the process is as follows:

firstly, merging an initial population and a varied population into a new population with the population size of 2 × NP, then calculating the secondary structure similarity of individuals of the new population, sequencing the merged population according to the level of the secondary structure similarity, selecting the individuals with high primary NP secondary structure similarity as the selected population individuals, and finally setting Gen + 1;

7) judging whether the maximum iteration number G is reached_maxIf the conditions are met, stopping iteration and outputting the last generation populationVolume information, otherwise, returning to the step 4).

The technical conception of the invention is as follows: the invention provides a protein structure prediction method based on secondary structure similarity under the framework of a group algorithm. Firstly, the setting of the cross probability in the group algorithm can control the speed of group convergence, avoid precocity and also enable information interaction between groups; then, mutation can increase the diversity of conformation to obtain better conformation; and finally, in the selection process, the population is preferred by using the secondary structure similarity, individuals with small secondary structure similarity are eliminated, and better individuals are left, so that the problem of inaccuracy of an energy function is avoided.

The beneficial effects of the invention are as follows: on one hand, a group algorithm is used, information interaction is carried out among groups, and search of a conformation space is increased; on the other hand, the population is selected through the secondary structure similarity, so that the retention probability of high-quality individuals is greatly increased, errors caused by inaccuracy of energy functions are reduced, and the prediction accuracy is improved.

Drawings

FIG. 1 is a flow chart of a method for predicting protein structure based on secondary structure similarity.

FIG. 2 is a conformational distribution diagram obtained when protein 1AIL is subjected to structure prediction by a protein structure prediction method based on secondary structure similarity.

FIG. 3 is a three-dimensional structural diagram obtained by predicting the structure of protein 1AIL by a protein structure prediction method based on secondary structure similarity.

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 to 3, a method for predicting a protein structure based on secondary structure similarity, the method comprising the steps of:

1) setting parameters, and the process is as follows:

reading sequence information of a target protein, fragment library information, and an energy function, and setting population position { x } of a protein conformation₁,x₂,...,x_i,...,x_NPWhere NP is the population size, x_iRepresenting a populationThe ith individual of (1), the number of iterations is Gen, and the maximum number of iterations G_maxCross probability p₁Probability of variation p₂The sequence length is L;

k∈[1,L]According to the formula

3) Population initialization, the process is as follows:

4) population crossing, the process is as follows:

4.1) pairing NP individuals pairwise to form NP/2 pairs, and numbering the NP/2 pairs₁,a₂,...,a_j,...,a_NP/2,j∈[1,NP/2]；

5) population variation, the process is as follows:

7) judging whether the maximum iteration number G is reached_maxAnd if the conditions are met, stopping iteration and outputting the information of the population individuals of the last generation, otherwise, returning to the step 4).

In this embodiment, an α -sheet protein 1AIL with a sequence length of 73 is taken as an example, and a method for predicting a protein structure based on a secondary structure similarity includes the following steps:

1) setting parameters, and the process is as follows:

reading sequence information of a target protein, fragment library information, an energy function "score 3", and setting population position { x } of a protein conformation₁,x₂,...,x_i,...,x_NPWhere NP-100 is the population size, x_iThe ith individual representing the population, with Gen iterations and maximum iterationsNumber G _max100, cross probability p₁0.1, probability of variation p₂0.5, sequence length L73;

k∈[1,L]According to the formula

3) Population initialization, the process is as follows:

4) population crossing, the process is as follows:

5) population variation, the process is as follows:

Using the method described above, the protein was obtained in a near-native conformation using the alpha-folded protein 1AIL with a sequence length of 73, the minimum RMS deviation being

Mean root mean square deviation of

The prediction structure is shown in fig. 3.

The above description is the optimization effect of the present invention using 1AIL protein as an example, and is not intended to limit the scope of the present invention, and various modifications and improvements can be made without departing from the scope of the present invention.

Claims

1. A protein structure prediction method based on secondary structure similarity is characterized by comprising the following steps:

1) setting parameters, and the process is as follows:

The secondary structure of the conformation obtained by the Dssp algorithm is

According to the formula

3) Population initialization, the process is as follows:

4) population crossing, the process is as follows:

5) population variation, the process is as follows:

7) judging whether the maximum iteration number G is reached_maxIf, ifStopping iteration and outputting the individual information of the last generation of population if the conditions are met, and returning to the step 4) if the conditions are not met.