CN110556161B

CN110556161B - Protein structure prediction method based on conformational diversity sampling

Info

Publication number: CN110556161B
Application number: CN201910743293.5A
Authority: CN
Inventors: 张贵军; 赵凯龙; 饶亮; 夏瑜豪; 刘俊; 彭春翔; 周晓根
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Guangzhou Zhaoji Biotechnology Co ltd; Shenzhen Xinrui Gene Technology Co ltd
Priority date: 2019-08-13
Filing date: 2019-08-13
Publication date: 2022-04-05
Anticipated expiration: 2039-08-13
Also published as: CN110556161A

Abstract

A protein structure prediction method based on conformational diversity sampling predicts a protein structure by adopting a method of combining a genetic algorithm and a local search strategy, and samples population conformations by adopting an energy function and the diversity calculation method of the invention in the third and fourth stages of Rosetta. The conformation with larger difference of local structures is preferentially sampled, the prediction efficiency and precision are improved, and blind sampling is avoided to a certain extent. The invention provides a protein structure prediction method based on conformational diversity sampling, which has high prediction precision.

Description

Protein structure prediction method based on conformational diversity sampling

Technical Field

The invention relates to the fields of bioinformatics and computer application, in particular to a protein structure prediction method based on conformational diversity sampling.

Background

The problem of protein structure prediction is also known as the protein folding problem. The shape of the protein folding structure largely determines the biological function, and understanding the structural information of the protein is of great significance for studying the function of the protein. Prediction of the three-dimensional structure of proteins has become one of the important research issues in bioinformatics.

The de novo protein structure prediction method is a commonly used protein structure prediction method, and is also an ideal prediction method because it only uses primary sequence information for prediction and does not depend on a known protein structure template. The theoretical basis for de novo protein structure prediction is that the three-dimensional structure of the native protein in a certain environment is the structure with the least free energy of the whole system. Thus, there are two keys to de novo protein structure prediction: firstly, a reasonable potential function is required, and the global minimum point of the potential function corresponds to the natural structure of the protein; and secondly, an efficient conformational space search algorithm is required to ensure that the global minimum of the potential function is found in effective calculation time. In the process of de novo prediction of protein structure, the inaccuracy of energy function and the lack of sampling ability cause the prediction result to be not ideal.

Over the past decades, researchers have proposed many algorithms to solve the problem of searching for a globally optimal solution to the problem of predicting the three-dimensional structure of proteins. Genetic algorithms have long been used in protein structure prediction because of their ability to find optimal solutions simply and efficiently in large and complex search spaces. Because the genetic algorithm has the defects of easy falling into local optimum, premature phenomenon and slow convergence rate of the algorithm, most methods adopt a method of combining the genetic algorithm and a local search strategy to predict the protein structure. For example, a method combining a genetic algorithm and a simulated annealing algorithm can effectively avoid falling into a local optimal solution. The taboo algorithm is applied to the genetic algorithm, and the structure of the protein is quickly and accurately searched. However, these methods combine multiple algorithms, have long running time and low efficiency, and have certain limitations.

Disclosure of Invention

In order to overcome the defect of low sampling efficiency of the conventional protein structure prediction method, the invention provides a novel diversity calculation method. The invention adopts a method of combining a genetic algorithm and a local search strategy to predict the protein structure, and adopts an energy function to combine with the diversity calculation method of the invention to sample after the cross variation of population individuals. The diversity calculation method of the invention can avoid blindly sampling the protein conformation space. After the first-stage and second-stage fragment assembly of the Rosetta protocol is carried out, the general structure of the protein is predicted, and on the basis, the conformation with larger difference of local structures is preferentially sampled and then the fragment assembly is carried out, so that the algorithm is prevented from falling into local optimum, and the prediction efficiency and precision are improved.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for predicting protein structure based on conformational diversity sampling, the method comprising the steps of:

1) inputting sequence information of a predicted protein, and reading the sequence length L; setting parameters: population N, iteration number G, cross probability P_c；

2) According to the sequence information of the target protein, a fragment library is constructed by utilizing Robeta (http:// robeta. bakerlab. org), and the secondary structure information of the target sequence is predicted by utilizing PSIPRED (http:// bioinf. cs. ucl. ac. uk/psiprd);

3) Iterating the first and second stages of Rosetta to generate an initial population with N individuals

4) Setting G to 0, wherein G belongs to {0,1,2, ·, G };

5) if g is 0, segment assembling is carried out on individuals in the population, and a new population P is generated₁,P₂,...,P_N}；

6) Respectively executing the steps 7) to 11) based on the third and fourth phases of the Rosetta protocol;

7) randomly pairing individuals in the population pairwise to form N/2 male parent pairs;

8) the cross operation, the process is as follows:

8.1) setting P₁ ^*、P₂ ^*Randomly selecting a loop area for two male parent individuals;

8.2) generating a random decimal r₁，r₁∈[0,1]If r is₁＜P_cExchange P for₁ ^*、P₂ ^*All residues of the selected loop region are dihedral to generate two new individuals P'₁、P′₂；

8.3) iterating steps 8.1) and 8.2) until all male parent pairs are crossed, generating a new population P '═ P'₁,P′₂,...P′_N}；

9) Mutation operation, the process is as follows:

9.1) to individuals P 'in the population P'_iGenerating a random integer r₂，r₂∈[0,L-3]Randomly selecting a fragment from the corresponding 3-fragment library for replacement;

9.2) iterating step 9.1) until all individuals have completed variation, generating a new population P ″ ═ P ″₁,P″₂,...P″_N}；

10) Selecting operation, the process is as follows:

10.1) generating a random decimal r_b，r_b∈[0,1]If r is_bIf the energy is less than 0.5, the individuals in the parent population P and the offspring population P' are scored by using an energy function, the individuals are sorted from low to high according to the energy, and the first N individuals with low energy are selected as the next generation population; otherwise, executing step 10.2);

10.2) the diversity of all individuals in the parent population P and the offspring population P' is calculated as follows:

diversity(C_i)＝max{RMSE_dif(C_i,C_j)|C_j∈{P∪P″},C_i≠C_j}

wherein

RMSE′_atoRepresents an individual C_iAnd individual C_jThe amino acid sequence is 0-L/2 of the similarity of corresponding structures, RMSE_atoRepresents an individual C_iAnd individual C_jSimilarity of corresponding structures of which the amino acid sequences are L/2-L;

C′_iand C'_jRepresents an individual C_iAnd individual C_jThe amino acid sequence is a structure corresponding to 0-L/2;

C″_iand C ″)_jRepresents an individual C_iAnd individual C_jThe amino acid sequence is a structure corresponding to L/2-L;

and

are respectively C'_iAnd C'_jThe three-dimensional coordinates of the ith atom in the population are determined, L is the sequence length of the structure, and finally, the individuals are sorted from high to low according to the diversity size, and the first N individuals with the maximum diversity are selected as the next generation of population;

11) g is G +1, if G is less than or equal to G, the step 7) is carried out, otherwise, the circulation is ended;

12) and outputting a prediction result.

The invention has the beneficial effects that: the protein structure is predicted by adopting a method of combining a genetic algorithm and a local search strategy, and sampling is carried out by combining an energy function and the diversity calculation method after the population individuals are subjected to cross variation. Individuals with large local structure difference are preferentially sampled, the prediction efficiency and precision are improved, and blind sampling is avoided to a certain extent.

Drawings

FIG. 1 is a schematic diagram of structure prediction of protein 1ELWA by a protein structure prediction method based on conformational diversity sampling.

FIG. 2 is a three-dimensional structural diagram of a protein 1ELWA based on a protein structure prediction method of conformational diversity sampling.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 and 2, a method for optimizing the individual space of a protein based on conformational diversity sampling comprises the following steps:

1) inputting the sequence information of the predicted protein,reading the sequence length L; setting parameters: population N, iteration number G, cross probability P_c；

4) Setting G to 0, wherein G belongs to {0,1,2, ·, G };

8) the cross operation, the process is as follows:

8.3) iterating steps 8.1) and 8.2) until all male parent pairs are crossed to complete to generate a new population

P′＝{P′₁,P′₂,...P′_N}；

9) Mutation operation, the process is as follows:

10) Selecting operation, the process is as follows:

diversity(C_i)＝max{RMSE_dif(C_i,C_j)|C_j∈{P∪P″},C_i≠C_j}

wherein

RMSE′_atoRepresents an individual C _iAnd individual C_jThe amino acid sequence is 0-L/2 of the similarity of corresponding structures, RMSE_atoRepresents an individual C_iAnd individual C_jSimilarity of corresponding structures of which the amino acid sequences are L/2-L;

and

respectively represent C_i' and C_j' the three-dimensional coordinates of the ith atom, L is the length of the sequence of the structure. Finally, sorting the individuals according to the diversity size from high to low, and selecting the first N individuals with the maximum diversity as a next generation population;

12) and outputting a prediction result.

In this embodiment, taking protein 1ELWA with a sequence length of 117 as an example, a method for optimizing the individual space of protein based on conformational diversity sampling includes the following steps:

1) inputting sequence information of a predicted protein, setting parameters for reading sequence length L as 117: the population N is 100, the iteration number G is 10, and the cross probability P is_c＝0.5；

4) Setting g to be 0;

8) the cross operation, the process is as follows:

8.1) setting P₁ ^*、P₂ ^*Is two fathersRandomly selecting a loop area for the individual;

9) Mutation operation, the process is as follows:

9.2) iterating step 9.1) until all individuals have completed variation, generating a new population P ″ ═ P ″₁,P″₂,...P″_N}; 10) Selecting operation, the process is as follows:

10.1) generating a random decimal r_b，r_b∈[0,1]If r is_bLess than 0.5, using energy C_iThe function scores the individuals in the parent population P and the offspring population P', the individuals are sorted from low to high according to energy, and the first N individuals with low energy are selected as the next generation population; otherwise, executing step 10.2);

diversity(C_i)＝max{RMSE_dif(C_i,C_j)|C_j∈{P∪P″},C_i≠C_j}

wherein

RMSE′_atoRepresenting individuals and individuals C_jThe amino acid sequence is 0-L/2 of the similarity of corresponding structures, RMSE_atoRepresents an individual C_iAnd individual C_jThe amino acid sequence is L/2-L corresponds to structural similarity;

and

12) and outputting a prediction result.

Using the protein 1ELWA having an amino acid sequence length of 117 as an example, a near-natural individual of the protein was obtained by the above method, and the predicted root mean square deviation of the protein was

As shown in fig. 1, the prediction structure is shown in fig. 2.

The foregoing is a predictive effect of one embodiment of the invention, which may be adapted not only to the above-described embodiment, but also to various modifications thereof without departing from the basic idea of the invention and without exceeding the gist of the invention.

Claims

1. A method for optimizing the individual space of a protein based on conformational diversity sampling, the method comprising the steps of:

2) According to the sequence information of the target protein, a fragment library is constructed by utilizing Robeta, and the secondary structure information of the target sequence is predicted by utilizing PSIPRED;

4) Setting G to 0, wherein G belongs to {0,1,2, ·, G };

8) the cross operation, the process is as follows:

8.3) iterating steps 8.1) and 8.2) until all male parent pairs are crossed, generating a new population P '═ P' ₁,P′₂,...P′_N}；

9) Mutation operation, the process is as follows:

9.1) to individuals P 'in the population P'_iGenerating a random integer r₂，r₂∈[0,L-3]Randomly selecting one fragment from the corresponding 3 residual fragment libraries for replacement;

10) Selecting operation, the process is as follows:

diversity(C_i)＝max{RMSE_dif(C_i,C_j)|C_j∈{P∪P″},C_i≠C_j}

wherein

and

and outputting a prediction result.