Disclosure of Invention
In order to overcome the defect of low sampling efficiency of the conventional protein structure prediction method, the invention provides a novel diversity calculation method. The invention adopts a method of combining a genetic algorithm and a local search strategy to predict the protein structure, and adopts an energy function to combine with the diversity calculation method of the invention to sample after the cross variation of population individuals. The diversity calculation method of the invention can avoid blindly sampling the protein conformation space. After the first-stage and second-stage fragment assembly of the Rosetta protocol is carried out, the general structure of the protein is predicted, and on the basis, the conformation with larger difference of local structures is preferentially sampled and then the fragment assembly is carried out, so that the algorithm is prevented from falling into local optimum, and the prediction efficiency and precision are improved.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a method for predicting protein structure based on conformational diversity sampling, the method comprising the steps of:
1) inputting sequence information of a predicted protein, and reading the sequence length L; setting parameters: population N, iteration number G, cross probability Pc;
2) According to the sequence information of the target protein, a fragment library is constructed by utilizing Robeta (http:// robeta. bakerlab. org), and the secondary structure information of the target sequence is predicted by utilizing PSIPRED (http:// bioinf. cs. ucl. ac. uk/psiprd);
3) Iterating the first and second stages of Rosetta to generate an initial population with N individuals
4) Setting G to 0, wherein G belongs to {0,1,2, ·, G };
5) if g is 0, segment assembling is carried out on individuals in the population, and a new population P is generated1,P2,...,PN};
6) Respectively executing the steps 7) to 11) based on the third and fourth phases of the Rosetta protocol;
7) randomly pairing individuals in the population pairwise to form N/2 male parent pairs;
8) the cross operation, the process is as follows:
8.1) setting P1 *、P2 *Randomly selecting a loop area for two male parent individuals;
8.2) generating a random decimal r1,r1∈[0,1]If r is1<PcExchange P for1 *、P2 *All residues of the selected loop region are dihedral to generate two new individuals P'1、P′2;
8.3) iterating steps 8.1) and 8.2) until all male parent pairs are crossed, generating a new population P '═ P'1,P′2,...P′N};
9) Mutation operation, the process is as follows:
9.1) to individuals P 'in the population P'iGenerating a random integer r2,r2∈[0,L-3]Randomly selecting a fragment from the corresponding 3-fragment library for replacement;
9.2) iterating step 9.1) until all individuals have completed variation, generating a new population P ″ ═ P ″1 ,P″2,...P″N};
10) Selecting operation, the process is as follows:
10.1) generating a random decimal rb,rb∈[0,1]If r isbIf the energy is less than 0.5, the individuals in the parent population P and the offspring population P' are scored by using an energy function, the individuals are sorted from low to high according to the energy, and the first N individuals with low energy are selected as the next generation population; otherwise, executing step 10.2);
10.2) the diversity of all individuals in the parent population P and the offspring population P' is calculated as follows:
diversity(Ci)=max{RMSEdif(Ci,Cj)|Cj∈{P∪P″},Ci≠Cj}
wherein
RMSE′
atoRepresents an individual C
iAnd individual C
jThe amino acid sequence is 0-L/2 of the similarity of corresponding structures, RMSE
atoRepresents an individual C
iAnd individual C
jSimilarity of corresponding structures of which the amino acid sequences are L/2-L;
C′iand C'jRepresents an individual CiAnd individual CjThe amino acid sequence is a structure corresponding to 0-L/2;
C″
iand C ″)
jRepresents an individual C
iAnd individual C
jThe amino acid sequence is a structure corresponding to L/2-L;
and
are respectively C'
iAnd C'
jThe three-dimensional coordinates of the ith atom in the population are determined, L is the sequence length of the structure, and finally, the individuals are sorted from high to low according to the diversity size, and the first N individuals with the maximum diversity are selected as the next generation of population;
11) g is G +1, if G is less than or equal to G, the step 7) is carried out, otherwise, the circulation is ended;
12) and outputting a prediction result.
The invention has the beneficial effects that: the protein structure is predicted by adopting a method of combining a genetic algorithm and a local search strategy, and sampling is carried out by combining an energy function and the diversity calculation method after the population individuals are subjected to cross variation. Individuals with large local structure difference are preferentially sampled, the prediction efficiency and precision are improved, and blind sampling is avoided to a certain extent.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1 and 2, a method for optimizing the individual space of a protein based on conformational diversity sampling comprises the following steps:
1) inputting the sequence information of the predicted protein,reading the sequence length L; setting parameters: population N, iteration number G, cross probability Pc;
2) According to the sequence information of the target protein, a fragment library is constructed by utilizing Robeta (http:// robeta. bakerlab. org), and the secondary structure information of the target sequence is predicted by utilizing PSIPRED (http:// bioinf. cs. ucl. ac. uk/psiprd);
3) iterating the first and second stages of Rosetta to generate an initial population with N individuals
4) Setting G to 0, wherein G belongs to {0,1,2, ·, G };
5) if g is 0, segment assembling is carried out on individuals in the population, and a new population P is generated1,P2,...,PN};
6) Respectively executing the steps 7) to 11) based on the third and fourth phases of the Rosetta protocol;
7) Randomly pairing individuals in the population pairwise to form N/2 male parent pairs;
8) the cross operation, the process is as follows:
8.1) setting P1 *、P2 *Randomly selecting a loop area for two male parent individuals;
8.2) generating a random decimal r1,r1∈[0,1]If r is1<PcExchange P for1 *、P2 *All residues of the selected loop region are dihedral to generate two new individuals P'1、P′2;
8.3) iterating steps 8.1) and 8.2) until all male parent pairs are crossed to complete to generate a new population
P′={P′1,P′2,...P′N};
9) Mutation operation, the process is as follows:
9.1) to individuals P 'in the population P'iGenerating a random integer r2,r2∈[0,L-3]Randomly selecting a fragment from the corresponding 3-fragment library for replacement;
9.2) iterating step 9.1) until all individuals have completed variation, generating a new population P ″ ═ P ″1 ,P″2,...P″N};
10) Selecting operation, the process is as follows:
10.1) generating a random decimal rb,rb∈[0,1]If r isbIf the energy is less than 0.5, the individuals in the parent population P and the offspring population P' are scored by using an energy function, the individuals are sorted from low to high according to the energy, and the first N individuals with low energy are selected as the next generation population; otherwise, executing step 10.2);
10.2) the diversity of all individuals in the parent population P and the offspring population P' is calculated as follows:
diversity(Ci)=max{RMSEdif(Ci,Cj)|Cj∈{P∪P″},Ci≠Cj}
wherein
RMSE′
atoRepresents an individual C
iAnd individual C
jThe amino acid sequence is 0-L/2 of the similarity of corresponding structures, RMSE
atoRepresents an individual C
iAnd individual C
jSimilarity of corresponding structures of which the amino acid sequences are L/2-L;
C′iand C'jRepresents an individual CiAnd individual CjThe amino acid sequence is a structure corresponding to 0-L/2;
C″
iand C ″)
jRepresents an individual C
iAnd individual C
jThe amino acid sequence is a structure corresponding to L/2-L;
and
respectively represent C
i' and C
j' the three-dimensional coordinates of the ith atom, L is the length of the sequence of the structure. Finally, sorting the individuals according to the diversity size from high to low, and selecting the first N individuals with the maximum diversity as a next generation population;
11) g is G +1, if G is less than or equal to G, the step 7) is carried out, otherwise, the circulation is ended;
12) and outputting a prediction result.
In this embodiment, taking protein 1ELWA with a sequence length of 117 as an example, a method for optimizing the individual space of protein based on conformational diversity sampling includes the following steps:
1) inputting sequence information of a predicted protein, setting parameters for reading sequence length L as 117: the population N is 100, the iteration number G is 10, and the cross probability P isc=0.5;
2) According to the sequence information of the target protein, a fragment library is constructed by utilizing Robeta (http:// robeta. bakerlab. org), and the secondary structure information of the target sequence is predicted by utilizing PSIPRED (http:// bioinf. cs. ucl. ac. uk/psiprd);
3) Iterating the first and second stages of Rosetta to generate an initial population with N individuals
4) Setting g to be 0;
5) if g is 0, segment assembling is carried out on individuals in the population, and a new population P is generated1,P2,...,PN};
6) Respectively executing the steps 7) to 11) based on the third and fourth phases of the Rosetta protocol;
7) randomly pairing individuals in the population pairwise to form N/2 male parent pairs;
8) the cross operation, the process is as follows:
8.1) setting P1 *、P2 *Is two fathersRandomly selecting a loop area for the individual;
8.2) generating a random decimal r1,r1∈[0,1]If r is1<PcExchange P for1 *、P2 *All residues of the selected loop region are dihedral to generate two new individuals P'1、P′2;
8.3) iterating steps 8.1) and 8.2) until all male parent pairs are crossed, generating a new population P '═ P'1,P′2,...P′N};
9) Mutation operation, the process is as follows:
9.1) to individuals P 'in the population P'iGenerating a random integer r2,r2∈[0,L-3]Randomly selecting a fragment from the corresponding 3-fragment library for replacement;
9.2) iterating step 9.1) until all individuals have completed variation, generating a new population P ″ ═ P ″1,P″2,...P″N}; 10) Selecting operation, the process is as follows:
10.1) generating a random decimal rb,rb∈[0,1]If r isbLess than 0.5, using energy CiThe function scores the individuals in the parent population P and the offspring population P', the individuals are sorted from low to high according to energy, and the first N individuals with low energy are selected as the next generation population; otherwise, executing step 10.2);
10.2) the diversity of all individuals in the parent population P and the offspring population P' is calculated as follows:
diversity(Ci)=max{RMSEdif(Ci,Cj)|Cj∈{P∪P″},Ci≠Cj}
wherein
RMSE′
atoRepresenting individuals and individuals C
jThe amino acid sequence is 0-L/2 of the similarity of corresponding structures, RMSE
atoRepresents an individual C
iAnd individual C
jThe amino acid sequence is L/2-L corresponds to structural similarity;
C′iand C'jRepresents an individual CiAnd individual CjThe amino acid sequence is a structure corresponding to 0-L/2;
C″
iand C ″)
jRepresents an individual C
iAnd individual C
jThe amino acid sequence is a structure corresponding to L/2-L;
and
respectively represent C
i' and C
j' the three-dimensional coordinates of the ith atom, L is the length of the sequence of the structure. Finally, sorting the individuals according to the diversity size from high to low, and selecting the first N individuals with the maximum diversity as a next generation population;
11) g is G +1, if G is less than or equal to G, the step 7) is carried out, otherwise, the circulation is ended;
12) and outputting a prediction result.
Using the protein 1ELWA having an amino acid sequence length of 117 as an example, a near-natural individual of the protein was obtained by the above method, and the predicted root mean square deviation of the protein was
As shown in fig. 1, the prediction structure is shown in fig. 2.
The foregoing is a predictive effect of one embodiment of the invention, which may be adapted not only to the above-described embodiment, but also to various modifications thereof without departing from the basic idea of the invention and without exceeding the gist of the invention.