CN109509510B

CN109509510B - Protein structure prediction method based on multi-population ensemble variation strategy

Info

Publication number: CN109509510B
Application number: CN201810762915.4A
Authority: CN
Inventors: 张贵军; 彭春祥; 周晓根; 刘俊; 王柳静; 胡俊
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-07-12
Filing date: 2018-07-12
Publication date: 2021-06-18
Anticipated expiration: 2038-07-12
Also published as: CN109509510A

Abstract

A protein structure prediction method based on multi-population ensemble variation strategy is characterized in that under the framework of an evolutionary algorithm, a population is averagely divided into four sub-populations, and different variation strategies are respectively designed for each sub-population through conformation cooperative cooperation in each sub-population; and secondly, selecting the constellation according to a Rosetta energy function score3, a distance error coefficient and a Monte Carlo probability receiving criterion to guide the update process of the constellation, so that the problem of inaccuracy of the energy function can be relieved, algorithm sampling can be guided to obtain the constellation with lower energy and more reasonable structure, and the sampling efficiency is improved. The invention provides a protein structure prediction method based on multi-population ensemble mutation strategies, which is high in sampling efficiency and prediction accuracy.

Description

Protein structure prediction method based on multi-population ensemble variation strategy

Technical Field

The invention relates to the fields of bioinformatics and computer application, in particular to a protein structure prediction method based on a multi-population ensemble variation strategy.

Background

The rapid development of computer hardware and software technologies provides a robust, fundamental platform for the development of de novo prediction methods. The progress and breakthrough of the de novo protein structure prediction method has further promoted the wide participation of the subject researchers in computer science and evolutionary computation, and has become one of the most active multidisciplinary research subjects in the field of protein structure prediction in recent years. In a review article published in the Science journal of 2012, professor Dill of academy of sciences of the united states of america reviewed the progress made from the field of de novo prediction for 50 years, and it was pointed out that in the process of seeking answers to this problem, the development of supercomputers, new materials and drug discovery was greatly promoted, helping people understand the basic process of life. De novo prediction methods currently face a number of difficulties and challenges.

The de novo prediction method is directly based on a protein physical or knowledge energy model, and utilizes an optimization algorithm to search a global minimum energy conformational solution in a conformational space. The conformation space optimization method is one of the most critical factors for restricting the de novo prediction precision of the protein structure at present. The application of the optimization algorithm to the de novo prediction sampling process must first solve the following three problems: (1) the complexity of the energy. (2) High dimensional properties of the energy model. (3) Inaccuracy of the energy model. At present, we are far from constructing a force field which can guide the target sequence to fold towards the correct direction and is accurate enough, so that the optimal solution in mathematics does not necessarily correspond to the natural structure of the target protein; furthermore, model inaccuracies can also result in an inability to objectively analyze the performance of the optimization algorithm.

The inherent complexity of spatial optimization of protein conformation makes it a very challenging research topic in the field of de novo protein structure prediction. In order to find unique native protein structures in a huge sampling space by using a computer, an efficient conformational space optimization algorithm must be designed to convert the native protein structures into a practical computational problem.

The differential evolution algorithm (DE) has been successfully applied to protein structure prediction due to its advantages of simple structure, easy implementation, strong robustness, fast convergence rate, etc. However, with the increase of amino acid sequences, the degree of freedom of a protein molecular system is increased, and obtaining a global optimal solution of a large-scale protein conformation space by using the traditional population algorithm sampling becomes challenging work; secondly, the coarse-grained model reduces the conformational search space, but also causes information loss between interaction forces, thereby directly affecting the prediction accuracy.

Therefore, the conventional protein structure prediction method has disadvantages in sampling efficiency and prediction accuracy, and needs to be improved.

Disclosure of Invention

In order to overcome the defects of low sampling efficiency, poor population diversity and low prediction precision of the conventional protein structure prediction method, the invention introduces a multi-population mutation strategy to guide conformational space optimization under the framework of a basic differential evolution algorithm, and provides the protein structure prediction method based on the multi-population ensemble mutation strategy, which has high sampling efficiency and high prediction precision.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method of protein structure prediction based on a multi-population ensemble mutation strategy, the prediction method comprising the steps of:

1) sequence information for a given protein of interest;

2) from ROBETTA servers according to the target protein sequence (http://www.robetta.org/) Obtaining fragment library files, wherein the fragment library files comprise 3 fragment library files and 9 fragment library files;

3) from QUARK Server (https:// zhanglab. ccmb. med. umich. edu/QUARK /)

Obtaining a distance spectrum file;

4) setting parameters: the population size NP, the maximum iteration algebra G of the algorithm, a cross factor CR and a temperature factor beta, and the iteration algebra G is set to be 0;

5) population initialization: random fragment assembly to generate NP initial conformations C_iI ═ 1,2, …, NP }, and the NP individuals are divided equally into four sub-populations, i.e.

And

wherein, j is {1,2, …, NP/4}, k is { NP/4+1, …, NP/2}, m is { NP/2+1, …, NP3/4}, and n is {3NP/4+1, …, NP };

6) for individuals in the first sub-population

The following operations are carried out:

6.1) mixing

Set as a target individual

Randomly selecting a conformational individual in a first sub-population

Randomly selecting two sub-populations from the remaining three sub-populations, and randomly taking out two individuals C from the two sub-populations_a，C_bFrom C, respectively_a、C_bIn the method, a 9 segment with different positions is randomly selected and respectively replaced to

Fragments of the corresponding positions generate a mutated conformation

To pair

Performing a fragment assembly to generate a conformation

6.2) randomly generating a uniformly distributed fraction R between 0 and 1, if R>CR, then

Randomly selects a 9-segment to replace to

A corresponding position; otherwise, it keeps

The conformation resulting from this operation was recorded as the test conformation, without change

6.3) separately computing with Rosetta score3 energy function

And

energy of (2):

and

6.4) if

Then conformation

Replacement of

Adding 1 to the receiving times count1, and going to step 6.8), otherwise, continuing to execute step 6.5);

6.5) separate computations based on residue pairs in the distance spectrum

And

inter-residue distance of

And

then respectively calculating according to formulas (1) and (2)

And

distance error coefficient D of_trialAnd D_targetWhere T represents the number of pairs of residues in the distance spectrum,

and

respectively represent

And

residues in the t-th pair of conformations correspond to C_αDistance between atoms, d_NRepresenting the mean value of the distance spectrum in the Nth distance interval of the distance spectrum, PD_NRepresenting the number of distance spectrum lengths within the interval N, the distance range in the distance spectrum is (0,9), the distance interval is 0.5, i.e., the distance interval is (0, 0.5)],(0.5,1],…,(8.5,9)；

6.6) if D_trial<D_targetThen conformation

Alternative conformations

Adding 1 to the receiving times count1, otherwise, performing step 6.7);

6.7) calculating the difference in the distance error coefficients of the target and test conformations

According to probability

Acceptance of conformation by Monte Carlo criteria

Wherein β is a temperature factor;

6.8) j equals j +1, iteratively executing steps 6.1) -6.8) until j equals NP/4;

7) for each conformation in the second sub-population

The operation was carried out as follows:

7.1) formation of

Recording as target individual

Selecting a lowest energy conformation from the second sub-population

Two of the three subgroups were randomly selected, and two conformations C were randomly selected from them_c、C_dAre respectively paired with C_c、C_dRandomly selecting a 9 segment from different positions to replace the 9 segment

Corresponding position, generating

To pair

Performing a fragment assembly to generate a conformation

7.2) pairs of steps corresponding to 6.2) to 6.7)

And

performing an operation wherein the number of times the test conformation is received is denoted as count 2;

7.3) k equals k +1, and iteratively executing steps 7.1) -7.2) until k equals NP/2;

8) for each conformation in the third group of sub-populations

The operation was carried out as follows:

8.1) formation of

Is recorded as a target individual

Sorting the third group of sub-populations from smaller to larger energy, and randomly selecting an individual in the first half of the conformations

Then randomly selecting two sub-populations from the other three sub-populations, and randomly selecting conformation C from the two sub-populations_eAnd C_fAre respectively paired with C_e、C_fRandomly selecting a 9 segment from different positions to replace the 9 segment

Corresponding position, generating

To pair

Performing a fragment assembly to generate a conformation

8.2) pairing of constellations according to the corresponding steps 6.2) to 6.7)

And

performing an operation wherein the number of times the test conformation is received is denoted as count 3;

8.3) m is m +1, and the steps 8.1) to 8.2) are executed in an iterative manner until k is NP 3/4;

9) for all conformations in the fourth subgroup population

Assembling Rosetta segments;

10) iteratively operating steps 6) -9), carrying out variation on the fourth sub-population by selecting a population variation strategy corresponding to the maximum value of count1, count2 and count3 every 20 generations to calculate the sizes of count1, count2 and count3, operating according to steps 6.2) -6.8), and setting the count1, count2 and count3 to zero;

11) g +1, iteratively executing steps 6) -10) until G is greater than G;

12) and outputting the result.

The technical conception of the invention is as follows: under an evolutionary algorithm framework, firstly, a population is averagely divided into four sub-populations, and different variation strategies are respectively designed for each sub-population through the conformation collaborative cooperation in each sub-population; and secondly, selecting the constellation according to a Rosetta energy function score3, a distance error coefficient and a Monte Carlo probability receiving criterion to guide the update process of the constellation, so that the problem of inaccuracy of the energy function can be relieved, algorithm sampling can be guided to obtain the constellation with lower energy and more reasonable structure, and the sampling efficiency is improved. The invention provides a protein structure prediction method based on multi-population ensemble mutation strategies, which is high in sampling efficiency and prediction accuracy.

The invention has the beneficial effects that: through the cooperation among multiple populations to guide variation, the sampling efficiency can be improved, and the population diversity can be kept; the distance spectrum is used for assisting the conformation selection, so that although the energy function is high, the conformation with a reasonable structure is kept, the problem of prediction error caused by inaccuracy of the energy function is solved, and the prediction accuracy is improved.

Drawings

FIG. 1 is a conformational profile of protein 2EZK sampled by a protein structure prediction method based on a multi-population ensemble mutation strategy.

FIG. 2 is a schematic diagram of the conformational update when protein 2EZK is sampled by a protein structure prediction method based on a multi-population ensemble mutation strategy.

FIG. 3 is a three-dimensional structure predicted from the structure of protein 2EZK by a protein structure prediction method based on a multi-population ensemble mutation strategy;

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 to 3, a method for predicting protein structure based on a multi-population ensemble mutation strategy, the method comprising the steps of:

1) sequence information for a given protein of interest;

3) obtaining a distance spectrum file from a QUARK server (https:// zhanglab. ccmb. med. umich. edu/QUARK /) according to the sequence information;

And

6) for individuals in the first sub-population

The following operations are carried out:

6.1) mixing

Set as a target individual

Randomly selecting a conformational individual in a first sub-population

Fragments of the corresponding positions generate a mutated conformation

To pair

Performing a fragment assembly to generate a conformation

Randomly selects a 9-segment to replace to

A corresponding position; otherwise, it keeps

6.3) separately computing with Rosetta score3 energy function

And

energy of (2):

and

6.4) if

Then conformation

Replacement of

6.5) separate computations based on residue pairs in the distance spectrum

And

inter-residue distance of

And

then respectively calculating according to formulas (1) and (2)

And

and

respectively represent

And

6.6) if D_trial<D_targetThen conformation

Alternative conformations

Adding 1 to the receiving times count1, otherwise, performing step 6.7);

According to probability

Acceptance of conformation by Monte Carlo criteria

Wherein β is a temperature factor;

6.8) j equals j +1, iteratively executing steps 6.1) -6.8) until j equals NP/4;

7) for each conformation in the second sub-population

The operation was carried out as follows:

7.1) formation of

Recording as target individual

Selecting a lowest energy conformation from the second sub-population

Randomly selecting two of the other three subgroups, and randomly selecting one of themTwo conformations selected C_c、C_dAre respectively paired with C_c、C_dRandomly selecting a 9 segment from different positions to replace the 9 segment

Corresponding position, generating

To pair

Performing a fragment assembly to generate a conformation

7.2) pairs of steps corresponding to 6.2) to 6.7)

And

8) for each conformation in the third group of sub-populations

The operation was carried out as follows:

8.1) formation of

Is recorded as a target individual

And then from itRandomly selecting two sub-populations from the three sub-populations, and randomly selecting conformation C from the two sub-populations_eAnd C_fAre respectively paired with C_e、C_fRandomly selecting a 9 segment from different positions to replace the 9 segment

Corresponding position, generating

To pair

Performing a fragment assembly to generate a conformation

And

9) for all conformations in the fourth subgroup population

Assembling Rosetta segments;

11) g +1, iteratively executing steps 6) -10) until G is greater than G;

12) and outputting the result.

Taking alpha protein 2EZK with the sequence length of 99 as an example, a protein structure prediction method based on a multi-population ensemble variation strategy comprises the following steps:

1) sequence information for a given protein of interest;

4) setting parameters: the population size NP is 100, the maximum iteration algebra G of the algorithm is 1000, the crossover factor CR is 0.3, the temperature factor β is 2, and the iteration algebra G is 0;

And

6) for individuals in the first sub-population

The following operations are carried out:

6.1) mixing

Set as a target individual

Randomly selecting a conformational individual in a first sub-population

Fragments of the corresponding positions generate a mutated conformation

To pair

Performing a fragment assembly to generate a conformation

Randomly selects a 9-segment to replace to

A corresponding position; otherwise, it keeps

6.3) separately computing with Rosetta score3 energy function

And

energy of (2):

and

6.4) if

Then conformation

Replacement of

6.5) separate computations based on residue pairs in the distance spectrum

And

inter-residue distance of

And

then respectively calculating according to formulas (1) and (2)

And

and

respectively represent

And

6.6) if D_trial<D_targetThen conformation

Alternative conformations

Adding 1 to the receiving times count1, otherwise, performing step 6.7);

According to probability

Acceptance of conformation by Monte Carlo criteria

Wherein β is a temperature factor;

6.8) j equals j +1, iteratively executing steps 6.1) -6.8) until j equals NP/4;

7) for each conformation in the second sub-population

The operation was carried out as follows:

7.1) formation of

Recording as target individual

Selecting a lowest energy conformation from the second sub-population

Corresponding position, generating

To pair

Performing a fragment assembly to generate a conformation

7.2) pairs of steps corresponding to 6.2) to 6.7)

And

8) for each conformation in the third group of sub-populations

The operation was carried out as follows:

8.1) formation of

Is recorded as a target individual

Corresponding position, generating

To pair

Performing a fragment assembly to generate a conformation

And

9) for all conformations in the fourth subgroup population

Assembling Rosetta segments;

11) g +1, iteratively executing steps 6) -10) until G is greater than G;

12) and outputting the result.

Taking alpha protein 2EZK with sequence length 99 as an example, the near-native conformation of the protein is obtained by the above method, and the mean root mean square deviation between the structure obtained by running 1000 generations and the native structure is

Minimum root mean square deviation of

The predicted three-dimensional structure is shown in fig. 3.

The foregoing illustrates one example of the invention, and it will be apparent that the invention is not limited to the above-described embodiments, but may be practiced with various modifications without departing from the essential spirit of the invention and without departing from the spirit thereof.

Claims

1. A protein structure prediction method based on multi-population ensemble mutation strategy is characterized in that: the method comprises the following steps:

1) sequence information for a given protein of interest;

2) obtaining fragment library files from a ROBETTA server according to a target protein sequence, wherein the fragment library files comprise 3 fragment library files and 9 fragment library files;

3) obtaining a distance spectrum file from a QUARK server according to the sequence information;

And

6) for individuals in the first sub-population

The following operations are carried out:

6.1) mixing

Set as a target individual

Randomly selecting a conformational individual in a first sub-population

From the remaining threeRandomly selecting two sub-populations from the sub-populations, and randomly taking out two individuals C from the two sub-populations respectively_a，C_bFrom C, respectively_a、C_bIn the method, a 9 segment with different positions is randomly selected and respectively replaced to