CN113257338A

CN113257338A - Protein structure prediction method based on residue contact diagram information game mechanism

Info

Publication number: CN113257338A
Application number: CN202110440653.1A
Authority: CN
Inventors: 张贵军; 侯铭桦; 魏源; 彭春祥; 杨涛; 郭赛赛; 周晓根
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2021-08-13

Abstract

A protein structure prediction method based on a residue contact graph information game mechanism comprises the steps of firstly, obtaining a plurality of residue contact graphs through four protein residue contact servers of Raptorx, ResPRE, NeBcon and DeepMetaPSICOV selected according to Jaccard indexes of CASP games so as to construct a plurality of energy functions; secondly, initializing the population by utilizing the first and second stages of Rosetta, and generating a new test conformation by carrying out variation and crossing on the target conformation; and finally, designing a Pareto-based multi-objective optimization algorithm to update the conformation according to an energy function constructed by the four-residue contact diagram, so as to guide algorithm sampling to obtain the conformation with the structure closer to the natural state. The invention provides a protein structure prediction method based on a residue contact map information game mechanism.

Description

Protein structure prediction method based on residue contact diagram information game mechanism

Technical Field

The invention relates to the fields of bioinformatics and computational intelligence, in particular to a protein structure prediction method based on a residue contact diagram information game mechanism.

Background

The life health is the leading direction of future industry development in the world, and is a basic field for improving the health level of people and enhancing the acquaintance of common people. The reproductive activities of all life processes and races are closely related to the synthesis, decomposition and change of proteins. The three-dimensional structure of a protein determines its specific biological function and is the material basis for life activities. Misfolding of the protein may result in failure to function properly. For example, in the brain of senile dementia patients, there are numerous disordered protein clusters formed by misfolded proteins. Therefore, in order to realize breakthrough in the field of life health and understand life phenomena and life processes more deeply to realize targeted drug development, the prerequisite is to acquire the three-dimensional structure of the protein.

At present, conventional methods of biological wet experiments, including X-ray crystallography, nuclear magnetic resonance and cryoelectron microscopy, although capable of determining the three-dimensional structure of proteins, are highly demanding on materials, instruments and personnel and are extremely time-consuming. Therefore, it is urgently required to perform structural modeling of sequences and to search for protein structure prediction by using computational techniques.

Protein structure prediction is taken as a major research problem in the field of bioinformatics, and two major fields exist in the field at present, namely an energy function model is constructed according to physicochemical knowledge of biomolecules, so that the trend is led all the time from the early CASP competition, and the situation is also a very important position at present. It is represented by Rosetta at Baker laboratory of Washington university and I-TASSER at Zhang Yang laboratory of Michigan university. As a structural prediction tool, the Rosetta algorithm is capable of predicting, designing, and analyzing a variety of biomolecular systems, including proteins, RNA, DNA, peptides, small molecules, and non-canonical or derivatized amino acids. I-TASSER is a method for predicting protein structure and function. The method predicts the functions of targets by a multithreading method LOMETS, a protein function database BioLiP and the like. The physicochemical model method achieves abundant results and simultaneously shows the defects of insufficient expression accuracy, imperfect characteristics and the like. And the other block is mainly used for predicting contact, distance and other information based on deep learning so as to construct a knowledge model. In the CASP14 results from the previous days, the AlphaFold proposed by Google ranked first in the artificial group and far beyond the second, Tencent, tfold first contest also achieved good performance ranked first in the contact group.

From the aspect of CASP competition contact prediction, although the precision of contact prediction is higher and higher at present, error information still exists; and the Jaccard distance graph shows that the information sets captured by different prediction servers are different. In addition, although the deep learning method has made great progress in the field of protein structure prediction, especially residue contact prediction, when a protein structure is folded, a plurality of different sets of residue contact information are often integrated by adopting simple weighted superposition, so that a part of predicted residue contact information is lost, and the prediction accuracy is inevitably influenced. On the other hand, prediction of protein structure by computational techniques is usually evaluated using a single energy function, which is limited in the ability to sample, and which ultimately yields a conformation of the protein that may be optimal in energy but not necessarily optimal, i.e., a conformation that is low in energy is not necessarily the closest to the native conformation.

Therefore, the existing protein structure prediction methods have shortcomings in data reception efficiency and conformation selection evaluation, and improvements are needed.

The invention content is as follows:

in order to overcome the defects of low data receiving efficiency and low prediction precision of the conventional protein structure prediction method, the invention provides the protein structure prediction method based on a residue contact graph information game mechanism, wherein a plurality of energy functions are constructed by a plurality of residue contact graphs based on four protein residue contact servers of Raptorx, ResPRE, NeBcon and DeepMetaPSICOV and a Rosetta platform, and a multi-objective optimization method is adopted to guide conformation space optimization.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a protein structure prediction method based on residue contact map information gambling mechanism, the method comprising the steps of:

1) sequence information for a given protein of interest;

2) according to the sequence information of the given target protein, the following four contact map prediction servers are utilized:

RaptorX(http://raptorx.uchicago.edu/ContactMap/)；

ResPRE(https://zhanglab.ccmb.med.umich.edu/ResPRE/)；

NeBcon(https://zhanglab.ccmb.med.umich.edu/NeBcon/)；

DeepMetaPSICOV(http://bioinf.cs.ucl.ac.uk/psipred/)；

acquiring four residue contact information files, and performing data processing to generate four contacMapRapter files, namely, contactMapRaptorX, contactMapRespre, contactMapNeBcon and contactMapDeepMetaPSICOV;

3) respectively constructing an Energy function Energy raptorX (C) according to four contact map files, namely contact map pRaprammonium X, contact map pRespPRE, contact map NeBcon and contact map data copy PSICOV_n)、Energy ResPRE(C_n)、Energy NeBcon(C_n)、Energy PSICOV(C_n) The formula is as follows:

wherein the content of the first and second substances,

represents the confidence of the contact of the kth residue pair (i, j) in the residue contact maps contactMapRaptorX, contactMapRespre, contactMapNeBcon and contactMapDeepetaPSICOV,

represents the true distance between the kth residue pair (i, j), d_conThe threshold value of 8, the maximum distance that two residues touch,

respectively represent conformation C_nIs determined at four energy functions energy raptorx (C)_n)、Energy ResPRE(C_n)、Energy NeBcon(C_n)、Energy PSICOV(C_n) The contact score of (1);

4) acquiring fragment library files from a ROBETTA server (http:// www.robetta.org /) according to a target protein sequence, wherein the fragment library files comprise 3 fragment library files and 9 fragment library files;

5) setting parameters: setting an initial iteration algebra G to be 0, wherein the population size NP, a cross factor CR and an iteration number G are set;

6) population initialization: random fragment assembly to generate NP initial conformations C_n，n＝{1,2，…,NP}；

7) Will be conformation C_nSubstituting into four Energy functions Energy raptorX (C)_n)、Energy ResPRE(C_n)、Energy NeBcon(C_n)、Energy PSICOV(C_n) In the method, an energy value is obtained

Constructed as an energy array

8) According to energy array

A first conformational pool was constructed as follows:

8.1) setting the initial conformation number N to 0;

8.2) traversing the population, each conformation C_nEnergy array of

If none of the four energy values of the conformations is better than the current conformation C, compared with all other conformations_nI.e. by

So that

Wherein C is_mTo remove the current conformation C_nIf any conformation is in other conformation, the solution is recorded as Pareto effective solution;

8.3) placing the conformation effectively solved by Pareto into a first conformation pool, recording the current conformation number as N, and removing the rest conformations;

9) and (3) circulation: g +1, if G > G, go to step 14);

10) subjecting the conformational individuals in the first conformational pool to C_nN ∈ {1,2,3, …, N } is regarded as the target conformation entity

Performing the following operations to generate a mutated conformation

The process is as follows:

10.1) randomly generating positive integers N1, N2, N3 in the range of 1 to N, and N1 ≠ N2 ≠ N3 ≠ N;

10.2) in conformation C_n1Randomly selected 9-fragment at position to replace conformation C_n3From the fragment corresponding to the same position in conformation C_n2Randomly choosing one and conformation C in position_n1Selection of differently positioned 9 fragments for replacement of conformation C_n3And then the corresponding fragment in the same position of (A) is used for conformation C_n3Performing 3-segment assembly to generate individual with variant conformation

11) For the variant conformation

N e {1,2,3, …, N } performs a crossover operation to generate a test constellation

The process is as follows:

11.1) generating a random number rand1, wherein rand1 belongs to (0, 1);

11.2) if the random number rand1 is less than or equal to CR, then starting from the target conformation

In which a 3-fragment is randomly selected to be substituted into a variant conformation

Otherwise mutated conformation

The change is not changed;

11.3) test conformation to be generated

Placing into a second conformation pool;

12) testing the conformation in the second conformation cell

Substituting into four Energy functions Energy raptorX (C)_n)、Energy ResPRE(C_n)、Energy NeBcon(C_n)、Energy PSICOV(C_n) In the method, an energy value is obtained

Constructed as an energy array

13) Traversing the second conformation pool, and reserving conformations of the full population Pareto effective solution

The process is as follows:

13.1) the second conformation pool internal conformations are compared with each other, the conformations that retain the effective solution of Pareto

Recording the number of conformations as N_TBA；

13.2) conformation in the second conformation pool

m∈{1,2,3,…,N_TBAAnd the conformations in the first conformation well

N ∈ {1,2,3, …, N } for comparison:

13.2.1) if

So that

Deleting conformations in the second conformation pool

13.2.2) if

So that

And certainly

So that

Then use the conformation

Replacement of conformations in the first conformational pool

And deleting conformations in the second conformation pool

13.2.3) if present for any one of the conformations in the first pool of conformations

All exist k epsilon [1,2,3,4 ]]So that

Then the conformation will be changed

Retained in the second conformational bath;

13.2.4) update N_TBARecording the number of conformations in the current second conformation pool;

14) for the conformation of the first conformation pool

And the conformation of the second conformation pool

The selection operation is carried out by the following process:

14.1) if the sum of the conformational numbers of the first conformational pool and the second conformational pool is greater than the set population number, i.e., N + N_TBAIf not, continuing to step 14.2), otherwise, putting the conformation in the second conformation pool into the first conformation pool, emptying the second conformation pool and jumping to step 9);

14.2) introducing a conformational similarity index RMSD by calculating the RMSD value between each conformation and all the remaining conformations in two conformational pools, as shown in formula (5), wherein

Is in conformation C_i(x, y, z) coordinates in the internal atomic space,

in any of the remaining conformations C_j(x, y, z) coordinates in the internal atomic space;

14.3) judging the conformation similarity according to the RMSD value, selecting NP conformations with the most abundant diversity, putting the NP conformations into a first conformation pool, emptying a second conformation pool, and transferring to the step 9);

15) and outputting the result.

The technical conception of the invention is as follows: firstly, a plurality of residue contact maps are obtained by contacting four protein residues of Raptorx, ResPRE, NeBcon and DeepMetaPSICOV with a server so as to construct a plurality of energy functions; secondly, initializing the population by utilizing the first and second stages of Rosetta, and generating a new test conformation by carrying out variation and crossing on the target conformation; and finally, designing a Pareto-based multi-objective optimization algorithm to update the conformation according to an energy function constructed by the four-residue contact diagram, so as to guide algorithm sampling to obtain the conformation with the structure closer to the natural state. The invention provides a protein structure prediction method based on a residue contact map information game mechanism.

The invention has the beneficial effects that: firstly, the obtained residue contact information is predicted and obtained through different servers, so that the source diversity of the contact information is increased, and the influence of information loss and error leakage possibly caused by a single contact graph on structure prediction is reduced; secondly, a conformation selection method based on a residue contact information game mechanism is designed by combining a multi-objective optimization algorithm, and conformation guiding errors caused by inaccuracy of a traditional energy model are avoided.

Drawings

FIG. 1 is processed information of four predicted residue contact maps.

FIG. 2 is a conformational distribution diagram obtained by protein 1ELW sampling based on a protein structure prediction method of residue contact diagram information game mechanism.

FIG. 3 is a three-dimensional structure predicted from a 1ELW protein structure by a protein structure prediction method based on a residue contact map information game mechanism.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 to 3, a method for predicting a protein structure based on a multi-residue contact map synergistic constraint, the method comprising the steps of:

1) sequence information for a given protein of interest;

RaptorX(http://raptorx.uchicago.edu/ContactMap/)；

ResPRE(https://zhanglab.ccmb.med.umich.edu/ResPRE/)；

NeBcon(https://zhanglab.ccmb.med.umich.edu/NeBcon/)；

DeepMetaPSICOV(http://bioinf.cs.ucl.ac.uk/psipred/)；

3) respectively constructing an energy function EnergyRaptorX (C) according to four contact map files, namely contact map pRaptorX, contact map pRespRE, contact map NeBcon and contact map data copy PSICOV_n)、EnergyResPRE(C_n)、EnergyNeBcon(C_n)、EnergyPSICOV(C_n) The formula is as follows:

wherein the content of the first and second substances,

Constructed as an energy array

8) According to energy array

A first conformational pool was constructed as follows:

8.1) setting the initial conformation number N to 0;

8.2) traversing the population, each conformation C_nEnergy array of

So that

9) and (3) circulation: g +1, if G > G, go to step 14);

Performing the following operations to generate a mutated conformation

The process is as follows:

10.2) in conformation C_n1Randomly selected 9-fragment at position to replace conformation C_n3Corresponding to the same position ofFrom conformation C_n2Randomly choosing one and conformation C in position_n1Selection of differently positioned 9 fragments for replacement of conformation C_n3And then the corresponding fragment in the same position of (A) is used for conformation C_n3Performing 3-segment assembly to generate individual with variant conformation

11) For the variant conformation

The process is as follows:

11.1) generating a random number rand1, wherein rand1 belongs to (0, 1);

Otherwise mutated conformation

The change is not changed;

11.3) test conformation to be generated

Placing into a second conformation pool;

12) testing the conformation in the second conformation cell

Constructed as an energy array

The process is as follows:

Recording the number of conformations as N_TBA；

13.2) conformation in the second conformation pool

m∈{1,2,3,…,N_TBAAnd the conformations in the first conformation well

N ∈ {1,2,3, …, N } for comparison:

13.2.1) if

So that

Deleting conformations in the second conformation pool

13.2.2) if

So that

And certainly

So that

Then use the conformation

Replacement of conformations in the first conformational pool

And deleting conformations in the second conformation pool

All exist k epsilon [1,2,3,4 ]]So that

Then the conformation will be changed

Retained in the second conformational bath;

14) for the conformation of the first conformation pool

And the conformation of the second conformation pool

The selection operation is carried out by the following process:

14.1) if the first conformational cell and the second conformational cellIs greater than a set population quantity, i.e. N + N_TBAIf not, continuing to step 14.2), otherwise, putting the conformation in the second conformation pool into the first conformation pool, emptying the second conformation pool and jumping to step 9);

Is in conformation C_i(x, y, z) coordinates in the internal atomic space,

15) and outputting the result.

Taking protein 1ELW with the sequence length of 117 as an implementation case, the protein structure prediction method based on the residue contact map information game mechanism comprises the following steps:

1) sequence information for a given protein of interest;

RaptorX(http://raptorx.uchicago.edu/ContactMap/)；

ResPRE(https://zhanglab.ccmb.med.umich.edu/ResPRE/)；

NeBcon(https://zhanglab.ccmb.med.umich.edu/NeBcon/)；

DeepMetaPSICOV(http://bioinf.cs.ucl.ac.uk/psipred/)；

wherein the content of the first and second substances,

5) setting parameters: the population size NP is 200, the cross factor CR is 0.5, the iteration number G is 500, and the initial iteration algebra G is 0;

Constructed as an energy array

8) According to energy array

A first conformational pool was constructed as follows:

8.1) setting the initial conformation number N to 0;

8.2) traversing the population, each conformation C_nEnergy array of

So that

9) and (3) circulation: g +1, if G > G, go to step 14);

Performing the following operations to generate a mutated conformation

The process is as follows:

10.2) in conformation C_n1Randomly selected 9-fragment at position to replace conformation C_n3From the fragment corresponding to the same position in conformation C_n2Randomly choosing one and conformation C in position_n1Selection of differently positioned 9 fragments for replacement of conformation C_n3Corresponding to the same position ofThen using the pair conformation C_n3Performing 3-segment assembly to generate individual with variant conformation

11) For the variant conformation

The process is as follows:

11.1) generating a random number rand1, wherein rand1 belongs to (0, 1);

Otherwise mutated conformation

The change is not changed;

11.3) test conformation to be generated

Placing into a second conformation pool;

12) testing the conformation in the second conformation cell

Into four energy functions EnergyRaptorx (C)_n)、EnergyResPRE(C_n)、EnergyNeBcon(C_n)、EnergyPSICOV(C_n) In the method, an energy value is obtained

Constructed as an energy array

The process is as follows:

Recording the number of conformations as N_TBA；

13.2) conformation in the second conformation pool

m∈{1,2,3,…,N_TBAAnd the conformations in the first conformation well

N ∈ {1,2,3, …, N } for comparison:

13.2.1) if

So that

Deleting conformations in the second conformation pool

13.2.2) if

So that

And certainly

So that

Then use the conformation

Replacement of conformations in the first conformational pool

And deleting conformations in the second conformation pool

All exist k epsilon [1,2,3,4 ]]So that

Then the conformation will be changed

Retained in the second conformational bath;

14) for the conformation of the first conformation pool

And the conformation of the second conformation pool

The selection operation is carried out by the following process:

14.1) if the sum of the conformational numbers of the first conformational pool and the second conformational pool is greater than the set population number, i.e., N + N_TBANot less than NP, continuing step 14.2), otherwise, putting the conformation in the second conformation pool into the first conformation pool, emptying the second conformation poolLike pool and jump to step 9);

Is in conformation C_i(x, y, z) coordinates in the internal atomic space,

15) and outputting the result.

Taking the protein 1ELW with the sequence length of 117 as an example, the protein conformation in the near-natural state is obtained by the method, the average root mean square deviation between the structure obtained by running 500 generations and the natural state structure is 2.34, the minimum root mean square deviation is 1.65, and the predicted three-dimensional structure is shown in FIG. 3.

The foregoing illustrates one example of the invention, and it will be apparent that the invention is not limited to the above-described embodiments, but may be practiced with various modifications without departing from the essential spirit of the invention and without departing from the spirit thereof.

Claims

1. A protein structure prediction method based on a residue contact map information game mechanism is characterized in that: the method comprises the following steps:

1) sequence information for a given protein of interest;

2) according to the sequence information of the given target protein, the following four contact map prediction servers are utilized: RaptorX, ResPRE, NeBcon, and DeepMetaPSICOV; acquiring four residue contact information files, and performing data processing to generate four contacMapRapter files which are named as contactMapRaptorX, contactMapRespre, contactMapNeBcon and contactMapDeepMetaPSICOV respectively;

wherein the content of the first and second substances,

respectively represent conformation C_nIs determined at four energy functions energy raptorx (C)_n)、EnergyResPRE(C_n)、EnergyNeBcon(C_n)、EnergyPSICOV(C_n) The contact score of (1);

4) acquiring fragment library files from a ROBETTA server according to a target protein sequence, wherein the fragment library files comprise 3 fragment library files and 9 fragment library files;

7) Will be conformation C_nInto four energy functions EnergyRaptorx (C)_n)、EnergyResPRE(C_n)、EnergyNeBcon(C_n)、EnergyPSICOV(C_n) In the method, an energy value is obtained

Construction ofIs an energy array

8) According to energy array

A first conformational pool was constructed as follows:

8.1) setting the initial conformation number N to 0;

8.2) traversing the population, each conformation C_nEnergy array of

So that

9) and (3) circulation: g +1, if G > G, go to step 14);

Performing the following operations to generate a mutated conformation

The process is as follows:

6.1) randomly generating positive integers N1, N2, N3 in the range of 1 to N, wherein N1 ≠ N2 ≠ N3 ≠ N;

6.2) in conformation C_n1Randomly selected 9-fragment at position to replace conformation C_n3From the fragment corresponding to the same position in conformation C_n2Randomly choosing one and conformation C in position_n1Selection of differently positioned 9 fragments for replacement of conformation C_n3And then the corresponding fragment in the same position of (A) is used for conformation C_n3Performing 3-segment assembly to generate individual with variant conformation

11) For the variant conformation

Performing a crossover operation to generate a test constellation

The process is as follows:

11.1) generating a random number rand1, wherein rand1 belongs to (0, 1);

Otherwise mutated conformation

The change is not changed;

11.3) test conformation to be generated

Placing into a second conformation pool;

12) testing the conformation in the second conformation cell

Constructed as an energy array

The process is as follows:

Recording the number of conformations as N_TBA；

13.2) conformation in the second conformation pool

Conformation in pool with first conformation

And (3) comparison:

13.2.1) if

So that

Deleting conformations in the second conformation pool

13.2.2) if

So that

And certainly

So that

Then use the conformation

Replacement of conformations in the first conformational pool

And deleting conformations in the second conformation pool

All exist k epsilon [1,2,3,4 ]]So that

Then the conformation will be changed

Retained in the second conformational bath;

14) for the first conformational cellConformation

And the conformation of the second conformation pool

The selection operation is carried out by the following process:

8.1) if the sum of the conformational numbers of the first conformational pool and the second conformational pool is greater than the set population number, i.e., N + N_TBAIf not, continuing to step 14.2), otherwise, putting the conformation in the second conformation pool into the first conformation pool, emptying the second conformation pool and jumping to step 9);

8.2) introducing a conformational similarity index RMSD by calculating the RMSD value between each conformation and all the remaining conformations in two conformational pools, as shown in formula (5), wherein

Is in conformation C_i(x, y, z) coordinates in the internal atomic space,

8.3) judging the conformation similarity according to the RMSD value, selecting NP conformations with the most abundant diversity, putting the NP conformations into a first conformation pool, emptying a second conformation pool, and transferring to the step 9);

15) and outputting the result.