CN111815036A

CN111815036A - Protein structure prediction method based on multi-residue contact map cooperative constraint

Info

Publication number: CN111815036A
Application number: CN202010578257.0A
Authority: CN
Inventors: 张贵军; 彭春祥; 刘俊; 周晓根; 夏瑜豪; 赵凯龙
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Guangzhou Zhaoji Biotechnology Co ltd; Shenzhen Xinrui Gene Technology Co ltd
Priority date: 2020-06-23
Filing date: 2020-06-23
Publication date: 2020-10-23
Anticipated expiration: 2040-06-23
Also published as: CN111815036B

Abstract

A protein structure prediction method based on multi-residue contact map cooperative constraint is based on a framework of Rosetta, firstly, a population is initialized by utilizing a first stage and a second stage of Rosetta, and then a new test conformation is generated by carrying out variation and cross on a target conformation; secondly, according to residue contact maps predicted by four contact servers, a cosine similarity index based on the residue contact maps is designed to assist a Rosetta energy function score3 to update the conformation, so that algorithm sampling is guided to obtain the conformation with lower energy and more compact structure. The invention provides a protein structure prediction method based on multi-residue contact map cooperative constraint with high prediction accuracy.

Description

Protein structure prediction method based on multi-residue contact map cooperative constraint

Technical Field

The invention relates to the fields of bioinformatics and computer application, in particular to a protein structure prediction method based on multi-residue contact map cooperative constraint.

Background

The prediction of protein structure is the main research content of structural bioinformatics and is an important basic scientific research subject which is not solved by the central laws of molecular biology. In the global protein structure prediction competition (CASP13) held by campfon, mexico, early 12 in 2018, the AlphaFold group developed by deep mind group under google obtained the first total name. The AlphaFold enables the protein structure prediction leading edge basic research problem to enter the visual field of people from a scientific hall, becomes a current 'heat suggestion' direction, and is expected to become an important milestone in the development process of structural bioinformatics; the work also shows that the deep cross fusion of the computer technology, the information technology and the life science field can effectively drive and accelerate the new scientific discovery.

The importance of protein structure prediction stems from the limitations of current experimental assays. X-ray crystal diffraction is the most effective method for determining the protein structure at present, the achieved precision is incomparable with other methods, and the main defects are that the protein crystal is difficult to culture and the period for determining the crystal structure is long; the multidimensional Nuclear Magnetic Resonance (NMR) method can directly determine the conformation of the protein in the solution, but has large requirements on the sample quantity and high purity, and only can determine the small-molecule protein at present. For a drug target-membrane protein, the three-dimensional structure of the membrane protein is extremely difficult to obtain by the existing experimental determination technology;

proteins can only produce their specific biological functions by folding into a specific three-dimensional structure. Therefore, understanding the three-dimensional structure (native state structure) of a protein is key to understanding the biological function of a protein. The three-dimensional structure of the protein can be obtained by experimental methods such as nuclear magnetic resonance and X-ray crystal diffraction, however, the experimental determination methods are time-consuming and extremely expensive, and are not suitable for some proteins which are not easy to crystallize. Therefore, according to the thermodynamic hypothesis of Anfinsen (the conformation with the lowest energy is considered to be the native state structure), many computational algorithms have been proposed for protein structure prediction.

Under the double promotion of theoretical exploration and application requirements, the technology for predicting protein structures by using computers is developed vigorously at the end of the 20 th century according to the Anfinsen rule. The CASP competition initiated by Moult, a scientist of the university of Marylan, 1994, is a worldwide protein structure prediction and evaluation activity, objectively reflects the latest technical level of development in the current protein structure prediction field, and is known as the Olympic competition of protein structure prediction. The competition aims to attract experts in different fields of computer science, biophysics and the like to participate in the very challenging bioinformatics problem of protein three-dimensional structure prediction, and jointly evaluate the current development situation and discuss the future trend.

Protein structure prediction by a calculation technology is usually evaluated by a very complex energy function, the energy function surface of the protein structure prediction has thousands of degrees of freedom and a large number of local optimal solutions, and the conformation search space is extremely large. To perform conformational space search, a de novo prediction method typically first obtains a global minimum solution of the conformational space based on a knowledge-based coarse-grained energy model, and then refines its corresponding conformation to obtain the predicted structure. Therefore, the de novo prediction method needs to solve two problems: 1. establishing a proper energy function to evaluate the reasonability of the conformation; 2. an effective conformational space search method is proposed to search for a globally optimal solution. The first factor is essentially a matter of molecular mechanics, mainly in order to be able to calculate the energy value corresponding to each protein structure. The second factor is essentially a global optimization problem, and a suitable optimization method is selected to quickly search the conformational space to obtain the conformation corresponding to a certain global minimum energy.

The differential evolution algorithm (DE) has been successfully applied to protein structure prediction due to its advantages of simple structure, easy implementation, strong robustness, fast convergence rate, etc. However, with the increase of amino acid sequences, the degree of freedom of a protein molecular system is increased, and obtaining a global optimal solution of a large-scale protein conformation space by using the traditional population algorithm sampling becomes challenging work; secondly, the coarse-grained model reduces the conformational search space, but also causes information loss between interaction forces, thereby directly affecting the prediction accuracy.

Therefore, the conventional protein structure prediction method has disadvantages in sampling efficiency and prediction accuracy, and needs to be improved.

Disclosure of Invention

In order to overcome the defects of low sampling efficiency and low prediction accuracy of the conventional protein structure prediction method, the invention introduces a plurality of residue contact maps to guide conformational space optimization based on Rosetta, and provides a protein structure prediction method based on multi-residue contact map cooperative constraint with high efficiency and high prediction accuracy.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for protein structure prediction based on multi-residue contact map co-constraints, the method comprising the steps of:

1) sequence information for a given protein of interest;

2) obtaining a fragment library file from a ROBETTA server (http:// www.robetta.org /) according to a target protein sequence, wherein the fragment library file comprises a 3 fragment library file and a 9 fragment library file;

3) according to the target protein sequence, four contictmaps, namely contiRaptorx, contitResPRE, contictDeepMaP and continecteBcon, are obtained by utilizing a Raptorx-Contact server (http:// RaptorX. uchago. edu/ContactMap /), an ResPRE server (https:// zhangglab. ccmb. med. edu/ResPRE /), a DeepMetaPSICOV server (http:// bioif. cs. ucl. ac. uk/psiprd /), a NeBcon server (https:// zhangb. ccmb. med. emuch. edu/NeBcon /);

4) setting parameters: the population size NP, the iteration times G of the algorithm, a cross factor CR, and an iteration algebra G of 0;

5) population initialization: random fragment assembly to generate NP initial conformations C_i，i＝{1,2，…,NP}；

6) The conformational individuals in the population C_iI e {1,2,3, …, NP } is regarded as the target conformation entity

Performing the following operations to generate a mutated conformation

6.1) randomly generating positive integers n1, n2, n3 in the range of 1 to NP, wherein n1 ≠ n2 ≠ n3 ≠ i;

6.2) in conformation C_n1Randomly selected 9-fragment at position to replace conformation C_n3From the fragment corresponding to the same position in conformation C_n2Randomly choosing one and conformation C in position_n1Selecting 9 segments with different positions to replace even image C_n3And then the corresponding fragment in the same position of (A) is used for conformation C_n3Performing 3-segment assembly to generate individual with variant conformation

7) For the variant conformation

Performing a crossover operation by i e {1,2,3, …, NP } to generate a test constellation

7.1) generating a random number rand1, wherein rand1 belongs to (0, 1);

7.2) if the random number rand1 is less than or equal to CR, the target conformation is selected

In which a 3-fragment is randomly selected to be substituted into a variant conformation

Otherwise mutated conformation

The change is not changed;

8) for each target conformation

And a test conformation

Carrying out selection operation;

8.1) separately calculated with the Rosetta score3 energy function

And

energy of (2):

and

8.2) if

Then conformation

Rejected, otherwise, continues to execute step 8.3);

8.3) first, the handle

And

is converted into a one-dimensional vector with the length of L multiplied by L

And

converting contictRaptorx, contitResPRE, contictDeepMetaPSICOV and contictNeBcon into 4 one-dimensional vectors L x L in length

And

wherein L is the length of the protein sequence; then separately calculate

And

and

cosine similarity and summing to obtain

And

the calculation method is as follows:

8.4) if

Then conformation

Alternative conformations

And go to step 9);

9) g +1, and iteratively executing the steps 6) to 8) until G is larger than G;

10) and outputting the result.

The technical conception of the invention is as follows: initializing a population by utilizing a first stage and a second stage of Rosetta based on a Rosetta framework, and generating a new test conformation by carrying out mutation and cross on a target conformation; secondly, according to residue contact maps predicted by four contact servers, a cosine similarity index based on the residue contact maps is designed to assist a Rosetta energy function score3 to update the conformation, so that algorithm sampling is guided to obtain the conformation with lower energy and more compact structure. The invention provides a protein structure prediction method based on multi-residue contact map cooperative constraint.

The invention has the beneficial effects that: firstly, by combining the residue contact map information predicted by different servers, the problems of insufficient recall rate and accuracy of a single residue contact map are solved; secondly, a cosine similarity index based on a residue contact map is designed to assist the Rosetta energy function score3 to update the conformation, so that the algorithm is guided to sample to obtain the conformation with lower energy and more compact structure.

Drawings

FIG. 1 is a graph of four predicted residue contacts.

FIG. 2 is a conformational distribution map obtained by sampling protein 1TEN based on a protein structure prediction method of multi-residue contact map co-constraint.

FIG. 3 is a three-dimensional structure predicted from the structure of the 1TEN protein based on the protein structure prediction method of the multi-residue contact map synergistic constraint.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 to 3, a method for predicting a protein structure based on a multi-residue contact map synergistic constraint, the method comprising the steps of:

1) sequence information for a given protein of interest;

Performing the following operations to generate a mutated conformation

7) For the variant conformation

7.1) generating a random number rand1, wherein rand1 belongs to (0, 1);

Otherwise mutated conformation

The change is not changed;

8) for each target conformation

And a test conformation

Carrying out selection operation;

8.1) separately calculated with the Rosetta score3 energy function

And

energy of (2):

and

8.2) if

Then conformation

Is rejected, otherwise, step 8.3 is continued)；

8.3) first, the handle

And

is converted into a one-dimensional vector with the length of L multiplied by L

And

And

wherein L is the length of the protein sequence; then separately calculate

And

and

cosine similarity and summing to obtain

And

the calculation method is as follows:

8.4) if

Then conformation

Alternative conformations

And go to step 9);

9) g +1, and iteratively executing the steps 6) to 8) until G is larger than G;

10) and outputting the result.

Taking protein 1TEN with the sequence length of 87 as an example, a protein structure prediction method based on multi-residue contact map cooperative constraint comprises the following steps:

1) sequence information for a given protein of interest;

4) setting parameters: the population size NP is 100, the iteration number G of the algorithm is 300, the cross factor CR is 0.5, and the iteration algebra G is 0;

Performing the following operations to generate a mutated conformation

7) For the variant conformation

7.1) generating a random number rand1, wherein rand1 belongs to (0, 1);

Otherwise mutated conformation

The change is not changed;

8) for each target conformation

And a test conformation

Carrying out selection operation;

8.1) separately calculated with the Rosetta score3 energy function

And

energy of (2):

and

8.2) if

Then conformation

Rejected, otherwise, continues to execute step 8.3);

8.3) first, the handle

And

is converted into a one-dimensional vector with the length of L multiplied by L

And

make contictRaptorx, contitResPRE, contictDeepMetaPSICOV and conticNeBcon are converted into 4 one-dimensional vectors of length L × L

And

wherein L is the length of the protein sequence; then separately calculate

And

and

cosine similarity and summing to obtain

And

the calculation method is as follows:

8.4) if

Then conformation

Alternative conformations

And go to step 9);

9) g +1, and iteratively executing the steps 6) to 8) until G is larger than G;

10) and outputting the result.

Taking protein 1TEN with sequence length 87 as an example, the above method is used to obtain the near-native conformation of the protein, the average root mean square deviation between the structure obtained by running 300 generations and the native structure is 2.86, the minimum root mean square deviation is 2.01, and the predicted three-dimensional structure is shown in FIG. 3.

The foregoing illustrates one example of the invention, and it will be apparent that the invention is not limited to the above-described embodiments, but may be practiced with various modifications without departing from the essential spirit of the invention and without departing from the spirit thereof.

Claims

1. A protein structure prediction method based on multi-residue contact map cooperative constraint is characterized in that: the method comprises the following steps:

1) sequence information for a given protein of interest;

2) obtaining fragment library files from a ROBETTA server according to a target protein sequence, wherein the fragment library files comprise 3 fragment library files and 9 fragment library files;

3) according to the target protein sequence, utilizing a RaptorX-Contact server, a ResPRE server, a DeepMetaPsICOV server and a NeBcon server to obtain four contictmaps which are respectively contictRaptorX, contitResPRE, contictDeepMetaPSICOV and contictNeBcon;