CN111180004A

CN111180004A - Multi-contact information sub-population strategy protein structure prediction method

Info

Publication number: CN111180004A
Application number: CN201911197621.2A
Authority: CN
Inventors: 张贵军; 彭春祥; 刘俊; 周晓根; 李亭
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-05-19
Anticipated expiration: 2039-11-29
Also published as: CN111180004B

Abstract

A method for predicting a protein structure of a multi-contact information sub-population strategy comprises the steps of firstly, initializing a population by utilizing a fragment assembly technology under an evolutionary algorithm framework; then, dividing the population into a plurality of sub-populations, carrying out variation on each individual in the sub-populations, and carrying out cross operation to generate a new conformation; in the selection link, a new structure is selected by using a Rosetta energy function score 3; then, using S_con(C) The new low energy conformations are further screened while preserving the diversity of conformations during selection by monte carlo probability acceptance criteria. The method utilizes the concept of the sub-population and combines the contact information auxiliary structure prediction predicted by a plurality of contact servers, so that the problem of inaccuracy of an energy function can be relieved, and the diversity of the population can be improved. The invention provides a method for predicting the protein structure of a sub-population strategy of multi-contact information, which has good diversity and high prediction precision.

Description

Multi-contact information sub-population strategy protein structure prediction method

Technical Field

The invention relates to the fields of bioinformatics and computer application, in particular to a method for predicting a protein structure of a multi-contact information sub-population strategy.

Background

Protein structure prediction is a major research content in structural bioinformatics. In the global protein structure prediction competition held by campfon, mexico (CASP13) at 12 months of 2018, AlphaFold, developed by the deep mind team under google, obtained the first total name. The most innovative and breakthrough place of AlphaFold is that the spatial distance relationship of the protein structure is predicted by using a machine learning method, and the spatial distance constraint is used as an energy function to guide the folding of the protein, so that the prediction precision is greatly improved. The work also shows that the deep cross fusion of the fields of computer technology, information technology and life science can effectively drive and accelerate the new discovery of science. However, de novo prediction methods currently face a number of difficulties and challenges.

First, due to the inaccuracy of energy models, the accuracy of inter-residue contact information is one of the key factors that currently restrict the accuracy of de novo protein structure prediction. Although the precision of prediction of contact information among residues reaches an unprecedented new era, the accuracy of the contact information is low, and the contact information predicted by each contact prediction server is uneven, so that the accuracy of the contact prediction and the precision of protein structure prediction do not form a good corresponding relation.

Second, the inherent complexity of spatial optimization of protein conformation makes it a very challenging research topic in the field of de novo protein structure prediction. In order to find unique native protein structures in a huge sampling space by using a computer, an efficient conformational space optimization algorithm must be designed to convert the native protein structures into a practical computational problem. The differential evolution algorithm (DE) has the advantages of simple structure, easy realization, strong robustness, high convergence speed and the like, and is widely applied in the field of protein conformation space optimization. However, as the amino acid sequence increases, the degree of freedom of a protein molecular system also increases, and obtaining a global optimal solution of a large-scale protein conformation space by using a traditional population algorithm under the condition of ensuring population diversity becomes challenging work.

Therefore, the conventional protein structure prediction methods are insufficient in diversity and prediction accuracy, and improvement is required.

Disclosure of Invention

In order to solve the problems of poor diversity and low prediction precision of the conventional protein structure prediction method in the sampling process, the invention firstly uses a plurality of contact prediction servers to predict the obtained contact information and then constructs a high-confidence contact set. Meanwhile, by utilizing the concept of the sub-population, different space constraint models are adopted for different sub-populations to assist the Rosettascore3 energy function to guide conformation selection. The invention provides a sub-population strategy protein structure prediction method of multi-contact information with good diversity and high prediction precision.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for predicting protein structure of multiple contact information sub-population strategy, comprising the following steps:

1) sequence information for a given protein of interest;

2) obtaining a fragment library file from a ROBETTA server (http:// www.robetta.org /) according to a target protein sequence, wherein the fragment library file comprises a 3 fragment library file and a 9 fragment library file;

3) predicting 3 contact maps from a Raptorx server (RaptorX. uchicago. edu/ContactMap), a ResTriplet server (zhangglab. ccmb. med. omich. edu/Restriplet) and a DNCON2 server (sysbio. rnet. missouri. edu/DNCON2) respectively according to a target protein sequence, and selecting L/5 contact information from large to small according to the confidence degree of each contact information in each contact map to form a high-confidence contact information set contact 1, a contact 2 and a contact cf3 respectively, wherein L is the length of the target protein sequence;

4) constructing a contact set contf 4 with high confidence according to the contact information of the contf 1, the contf 2 and the contcf 3, wherein the construction rule of the contf 4 is as follows:

4.1) adding contact information for each of contif 1, contif 2 and conticf 3 to conticf 4, respectively, if the residue pair does not overlap for the contact information in contif 1, contif 2 and conticf 3;

4.2) for the contact information in contictf 1, contictf 2 and conticf 3, if the residue pair is repeated, firstly, averaging the confidence degrees of the contact information repeated in contictf 1, contictf 2 and conticf 3, and then adding the average to the contictf 4;

4.3) sorting according to the confidence degree of the contact information in the contact 4 from big to small, and calculating the number Num of contacts in the contact 4;

5) setting parameters, namely a population size NP, a maximum iteration algebra G of the algorithm, a cross factor CR and a temperature factor beta, and setting the iteration algebra G to be 0;

6) population initialization: random fragment assembly to generate NP initial conformations C_iI ═ 1,2, …, NP, dividing NP initial constellations equally into 4 sub-populations

7) For each individual in the population C_iThe following operations are carried out:

7.1) mixing C_iSet as target individual C_targetRandomly selecting two different individuals C from the population_aAnd C_b，C_target≠C_a≠C_bFrom C, respectively_a、C_bIn the method, a 9 segment with different positions is randomly selected and respectively replaced

Corresponding position fragment generates variant conformation C_mutant；

7.2) pairs of C_mutantOne-time fragment assembly to generate new conformation C_mutant′；

7.3) generating a random number pCR, where pCR ∈ (0,1), if pCR < CR, then from C_targetIn the sequence, randomly selecting a 3-segment to replace to C_mutant' fragment of corresponding position generates test conformation C_trialOtherwise, directly handle C_mutant' As C_trial；

7.4) computing C using the Rosetta score3 energy function_target、C_trialEnergy score3 (C)_trial)、score3(C_target)；

7.5) if score3 (C)_trial)>score3(C_target) Then C is retained_target；

7.6) if score3 (C)_trial)<score3(C_target) Then C is calculated according to equation (1)_trialAnd C_targetIs a space constraint score of S_con(C)，S_con(C) Is defined as follows;

wherein m and n are respectively the m-th residue and the n-th residue corresponding to the K-th contact in the high-confidence contact set, K is the number of contacts in the high-confidence contact set, d_m,nEuclidean distance of the mth residue from the nth residue in conformation C, U_m,nConfidence that the residue pair (m, n) corresponds to a contact in the high-confidence contact set, if

Contact 1 for high confidence contact set selection, if

Contact 2 for high confidence contact set selection, if

Contact 3 for high confidence contact set selection, if

High confidence contact set selects contact 4;

7.7) if S_con(C_trial)<S_con(C_target) Then C is_trialReplacement C_targetEntering a population;

7.8) if S_con(C_trial)>S_con(C_target) Then C is_trialWith probability P_acceptReplacement C_targetEntering a population, and if the replacement is unsuccessful, retaining C_targetWherein P is_acceptIs defined as follows;

8) g +1, and iteratively executing the steps 5) -8) until G is greater than G;

9) the lowest conformation of Rosetta score3 was exported as the final result.

The technical conception of the invention is as follows: under the framework of an evolutionary algorithm, first, a population is initialized using a fragment assembly technique. Then, dividing the population into a plurality of sub-populations, carrying out variation on each individual in the sub-populations, and carrying out cross operation to generate a new conformation; in the selection step, a new structure is selected by using a Rosetta energy function score3, and then S is used_con(C) The new low energy conformations are further screened while preserving the diversity of conformations during selection by monte carlo probability acceptance criteria. The method utilizes the concept of the sub-population and combines the contact information auxiliary structure prediction predicted by a plurality of contact servers, so that the problem of inaccuracy of an energy function can be relieved, and the diversity of the population can be improved. The invention provides a method for predicting the protein structure of a sub-population strategy of multi-contact information, which has good diversity and high prediction precision.

The invention has the beneficial effects that: according to different sub-populations, different space constraint fractions are constructed to assist the Rosetta energy function score3 in selecting the conformation, so that the problem of prediction error caused by inaccuracy of the energy function is relieved, and the prediction accuracy is improved.

Drawings

FIG. 1 is a conformational distribution diagram obtained by protein 4UEX sampling by a subgroup strategy protein structure prediction method of multi-contact information.

FIG. 2 is a schematic diagram of conformation update of protein 4UEX in a multi-contact information sub-population strategy protein structure prediction method.

FIG. 3 is a three-dimensional structure predicted by a subgroup strategy protein structure prediction method of multi-contact information on a protein 4UEX structure.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 to 3, a method for predicting protein structure of multiple contact information sub-population strategy, the method comprising the following steps:

1) sequence information for a given protein of interest;

Corresponding position fragment generates variant conformation C_mutant；

7.5) if score3 (C)_trial)>score3(C_target) Then C is retained_target；

Contact 1 for high confidence contact set selection, if

Contact 2 for high confidence contact set selection, if

Contact 3 for high confidence contact set selection, if

High confidence contact set selects contact 4;

8) g +1, and iteratively executing the steps 5) -8) until G is greater than G;

9) the lowest conformation of Rosetta score3 was exported as the final result.

taking α protein 4UEX with the sequence length of 82 as an example, the method for predicting the protein structure of the multi-contact information sub-population strategy comprises the following steps:

1) sequence information for a given protein of interest;

5) setting parameters, wherein the population size NP is 200, the maximum iteration algebra G of the algorithm is 6000, the cross factor CR is 0.5, the temperature factor β is 4, and the iteration algebra G is 0;

Corresponding position fragment generates variant conformation C_mutant；

7.5) if score3 (C)_trial)>score3(C_target) Then C is retained_target；

Contact 1 for high confidence contact set selection, if

Contact 2 for high confidence contact set selection, if

Contact 3 for high confidence contact set selection, if

High confidence contact set selects contact 4;

8) g +1, and iteratively executing the steps 5) -8) until G is greater than G;

9) the lowest conformation of Rosetta score3 was exported as the final result.

taking alpha protein 4UEX with sequence length of 82 as an example, the near-natural state conformation of the protein is obtained by using the method, and the average root mean square deviation between the structure obtained by running 6000 generations and the natural state structure is

Minimum root mean square deviation of

The predicted three-dimensional structure is shown in fig. 3.

The foregoing illustrates one example of the invention, and it will be apparent that the invention is not limited to the above-described embodiments, but may be practiced with various modifications without departing from the essential spirit of the invention and without departing from the spirit thereof.

Claims

1. A method for predicting a protein structure of a multi-contact information sub-population strategy is characterized by comprising the following steps: the method comprises the following steps:

1) sequence information for a given protein of interest;

2) obtaining fragment library files from a ROBETTA server according to a target protein sequence, wherein the fragment library files comprise 3 fragment library files and 9 fragment library files;

3) respectively predicting 3 contact graphs from a Raptorx server, a Restriplet server and a DNCON2 server according to a target protein sequence, and respectively selecting L/5 contact information from large to small according to the confidence degree of each contact information in each contact graph to respectively form a high-confidence contact information set contact 1, contact 2 and a contact 3, wherein L is the length of the target protein sequence;

Corresponding position fragment generates variant conformation C_mutant；

7.5) if score3 (C)_trial)>score3(C_target) Then C is retained_target；

Contact 1 for high confidence contact set selection, if

Contact 2 for high confidence contact set selection, if

Contact 3 for high confidence contact set selection, if

High confidence contact set selects contact 4;

8) g +1, and iteratively executing the steps 5) -8) until G is greater than G;

9) the lowest conformation of Rosetta score3 was exported as the final result.