CN113205855A

CN113205855A - Knowledge energy function optimization-based membrane protein three-dimensional structure prediction method

Info

Publication number: CN113205855A
Application number: CN202110636292.8A
Authority: CN
Inventors: 柳源; 沈红斌; 冯世豪; 张沛东
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-06-08
Filing date: 2021-06-08
Publication date: 2021-08-03
Anticipated expiration: 2041-06-08
Also published as: CN113205855B

Abstract

A membrane protein three-dimensional structure prediction method based on knowledge energy function optimization is characterized in that constraint on residue distance is obtained according to a multi-sequence comparison result of an input sequence and statistical knowledge, a structure fragment query library is constructed according to a secondary structure prediction result of the input sequence and a known structure in a protein structure database PDB, and an energy function of a knowledge base is calculated according to a residue contact prediction result of the input sequence; then, iteratively carrying out fragment replacement on the initial structure under the conditions of energy function and residue distance constraint to obtain a plurality of candidate structures; and finally, screening the candidate structure to obtain the final predicted three-dimensional structure of the membrane protein. The invention is based on a de novo prediction method, uses multiple technologies such as Multiple Sequence Alignment (MSA), secondary structure prediction, residue contact prediction and the like, and has the advantages of convenient operation, high accuracy and the like.

Description

Knowledge energy function optimization-based membrane protein three-dimensional structure prediction method

Technical Field

The invention relates to a technology in the field of bioengineering, in particular to a membrane protein three-dimensional structure prediction method based on knowledge energy function optimization.

Background

The method for obtaining accurate protein structure information is through experimental determination, the most common experimental methods at present are X-ray diffraction method, nuclear magnetic resonance method and frozen electron microscope technology, etc., and the protein structures obtained through the experimental methods are stored in a biological Database pdb (protein Database bank). The structure of the existing analyzed high-resolution membrane protein in PDB is few, only 1267, which accounts for about 2% of the total number of the protein structures in PDB, so that the calculation method for predicting the three-dimensional structure of the membrane protein is very important. The current computational methods predict protein structure in two main directions, one is template modeling and the other is de novo prediction. For membrane proteins, since the structures available in PDB are rare and no suitable template is generally found, most membrane protein structure prediction methods are based on de novo calculations. On the other hand, the membrane protein has a longer amino acid sequence, which causes the time efficiency of most de novo prediction methods to be extremely reduced, and some methods can not even complete the prediction task.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a membrane protein three-dimensional structure prediction method based on knowledge energy function optimization, which is based on a de novo prediction method, uses multiple technologies such as Multiple Sequence Alignment (MSA), secondary structure prediction, residue contact prediction and the like, and has the advantages of convenient operation, high accuracy and the like.

The invention is realized by the following technical scheme:

the invention relates to a membrane protein three-dimensional structure prediction method based on knowledge energy function optimization, which comprises the steps of respectively combining statistical knowledge with the multi-sequence comparison result of an input sequence to obtain the constraint on residue distance, combining a known structure in PDB with the secondary structure prediction result of the input sequence to construct a structure fragment query library, and calculating an energy function of a knowledge base according to the residue contact prediction result of the input sequence; then, iteratively carrying out fragment replacement on the initial structure under the conditions of energy function and residue distance constraint to obtain a plurality of candidate structures; and finally, screening the candidate structure to obtain the final predicted three-dimensional structure of the membrane protein.

The input sequence is an amino acid sequence of the membrane protein, the sequence length is not limited, a plurality of transmembrane helices are included, and the transmembrane part accounts for the main body.

And the multiple sequence comparison result is obtained by comparing the input sequence with the middle sequence to select a plurality of sequences with higher homology.

The constraint on the residue distance is that: according to the statistical rules of various protein structures in PDB, the maximum value and the minimum value of the value range of the C beta-C beta distance when two types of residues have contact relation are limited.

The statistical rule is as follows: analyzing a large number of real protein structures, calculating two types of residue distances with contact relation, and counting the value range obtained by the distances.

The secondary structure prediction result is obtained through a membrane protein transmembrane helix (TMH) prediction model (Membrain) based on multi-scale deep learning, and the specific operation steps are as follows: inputting a protein sequence, obtaining a large number of similar sequences through multi-sequence comparison, combining co-evolution information, and predicting a transmembrane region and a transmembrane direction by using a deep learning model and a support vector machine classifier.

The prediction model of transmembrane helix of membrane protein comprises: a transmembrane region prediction module and a direction prediction module, wherein: the transmembrane region prediction module comprises a multi-scale deep learning model and a binarization processing module, wherein the deep learning model consists of a small-scale residual error neural network based on residues and a large-scale residual error neural network based on a full sequence, and the binarization processing module is used for carrying out binarization processing on an original prediction score according to a dynamic threshold value and solving the problem of insufficient segmentation; the direction prediction module uses a support vector machine classifier (SVM).

The structural fragment query library comprises: the query library is constructed based on a fragment with a specific secondary structure, including an alpha-helix and a beta-sheet, and the minimum length of the fragment is 5 residues, the query library is constructed based on a protein fragment with a fixed length, the minimum length of the fragment is 9 residues and the maximum length of the fragment is 16 residues, and the query library is constructed based on a short fragment with a short length of 3 residues.

Each fixed length segment in the fixed length segment query library is the same as the secondary structure of a corresponding position of a certain segment with the same length in the query sequence.

The residue contact prediction result is obtained by a deep learning-based protein residue contact prediction model (shen-CDeep), and specifically comprises the following steps: the C β -C β distance of the residues is divided into 10 intervals, which are respectively:

in the above, the probability of each pair of residues being located in each distance bin is predicted.

The protein residue contact prediction model comprises five groups of 29 improved ResNet residual modules which are divided into 3, 4, 6, 8 and 8 groups respectively, wherein a dilated convolution mechanism is introduced into the first three groups of modules, and a channel-based attention mechanism is introduced into the second two groups of modules.

The energy function based on the knowledge base is as follows: calculating a score by using the score-d function relation group of each residue pair, and taking the accumulated result of all scores as the energy value of the whole structure, wherein: the score-d functional relationship group refers to: for a protein sequence of length L residues, the predicted C beta-C beta distance is selected to be

The first L residue pairs of the probability between and the distance between

Predicting the first L/5 residue pairs of the probability, and calculating the score of each residue pair in each probability interval for each residue pair after removing the repeated part between the residue pairs

Wherein: n is 9, i is the number of the interval, i is 1, 2, …, 9, d_iIs the midpoint of the i-th interval, p_iIs the probability value corresponding to the ith interval, and α is a normalization term, where the constant α is 1.57; then, each group is respectively processed with cubic spline interpolation to obtain

A set of score-d functional relationships within the range.

The initial structure is as follows: the peptide bond between two residues is parallel to the backbone of the residue and overall is a straight chain, i.e., the starting point for the iterative substitution of fragments.

The segment replacement comprises the following steps:

i. generating a random number R1 for determining which of three fragment substitutions (secondary structure fragment substitution, fixed length fragment substitution, short fragment substitution) is to be made;

ii. Generating a random number R2 for determining the starting position for segment replacement;

iii, generating a random number R3 for selecting a specific type of fragment;

iv, carrying out coordinate transformation to complete a round of segment replacement process;

and v, judging whether the replaced structure meets the constraint condition, and if so, retaining the structure and not discarding the structure.

And the candidate structure repeatedly iterates the initial structure through a simulated annealing algorithm, and replaces other candidate structures with higher energy values each time when a structure with a lower energy value is generated.

The number of repeated iterations is preferably 2000 ten thousand or more, and the number of corresponding candidate structures is preferably 100 or more.

The screening is as follows: using another radicalEnergy function from statistical knowledge

For each contact distance in

Calculating the energy value of the residue pair with the probability of being more than 0.3, and taking the accumulated result of all the energy values as the final energy value; then the energy value and an energy function based on knowledge base are used for carrying out comprehensive evaluation on the candidate structure, two energy functions are respectively used for carrying out sequencing from small to large on the candidate structure, the sequence numbers are added, the structure with the minimum sequence number and the minimum structure are screened out and then side chain optimization is carried out on the structure, side chain isomers of each type of amino acid in the nature are counted, the side chain is replaced, so that possible position overlapping among side chain atoms is eliminated, and the side chain conformation is improved to be more consistent with the real structure, wherein: p is residue to contact distance

Probability of d between d_maxIs the theoretical maximum of the C.beta. -C.beta.distance at which a contact relationship exists between these two types of residues.

The invention relates to a system for realizing the method, which comprises the following steps: a multi-sequence alignment module, a transmembrane region prediction module, a residue contact prediction module, and a tertiary structure prediction module, wherein: the input is connected with the multi-sequence comparison module to obtain a homologous sequence, the transmembrane region prediction module is connected with the multi-sequence comparison module and the three-dimensional structure prediction module, transmembrane region prediction is carried out by combining the result of multi-sequence comparison and the result is transmitted to the three-dimensional structure prediction module, the residue contact prediction module is also connected with the multi-sequence comparison module and the three-dimensional structure prediction module, residue contact prediction is carried out by combining the result of multi-sequence comparison and the result is transmitted to the three-dimensional structure prediction module, and the three-dimensional structure prediction module comprehensively uses the information given by the multi-sequence comparison module, the transmembrane region prediction module and the residue contact prediction module to finally complete the three-dimensional structure prediction.

Technical effects

The invention integrally solves the problems of insufficient pertinence, insufficient precision, insufficient speed and the like in the prior art;

compared with the prior art, the method for predicting the three-dimensional structure of the membrane protein can simultaneously give the prediction result of the transmembrane region and residue contact generated in the prediction process, has 10 to 20 percent improvement on the prediction precision compared with the current membrane protein structure prediction methods, only needs dozens of minutes to several hours in time, is simple to operate and convenient to use, and can achieve the RMSD of the predicted structure on certain proteins relative to the real structure

The majority of proteins are listed below

The accuracy of the inner part of the membrane is higher.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

As shown in FIG. 1, the present example relates to a method for predicting the three-dimensional structure of a membrane protein based on knowledge energy function optimization, wherein the input is a membrane protein sequence, for example, the A chain of PDB with the protein number of 2D57 is shown in Seq ID No. 1.

In this embodiment, the iteration number is set to 2000 ten thousand, and the algorithm starts to be executed, which includes three stages:

s1, preprocessing;

s2, iterative optimization;

and S3, post-processing.

Further, the preprocessing stage S1 includes the following steps:

s11, obtaining a multi-sequence alignment result, a secondary structure prediction result and a residue contact prediction result;

s12, providing constraint on the distance between residues C beta and C beta by using the multi-sequence alignment result and combining statistical knowledge; constructing a structural fragment query library by using the secondary structure prediction result and combining with the known structure in the protein structure database PDB; and (4) giving an energy function of a knowledge base by using residue prediction results.

Further, the iterative optimization S2 stage includes the following steps:

s21, randomly selecting fragments from the structural fragment query library under the constraint conditions of an energy function and a residue distance, and replacing the initial structure;

and S22, repeating the process of S21 for 2000 ten thousand times, and selecting the last 100 structures as candidate structures.

Further, the post-processing stage S3 includes the following steps:

s31, comprehensively evaluating the 100 structures by using an energy function of another knowledge base, and selecting the best structure;

and S32, performing side chain optimization on the structure, and outputting a final result.

The final output of the algorithm is a file in PDB format.

Evaluation index used in the present example

Wherein: l is_NIs the length of the template structure (generally the actual protein structure), L_TIs the length of the residue aligned with the template structure, d_iIs the distance between the i-th alignment residues, d₀Is a standardized scale item which is a fixed value.

The evaluation index TM-score has a value between 0 and 1, the larger the value is, the higher the similarity degree between the two structures is, the TM-score value calculated by using the predicted structure and the real structure can be used as an index for evaluating a prediction result, the larger the TM-score value is, the closer the predicted structure is to the real structure is, and the smaller the TM-score value is, the larger the difference between the predicted structure and the real structure is. GDT-TS, similarly, is between 0 and 100. RMSD represents the root mean square error between the predicted atomic coordinates and the true atomic coordinates.

In the present example, experiments were performed on some membrane proteins, and the experimental results shown in table 2 were obtained, and compared with the existing membrane protein structure prediction method FILM3, the membrane protein structure prediction method has different degrees of improvement in various indexes, and the improvement range on some proteins reaches more than 20%.

TABLE 2 prediction and comparison with FILM3

Compared with the prior art, the method has the advantages that the prediction precision of the three-dimensional structure of the membrane protein is greatly improved, and particularly, the error between a transmembrane helical region and a real structure is very low. The prediction time is short, and the three-dimensional structure prediction of the membrane protein with the length of hundreds of residues can be completed within a few hours. Prediction results and residue contacts for transmembrane domains generated during the prediction process can also be presented.

The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Sequence listing

<110> Shanghai university of transportation

<120> knowledge energy function optimization-based membrane protein three-dimensional structure prediction method

<130> fnc482e

<141> 2021-06-08

<160> 1

<170> SIPOSequenceListing 1.0

<210> 1

<211> 224

<212> PRT

<213> Artificial Sequence (Artificial Sequence)

<400> 1

Thr Gln Ala Phe Trp Lys Ala Val Thr Ala Glu Phe Leu Ala Met Leu

1 5 10 15

Ile Phe Val Leu Leu Ser Val Gly Ser Thr Ile Asn Trp Gly Gly Ser

20 25 30

Glu Asn Pro Leu Pro Val Asp Met Val Leu Ile Ser Leu Cys Phe Gly

35 40 45

Leu Ser Ile Ala Thr Met Val Gln Cys Phe Gly His Ile Ser Gly Gly

50 55 60

His Ile Asn Pro Ala Val Thr Val Ala Met Val Cys Thr Arg Lys Ile

65 70 75 80

Ser Ile Ala Lys Ser Val Phe Tyr Ile Thr Ala Gln Cys Leu Gly Ala

85 90 95

Ile Ile Gly Ala Gly Ile Leu Tyr Leu Val Thr Pro Pro Ser Val Val

100 105 110

Gly Gly Leu Gly Val Thr Thr Val His Gly Asn Leu Thr Ala Gly His

115 120 125

Gly Leu Leu Val Glu Leu Ile Ile Thr Phe Gln Leu Val Phe Thr Ile

130 135 140

Phe Ala Ser Cys Asp Ser Lys Arg Thr Asp Val Thr Gly Ser Val Ala

145 150 155 160

Leu Ala Ile Gly Phe Ser Val Ala Ile Gly His Leu Phe Ala Ile Asn

165 170 175

Tyr Thr Gly Ala Ser Met Asn Pro Ala Arg Ser Phe Gly Pro Ala Val

180 185 190

Ile Met Gly Asn Trp Glu Asn His Trp Ile Tyr Trp Val Gly Pro Ile

195 200 205

Ile Gly Ala Val Leu Ala Gly Ala Leu Tyr Glu Tyr Val Phe Cys Pro

210 215 220

Claims

1. A membrane protein three-dimensional structure prediction method based on knowledge energy function optimization is characterized in that constraints on residue distances are obtained according to multiple sequence comparison results of input sequences and statistical knowledge, a structure fragment query library is constructed according to secondary structure prediction results of the input sequences and known structures in a protein structure database PDB, and an energy function of a knowledge base is calculated according to residue contact prediction results of the input sequences; then, iteratively carrying out fragment replacement on the initial structure under the conditions of energy function and residue distance constraint to obtain a plurality of candidate structures; and finally, screening the candidate structure to obtain the final predicted three-dimensional structure of the membrane protein.

2. The method for predicting the three-dimensional structure of the membrane protein based on the knowledge-energy function optimization of claim 1, wherein the constraint on the residue distance is as follows: according to the statistical rule of various protein structures in PDB, limiting the maximum value and the minimum value of the value range of the C beta-C beta distance when two types of residues have a contact relation;

3. The method for predicting the three-dimensional structure of the membrane protein based on knowledge energy function optimization according to claim 1, wherein the secondary structure prediction result is obtained by a membrane protein transmembrane helix prediction model based on multi-scale deep learning, and the specific operation steps are as follows: inputting a protein sequence, obtaining a large number of similar sequences through multi-sequence comparison, and predicting a transmembrane region and a transmembrane direction by using a deep learning model and a support vector machine classifier in combination with coevolution information;

the prediction model of transmembrane helix of membrane protein comprises: a transmembrane region prediction module and a direction prediction module, wherein: the transmembrane region prediction module comprises a multi-scale deep learning model and a binarization processing module, wherein the deep learning model consists of a small-scale residual error neural network based on residues and a large-scale residual error neural network based on a full sequence, and the binarization processing module is used for carrying out binarization processing on an original prediction score according to a dynamic threshold value and solving the problem of insufficient segmentation; the directional prediction module uses a support vector machine classifier.

4. The method for predicting the three-dimensional structure of the membrane protein based on the knowledge-energy function optimization of claim 1, wherein the query library of the structural fragments comprises: a query library constructed based on a fragment of a specific secondary structure including an alpha-helix and a beta-sheet, the fragment having a minimum length of 5 residues, a fixed length fragment query library constructed based on a protein fragment having a fixed length and a maximum length of 16 residues, the fragment having a minimum length of 9 residues, and a short fragment query library constructed based on a short fragment having a 3-residue length;

5. The method for predicting the three-dimensional structure of the membrane protein based on the knowledge-energy function optimization of claim 1, wherein the residue contact prediction result is obtained by a protein residue contact prediction model based on deep learning, and specifically comprises the following steps: the C β -C β distance of the residues is divided into 10 intervals, which are respectively:

in the above, the probability of each pair of residues being located in each distance bin is predicted;

the protein residue contact prediction model comprises five groups of 29 improved ResNet residual modules which are respectively divided into 3, 4, 6, 8 and 8 groups, wherein the former three groups of modules introduce a swelling convolution mechanism, and the latter two groups of modules introduce a channel-based attention mechanism.

6. The method for predicting the three-dimensional structure of the membrane protein based on the knowledge-based energy function optimization of claim 1, wherein the knowledge-based energy function is: calculating a score by using the score-d function relation group of each residue pair, and taking the accumulated result of all scores as the energy value of the whole structure, wherein: the score-d functional relationship group refers to: for a protein sequence of length L residues, the predicted C beta-C beta distance is selected to be

The first L residue pairs of the probability between and the distance between

A set of score-d functional relationships within the range.

7. The method for predicting the three-dimensional structure of the membrane protein based on the knowledge-energy function optimization of claim 1, wherein the initial structure is as follows: the peptide bond between two residues is parallel to the backbone of the residue, and overall is a straight chain, i.e., the starting point for the iterative substitution of fragments;

the segment replacement comprises the following steps:

i. generating a random number R1 for determining whether to perform one of a secondary structure fragment replacement, a fixed length fragment replacement, or a short fragment replacement;

iii, generating a random number R3 for selecting a specific type of fragment;

v, judging whether the replaced structure meets constraint conditions, and if so, retaining the structure, and not discarding the structure;

8. The method for predicting the three-dimensional structure of the membrane protein based on the knowledge-energy function optimization of claim 1, wherein the screening is: using another energy function based on statistical knowledge

For each contact distance in

Calculating the energy value of the residue pair with the probability of being more than 0.3, and taking the accumulated result of all the energy values as the final energy value; then the energy value and an energy function based on knowledge base are used for carrying out comprehensive evaluation on the candidate structure, two energy functions are respectively used for carrying out sorting on the candidate structure from small to large, the serial numbers are added, the serial number and the minimum structure are screened out, and then the candidate structure is subjected to comprehensive evaluationThe structure is optimized by side chains, the side chain isomers of each type of amino acid in the nature are counted, the side chains are replaced, so that the possible position overlapping between side chain atoms is eliminated, and the side chain conformation is improved to be more consistent with a real structure, wherein: p is residue to contact distance

9. A system for realizing the method for predicting the three-dimensional structure of the membrane protein based on the knowledge-energy function optimization according to any one of claims 1 to 8, which comprises the following steps: a multi-sequence alignment module, a transmembrane region prediction module, a residue contact prediction module, and a tertiary structure prediction module, wherein: the input is connected with the multi-sequence comparison module to obtain a homologous sequence, the transmembrane region prediction module is connected with the multi-sequence comparison module and the three-dimensional structure prediction module, transmembrane region prediction is carried out by combining the result of multi-sequence comparison and the result is transmitted to the three-dimensional structure prediction module, the residue contact prediction module is also connected with the multi-sequence comparison module and the three-dimensional structure prediction module, residue contact prediction is carried out by combining the result of multi-sequence comparison and the result is transmitted to the three-dimensional structure prediction module, and the three-dimensional structure prediction module comprehensively uses the information given by the multi-sequence comparison module, the transmembrane region prediction module and the residue contact prediction module to finally complete the three-dimensional structure prediction.