CN115273968B

CN115273968B - Quality evaluation method and device for protein prediction three-dimensional structure

Info

Publication number: CN115273968B
Application number: CN202210754951.2A
Authority: CN
Inventors: 管佳威; 张闻瀚; 金慧玲; 王浩博
Original assignee: Hangzhou Liwen Institute Biotechnology Co ltd
Current assignee: Hangzhou Liwen Institute Biotechnology Co ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2023-05-12
Anticipated expiration: 2042-06-30
Also published as: CN115273968A

Abstract

The invention discloses a quality evaluation method and a device for a protein prediction three-dimensional structure, which have the technical scheme that the key points are that esmif cross entropy loss describing the sequence recovery degree is found to be in linear correlation with a structure quality evaluation function TMscore describing comparison with a real structure, and the predicted structure quality is judged by calculating the probability of a predicted sequence and the cross entropy of a reference sequence. The method comprises the following steps: inputting the reference sequence into various protein structure prediction models to obtain thousands or tens of thousands of three-dimensional prediction structures. The three-dimensional prediction structure can also be obtained by manually folding an amino acid chain, or can be obtained by manually fine-tuning on the basis of the three-dimensional prediction structure output by the prediction model. And then, reversely pushing back the sequences according to the plurality of three-dimensional predicted structures, and judging the accuracy of the three-dimensional predicted structures according to the difference between the reversely pushed back sequences and the reference sequences, so as to obtain the three-dimensional structure closest to the real protein.

Description

Quality evaluation method and device for protein prediction three-dimensional structure

Technical Field

The invention relates to the field of protein three-dimensional structure prediction, in particular to a quality evaluation method and device for protein three-dimensional structure prediction.

Background

Proteins are very important biomolecules in nature. Direct prediction of the three-dimensional structure of proteins based on amino acid sequences is a challenging problem, with significant impact on modern biology and medicine. Whether the three-dimensional structure of the protein can be accurately predicted plays a key role in understanding the function of the protein, designing the protein with a new biological function, researching and developing new medicines and the like. With the completion of the human genome project, a large number of protein amino acid sequences have been known by genome sequencing technology, and the number of new amino acid sequences obtained by sequencing analysis is still increasing at an explosive rate, while the rate of increase of the number of experimentally determined three-dimensional structures is far behind that of sequence analysis. The main experimental methods currently exist X-ray crystallography, nuclear Magnetic Resonance (NMR) and Cryo-electron microscopy (Cryo-EM). These existing methods often require a significant amount of time and expensive resources.

One major challenge in structure prediction is selecting the best three-dimensional structure from a generated pool of three-dimensional structures. Protein structure prediction models, such as Rosetta, rosettaFold, alphaFold2, can predict a large number of protein three-dimensional structures from one amino acid sequence, but it is difficult to predict which structure is closest to the native structure. Therefore, it is desired to explore a method for obtaining a predicted three-dimensional structure of a protein with high accuracy by only inputting an amino acid sequence.

Disclosure of Invention

Aiming at the defects of the prior art, one of the purposes of the invention is to provide a quality evaluation method for predicting a three-dimensional structure of a protein, which can obtain the three-dimensional structure of the protein with high accuracy without MSA.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a quality evaluation method for protein prediction three-dimensional structure comprises the following steps:

s1, predicting to obtain a plurality of prediction structures according to a reference sequence, wherein the reference sequence reflects the real distribution of the known protein amino acid sequence, and the prediction structures reflect the three-dimensional structure of the predicted protein. The predicted structure can comprise a three-dimensional structure which is more different from the real structure of the protein, and the quality requirement of the predicted structure which is initially input is lower;

s2, sequentially inputting a plurality of predicted structures into a Esm-if1 model to obtain predicted sequences corresponding to the predicted structures one by one, wherein the predicted sequences reflect probability distribution of amino acids at each site in the predicted protein amino acid sequence;

and S3, sequentially calculating multi-Class Cross Entropy (CCE) of the prediction sequence and the reference sequence to obtain the esmif cross entropy loss, and selecting a prediction structure corresponding to the minimum esmif cross entropy loss as an optimal three-dimensional structure.

Preferably, the reference sequence and the predicted sequence are each presented in a matrix, a first dimension of the matrix representing sequence position information, a second dimension of the matrix representing amino acid identity information,

the multi-classification cross entropy calculation method of the prediction sequence and the reference sequence comprises the following steps:

wherein CCE is multi-classification cross entropy, N is the length of a protein amino acid sequence, p is probability distribution of each amino acid in a reference sequence expressed by a single thermal code, q is probability distribution of the amino acid at each position in a predicted sequence, i is first-dimension position information of the position, and j is second-dimension amino acid identification information. The single hot code is a binary coding mode and is characterized in that of N bits used for coding the number, only one bit is 1, and the rest bits are all 0.

Preferably, the predicted structure is obtained by: inputting the reference sequence into a protein structure prediction model to obtain the amino acid sequence, manually folding the amino acid chain to obtain the amino acid sequence, or manually adjusting the amino acid chain based on the predicted structure output by the protein structure prediction model.

In view of the shortcomings of the prior art, a second object of the present invention is to provide a quality evaluation device for predicting three-dimensional structure of protein, comprising:

the prediction structure acquisition module is used for outputting a plurality of prediction structures according to a reference sequence, wherein the reference sequence reflects the real distribution of the known protein amino acid sequence, and the prediction structures reflect the three-dimensional structure of the predicted protein;

the prediction sequence acquisition module is used for sequentially inputting a plurality of prediction structures into a Esm-if1 model to obtain prediction sequences corresponding to the prediction structures one by one, wherein the prediction sequences reflect probability distribution of amino acids at each site in the predicted protein amino acid sequence;

and the structure screening module is used for sequentially calculating multi-classification cross entropy of the prediction sequence and the reference sequence to obtain the esmif cross entropy loss, and selecting a prediction structure corresponding to the minimum esmif cross entropy loss as an optimal three-dimensional structure.

In view of the shortcomings of the prior art, a third object of the present invention is to provide an electronic device, including:

processor and method for controlling the same

And a memory storing executable code that, when executed by the processor, causes the processor to perform the method of evaluating quality of a predicted three-dimensional structure of a protein described above.

Compared with the prior art, the invention has the advantages that: the esmif cross entropy loss describing the degree of sequence restoration and the structure quality evaluation function TMscore describing the comparison with the real structure are found to be linearly related, and the predicted structure quality is judged by calculating the probability of the predicted sequence and the cross entropy of the reference sequence. The method can obtain the protein three-dimensional structure with high accuracy without homologous multi-sequence alignment data (MSA).

Drawings

FIG. 1 is a matrix diagram of a reference sequence;

FIG. 2 is a matrix diagram of a predicted sequence;

FIG. 3 is a diagram of a prediction block;

FIG. 4 is a plot of 9 structural quality assessment functions TMscore and emif cross entropy loss scatter plots.

Detailed Description

The invention will now be described in further detail with reference to the drawings and examples.

Example 1

alphaFold has made remarkable progress in protein structure prediction using deep learning techniques and co-evolution information (MSA) from related protein sequences. According to the input single amino acid sequence, searching hundreds of homologous sequences to form MSA, and according to biological and physical laws, learning reliable co-evolution information from MSA, and finally folding the amino acid chain into a low potential energy state to obtain the predicted protein three-dimensional structure. However, obtaining MSA is difficult and too much dependent on MSA to explore the physical properties of protein folding, which may not accurately predict the effect of new mutations on protein structure and stability. Based on Anfinsen work, it is known that protein structure minimizes potential energy by folding. Thus, if the potential energy function can be modeled with high accuracy, the protein structure can be predicted by optimizing the function. However, this method has difficulties: how to construct this potential energy function accurately. We call the above problem a scoring problem, also called structural quality assessment (Quality Accessment, QA).

Besides the Alphafold of google, meta corporation in the united states opens up a new approach in big data driven pre-training models, trying to solve protein structure prediction and design problems from another perspective. Of these, esm-if1 is a large-scale pretrained model by Meta corporation of America [Hsu C, Verkuil R, Liu J, et al. Learning inverse folding from millions of predicted structures[J]. bioRxiv, 2022.]It attempts to predict the amino acid sequence of a protein by inputting the backbone structure of the protein. The method takes the structure of one thousand two million protein sequences predicted by alpha fold2 as a training set, and simultaneously utilizes a model GVP with geometry-invariant input processing to recover the sequences. The recovery degree of the method for the original sequence reaches 51%, the recovery rate for the buried residues reaches 72%, and the method exceeds 10% of the best algorithm in the market. The method is developed for protein design (the problem of protein reverse folding), and is a relatively advanced protein pre-training model.

The invention can input a reference sequence (an original sequence can be a wild type protein sequence) into various protein structure prediction models to obtain thousands or tens of thousands of three-dimensional prediction structures. The three-dimensional prediction structure can also be obtained by manually folding an amino acid chain, or can be obtained by manually fine-tuning on the basis of the three-dimensional prediction structure output by the prediction model. And then, reversely pushing back the sequences according to the plurality of three-dimensional predicted structures, and judging the accuracy of the three-dimensional predicted structures according to the difference between the reversely pushed back sequences and the reference sequences, so as to obtain the three-dimensional structure closest to the real protein. The method comprises the following specific steps:

s1, predicting to obtain a plurality of prediction structures according to a reference sequence, wherein the reference sequence reflects the actual distribution of the known protein amino acid sequence. As shown in FIG. 1, the first dimension of the matrix represents the sequence locus, the second dimension represents the type of amino acid, the single thermal code (other than 0 or 1) represents the probability distribution of the amino acid sequence in FIG. 1, and the color brightness of the small square blocks in the figure corresponds to the probability distribution value (the brighter the color brightness represents)The larger the value) and the color brightness of the small square in fig. 1 represents a probability value of 1. Specifically, the square in the fifth column in FIG. 1 indicates that the probability of serine (S) at the 5 th site of the protein is 1 (since the sequence is known, the type of amino acid at the first site is necessarily determined), namely P _ij =P ₁₇ =1, i represents the position of the corresponding amino acid in the sequence, j represents the amino acid type, except for the five rare natural amino acids of XBUZO added to the 20 amino acids, "" represents periods, "-" represents deletions, so j has a total of 27 different numbers (e.g. 3-29) of expressions, each number representing a different amino acid type. Specifically, the amino acids represented by 3 to 26 are 'L', 'a', 'G', 'V', 'S', 'E', 'R', 'T', 'I', 'D', 'P', 'K', 'Q', 'N', 'F', 'Y', 'M', 'H', 'W', 'C', 'X', 'B', 'U', 'Z', 'O', 'V', and 'V', in that order. As shown in fig. 3, the predicted structure reflects the three-dimensional structure of the predicted protein. The predicted structure can comprise a three-dimensional structure which is more different from the real structure of the protein, and the quality requirement of the predicted structure which is initially input is lower;

s2, sequentially inputting a plurality of prediction structures into an ESM-IF1 model to obtain prediction sequences corresponding to the prediction structures one by one, wherein the prediction sequences reflect probability distribution of amino acids at each site in the predicted protein amino acid sequence. As shown in fig. 2, the first dimension of the matrix represents the position information of the sequence locus, the second dimension represents the amino acid identification information, the color brightness of the small square block in fig. 2 corresponds to the probability distribution value, the probability distribution value corresponding to the brighter color brightness is larger (if the same column has a plurality of probability distribution values (i.e. a plurality of small square blocks), and the amino acid corresponding to the position with the largest value is selected as the prediction result). Specifically, the square in the fifth column in FIG. 2 shows that the probability of serine (S) at the 5 th site of the protein is 0.8, i.e., q _ij =q ₁₇ =0.8, while predicting amino acid at position 5 to be serine. Likewise, different expression forms can be adopted to represent probability distribution values, such as three-dimensional columns with different heights, and the like, which is not limited;

s3, sequentially calculating multi-Class Cross Entropy (CCE) of the prediction sequence and the reference sequence to obtain an esmif cross entropy loss, and selecting a prediction structure corresponding to the minimum esmif cross entropy loss as an optimal three-dimensional structure;

the multi-classification cross entropy calculation method comprises the following steps:

The multi-class cross entropy comes from shannon's theory of information, which is mainly used herein to measure ambiguity of predicted results and true sequences, the lower the CCE, the higher the degree of sequence recovery, also indicating a higher quality of the corresponding input structure.

The operation basis of the quality assessment method of the protein prediction three-dimensional structure is as follows:

s1.1, obtaining a plurality of reference sequences and experimentally measured reference structures corresponding to the reference sequences, wherein the reference structures reflect real three-dimensional structures of proteins;

s1.2, obtaining a plurality of prediction structures according to a reference sequence;

s1.3, sequentially calculating the difference between a predicted structure and a reference structure to obtain a corresponding index TMscore for describing the quality of the structure;

s1.4, sequentially inputting a plurality of prediction structures into Esm-if1 to obtain probability distribution of a prediction sequence corresponding to the prediction structures one by one;

s1.5, sequentially calculating probability distribution of the predicted sequence and multi-class cross entropy of the reference sequence to obtain esmif cross entropy loss;

s1.6, plotting a scatter diagram by taking the esmif cross entropy loss as an abscissa and TMscore as an ordinate. As shown in FIG. 4, each TMscore and esmif cross entropy loss scatter plot of FIG. 4 is data for a single protein, where each dot represents one of the decoys, and the abscissa is the distance from the true structure, the closer to 1 (right) the more true. The ordinate is the esmif cross entropy loss of the recovered sequence, the lower the sequence the better the recovery. In particular, in FIG. 4, the 9 proteins are 1fzy, 1l3k, 1opd, 1t3y, 1z2u, 1zma, 2cxd, 2dfb, 2z0t, respectively, and the specific data sources are rosetta decoy set. We found that the emif cross entropy loss and TMscore are linearly related, so that the corresponding index TMscore describing the quality of the structure can only be obtained by calculating the multi-Class Cross Entropy (CCE) of the predicted sequence and the reference sequence. Experiments prove that the low-quality structure (decoy) dataset generated by Rosetta is fed into the model constructed by the method, and the index TMscore for measuring the decoy and the real structure can form strong negative correlation with multi-classification cross entropy. That is, when we input a low quality structure, it is difficult for the model to get probability distribution close to the original sequence. Thus, the model also allows for high quality structural quality assessment.

Example 2

A quality assessment device for protein prediction three-dimensional structure, comprising:

Example 3

An electronic device, comprising:

processor and method for controlling the same

A memory storing executable code that, when executed by the processor, causes the processor to perform the method for quality assessment of a protein predicted three-dimensional structure as shown in embodiment 1.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the present invention can be made by one of ordinary skill in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. The quality evaluation method of the protein prediction three-dimensional structure is characterized by comprising the following steps of:

s1, predicting to obtain a plurality of predicted structures according to a reference sequence, wherein the reference sequence reflects the real distribution of the known protein amino acid sequence, and the predicted structures reflect the three-dimensional structure of the predicted protein;

s3, sequentially calculating multi-classification cross entropy of the prediction sequence and the reference sequence to obtain an esmif cross entropy loss, and selecting a prediction structure corresponding to the minimum esmif cross entropy loss as an optimal three-dimensional structure;

the reference sequence and the predicted sequence are presented in a matrix, a first dimension of the matrix representing a sequence position, a second dimension of the matrix representing a type of amino acid,

wherein CCE is multi-classification cross entropy, N is the length of a protein amino acid sequence, p is probability distribution of each amino acid in a reference sequence expressed by a single thermal code, q is probability distribution of the amino acid at each position in a predicted sequence, i is first-dimension position information of the position, and j is second-dimension amino acid identification information.

2. The method for evaluating the quality of a predicted three-dimensional structure of a protein according to claim 1, wherein the predicted structure is obtained by: inputting the reference sequence into a protein structure prediction model to obtain the amino acid sequence, manually folding the amino acid chain to obtain the amino acid sequence, or manually adjusting the amino acid chain based on the predicted structure output by the protein structure prediction model.

3. A quality assessment device for predicting a three-dimensional structure of a protein, comprising:

the structure screening module is used for sequentially calculating multi-classification cross entropy of the prediction sequence and the reference sequence to obtain an esmif cross entropy loss, and selecting a prediction structure corresponding to the minimum esmif cross entropy loss as an optimal three-dimensional structure;

4. An electronic device, comprising:

processor and method for controlling the same

A memory storing executable code that, when executed by the processor, causes the processor to perform the method for quality assessment of a protein predicted three-dimensional structure according to any one of claims 1-2.