CN116994642A

CN116994642A - Discrete graph probability denoising diffusion model for protein sequence generation

Info

Publication number: CN116994642A
Application number: CN202310995978.5A
Authority: CN
Inventors: 周冰心; 郑力荣; 吴邦昊; 洪亮
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2023-08-09
Filing date: 2023-08-09
Publication date: 2023-11-03

Abstract

The invention discloses a discrete graph probability denoising diffusion model for protein sequence generation, wherein a given protein skeleton guides a diffusion process of a corresponding amino acid residue type. The model extrapolates the joint distribution of amino acids provided that the biophysical properties of the node and the local environment. In addition, the invention utilizes the amino acid substitution matrix to carry out the diffusion forward process, and encodes the prior knowledge of the biological meaning of the amino acids, including the space and sequence neighbors thereof and the self, thereby reducing the sampling space in the generation process. The model of the invention achieves the leading performance in the aspect of sequence recovery and has great potential in the aspect of generating various protein sequences for determining the protein skeleton structure.

Description

Discrete graph probability denoising diffusion model for protein sequence generation

Technical Field

The invention belongs to the technical field of protein sequence generation, and relates to a discrete graph probability denoising diffusion model for protein sequence generation.

Background

Protein sequence generation, or protein de novo design, is intended to predict viable amino acid sequences that can fold into specific 3D protein structures or possess specific functions. Development of methods for protein sequence generation facilitates the design of novel proteins with desirable structural and functional properties. The proteins can be applied to various fields, such as targeted drug delivery and enzyme design, and have wide application prospects for academic research and industrialization purposes.

Protein sequence generation is a challenging task because both structure-to-sequence and function-to-sequence have one-to-many mapping properties, i.e., many amino acid sequences may fold into one and the same protein backbone, performing the same function. In addition, protein sequence generation itself remains a very challenging task due to the large sequence space, complexity of protein folding, and complex mechanisms involved in protein function. In addition to determining the folding status of proteins based on physical reasoning of energy, recent advances in deep learning have made significant progress in directly learning the mapping from protein structure to amino acid sequence. For example, a discriminant model expresses this problem as a prediction of the most likely sequence for a given structure by a transducer-based model. However, such models typically give only a fixed answer and are therefore not satisfactory in accurately capturing one-to-many mappings from protein structures to non-unique amino acid sequences. The diffusion probability model, as an emerging generation method type, provides the potential to generate a diverse set of sequence candidates that define a protein backbone.

Disclosure of Invention

In view of the above, an object of the present invention is to provide a discrete map probability denoising diffusion model for protein sequence generation.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a discrete map probabilistic denoising diffusion model for protein sequence generation, a given protein backbone guiding a corresponding amino acid residue type, comprising in particular:

(1) Discrete probability back diffusion process based on deep neural network training;

(2) Generating conditional back diffusion distribution based on protein priori knowledge guidance;

(3) An isomorphic map neural network (EGNN) for use in a back-diffusion parametric denoising process;

(4) The denoising process is sampled and accelerated based on a Denoising Diffusion Implicit Model (DDIM).

As one of the preferred technical solutions, the model uses a given graph g= { X, E }, where the node feature is X and the edge feature is E as a condition; wherein, the node characteristic X= [ X ] ^pos ，X ^aa ，X ^prop ]Position X comprising an amino acid ^pos Type X of amino acid ^aa Spatial and biochemical Properties X ^prop 。

As one of the preferred technical schemes, the step (1) specifically comprises a diffusion process, a training denoising network and a parameterization generating process.

As one of further preferable technical schemes, the specific method of the diffusion process is as follows: adding noise independently to each amino acid node of a protein, for any given node, the transition probability is represented by matrix Q _t Defining, by means of a predefined transfer matrix (transition matrix), a forward diffusion kernel (diffusion kernel) as: wherein->Is a transition probability matrix at time step t.

As one of further preferable technical schemes, the specific method for training the denoising network is as follows: construction of a denoising neural network f _θ Parameterized by a parameter θ; the network accepts noise inputWherein X is _t Is a splice comprising a noisy amino acid type and other amino acid attributes, 20 single-hot encoding (one-hot encoding) amino acid types and 15 geometric attributes about the protein backbone; the training goal of the denoising model is to predict the noiseless amino acid type X ^aa Thus modeling a variety of sequence combinations that potentially correspond to a protein structure while maintaining its inherent structural constraints.

As one of further preferable technical schemes, the specific method of the parameterization generating process is as follows: the novel amino acid sequence is obtained by reacting the amino acid sequence at time t (1<t<T) amino acid type x of each node ^aa And performing back diffusion iteration generation. Correspondingly generated probability distribution p _θ (x _t-1 |x _t ) Predicting probabilities by neural network pairsAnd (5) performing estimation. The invention utilizes deep neural network predictive probability marginalization (marginalization) to calculate the generation distribution of each iteration step:

wherein->Wherein posterior distribution q (x _t-1 |x _t ，x ^aa ) From node characteristics x of transfer matrix, time t _t And amino acid type x ^aa And (5) calculating to obtain the product. Wherein x is ^aa Probability predicted by denoising network +.>Sampling to obtain the product.

As one of the preferable technical schemes, the step (2) specifically comprises two parts of a markov transfer matrix and secondary structure sampling, wherein the specific method of the markov transfer matrix is as follows: the block replacement matrix (BLOSUM) is incorporated into the diffusion and generation process, normalized to probability, and usedA softmax function; then, the normalized block replacement matrix B is adjusted by using different probability temperatures so as to control the noise scale in the diffusion process; transfer matrix Q at time t _t From Q _t ＝B ^T Is given; the method for sampling the secondary structure comprises the following steps: DSSP was used to predict the secondary structure of each amino acid and one-hot coding was used to represent these structures; the neural network takes one-hot code as input and uses the one-hot code to denoise amino acid, so that a protein secondary structure is added as a guiding condition in the denoising sampling process.

As one of the preferable technical schemes, the specific method of the step (3) is as follows:

for a protein map of n amino acids, at layer I, a isovariogram convolution (EGC) results in node expressionImplicit vector representation representing amino acid type and geometry and edge expression ++>Hidden layer feature on the connected nodes i and j, and node three-dimensional coordinates +.>The modified EGC layer may be used to update the hidden layer representation H of the layer 1 node ^(l+1) Hidden layer representation M of sum edges ^(l+1) The method comprises the steps of carrying out a first treatment on the surface of the Briefly, the present invention includes

H ^(l+1) ，M ^(l+1) ＝EGC[H ^(l) ，X ^pos ，M ^(l) ]。

As a further preferred embodiment, the EGC layer defines the following operations:

-for embedding of edges:

-for node coordinates:

-hiding the representation from nodes:

wherein phi is _e 、φ _h And phi _x Is an edge and node propagation operation, phi _x Is an additional operation to embed the vector edge into m _ij Projected to a scalar. Modified EGC layer maintains node coordinate X ^pos Rotation and translation, etc., and maintain alignment invariance over the set of nodes.

As one of the preferable technical schemes, the specific method of the step (4) is as follows:

wherein the temperature T is used to regulate the degree of certainty or randomness of the estimated probability distribution. Posterior distribution after multi-step denoising is as follows:

the invention has the beneficial effects that:

the present invention proposes a novel pattern denoising diffusion model for protein sequence generation, wherein a given protein backbone directs the diffusion process of the corresponding amino acid residue type. For the target protein, the model extrapolates the joint distribution of amino acid sequences subject to the biophysical properties of the amino acid nodes and the local environment. In addition, the invention utilizes the amino acid substitution matrix to carry out the diffusion forward process, and encodes the prior knowledge of the biological meaning of the amino acid, including the space and sequence neighbors and the self characteristics thereof, thereby reducing the sampling space in the generation process. The model of the invention achieves the leading performance in the aspect of sequence recovery and has great potential in the aspect of generating various protein sequences for determining the protein skeleton structure.

Diffusion probability models are becoming more and more interesting because of their strong learning ability. Under inherent randomness conditions, they are able to generate a variety of diverse molecular outputs from a set of fixed conditions. For example, torsion Diffusion learns the torsional angular distribution of heavy atoms, thereby mimicking the conformation of small molecules. Meanwhile, SMCDIFF completes protein folding task by learning to support stable skeleton distribution of target motif. Similarly, DIFFDOCK employs a generation method in protein ligand docking, creating a range of possible ligand binding poses for the target pocket structure.

Although diffusion models are widely used, their full potential remains relatively unexplored in the context of protein reverse folding. Current sequence design methods are based primarily on language models, including masking language models and autoregressive generation models. By tagging amino acids, the mask language model treats the sequence generation task as a mask enhanced process. These models are typically masked by selecting tokens in a given context and then learning to predict these masked tokens (token). When trained by parameterizing the target, this process can be considered as a discrete diffusion probability denoising model. Instead, the autoregressive model can be regarded as a deterministic diffusion process. It introduces a conditional distribution for each term, but the overall dependence on the whole amino acid sequence is reconstructed by a diffusion process that is performed independently.

In contrast, the diffusion probability model employs an iterative prediction method that can generate less noisy samples and exhibits the potential to capture the inherent diversity of the real data distribution. This unique property further underscores the promising roles that diffusion models may play in the push protein sequence design field. To bridge this gap, the applicant has first tried protein sequence generation using a discrete diffusion probability model. The present invention models the problem of sequence generation as a de-noising problem, i.e., restoring randomly assigned amino acid types in a protein (backbone) pattern as wild-type. Protein maps containing all amino acid space and biochemical information are represented by isomorphous map neural networks, with diffusion processes occurring at map nodes. In the real sequence generation task, the proposed model achieves a recovery of up to 70% especially for conserved regions of biological significance. In addition, the structure predicted by alpha fold2 for the generated sequence possesses a high pLDDT confidence and differs from the native protein structure by less than the experimental measurement error (3. Emi), i.e., the structure remains substantially identical.

In order to maintain the required functions, the invention innovatively conditions the model on a secondary structure and a tertiary structure, and the model is presented in the form of a residue diagram and corresponding node characteristics. The main contributions of the invention have three aspects. First, the invention proposes GRADE-IF, a reverse folding diffusion model supported by a rotation-translation isomorphous neural network. It can produce a diverse set of sequence candidates compared to other models. Second, unlike the uniform noise in traditional discrete diffusion models, the present invention encodes a priori knowledge of the amino acid response to evolutionary pressure by using a block substitution matrix as the translation kernel. Furthermore, to accelerate the sampling process, the present invention converts the Denoising Diffusion Implicit Model (DDIM) from its original continuous form to a sampling method suitable for discrete cases for accelerating model training and reasoning speed.

Drawings

In order to make the objects, technical solutions and advantageous effects of the present invention more clear, the present invention is illustrated in the following drawings.

Fig. 1 is a diagram of a model structure.

FIG. 2 is a diagram of amino acids generated based on a protein backbone.

Fig. 3 is an example of amino acid probability distribution generated based on a random probability distribution (left) and based on a block replacement matrix (right).

FIG. 4 is a graph of an isomorphous neural network used to fit amino acid probability distributions during denoising.

FIG. 5 shows that GRADE-IF possesses higher recovery at internal conserved amino acids.

FIG. 6 shows that GRADE-IF can simultaneously achieve both quality and diversity of the protein produced, as compared to PIFOLD and PROTEINMPNN.

Fig. 7 is a graph showing that DDIM can greatly boost model speed with less loss of model performance.

FIG. 8 shows a comparison of predicted and native protein (PDB ID:3 FKF) structures obtained by folding alpha fold2 of protein sequences generated by different GRADE-IF.

Detailed Description

Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

1.1 amino acid mapping of protein backbone

The residue map is represented asFor describing the geometry of the protein (as shown in figure 2). Specifically, each node represents an amino acid in a protein. Accordingly, each node is assigned a carefully selected set of node attributes X to reflect its biophysical and topological properties. The local environment of a given node is defined by its spatial neighbors, as determined by the k-nearest neighbor (kNN) algorithm. Thus, each amino acid node is connected to at most k other nodes in the figure, in particular +.>The node with the smallest euclidean distance between all nodes in the contact area. The edge attribute, denoted E ε R ⁹³ The relationship between the connection nodes is explained. These relationships are determined by parameters such as interatomic distance, local N-C position, and sequence position coding scheme.

1.2 protein sequence Generation is defined as a De-noising problem

The goal of protein sequence generation is to design a sequence that can fold into a pre-specified desired structure. The invention uses the coordinate of C alpha atom to represent the three-dimensional position of amino acid in Euclidean space, thus representing the framework of protein. Based on the naturally occurring protein structure, the model of the invention is constructed to generate the native sequence of the protein based on its backbone atomic coordinates. Formally, the present invention regards this problem as a learning condition distribution p (X ^aa |X ^pos ). Given a protein of length n and a sequence of spatial coordinates representing the C alpha atoms of each backbone in the structureThe goal is predictionI.e. the original sequence of the amino acid. This density model is modeled with other amino acids throughout the chain. The model of the present invention is trained by minimizing the negative log-likelihood between the generated amino acid sequence and the native wild-type sequence. The sequence may then be designed by sampling or identifying the sequence that is maximized given the conditional probability of the desired secondary and tertiary structure.

1.3 diffusion probability model for removing noise by diffusion

The diffusion model belongs to the class of the generated model, and the training stage comprises diffusion and denoising processes. The diffusion process is performed by T stepsTo the original data x ₀ Q (x) is converted into a series of latent variables { x } ₁ ，...，x _T Each latent variable carries increasing noise. Conversely, denoising process-> Noise in these latent variables is gradually reduced, directing them back to the original data distribution. This iterative denoising process is driven by differentiable operators, such as trainable neural networks.

Theoretically, q (x) _t |x _t-1 ) There is no strict form, but for efficient sampling, p _θ Several conditions need to be met:

(i) Diffusion kernel q (x) _t |x ₀ ) It is desirable to have a closed shapeFormula to sample noise data for different time steps in parallel training;

(ii) The kernel should have a simplified form to calculate a posterior q (x _t |x ₀ ) Thus posterior probability distribution p _θ (x _t-1 |x _t )＝∫q(x _t-1 |x _t ，x ₀ )dp _θ (x ₀ |x _t ) May be used as a target for a trainable neural network, where θ represents a parameter in the neural network. Thus, the original input x ₀ Can be targeted.

(iii) Edge distribution q (x _T ) Should be independent of x ₀ This allows the present invention to use q (x) _T ) As an a priori distribution.

The above criteria are critical to developing a suitable noise adding module and training procedure. To meet these prerequisites, the present invention follows the settings in previous studies. For x _t E { 1., where, K class K classification data of K (in the protein sequence generation task, k=20, representing 20 amino acids), transition probabilities are calculated from the matrix [ Q _t ] _ij ＝q(x _t ＝j|x _t-1 =i). Classification feature x using transfer matrix and single thermal encoding _t The present invention may define the transfer core in the diffusion process by:

wherein the method comprises the steps ofIs the superposition result of the multiple transfer matrices. The posterior distribution derived from Bayesian rules can be calculated as +.>Thus, the transition kernel, model output at time t, and process state x can be used _t To determine the probability of generation. By iterative sampling, the resulting output x is produced ₀ 。

Probability of final outputDistribution p (x) _T ) Should be independent of the observed value x ₀ . Thus, the construction of the transfer matrix requires the use of a noise schedule. The simplest and common method is uniform transfer, which can be parameterized asWherein the method comprises the steps ofIs the transpose of the identity matrix I. As t approaches infinity, α gradually decays until it reaches 0. Thus, the distribution q (x _T ) Asymptotically tends to be uniformly distributed and thus is substantially independent of x.

A diffusion model (FIG. 1) for probability denoising of a discrete map for protein sequence generation, wherein the diffusion denoising process performs T times of denoising on a single thermal coding vector of an amino acid type by using a transfer matrix on each amino acid node until the probability distribution of each amino acid is approximately uniformly distributed. The noise reduction (back diffusion) process performs T denoising on the amino acid type through a trainable deep neural network, and finally obtains a set of joint distribution probabilities of the amino acid sequences for finally generating a sample. Condition definitions are added to the denoising process to guide the recovery of probability distributions, including secondary and tertiary structures of proteins.

The model utilizes a given graphWherein the node features are X and the edge features are E. Specifically, the node features include the position of amino acids, amino acid type, and spatial and biochemical properties x= [ X ^pos ，X ^aa ，X ^prop ]. The invention is characterized by amino acid X ^aa The diffusion process is defined above and denoising is performed on the graph structure E encoded by the constant-change neural network. In addition, the present invention incorporates protein specific prior knowledge, including amino acid substitution scoring matrices and protein secondary structures. The invention also introduces a new acceleration algorithm based on the discrete diffusion generation process of the conversion matrix.

2.1 discrete probability back diffusion Process based on deep neural network training

Diffusion process: to capture the distribution of amino acid types, the present invention adds noise to each amino acid node of a protein independently. For any given node, the transition probability is represented by matrix Q _t And (5) defining. By predefining the transfer matrix, the present invention can define the forward diffusion kernel as:

and/>wherein->Is a transition probability matrix at time step t.

Training a denoising network: the second component of the diffusion model is a denoising neural network f _θ Parameterized by the parameter θ. The network accepts noise inputWherein X is _t Is a splice comprising noisy amino acid types and other amino acid attributes, including 20 single thermally encoded amino acid types and 15 protein backbone related amino acid geometric attributes such as Solvent Accessible Surface Area (SASA), normalized surface-aware node characteristics, backbone atom dihedral angles, and three-dimensional positions. Its objective is to predict noiseless amino acid type X ^aa Thus modeling the potentially diverse amino acid sequences that conform to the protein structure while maintaining their inherent structural constraints. To train f _θ The present invention optimizes the predicted probability of amino acid type for each nodeCross entropy loss L of (c).

Parameterization generating process: the new amino acid sequence is generated by back-diffusion iteration for each node x. Generating probability distributionIs to predict probability by neural network>An estimation is made. The invention performs marginalization on the network prediction to calculate the generation distribution of each iteration step:

wherein posterior distribution q (x _t-1 |x _t ，x ^aa ) From the transition matrix, the node characteristic state at time t and the amino acid type x ^aa And (5) calculating to obtain the product. X is x ^aa Prediction probability from denoising networkIs a sample of the sample.

2.2 conditional reverse diffusion distribution Generation based on protein priori knowledge guidance

2.2.1 Markov transfer matrices

The transition matrix serves as a guide for the discrete diffusion model and plays a key role in providing transition probabilities from the current time step to the next time step. Since it reflects the possibility of one amino acid type to another, this matrix plays a crucial role in both diffusion and production. In the diffusion phase, the transfer matrix is repeatedly applied to the observed data, which evolves over time due to inherent noise. As the diffusion time increases, the probability of the original amino acid type gradually decays, eventually converging to a uniform distribution across all amino acid types. In the generation phase, conditional probability p _θ (x _t-1 |x _t ) Is influenced by the model predictions and characteristics of the transfer matrix Q.

In view of the biological specificity of amino acid substitutions, the transition probabilities between amino acids are not evenly distributed, so defining random directions during generation or sampling is not reasonable. Alternatively, the diffusion process may reflect evolutionary pressure by using amino acid substitution scoring matrices that maintain protein function, structure, or stability. In this study, the present invention uses a block replacement matrix (Blocks Substitution Matrix, abbreviated BLOSUM) that identifies conserved regions within proteins that are believed to have greater functional relevance. Based on empirical observations of protein evolution, BLOSUM provides an estimate of the likelihood of substitution between different amino acids. Thus, the present invention incorporates BLOSUM into the diffusion and generation process. First, the matrix is normalized to probability, using the softmax function. The present invention then uses different probability temperatures to adjust the normalized matrix B to control the noise scale during diffusion. Thus, the transfer matrix at time t is defined by Q _t ＝B ^T Given. By using this matrix to improve transition probabilities, the generation space that needs to be sampled can be effectively reduced, thereby converging the predictions of the model into a meaningful subspace. Fig. 3 compares the transfer matrices over time for the random and BLOSUM cases.

2.2.2 secondary Structure

The secondary structure of a protein refers to the local spatial arrangement of amino acid residues in the protein chain. Two common types of protein secondary structures are alpha-helices and beta-sheet, which are stabilized by hydrogen bonding between backbone atoms. The secondary structure of a protein is a critical intermediary, bridging between the amino acid sequence and the overall three-dimensional conformation of the protein. In the research of the invention, the invention takes eight different types of secondary structures as conditions into the sampling process of amino acid nodes. This strategy effectively reduces the exploration space of potential amino acid sequences. Specifically, the present invention uses DSSP (Define Secondary Structure of Proteins) to predict the secondary structure of each amino acid and uses one-hot coding to represent these structures. The neural network takes the single thermal code as input and uses the single thermal code to denoise the amino acid, so that the generated amino acid sequence is sampled by taking the secondary structure as a guiding condition.

By applying secondary structure conditions such as alpha-helix and beta-folded sheet to the search of the amino acid sequence, the sampling space of the potential sequence can be remarkably reduced, and the biological significance of the generated protein sequence can be endowed. By conditioning the amino acid type during its sampling to the corresponding secondary structure type, the present invention directs the resulting protein sequence to achieve the appropriate three-dimensional structure with viable thermal stability while retaining its ability to perform its intended function.

2.3 constant change map neural network (FIG. 4)

Biomolecules such as proteins and compounds are three-dimensional structures, and model predicted binding complexes should remain consistent regardless of the positioning and orientation of the incoming protein to encode strong and expression rich hidden representations. This property can be ensured by using a rotating alike denatured neural network. One typical isomorphism neural network is an isomorphism graph neural network. The invention modifies its SE (3) -isomodifying neural layer for updating the node and edge representations to maintain SO (3) rotational isomodifying and E (3) translational invariance.

At layer l, an isovariogram convolution (EGC) input is embedded with a set of n hidden nodesFor describing AA type and geometry, edge-embedded +.>Nodes i and j related to the connection and node coordinatesThe goal of the modified EGC layer is to update the hidden representation H of the node ^(l+1) Representation M of sum edges ^(l+1) . Briefly, the present invention has H ^(l+1) ，M ^(l+1) ＝EGC[H ^(l) ，X ^pos ，M ^(l) ]。

To achieve this goal, the EGC layer defines the following operations:

-for embedding of edges:

-for node coordinates:

-hiding the representation from nodes:

here, phi _e 、φ _h And phi _x Is an edge and node propagation operation, phi _x Is an additional operation to embed the vector edge into m _ij Projected to a scalar. Modified EGC layer maintains node coordinate X ^pos Rotation and translation, etc., and maintains the invariance of the arrangement over the set of nodes, similar to other graph neural networks.

2.4 sampling and accelerating De-noising Process based on De-noising diffusion implicit model (DDIM)

One significant disadvantage of diffusion models is that the speed of the generation process is typically very slow, as it involves many incremental steps. To address this problem, a Deterministic Denoising Implicit Model (DDIM) is typically used in the continuous variable diffusion generation model. DDIM is based on a non-markov forward diffusion process, always conditioned on input, rather than the previous step. By setting the noise variance of each step to 0, the reverse generation process becomes completely deterministic, given an initial a priori sample.

Similarly, since the present invention formally possesses the generation probability p _θ (x _t-1 |x _t ) It is based on predicted x ^aa And posterior distribution p (x _t-1 |x _t ，x ^aa ) Calculated, the invention can also be implemented by controlling p (x ^aa |x _t ) Is of (1)The sample temperature makes the generated model deterministic. Thus, the present invention defines a multi-step generation process as follows:

wherein the temperature T controls whether deterministic or random, the multi-step posterior distribution is:

3.1 Experimental protocol

Training setting: the present invention employs CATH v4.3.0 dataset partitioning based on GRAPHTRANS (John ingham, vikas Garg, region Barzilay, and Tommi jaakkola. Generated models for graph-based protein design. Advanced in Neural Information Processing Systems,32,2019.) and GVP (Bowen jin, stephan Eismann, patricia Suriana, raphael John Lamarre Townshend, and Ron driver. Learning from protein structure with geometric vector superns. In International Conference on Learning Representations, 2021.). Proteins were classified according to the CATH topology classification, and were divided into 18,024 proteins for training, 608 for validation, and 1,120 for testing. To evaluate the quality of production of different proteins, the present invention tested the model of the present invention on three different categories: short sequence, single chain and total protein. Wherein the short sequence comprises a protein having a length of less than 100; the single chain class comprises proteins consisting of a single chain. Furthermore, the total time step number of the diffusion model is configured to be 500, following the noise setting of cosine scheduling. For a denoising network, the present invention implements six stacked EGNN blocks, each block having 128 hidden dimensions. The model exercises 200 epochs by default and uses Adam optimizers. During training, the invention employs a batch size of 64 and a learning rate of 0.0005. In addition, to prevent overfitting, the invention introduces a dropout rate of 0.1 in the architecture of the model.

Evaluation index: the present invention measures the quality of recovered protein sequences by confusion and recovery. The degree of confusion measures how well the model predicts the matching of the amino acid probability to the actual amino acid at each position in the sequence. Lower confusion indicates better fitting of the model to the data. The recovery rate assessment model is based on the ability of the protein to recover the correct amino acid sequence based on its three-dimensional structure. Typically, it is calculated as the proportion of amino acids in the predicted sequence that match the original sequence. Higher recovery rates indicate that the model has a greater ability to predict the original sequence from the structure.

TABLE 1 Performance of GRADE-IF in recovering proteins in CATH

TABLE 2 Performance of GRADE-IF in recovering proteins in TS50 and TS500

3.2 Generation of completely New sequences

Table 1 compares the performance of GRADE-IF in recovering proteins in CATH. To generate a high confidence sequence, GRADE-IF passes approximate probabilitiesTo eliminate uncertainty in the prior. Notably, the present invention found that recovery of single chain proteins and short sequences was improved by 4.2% and 5.4%, respectively. The invention also evaluates the TS50 and T500 datasets and the results are shown in Table 2.

After subdividing the recovery properties into buried and superficial amino acids, the present invention found that more conserved core residues show higher native sequence recovery. In contrast, the amino acids on the active surface show a lower sequence recovery rate. FIG. 5 examines amino acid conservation by Solvent Accessible Surface Area (SASA) (where SASA <0.25 represents internal amino acids) and number of contacts (number of amino acids adjacent thereto in 3D space) [10 ]. In all three protein sequences, the recovery of the inner residues was significantly higher than the outer residues, whereas the recovery increased with increasing number of contacts. The invention also presents the recovery rate of different secondary structures, the invention obtains high recovery rate in most secondary structures, and only one less 5 spiral structures show lower recovery rate.

The present invention also compares GRADE-IF with PIFOLD (Zhangyang Gao, cheng Tan, and Stan Z.Li.Pifold: toward effective and efficient protein inverse folding in International Conference on Learning Representations, 2023.) and PROTEINMPNN (Juusts Dauparas, ivan Anishchenko, nathaniel Bennett, hua Bai, robert J Ragotte, lukas F Milles, basic IM Wicky, alexis Courbet, rob J de Haas, neville Bethen, et al Robust deep learning-based protein sequence design using protein science,378 (6615): 49-56,2022.) in FIG. 6. For a given backbone, the invention generates 100 sequences with less than 50% self-similarity and projects them into two-dimensional space using t-SNE. At the same level of diversity, GRADE-IF comprises wild-type sequences, while the other two methods fail to include wild-type sequences within their sample regions. Furthermore, the recovery threshold for this protein was 45%, and GRADE-IF was able to generate a large number of samples, whereas the other two methods became deterministic results. This further demonstrates the superiority of the model of the invention in achieving sequence diversity and high recovery.

The present invention also evaluates a speed up-sampling algorithm in this dataset as shown in fig. 7. With DDIM, the invention can skip k steps in the sampling phase. The present invention selects a series of steps and evaluates its performance based on the recovery rate and the time required to sample 1200 sequences. The recovery rate was slightly decreased with increasing step size, and was 48.13% when the step size was 100. However, at a step size of 100, the sampling rate is 100 times faster than that of step size 1, showing significant acceleration.

3.3 fold prediction for generated sequences

The invention extends the study to the foldability of sequences generated at different sequence recovery rates. FIG. 8 compares the crystal structure of a native protein (PDB ID:3 FKF) with the structure of three GRADE-IF-derived sequences from different GRADE-IF-derived sequences. The resolution of the crystal structure isIndicating that the folding structure of all the resulting sequences is almost identical to the native sequence, the root mean square deviation at 139 residues is about +.>The average pldts score was 0.835, which indicates that their folding structure was reliable compared to pldts score of 0.91 for the native protein. In connection with the evidence in FIG. 7, it is shown that the methods of the present invention are advantageous in producing more similar results, and the present invention is believed to be confident that GRADE-IF can produce biologically authentic new sequences for a given protein structure.

Finally, it is noted that the above-mentioned preferred embodiments are only intended to illustrate rather than limit the invention, and that, although the invention has been described in detail by means of the above-mentioned preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention as defined by the appended claims.

Claims

1. A discrete map probabilistic denoising diffusion model for protein sequence generation, characterized in that a given protein backbone directs a corresponding amino acid residue type, comprising in particular:

(3) The isomorphic map neural network is used for a parametric denoising process of back diffusion;

(4) And sampling and accelerating the denoising process based on the denoising diffusion implicit model.

2. Model according to claim 1, characterized in that it uses a given graph g= { X, E }, where node features are X and edge features are E as conditions; wherein, the node characteristic X= [ X ] ^pos ，X ^aa ，X ^prop ]Position X comprising an amino acid ^pos Type X of amino acid ^aa Spatial and biochemical Properties X ^prop 。

3. The model of claim 1, wherein step (1) specifically comprises a diffusion process, a training denoising network, and a parameterization generation process.

4. The model of claim 2, wherein in step (1), the specific method of diffusion process is as follows: adding noise independently to each amino acid node of a protein, for any given node, the transition probability is represented by matrix Q _t Defining, by a predefined transfer matrix, a forward diffusion kernel as:wherein the method comprises the steps ofIs a transition probability matrix at time step t.

5. The model of claim 2, wherein in step (1), the specific method of training the denoising network is as follows: construction of a denoising neural network f _θ Parameterized by a parameter θ; the network accepts noise inputWherein X is _t Is a splice comprising noise amino acid types and other amino acid attributes, including 20 single heat encoded amino acid types and 15 related eggsGeometric properties of white matter skeletons; the training goal of the denoising model is to predict the noiseless amino acid type X ^aa Thus modeling a variety of sequence combinations that potentially correspond to a protein structure while maintaining its inherent structural constraints.

6. The model of claim 2, wherein in step (1), the specific method of parameterizing the generation process is: the novel amino acid sequence is obtained by the amino acid type x for each node at time t ^aa And performing back diffusion iteration generation. Correspondingly generated probability distribution p _θ (x _t-1 |x _t ) Predicting probabilities by neural network pairsAnd (5) performing estimation. The invention utilizes deep neural network prediction probability marginalization to calculate the generation distribution of each iteration step:wherein->

Wherein posterior distribution q (x _t-1 |x _t ，x ^aa ) From node characteristics x of transfer matrix, time t _t And amino acid type x ^aa And (5) calculating to obtain the product. Wherein x is ^aa Probability predicted by denoising networkSampling to obtain the product.

7. The model of claim 1, wherein step (2) specifically comprises two parts of a markov transfer matrix and a secondary structure sampling, wherein the specific method of the markov transfer matrix is as follows: firstly, incorporating a block replacement matrix into a diffusion and generation process, normalizing the block replacement matrix into probability, and using a softmax function; then normalize using different probability temperature pairsThe block replacement matrix B of the diffusion process is adjusted to control the noise scale in the diffusion process; transfer matrix Q at time t _t From Q _t ＝B ^T Is given; the method for sampling the secondary structure comprises the following steps: DSSP was used to predict the secondary structure of each amino acid and one-hot coding was used to represent these structures; the neural network takes one-hot code as input and uses the one-hot code to denoise amino acid, so that a protein secondary structure is added as a guiding condition in the denoising sampling process.

8. The model of claim 1, wherein the specific method of step (3) is as follows:

for a protein map of n amino acids, at layer I, an isogram is rolled to obtain node expressionImplicit vector representation representing amino acid type and geometry and edge expression ++>Hidden layer feature on the connected nodes i and j, and node three-dimensional coordinates +.>The modified EGC layer may be used to update the hidden layer representation H of the layer 1 node ^(l+1) Hidden layer representation M of sum edges ^(l+1) The method comprises the steps of carrying out a first treatment on the surface of the I.e.

H ^(l+1) ，M ^(l+1) ＝EGC[H ^(l) ，X ^pos ，M ^(l) ]。

9. The model of claim 1, wherein the specific method of step (4) is: