CN116994642A - Discrete graph probability denoising diffusion model for protein sequence generation - Google Patents

Discrete graph probability denoising diffusion model for protein sequence generation Download PDF

Info

Publication number
CN116994642A
CN116994642A CN202310995978.5A CN202310995978A CN116994642A CN 116994642 A CN116994642 A CN 116994642A CN 202310995978 A CN202310995978 A CN 202310995978A CN 116994642 A CN116994642 A CN 116994642A
Authority
CN
China
Prior art keywords
amino acid
diffusion
model
protein
denoising
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310995978.5A
Other languages
Chinese (zh)
Inventor
周冰心
郑力荣
吴邦昊
洪亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202310995978.5A priority Critical patent/CN116994642A/en
Publication of CN116994642A publication Critical patent/CN116994642A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • G16B5/20Probabilistic models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Bioethics (AREA)
  • Genetics & Genomics (AREA)
  • Physiology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Public Health (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a discrete graph probability denoising diffusion model for protein sequence generation, wherein a given protein skeleton guides a diffusion process of a corresponding amino acid residue type. The model extrapolates the joint distribution of amino acids provided that the biophysical properties of the node and the local environment. In addition, the invention utilizes the amino acid substitution matrix to carry out the diffusion forward process, and encodes the prior knowledge of the biological meaning of the amino acids, including the space and sequence neighbors thereof and the self, thereby reducing the sampling space in the generation process. The model of the invention achieves the leading performance in the aspect of sequence recovery and has great potential in the aspect of generating various protein sequences for determining the protein skeleton structure.

Description

Discrete graph probability denoising diffusion model for protein sequence generation
Technical Field
The invention belongs to the technical field of protein sequence generation, and relates to a discrete graph probability denoising diffusion model for protein sequence generation.
Background
Protein sequence generation, or protein de novo design, is intended to predict viable amino acid sequences that can fold into specific 3D protein structures or possess specific functions. Development of methods for protein sequence generation facilitates the design of novel proteins with desirable structural and functional properties. The proteins can be applied to various fields, such as targeted drug delivery and enzyme design, and have wide application prospects for academic research and industrialization purposes.
Protein sequence generation is a challenging task because both structure-to-sequence and function-to-sequence have one-to-many mapping properties, i.e., many amino acid sequences may fold into one and the same protein backbone, performing the same function. In addition, protein sequence generation itself remains a very challenging task due to the large sequence space, complexity of protein folding, and complex mechanisms involved in protein function. In addition to determining the folding status of proteins based on physical reasoning of energy, recent advances in deep learning have made significant progress in directly learning the mapping from protein structure to amino acid sequence. For example, a discriminant model expresses this problem as a prediction of the most likely sequence for a given structure by a transducer-based model. However, such models typically give only a fixed answer and are therefore not satisfactory in accurately capturing one-to-many mappings from protein structures to non-unique amino acid sequences. The diffusion probability model, as an emerging generation method type, provides the potential to generate a diverse set of sequence candidates that define a protein backbone.
Disclosure of Invention
In view of the above, an object of the present invention is to provide a discrete map probability denoising diffusion model for protein sequence generation.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a discrete map probabilistic denoising diffusion model for protein sequence generation, a given protein backbone guiding a corresponding amino acid residue type, comprising in particular:
(1) Discrete probability back diffusion process based on deep neural network training;
(2) Generating conditional back diffusion distribution based on protein priori knowledge guidance;
(3) An isomorphic map neural network (EGNN) for use in a back-diffusion parametric denoising process;
(4) The denoising process is sampled and accelerated based on a Denoising Diffusion Implicit Model (DDIM).
As one of the preferred technical solutions, the model uses a given graph g= { X, E }, where the node feature is X and the edge feature is E as a condition; wherein, the node characteristic X= [ X ] pos ,X aa ,X prop ]Position X comprising an amino acid pos Type X of amino acid aa Spatial and biochemical Properties X prop
As one of the preferred technical schemes, the step (1) specifically comprises a diffusion process, a training denoising network and a parameterization generating process.
As one of further preferable technical schemes, the specific method of the diffusion process is as follows: adding noise independently to each amino acid node of a protein, for any given node, the transition probability is represented by matrix Q t Defining, by means of a predefined transfer matrix (transition matrix), a forward diffusion kernel (diffusion kernel) as: wherein->Is a transition probability matrix at time step t.
As one of further preferable technical schemes, the specific method for training the denoising network is as follows: construction of a denoising neural network f θ Parameterized by a parameter θ; the network accepts noise inputWherein X is t Is a splice comprising a noisy amino acid type and other amino acid attributes, 20 single-hot encoding (one-hot encoding) amino acid types and 15 geometric attributes about the protein backbone; the training goal of the denoising model is to predict the noiseless amino acid type X aa Thus modeling a variety of sequence combinations that potentially correspond to a protein structure while maintaining its inherent structural constraints.
As one of further preferable technical schemes, the specific method of the parameterization generating process is as follows: the novel amino acid sequence is obtained by reacting the amino acid sequence at time t (1<t<T) amino acid type x of each node aa And performing back diffusion iteration generation. Correspondingly generated probability distribution p θ (x t-1 |x t ) Predicting probabilities by neural network pairsAnd (5) performing estimation. The invention utilizes deep neural network predictive probability marginalization (marginalization) to calculate the generation distribution of each iteration step:
wherein->Wherein posterior distribution q (x t-1 |x t ,x aa ) From node characteristics x of transfer matrix, time t t And amino acid type x aa And (5) calculating to obtain the product. Wherein x is aa Probability predicted by denoising network +.>Sampling to obtain the product.
As one of the preferable technical schemes, the step (2) specifically comprises two parts of a markov transfer matrix and secondary structure sampling, wherein the specific method of the markov transfer matrix is as follows: the block replacement matrix (BLOSUM) is incorporated into the diffusion and generation process, normalized to probability, and usedA softmax function; then, the normalized block replacement matrix B is adjusted by using different probability temperatures so as to control the noise scale in the diffusion process; transfer matrix Q at time t t From Q t =B T Is given; the method for sampling the secondary structure comprises the following steps: DSSP was used to predict the secondary structure of each amino acid and one-hot coding was used to represent these structures; the neural network takes one-hot code as input and uses the one-hot code to denoise amino acid, so that a protein secondary structure is added as a guiding condition in the denoising sampling process.
As one of the preferable technical schemes, the specific method of the step (3) is as follows:
for a protein map of n amino acids, at layer I, a isovariogram convolution (EGC) results in node expressionImplicit vector representation representing amino acid type and geometry and edge expression ++>Hidden layer feature on the connected nodes i and j, and node three-dimensional coordinates +.>The modified EGC layer may be used to update the hidden layer representation H of the layer 1 node (l+1) Hidden layer representation M of sum edges (l+1) The method comprises the steps of carrying out a first treatment on the surface of the Briefly, the present invention includes
H (l+1) ,M (l+1) =EGC[H (l) ,X pos ,M (l) ]。
As a further preferred embodiment, the EGC layer defines the following operations:
-for embedding of edges:
-for node coordinates:
-hiding the representation from nodes:
wherein phi is e 、φ h And phi x Is an edge and node propagation operation, phi x Is an additional operation to embed the vector edge into m ij Projected to a scalar. Modified EGC layer maintains node coordinate X pos Rotation and translation, etc., and maintain alignment invariance over the set of nodes.
As one of the preferable technical schemes, the specific method of the step (4) is as follows:
wherein the temperature T is used to regulate the degree of certainty or randomness of the estimated probability distribution. Posterior distribution after multi-step denoising is as follows:
the invention has the beneficial effects that:
the present invention proposes a novel pattern denoising diffusion model for protein sequence generation, wherein a given protein backbone directs the diffusion process of the corresponding amino acid residue type. For the target protein, the model extrapolates the joint distribution of amino acid sequences subject to the biophysical properties of the amino acid nodes and the local environment. In addition, the invention utilizes the amino acid substitution matrix to carry out the diffusion forward process, and encodes the prior knowledge of the biological meaning of the amino acid, including the space and sequence neighbors and the self characteristics thereof, thereby reducing the sampling space in the generation process. The model of the invention achieves the leading performance in the aspect of sequence recovery and has great potential in the aspect of generating various protein sequences for determining the protein skeleton structure.
Diffusion probability models are becoming more and more interesting because of their strong learning ability. Under inherent randomness conditions, they are able to generate a variety of diverse molecular outputs from a set of fixed conditions. For example, torsion Diffusion learns the torsional angular distribution of heavy atoms, thereby mimicking the conformation of small molecules. Meanwhile, SMCDIFF completes protein folding task by learning to support stable skeleton distribution of target motif. Similarly, DIFFDOCK employs a generation method in protein ligand docking, creating a range of possible ligand binding poses for the target pocket structure.
Although diffusion models are widely used, their full potential remains relatively unexplored in the context of protein reverse folding. Current sequence design methods are based primarily on language models, including masking language models and autoregressive generation models. By tagging amino acids, the mask language model treats the sequence generation task as a mask enhanced process. These models are typically masked by selecting tokens in a given context and then learning to predict these masked tokens (token). When trained by parameterizing the target, this process can be considered as a discrete diffusion probability denoising model. Instead, the autoregressive model can be regarded as a deterministic diffusion process. It introduces a conditional distribution for each term, but the overall dependence on the whole amino acid sequence is reconstructed by a diffusion process that is performed independently.
In contrast, the diffusion probability model employs an iterative prediction method that can generate less noisy samples and exhibits the potential to capture the inherent diversity of the real data distribution. This unique property further underscores the promising roles that diffusion models may play in the push protein sequence design field. To bridge this gap, the applicant has first tried protein sequence generation using a discrete diffusion probability model. The present invention models the problem of sequence generation as a de-noising problem, i.e., restoring randomly assigned amino acid types in a protein (backbone) pattern as wild-type. Protein maps containing all amino acid space and biochemical information are represented by isomorphous map neural networks, with diffusion processes occurring at map nodes. In the real sequence generation task, the proposed model achieves a recovery of up to 70% especially for conserved regions of biological significance. In addition, the structure predicted by alpha fold2 for the generated sequence possesses a high pLDDT confidence and differs from the native protein structure by less than the experimental measurement error (3. Emi), i.e., the structure remains substantially identical.
In order to maintain the required functions, the invention innovatively conditions the model on a secondary structure and a tertiary structure, and the model is presented in the form of a residue diagram and corresponding node characteristics. The main contributions of the invention have three aspects. First, the invention proposes GRADE-IF, a reverse folding diffusion model supported by a rotation-translation isomorphous neural network. It can produce a diverse set of sequence candidates compared to other models. Second, unlike the uniform noise in traditional discrete diffusion models, the present invention encodes a priori knowledge of the amino acid response to evolutionary pressure by using a block substitution matrix as the translation kernel. Furthermore, to accelerate the sampling process, the present invention converts the Denoising Diffusion Implicit Model (DDIM) from its original continuous form to a sampling method suitable for discrete cases for accelerating model training and reasoning speed.
Drawings
In order to make the objects, technical solutions and advantageous effects of the present invention more clear, the present invention is illustrated in the following drawings.
Fig. 1 is a diagram of a model structure.
FIG. 2 is a diagram of amino acids generated based on a protein backbone.
Fig. 3 is an example of amino acid probability distribution generated based on a random probability distribution (left) and based on a block replacement matrix (right).
FIG. 4 is a graph of an isomorphous neural network used to fit amino acid probability distributions during denoising.
FIG. 5 shows that GRADE-IF possesses higher recovery at internal conserved amino acids.
FIG. 6 shows that GRADE-IF can simultaneously achieve both quality and diversity of the protein produced, as compared to PIFOLD and PROTEINMPNN.
Fig. 7 is a graph showing that DDIM can greatly boost model speed with less loss of model performance.
FIG. 8 shows a comparison of predicted and native protein (PDB ID:3 FKF) structures obtained by folding alpha fold2 of protein sequences generated by different GRADE-IF.
Detailed Description
Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
1.1 amino acid mapping of protein backbone
The residue map is represented asFor describing the geometry of the protein (as shown in figure 2). Specifically, each node represents an amino acid in a protein. Accordingly, each node is assigned a carefully selected set of node attributes X to reflect its biophysical and topological properties. The local environment of a given node is defined by its spatial neighbors, as determined by the k-nearest neighbor (kNN) algorithm. Thus, each amino acid node is connected to at most k other nodes in the figure, in particular +.>The node with the smallest euclidean distance between all nodes in the contact area. The edge attribute, denoted E ε R 93 The relationship between the connection nodes is explained. These relationships are determined by parameters such as interatomic distance, local N-C position, and sequence position coding scheme.
1.2 protein sequence Generation is defined as a De-noising problem
The goal of protein sequence generation is to design a sequence that can fold into a pre-specified desired structure. The invention uses the coordinate of C alpha atom to represent the three-dimensional position of amino acid in Euclidean space, thus representing the framework of protein. Based on the naturally occurring protein structure, the model of the invention is constructed to generate the native sequence of the protein based on its backbone atomic coordinates. Formally, the present invention regards this problem as a learning condition distribution p (X aa |X pos ). Given a protein of length n and a sequence of spatial coordinates representing the C alpha atoms of each backbone in the structureThe goal is predictionI.e. the original sequence of the amino acid. This density model is modeled with other amino acids throughout the chain. The model of the present invention is trained by minimizing the negative log-likelihood between the generated amino acid sequence and the native wild-type sequence. The sequence may then be designed by sampling or identifying the sequence that is maximized given the conditional probability of the desired secondary and tertiary structure.
1.3 diffusion probability model for removing noise by diffusion
The diffusion model belongs to the class of the generated model, and the training stage comprises diffusion and denoising processes. The diffusion process is performed by T stepsTo the original data x 0 Q (x) is converted into a series of latent variables { x } 1 ,...,x T Each latent variable carries increasing noise. Conversely, denoising process-> Noise in these latent variables is gradually reduced, directing them back to the original data distribution. This iterative denoising process is driven by differentiable operators, such as trainable neural networks.
Theoretically, q (x) t |x t-1 ) There is no strict form, but for efficient sampling, p θ Several conditions need to be met:
(i) Diffusion kernel q (x) t |x 0 ) It is desirable to have a closed shapeFormula to sample noise data for different time steps in parallel training;
(ii) The kernel should have a simplified form to calculate a posterior q (x t |x 0 ) Thus posterior probability distribution p θ (x t-1 |x t )=∫q(x t-1 |x t ,x 0 )dp θ (x 0 |x t ) May be used as a target for a trainable neural network, where θ represents a parameter in the neural network. Thus, the original input x 0 Can be targeted.
(iii) Edge distribution q (x T ) Should be independent of x 0 This allows the present invention to use q (x) T ) As an a priori distribution.
The above criteria are critical to developing a suitable noise adding module and training procedure. To meet these prerequisites, the present invention follows the settings in previous studies. For x t E { 1., where, K class K classification data of K (in the protein sequence generation task, k=20, representing 20 amino acids), transition probabilities are calculated from the matrix [ Q t ] ij =q(x t =j|x t-1 =i). Classification feature x using transfer matrix and single thermal encoding t The present invention may define the transfer core in the diffusion process by:
wherein the method comprises the steps ofIs the superposition result of the multiple transfer matrices. The posterior distribution derived from Bayesian rules can be calculated as +.>Thus, the transition kernel, model output at time t, and process state x can be used t To determine the probability of generation. By iterative sampling, the resulting output x is produced 0
Probability of final outputDistribution p (x) T ) Should be independent of the observed value x 0 . Thus, the construction of the transfer matrix requires the use of a noise schedule. The simplest and common method is uniform transfer, which can be parameterized asWherein the method comprises the steps ofIs the transpose of the identity matrix I. As t approaches infinity, α gradually decays until it reaches 0. Thus, the distribution q (x T ) Asymptotically tends to be uniformly distributed and thus is substantially independent of x.
A diffusion model (FIG. 1) for probability denoising of a discrete map for protein sequence generation, wherein the diffusion denoising process performs T times of denoising on a single thermal coding vector of an amino acid type by using a transfer matrix on each amino acid node until the probability distribution of each amino acid is approximately uniformly distributed. The noise reduction (back diffusion) process performs T denoising on the amino acid type through a trainable deep neural network, and finally obtains a set of joint distribution probabilities of the amino acid sequences for finally generating a sample. Condition definitions are added to the denoising process to guide the recovery of probability distributions, including secondary and tertiary structures of proteins.
The model utilizes a given graphWherein the node features are X and the edge features are E. Specifically, the node features include the position of amino acids, amino acid type, and spatial and biochemical properties x= [ X pos ,X aa ,X prop ]. The invention is characterized by amino acid X aa The diffusion process is defined above and denoising is performed on the graph structure E encoded by the constant-change neural network. In addition, the present invention incorporates protein specific prior knowledge, including amino acid substitution scoring matrices and protein secondary structures. The invention also introduces a new acceleration algorithm based on the discrete diffusion generation process of the conversion matrix.
2.1 discrete probability back diffusion Process based on deep neural network training
Diffusion process: to capture the distribution of amino acid types, the present invention adds noise to each amino acid node of a protein independently. For any given node, the transition probability is represented by matrix Q t And (5) defining. By predefining the transfer matrix, the present invention can define the forward diffusion kernel as:
and/>wherein->Is a transition probability matrix at time step t.
Training a denoising network: the second component of the diffusion model is a denoising neural network f θ Parameterized by the parameter θ. The network accepts noise inputWherein X is t Is a splice comprising noisy amino acid types and other amino acid attributes, including 20 single thermally encoded amino acid types and 15 protein backbone related amino acid geometric attributes such as Solvent Accessible Surface Area (SASA), normalized surface-aware node characteristics, backbone atom dihedral angles, and three-dimensional positions. Its objective is to predict noiseless amino acid type X aa Thus modeling the potentially diverse amino acid sequences that conform to the protein structure while maintaining their inherent structural constraints. To train f θ The present invention optimizes the predicted probability of amino acid type for each nodeCross entropy loss L of (c).
Parameterization generating process: the new amino acid sequence is generated by back-diffusion iteration for each node x. Generating probability distributionIs to predict probability by neural network>An estimation is made. The invention performs marginalization on the network prediction to calculate the generation distribution of each iteration step:
wherein posterior distribution q (x t-1 |x t ,x aa ) From the transition matrix, the node characteristic state at time t and the amino acid type x aa And (5) calculating to obtain the product. X is x aa Prediction probability from denoising networkIs a sample of the sample.
2.2 conditional reverse diffusion distribution Generation based on protein priori knowledge guidance
2.2.1 Markov transfer matrices
The transition matrix serves as a guide for the discrete diffusion model and plays a key role in providing transition probabilities from the current time step to the next time step. Since it reflects the possibility of one amino acid type to another, this matrix plays a crucial role in both diffusion and production. In the diffusion phase, the transfer matrix is repeatedly applied to the observed data, which evolves over time due to inherent noise. As the diffusion time increases, the probability of the original amino acid type gradually decays, eventually converging to a uniform distribution across all amino acid types. In the generation phase, conditional probability p θ (x t-1 |x t ) Is influenced by the model predictions and characteristics of the transfer matrix Q.
In view of the biological specificity of amino acid substitutions, the transition probabilities between amino acids are not evenly distributed, so defining random directions during generation or sampling is not reasonable. Alternatively, the diffusion process may reflect evolutionary pressure by using amino acid substitution scoring matrices that maintain protein function, structure, or stability. In this study, the present invention uses a block replacement matrix (Blocks Substitution Matrix, abbreviated BLOSUM) that identifies conserved regions within proteins that are believed to have greater functional relevance. Based on empirical observations of protein evolution, BLOSUM provides an estimate of the likelihood of substitution between different amino acids. Thus, the present invention incorporates BLOSUM into the diffusion and generation process. First, the matrix is normalized to probability, using the softmax function. The present invention then uses different probability temperatures to adjust the normalized matrix B to control the noise scale during diffusion. Thus, the transfer matrix at time t is defined by Q t =B T Given. By using this matrix to improve transition probabilities, the generation space that needs to be sampled can be effectively reduced, thereby converging the predictions of the model into a meaningful subspace. Fig. 3 compares the transfer matrices over time for the random and BLOSUM cases.
2.2.2 secondary Structure
The secondary structure of a protein refers to the local spatial arrangement of amino acid residues in the protein chain. Two common types of protein secondary structures are alpha-helices and beta-sheet, which are stabilized by hydrogen bonding between backbone atoms. The secondary structure of a protein is a critical intermediary, bridging between the amino acid sequence and the overall three-dimensional conformation of the protein. In the research of the invention, the invention takes eight different types of secondary structures as conditions into the sampling process of amino acid nodes. This strategy effectively reduces the exploration space of potential amino acid sequences. Specifically, the present invention uses DSSP (Define Secondary Structure of Proteins) to predict the secondary structure of each amino acid and uses one-hot coding to represent these structures. The neural network takes the single thermal code as input and uses the single thermal code to denoise the amino acid, so that the generated amino acid sequence is sampled by taking the secondary structure as a guiding condition.
By applying secondary structure conditions such as alpha-helix and beta-folded sheet to the search of the amino acid sequence, the sampling space of the potential sequence can be remarkably reduced, and the biological significance of the generated protein sequence can be endowed. By conditioning the amino acid type during its sampling to the corresponding secondary structure type, the present invention directs the resulting protein sequence to achieve the appropriate three-dimensional structure with viable thermal stability while retaining its ability to perform its intended function.
2.3 constant change map neural network (FIG. 4)
Biomolecules such as proteins and compounds are three-dimensional structures, and model predicted binding complexes should remain consistent regardless of the positioning and orientation of the incoming protein to encode strong and expression rich hidden representations. This property can be ensured by using a rotating alike denatured neural network. One typical isomorphism neural network is an isomorphism graph neural network. The invention modifies its SE (3) -isomodifying neural layer for updating the node and edge representations to maintain SO (3) rotational isomodifying and E (3) translational invariance.
At layer l, an isovariogram convolution (EGC) input is embedded with a set of n hidden nodesFor describing AA type and geometry, edge-embedded +.>Nodes i and j related to the connection and node coordinatesThe goal of the modified EGC layer is to update the hidden representation H of the node (l+1) Representation M of sum edges (l+1) . Briefly, the present invention has H (l+1) ,M (l+1) =EGC[H (l) ,X pos ,M (l) ]。
To achieve this goal, the EGC layer defines the following operations:
-for embedding of edges:
-for node coordinates:
-hiding the representation from nodes:
here, phi e 、φ h And phi x Is an edge and node propagation operation, phi x Is an additional operation to embed the vector edge into m ij Projected to a scalar. Modified EGC layer maintains node coordinate X pos Rotation and translation, etc., and maintains the invariance of the arrangement over the set of nodes, similar to other graph neural networks.
2.4 sampling and accelerating De-noising Process based on De-noising diffusion implicit model (DDIM)
One significant disadvantage of diffusion models is that the speed of the generation process is typically very slow, as it involves many incremental steps. To address this problem, a Deterministic Denoising Implicit Model (DDIM) is typically used in the continuous variable diffusion generation model. DDIM is based on a non-markov forward diffusion process, always conditioned on input, rather than the previous step. By setting the noise variance of each step to 0, the reverse generation process becomes completely deterministic, given an initial a priori sample.
Similarly, since the present invention formally possesses the generation probability p θ (x t-1 |x t ) It is based on predicted x aa And posterior distribution p (x t-1 |x t ,x aa ) Calculated, the invention can also be implemented by controlling p (x aa |x t ) Is of (1)The sample temperature makes the generated model deterministic. Thus, the present invention defines a multi-step generation process as follows:
wherein the temperature T controls whether deterministic or random, the multi-step posterior distribution is:
3.1 Experimental protocol
Training setting: the present invention employs CATH v4.3.0 dataset partitioning based on GRAPHTRANS (John ingham, vikas Garg, region Barzilay, and Tommi jaakkola. Generated models for graph-based protein design. Advanced in Neural Information Processing Systems,32,2019.) and GVP (Bowen jin, stephan Eismann, patricia Suriana, raphael John Lamarre Townshend, and Ron driver. Learning from protein structure with geometric vector superns. In International Conference on Learning Representations, 2021.). Proteins were classified according to the CATH topology classification, and were divided into 18,024 proteins for training, 608 for validation, and 1,120 for testing. To evaluate the quality of production of different proteins, the present invention tested the model of the present invention on three different categories: short sequence, single chain and total protein. Wherein the short sequence comprises a protein having a length of less than 100; the single chain class comprises proteins consisting of a single chain. Furthermore, the total time step number of the diffusion model is configured to be 500, following the noise setting of cosine scheduling. For a denoising network, the present invention implements six stacked EGNN blocks, each block having 128 hidden dimensions. The model exercises 200 epochs by default and uses Adam optimizers. During training, the invention employs a batch size of 64 and a learning rate of 0.0005. In addition, to prevent overfitting, the invention introduces a dropout rate of 0.1 in the architecture of the model.
Evaluation index: the present invention measures the quality of recovered protein sequences by confusion and recovery. The degree of confusion measures how well the model predicts the matching of the amino acid probability to the actual amino acid at each position in the sequence. Lower confusion indicates better fitting of the model to the data. The recovery rate assessment model is based on the ability of the protein to recover the correct amino acid sequence based on its three-dimensional structure. Typically, it is calculated as the proportion of amino acids in the predicted sequence that match the original sequence. Higher recovery rates indicate that the model has a greater ability to predict the original sequence from the structure.
TABLE 1 Performance of GRADE-IF in recovering proteins in CATH
TABLE 2 Performance of GRADE-IF in recovering proteins in TS50 and TS500
3.2 Generation of completely New sequences
Table 1 compares the performance of GRADE-IF in recovering proteins in CATH. To generate a high confidence sequence, GRADE-IF passes approximate probabilitiesTo eliminate uncertainty in the prior. Notably, the present invention found that recovery of single chain proteins and short sequences was improved by 4.2% and 5.4%, respectively. The invention also evaluates the TS50 and T500 datasets and the results are shown in Table 2.
After subdividing the recovery properties into buried and superficial amino acids, the present invention found that more conserved core residues show higher native sequence recovery. In contrast, the amino acids on the active surface show a lower sequence recovery rate. FIG. 5 examines amino acid conservation by Solvent Accessible Surface Area (SASA) (where SASA <0.25 represents internal amino acids) and number of contacts (number of amino acids adjacent thereto in 3D space) [10 ]. In all three protein sequences, the recovery of the inner residues was significantly higher than the outer residues, whereas the recovery increased with increasing number of contacts. The invention also presents the recovery rate of different secondary structures, the invention obtains high recovery rate in most secondary structures, and only one less 5 spiral structures show lower recovery rate.
The present invention also compares GRADE-IF with PIFOLD (Zhangyang Gao, cheng Tan, and Stan Z.Li.Pifold: toward effective and efficient protein inverse folding in International Conference on Learning Representations, 2023.) and PROTEINMPNN (Juusts Dauparas, ivan Anishchenko, nathaniel Bennett, hua Bai, robert J Ragotte, lukas F Milles, basic IM Wicky, alexis Courbet, rob J de Haas, neville Bethen, et al Robust deep learning-based protein sequence design using protein science,378 (6615): 49-56,2022.) in FIG. 6. For a given backbone, the invention generates 100 sequences with less than 50% self-similarity and projects them into two-dimensional space using t-SNE. At the same level of diversity, GRADE-IF comprises wild-type sequences, while the other two methods fail to include wild-type sequences within their sample regions. Furthermore, the recovery threshold for this protein was 45%, and GRADE-IF was able to generate a large number of samples, whereas the other two methods became deterministic results. This further demonstrates the superiority of the model of the invention in achieving sequence diversity and high recovery.
The present invention also evaluates a speed up-sampling algorithm in this dataset as shown in fig. 7. With DDIM, the invention can skip k steps in the sampling phase. The present invention selects a series of steps and evaluates its performance based on the recovery rate and the time required to sample 1200 sequences. The recovery rate was slightly decreased with increasing step size, and was 48.13% when the step size was 100. However, at a step size of 100, the sampling rate is 100 times faster than that of step size 1, showing significant acceleration.
3.3 fold prediction for generated sequences
The invention extends the study to the foldability of sequences generated at different sequence recovery rates. FIG. 8 compares the crystal structure of a native protein (PDB ID:3 FKF) with the structure of three GRADE-IF-derived sequences from different GRADE-IF-derived sequences. The resolution of the crystal structure isIndicating that the folding structure of all the resulting sequences is almost identical to the native sequence, the root mean square deviation at 139 residues is about +.>The average pldts score was 0.835, which indicates that their folding structure was reliable compared to pldts score of 0.91 for the native protein. In connection with the evidence in FIG. 7, it is shown that the methods of the present invention are advantageous in producing more similar results, and the present invention is believed to be confident that GRADE-IF can produce biologically authentic new sequences for a given protein structure.
Finally, it is noted that the above-mentioned preferred embodiments are only intended to illustrate rather than limit the invention, and that, although the invention has been described in detail by means of the above-mentioned preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention as defined by the appended claims.

Claims (9)

1. A discrete map probabilistic denoising diffusion model for protein sequence generation, characterized in that a given protein backbone directs a corresponding amino acid residue type, comprising in particular:
(1) Discrete probability back diffusion process based on deep neural network training;
(2) Generating conditional back diffusion distribution based on protein priori knowledge guidance;
(3) The isomorphic map neural network is used for a parametric denoising process of back diffusion;
(4) And sampling and accelerating the denoising process based on the denoising diffusion implicit model.
2. Model according to claim 1, characterized in that it uses a given graph g= { X, E }, where node features are X and edge features are E as conditions; wherein, the node characteristic X= [ X ] pos ,X aa ,X prop ]Position X comprising an amino acid pos Type X of amino acid aa Spatial and biochemical Properties X prop
3. The model of claim 1, wherein step (1) specifically comprises a diffusion process, a training denoising network, and a parameterization generation process.
4. The model of claim 2, wherein in step (1), the specific method of diffusion process is as follows: adding noise independently to each amino acid node of a protein, for any given node, the transition probability is represented by matrix Q t Defining, by a predefined transfer matrix, a forward diffusion kernel as:wherein the method comprises the steps ofIs a transition probability matrix at time step t.
5. The model of claim 2, wherein in step (1), the specific method of training the denoising network is as follows: construction of a denoising neural network f θ Parameterized by a parameter θ; the network accepts noise inputWherein X is t Is a splice comprising noise amino acid types and other amino acid attributes, including 20 single heat encoded amino acid types and 15 related eggsGeometric properties of white matter skeletons; the training goal of the denoising model is to predict the noiseless amino acid type X aa Thus modeling a variety of sequence combinations that potentially correspond to a protein structure while maintaining its inherent structural constraints.
6. The model of claim 2, wherein in step (1), the specific method of parameterizing the generation process is: the novel amino acid sequence is obtained by the amino acid type x for each node at time t aa And performing back diffusion iteration generation. Correspondingly generated probability distribution p θ (x t-1 |x t ) Predicting probabilities by neural network pairsAnd (5) performing estimation. The invention utilizes deep neural network prediction probability marginalization to calculate the generation distribution of each iteration step:wherein->
Wherein posterior distribution q (x t-1 |x t ,x aa ) From node characteristics x of transfer matrix, time t t And amino acid type x aa And (5) calculating to obtain the product. Wherein x is aa Probability predicted by denoising networkSampling to obtain the product.
7. The model of claim 1, wherein step (2) specifically comprises two parts of a markov transfer matrix and a secondary structure sampling, wherein the specific method of the markov transfer matrix is as follows: firstly, incorporating a block replacement matrix into a diffusion and generation process, normalizing the block replacement matrix into probability, and using a softmax function; then normalize using different probability temperature pairsThe block replacement matrix B of the diffusion process is adjusted to control the noise scale in the diffusion process; transfer matrix Q at time t t From Q t =B T Is given; the method for sampling the secondary structure comprises the following steps: DSSP was used to predict the secondary structure of each amino acid and one-hot coding was used to represent these structures; the neural network takes one-hot code as input and uses the one-hot code to denoise amino acid, so that a protein secondary structure is added as a guiding condition in the denoising sampling process.
8. The model of claim 1, wherein the specific method of step (3) is as follows:
for a protein map of n amino acids, at layer I, an isogram is rolled to obtain node expressionImplicit vector representation representing amino acid type and geometry and edge expression ++>Hidden layer feature on the connected nodes i and j, and node three-dimensional coordinates +.>The modified EGC layer may be used to update the hidden layer representation H of the layer 1 node (l+1) Hidden layer representation M of sum edges (l+1) The method comprises the steps of carrying out a first treatment on the surface of the I.e.
H (l+1) ,M (l+1) =EGC[H (l) ,X pos ,M (l) ]。
9. The model of claim 1, wherein the specific method of step (4) is:
wherein the temperature T is used to regulate the degree of certainty or randomness of the estimated probability distribution. Posterior distribution after multi-step denoising is as follows:
CN202310995978.5A 2023-08-09 2023-08-09 Discrete graph probability denoising diffusion model for protein sequence generation Pending CN116994642A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310995978.5A CN116994642A (en) 2023-08-09 2023-08-09 Discrete graph probability denoising diffusion model for protein sequence generation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310995978.5A CN116994642A (en) 2023-08-09 2023-08-09 Discrete graph probability denoising diffusion model for protein sequence generation

Publications (1)

Publication Number Publication Date
CN116994642A true CN116994642A (en) 2023-11-03

Family

ID=88531833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310995978.5A Pending CN116994642A (en) 2023-08-09 2023-08-09 Discrete graph probability denoising diffusion model for protein sequence generation

Country Status (1)

Country Link
CN (1) CN116994642A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117423396A (en) * 2023-12-18 2024-01-19 烟台国工智能科技有限公司 Crystal structure generation method and device based on diffusion model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117423396A (en) * 2023-12-18 2024-01-19 烟台国工智能科技有限公司 Crystal structure generation method and device based on diffusion model
CN117423396B (en) * 2023-12-18 2024-03-08 烟台国工智能科技有限公司 Crystal structure generation method and device based on diffusion model

Similar Documents

Publication Publication Date Title
Gao et al. Deep learning in protein structural modeling and design
Shrikumar et al. Technical note on transcription factor motif discovery from importance scores (TF-MoDISco) version 0.5. 6.5
CN116994642A (en) Discrete graph probability denoising diffusion model for protein sequence generation
CN115862747B (en) Method for constructing protein pre-training model with sequence-structure-function coupling
Khakzad et al. A new age in protein design empowered by deep learning
CN116741307A (en) Three-dimensional molecular structure simulation method for synthesis and screening of lead compounds
Song et al. Importance weighted expectation-maximization for protein sequence design
Huang et al. Data-efficient protein 3d geometric pretraining via refinement of diffused protein structure decoy
US20240006017A1 (en) Protein Structure Prediction
Huang et al. G-vae, a geometric convolutional vae for proteinstructure generation
CN113539358B (en) Hilbert coding-based enhancer-promoter interaction prediction method and device
Vullo et al. Prediction of protein coarse contact maps
Coste et al. A similar fragments merging approach to learn automata on proteins
Liu et al. GraphCPLMQA: Assessing protein model quality based on deep graph coupled networks using protein language model
Yim Diffusion Probabilistic Modeling of Protein Backbones in 3D for the Motif-Scaffolding problem
Zheng et al. Inverse Design of Vitrimeric Polymers by Molecular Dynamics and Generative Modeling
Steeg Automated motif discovery in protein structure prediction.
CN116884473B (en) Protein function prediction model generation method and device
CN116758978A (en) Controllable attribute totally new active small molecule design method based on protein structure
Kim Dihedral angle prediction using generative adversarial networks
CN117524295A (en) Enzyme stability prediction method based on large-scale structure pre-training model
Lu Enhanced Potts Models for Improved Computational Protein Design
Li et al. ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention
Torrisi Predicting Protein Structural Annotations by Deep and Shallow Learning
CN117727365A (en) Protein reverse folding method and equipment based on multi-mode pre-training large model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination