CN116312864B - System and method for predicting protein-ligand binding affinity based on filtration curvature - Google Patents

System and method for predicting protein-ligand binding affinity based on filtration curvature Download PDF

Info

Publication number
CN116312864B
CN116312864B CN202310122433.3A CN202310122433A CN116312864B CN 116312864 B CN116312864 B CN 116312864B CN 202310122433 A CN202310122433 A CN 202310122433A CN 116312864 B CN116312864 B CN 116312864B
Authority
CN
China
Prior art keywords
edge
layer
node
graph
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310122433.3A
Other languages
Chinese (zh)
Other versions
CN116312864A (en
Inventor
吴剑秋
陈红阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202310122433.3A priority Critical patent/CN116312864B/en
Publication of CN116312864A publication Critical patent/CN116312864A/en
Application granted granted Critical
Publication of CN116312864B publication Critical patent/CN116312864B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/80Data visualisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Medicinal Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A system for predicting binding affinity of a molecule and a protein based on filtering curvature includes a data store, a data collator, and a model predictor. The data collator uses the protein-ligand data stored in the data store to generate a map and structure of the map and passes the map and structure to the model predictor to predict binding affinity. The graph module of the data collator uses the data preprocessing module to extract affinity, atomic coordinates and attributes from the protein-ligand data to generate a graph, and the structure module uses the graph to generate a structure of the graph. The filtering curvature module of the model predictor fuses the distance information and curvature information between atoms in the graph into a representation of the edge. Whereas the sibn module based on angle and adaptive graph attention mechanisms uses personalized attention mechanisms to incorporate the angle information of the graph into the graph representation to predict binding affinity. The invention also includes a method of predicting the binding affinity of a molecule to a protein based on the curvature of the filtration.

Description

System and method for predicting protein-ligand binding affinity based on filtration curvature
Technical Field
The invention belongs to the technical field of drug research and development, and particularly relates to a system and a method for predicting binding affinity of a drug molecule and a target protein.
Technical Field
Accurate prediction of binding affinity between drug molecules and proteins is helpful for screening suitable candidate drugs for testing, accelerates drug screening process, and is a key stage for finding new drugs. The three-dimensional structure of the protein-ligand complex was shown to play an effective role in affinity prediction. Currently, methods for predicting affinity using three-dimensional structures are machine learning-based methods and deep learning-based methods.
Machine learning based methods require specialized knowledge and rely heavily on feature engineering. This means that there is a lack of versatility over larger data sets.
Most of the work in deep learning-based methods treat the complex as a 3D mesh, and then use three-dimensional convolutional neural networks (3D GNNs), such as the binding affinity prediction systems and methods proposed by a·s·hefiz et al (a·s·hefiz, i·valz, m·disaba. Binding affinity prediction systems and methods: china, 2012015136059.9 [ p ], 2019-03-26). However, the 3D grid changes after spatial rotation of one complex, which makes affinity predictions of 3D GNNs for the same complex sensitive to the spatial position of the complex.
The graphic neural network overcomes this difficulty and 3D models the spatial structure of the protein-ligand complex, predicting its affinity. In terms of the spatial structure of the complex, li, shuangli et al propose a SIGN-graph neural network that incorporates the primary geometric distance and angle into 3D modeling and considers intermolecular long distance information (Li, shuangli, jingbo Zhou, tong Xu, liang Huang, fan Wang, haoyi Xiong, weili Huang, dejin Dou, and Hui Xiong. Structure-aware interactive graph neural networks for the prediction ofprotein-alignment affinity. In Proceedings ofthe 27thACM SIGKDD Conference on Knowledge Discovery&Data Mining,2021.975-985.). However, such a graph neural network is not ideal in predicting loss errors in affinity.
The reasons for the non-idealities are the following three points: first, high-end geometric information such as continuous coherence in algebraic topologies and curvature in differential geometries, etc. has not been incorporated into it at the time of 3D modeling. These geometries have been used as molecular fingerprints and utilized machine learning to achieve good results in affinity predictions.
Second, the graph of protein-ligand complex formation is a heterogeneous graph. SIGN is not suitable for heterogeneous graph cases because of the graph-annotation-force mechanism (GAT) based on too much-reliance on low-pass filters. In fact, for low frequency information based graphic neural networks are not suitable for protein-ligand complexes.
Third, like other graph neural networks, SIGN excessively emphasizes the graph structure, i.e., the dependency of the center node on the neighbor node when updating node information, and ignores the difference of the dependency of different attributes of the center node on different attributes of the neighbor node.
Disclosure of Invention
The present invention overcomes the above-described deficiencies in the art by providing a system and method for predicting the binding affinity of a molecule to a protein based on the curvature of the filtration. The invention greatly improves the accuracy of prediction by learning the 3D spatial structure of the protein-ligand complex based on the adaptive graph attention mechanism and the sien graph neural network containing spatial information of distance, angle and curvature.
The invention provides a system for predicting binding affinity of a molecule and a protein, comprising a data storage, a data collator and a model predictor. The data collator generates a graph and a structure of the graph using the data stored by the memory and provides the graph and the structure of the graph to a model prediction module for predicting affinity, wherein
A data memory for storing protein-ligand information. The protein-ligand information contains the coordinates of atoms, attributes of atoms, and affinity information. The atomic attributes include: atom type, pybel atom attribute, and smart attribute.
A data collator comprising: a data preprocessing module, a graph module, and a structure module configured to: the data preprocessing module is used for data stored in the data storage to generate binding affinity, atomic coordinates and intermolecular atomic pair co-occurrence frequency. The intermolecular atomic pair means that one atom of the atomic pair is from a ligand, the other atom is from a protein, and the Euclidean distance between the two atoms does not exceed a preset threshold.
The graph module is used for binding affinities, atomic coordinates, and atomic properties generated by the data preprocessing module to generate a graph. Specifically, a short-distance threshold is set, and atoms and atomic attributes are taken as nodes of the graph and attributes of the nodes. If the Euclidean distance between two atoms does not exceed the short distance threshold, the corresponding node is connected with an edge. The weight of the edge is the Euclidean distance, and the label of the graph is affinity.
The structural module is used for the graph to generate a structure of the graph: the type of intermolecular edges, the number of each type of edge, and the filtering curvature of the edges. Wherein the intermolecular edge refers to an edge of the figure connecting a protein atom and a ligand atom, and the type of the edge refers to an atom pair type of both end points of the edge.
The structural module comprises a long-distance characteristic layer and a curvature characteristic layer, wherein the long-distance characteristic layer uses the graph to generate types of intermolecular edges and the number of edges of various types;
the curvature feature layer uses the graph to generate the filtered curvature of the edge.
The model predictor uses the map generated by the data collator to generate predictions of binding affinities of the one molecule and one target protein and approximations of co-occurrence frequencies by atoms. Wherein,
the model predictor is composed of a filtering curvature module and a SIHN module, wherein the filtering curvature module uses the filtering curvature of the edge generated by the data collator and the initial representation of the weight output edge.
The SIHN module uses the graph generated by the data collator and the structure of the graph and the initial representation of the edges generated by the filtered curvature module to yield a prediction of affinity and co-occurrence frequency approximation of intermolecular atomic pairs.
Further, the curvature characteristic layer generates the filtering curvature of the edges of the graph by using the graph generated by the graph module, and the specific process is as follows:
setting a string of filter values, for each filter value, deleting edges from the graph that are weighted beyond the filter value to yield a filter subgraph, and calculating the curvature of each edge of the subgraph. And splicing the curvatures of one edge in each filtering sub-graph according to the order of the filtering values from small to large, and setting the curvature of the edge in a filtering sub-graph to be zero if the edge does not appear in a certain filtering sub-graph, wherein the obtained vector is the filtering curvature of the edge. Wherein,
The curvature is a discretized pattern of Ricci curvature, such as the ellivier Ricci curvature and the raman Ricci curvature.
Further, the filtering curvature module uses a dense layer to embed the filtering curvature of the edge generated by the data collator into a high-dimensional vector space, and performs softmax operation on the embedded vector to obtain the curvature embedding of the edge, and the edge weight is embedded into the high-dimensional space after the edge weight is rounded up to obtain the weight embedding of the edge, and the curvature embedding and the weight embedding of the edge are spliced and the initial representation of the edge is generated through another dense layer.
Further, the SIHN module consists of a PHAL group and a pooling group. The PHAL group uses the graph and graph structure generated by the data collator and the initial representation of the edges generated by the filtering curvature module to yield a representation of the edges and a representation of the nodes. The pooling group uses the representation of edges and node representations generated by the PHAL group and the type of edges generated by the data collator to yield predictions of co-occurrence frequency and affinity of atomic pairs.
The pooling group comprises an edge pooling and an atomic pooling, wherein the edge pooling uses a representation of edges generated by the PHAL group and co-occurrence frequency approximations of types of edge-type output various types of atomic pairs generated by the data collator, and the atomic pooling uses a prediction of the representation output affinity of nodes generated by the PHAL group.
The side pooling includes a dense layer, a class pooling layer, a linear layer, and a softmax layer, where the activation function is ReLU. Wherein the dense layer embeds representations of edges generated by the PHAL groups into a 128-dimensional space. The classification pooling layer selects 36 classes of edges from the types of edges generated by the data collator, adds up the representations of each class of edges in the dense layer to form a 36-row matrix, wherein the selected 36 classes of edges are edges (a, b), wherein a is from C (carbon), N (nitrogen), O (oxygen), S (sulfur) atoms in the protein, and b is from C, N, O, S, cl (chlorine), F (fluorine), P (phosphorus), I (iodine), br (bromine) atoms in the ligand. The linear layer converts the matrix generated by the class pooling layer into a 36-dimensional vector, and the softmax layer converts the vector into a co-occurrence frequency approximation of the intermolecular atomic pairs.
The atomic pooling includes three dense layers and one linear layer of the graph, where the pooling layer and the activation function are all ReLU. The pooling layer of the graph adds the node representations generated by the PHAL groups to obtain a vector. The three dense layers are sequentially arranged, the output of the former dense layer is used as the input of the latter dense layer, and vectors obtained by the pooling layer of the graph are sequentially embedded into 128 x 4, 128 x 2 and 128-dimensional vector space. The linear layer maps the resulting 128-dimensional vector into a real number, the affinity predictor.
Further, the group of PHAL is composed of a plurality of PHAL's, wherein the first PHAL generates a representation of an edge and a representation of a node as input with the graph generated by the data collator, the initial representation of the edge generated by the filtered curvature module. The remaining PHALs generate representations of edges and representations of nodes as input using representations of edges and representations of nodes generated by a previous PHAL and initial representations of edges generated by a filter curvature module. Wherein,
the PHAL is composed of a node-to-side layer, an edge-to-side layer, and an edge-to-node layer. The node-to-edge layer splices the initial representation of the edge in the PHAL input and the representations (attributes) of the nodes at the two ends of the edge, and produces the representation of the edge through the dense layer. The edge-to-edge layer generates a representation of an edge using the intermediate representation of the edge generated by the node-to-edge layer. The edge-to-node layer generates a node representation using the node representation (attribute) in the PHAL input, the initial representation of the edge, and the representation of the edge generated by the edge-to-edge layer.
Further, the edge-to-edge layer comprises a directional line drawing element, a drawing classification element, a plurality of personality drawing attention elements and a splicing element. Wherein,
the orientation line drawing element constructs one orientation line drawing of the drawing by using the drawing generated by the data collator and the intermediate representation of the edge generated from the node to the edge layer. The specific flow is to assign two directions to each edge of the graph to form two directed edges, take the directed edges as nodes of the directional graph, take the intermediate representation as attributes of corresponding nodes in the directional graph, construct edges from one corresponding node to the other node in the directional graph according to the fact that the head of the directed edge in the graph is the tail of the other directed edge, and take the included angle of the two directed edges as the weight of the edges of the directional graph.
The map classifying element divides the orientation map generated by the orientation map element into a plurality of subgraphs. Specifically, the angle interval (0 ° ,180 ° ]Divided into a plurality of equally sized intervals. For each interval, deleting the edges of which the weights are not in the interval in the orientation line diagram to obtain a subgraph.
The personality map attention element consists of a concatenation layer, a dense layer whose activation function is tanh (), a multiplication layer and an addition layer. And the splicing layer splices the attribute of one node and the attribute of the neighbor node of the node into a vector. The dense layer uses the vector to generate a vector, the dimension of which is consistent with the attribute dimension of the neighbor node. And the multiplication layer takes the vector generated by the dense layer and the neighbor node attribute as element products to obtain the neighbor transfer vector. The addition layer adds the transfer vector of each neighbor to the attributes of the node to obtain a local representation of the node.
The concatenation element concatenates the local representations of each node in the line graph to obtain one representation of the edge.
Further, the edge-to-node layer is a multi-head adaptive graph attention comprising a plurality of head elements and a mean element. Wherein each header element generates a representation of a node using a node representation (attribute) and a representation of the edge in the input of the edge to the node layer. The mean value element averages the representations generated by each node in the plurality of head elements to obtain the representation of each node.
The header element is composed of an edge initial representing linear layer, an edge representing linear layer, a node representing linear layer, a splicing layer, a dense layer with an activation function of tanh (), a multiplication layer and an addition layer. The initial representation of the edge, the representation of the edge, and the representation of the node the linear layer embed the initial representation of the edge, the representation of the edge, and the representation (attribute) of the center node, respectively, in the same vector space, respectively, in the input of the edge to the node layer. The splice layer splices the embedding of the center node, the embedding of the representation of one edge, and the embedding of the initial representation of the edge. The dense layer embeds the splice into a real number in interval [ -1,1 ]. The multiplication layer multiplies the real number and the embedded of the edge representation to obtain the transfer vector of the edge, and the addition layer adds the transfer vector of each neighbor edge of the neighbor to the embedded of the central node.
In another aspect, the invention provides a method of predicting binding affinity of a molecule to a protein based on filtration curvature comprising:
step one: the reflective protein-ligand data is stored in a data store. The protein-ligand data contains coordinates of atoms, attributes of atoms, and binding affinity information. The atomic attributes include: atom type, pybel atom attribute, and smart attribute.
Step two: the data collator is constructed and used for generating the graph and the structure of the graph from the data stored in one data memory. The data collator includes: a data preprocessing module, a graph module and a structure module. Wherein the method comprises the steps of
The data preprocessing module uses the data stored in the one data storage to generate binding affinity, atomic coordinates and intermolecular atomic pair co-occurrence frequency. The intermolecular atomic pair means that one atom of the atomic pair is from a ligand, the other atom is from a protein, and the Euclidean distance between the two atoms does not exceed a preset threshold.
The map module constructs a map using the binding affinities, atomic coordinates, and atomic properties generated by the data preprocessing module. The construction is as follows: setting a short distance threshold, taking atoms and atomic attributes as nodes of the graph and attributes of the nodes, and connecting the corresponding nodes with an edge if the Euclidean distance between the two atoms does not exceed the short distance threshold. The weight of the edge is the Euclidean distance, and the label of the graph is affinity.
The structure module generates a structure of a graph by using a graph constructed by the graph module: the type of intermolecular edges, the number of each type of edge, and the filtering curvature of the edges. Wherein the intermolecular edge refers to an edge of the figure connecting a protein atom and a ligand atom, and the type of the edge refers to an atom pair type of both end points of the edge.
The structural module comprises a long-distance feature layer and a curvature feature layer, wherein the long-distance feature layer uses the graph to generate types of intermolecular edges and numbers of various types of edges.
The curvature feature layer uses the graph to generate the filtered curvature of the edge.
Step three: a predictive model is constructed and used for the map and map structure generated by the data collator to produce predictions of binding affinity of the one molecule and one target protein and approximations of co-occurrence frequency of intermolecular atomic pairs. Wherein the method comprises the steps of
The prediction model consists of a filtering curvature module and a SIHN module. The filtering curvature module generates an initial representation of the edge using the filtering curvature and weights of the edge generated by the data collator.
The SIHN module uses the graph generated by the data collator and the structure of the graph and the initial representation of the edges generated by the filtered curvature module to yield a prediction of affinity and co-occurrence frequency approximation of intermolecular atomic pairs.
Step four: and training a prediction model. The data in one data store is used as input to a predictive model by step two, using a back-propagation and Adam optimizer, training the model to derive a model by minimizing a loss function that is a weighted sum of the co-occurrence frequency of the intermolecular atoms and their approximated losses and the binding affinity and their predicted losses.
Step five: and (3) inputting the protein-ligand data of interest into a trained prediction model after the first step and the second step to obtain a predicted value of the affinity of the protein-ligand data.
Further, the curvature characteristic layer generates the filtering curvature of the edge of the graph by using the graph generated by the graph module in the second step, and the specific process is as follows:
setting a string of filter values, for each filter value, deleting edges from the graph that are weighted beyond the filter value to yield a filter subgraph, and calculating the curvature of each edge of the subgraph. And splicing the curvatures of one edge in each filtering sub-graph according to the order of the filtering values from small to large, and setting the curvature of the edge in a filtering sub-graph to be zero if the edge does not appear in a certain filtering sub-graph, wherein the obtained vector is the filtering curvature of the edge. Wherein the method comprises the steps of
The curvature is a discretized pattern of Ricci curvature, such as the ellivier Ricci curvature and the raman Ricci curvature.
Further, in the third step, the filtering curvature module uses a dense layer to embed the filtering curvature of the edge generated by the data collator into a high-dimensional vector space, and performs softmax operation on the embedded vector to obtain the curvature embedding of the edge, and after the weight of the edge is rounded upwards, the edge is embedded into the high-dimensional space to obtain the weight embedding of the edge, and the curvature embedding and the weight embedding of the edge are spliced and generate the initial representation of the edge through another dense layer.
Further, the SIHN module in the third step consists of a PHAL group and a pooling group. The PHAL group applies the graph and the graph structure generated by the data collator, and the initial representation of the edge generated by the filtering curvature module yields the representation of the edge and the representation of the node; the pooling group uses the representation of edges and node representations generated by the PHAL group and the type of edges generated by the data collator to yield predictions of co-occurrence frequency and affinity of atomic pairs.
The pooling group comprises an edge pooling and an atomic pooling. The edge pooling uses the representation of the edges generated by the PHAL groups and the types of the edges generated by the data collator to generate co-occurrence frequency approximations of various types of atomic pairs; the atomic pooling uses predictions of the yield affinity of representations of nodes produced by the PHAL group.
The side pooling includes a dense layer, a class pooling layer, a linear layer, and a softmax layer, where the activation function is ReLU. Wherein the dense layer embeds representations of edges generated by the PHAL groups into a 128-dimensional space. The classification pooling layer selects 36 classes of edges from the types of edges generated by the data collator, sums the representations of each class of edges in the dense layer to form a 36-row matrix, the selected 36 classes of edges being edges (a, b), where a is from the C, N, O, S atoms in the protein and b is from the C, N, O, S, cl, F, P, I, br atoms in the ligand. The linear layer converts the matrix generated by the class pooling layer into a 36-dimensional vector, and the softmax layer converts the vector into a co-occurrence frequency approximation of the intermolecular atomic pairs.
The atomic pooling includes three dense layers and one linear layer of the graph, where the pooling layer and the activation function are all ReLU. The pooling layer of the graph adds the node representations generated by the PHAL groups to obtain a vector. The three dense layers are sequentially arranged, the output of the former dense layer is used as the input of the latter dense layer, and vectors obtained by the pooling layer of the graph are sequentially embedded into 128 x 4, 128 x 2 and 128-dimensional vector space. The linear layer maps the resulting 128-dimensional vector into a real number, the affinity predictor.
Further, the group of PHAL is composed of a plurality of PHAL's, wherein the first PHAL generates a representation of an edge and a representation of a node as input with the graph generated by the data collator, the initial representation of the edge generated by the filtered curvature module. The remaining PHALs generate representations of edges and representations of nodes as input using representations of edges and representations of nodes generated by a previous PHAL and initial representations of edges generated by a filter curvature module. Wherein,
the PHAL is composed of a node-to-side layer, an edge-to-side layer, and an edge-to-node layer. The node-to-edge layer splices the initial representation of the edge in the PHAL input and the representations (attributes) of the nodes at the two ends of the edge, and produces the representation of the edge through the dense layer. The edge-to-edge layer generates a representation of an edge using the intermediate representation of the edge generated by the node-to-edge layer. The edge-to-node layer generates a node representation using the node representation (attribute) in the PHAL input, the initial representation of the edge, and the representation of the edge generated by the edge-to-edge layer.
Further, the edge-to-edge layer comprises a directional line drawing element, a drawing classification element, a plurality of personality drawing attention elements and a splicing element. Wherein,
the orientation line drawing element constructs one orientation line drawing of the drawing by using the drawing generated by the data collator and the intermediate representation of the edge generated from the node to the edge layer. The specific flow is to assign two directions to each edge of the graph to form two directed edges, take the directed edges as nodes of the directional graph, take the intermediate representation as attributes of corresponding nodes in the directional graph, construct edges from one corresponding node to the other node in the directional graph according to the fact that the head of the directed edge in the graph is the tail of the other directed edge, and take the included angle of the two directed edges as the weight of the edges of the directional graph.
The map classifying element divides the orientation map generated by the orientation map element into a plurality of sub-maps, specifically, divides an angle section (0 ° ,180 ° ]Divided into a plurality of equally sized intervals. For each interval, deleting the edges of which the weights are not in the interval in the orientation line diagram to obtain a subgraph.
The personality graph attention element consists of a splicing layer, a dense layer, a multiplication layer and an addition layer, wherein the dense layer is an activation function of tanh (), the splicing layer splices the attribute of a node and the attribute of a neighbor node of the node into a vector, the dense layer uses the vector to generate a vector, and the dimension of the vector is consistent with the dimension of the attribute of the neighbor node. The multiplication layer takes the vector generated by the dense layer and the attribute of the neighbor node as element products to obtain the transfer vector of the neighbor, and the addition layer adds the transfer vector of each neighbor to the attribute of the node to obtain the local representation of the node.
The concatenation element concatenates the local representations of each node in the line graph to obtain one representation of the edge.
Further, the edge-to-node layer is a multi-head adaptive graph attention mechanism comprising a plurality of head elements and a mean element. Wherein each header element generates a representation of a node using a node representation (attribute) and a representation of the edge in the input of the edge to the node layer. The mean value element takes the mean value of the representations generated by each node in the plurality of head elements to obtain the representation of each node.
The header element is composed of an edge initial representing linear layer, an edge representing linear layer, a node representing linear layer, a splicing layer, a dense layer with an activation function of tanh (), a multiplication layer and an addition layer. The initial representation of the edge, the representation of the edge, and the representation of the node the linear layer embed the initial representation of the edge, the representation of the edge, and the representation (attribute) of the center node, respectively, in the same vector space, respectively, in the input of the edge to the node layer. The splice layer splices the embedding of the central node, the embedding of the representation of one edge and the embedding of the initial representation of the edge, and the splice layer embeds the splice into a real number in the interval [ -1,1 ]. The multiplication layer multiplies the real number and the embedded of the edge representation to obtain the transfer vector of the edge, and the addition layer adds the transfer vector of each neighbor edge of the neighbor to the embedded of the central node.
The working principle of the invention is to construct a graphic neural network based on spatial distance, angle and curvature spatial information and an adaptive graphic attention mechanism, which learns to predict binding affinity by learning the 3D structure of protein-ligand complexes.
The invention has the advantages that: the accuracy of the protein-ligand binding affinity prediction is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only one embodiment of the present invention.
FIG. 1 is an overall frame diagram of the system and method of the present invention.
FIG. 2 is a flow chart of the present invention for predicting affinity.
Fig. 3 is a flow chart of curvature feature layer proposed by the present invention.
Fig. 4 is a frame diagram of a prediction module according to the present invention.
Fig. 5 is a construction scheme of the filtering curvature module proposed by the present invention.
Fig. 6 and 7 illustrate edge-to-edge layer construction schemes proposed by the present invention.
Fig. 8 and 9 illustrate edge-to-junction layer construction schemes according to the present invention.
Detailed description of the preferred embodiments
The invention is further elucidated with reference to the drawings.
The system is shown in fig. 1 and comprises a data storage 1, a data collator 3 and a model predictor 4. The data store 1 comprises a protein database 1a, a molecular database 1b and an affinity database 1c. The protein database and the molecular database contain relative three-dimensional coordinates of atoms and atomic attribute information, and the affinity database contains affinity values of protein-ligand. The atomic attribute information includes an atomic type, a pybel atomic attribute, and an attribute of smart.
The data collator 3 is connected to the data memory 1. The data collator 3 receives input from the database 1 to generate a graph and a structure of the graph and passes it to the model predictor 4, which model predictor 4 predicts co-occurrence frequency approximations of affinity and atomic pairs using the graph structure of the graph. Wherein,
the data collator 3 comprises a data preprocessing module 3a, a drawing module 3b and a structure module 3c. The data collator 3 passes the received input to a data preprocessing module 3a, which data preprocessing module 3a extracts the relative atomic coordinates of the protein-ligand, the pybel atomic properties and the smart properties from the data and provides them to a graph module 3b, which graph module 3b uses the input to generate a graph and passes the graph to a structure module 3c, which structure module 3c uses the graph to generate a curvature structure and classification of edges of the graph.
The model predictor 4 comprises a filtered curvature module 4a and a SIHN module 4b. The filter curvature module 4a receives the input of the model predictor 4 to generate an initial representation of the edge and provides the initial representation of the edge to the SIHN module 4b. The SIHN module 4b generates affinity predictions and co-occurrence frequency approximations of atomic pairs using the initial representation of the edges and the inputs of the model predictor 4.
Figure 2 provides a more elaborate workflow for predicting binding affinity. The workflow comprises the following steps: data storage, data preprocessing, generating a graph and a structure of the graph, constructing a prediction model, predicting affinity and training the prediction model. Each step is described in detail below.
Step one: and (5) data storage. The pddbbind-v 2016 data is downloaded from the pddbbind website in a data store, and the protein information, ligand information, and affinity values in each protein ligand complex are stored in protein library 1a, molecular library 1b, and affinity library 1c, respectively.
Step two: and (5) preprocessing data. In the data collator 3, the co-occurrence frequency of the atomic properties, three-dimensional coordinates and affinities and intermolecular pairs of atoms is produced by the preprocessing module 3a using an openbabel tool on the stored data. The atomic attribute is a 36-dimensional vector, specifically:
(2.1) one-time encoding the atom types to produce a 9-dimensional vector. The atom types include: B. c, N, O, P, S, se halogen (halogen: F, cl, br, I) and metal.
(2.2) a 4-dimensional vector generated by the pybel atomic property. The 4 components of this vector are: an integer for counting the number of atomic hybrid orbitals (hyb), an integer for counting the number of bonds with other heavy atoms (weavegap), an integer for counting the number of bonds with other heteroatoms (heteroguard), and a floating point number representing the partial charge.
(2.3) one-time encoding the properties of smart to produce a 5-dimensional vector: hydrophobic (hydrophobac), aromatic (aromatic), acceptor (receptor), donor (donor) and ring (ring).
(2.4) splicing the vectors to obtain an 18-dimensional vector. To distinguish the atoms on the protein and the ligand, an 18-dimensional zero vector is spliced on the left of the protein atom attribute vector, and an 18-dimensional zero vector is spliced on the right of the protein atom attribute vector, so that a 36-dimensional vector is obtained.
(2.5) the co-occurrence frequency of a class of atomic pairs between the molecules is the specific gravity of the class of atomic pairs to the total amount of atomic pairs. The intermolecular atomic pairs are such that the Euclidean distance is not more than One atom from the set of atoms { C, N, O, S } in the protein and the other atom from the set of atoms { C, N, O, S, cl, F, P, I, br } in the ligand.
Step three: the generation of the graph and the structure of the graph are divided into the following sub-steps:
(3.1) the graph module 3b generates a graph including nodes of the graph, node attributes, edges of the graph, edge weights, specific gravity of each type of atomic pair between molecules, labels of the graph using the information generated by the data preprocessing module 3 a. The generation is specifically completed in the following steps.
(3.1.1) removal of a hydrogen atom in a protein-ligand complex, the proteins, ligands and protein-ligand complexes mentioned later are all products after removal of a hydrogen atom.
(3.1.2) the nodes of the graph are all atoms of the ligand and the Euclidean distance from one atom in the ligand in the protein is not more thanIs an atom of (a).
(3.1.3) the node attribute is an attribute of a corresponding atom.
(3.1.4) edges of the graph are determined by Euclidean distances between nodes when the distance between two atoms does not exceedAnd when the two nodes are connected by edges.
(3.1.5) the weight of the edge is the distance.
The label of the (3.1.7) figure is the affinity.
(3.2) the structural module 4b constructs the structure of the graph, including the long-pitch feature layer and the curvature feature layer, using the graph.
(3.2.1) the long-distance feature layer classifies edges between molecules and counts the number of each type of edge between molecules. The intermolecular edge refers to the atomic pair of the edge from one protein to the other ligand. The type of the intermolecular edge refers to the type of the two-terminal atoms of the edge. From the atoms of the protein, 4 atoms { C, N, O, S }, and from the ligands, 9 sets of atoms { C, N, O, S, cl, F, P, I, br }, the types of 36 intermolecular edges can be obtained. The intermolecular edges of the graph, if not among the 36 types, will be classified as the same type. The number of each type of side between molecules refers to the number of each type of side for the 36 types of sides.
(3.2.2) the curvature feature layer uses the graph to calculate the filtered human curvature of the edges of the graph, the calculation flow is as shown in fig. 3, and 50 filter values are selected: 0.1,0.2, … …,4.9,5.0, for each filtering value, deleting edges with weights exceeding the filtering value from the original image to obtain a filtering subgraph, calculating the human curvature of each edge of the subgraph, namely
4-deg(v i )-deg(v j )+3Δ ij
Wherein v is i And v j For the nodes of the edge, deg () is a function of the degree taken from the node, Δ ij Is the number of triangles that contain the edges. And splicing the curvatures of one edge in each filtering sub-graph according to the order of the filtering values from small to large, and setting the curvature of the edge in a filtering sub-graph to be zero if the edge does not appear in a certain filtering sub-graph, wherein the obtained vector is the filtering curvature of the edge.
Step four: and constructing a prediction model. The predictive model is composed of a filtered curvature module 4a and a SIHN module 4b, as shown in fig. 4. The filtered curvature module 4a generates an initial representation of the edge using the plot generated in step three and the filtered curvature of the plot; the SIHN module 4b uses the map generated in step three and the classification of the edges of the map and the initial representation of the edges of the map to generate a prediction of co-occurrence frequency of intermolecular atomic pairs and protein-ligand binding affinity.
As shown in fig. 5, the filtering curvature module 4a performs the following steps:
(4.1) the activation function is a dense layer of leak_relu (, 0.2) embedding the filtered curvature of the edge into a 128-dimensional vector space, and then softmax operating on the vector to obtain the curvature embedding of the edge.
(4.2) Embedding the upward rounding of the weights of the edges into the 128-dimensional space by the Embedding layer to obtain the weight Embedding of the edges.
And (4.3) embedding the curvature of the edge in (4.1) and the weight in (4.2) to obtain a vector of the edge.
(4.4) the dense layer of the ReLU takes as input the vector of (4.3) and the output is a 128-dimensional vector, i.e. the initial representation of the edge.
As shown in fig. 3, the SIHN module 4b is composed of a phas group and a pooling group. The PHAL group applies the graph and the structure of the graph generated by the data collator 3, and the initial representation of the edges generated by the filtering curvature module 4a yields a representation of the edges and a representation of the nodes; the pooling group uses the representation of edges and node representations generated by the PHAL group and the types of edges generated by the data collator 3 to yield predictions of co-occurrence frequency and affinity of atomic pairs.
The group of PHALs of (4.5) is composed of a plurality of PHALs, wherein the first PHAL generates a representation of edges and a representation of nodes as inputs with the graph generated by the data collator 3, the initial representation of edges generated by the filter curvature module 4 a. The remaining PHALs generate representations of edges and representations of nodes as input using the representations of edges and representations of nodes generated by the previous PHAL and the initial representations of edges generated by the filtering curvature module 4 a.
(4.6) the PHAL is comprised of a node-to-side layer, an edge-to-edge layer, and an edge-to-node layer. The node-to-edge layer generates a representation of the edge from an initial representation of the edge in the PHAL input and representations (attributes) of nodes at both ends of the edge. The edge-to-edge layer generates a representation of the edge using the intermediate representation of the edge generated by the node-to-edge layer. The edge-to-node layer generates a node representation using the node representation (attribute) in the PHAL input, the initial representation of the edge, and the representation of the edge generated by the edge-to-edge layer.
(4.7) node-to-side layer concatenates the initial representation of the edge in the PHAL input with representations (attributes) of nodes at both ends of the edge, embedding the concatenation through a dense layer into a 128-dimensional vector space to form an intermediate representation of the edge. The activation function of the dense layer is ReLU.
The edge-to-edge layer of (4.8), as shown in FIGS. 6 and 7, comprises a directional line drawing element, a drawing classification element, six individual drawing attention elements, and a stitching element. The graph classifying element divides the directed graph generated by the directed graph element into six subgraphs, six representations of the edges are obtained after the personalized graph attention element is used for each subgraph, and the splicing element splices the six representations of each edge to form the representation of the edge. Wherein the method comprises the steps of
(4.8.1) the orientation line drawing element constructs one orientation line drawing of the drawing using the drawing generated by the data collator 3 and the intermediate representation of the edge generated from the node to the edge layer. The specific flow is to assign two directions to each edge of the graph to form two directed edges, take the directed edges as nodes of the directional graph, take the intermediate representation as attributes of corresponding nodes in the directional graph, construct edges from one corresponding node to the other node in the directional graph according to the fact that the head of the directed edge in the graph is the tail of the other directed edge, and take the included angle of the two directed edges as the weight of the edges of the directional graph.
(4.8.2) the graph classifying element divides the orientation graph generated by the orientation graph element into a plurality of subgraphs, in particular, an angle interval (0 ° ,180 ° ]Divided into six equally sized intervals. For each interval, deleting the edges of which the weights are not in the interval in the orientation line diagram to obtain a subgraph.
The personality map attention element of (4.8.3) is composed of a splice layer, a dense layer whose activation function is tanh (), a multiplication layer, and an addition layer. The splicing layer splices the attribute of one node and the attribute of the neighbor node of the node into a vector. The dense layer uses the vector to generate a 128-dimensional vector. And the multiplication layer takes the 128-dimensional vector and the neighbor node attribute as element products to obtain the neighbor transfer vector. The addition layer adds the transfer vector of each neighbor to the attributes of the node to obtain a local representation of the node.
(4.8.4) the stitching element stitches together the local representations of each node in the line graph to obtain a 128 x 6 dimensional representation of the edge.
The edge-to-junction layer of (4.9) comprises four header elements and a mean element. Wherein each header element generates four representations of nodes using node representations (attributes) and representations of edges in the input of the edge to node layer. The mean element averages the representations generated by each node at the four head elements to obtain the representation of each node, as shown in fig. 8 and 9.
The header element of (4.9.1) is composed of an edge initial representing a linear layer, an edge representing a linear layer, a node representing a linear layer, a splice layer, a dense layer with an activation function of tanh (), a multiplication layer, and an addition layer. The initial representation of the edge, the representation of the edge, and the representation of the node the linear layer embed the initial representation of the edge, the representation of the edge, and the representation (attribute) of the center node, respectively, in a 128-dimensional vector space. The splice layer splices the embedding of the central node, the embedding of the representation of one edge and the embedding of the initial representation of the edge, and the splice layer embeds the splice into a real number in the interval [ -1,1 ]. The multiplication layer multiplies the real number and the embedded of the edge representation to obtain the transfer vector of the edge, and the addition layer adds the transfer vector of each neighbor edge of the neighbor to the embedded of the central node.
(4.9.2) the mean element taking the mean of the representations of the same node obtained at each head module to obtain the representation of the node.
The pooling group comprises an edge pooling and an atomic pooling. The edge pooling produces co-occurrence frequency approximations of various types of atomic pairs using representations of edges generated by the PHAL groups and types of edges generated by the data collator 4; the atomic pooling uses predictions of the yield affinity of representations of nodes produced by the PHAL group.
The side pooling of (4.10) comprises a dense layer, a class pooling layer, a linear layer, and a softmax layer, the activation function of which is ReLU. Wherein the dense layer embeds the representation of the edges generated by the PHAL groups into 128-dimensional space. The classification-pooling layer selects 36 classes of edges from the types of edges generated by the data collator 4, and sums the representations of each class of edges in the dense layer to form 36 vectors, which 36 vectors can be seen as a 36-row matrix. The linear layer converts the matrix into a 36-dimensional vector and the softmax layer converts the vector into a co-occurrence frequency approximation of the intermolecular atomic pairs.
(4.11) atomic pooling includes a layer of pooling, a dense layer where the three activation functions are all relus, and a linear layer. The pooling layer adds the node representations to obtain a vector. The three dense layers are arranged in sequence, the output of the former dense layer is used as the input of the latter dense layer, and the obtained vectors are sequentially embedded into 128 x 4, 128 x 2 and 128-dimensional vector space. The linear layer maps the resulting 128-dimensional vector into a real number, the affinity predictor.
Step five: affinity is predicted. The data collator 3 receives the data-generated map and the structural information of the map from the data memory 1 and provides it to the model predictor 4, which model predictor 4 uses to generate affinity predictions and co-occurrence frequency approximations of pairs of atoms by the molecules.
Step six: and training a prediction model. During training, the optimizer uses an Adam optimizer to return the gradient of the loss function back to the update parameters. The loss function is calculated by a loss module. The loss module compares the binding affinity generated by the data processor 3a and the co-occurrence frequency of atoms with the corresponding approximated L generated by the model predictor 4 1 Loss f 1 And f 2 Weighted average is performed, specifically f: =f 1 +1.75f 2
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (12)

1. A system for predicting binding affinity of a molecule and a protein based on filtering curvature, said system comprising a data store, a data collator and a model predictor, said data collator using data stored in said store to generate a map and structure of the map and providing said map and structure to the model predictor for predicting affinity, wherein,
A data store for storing protein-ligand information, the protein-ligand information comprising coordinates of atoms, attributes of atoms, and binding affinity information, the attributes of atoms comprising: atom type, pybel atom attribute, and smart attribute;
a data collator comprising: a data preprocessing module, a graph module, and a structure module configured to: applying the data preprocessing module to the data stored in the one data memory to generate binding affinity, atomic coordinates and co-occurrence frequency of intermolecular atomic pairs, wherein the intermolecular atomic pairs are that one atom of the atomic pairs is from a ligand, the other atom is from a protein, and Euclidean distance between the two atoms does not exceed a preset threshold value, and
the graph module is used for combining the affinity, the atomic coordinates and the atomic attributes generated by the data preprocessing module to construct a graph, specifically, a short-distance threshold is set, atoms and the atomic attributes are taken as nodes of the graph and the attributes of the nodes, if the Euclidean distance between two atoms does not exceed the short-distance threshold, an edge is added between the corresponding nodes, the weight of the edge is the Euclidean distance, the label of the graph is the affinity, and
The structural module is used in the constructed graph to generate the structure of the graph: the type of the intermolecular edge, the number of the types of the edges, and the filtering curvature of the edges, wherein the intermolecular edge refers to the edge of the graph connecting the protein atom and the ligand atom, the type of the edge refers to the type of the pair of atoms at both end points of the edge, and
the structural module comprises a long-distance feature layer and a curvature feature layer, wherein the long-distance feature layer generates types of intermolecular edges and numbers of various types of edges using a constructed graph, and
the curvature feature layer generates a filtered curvature of the edge using the constructed graph; the specific process is as follows:
setting a string of filtering values, deleting edges with a weight exceeding the filtering value from the graph for each filtering value to generate a filtering sub-graph, calculating the curvature of each edge of the sub-graph, splicing the curvatures of one edge in each filtering sub-graph according to the sequence from the small filtering value to the large filtering value, and setting the curvature of the edge in a filtering sub-graph to be zero if the edge does not appear in a certain filtering sub-graph, wherein the obtained vector is the filtering curvature of the edge, and the curvature is a discretization mode of Ricci curvature;
A model predictor using the map and structure generated by the data collator to generate a prediction of binding affinity of the one molecule and one target protein and an approximation of co-occurrence frequency by atoms, the model predictor consisting of a filtered curvature module and a SIHN module, wherein,
the filtering curvature module generates an initial representation of the edge using the filtering curvature and weights of the edge generated by the data collator, and
the SIHN module uses the graph generated by the data collator and the structure of the graph and the initial representation of the edges generated by the filtered curvature module to yield an affinity prediction and co-occurrence frequency approximation of intermolecular atomic pairs.
2. The system of claim 1, wherein the filtered curvature module uses one dense layer to embed the filtered curvature of the edge generated by the data collator into the high-dimensional vector space, and performs softmax operation on the embedded vector to obtain an edge curvature embedding, and the edge weight is embedded into the high-dimensional space after rounding up to obtain an edge weight embedding, and the edge curvature embedding and the weight embedding are spliced together and passed through another dense layer to generate the initial representation of the edge.
3. The system of claim 1, wherein said SIHN module is comprised of a group of shapes using said graphs and structures of graphs generated by said data collator and said initial representations of edges generated by said filtering curvature module to yield representations of edges and representations of nodes, and a pooled group using said representations of edges and representations of nodes generated by said group of shapes and types of edges generated by said data collator to yield predictions of co-occurrence frequency and affinity of atomic pairs;
The pooling group comprises an edge pooling and an atomic pooling, wherein the edge pooling uses the co-occurrence frequency approximation of the representation of the edge generated by the PHAL group and the type of the edge generated by the data collator to produce various types of atomic pairs, and the atomic pooling uses the prediction of the representation yield affinity of the node generated by the PHAL group;
the edge pooling comprises a dense layer with an activation function of ReLU, a classification pooling layer, a linear layer and a softmax layer, wherein the dense layer embeds the edge representation generated by the group of shals into 128-dimensional space, the classification pooling layer selects 36 classes of edges from the types of edges generated by the data collator, adds up the representations of each class of edges in the dense layer to form a 36-row matrix, the selected 36 classes of edges are edges (a, b), wherein a is a C (carbon), N (nitrogen), O (oxygen), S (sulfur) atoms from proteins, b is C, N, O, S, cl (chlorine), F (fluorine), P (phosphorus), I (iodine), br (bromine) atoms from ligands, the linear layer converts the matrix generated by the classification pooling layer into a 36-dimensional vector, and the softmax layer converts the vector into a co-occurrence frequency approximation of the intermolecular atomic pairs;
the atomic pooling comprises three dense layers of a map, an activation function and a linear layer, wherein the pooling layer of the map adds node representations generated by a PHAL group to obtain a vector, the three dense layers are sequentially arranged, the output of the former dense layer is used as the input of the latter dense layer, the vector obtained by the pooling layer of the map is sequentially embedded into 128 x 4, 128 x 2 and 128-dimensional vector space, and the linear layer maps the obtained 128-dimensional vector into a real number, namely an affinity predicted value.
4. The system of claim 3 wherein the group of shals consists of a plurality of shals, wherein a first shal uses the graph generated by the data collator, the initial representation of the edge generated by the filter curvature module as the representation of the input generated edge and the representation of the node, and the remaining shals use the representation of the edge generated by the previous shal and the representation of the node and the initial representation of the edge generated by the filter curvature module as the representation of the input generated edge and the representation of the node; wherein,
the PHAL consists of a node-to-side layer, an edge-to-side layer and an edge-to-node layer, wherein the node-to-side layer splices an initial representation of an edge in the PHAL input and representations (attributes) of nodes at two ends of the edge, the edge-to-side layer generates an edge representation by using an intermediate representation of the edge generated by the node-to-edge layer through a dense layer, and the edge-to-node layer generates a node representation by using the node representation (attributes) in the PHAL input, the initial representation of the edge and the edge representation generated by the edge-to-edge layer.
5. The system of claim 4, wherein the edge-to-edge layer comprises a directional line drawing element, a drawing classification element, a plurality of individual drawing attention elements, and a stitching element; wherein,
The directional line drawing element uses a graph generated by a data collator and the middle expression of the edge generated from the node to the edge layer to construct one directional line drawing of the graph, and the specific flow is that two directional edges are formed for each edge of the graph, the directional edges are used as the nodes of the directional line drawing, the middle expression is used as the attribute of the corresponding node in the directional line drawing, the tail part of the head part of the directional edge in the graph is used as the basis for constructing the edge from the corresponding node to the other node in the directional line drawing, and the included angle of the two directional edges is used as the weight of the edge of the directional line drawing;
the map classifying element generates the orientation map elementThe line graph is divided into a plurality of sub-graphs, specifically, the angle interval (0 ° ,180 ° ]Dividing into a plurality of intervals with the same size, deleting the edges of which the weights are not in each interval in the directional diagram to obtain a subgraph;
the personality graph attention element consists of a splicing layer, a dense layer, a multiplication layer and an addition layer, wherein the activation function is tan (), the splicing layer splices the attribute of a node and the attribute of a neighbor node of the node into a vector, the dense layer uses the vector to generate a vector, the dimension of the vector is consistent with the dimension of the attribute of the neighbor node, the multiplication layer takes the vector generated by the dense layer and the attribute of the neighbor node as element products to obtain the transfer vector of the neighbor, and the addition layer adds the transfer vector of each neighbor to the attribute of the node to obtain the local representation of the node;
The concatenation element concatenates the local representations of each node in the line graph to obtain one representation of the edge.
6. The system of claim 4, wherein the edge-to-node layer is a multi-headed adaptive graph attention mechanism comprising a plurality of head elements and a mean element, wherein each head element generates a representation of a node using a node representation (attribute) and an edge representation in the input of the edge-to-node layer, and wherein the mean element averages the generated representations of each node over the plurality of head elements to obtain a representation of each node;
the header element consists of an edge initial linear layer, an edge linear layer, a node linear layer, a splice layer, a dense layer with an activation function of tanh (), a multiplication layer and an addition layer, wherein the edge initial linear layer, the edge linear layer and the node linear layer respectively embed the edge initial representation, the edge representation and the center node representation (attribute) in the input of the edge to the node layer into the same vector space, the splice layer splices the embedding of the center node, the embedding of the edge representation and the embedding of the edge initial representation, the splice layer embeds the splice into a real number in an interval [ -1,1], the multiplication layer performs a number multiplication operation on the real number and the embedding of the edge representation to obtain the edge transfer vector, and the layer adds the transfer vector of each neighbor edge of the addition layer to the embedding of the center node.
7. A method of predicting binding affinity of a molecule to a protein based on filtration curvature, comprising:
step one: storing in a data store, reflecting protein-ligand data comprising coordinates of atoms, attributes of atoms, and binding affinity information, the attributes of atoms comprising: atom type, pybel atom attribute, and smart attribute;
step two: constructing and using a data collator for generating a graph and a structure of the graph from data stored in a data memory, the data collator comprising: a data preprocessing module, a graph module and a structure module, wherein
The data preprocessing module generates binding affinity, atomic coordinates, co-occurrence frequency of intermolecular atomic pairs using the data stored in the one data memory, wherein the intermolecular atomic pairs are that one atom of the atomic pairs is from a ligand, the other atom is from a protein, and Euclidean distance between the two atoms does not exceed a preset threshold value, and
the map module constructs a map using the binding affinities, atomic coordinates, and atomic properties generated by the data preprocessing module, the constructs being: setting a short distance threshold, taking atoms and atomic attributes as nodes of a graph and attributes of the nodes, adding an edge between corresponding nodes if the Euclidean distance between the atoms does not exceed the short distance threshold, wherein the weight of the edge is the Euclidean distance, the label of the graph is affinity, and
The structure module generates a structure of a graph by using a graph constructed by the graph module: the type of the intermolecular edge, the number of the types of the edges, and the filtering curvature of the edges, wherein the intermolecular edge refers to the edge of the graph connecting the protein atom and the ligand atom, the type of the edge refers to the type of the pair of atoms at both end points of the edge, and
the structural module comprises a long-distance feature layer and a curvature feature layer, wherein the long-distance feature layer generates types of intermolecular edges and numbers of various types of edges using a constructed graph, and
the curvature feature layer generates a filtered curvature of the edge using the constructed graph; the specific process is as follows:
setting a string of filtering values, deleting edges with a weight exceeding the filtering value from the graph for each filtering value to generate a filtering sub-graph, calculating the curvature of each edge of the sub-graph, splicing the curvatures of one edge in each filtering sub-graph according to the sequence from the small filtering value to the large filtering value, and setting the curvature of the edge in a filtering sub-graph to be zero if the edge does not appear in a certain filtering sub-graph, wherein the obtained vector is the filtering curvature of the edge, and the curvature is a discretization mode of Ricci curvature;
Step three: constructing a predictive model for the map and the structure of the map generated by the data collator to produce a prediction of binding affinity of the one molecule and the one target protein and an approximation of co-occurrence frequency of intermolecular atoms, wherein
The predictive model is composed of a filtered curvature module and a SIHN module, the filtered curvature module producing an initial representation of the edge using the filtered curvature and weight of the edge generated by the data collator, and
the SIHN module uses the graph generated by the data collator and the structure of the graph and the initial representation of the edge generated by the filtering curvature module to generate affinity prediction and co-occurrence frequency approximation of the intermolecular atomic pair;
step four: training a prediction model, taking data in a data storage as input of the prediction model through the second step, training the model by using a back propagation and Adam optimizer, and obtaining the model by minimizing a loss function, wherein the loss function is a weighted sum of co-occurrence frequency of intermolecular atoms and approximated loss and binding affinity of the intermolecular atoms and predicted loss of the intermolecular atoms;
step five: and (3) inputting the protein-ligand data of interest into a trained prediction model after the first step and the second step to obtain a predicted value of the affinity of the protein-ligand data.
8. The method of claim 7, wherein the filtering curvature module embeds the filtering curvature of the edge generated by the data collator into the high-dimensional vector space using one dense layer, and performs softmax operation on the embedded vector to obtain the curvature embedment of the edge, and embeds the weight of the edge into the high-dimensional space after rounding up to obtain the weight embedment of the edge, and the curvature embedment and the weight embedment of the edge are spliced together and generate the initial representation of the edge through another dense layer.
9. The method of claim 7 wherein the SIHN module in step three consists of a group of shals using the graph and graph structure generated by the data collator and the initial representation of edges generated by the filtered curvature module to yield representations of edges and representations of nodes and a pooled group using the representations of edges and representations of nodes generated by the group of shals and the type of edges generated by the data collator to yield predictions of co-occurrence frequency and affinity of atomic pairs;
the pooling group comprises an edge pooling and an atomic pooling, wherein the edge pooling uses the co-occurrence frequency approximation of the representation of the edge generated by the PHAL group and the type of the edge generated by the data collator to produce various types of atomic pairs, and the atomic pooling uses the prediction of the representation yield affinity of the node generated by the PHAL group;
The edge pooling comprises a dense layer whose activation function is ReLU, a classification pooling layer, a linear layer and a softmax layer, wherein the dense layer embeds representations of edges generated by the group of shals into 128-dimensional space, the classification pooling layer selects 36 classes of edges from the types of edges generated by the data collator, adds up representations of each class of edges in the dense layer to form a 36-row matrix, the selected 36 classes of edges are edges (a, b), wherein a is C, N, O, S atoms from proteins and b is C, N, O, S, cl, F, P, I, br atoms from ligands, the linear layer converts the matrix generated by the classification pooling layer into a 36-dimensional vector, and the softmax layer converts the vector into a co-occurrence frequency approximation of the intermolecular atomic pairs;
the atomic pooling comprises three dense layers of a map, an activation function and a linear layer, wherein the pooling layer of the map adds node representations generated by a PHAL group to obtain a vector, the three dense layers are sequentially arranged, the output of the former dense layer is used as the input of the latter dense layer, the vector obtained by the pooling layer of the map is sequentially embedded into 128 x 4, 128 x 2 and 128-dimensional vector space, and the linear layer maps the obtained 128-dimensional vector into a real number, namely an affinity predicted value.
10. The method of claim 9 wherein the group of shals consists of a plurality of shals, wherein a first shal uses the graph generated by the data collator, the initial representation of the edge generated by the filter curvature module as the representation of the input generated edge and the representation of the node, and the remaining shals use the representation of the edge generated by the previous shal and the representation of the node and the initial representation of the edge generated by the filter curvature module as the representation of the input generated edge and the representation of the node; wherein the method comprises the steps of
The PHAL consists of a node-to-side layer, an edge-to-side layer and an edge-to-node layer, wherein the node-to-side layer splices an initial representation of an edge in the PHAL input and representations (attributes) of nodes at two ends of the edge, the edge-to-side layer generates an edge representation by using an intermediate representation of the edge generated by the node-to-edge layer through a dense layer, and the edge-to-node layer generates a node representation by using the node representation (attributes) in the PHAL input, the initial representation of the edge and the edge representation generated by the edge-to-edge layer.
11. The method of claim 10, wherein the edge-to-edge layer comprises a directional line drawing element, a classification element, a plurality of personality map attention elements, and a stitching element; wherein the method comprises the steps of
The directional line drawing element uses a graph generated by a data collator and the middle expression of the edge generated from the node to the edge layer to construct one directional line drawing of the graph, and the specific flow is that two directional edges are formed for each edge of the graph, the directional edges are used as the nodes of the directional line drawing, the middle expression is used as the attribute of the corresponding node in the directional line drawing, the tail part of the head part of the directional edge in the graph is used as the basis for constructing the edge from the corresponding node to the other node in the directional line drawing, and the included angle of the two directional edges is used as the weight of the edge of the directional line drawing;
the map classifying element divides the orientation map generated by the orientation map element into a plurality of sub-maps, specifically, divides an angle section (0 ° ,180 ° ]Dividing into a plurality of intervals with the same size, deleting the edges of which the weights are not in each interval in the directional diagram to obtain a subgraph;
the personality graph attention element consists of a splicing layer, a dense layer, a multiplication layer and an addition layer, wherein the activation function is tan (), the splicing layer splices the attribute of a node and the attribute of a neighbor node of the node into a vector, the dense layer uses the vector to generate a vector, the dimension of the vector is consistent with the dimension of the attribute of the neighbor node, the multiplication layer takes the vector generated by the dense layer and the attribute of the neighbor node as element products to obtain the transfer vector of the neighbor, and the addition layer adds the transfer vector of each neighbor to the attribute of the node to obtain the local representation of the node;
The concatenation element concatenates the local representations of each node in the line graph to obtain one representation of the edge.
12. The method of claim 10, wherein the edge-to-node layer is a multi-headed adaptive graph attention comprising a plurality of head elements and a mean element, wherein each head element generates a representation of a node using a node representation (attribute) and an edge representation in the input of the edge-to-node layer, and wherein the mean element averages the representations generated by each node over the plurality of head elements to obtain a representation of each node;
the header element consists of an edge initial linear layer, an edge linear layer, a node linear layer, a splice layer, a dense layer with an activation function of tanh (), a multiplication layer and an addition layer, wherein the edge initial linear layer, the edge linear layer and the node linear layer respectively embed the edge initial representation, the edge representation and the center node representation (attribute) in the input of the edge to the node layer into the same vector space, the splice layer splices the embedding of the center node, the embedding of the edge representation and the embedding of the edge initial representation, the splice layer embeds the splice into a real number in a section [ -1,1], the multiplication layer multiplies the real number and the embedding of the edge representation into a number of multiplication operations to obtain the edge transfer vector, and the addition layer adds the transfer vector of each neighbor edge to the embedding of the center node.
CN202310122433.3A 2023-01-19 2023-01-19 System and method for predicting protein-ligand binding affinity based on filtration curvature Active CN116312864B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310122433.3A CN116312864B (en) 2023-01-19 2023-01-19 System and method for predicting protein-ligand binding affinity based on filtration curvature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310122433.3A CN116312864B (en) 2023-01-19 2023-01-19 System and method for predicting protein-ligand binding affinity based on filtration curvature

Publications (2)

Publication Number Publication Date
CN116312864A CN116312864A (en) 2023-06-23
CN116312864B true CN116312864B (en) 2023-10-27

Family

ID=86784313

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310122433.3A Active CN116312864B (en) 2023-01-19 2023-01-19 System and method for predicting protein-ligand binding affinity based on filtration curvature

Country Status (1)

Country Link
CN (1) CN116312864B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021061638A1 (en) * 2019-09-23 2021-04-01 Chan Zuckerberg Biohub, Inc. Methods related to a structure of high-affinity human pd-1/pd-l2 complex
CN113728390A (en) * 2019-01-04 2021-11-30 思科利康有限公司 Methods and systems for predicting drug binding using synthetic data
CN113744799A (en) * 2021-09-06 2021-12-03 中南大学 End-to-end learning-based compound and protein interaction and affinity prediction method
CN115116538A (en) * 2022-04-07 2022-09-27 腾讯科技(深圳)有限公司 Protein ligand affinity prediction method, related device and equipment
CN115274006A (en) * 2022-08-05 2022-11-01 石家庄鲜虞数字生物科技有限公司 Affinity prediction method based on protein-ligand complex structure information
CN115512785A (en) * 2022-09-01 2022-12-23 中国海洋大学 Attention mechanism-based three-dimensional protein-ligand activity prediction method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019191777A1 (en) * 2018-03-30 2019-10-03 Board Of Trustees Of Michigan State University Systems and methods for drug design and discovery comprising applications of machine learning with differential geometric modeling
US20220383992A1 (en) * 2018-07-17 2022-12-01 Kuano Ltd. Machine learning based methods of analysing drug-like molecules

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113728390A (en) * 2019-01-04 2021-11-30 思科利康有限公司 Methods and systems for predicting drug binding using synthetic data
WO2021061638A1 (en) * 2019-09-23 2021-04-01 Chan Zuckerberg Biohub, Inc. Methods related to a structure of high-affinity human pd-1/pd-l2 complex
CN113744799A (en) * 2021-09-06 2021-12-03 中南大学 End-to-end learning-based compound and protein interaction and affinity prediction method
CN115116538A (en) * 2022-04-07 2022-09-27 腾讯科技(深圳)有限公司 Protein ligand affinity prediction method, related device and equipment
CN115274006A (en) * 2022-08-05 2022-11-01 石家庄鲜虞数字生物科技有限公司 Affinity prediction method based on protein-ligand complex structure information
CN115512785A (en) * 2022-09-01 2022-12-23 中国海洋大学 Attention mechanism-based three-dimensional protein-ligand activity prediction method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CurvAGN: Curvature-based Adaptive Graph Neural Networks for Predicting Protein-Ligand Binding Anity;Jianqiu Wu等;《https://www.researchsquare.com/article/rs-3141023/v1》;1-21 *
Improved protein-ligand binding affinity prediction by using a curvature-dependent surface-area model;Yang Cao等;《Bioinformatics.》;第30卷(第12期);1-7 *
Improved Protein–Ligand Binding Affinity Prediction with Structure-Based Deep Fusion Inference;Derek Jones等;《Journal of Chemical Information and Modeling》;第61卷(第4期);1583-1592 *
分子动力学模拟结合深度学习探究氨基酸突变或配体结合对酶活性的影响;朱镜璇;《中国博士学位论文全文数据库_基础科学辑》;A006-121 *

Also Published As

Publication number Publication date
CN116312864A (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN110120097B (en) Semantic modeling method for airborne point cloud of large scene
CN111522962A (en) Sequence recommendation method and device and computer-readable storage medium
Ong et al. Memetic computation—past, present & future [research frontier]
US20210125691A1 (en) Systems and method for designing organic synthesis pathways for desired organic molecules
CN114386694A (en) Drug molecule property prediction method, device and equipment based on comparative learning
Guzzi et al. Biological network analysis: Trends, approaches, graph theory, and algorithms
CN111967972A (en) Financial product recommendation method and device
CN115240786A (en) Method for predicting reactant molecules, method for training reactant molecules, device for performing the method, and electronic apparatus
CN113918837B (en) Method and system for generating city interest point category representation
JP2021505978A (en) Storage and loading methods, devices, systems and storage media for visual self-location estimation maps
Sahillioğlu A genetic isometric shape correspondence algorithm with adaptive sampling
CN111126578B (en) Joint data processing method, device and system for model training
CN115346372A (en) Multi-component fusion traffic flow prediction method based on graph neural network
CN115730519A (en) Urban crowd flow prediction system and method based on space-time potential energy field
KR20200054355A (en) Method, apparatus and computer program for coloring of image, Method, apparatus and computer program for learning of artificial neural network
Zhang et al. When visual disparity generation meets semantic segmentation: A mutual encouragement approach
CN112086144A (en) Molecule generation method, molecule generation device, electronic device, and storage medium
Stein Generating high-quality explanations for navigation in partially-revealed environments
CN116312864B (en) System and method for predicting protein-ligand binding affinity based on filtration curvature
Zhou et al. Move and remove: Multi-task learning for building simplification in vector maps with a graph convolutional neural network
Zhang et al. Dense-CNN: Dense convolutional neural network for stereo matching using multiscale feature connection
CN117321692A (en) Method and system for generating task related structure embeddings from molecular maps
Maimaitimin et al. Stacked convolutional auto-encoders for surface recognition based on 3d point cloud data
Dutta et al. Incomplete multi-view gene clustering with data regeneration using Shape Boltzmann Machine
CN114662009B (en) Graph convolution-based industrial internet factory collaborative recommendation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant