CN116959555A - Method and system for compound-protein affinity prediction based on protein three-dimensional structure - Google Patents

Method and system for compound-protein affinity prediction based on protein three-dimensional structure Download PDF

Info

Publication number
CN116959555A
CN116959555A CN202210828457.6A CN202210828457A CN116959555A CN 116959555 A CN116959555 A CN 116959555A CN 202210828457 A CN202210828457 A CN 202210828457A CN 116959555 A CN116959555 A CN 116959555A
Authority
CN
China
Prior art keywords
protein
compound
sequence
affinity
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210828457.6A
Other languages
Chinese (zh)
Inventor
王绪化
郭滨杰
郑涵予
江昊翰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Publication of CN116959555A publication Critical patent/CN116959555A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Public Health (AREA)
  • Chemical & Material Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for predicting compound-protein affinity based on a protein three-dimensional structure, which comprises the following steps: s1, a compound feature extraction step, namely obtaining updated atomic features and aggregated node features by using a deep graph convolution network and a multi-head attention algorithm; s2, protein characteristic extraction, namely enabling sequence characteristics and structural characteristics of the protein to represent more complete protein information through a characteristic aggregation algorithm and a co-evolution strategy; s3, predicting the affinity of the compound and the protein, and obtaining a predicted affinity value according to the atomic characteristics, the polymerization node characteristics and the sequence characteristics. The invention also discloses a corresponding system comprising: compound extractor, protein extractor and affinity predictor. The invention uses a discretization distance matrix and a torsion angle matrix as protein three-dimensional structure representation, introduces a co-evolution updating mechanism to update the characteristics between the protein three-dimensional structure and the sequence, and utilizes the characteristic of aggregation nodes to improve the affinity prediction precision.

Description

Method and system for compound-protein affinity prediction based on protein three-dimensional structure
Technical Field
The invention relates to the field of drug development, in particular to a method and a system for predicting compound-protein affinity based on a protein three-dimensional structure.
Background
Accurate prediction of binding affinity of compounds to proteins is a major challenge in virtual drug candidate screening. The calculation method is utilized to carry out effective virtual screening of the high-flux lead compound, and the development time and experimental workload can be reduced, so that the development process of the medicine can be greatly accelerated. In particular, in recent years, with the development of technology, more and more data sharing projects are being greatly advanced, and a large number of new biomedical data sources become available. For example, pubCHem currently contains more than 1.11 hundred million compound structures and 2.71 hundred million bioactive data points (reference https:// doi.org/10.1093/nar/gkaa 971), which greatly increases the predictive potential of compound-protein binding affinity (CPA). However, further improvement of the accuracy of predicting CPA remains a major challenge for virtual drug screening techniques before they are practically applied. In order to increase the accuracy of CPA predictions, the development of computational methods has been essentially along two paths, either unstructured or structure-based. Although both of these approaches help to improve CPA prediction accuracy, both of these approaches also face challenges to further improve their CPA prediction accuracy
The non-structural parametric model regards a compound (drug) as atom combination information, a protein (target) as residue combination information, and calculates the paired atom-residue distance to form a [ number of atoms ] - [ number of residues ] paired interaction matrix, under a certain threshold, the compound atom is considered to have contact (i.e., non-covalent interaction) with the corresponding protein residue. In the pairwise interaction matrix, each element is a binary value, indicating whether the corresponding atom-residue pair has interacted or not. These unstructured parametric models rely on the sequence of the drug and the target to learn interactions from pairs of matrices in order to predict the affinity between the drug and the target. For example, a multi-layer one-dimensional convolutional neural network (1 d-CNNs) is used to extract features from protein sequences, these matrices representing features as input to another downstream Deep Learning (DL) model, ultimately learning the affinity of the drug-target. However, this interaction is constantly changing with time and environment, since the structure of the protein is dynamic, and the pairing information acquired is only a certain state captured by an electron microscope. Thus, this unstructured parametric model, which ignores binding site information in three-dimensional spatial structures, only extracts information on a single type of feature of the protein (only sequence features are considered), is likely to have a great limitation on CPA prediction.
Previous studies have shown that predicted paired non-covalent traction results can be used in a model of no structure to further improve its accuracy in CPA prediction. However, simply taking the area under the curve (AUC Area Under the Curve) as an evaluation index measures the performance of the pair matrix predictions, the interpretation of the results may be misleading. The AUC, which is an evaluation index, is the area under the receiver characteristic curve (ROC), and its Y-axis and X-axis are measurements of sensitivity, used to evaluate the model's ability to predict true positives or true negatives. It draws a graph by considering different thresholds that distinguish positive and negative values in the binary classification. As shown in fig. 16, there are very many negative values in the compound-protein pairing matrix and only a few positive (molecular compound-protein interaction) values, for example in a dataset with 5 positives versus 11595 negatives. Even though the model predicts a fully negative matrix, the AUC values are still high (over 0.95) in this case. The predicted pairwise interaction matrix is an intermediate result generated during model training, which is inappropriately input into the original model of the CPA prediction. Mathematically, any intermediate results produced during the training of a machine learning model are only one representation of the input features (i.e., compound sequences, protein sequences, and pairwise interaction matrices). Theoretically, when this representation is fed back with its upstream features to predict the final reaction (i.e., CPA), the reactions remain a representation of their original input features. In our experiments we found that the act of inputting intermediate results into the unstructured model together with the original features did not benefit the CPA prediction, indicating that the unstructured model did not improve the accuracy of the CPA prediction without inputting new information. Thus, models built on this approach may be difficult to further improve accuracy.
Regarding structure-based methods of predicting CPA, one of them is molecular docking technology. Docking is a bioinformatic modeling process between target protein fragments and ligands that takes into account the potential binding sites and three-dimensional structure of protein-drug complexes in the prediction process. However, only a few protein 3D structures are currently available, which greatly limits the application of this approach. To overcome this limitation, methods based on Machine Learning (ML) algorithms are now increasingly being proposed, which rely not only on the sequence data of the drug and the target, but also on spatial three-dimensional information (e.g. three-dimensional spatial coordinates, distance between the drug and the target in the ligand, torsion angle information between residues of the protein) in order to obtain a more accurate prediction of CPA. Although attractive in theory, it is very difficult to develop a practical paradigm for CPA prediction using three-dimensional structural information of proteins.
At present, the main obstacle of applying protein three-dimensional information to a calculation model is that a proper method or a model cannot reasonably utilize the protein three-dimensional information, partial research directly introduces three-dimensional structure information into a traditional model, but a good prediction effect cannot be obtained, most of calculation methods for predicting the affinity of a compound-protein in the existing method also use sequence information to predict the affinity value, but the prediction effect on partial data sets is not obviously improved. Developing a practical multi-modal model to reasonably utilize three-dimensional structural information and sequence information of proteins to further improve accuracy of CPA prediction remains an urgent need.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a reasonable characterization method of a three-dimensional structure of a protein, develop a multi-mode end-to-end model capable of simultaneously polymerizing sequence information and structure information of the protein, perform characterization and polymerization on the structure information and the sequence information of the protein, and then apply a co-evolution attention algorithm to perform optimized extraction on the structure characteristics and the sequence characteristics of the protein; introducing a graph convolution layer with special residual links, and avoiding the overcomplete problem encountered by deep graph convolution so as to more comprehensively extract compound characteristics; an interactable attention algorithm is introduced, and characteristic interaction between the protein and the compound is realized in an affinity prediction module, so that potential interaction information between the protein and the compound is learned, and the prediction accuracy of affinity is improved.
The invention provides a method for predicting compound-protein affinity based on a protein three-dimensional structure, which is used for predicting compound-protein affinity, extracting compound characteristics from a compound, wherein the compound characteristics comprise atomic characteristics and polymerization node characteristics, extracting protein characteristics which implicitly contain protein three-dimensional information from the protein, and predicting the compound-protein affinity by the compound characteristics and the protein characteristics through an affinity prediction algorithm, and specifically comprises the following steps:
S1, compound composite characteristic extraction step,
according to the compound characterization graph, atomic initial characteristics and aggregation node characteristics are obtained, the atomic characteristics and the aggregation node characteristics are circularly updated by utilizing a deep graph convolution network and a multi-head attention mechanism, and the final atomic characteristics and the aggregation node characteristics of the compound are output;
s2, extracting protein characteristics;
acquiring protein sequence information and protein structural characteristics; according to the protein sequence information and the protein structural characteristics, the protein embedding sequence characteristics have protein structural characteristics through a protein characteristic polymerization algorithm, so that the protein embedding sequence characteristics and the embedding structural characteristics are obtained; circularly updating the embedded sequence features and the embedded structure features by adopting a co-evolution strategy to finally obtain updated protein sequence features and protein structure features;
s3, predicting affinity of the compound and the protein;
according to the atomic characteristics and the aggregation node characteristics of the compound obtained in the step S1 and the protein sequence characteristics in the step S2, a predicted affinity value is obtained through an affinity learning unit algorithm, and the larger the affinity value is, the larger the affinity probability of the compound and the protein is.
Preferably, the step S1 specifically includes:
s11, obtaining atomic characteristics and polymerization node characteristics according to a compound characterization graph;
s12, inputting the atomic characteristics into a deep graph convolution network, adding the output of the deep graph convolution network and the aggregation node characteristics according to a first weight of the compound characteristics, and combining the atomic characteristics with a first gating cyclic function (GRU) to obtain updated atomic characteristics;
s13, inputting the atomic characteristics into a multi-head attention characterization algorithm, adding the output of the multi-head attention characterization algorithm and the aggregation node characteristics according to a second weight of the compound characteristics, and combining the updated aggregation node characteristics through a second gating circulation function (GRU) and the aggregation node characteristics;
s14, performing S14; and (3) repeating the steps S12-S13K times by taking the updated atomic characteristics and the updated aggregation node characteristics as the atomic characteristics and the aggregation node characteristics, and outputting the atomic characteristics and the aggregation node characteristics.
Preferably, the deep graph convolutional network in the step S1 has a residual connection loop for aggregating neighbor information of atoms onto atoms and avoiding an overcomplete problem caused when the number of network layers is deepened, and the multi-head attention mechanism characterization algorithm is used for acquiring diversified compound features by using interlayer information before and after the GRU polymeric network, so that accuracy of affinity prediction in the affinity prediction step is improved.
Preferably, the step S2 specifically includes:
s21, acquiring protein sequence information and protein structural characteristics, wherein the protein structural characteristics at least comprise a discretization distance matrix;
s22, a protein characteristic aggregation algorithm specifically comprises the following steps: the protein sequence information obtains a sequence vector through a first embedding layer (Eemmbedding layer), and the discretization distance matrix obtains a discrete distance matrix feature vector through a second embedding layer (Eemmbedding layer), wherein the discrete distance matrix feature vector is used as an embedded distance matrix after feature embedding;
s23, realizing a co-evolution strategy by adopting N layers of protein coding algorithms, wherein the protein coding algorithms of each layer are the same, and one layer of protein coding algorithm specifically comprises the following steps:
the two results of the embedded distance matrix after row addition and column addition are subjected to feature fusion through a gating circulation unit to obtain spliced embedded structure information; the spliced embedded structure information enters a diversity convolution layer to obtain diversified protein structure characteristics so as to learn protein characteristics with more diverse characteristics;
the embedded sequence features sequentially pass through a diversity convolution layer and a common convolution layer to obtain diversified protein sequence features; adding the diversified protein structural features and the diversified protein sequence features through a gate control logic, and sequentially passing through a gate control circulation unit and a convolution layer to obtain structural feature vectors with structural information and sequence information at the same time;
The structural feature vector outputs the structural feature of the protein through self-combination transformation (outer sum);
the sequence features after embedding and updating are subjected to a diversity convolution layer to obtain diversified sequence features after embedding and updating; adding the diversified embedded updated sequence features and the diversified protein structural features through a gate control logic to obtain sequence feature vectors fusing the structural information and the sequence information; further fusing the diversified protein sequence characteristics with sequence characteristic vectors through a gating circulating unit to output protein sequence characteristics;
s24, the new embedding distance matrix output by the protein coding algorithm is the required embedding distance matrix, the obtained embedding distance matrix and the embedding sequence feature are used as the input of the next protein coding algorithm, the embedding distance matrix and the embedding sequence feature are continuously updated until the N protein coding algorithms are all completed, at the moment, the output embedding distance matrix is the structural feature of the protein, and the embedding sequence feature is the sequence feature of the protein.
Preferably, the protein structural feature in step S21 further includes a feature auxiliary matrix; the embedded sequence features obtained in step S22 are further added to the feature auxiliary matrix through the gating logic to obtain embedded updated sequence features. The diversity convolution layer in step S23 equally divides the input feature vector into four parts in the feature dimension, and respectively enters four parallel common convolution layers, sums the output results, and simultaneously, aggregates the initial features of the sequence information input through residual connection to obtain the diversity feature vector.
Preferably, the feature auxiliary matrix is a torsion angle matrix, and the torsion angle matrix is constructed by sine values and cosine values of dihedral angles phi and phi between alpha carbon atoms and amino groups and carboxyl groups of the protein skeleton chain.
Preferably, the discretization distance matrix in the step S21 is to divide the distance between residues of the protein stabilizing structure into M equidistant mapping intervals which basically conform to the normal distribution in statistics, so as to implement discretization coding of the distance matrix, where M is a positive integer.
Preferably, the step S3 specifically includes:
s31, receiving atomic characteristics and aggregation node characteristics of the compound from the step S1, and transmitting the atomic characteristics and the aggregation node characteristics of the compound as compound information into an affinity learning unit, wherein the aggregation node characteristics are updated through linear layer updating characteristics to obtain updated aggregation node characteristics; the compound features are updated through a linear layer, an average value is obtained in an atomic dimension, and feature stitching is carried out on the compound features and the updated aggregate node features, so that compound comprehensive features are obtained;
s32, receiving the sequence characteristics of the protein from the step S2, transmitting the sequence characteristics of the protein as protein information into an affinity learning unit, updating the characteristics of the protein sequence characteristics through a convolution layer, and meanwhile, obtaining the average value in the residue dimension to obtain the comprehensive characteristics of the protein;
Protein complex features and compound complex features are combined by matrix multiplication (matmul) and linear layer fusion to obtain predicted affinity values.
The invention also discloses a system for predicting the affinity of the compound-protein based on the three-dimensional structure of the protein, which comprises: a compound extractor, a protein extractor, and an affinity predictor;
the compound extractor updates the atomic characteristics and the aggregation node characteristics by using a deep graph convolution network and a multi-head attention algorithm according to the input atomic characteristics and the aggregation node characteristics of the compound, and circularly outputs the final atomic characteristics and the aggregation node characteristics; transmitting the final compound signature and the aggregate node signature to an affinity predictor;
a protein extractor, which obtains embedded sequence characteristics and embedded structure characteristics through a protein characteristic aggregation algorithm according to the input protein sequence characteristics and protein structure characteristics; updating the embedded sequence features and the embedded structure features by adopting a co-evolution strategy to obtain updated protein sequence features and protein structure features;
and the affinity predictor is used for receiving the compound characteristic, the protein sequence characteristic and the aggregation node characteristic, and obtaining a predicted affinity value through an affinity learning unit algorithm.
Preferably, the compound extractor comprises a deep map convolution unit for implementing a deep map convolution network and a multi-head attention characterization unit for implementing a multi-head attention algorithm; the protein extractor comprises a protein information aggregation unit and a co-evolution updating unit, wherein the protein information aggregation unit is used for realizing a protein characteristic aggregation algorithm, and the co-evolution updating unit is used for realizing a co-evolution strategy; the affinity predictor includes an affinity learning unit.
Compared with the prior art, the invention has the following beneficial effects:
1. when the distance matrix is used for recording the distance between protein residues, the discretization distance matrix is used, so that the dimension of the data characterization vector can be reduced in the word embedding stage; and torsion angles phi and phi between alpha carbon atoms and amino and carboxyl in the protein skeleton are used for constructing a torsion matrix, so that the sequence direction is accurately represented, the three-dimensional structure of the protein is fully represented, and the problem of high-dimensional protein representation is solved.
2. Introducing a co-evolution strategy to cooperatively update the sequence information and the structure information of the protein, and strengthening the association between the sequence information and the structure information of the protein so as to extract more comprehensive and diversified characteristic information from a final protein characterization vector to characterize the protein;
3. Introducing a residual structure with initial information of the graph nodes to avoid the problem of over-smoothing, introducing an aggregation node to aggregate global information of the compound on the basis, and assisting a gating updating unit and a message transmission mechanism of the graph information in the graph rolling process to realize information interaction updating between each atomic node and a sink node in the graph structure;
4. an affinity learning unit is introduced into the affinity prediction unit to polymerize the characteristics of the protein and the compound so as to more comprehensively learn the potential interaction information between the protein and the compound and improve the prediction accuracy of the affinity.
Drawings
FIG. 1 is a schematic flow chart of a method for predicting compound-protein affinity based on the three-dimensional structure of a protein;
FIG. 2 is a flowchart showing specific steps for compound feature extraction;
FIG. 3 is a schematic flow diagram of a deep graph convolutional network; .
FIG. 4 is a flowchart of specific steps of a deep graph convolutional network;
FIG. 5 is a schematic diagram of an extraction process of atomic features;
FIG. 6 is a schematic diagram of a protein feature extraction process;
FIG. 7 is a flowchart of specific steps of a protein feature aggregation algorithm;
FIG. 8 is a flowchart showing the specific steps of a single layer protein encoding algorithm;
FIG. 9 is a schematic flow diagram of a diversity convolution layer;
FIG. 10 is a schematic flow chart of a single layer protein encoding algorithm;
FIG. 11 is a flowchart showing specific steps of an affinity learning unit algorithm;
FIG. 12 is a comparison of model performance over a dataset (small dataset) based on molecular fingerprints after similar clustering of compounds;
FIG. 13 is a comparison of model performance on a dataset (small dataset) after homologous clustering of proteins based on multiple sequence alignment strategies; .
FIG. 14 is a comparison of the convergence of models during training (model convergence rate versus final convergence position);
FIG. 15 is an effect diagram of the process of deepening the number of layers of the roll-up for each performance evaluation index;
FIG. 16 is a diagram illustrating the problem of a conventional unstructured parametric model in performance evaluation.
Detailed Description
For a better understanding of the technical solution of the present invention, the following detailed description of the specific embodiments of the present invention refers to the accompanying drawings and examples. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The invention provides a method for predicting compound-protein affinity based on a protein three-dimensional structure, which is used for predicting compound-protein affinity. In order to realize the method, an end-to-end model based on deep learning and the latest technology is adopted, the model is named as a fast evolutionary attention and deep map neural network (FeatNN), and a method for predicting the affinity of a compound-protein based on a protein three-dimensional structure realized by the FeatNN is shown in figure 1, and the specific implementation steps are as follows:
S1, compound composite characteristic extraction step
The compound complex feature extraction step is achieved using a compound extractor.
The compound features include atomic features and bond features, and polymeric node features are also introduced for better characterization of the compound features. The compound features and the polymeric node features are collectively referred to as compound complexing features.
According to the compound characterization graph G= { atom, bond }, initial atomic characteristics and aggregation node characteristics are obtained, the atomic characteristics and the aggregation node characteristics are updated through an attention mechanism by utilizing a deep graph convolution network and a multi-head attention algorithm, and final atomic characteristics and aggregation node characteristics are circularly output, wherein the atomic characteristics already comprise bond characteristics, so that the output atomic characteristics can represent the compound characteristics. The specific steps are shown in fig. 2.
The invention point of the step lies in that a special residual error connection loop of a deep graph rolling network is used for aggregating neighbor information of atoms to the atoms and avoiding the problem of over-smoothness caused when the number of network layers is deepened, GRU is used for aggregating front and rear interlayer information of the network, a multi-head attention characterization algorithm is used for acquiring more diversified compound features, aggregation node features are introduced, and an interactive evolution mechanism is used for updating the aggregation node features and the atomic features.
In order to clearly describe the compound complex feature extraction procedure, the following relevant definitions are given first. It can be converted into a compound characterization graph g= { atom, bond }, where atom is represented as an atom and bond is represented as a bond, according to the molecular formula of the compound. atoms contain atomic features of element name, aromatic type, vertex degree, and valence, bond contains bond type and shape features, and the values of atoms and bond can be feature coded by separate thermal coding strategies, respectively. From the compound characterization mapG, the characteristics of each atom and bond in the compound are obtained as follows: for each atomFor each keyWhere i=1, 2,.. a ,j=1,2,...,N b The original atomic features are defined as +.>The sum of atomic features is called aggregate node feature, i.e. < ->Consider the picture volume lamination l c And a multi-head attention feature (k) c Number of multiple heads), atom feature or master node feature, l c th is defined as +.> And->The variable V has k c The head is defined as +.>Wherein l c =1,2,...,l comp ,k c =1,2,...,k comp .
The characteristics of the compound, more specifically, V for each node (i.e., atom), which can be obtained from the graphical representation g= { V, E } of the compound i Initially from a length 82 eigenvectorRepresentation, which is a connection embedded by a word representing an atomic symbol, representing the origin of the corresponding atom Sub-symbols, degree, dominant valence, recessive valence, and aromaticity. Each edge (i.e. bond) e i E initially consists of a length-6 eigenvector->It is meant that it is a word-embedded linkage representing bond types, including mono, di, tri, aromatic, bond conjugated and bond in one ring.
Firstly, the atomic characteristics of a compound are respectively input into a linear layer (linear) and a deep graph rolling network, the atomic characteristics are respectively input into an atomic aggregation function, the deep graph rolling network, a multi-head attention characterization algorithm and a first gating circulation function (GRU) after being processed by the linear layer, the atomic aggregation function is used for obtaining initial aggregation node characteristics, the output results of the aggregation node characteristics and the deep graph rolling network are respectively processed by the linear layer and added, the first weight of the compound characteristics is obtained through a first activation function (Sigmoid), the output results of the aggregation node characteristics and the multi-head attention characterization algorithm are respectively processed by the linear layer and added, the second weight of the compound characteristics is obtained through the first activation function (Sigmoid), the weight values of the aggregation node characteristics are respectively the first weight of the compound characteristics and the first weight of the 1-down compound characteristics, the added result is output into the first gating circulation function (GRU), and the first gating circulation function outputs the atomic characteristics, and updates the atomic characteristics; the aggregation node characteristics are processed by a linear layer and added with the output result of the multi-head attention characterization algorithm, the weight values of the aggregation node characteristics and the multi-head attention characterization algorithm are respectively 1 minus the second weight value of the compound characteristics and the second weight value of the compound characteristics, the added result is output to a second gating circulation function (GRU), the second gating circulation function outputs the aggregation node characteristics, and at the moment, the aggregation node characteristics are updated;
And (3) taking the updated aggregation node characteristics and atomic characteristics as input again for circulation, wherein the aggregation node characteristics and the atomic characteristics obtained after the circulation for K times are the aggregation node characteristics and the atomic characteristics which need to be output, K is a positive integer which is more than 2, and specific numerical values are set according to the needs.
The graph rolling network (GCN) can be used in biological studies to predict drug affinity to targets, but this approach always has an excessive smoothness problem due to depth limitations. As the number of layers increases, the performance of the GCN may deteriorate because the representation of the nodes in the GCN may converge to a similar value. And a special residual error connection loop is adopted for the deep graph rolling network to aggregate neighbor information of atoms to the atoms, so that the problem of over-smoothness caused by deepening the network layer number is avoided, and more information can be extracted by deepening the layer number of the GCN.
The purpose of the deep graph convolutional network in the invention is to aggregate neighbor information of atoms onto atoms in a manner shown in fig. 3, for example, for carbon atom C located in a circle, the characteristics of surrounding nitrogen atom N, oxygen atom O and bromine atom Br atoms, and the bond characteristics between them are all aggregated into the characteristics of carbon atoms by the deep graph convolutional network. The specific implementation steps are as shown in fig. 4, the atomic node characteristics and the adjacent matrix aggregate and update the neighbor node characteristics of each atom on each atom through a message transmission mechanism, and a special residual error connection loop is assisted to prevent the problem of smoothness when the network layer number is deepened, wherein alpha and theta are super parameters of the residual error connection loop. The atomic feature extraction cycle is shown in fig. 5, M represents a multi-head attention polymerization operation, and R represents the number of atomic diversity feature loop construction cycles.
The multi-head attention characterization algorithm is used for realizing a multi-head attention mechanism, and is particularly realized as the prior art, and is used for increasing the diversity of compound atomic node characteristics so as to more comprehensively polymerize compound information, so that more accurate compound characterization can be learned, and the accuracy of affinity is improved.
Thus, the atomic characteristics of the compound at this point can already represent the compound characteristics through repeated processing of the depth map convolution network and the multi-headed attention.
The role of the linear layer (linear) is to map the input vector to a specified dimension for subsequent vector or inter-matrix operations. All the linear layers mentioned in the application are different linear layers and are set according to actual needs.
The compound complex feature extraction step is implemented in a computer using algorithm 1, below. The number of layers of the deep graph convolution network is set according to the requirement, and graph convolution is realized in a computer by adopting a later algorithm 2.
S2, extracting protein characteristics;
the embedded sequence features and the embedded structure features are obtained through a protein feature aggregation algorithm according to the protein sequence features and the protein structure features, and updated protein sequence features and protein structure features are obtained through a protein coding algorithm adopting a co-evolution strategy, as shown in fig. 6. The protein feature extraction step is implemented in a computer using algorithm 3, which follows.
The invention point of the step is that the structural characteristics of the protein are hidden in the protein sequence characteristics by utilizing the structural characteristics of the protein and the protein characteristic aggregation algorithm and the protein coding algorithm provided by the invention.
S21, acquiring sequence characteristics and structural characteristics of proteins;
protein sequence information can be obtained in direct order from the relevant database.
The structural feature is a three-dimensional structural feature, at least comprising a discretized distance matrix and optionally a structural assist feature, where the structural feature in this embodiment comprises a discretized distance matrix and a torsion angle matrix as the structural assist feature.
The discretization distance matrix of the protein is the distance between residues after the protein is unfolded into peptide chains, and the discretization distance matrix is a symmetrical matrix. Because the distance between the residues is usually Euclidean distance, the specific value is one value in a continuous interval, and because a single value cannot well represent structural characteristics and is inconvenient for understanding a machine learning model, a discrete distance matrix is provided, the distance matrix is mapped into a bin with equal width and low dimension, namely the continuous interval is divided into a plurality of equal interval sections, and the values in the same interval section are set to be the same value, so that the data volume of the distance matrix is greatly reduced, and the distance characteristics between the residues can be better reflected. Compared with the traditional distance matrix with smaller distance value change interval and no obvious distinction, the invention is inconvenient for feature extraction under the condition of limited data, and the invention divides the continuous values with small distinction into equidistant mapping intervals which basically accord with normal distribution in statistics, thereby realizing discretization coding of the distance matrix. In this embodiment, the distance matrix is mapped into 38 equal width low-dimensional bins between 3.25A and 50.75A (a being the length unit angstroms, 10 times Fang Ai of 1 meter = 10), with one additional bin to store any larger distance and one bin to store any smaller distance, thus totaling 40 bins.
The construction of the torsion angle matrix adopts the prior art. The orientation of the protein sequence can be accurately represented using the torsion angle matrix as an aid, considering that the three-dimensional structure of the protein may not be accurately described simply by the distance information between each residue. The torsion angle matrix is constructed using sine and cosine values of dihedral angles phi and phi between the alpha carbon atoms of the protein backbone chain and the amino and carboxyl groups.
S22, obtaining an embedded sequence feature and an embedded distance matrix according to the protein sequence feature and the three-dimensional structure feature through a protein feature aggregation algorithm, wherein the embedded sequence feature and the embedded distance matrix are shown in fig. 7, and specifically comprise the following steps:
the protein sequence information is word embedded by a first embedding layer (Eemmbedding layer) to obtain sequence characteristics,
the discrete distance matrix is subjected to word embedding through the second embedding layer to obtain discrete distance matrix characteristics, and the discrete distance matrix characteristics are output as sequence characteristics with discrete distance matrix information.
The torsion angle matrix is subjected to feature extraction through the convolution layer to obtain torsion angle features; the torsion angle characteristics are output through a linear layer, and a first activation function (Sigmoid) is used to obtain torsion angle weights; the torsion angle feature and the sequence feature take the torsion angle weight as gating weight, and the sequence feature after being embedded and updated is output through the gating logic addition and summation.
The protein feature aggregation algorithm is implemented in the computer using algorithm 5, which follows.
S23, obtaining updated sequence features and updated structural features of the embedded sequence features and the embedded distance matrix through a co-evolution strategy;
the co-evolution strategy is implemented by adopting N layers of protein coding algorithms, wherein the protein coding algorithms of each layer are identical, and the specific steps of one layer of protein coding algorithm are shown in fig. 8.
The two results of the embedded distance matrix after row addition and column addition are subjected to feature fusion through a gating circulation unit to obtain spliced embedded structure information; the spliced embedded structure information enters a diversity convolution layer to obtain diversified protein structure characteristics so as to learn protein characteristics with more diverse characteristics; meanwhile, the spliced embedded structure information passes through the linear layer and then a second activation function (Sigmoid) is used for obtaining the weight of the structure information.
The embedded sequence features sequentially pass through a diversity convolution layer and a common convolution layer to obtain diversified protein sequence features; the diversified protein sequence features are subjected to a linear layer and then sequence information weights are obtained by using a third activation function (Sigmoid).
The diversified protein structural features and the diversified protein sequence features are added together through a gating unit by taking sequence information weights as gating weights to obtain structural feature vectors of primary aggregate sequence information and structural information; when the method is used for adding, the weight of the diversified protein sequence features is sequence information weight, and the weight of the diversified protein structure features is 1 minus embedded structure information weight.
The method comprises the steps that structural feature vectors of sequence information and structural information and diversified protein structural features are preliminarily aggregated, the structural feature vectors are updated from sequence dimension through a gating circulation unit, dimension mapping is changed through a convolution layer, and the structural feature vectors with the structural information and the sequence information are obtained; after the structural feature vector designates the mapping dimension through the convolution layer, performing combination transformation (outer sum) on the structural feature vector to obtain the protein structural feature, wherein the implementation is realized by adopting an outer sum function, namely, each time taking one element of the first input quantity and traversing and multiplying the element of the second input quantity as one row until the element of the first input quantity is traversed.
The sequence features after embedding and updating are subjected to diversity convolution layers to obtain diversified sequence features after embedding and updating, structural information weights are used as gating weights, and the sequence features after diversity embedding and updating and the diversified protein structural features are added and summed through gating logic to obtain sequence feature vectors of fusion structural information and sequence information; the sequence features after diversification embedding and updating are further fused with the sequence feature vectors through a gating circulation unit, and protein sequence features are output; when the sequence information is added, the weight used by the diversified protein structural features is sequence information weight, and the weight of the sequence features after the diversified embedding update is 1 minus sequence information weight.
In this embodiment, the normal convolution layer performs a conventional one-dimensional convolution operation; the diversity convolution layer can perform convolution operations in parallel and enhance the diversity of the learned features: the diversity convolution layer firstly equally divides the input feature vector into four parts in the feature dimension by adopting an equal division strategy, respectively enters four parallel common convolution layers, sums the output results, and simultaneously, connects initial features of the aggregate sequence information input through residual errors to finally obtain diversified feature vectors, as shown in fig. 9; the equal division policy is to divide the vector into four parts directly from the feature dimension, and the precondition that the feature number can be divided into four parts is required to be satisfied, and the equal division policy is implemented by adopting the following algorithm 9 in the computer.
The diversity convolution layer is implemented in the computer using the algorithm 7 later.
At this time, the distance matrix characteristics outputted by the protein coding algorithm are the distance matrix of the desired protein, as shown in fig. 10. And continuously updating the structural characteristics and the sequence characteristics of the protein by taking the obtained distance matrix and the sequence characteristics as input of a next layer of protein coding algorithm until the N layers of protein coding algorithms are all completed, wherein the structural characteristics of the protein are already implied by the sequence characteristics of the protein. Thus the protein sequence features at this time can represent protein features.
Protein encoding is implemented in the computer using algorithm 4, later.
The following describes "S3, compound and protein affinity prediction step" with reference to FIG. 11;
introducing an affinity learning unit algorithm in the compound and protein affinity prediction step to learn potential characteristics of interaction between the protein and the compound, and polymerizing the characteristics respectively to finally predict the affinity;
according to the step S1, receiving the atomic characteristics and the aggregation node characteristics of the compound as compound information, transmitting the compound information into an affinity learning unit, and obtaining updated aggregation node characteristics by the aggregation node characteristics through linear layer updating characteristics; the compound features are updated through a linear layer, an average value is obtained in an atomic dimension, and feature stitching is carried out on the compound features and the updated aggregate node features, so that compound comprehensive features are obtained;
s2, receiving sequence features of proteins and transmitting the sequence features of the proteins as protein information into an affinity learning unit, wherein the sequence features of the proteins are updated by a convolution layer, and meanwhile, the average value is calculated in the residue dimension to obtain comprehensive features of the proteins;
protein complex features and compound complex features are combined by matrix multiplication (matmul) and linear layer fusion to obtain predicted affinity values. The basic flow is shown in fig. 11 and is implemented in the computer using the algorithm 8 described later.
The invention point of the step is to use the protein sequence characteristic implying the protein structure characteristic and the atomic characteristic representing the compound characteristic to fuse by using the affinity learning unit algorithm provided by the invention, and combine the aggregation node characteristic to finally predict the affinity value.
The compound and protein affinity prediction steps were implemented in silico using the following algorithm 8.
The invention also provides a compound-protein affinity prediction system based on a protein three-dimensional structure, which is used for predicting the affinity of the compound-protein, and specifically comprises the following modules:
a first input module, a second input module, a compound extractor, a protein extractor, and an affinity predictor;
the first input module receives a compound characterization graph and obtains compound atomic characteristics and polymerization node characteristics according to the compound characterization graph;
a compound extractor for receiving the atomic compound characteristics and the aggregate node characteristics, and transmitting the updated atomic compound characteristics and the aggregate node characteristics to the affinity predictor according to the input to obtain updated atomic compound characteristics and aggregate node characteristics;
the second input module receives protein sequence characteristics and protein structure characteristics, wherein the protein structure characteristics comprise a discretization distance matrix and a torsion angle matrix, and the received 3 characteristics are sent to a protein extractor;
A protein extractor for receiving the protein sequence features and the protein structural features, and updating the protein sequence features, and sending to the affinity predictor.
And the affinity predictor is used for receiving the compound characteristics, the protein sequence characteristics and the polymerization node characteristics, carrying out affinity prediction and giving affinity predicted values of the compound and the protein.
Further, the compound extractor includes a deep map convolution unit and a multi-headed attention characterization unit. The deep graph convolution unit is used for realizing a deep graph convolution network and is used for aggregating neighbor information of atoms onto the atoms. A multi-head attention characterization unit; for extracting more diverse compound features to improve the accuracy of affinity prediction.
The protein extractor includes a protein information aggregation unit and a co-evolving update unit. The protein information aggregation unit is used for realizing a protein characteristic aggregation algorithm, so that the sequence characteristic of the protein carries structural information of the protein. The co-evolutionary updating unit is used for realizing a co-evolutionary strategy, so that the extracted protein sequence features can represent the global information of the protein
The affinity predictor consists of an affinity learning unit.
In the following, the model implementing the present invention is named fast evolutionary attention and deep drawing neural network (featns) in comparison to the prior art using the method of the present invention.
1. Construction of test data sets
In the study, two datasets were constructed based on PDBbind (version 2020, total 23496 entries) and BindingDB (version 2022, month 2, 6, total 41296 entries). The pdbbbind database provides experimental binding affinity data for a total of 23496 biomolecular complexes in PDB. In this example, different measured values K of the data-washed protein-ligand affinity data (12628 data) were used mainly i ,K d And IC 50 According to the training set: verification set: test set = 7:1:2 as small data set training samples. BindingDB contains 241268 binding data for 8661 proteins and 1039940 small molecules, in this example, the IC of data washed protein-ligand affinity data (218615 pieces of data) was used primarily 50 Measurement type, per training set: verification set: test set = 7:1:2 as a big data set training sample.
The associated data is checked for normality as extreme data may affect the entire model. The BindingDB dataset has some associated data that is too extreme, thus eliminating the external data (μ -3σ, μ+3σ).
For the distance matrix and the torsion angle matrix, protein structure data is acquired by searching in an RCSB PDB (protein database) database using PDB ids provided by two databases. The RSCB PDB file provides sufficient 3D structural data (spatial locations) from which the distance matrix to be used in the model can be extracted and from which the torsion matrix is calculated.
For the SMILES sequence, we obtained data from RCSB PDB based on ligand ID.
2. An evaluation index is determined and a determination is made,
the following indices are used to evaluate the predictive performance of the model: r is R 2 RMSE (root mean square error), MAE (absolute average error), medAE, pearson correlation coefficient, spearman rank correlation coefficient; wherein R is 2 RMSE (root mean square error), MAE (absolute average error) describe the distance between the predicted value and the true value; the pearson product moment correlation coefficient and the spearman rank correlation coefficient describe the correlation between the predicted value and the true value.
The specific description of each index is as follows:
R 2 is a dimensionless score describing the validity of the model. It compares the predictions with random guesses based on the average of the true values;
RMSE is a standard value of MSE, commonly used as training loss in machine learning studies;
MAE describes the absolute error of the predicted and true values;
MedAE is the median of MAE, which describes the median of the residues;
the pearson correlation coefficient describes a linear correlation between two values;
the spearman rank correlation coefficient is a rank form of pearson correlation coefficient that describes the correlation of two variables (e.g., when one variable increases, the other increases), which is related to the monotonicity of the function;
3. evaluation of FeatNN, BACPI, graphDTA (GATNet, GAT GCN, GCNNet, GINConvNet) and Monn.
FIG. 12 shows that FeatNN of the present invention was tested on data sets after clustering of compounds based on the measurement criteria of IC50 and KIKD, respectively, as seen in FIG. 12, at three indices (i.e., RMSE, pearson and R 2 ) All superior to other models.
FIG. 13 shows that the FeatNN of the present invention was tested on new protein data, and it can be seen from FIG. 13 that the FeatNN has three indexes (i.e., RMSE, pearson and R 2 ) And is also superior to other models.
FIG. 14 is a comparison of the convergence of models during training (model convergence rate versus final convergence position); from the figure, it can be seen that the convergence effect of the featns of the invention is best.
Fig. 15 to show the advantages of deep graph convolution, the present invention also uses different depths for graph convolution, and it can be seen that each performance evaluation index changes with the depth of the graph convolution layer.
Table 1 shows the MONN, graphDTA (GATGCN, GCNNet, GINConvNet, GATConvNet), BACPI and FeatNN model performance evaluation tables of the invention trained on BindingDB (IC 50 measurement standard, 218615 pieces of data);
TABLE 1
Tables 2 and 3 show the model performance evaluation tables of FeatNN and prior art models of the invention after homologous similar clustering of compounds (Table 2) and proteins (Table 3), respectively, on PDBBind small dataset (containing KIKD, IC50 measurement criteria, 12628 data) at different evaluation indices (KIKD or IC 50)
TABLE 2
TABLE 3 Table 3
To further embody the present invention, pseudo code implementing the main modules of the present invention is now placed below where a better understanding of the methods presented herein is desired.
First, the main variable definitions in the code are shown in Table 4
Table 4 common variable definitions
The symbol definition is marked with a name, and the subscript is vector dimension information.
In addition, N res Representing the number of residues in the input main sequence (clipped during training), using N tors Indicating the magnitude of the twist, N ebd_s Indicating the embedding size, h indicating the concealment size, N atom Represents the number of atoms, N bond Representing the number of keys, F atom Representing atomic characteristics, F bond Representing key features, nbs is a neighbor dimension. Using capitalized operator names, linear is used to represent a Linear transformation with a weight matrix W and a bias vector b when they encapsulate learning parameters, for example.
Layer normalization operating in the channel dimension is denoted by LayerNorm, and the gain and bias of each channel can be learned. Capitalized names are also used for stochastic operators, such as those related to the study. For functions without parameters, we use the operator name in lower case. Wherein the definitions of the operations and variables are as follows. UsingThe representation is multiplied by a number,indicates addition, and then indicates Hadamard product (Hadamard), a T b represents the dot product of two vectors, [ ·, ]] h Refers to performing dimension splicing operation on the dimension of the hidden dimension. Let h denote the hidden size, k denote the number of attention heads, k h Represents the size of the attention head, where k h =h/k。softmax(x i ) Corresponding to the soft max function, softmax (x i )=exp(x i )/∑ i exp(x i ) Representing a softmax activation function; sigma (·) represents the sigmoid activation function sigma (x) =1/(1+e) -x ) Tanh (·) corresponds to tanh activation function tanh (x) = (e) x -e -x )/(e x +e -x ) Representing hyperbolic tangent activation function, GELU represents Gaussian error linear units, f (·) corresponds to GELU activation function +.>Wherein erf (·) is +.>
The algorithm of each module is as follows:
algorithm 1-graph convolution Graph Convolution
con neighhor =concat neighhor (ver neighbor ,edge neighbor )
neighbor_label=gelu(Linear(con neighbor ))
output=theta⊙Linear(support)+(1-theta)⊙support
Algorithm 2 Compound Complex feature extraction Compound Extractor
Algorithm 3 protein feature extraction protein extraction
Algorithm 4 protein encoding ProteinEncoder
Algorithm 5 protein feature aggregation Protein Aggregating
seq_embed=Embedding(Seq init )
torsion_embed=CNN Nts→h1 (TorMat init )
torsion_vector=CNN h1→h (torsion_embed)
gate=Sigmoid(Linear(torsion_vector))
Algorithm 6 Single layer protein encoding EvoUpdating
MixKey=GRU(PairKey1,PairKey2)
Struct_Features=DivCNN(MixKey)
SeqGate=Sigmoid(Linear(Seq2Struct))
StructGate=Sigmoid(Linear(MixKey))
Seq2Struct_Vector=SeqGate⊙Seq2Struct+(1-SeqGate)⊙Struct_Features
Struct_Vector=GRU(Seq2Struct_Vector,Struct_Features)
Struct2Seq_Mapping=DeepSparseCNN(Struct_Vector)
Seq_Vector=StructGate⊙Struect_Features+(1-StructGate)⊙Seq_Features
Algorithm 7 diversity convolution layer DivCNN
def DivCNN(x):
x0=CNN kernel=1 (x)
#x(seq,hidden_size)→x k-head (seq,k,head_size)
#where k is the number of head,and head_size=hidden_size/k
x k-head =TransposeForScores(x)
return→{x total }
Algorithm 8 affinity prediction AffinityLearning
#Inputs projections
#Output projection
return→{Affinity Prediction }
Algorithm 9 TPS function TransposeForScares
def TransposeForScores({input}):
#input dimention:(N res /N atom ,h)
#output dimention:(k,N res /N atom ,ks)
output=DimentionReshape(input)
return{output}
Finally, it should be noted that: the embodiments described above are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims (10)

1. A method for predicting compound-protein affinity based on a three-dimensional structure of a protein is used for predicting compound-protein affinity, extracting compound characteristics from the compound, wherein the compound characteristics comprise atomic characteristics and polymerization node characteristics, extracting protein characteristics which contain three-dimensional information of the protein from the protein, and predicting the compound-protein affinity by an affinity prediction algorithm through the compound characteristics and the protein characteristics, and specifically comprises the following steps:
S1, compound composite characteristic extraction step,
according to the compound characterization graph, atomic initial characteristics and aggregation node characteristics are obtained, the atomic characteristics and the aggregation node characteristics are circularly updated by utilizing a deep graph convolution network and a multi-head attention mechanism, and the final atomic characteristics and the aggregation node characteristics of the compound are output;
s2, extracting protein characteristics;
acquiring protein sequence information and protein structural characteristics; according to the protein sequence information and the protein structural characteristics, the protein embedding sequence characteristics have protein structural characteristics through a protein characteristic polymerization algorithm, so that the protein embedding sequence characteristics and the embedding structural characteristics are obtained; circularly updating the embedded sequence features and the embedded structure features by adopting a co-evolution strategy to finally obtain updated protein sequence features and protein structure features;
s3, predicting affinity of the compound and the protein;
according to the atomic characteristics and the aggregation node characteristics of the compound obtained in the step S1 and the protein sequence characteristics in the step S2, a predicted affinity value is obtained through an affinity learning unit algorithm, and the larger the affinity value is, the larger the probability that the compound and the protein are combined is.
2. The method for compound-protein affinity prediction based on protein three-dimensional structure according to claim 1, wherein: the step S1 specifically comprises the following steps:
s11, obtaining atomic characteristics and polymerization node characteristics according to a compound characterization graph;
s12, inputting the atomic characteristics into a deep graph convolution network, adding the output of the deep graph convolution network and the aggregation node characteristics according to a first weight of the compound characteristics, and combining the atomic characteristics with a first gating cyclic function (GRU) to obtain updated atomic characteristics;
s13, inputting the atomic characteristics into a multi-head attention characterization algorithm, adding the output of the multi-head attention characterization algorithm and the aggregation node characteristics according to a second weight of the compound characteristics, and combining the updated aggregation node characteristics through a second gating circulation function (GRU) and the aggregation node characteristics;
s14, performing S14; and (3) repeating the steps S12-S13K times by taking the updated atomic characteristics and the updated aggregation node characteristics as the atomic characteristics and the aggregation node characteristics, and outputting the atomic characteristics and the aggregation node characteristics.
3. The method for compound-protein affinity prediction based on protein three-dimensional structure according to claim 2, wherein: the deep graph convolutional network in the step S1 is provided with a residual error connection loop for aggregating neighbor information of atoms onto atoms and avoiding the problem of excessive smoothness caused when the number of network layers is deepened, the front and rear interlayer information of the GRU aggregation network is used, and a multi-head attention mechanism characterization algorithm is used for acquiring diversified compound characteristics, so that accuracy of affinity prediction in the affinity prediction step is improved.
4. The method for compound-protein affinity prediction based on protein three-dimensional structure according to claim 1, wherein: the step S2 specifically comprises the following steps:
s21, acquiring protein sequence information and protein structural characteristics, wherein the protein structural characteristics at least comprise a discretization distance matrix;
s22, a protein characteristic aggregation algorithm specifically comprises the following steps: the protein sequence information obtains a sequence vector through a first embedding layer (Eemmbedding layer), and the discretization distance matrix obtains a discrete distance matrix feature vector through a second embedding layer (Eemmbedding layer); the discrete distance matrix feature vector is used as an embedded distance matrix after feature embedding update;
s23, realizing a co-evolution strategy by adopting N layers of protein coding algorithms, wherein the protein coding algorithms of each layer are the same, and one layer of protein coding algorithm specifically comprises the following steps:
the two results of the embedded distance matrix after row addition and column addition are subjected to feature fusion through a gating circulation unit to obtain spliced embedded structure information, and the spliced embedded structure information enters a diversity convolution layer to obtain diversified protein structural features;
the embedded sequence features sequentially pass through a diversity convolution layer and a common convolution layer to obtain diversified protein sequence features; adding the diversified protein structural features and the diversified protein sequence features through a gate control logic, and sequentially passing through a gate control circulation unit and a convolution layer to obtain structural feature vectors with structural information and sequence information at the same time;
The structural feature vector outputs the structural feature of the protein through self-combination transformation (outer sum);
the sequence features after embedding and updating are subjected to a diversity convolution layer to obtain diversified sequence features after embedding and updating; adding the diversified embedded updated sequence features and the diversified protein structural features through a gate control logic to obtain sequence feature vectors fusing the structural information and the sequence information; further fusing the diversified protein sequence characteristics with sequence characteristic vectors through a gating circulating unit to output protein sequence characteristics;
s24, the new embedding distance matrix output by the protein coding algorithm is the required embedding distance matrix, the obtained embedding distance matrix and the embedding sequence feature are used as the input of the next protein coding algorithm, the embedding distance matrix and the embedding sequence feature are continuously updated until the N protein coding algorithms are all completed, at the moment, the output embedding distance matrix is the structural feature of the protein, and the embedding sequence feature is the sequence feature of the protein.
5. The method for compound-protein affinity prediction based on protein three-dimensional structure according to claim 4, wherein:
The protein structural features in the step S21 also comprise a feature auxiliary matrix;
the embedded sequence features obtained in the step S22 are added with the feature auxiliary matrix through the gate logic to obtain embedded updated sequence features;
the diversity convolution layer in step S23 equally divides the input feature vector into four parts in the feature dimension, and respectively enters four parallel common convolution layers, sums the output results, and simultaneously, aggregates the initial features of the sequence information input through residual connection to obtain the diversity feature vector.
6. The method for compound-protein affinity prediction based on protein three-dimensional structure according to claim 5, wherein: the characteristic auxiliary matrix is a torsion angle matrix, and the torsion angle matrix is constructed by sine values and cosine values of dihedral angles phi and phi between alpha carbon atoms of a protein skeleton chain and amino groups and carboxyl groups.
7. The method for compound-protein affinity prediction based on protein three-dimensional structure according to claim 4, wherein:
the discretization distance matrix in the step S21 is to divide the distance between residues of the protein stabilizing structure into M equidistant mapping intervals which basically conform to normal distribution in statistics, so as to realize the discretization coding of the distance matrix, wherein M is a positive integer.
8. The method for compound-protein affinity prediction based on protein three-dimensional structure according to claim 1, wherein: the step S3 specifically comprises the following steps:
s31, receiving atomic characteristics and aggregation node characteristics of the compound from the step S1, and transmitting the atomic characteristics and the aggregation node characteristics of the compound as compound information into an affinity learning unit, wherein the aggregation node characteristics are updated through linear layer updating characteristics to obtain updated aggregation node characteristics; the compound features are updated through a linear layer, an average value is obtained in an atomic dimension, and feature stitching is carried out on the compound features and the updated aggregate node features, so that compound comprehensive features are obtained;
s32, receiving the sequence characteristics of the protein from the step S2, transmitting the sequence characteristics of the protein as protein information into an affinity learning unit, updating the characteristics of the protein sequence characteristics through a convolution layer, and meanwhile, obtaining the average value in the residue dimension to obtain the comprehensive characteristics of the protein;
protein complex features and compound complex features are combined by matrix multiplication (matmul) and linear layer fusion to obtain predicted affinity values.
9. A system for compound-protein affinity prediction based on a three-dimensional structure of a protein, comprising: a compound extractor, a protein extractor, and an affinity predictor;
The compound extractor updates the atomic characteristics and the aggregation node characteristics by using a deep graph convolution network and a multi-head attention algorithm according to the input atomic characteristics and the aggregation node characteristics of the compound, and circularly outputs the final atomic characteristics and the aggregation node characteristics; transmitting the final compound signature and the aggregate node signature to an affinity predictor;
a protein extractor, which obtains embedded sequence characteristics and embedded structure characteristics through a protein characteristic aggregation algorithm according to the input protein sequence characteristics and protein structure characteristics; updating the embedded sequence features and the embedded structure features by adopting a co-evolution strategy to obtain updated protein sequence features and protein structure features;
and the affinity predictor is used for receiving the compound characteristic, the protein sequence characteristic and the aggregation node characteristic, and obtaining a predicted affinity value through an affinity learning unit algorithm.
10. A system for compound-protein affinity prediction based on protein three-dimensional structure according to claim 9, characterized by comprising:
the compound extractor comprises a deep map convolution unit and a multi-head attention characterization unit, wherein the deep map convolution unit is used for realizing a deep map convolution network, and the multi-head attention characterization unit is used for realizing a multi-head attention algorithm;
The protein extractor comprises a protein information aggregation unit and a co-evolution updating unit, wherein the protein information aggregation unit is used for realizing a protein characteristic aggregation algorithm, and the co-evolution updating unit is used for realizing a co-evolution strategy;
the affinity predictor includes an affinity learning unit.
CN202210828457.6A 2022-03-29 2022-07-13 Method and system for compound-protein affinity prediction based on protein three-dimensional structure Pending CN116959555A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2022103254945 2022-03-29
CN202210325494 2022-03-29

Publications (1)

Publication Number Publication Date
CN116959555A true CN116959555A (en) 2023-10-27

Family

ID=88441520

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210828457.6A Pending CN116959555A (en) 2022-03-29 2022-07-13 Method and system for compound-protein affinity prediction based on protein three-dimensional structure

Country Status (1)

Country Link
CN (1) CN116959555A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117393036A (en) * 2023-11-09 2024-01-12 中国海洋大学 Protein multi-level semantic polymerization characterization method for drug-target affinity prediction

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117393036A (en) * 2023-11-09 2024-01-12 中国海洋大学 Protein multi-level semantic polymerization characterization method for drug-target affinity prediction

Similar Documents

Publication Publication Date Title
Ruffolo et al. Geometric potentials from deep learning improve prediction of CDR H3 loop structures
CN116580848A (en) Multi-head attention mechanism-based method for analyzing multiple groups of chemical data of cancers
Santana et al. GRaSP: a graph-based residue neighborhood strategy to predict binding sites
CN116959555A (en) Method and system for compound-protein affinity prediction based on protein three-dimensional structure
Chen et al. Identification of self-interacting proteins by integrating random projection classifier and finite impulse response filter
Gui et al. DNN-PPI: a large-scale prediction of protein–protein interactions based on deep neural networks
Liu et al. Bi-fidelity evolutionary multiobjective search for adversarially robust deep neural architectures
Feng et al. MGMAE: molecular representation learning by reconstructing heterogeneous graphs with A high mask ratio
Yuan et al. Protein-ligand binding affinity prediction model based on graph attention network
Zhang et al. Efficient and accurate physics-aware multiplex graph neural networks for 3d small molecules and macromolecule complexes
CN112466410B (en) Method and device for predicting binding free energy of protein and ligand molecule
CN116758978A (en) Controllable attribute totally new active small molecule design method based on protein structure
Newton et al. Secondary structure specific simpler prediction models for protein backbone angles
CN116978450A (en) Protein data processing method, device, electronic equipment and storage medium
CN114783507B (en) Drug-protein affinity prediction method and device based on secondary structural feature coding
Zhao et al. General and species-specific lysine acetylation site prediction using a bi-modal deep architecture
Grisci et al. NEAT-FLEX: Predicting the conformational flexibility of amino acids using neuroevolution of augmenting topologies
Talebitooti et al. Identification of tire force characteristics using a Hybrid method
Zhang et al. GANs for molecule generation in drug design and discovery
Wang et al. DPLA: prediction of protein-ligand binding affinity by integrating multi-level information
Zhou et al. Few-shot multi-view object classification via dual augmentation network
Hu et al. Cerebra: a computationally efficient framework for accurate protein structure prediction
Ren et al. Highly accurate and robust protein sequence design with CarbonDesign
Khandelwal et al. DeepPRMS: advanced deep learning model to predict protein arginine methylation sites
CN118335202B (en) Method for designing antibody structure and sequence based on generated neural network model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination