CN115713965A

CN115713965A - Computing method for predicting compound-protein affinity based on GECo model

Info

Publication number: CN115713965A
Application number: CN202211332124.0A
Authority: CN
Inventors: 袁永娜; 王欣; 刘振宇
Original assignee: Lanzhou University
Current assignee: Lanzhou University
Priority date: 2022-10-28
Filing date: 2022-10-28
Publication date: 2023-02-24
Anticipated expiration: 2042-10-28
Also published as: CN115713965B

Abstract

The application discloses a computing method for predicting compound-protein affinity based on a GECo model, which comprises the following steps: obtaining compound-protein pair data in a dataset; respectively processing the compound-protein pairs to respectively obtain the molecular characteristics of the compound and the molecular characteristics of the protein; inputting the molecular characteristics of the compound and the molecular characteristics of the protein into a GECo model for training to obtain the binding affinity of the compound and the protein. Compared with the prior CPA calculation methods, the method has the advantages of rapidness, accuracy, high efficiency and low cost, has the double functions of predicting the affinity and inferring the potential interaction position, and provides a basic tool and a high-efficiency way for calculating the compound-protein affinity in the drug research and development process.

Description

Computing method for predicting compound-protein affinity based on GECo model

Technical Field

The application belongs to the field of combination of computational chemistry and deep learning algorithms, computer science and artificial intelligence, and particularly relates to a computing method for predicting compound-protein affinity based on a GECo model.

Background

Drug relocation is highly appreciated in the current field of drug discovery, as development of a new drug takes billions of dollars, and approval to market after trial takes up to a decade. It is a good choice to test the existing marketed drugs for other diseases and then to see if the drugs can be used, which can save a lot of time and money and ensure safety. Therefore, the drug discovery range can be expanded for diseases like Alzheimer's disease or novel coronavirus pneumonia which lack therapeutic drugs, the interaction between the small molecular compounds and potentially related protein targets in the diseases can be researched from the research of the small molecular compounds, and the potentially effective drug-like molecules can be further searched. Recognition of Compound-Protein interactions (CPI) is a key step in drug discovery and drug reuse. At present, drug-target interaction is mainly expressed as interaction between a small molecule compound as a drug and a protein as a target. The calculation of CPI is mainly considered as a binary classification problem, i.e., the final calculation is whether there is an interaction between the compound and the protein (0 means no interaction, 1 means interaction). The use of Compound-Protein Affinity (CPA) pairs is more likely to express the magnitude of CPI and therefore is more realistic than using binary classification values to measure CPI, and the first step in the drug discovery process is to identify binding molecules with high Affinity for the target, which can be further developed into drug-like molecules (lead compounds). This is an important reference for the study of compound-protein interactions.

Traditional computational CPA methods include experimental measurements and molecular modeling. However, measuring the binding affinity of a compound-protein pair experimentally is time consuming and complex, which is one of the major bottlenecks of the drug discovery process. The molecular modeling is based on modeling the compound-protein pair on an atomic level by using a physical and chemical method and utilizing a eutectic structure of the compound or a three-dimensional structure of a molecule, and then statistically measuring the affinity of the compound-protein pair by using quantum mechanics, molecular mechanics and molecular dynamics. Molecular modeling and molecular docking simulation can identify the interaction between a drug and a target by dynamic simulation, but it cannot be applied when the three-dimensional structure of a protein is unavailable. There is therefore a pressing need to develop efficient computational techniques to accurately predict CPA.

The popular data-driven artificial intelligence algorithm in recent years has also achieved some success in the area of predictive CPA. These algorithms include the conventional Machine Learning (ML) algorithm and Deep Learning (DL) algorithm. The traditional machine learning algorithm highly depends on the manually made feature engineering, namely, the final prediction result is seriously influenced by the level of data feature processing, so the obtained prediction precision does not achieve the ideal effect. After the appearance of more advanced deep learning algorithms, the level of artificial intelligence algorithms in the field of predicting CPA is improved. In the currently popular deep learning method, most of the works use a SMILES (Simplified Molecular Input Line Entry System) character sequence to characterize a compound molecule, then simply convert the SMILES character sequence and a protein sequence into a one-hot code, and then extract features from the one-hot code using 1D CNN (conditional Neural Networks) or RNN (recovery Neural Networks). However, the use of the SMILES character sequence to characterize a compound molecule does not adequately contain the physicochemical properties of the molecule and the geometry of the molecule. The 1D CNN can extract local features of a sequence, and then the dimensions of feature vectors are compressed by stacking multiple layers of 1D CNN to achieve the purpose of extracting global information, but information can be lost, so that richer global information is ignored. RNN may lose its gradient when extracting long-range sequence features, and thus cannot take into account the correlation between long-range amino acid residues in the sequence. And the 3D structure information is used for constructing the features, and then the 3D-CNN or graph convolution method is used for extracting the features, so that the problems of excessive model parameters and high calculation cost are caused, and the 3D structure of the protein is difficult to obtain. Moreover, most of the current deep learning algorithms only have a single target for predicting affinity values, and cannot predict specific positions of amino acid residues and atoms in compound molecules, which interact when the compound molecules and protein molecules are combined, and searching for the combined positions is also an important part for researching CPI.

In the process of CPA prediction, in order to solve the problems of time consumption and labor consumption of the traditional experimental method and the molecular modeling method, and simultaneously overcome the defects that the traditional machine learning algorithm depends on manual characteristic engineering and has low precision, and the defect that the molecular characterization of molecular characteristics cannot be fully expressed in the deep learning algorithm is improved, the invention discloses a GECo (compressing CPA based on GIN, ESM-1b and Co-orientation) model, which is used for firstly constructing physical and chemical property characteristics for each atom in a compound molecule, then inputting the whole molecule into a Graph Neural network (GNN, graph Neural Networks) as a Graph structure, and finally obtaining the characteristics of the whole molecule by fusing the characteristics among the atoms in the molecule through the GNN. For protein characterization, we utilized the amino acid sequence of the protein and input it into a protein pre-training model to obtain excellent overall protein characterization. Then, the Co-Attention mechanism is utilized to further develop the function of predicting the specific binding position of the compound molecule and the protein target on the basis of predicting the CPA with high accuracy by the model.

Disclosure of Invention

The application provides a computing method for predicting compound-protein affinity based on a GECo model, which is based on chemical information and topological information of fully characterized molecules, can predict CPA with high efficiency, high accuracy and low cost, and can also predict the interaction position (binding site) of a compound molecule and a protein molecule.

In order to achieve the above purpose, the present application provides the following solutions:

a computing method for predicting compound-protein affinity based on a GECo model comprises the following steps:

obtaining compound-protein pair data in a dataset;

respectively processing the compound-protein pairs to respectively obtain the molecular characteristics of the compound and the molecular characteristics of the protein;

inputting the molecular characteristics of the compound and the molecular characteristics of the protein into a GECo model for training to obtain the binding affinity of the compound and the protein.

Preferably, the method of treating the compound-protein pair separately comprises:

the method for treating the compound molecule comprises the following steps:

forming an overall atomic signature for atoms in the compound molecule by atomic chemical signature configuration using an RDKit toolkit, combining the overall atomic signatures to generate a compound molecular signature;

the method for processing the protein molecule comprises the following steps:

protein molecular signatures were generated for protein molecules by using a pre-trained protein model, ESM-1b model.

Preferably, the method of inputting the molecular characteristics of the compound and the molecular characteristics of the protein into a GECo model for training comprises:

processing the molecular characteristics of the compound through a graph isomorphic network in a GECo model to obtain the molecular characteristics of the compound which are fused and updated;

reducing the dimensionality of the protein molecules to obtain the protein molecule characteristics after the dimensionality reduction;

obtaining a compound molecular characteristic and a protein molecular characteristic with contribution degrees based on the fusion updated compound molecular characteristic and the protein molecular characteristic after dimensionality reduction;

and obtaining an affinity predicted value based on the molecular characteristics of the compound with the contribution degree and the molecular characteristics of the protein.

Preferably, the method for obtaining the molecular characteristics of the fusion renewal compound comprises:

x _i ^(k) ＝MLP((1+∈ ^(k) )·X _i ^(k-1) +∑ _j∈N(i) X _j ^(k-1) )

wherein MLP is a multilayer perceptron, ε is a learnable parameter, i is an atom, X _i ^(k-1) Is a characteristic of an atom i after fusion renewal through a k-1 th layer GIN, N _(i) Is a chemically bonded adjacent atomic group of atom i in the molecule, X _j ^(k-1) The characteristic of the atom i adjacent to the atom after fusion updating of the GIN of the k-1 layer is obtained, and the molecular characteristic x of the atom i after the k-layer iteration is finally obtained _i ^(k) When k is 1, x _i ^(k) Is the original input for GIN.

Preferably, the method for obtaining the protein molecular characteristics after dimensionality reduction comprises the following steps:

P _L ＝FCNN(FCNN(P _L ′))

wherein FCNN represents a fully-connected neural network, P _L ' represents a molecular feature of a protein exported by ESM-1b, P _L Representing the protein molecular characteristic P after dimensionality reduction _L ∈R ^L×128 。

Preferably, obtaining the molecular characteristics of the compound with the contribution degree and the molecular characteristics of the protein comprises:

obtaining a correlation matrix based on the fusion updated compound molecular characteristics and the dimensionality reduced protein molecular characteristics;

obtaining an attention weight score based on the incidence matrix;

and obtaining the molecular characteristics of the compound with the contribution degree and the molecular characteristics of the protein based on the attention scores.

Preferably, the method for obtaining the incidence matrix comprises:

the molecular characteristic X epsilon R of the compound for updating the fusion ^d×N And the protein molecule characteristic P after dimensionality reduction _L ∈R ^L×128 The incidence matrix C belongs to R ^L×N Can be calculated from the following formula:

C＝tanh(P ^T W _b X)

wherein ,W_b ∈R ^d×d Representing an association weight matrix.

Preferably, the method of obtaining the attention weight score includes:

H ^x ＝tanh(W _x X+(W _p P)C)

H ^p ＝tanh(W _p P+(W _x X)C ^T )

a ^x ＝softmax(W _hx ^T H ^x )

a ^p ＝softmax(W _hp ^T H ^p )

wherein C represents a correlation matrix, C ^T Representing an inverse matrix, the correlation matrix C transforms the protein attention space into the compound molecular attention space, the inverse matrix C thereof ^T Otherwise; h ^x Molecular feature matrix of compounds representing molecular features of interest and fusion proteins based on the correlation matrix, H ^p Protein molecular characterization matrix, W, representing molecular characteristics of interest-fusion compounds based on the inverse of the correlation matrix _x ，W _p ∈R ^k×d ，W _hx ，W _hp ∈R ^k Represents a weight parameter, a ^x ∈R ^N and a^p ∈R ^L The attention scores for each atom of the compound and each amino acid residue of the protein are indicated separately.

Preferably, the method for obtaining the molecular characteristics of the compound with the contribution degree and the molecular characteristics of the protein comprises the following steps:

based on the attention weight score, the formula for obtaining the molecular characteristics of the compound and the molecular characteristics of the protein with the contribution degree is as follows:

wherein ,

denotes the attention score, x, of the n-th atom in the molecule of the compound _n Which is characteristic of the n-th atom in the molecule of the compound,

denotes the attention fraction of the first amino acid in the protein molecule, p _l Represents the characteristics of the first amino acid in a protein molecule,

the molecular characteristics of the compound are shown,

representing the molecular characteristics of the protein;

and taking the attention score as the contribution degree of the compound molecules and the protein molecules in the interaction, and finally obtaining the compound molecular characteristics and the protein molecular characteristics with the contribution degree.

Preferably, the method for obtaining the predicted value of affinity comprises:

wherein Concat represents the docking operation.

The beneficial effect of this application does:

the method utilizes the nature that the compound molecule naturally has a graph structure, overcomes the defect that the molecular representation which cannot fully contain chemical characteristics and geometric structure information is used in the prior work as the construction characteristics of the compound molecule, constructs rich chemical characteristics for atoms in the compound molecule, and fuses the characteristics of neighbor atoms for each atom by utilizing an advanced graph neural network to obtain the molecular representation containing more rich information;

for protein molecule characterization, the method abandons the construction of the harsh protein characterization by using an amino acid character sequence one-hot code in the work of other people in the past, uses the most advanced protein pre-training model based on a self-attention mechanism to generate characteristics for protein molecules, enables the characterization construction of the protein molecules to be separated from the general field, and uses a professional model pre-trained by a large amount of protein data to construct the protein molecule characterization;

the method uses a Co-Attention mechanism to enable a model to automatically learn the contribution degree of atoms in compound molecules and amino acid residues in protein molecules to the binding affinity, not only can predict the value of the binding affinity of the compound-protein pairs, but also can reasonably conjecture the interaction sites of the compound-protein pairs;

by adopting the GECo model, the high-precision prediction result can be ensured, the binding affinity of the compound-protein pair can be predicted through the trained GECo model, the time is short, the cost is low, the prediction precision is high, the method is low in cost and simple, a large amount of manpower, material resources and financial resources can be saved, a suitable candidate compound is searched for a protein target, and a corresponding basic tool and a quick way are provided for improving the efficiency of a drug research and development process.

Drawings

In order to more clearly illustrate the technical solution of the present application, the drawings needed to be used in the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for a person skilled in the art to obtain other drawings without any inventive exercise.

FIG. 1 is a schematic flow chart of a calculation method for predicting compound-protein affinity based on a GECo model according to an embodiment of the present application;

FIG. 2 is a schematic overall flow chart of the GECo model according to the embodiment of the present application;

FIG. 3 is a schematic view of an ESM-1b model according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart of GIN according to an embodiment of the present application;

FIG. 5 is a diagram illustrating the predicted results of the GECo model in the Davis dataset according to the embodiment of the present application;

FIG. 6 is a diagram illustrating the predicted result of the GECo model on the KIBA data set according to the embodiment of the present application;

FIG. 7 is a diagram showing the hydrogen bonding interaction between HDM and its ligand in example 3 of the present application;

FIG. 8 is a Pi-bond interaction between HDM and its ligand shown in example 3 of the present application;

FIG. 9 is a schematic diagram showing coloring of high Attention atomic positions by analyzing the molecular Attention scores of compounds in a Co-Attention module in an example of the present application;

FIG. 10 is a schematic representation of the staining of high Attention amino acid residue positions by analysis of protein molecule Attention scores in the Co-Attention module in the examples of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

FIG. 1 is a schematic flow chart of a calculation method for predicting compound-protein affinity based on a GECo model according to the embodiment of the present application;

obtaining compound-protein molecular data in a dataset; respectively processing the compound-protein molecules to respectively obtain the molecular characteristics of the compound and the molecular characteristics of the protein;

the method for treating the compound molecule comprises the following steps:

forming overall atomic characteristics for atoms in compound molecules through atomic chemical characteristics by using an RDkit toolkit, and combining the overall atomic characteristics to generate compound molecular characteristics; as shown in fig. 2, the graph is composed of nodes and edges, and the nodes are connected by the edges, so as to generate the relation between different nodes. Each node has a feature, and the GNN (graph neural network) can perform fusion update on the features of the nodes in the graph and the features of the adjacent nodes, so as to automatically learn the relationship between the nodes in one graph. Because the compound molecule naturally has the property of a graph, atoms in the molecule are taken as nodes of the graph, and chemical bonds between the atoms are taken as edges of the graph. Specifically, for a molecule M, an undirected graph G (V, E) is constructed, where the atom set corresponds to V and the edge set corresponds to E. Next, physicochemical characteristics were constructed for each atom in terms of five aspects, i.e., atom type, number of adjacent atoms, number of adjacent hydrogen atoms, valence of the atom, and whether or not the atom has aromaticity, and specific information on the atomic characteristics is shown in table 1.

TABLE 1

For an atom, the characteristics of the five aspects are spliced to obtain the overall atomic characteristic atom of the atom _i ∈R ⁷⁸ Where i ∈ [1, N ]]And N is the number of atoms in the molecule. Then, all the atomic feature combinations in the molecule are combined into the feature X of the whole molecule:

X＝[atom ₁ ，atom ₂ ，...，atom _N ]

wherein ,x∈R^N×78 An adjacency matrix E representing the adjacency relationship between atoms is also constructed for each molecule, and this matrix defines the source nodes and destination nodes of all edges.

Specifically, each atom in a molecule has a number from 1 to N, and if an atom i and an atom j are chemically bonded, then we add [ i, j ] and [ j, i ] to E. The process of constructing the molecular feature X and the adjacency matrix E is completed by using an RDKit toolkit and a PyGeometric library, and X and E are input of GNN.

The method for processing protein molecules comprises the following steps:

generating protein molecule characteristics from the protein molecules by using a pre-training protein model ESM-1b model; inputting the protein molecule amino acid sequence into an ESM-1b pre-training model, and taking the obtained protein molecule characteristics as the input of the model. Inputting an amino acid sequence of a protein molecule with the length of L, ESM-1b generates a feature vector with the dimension of 1280 for each amino acid character, so that the feature P of the protein molecule can be obtained _L ′∈R ^L×1280 As shown in fig. 3.

Inputting the molecular characteristics of the compound and the molecular characteristics of the protein into a GECo model for training to obtain the binding affinity of the compound and the protein molecules;

as shown in fig. 4, the molecular characteristics of the compound are processed through a graph isomorphic network in a GECo model, so as to obtain the molecular characteristics of the compound which is fused and updated; among the many graphical neural network algorithms, a novel graphical neural network-GIN was used to manipulate the characteristics of compound molecules. GIN suggested that previous MPNN-based graph neural network approaches were unable to distinguish different graph structures based on the graph embedding they produced. To compensate for this drawback, GIN relies on a learnable parameter e ^(k) The method has the advantages that the weight of the central node is adjusted, the method achieves very good accuracy in the graph classification tasks of four bioinformatics data sets, 5 layers of GINs are used in the model, and each atom fuses feature information of the atom and feature information of adjacent atoms in each layer of GIN to update the feature of the atom. In the k-th layer GIN, for an atom i, the GIN performs fusion and updating of atom characteristics according to the following formula:

x _i ^(k) ＝MLP((1+∈ ^(k) )·X _i ^(k-1) +∑ _j∈N(i) X _j ^(k-1) )

wherein MLP is a multilayer perceptron, ε is a learnable parameter, i is an atom, X _i ^(k-1) Is a characteristic of an atom i after fusion renewal through a k-1 th layer GIN, N _(i) Is a chemically bound, contiguous subset of atoms of the molecule, X _j ^(k-1) The characteristic of the atom i adjacent to the atom after fusion updating of the GIN of the k-1 layer is obtained, and the molecular characteristic x of the atom i after the k-layer iteration is finally obtained _i ^(k) When k is 1, x _i ^(k) The original input for GIN, also the molecular signature X obtained above.

Reducing the dimensionality of the protein molecules to obtain the protein molecular characteristics after the dimensionality reduction; according to the molecular characteristics of the protein in the last step, the characteristic dimension of each amino acid residue is 1280, two layers of fully-connected neural networks are adopted to reduce the dimension of the characteristics of the amino acid residues, the dimension of the first layer is reduced from 1280 to 512, and the dimension of the second layer is reduced from 512 to 128. The formula is as follows:

P _L ＝FCNN(FCNN(P _L ′))

wherein, two FCNNs represent two layers of fully-connected neural networks and are used for reducing the dimension of the protein molecular characteristics, reducing the complexity of neural network operation and reducing the parameter number of the model, the first layer reduces the characteristic dimension from 1280 to 512, the second layer reduces the characteristic dimension from 512 to 128 _L ' represents the molecular characteristics of the protein exported by ESM-1b, P _L Representing the protein molecular characteristics after dimensionality reduction, P _L ∈R ^L×128 。

Obtaining a compound molecular characteristic and a protein molecular characteristic with contribution degree based on the fusion updated compound molecular characteristic and the protein molecular characteristic after dimensionality reduction; the Co-Attentio module focuses on both compound molecules and protein molecules, and connects the compound molecules and the protein molecules by calculating the similarity between the compound and protein characteristics of all pairs of compound atom positions and protein amino acid residue positions, so that the characteristics of each atom in the compound molecules and the mutual attention between each amino acid residue in the protein molecules can be calculated.

Export of compound molecular characteristics x ∈ R from GIN module ^d×N And protein molecular characteristic P after dimension reduction and output from ESM-1b protein pre-training model _L ∈R ^L×128 Correlation matrix C [ epsilon ] R of molecular characteristics of compounds and of proteins ^L×N Can be calculated from the following formula:

C＝tanh(P ^T W _b X)

wherein ,W_b ∈R _d×d And is an association weight matrix.

Obtaining an attention weight score based on the incidence matrix; the correlation matrix C is used as a feature to learn and calculate the attention weight scores of the compound molecules and the protein molecules by the following formula:

H ^x ＝tanh(W _x X+(W _p P)C)

H ^p ＝tanh(W _p P+(W _x X)C ^T )

a ^x ＝softmax(W _hx ^T H ^x )

a ^p ＝softmax(W _hp ^T H ^p )

wherein the correlation matrix C converts the protein attention space to the compound molecule attention space, the inverse matrix C thereof ^T And vice versa. H ^x Molecular characteristics matrix H representing a compound based on the correlation matrix C molecular characteristics of the protein of interest and fusion ^p Inverse matrix C based on correlation matrix ^T Molecular characterization matrix of proteins of interest and fused molecular characterization of compounds, W _x ，W _p ∈R ^k×d ，W _hx ，W _hp ∈R ^k Is a weight parameter. a is a ^x ∈R ^N and a^p ∈R ^L Attention scores are given for each atom of the compound and each amino acid residue of the protein, respectively.

The correlation matrix and its inverse can transform the compound molecular attention space and the protein molecular attention space to each other.

Obtaining a weighted sum of the molecular characteristics of the compound and the molecular characteristics of the protein with a degree of contribution based on the attention scores, and calculating the attention vectors of the compound and the protein molecules as the weighted sum of the molecular characteristics of the compound and the molecular characteristics of the protein, i.e.

wherein ,

indicating the attention score of the first amino acid in the protein molecule，p _l Represents the characteristics of the first amino acid in a protein molecule,

the molecular characteristics of the compound are shown,

representing the molecular characteristics of the protein;

the attention score is taken as the contribution degree of the compound molecules and the protein molecules in the interaction, and the molecular characteristics of the compound with the contribution degree are finally obtained

And molecular characterization of proteins

Docking compound molecular characteristics with contribution degrees and protein molecular characteristics into FCNN (fully connected neural network) to obtain an affinity predicted value; based on the molecular characteristics of the compound with contribution degree and the molecular characteristics of the protein, the two parts of characteristics are butted, the combined characteristics of the compound molecules and the protein molecules are extracted through three layers of FCNN, the obtained molecular characteristic dimension of the compound is 128 dimensions, the molecular characteristics of the protein are 128 dimensions, the dimension after butt joint is 256 dimensions, the transformation dimension of the first layer of FCNN on the combined characteristic dimension is 256 → 256, the transformation dimension of the second layer of FCNN on the combined characteristic dimension is 256 → 128, the transformation dimension of the third layer of FCNN on the combined characteristic dimension is 128 → 1, and the last 1 numerical value is the affinity numerical value predicted by the model. As shown in the following equation:

wherein Concat represents the docking operation, and the molecular characteristics of the compound with contribution degree

And molecular characterization of proteins

The combined feature vector with 256 dimensions is mapped, the FCNN is described in claim 5 as a fully connected neural network, 3 layers, the transformation of the combined feature dimension by the FCNN in the 1 st layer is 256 → 256, the transformation of the combined feature dimension by the FCNN in the second layer is 256 → 128, the transformation of the combined feature dimension by the FCNN in the third layer is 128 → 1, and the last 1 value is the predicted affinity value of the GECo model.

Performing verification based on the trained GECo model;

the model was trained and evaluated on the two data sets Davis and KIBA described above, and each data set was split into a training set, a validation set, and a test set at a ratio of 70%,15%, and 15% for each data set. In the experimental process, an NVIDIA GeForce RTX 3090 GPU running code is used, a training set is used for training a model, 2000 epochs are respectively trained on a Davis data set and a KIBA data set, the effect of a verification set is used for evaluating the hyper-parameters in the model, the optimal hyper-parameter selection is further determined, and the model with the best effect on the verification set is selected to be used for obtaining a prediction result on a test set. In order to make the result more stable, the model is finally tested 5 times by using the test set, the final result is the average value of the five test results, and the selection of the optimal hyper-parameter is shown in table 2.

TABLE 2

To evaluate the model effect, and for ease of comparison with other models, two of the most common evaluation indices, mean Square Error (MSE) and Consistency Index (CI) were used. The MSE is a common index for measuring errors of a prediction result and a real result, the smaller the value is, the better the MSE is, and the MSE formula is as follows:

wherein n represents a data sample having n compound-protein pairs, y _i Actual value, y, representing the affinity of the ith data sample _i ^* Indicating the predictive value of its affinity.

The CI is an index for evaluating the discrimination of the predicted value and the true value of the model, the calculation method of the CI is to form a pair of data in pairs randomly, if the predicted value of the data with higher affinity is also higher than the other data, the predicted result is consistent with the actual result, which is called as consistency, and the larger the value is, the better the value is. The formula for CI is as follows:

wherein ,δ_i and δ_j Representing the actual affinity value of a pair of data samples, b _i Indicating a greater affinity value delta _i Predicted value of (b) _j Indicating a lesser affinity delta _j F (x) represents a step function, and the formula is as follows:

as shown in fig. 5 and 6, the prediction results of the GECo model on the Davis dataset and the KIBA dataset are shown in the test set, respectively. In the two scatter diagrams, the abscissa is the true value of the data, the ordinate is the predicted value of the data, the oblique line in the diagrams is the correlation regression line of the predicted value and the true value, and the distribution histograms above and on the right of the diagrams are the distribution conditions of the predicted value and the true value. Firstly, a regression line about a predicted value and a true value is obtained from the prediction results of the two test sets, so that data points are uniformly and intensively distributed on two sides of the regression line. Where the regression line for the KIBA test set is closer to y = x by one point, and the closer the regression line is to y = x, indicating that the smaller the error of the predicted value from the true value, the final model yielded an MSE of 0.196, a ci of 0.905 on the Davis dataset, an MSE of 0.135 on the KIBA dataset, and a ci of 0.892. The MSE for the KIBA dataset was smaller than the Davis dataset, indicating that the model performs better in the KIBA dataset. From the distribution histograms above and on the right side of the two graphs, respectively, the distributions of the predicted values and the true values are consistent, and the reliability of the effect of the GECo model can also be reflected.

Six reference models of KronRLS, simBoost, deepDTA, graphDTA, deepAffinity + and MATT _ DTI are downloaded from GitHub, the parameter settings of the reference models follow the original optimal settings of the models, and the data sets used and the division of the data sets are the same as the models. In the aspect of experimental equipment, the operating systems are all Linux release CentOS 7, and the display cards are both NVIDIAGeForce RTX 3090. Similarly, each model was subjected to 5 test set experiments, and the average results of the models in the test set were recorded

As shown in table 3, the results of the GECo model and the reference model are compared, and it can be seen from the table that, among the results obtained from the Davis dataset, the MSE index of the model is improved by 13.6% compared with the MATT _ DTI model which performs the best performance, and the CI index of the model is improved by 0.5% compared with the deepsensitivity + model which performs the best performance. In view of the effect obtained in the KIBA dataset: the MSE index of the model is improved by 2.9 percent compared with the GraphDTA model with the best performance, and the CI index of the model also exceeds the GraphDTA. The two traditional machine learning models of KronRLS and SimBoost are models constructed based on calculating the similarity of ligands and protein targets, and take molecules as a whole in a similarity matrix, neglect the specific chemical information of atoms in the molecules and have no good effect. Depdta and MATT _ DTI characterize compound molecules using the SMILES character sequence, and features were extracted using 1D CNN. The molecular chemical information and the topological information contained in the SMILES character sequence are not rich enough, so that a model cannot learn more comprehensive characteristics, and the 1D CNN can extract local characteristics, but the performance of capturing the global information of the sequence is not good. The model is superior to the chemical molecular model in that the chemical molecular diagram structural property is utilized, atoms are taken as nodes, chemical bonds are taken as edges to characterize the compound molecules, rich chemical characteristics are constructed for the atoms in the compound molecules, and the chemical information and the geometric characteristics of the compound molecules are considered by using the diagram neural network. Although graph DTA also uses a graph neural network, in the aspect of characterization and processing of protein molecules, it uses one-hot coding and 1D CNN of protein sequences as same as deep DTA and MATT _ DTI, and our model uses excellent protein molecule characteristics generated by a pre-training protein model ESM-1b to replace one-hot coding characteristics in the general field, and the ESM-1b model utilizes a Self-Attention mechanism to capture the mutual Attention among each amino acid residue on the global sequence of the whole protein molecule, and further constructs a feature vector for the protein molecule according to the Attention, and the mechanism can overcome the defect of extracting local information from 1D CNN. In the deep affinity + model, the attention mechanism is also used, but the model is realized based on the hierarchical attention mechanism when extracting the protein molecular characteristics, and the rough method is to divide the protein sequence into a plurality of fragments according to the length, use the attention among amino acids in the fragments and the attention among the fragments, and then fuse the protein contact map characteristics. This method ignores that the protein molecule forms a secondary structure, when the fragments inside the protein molecule are not always of equal length. Through the summary of the reference models, the protein characteristics generated by the model through pre-training by comprehensively using the neural network of the graph are known by observing the model, so that various defects of the reference model are overcome, and the prediction binding affinity performance of the model is superior to that of all the reference models.

TABLE 3

Verifying a Co-Attention mechanism in a GECo model, and performing computational analysis on the interaction sites of the 3HDM and the ligands thereof by using Discovery Studio software; to verify the validity of the Attention mechanism in our model, it is meant that in the interaction of a protein molecule and a ligand compound molecule, the Co-Attention module does learn some atoms in the compound molecule and some amino acid residues in the protein molecule that make a large contribution in the compound-protein interaction, i.e. the potential binding sites. We selected a PDB protein molecule with 3HDM ID and its ligand 4- (5-phenyl-1H-pyrolo [2,3-b ] pyridine-3-yl) benzoic acid as our verification object. 3HDM is a kinase regulating serum and glucocorticoids, and 4- (5-phenyl-1H-pyro [2,3-b ] pyridine-3-yl) benzoic acid is an inhibitor of 3HDM, hereinafter referred to as ligand. PDB document contains 3HDM and its ligand complex, together with the three-dimensional coordinates of all their atoms, downloaded from PDB.

We showed the true interaction of 3HDM and ligand in Discovery Studio software, as shown in fig. 7 and 8, and we showed hydrogen bond interaction and Pi-bond interaction separately for better display and analysis of interaction, where fig. 7 is the hydrogen bond interaction of 3HDM and ligand we chose, and fig. 8 is the Pi-bond interaction of 3HDM and ligand we chose, including Pi-Sigma interaction, pi-Pi interaction and Pi-Alkyl interaction. It should also be explained that the position of the broken protein molecule chain is not shown in the Discovery Studio software because there is an interacting amino acid residue at that position.

Visualizing high Attention score interaction sites in a Co-Attention mechanism;

the protein sequence of 3HDM and the molecular diagram information of the ligand are input into a model, the attention scores of the protein sequence and the ligand molecule are calculated and analyzed, the amino acid residue with higher attention score and the atom in the compound molecule are used as the predicted binding position, and finally the predicted result and the actual binding position of the predicted result in a pdb file are compared and analyzed. We marked the atom positions with higher attention scores in the ligand red by analyzing the attention vector of the ligand molecules in the Co-attention module, and if the attention scores of the atoms on a ring are all higher, the ring will be stained with the average attention of all the atoms on the ring, with deeper colors indicating higher attention scores.

The Attention vectors of the protein molecules in the Co-Attention module were also analyzed, with an Attention score of 0.001 to 0.007 for all amino acid positions of the 3HDM protein sequence, 373 for the 3HDM amino acid sequence length, and with an Attention score higher than 0.004 in the region of the amino acid sequence or position numbers 101-105, 111-112, 149, 160, 162, 166, 177-178, 182-187, and we then used PyMol software to display the 3HDM as surface and to color the high Attention amino acid region or position, as shown in FIG. 10.

Comparing and verifying the interaction position calculated and analyzed by the Discovery Studio software and the high Attention position in the Co-Attention module; as shown in FIG. 9, it is clearly observed that the most attentive positions are the hydroxyl position and the N atom position of the left ring of 7-azaindole, while in FIG. 7 it can be seen that the hydroxyl group of the ligand forms a hydrogen bond with PHE109 and GLU226, respectively, while the N atom of the left ring of 7-azaindole forms a hydrogen bond with ILE179, and the C atom adjacent to this N atom also forms a carbon-hydrogen bond with ILE 179. The hydrogen bond is a stronger interaction in the intermolecular interactions, and the two atomic positions forming the hydrogen bond just correspond to the two positions with the highest Attention scores, so that the Co-Attention module in the model extracts the high Attention of the two positions. Similarly, in FIG. 9, 2 higher attention-score ring structures can be observed, with the 7-azaindole having a relatively higher attention-score, and the real interaction shown in FIG. 8, it can be seen that the left ring of the 7-azaindole forms a Pi-Pi Stacked bond with TYR178, a Pi-Sigma bond with ILE104, and a Pi-Alkyl bond with ALA125 and LEU229, respectively, for a total of four bonds. The right 7-azaindole ring forms a Pi-Alkyl bond with ALA125 and the six-membered ring near the hydroxyl group forms a Pi-Sigma bond with VAL 112. The position of these bonded loops in the ligand also corresponded to the case of the Attention score analysis in FIG. 9, and the loops with 4 interactions scored higher and darker than the loops with 1 interaction, indicating that the Co-Attention module also focused on the information of these loops that contributed significantly to the interactions. Overall, the Co-anchoring mechanism of the GECo model does allow efficient extraction of sites in the ligand molecule that contribute significantly to binding affinity.

The ligand position in FIG. 10 is actually the position where it binds to 3HDM, and it is apparent from the figure that the colored amino acid positions are distributed approximately near the binding pocket of 3HDM and its ligand, which achieves the desired effect. Although the high Attention position in the protein sequence does not correspond to the perfect match in FIG. 9, the Co-Attention module still captures the amino acid sequence near the binding pocket of 3HDM to the ligand with guaranteed effectiveness.

By integrating the analysis of the high Attention fraction atom position in the compound molecule and the high Attention amino acid residue position in the protein molecule in the step, the Co-Attention mechanism in the GECo model effectively completes the conjecture of the compound-protein interaction position.

The above-described embodiments are only intended to describe the preferred embodiments of the present application, and not to limit the scope of the present application, and various modifications and improvements made to the technical solutions of the present application by those skilled in the art without departing from the design spirit of the present application should fall within the protection scope defined by the claims of the present application.

Claims

1. The computing method for predicting the compound-protein affinity based on the GECo model is characterized by comprising the following steps of:

obtaining compound-protein pair data in a dataset;

2. The method of claim 1, wherein the step of separately treating the pairs of compound-protein pairs comprises:

the method for treating the compound molecule comprises the following steps:

combining the overall atomic features to generate compound molecular features by forming the overall atomic features through atomic chemical feature configuration for atoms in the compound molecules using the RDKit toolkit;

the method for processing the protein molecule comprises the following steps:

3. The method for predicting compound-protein affinity based on the GECo model of claim 1, wherein the method of inputting the molecular characteristics of the compound and the molecular characteristics of the protein into the GECo model for training comprises:

4. The method for predicting compound-protein affinity based on the GECo model of claim 3, wherein the method for obtaining the fusion updated molecular characteristics of the compound comprises:

x _i ^(k) ＝MLP((1+∈ ^(k) )·X _i ^(k-1) +∑ _j∈N(i) X _j ^(k-1) )

wherein MLP is a multilayer perceptron, ε is a learnable parameter, i is an atom, X _i ^(k-1) After atom i is subjected to fusion renewal of GIN at the k-1 th layerCharacteristic, N _(i) Is a chemically bonded adjacent atomic group of atom i in the molecule, X _j ^(k-1) The characteristic of the atom i adjacent to the atom after fusion updating of the GIN of the k-1 layer is obtained, and the molecular characteristic x of the atom i after the k-layer iteration is finally obtained _i ^(k) When k is 1, x _i ^(k) Is the original input to GIN.

5. The method for predicting compound-protein affinity based on the GECo model of claim 3, wherein the method for obtaining the molecular characteristics of the protein after dimension reduction comprises:

P _L ＝FCNN(FCNN(P _L '))

6. The method for predicting compound-protein affinity based on the GECo model of claim 3, wherein obtaining the contributing compound molecular characteristics and the protein molecular characteristics comprises:

obtaining an attention weight score based on the incidence matrix;

and obtaining the molecular characteristics of the compound with the contribution degree and the molecular characteristics of the protein based on the attention score.

7. The method for predicting compound-protein affinity based on the GECo model of claim 6, wherein the method for obtaining the correlation matrix comprises:

C＝tanh(P ^T W _b X)

wherein ,W_b ∈R ^d×d Representing the associative weight matrix.

8. The method of claim 6, wherein the method of deriving the attention weight score comprises:

H ^x ＝tanh(W _x X+(W _p P)C)

H ^p ＝tanh(W _p P+(W _x X)C ^T )

a ^x ＝softmax(W _hx ^T H ^x )

a ^p ＝softmax(W _hp ^T H ^p )

wherein C represents a correlation matrix, C ^T Representing an inverse matrix, the correlation matrix C transforms the protein attention space into the compound molecular attention space, the inverse matrix C thereof ^T Otherwise; h ^x Molecular feature matrix of compounds representing molecular features of interest and fusion proteins based on the correlation matrix, H ^p A protein molecular feature matrix representing molecular features of interest and fusion compounds based on an inverse of the correlation matrix, W _x ,W _p ∈R ^k×d ，W _hx ,W _hp ∈R ^k Represents a weight parameter, a ^x ∈R ^N and a^p ∈R ^L The attention scores for each atom of the compound and each amino acid residue of the protein are shown separately.

9. The method for predicting compound-protein affinity based on the GECo model of claim 6, wherein the method for obtaining the molecular characteristics of the compound and the protein with the contribution degree comprises:

wherein ,

the molecular characteristics of the compound are shown,

representing the molecular characteristics of the protein;

10. The method for predicting compound-protein affinity based on the GECo model of claim 3, wherein the method for obtaining the predicted value of affinity comprises:

wherein Concat represents the docking operation.