CN113421658B

CN113421658B - Drug-target interaction prediction method based on neighbor attention network

Info

Publication number: CN113421658B
Application number: CN202110759813.9A
Authority: CN
Inventors: 施建宇; 赵鹏程; 徐意; 朱蓓
Original assignee: Northwestern Polytechnical University
Current assignee: Shaanxi Exquisite Technology Development Co ltd
Priority date: 2021-07-06
Filing date: 2021-07-06
Publication date: 2023-06-16
Anticipated expiration: 2041-07-06
Also published as: CN113421658A

Abstract

The invention provides a drug-target interaction prediction method based on a neighbor attention network, wherein a prediction model is adopted as a neighbor attention network (NNAttNet), and the problems are solved by constructing an embedded representation (DTPs) of drugs to neighbors. In addition, the NNAttNet provides a key feature selection based on attention so as to accurately predict the DTI, and the evaluation of the NNAttNet on a reference data set shows that the NNAttNet has better DTI prediction performance.

Description

Drug-target interaction prediction method based on neighbor attention network

Technical Field

The invention belongs to the technical field of computer-aided drug research and development, and particularly relates to a drug-target interaction prediction method based on a neighbor attention network.

Background

The shift in drug discovery pattern from "one drug, one target" to "multiple drugs, multiple targets" reveals a link between drug and targets, and this shift in new pattern facilitates the discovery of potential "drug-target" interactions (DTIs), which are fundamental tasks in drug development. However, the process of determining DTI by biological experiments is time consuming and laborious.

In recent years, with the generation of more and more DTI data, various databases have been developed, and this accumulation has prompted the application of computer methods, particularly machine learning-based methods, to have good predictive performance in finding potential DTIs. However, despite the great efforts of researchers in DTI prediction, significant achievements are achieved, but there are still challenges in practice, mainly in the following ways:

1) The lack of interpretability, the embedded representation of the DTI prediction mechanism is insufficiently described;

2) The predictive model is very sensitive to missing tags;

3) Prediction of new compound molecule/protein interactions is difficult.

In view of this, there is a need to develop a new approach to predicting "drug-target" interactions.

Disclosure of Invention

The invention aims to solve the defects existing in the prior art and provides a drug-target interaction prediction method based on a neighbor attention network.

For this reason, the present invention first conducted intensive studies and analyses on the problems existing in the prior art, and found that:

1. the lack of interpretability, the embedded representation of the DTI prediction mechanism is insufficiently described; mainly because existing Deep Learning (DL) or Matrix Factorization (MF) learned drug/target embedded representations are always difficult to interpret, the resulting hidden space is difficult to provide an easy way to indicate how these properties affect interactions, their black box nature prevents direct guidance of drug design.

2. The predictive model is very sensitive to missing tags; mainly because in practice the collection of tags for "drug-target" pairs is not complete, existing methods rarely take into account the missing interaction tags between "drug-target" pairs and do not pay attention to whether the missing interactions are helpful for the prediction of DTI.

3. Prediction means it is difficult to predict new compound molecule/protein interactions; at present, two prediction modes are mainly adopted, namely direct push prediction and inductive prediction;

the task of direct push prediction is constructBuilding a function mapping F: dxT → [0,1 ]]To infer potential interactions between unlabeled "drug-target" pairs, the characteristics or similarity of the drug and target are used to learn the function F. Inductive learning is a well known problem for cold starts in recommended systems, and the task of inductive prediction is typically to learn a function map F: dxT → [0,1 ]]The method comprises the steps of carrying out a first treatment on the surface of the However, it can infer new drug molecules

And novel target protein->

Potential interactions between, or infer, D and T _y Interaction with each other, and D _x and T_y Features or similarities of (a) are learned in F. However, almost all current approaches to DTI prediction based on similarity belong to direct push predictions, which extract topologically embedded features from DTI networks or similarity matrices, and the training phase uses both labeled training samples and unlabeled test samples, so that new samples need to be trained again for model when they determine their labels in practice, which cannot meet the current requirements of drug development.

Therefore, in order to achieve the above object, the technical solution provided by the present invention is:

the drug-target interaction prediction method based on the neighbor attention network is characterized by comprising the following steps of:

1) Construction of "drug-target" pair interaction prediction model

The drug-target pair interaction prediction model consists of a neighbor attention module and a deep neural network module;

2) Collecting sample data, and training the drug-target interaction prediction model constructed in the step 1) to obtain a trained drug-target interaction prediction model;

the sample data includes relevant data of the drug and the target and actual interactions of the drug and the target; the specific training process is as follows:

2.1 Calculating the similarity between every two of all the drug molecules, the similarity between every two of all the target proteins by using the related data, and constructing an interaction relation matrix A of the drug molecules and the target proteins;

wherein the related data comprises structural information of the drug molecules, sequence information of the target proteins and interaction relation information of the drug molecules and the target proteins;

2.2 Constructing a TsDNA module by utilizing the interaction relation matrix A of the drug molecules and the target proteins obtained in the step 2.1) and the similarity data between the drug molecules, and extracting the embedded representation of the target proteins and all the drug molecules, namely whether the drug molecules are connected or not;

and/or

Constructing a DsTNA module by utilizing the similarity data between the interaction relation matrix A of the drug molecules and the target proteins obtained in the step 2.1), extracting the embedded representation of the drug molecules and all the target proteins, namely, the feature vector, and representing whether the drug molecules are connected or not;

wherein a drug d is extracted by a TsDNA module _x And a target t _p The extraction process is as follows:

a1. according to all drugs and said drug d _x Sequencing all medicines in the sequence from high similarity to low similarity to obtain K ₁ 、K ₂ 、…K _m ；

a2. Obtaining all drugs and targets t _p Removing non-interacting drugs;

a3. obtaining the drug d _x And the target t _p The formula is as follows:

wherein ,

is assigned key, which is aMedicine->

Is a series of assigned keys, v _i Is->

S (·, ·) is d _x and />

Similarity of (2);

extraction of a target t by DsTNA Module _p With a drug d _x The extraction process is as follows:

b1. according to all targets and said target t _p Sequencing all targets in the sequence of similarity from high to low to obtain H ₁ 、H ₂ 、…H _m ；

b2. Obtaining all targets and drug d _x Removing non-interacting targets;

b3. acquiring the target t _p With the drug d _x Is embedded in the representation of (1) as follows

wherein ,

is assigned bond, which is a drug +.>

Is a series of assigned keys, u _i Is->

S (·, ·) is t _p and />

Similarity of (2);

drug d _x And target t _p Is generated by concatenating the bi-directional representations:

e(d _x ,t _p )＝[a(d _x ,t _p )||a(t _p ,d _x )]；

for extraction of new drug embedded representation, since it cannot construct dsTNA (new data), only TsDNA is constructed when constructing test set and training set in order to maintain data balance;

for extraction of new target embedded representation, as it cannot construct TsDNA (new data), in order to maintain data balance, only DsTNA is constructed when constructing test set and training set;

that is, e (d _x ，d _p )＝[a(d _x ，t _p )]Or e (d) _x ，t _p )＝[a(t _p ，d _x )]；

2.3 Processing the embedded representation obtained in step 2.2) with a feature importance network

S1, carrying out step 2.2 on all medicines and targets, and stacking the obtained embedded representations of the medicine-target pairs together to obtain a matrix E;

s2, constructing a mapping attention matrix M for the matrix E obtained in the step S1 through a deep neural network;

s3, constructing a attention-enhanced representation matrix through the matrix M obtained in the step S2

The identification is convenient;

2.4 (ii) matrix representation obtained in step 3)

Inputting the predicted interaction into a deep neural network model as an input layer to obtain a predicted interaction of a drug-target;

2.5 Comparing the predicted interaction of the drug-target obtained in the step 2.4) with the actual interaction of the drug and the target, and obtaining a weight in the model through back propagation to obtain a trained drug-target pair interaction prediction model.

That is, training of the predictive model uses an interpretable model based on deep learning, namely NNAttNet, which comprises three modules, a neighbor attention module, a feature importance network and a multi-layer deep neural network model. For "drug-target" pairs, the first module generates their interpretable embedded representations that have stronger representation characteristics for missing tags in the training data and are viable in both direct-push and inductive prediction scenarios. In addition, the algorithm is adaptive not only to feature inputs, but also to similarity inputs. The second module, the feature importance network, represents a step inside the build neighbor attention module, indicating the importance of each dimension of the embedded feature, providing an interpretable feature selection. The last module distinguishes whether a "drug-target" pair is a potential DTIs.

3) And 3) predicting the interaction by using the trained drug-target interaction prediction model in the step 2).

Further, in step 2.1), the similarity between every two of all the drug molecules is calculated by using the SIMCOMP method by utilizing the acquired structural information of the drug molecules;

and calculating the similarity between every two target proteins by using the collected sequence information of the target proteins and adopting a Smith-Waterman algorithm.

Further, the SIMCOMP method is specifically as follows:

SIMCOMP provides a global similarity score based on the common substructure size between two drug compounds using a graph alignment algorithm, where the similarity s (c, c ') of compounds c and c' is calculated as follows:

further, the Smith-Waterman algorithm is specifically as follows:

two target sequences to be aligned are defined as a=a ₁ a ₂ a ₃ …a _n ，B＝b ₁ b ₂ b ₃ …b _m Wherein n and m are the lengths of sequences A and B, respectively;

determining parameters:

s is a score when there are identity between the elements that make up the sequence;

W _k a gap penalty of length k;

creating a scoring matrix H and initializing the first row and the first column of the scoring matrix H, wherein the size of the matrix is (n+1) ×m+1;

scoring from left to right, top to bottom, filling the remainder of the scoring matrix H, wherein:

the highest scoring item in the scoring matrix H is selected, namely the matching score of the sequence A and the sequence B, and is marked as SW (A, B).

Similarity of sequence a and sequence B:

further, in step S2), the attention matrix M is mapped as follows:

M(:,i)＝DNN _i (E)。

further, in step S3), the attention-enhanced representation matrix

The method comprises the following steps:

further, in step 2.4), the deep neural network model includes an input layer, a hidden layer using Relu as an activation function, and two neuron output layers using Sigmoid as an activation function; the deep neural network model acts as a binary predictor, with the output layer producing probabilities representing the likelihood of drug-target pair interactions. The entire network of the NNAttNet with neighbor awareness weights, feature importance items, and DNN weights can be jointly optimized by a binary cross entropy loss function as follows:

wherein Y is the true tag of the drug target pair; f (·) is DNN; θ is a weight parameter of the entire network; r (& gt) is L2-norm; lambda regularizes the coefficients of the term.

The present invention also provides a computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program realizes the steps of the above method when being executed by a processor.

An electronic device is characterized in that: including a processor and a computer-readable storage medium;

the computer readable storage medium has stored thereon a computer program which, when executed by the processor, performs the steps of the above method.

The invention has the advantages that:

the invention provides a prediction method based on deep learning, wherein a prediction model is a neighbor attention network (NNAttNet), and the problems are solved by constructing an embedded representation (DTPs) of a drug to a neighbor. In addition, the NNAttNet provides a key feature selection based on attention so as to accurately predict the DTI, and the evaluation of the NNAttNet on a reference data set shows that the NNAttNet has better DTI prediction performance.

Drawings

FIG. 1 is a general architecture of the proposed method NNAttNet;

FIG. 2 is a basic block diagram of a TsDNA module;

FIG. 3 is a schematic construction diagram of a feature importance matrix;

FIG. 4 is an arrangement and distribution of drug bond embedding features;

FIG. 5 is an arrangement and distribution of the importance of drug features;

FIG. 6 is a graph of predicted performance of top-k features;

FIG. 7 distribution of feature importance at different rates of DTIs loss.

Detailed Description

The invention is described in further detail below with reference to the attached drawings and specific examples:

one embodiment of a "drug-target" interaction prediction method based on a neighbor attention network according to the present invention is specifically as follows:

this embodiment includes three parts:

first, when constructing the test set, deleting a part of the continuous edges in the network, and then predicting the continuous edges.

The second part, deleting some drugs and all the edges in the network, is to simulate the scene of new drug prediction.

And the third part, deleting some targets and all edges in the network, so as to simulate the scene of new target prediction.

During the results statistics and presentation we presented the performance parameters of the overall results, without specific presentation of a specific drug (target).

This example uses an initial baseline data set in predictive performance comparison experiments, separating receptors into 4 subsets according to the properties of the protein family, enzyme (En), ion Channel (IC), G Protein Coupled Receptor (GPCR) and Nuclear Receptor (NR), respectively. Each subset includes known "drug-target" interactions, pairwise similarities between drugs, and pairwise similarities between targets. Wherein, the pairwise similarity between medicines is calculated by a SIMCOMP algorithm, and the pairwise similarity between targets is calculated by a Smith-Waterman algorithm. The details of this dataset are shown in table 1.

TABLE 1 details of reference data sets

And setting a dictionary K of the virtual key type for each data subset, assigning values to the dictionary by using the acquired data, and constructing a TsDNA module and a DsTNA module between the drug molecules and the target proteins to finally obtain embedded representations of all drug-target pairs.

TsDNA (fig. 2) consists of a dictionary K of virtual key types and some values v describing these virtual keys. In the dictionary, virtual keys are ordered by semantic adjacent terms. Briefly, the first key is its nearest neighbor, the second key is its second nearest neighbor, and the last key is its farthest neighbor. Notably, this is a contribution that can explain the classification distinction between known DTIs and unknown DTIs.

When taking into account a drug d _x Whether to bind to a target protein t _p In an interactive relationship, these empty bonds are then bound to the target t _p Other drugs known to have interactions are assigned. Conversely, for those sums t _p The medicine without interaction does not operate, d _x For t _p Can be defined as:

in this context,

is assigned bond, which is a drug +.>

Is a series of assigned keys, v _i Is->

S (·, ·) is d _x and />

Is a similarity of (3). Note that virtual keys are featureless and only after being assigned will there be features. This attention is given to d _x →t _p Unidirectional representation.

To enhance the interpretability of TsDNA modules, we set v= [ V ₁ ,v ₂ ,…,v _|K| ]Is a diagonal matrix, v therein _i Is a vector resembling one-hot, that is, v _i The i-th element of (2) is not 0, and the other elements are all 0. By this method, d _x →t _p Is sparse. Considering the widely accepted assumption that similar drugs tend to interact with the target protein of interest, it is hypothesized that if drug d _x And target t _p Having an interaction relationship has more non-zero values in some of their first characteristic dimensions than drugs and targets that have no interaction relationship. In other words, for d _x and t_p Interaction relationship between other drugs if desired and t _p Having interactions, where they are generally d _x Is the first neighbor of (c). This sparse attention embedding expression provides evidence for the later interpretability of TsDNA.

Due to the symmetrical effect of the nodes in the two networks, we can similarly construct a dsTNA block that outputs another unidirectional representation t _p →d _x The expression form is a (t) _p ,d _x ) Thus, the final requested d _x and t_p The representation of the pair is generated by concatenating the bi-directional representations.

e(d _x ,t _p )＝[a(d _x ,t _p )||a(t _p ,d _x )]

The embedded representations of all "drug-target" pairs are stacked together, denoted as the attention matrix E.

By the formula M (: i) =DNN _i (E) Modeling the attention matrix E to obtain an embedded expression matrix M.

By the formula

Constructing a concentration-enhanced representation matrix +.>

A generic DNN was used as a binary predictor to predict whether or not there was an interaction with the "drug-target" pair. The binary predictor comprises an input layer, i.e. an embedded representation of a "drug-target" pair, a hidden layer with Relu as activation function, and two neuron output layers with Sigmoid as activation function. The output layer produces a probability that indicates the likelihood of drug-target pair interactions. The entire network of the NNAttNet with neighbor awareness weights, feature importance items, and DNN weights can be jointly optimized by a binary cross entropy loss function as follows:

wherein Y is the authentic tag of the drug target pair; f (·) is DNN; θ is a weight parameter of the entire network; r (& gt) is L2-norm; lambda regularizes the coefficients of the term.

In this example we evaluate the performance of the various methods by 10 fold cross-validation (CV) and use AUROC (area under the receiver work characteristic) and AUPRC (area under the precision recall) as indicators for DTI predictive performance.

In 10-fold cross-validation, we calculated the AUROC/AUPRC score for each prediction method and obtained the final AUROC/AUPRC score by calculating the average AUROC/AUPRC score for 10 replicates.

In order to comprehensively evaluate the performance of the various methods, the present embodiment considers the following three scenarios for performing CV experiments.

Under CVS1, 90% of the DTPs (drug-to-neighbor embedded representation) are used for training, while the remaining 10% are used for each round of testing.

Under CVS2 (or CVS 3), 90% of the drug (or target) interactions are used for training, and the remaining 10% of the drug (or target) interactions are used for testing.

CVS2 (CVS 3) is a cold-start DTI prediction because there is no overlap between the training drug (target) and the test drug (target).

Notably, CVS1 is a straight-forward predictive task.

The CVS2/CVS3 may be a direct push or generalized prediction task, depending on the nature of the prediction method. The experimental results of the nnittnet are shown in tables 2, 3, and 4.

TABLE 2 performance demonstration of DTI predictions by CVS1 over 4 data sets

Note that: ROC and PR are abbreviations for AUROC and AUPRC.

TABLE 3 performance demonstration of DTI predictions by CVS2 over 4 data sets

Note that: ROC and PR are abbreviations for AUROC and AUPRC.

Table 4. Performance demonstration of DTI predictions by cvs3 over 4 data sets

Note that: ROC and PR are abbreviations for AUROC and AUPRC.

The following describes the explanation of the prediction method in the present invention according to the experimental results of the present embodiment.

Taking the GPCR dataset as an example, a dictionary distribution of drug bond types from K1 to K100 was obtained by computing two average embedded vectors of known DTIs and unlabeled DTPs (see fig. 4). The significantly high embedded eigenvalues that occur in the first n nearest neighbors indicate that drugs that interact with a particular target always find their first n nearest neighbors among drugs that interact with the same target. This observation shows that if a drug has more non-zero units than non-interactions in the first n characteristic dimensions (bonds) it may interact with the target.

The present invention also indicates on this embodiment which embedded features in the M matrix caused interactions to occur. Since the cells with larger median M represent an important feature dimension, each feature f _i The importance M (: i) of (i) can be measured by the average of the values in column i of M (see fig. 5). The importance distribution of the keys in dictionary K illustrates that features of relatively high importance are typically located in the first n nearest neighbors. This observation is significantly consistent with the visual observation described above over the first 10 keys with large Spearman (Spearman) correlation values (r= 0.8182).

This example investigated the predictive performance of top-k features (see FIG. 6). The value of k is {1,5,10,15, …,220}. When k increases to 50, the predictive effect increases dramatically. As k increases again, performance increases slowly and even decreases when k is greater.

One reason that the NNAttNet still performs better in missing labels is that it utilizes embedded vectors made up of neighboring nodes. We studied the distribution of feature importance at different rates of DTIs deletion (fig. 7). The graph reveals that the distribution of characteristic bonds shows a similar trend at different deletion rates. Meanwhile, feature importance vectors at 9 miss rates have a high degree of correlation. The Spearman correlation coefficients of the feature importance vectors at 10% loss and other loss (20% -90%) were 0.9996, 0.9993, 0.9989, 0.9979, 0.9969, 0.9943, 0.9919 and 0.9770, respectively. This high degree of correlation indicates that in the absence of data, the feature importance network may still indicate a critical feature. Thus, in the absence of a tag, even if a few drugs are found in the first n nearest neighbors of the drug to the target, the ordering key dictionary in its neighbor attention module can still guarantee that the queried drug interacts with the target.

The feasibility of the NNAttNet is demonstrated by the above examples: the interpretability of drug interactions with proteins, the ability to have stronger properties for predictions of missing DTI tags, consistent representation of direct-push and inductive DTI predictions, and selection of important attention-based features for more accurate DTI predictions.

The implementation methods and common general knowledge that are well known in the above-described schemes are not described here too much. It should be noted that modifications can be made to the invention by those skilled in the art without departing from the scope of the invention, which is also to be considered as the scope of the invention, and which does not affect the practice of the invention or the utility of the patent. The protection scope of the present application shall be subject to the content of the claims, and the detailed description and the like in the specification are recited for explaining the content of the claims.

Claims

1. A method for predicting drug-target interactions based on a neighbor attention network, comprising the steps of:

1) Construction of "drug-target" pair interaction prediction model

The drug-target interaction prediction model consists of a neighbor attention module and a deep neural network module;

2.2 Constructing a TsDNA module by utilizing the interaction relation matrix A of the drug molecules obtained in the step 2.1) and the target protein and the similarity data between the drug molecules, and extracting the embedded representation of the target protein and all the drug molecules;

constructing a DsTNA module by using the similarity data between the interaction relation matrix A of the drug molecules and the target proteins obtained in the step 2.1), and extracting the embedded representation of the drug molecules and all the target proteins;

a2. Obtaining all drugs and targets t _p Removing non-interacting drugs;

a3. obtaining the drug d _x And the target t _p The formula is as follows:

wherein ,

is assigned bond, which is a drug +.>

Is a series of assigned keys, v _i Is that

S (·, ·) is d _x and />

Similarity of (2);

b2. Obtaining all targets and drug d _x Removing non-interacting targets;

wherein ,

is assigned bond, which is a drug +.>

Is a series of assigned keys, u _i Is that

S (·, ·) is t _p and />

Similarity of (2);

e(d _x ,t _p )＝[a(d _x ,t _p )||a(t _p ,d _x )]；

2.4 (ii) matrix representation obtained in step 3)

2.5 Comparing the predicted interaction of the drug-target obtained in the step 2.4) with the actual interaction of the drug and the target, and obtaining a weight in the model through back propagation to obtain a trained drug-target pair interaction prediction model;

3) And 3) predicting the interaction of the drug-target by using the trained drug-target interaction prediction model in the step 2).

2. The method of predicting drug-target interactions of claim 1, wherein:

in the step 2.1), the similarity between every two of all the medicine molecules is calculated by using the SIMCOMP method by utilizing the acquired structure information of the medicine molecules;

3. The method of predicting drug-target interactions of claim 2, wherein:

the SIMCOMP method comprises the following steps:

4. the method of predicting drug-target interactions of claim 2, wherein:

the Smith-Waterman algorithm is specifically as follows:

determining parameters:

W _k a gap penalty of length k;

selecting the item with the highest score in the score matrix H, namely, the matching score of the sequence A and the sequence B, and marking the matching score as SW (A, B);

similarity of sequence a and sequence B:

5. the method of predicting drug-target interactions of any one of claims 1-4, wherein:

in step S2), the attention matrix M is mapped as follows:

M(:,i)＝DNN _i (E)。

6. the method of predicting drug-target interactions of claim 5, wherein:

in step S3), the attention-enhanced representation matrix

The method comprises the following steps:

7. the method of predicting drug-target interactions of claim 6, wherein:

in step 2.4), the deep neural network model includes an input layer, a hidden layer using Relu as an activation function, and two neuron output layers using Sigmoid as an activation function.

8. A computer-readable storage medium having stored thereon a computer program, characterized by: which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

9. An electronic device, characterized in that: including a processor and a computer-readable storage medium;

the computer readable storage medium has stored thereon a computer program which, when executed by the processor, performs the steps of the method of any of claims 1 to 7.