CN112863693A

CN112863693A - Drug target interaction prediction method based on multi-channel graph convolution network

Info

Publication number: CN112863693A
Application number: CN202110154690.6A
Authority: CN
Inventors: 汪国华; 李洋; 乔冠宇
Original assignee: Northeast Forestry University
Current assignee: Northeast Forestry University
Priority date: 2021-02-04
Filing date: 2021-02-04
Publication date: 2021-05-28
Anticipated expiration: 2041-02-04
Also published as: CN112863693B

Abstract

A method for predicting drug target interaction based on a multichannel graph convolution network belongs to the technical field of drug and target relation prediction. The method solves the problem that the accuracy of the prediction of the interaction of the drug target is poor due to the fact that the characteristics extracted manually by the existing method are inaccurate. The method comprises the steps of constructing a drug protein pair network according to an obtained drug characteristic matrix and a protein characteristic matrix, extracting characteristics of a topological relation between drug protein pairs in the drug protein pair network and a proximity relation between drug protein pair characteristics by adopting a multichannel graph convolution network to obtain topological relation embedding and characteristic proximity relation embedding, processing the topological relation embedding and the characteristic proximity relation embedding to obtain common embedding, fusing the topological relation embedding, the characteristic proximity relation embedding and the common embedding by using an attention mechanism, and inputting a fusion result into a multilayer sensing machine to predict a drug target relation. The method can be applied to the prediction of the relation between the medicine and the target.

Description

Drug target interaction prediction method based on multi-channel graph convolution network

Technical Field

The invention belongs to the technical field of medicine and target relation prediction, and particularly relates to a medicine target interaction prediction method based on a multichannel graph convolution network.

Background

Drug targets are molecules that can bind to drugs and exert specific effects inside cells, and proteins are the main molecular targets of drugs.

We need to test and experiment thousands of compounds to find safe and effective drugs. Drug discovery is therefore a time consuming and laborious process with the risk of failure. But by calculating the probability of drug interaction with the target, the costly losses in the drug discovery process can be reduced.

To achieve this goal, more and more researchers are exploring other methods to predict the relationship between drugs and targets. The prediction of the drug targeting relationship not only can reduce the loss in the drug discovery process, but also has guiding effect on drug migration, multi-drug pharmacology, drug resistance prediction, side effect prediction and the like.

Traditional approaches to predict new targets for known drugs are based on small molecules, protein targets or phenotypic characteristics. The existing drug-protein relationship prediction methods include a machine learning-based method, a two-part model-based method, a structure-based method, a deep learning-based method, and the like.

For proteins with unknown structure, the return from using a structure-based prediction method is often small, while many proteins have little return.

In recent years, the characteristics of the medicine and the target are fully utilized, and the performance of the medicine and the target is predicted by a method based on deep learning and machine learning. Although more and more researches show that deep learning can be used for predicting the drug target relationship, the existing prediction method needs to rely on manual feature extraction, and the feature extraction mode is necessarily influenced by artificial subjective factors, so that the feature extraction is not accurate, and the accuracy of drug target interaction prediction is further influenced.

Disclosure of Invention

The invention aims to solve the problem that the accuracy of the drug target interaction prediction is poor due to the fact that the existing method depends on manual extracted features are inaccurate, and provides a drug target interaction prediction method based on a multi-channel graph convolution network.

The technical scheme adopted by the invention for solving the technical problems is as follows: a drug target interaction prediction method based on a multichannel graph convolutional network specifically comprises the following steps:

step one, extracting medicine information, protein information, disease information and medicine side effect information from a database, and constructing a heterogeneous network according to the extracted information;

processing the constructed heterogeneous network by adopting a Jaccard similarity method and a random restart walk method to obtain a drug diffusion state matrix and a protein diffusion state matrix;

step two, respectively carrying out noise reduction and dimension reduction on the drug diffusion state matrix and the protein diffusion state matrix to obtain a drug characteristic matrix and a protein characteristic matrix;

thirdly, splicing the drug characteristic matrix and the protein characteristic matrix obtained in the second step, wherein in each drug protein pair obtained by splicing, the drug protein pair formed by the drug and the protein which are known to have a relationship is correct, and the rest drug protein pairs are incorrect;

randomly selecting a part of drug protein pairs from the correct drug protein pairs as a training set positive example, and randomly selecting a part of the rest correct drug protein pairs as a testing set positive example;

randomly selecting the drug protein pairs with the same number as the positive examples of the training set from the incorrect drug protein pairs as the negative examples of the training set, and randomly selecting the drug protein pairs with the same number as the positive examples of the test set from the rest incorrect drug protein pairs as the negative examples of the test set;

if the two drug protein pairs share the drug or share the protein, the two drug protein pairs are considered to be related, otherwise, the two drug protein pairs do not have the relation, a first drug protein pair network is constructed according to a training set positive example and a training set negative example, and a second drug protein pair network is constructed according to a test set positive example and a test set negative example;

step five, training the multichannel graph convolution network by adopting a first drug protein, wherein the specific process is as follows:

respectively adopting a graph convolution network to carry out feature extraction on the topological relation between the drug protein pairs in the first drug protein pair network and the adjacent relation between the drug protein pair features to obtain a topological relation embedding Z_tAnd feature proximity embedding Z_f；

To Z_tAnd Z_fProcessed to obtain a common insert Z_c；

Using attention mechanism pair Z_t、Z_fAnd Z_cProcessing to obtain a characteristic Z;

inputting the characteristic Z into a multilayer perceptron to carry out secondary classification, and outputting a prediction result of the relationship between the medicine and the protein by the multilayer perceptron;

testing the multichannel graph convolution network by using the second drug protein to the network, and stopping training until the prediction result of the second drug protein to the relationship between the drugs and the proteins in the network, which is output by the multilayer perceptron, meets the precision requirement to obtain the trained multichannel graph convolution network;

step six, after the processes from the step one to the step three are repeatedly executed for the drug protein pairs related to the to-be-predicted drug protein pair, randomly selecting a part of the drug protein pairs obtained in the step three, and constructing a third drug protein pair network by using the drug protein pairs related to the to-be-predicted drug protein pair and the randomly selected drug protein pairs;

and after the constructed third drug protein pair network is processed by a trained multichannel graph convolution network and an attention mechanism, inputting a processing result into a multilayer perceptron to obtain a relationship prediction result of the drug protein pair to be predicted.

The invention has the beneficial effects that: the invention provides a drug target interaction prediction method based on a multichannel graph convolution network, which comprises the steps of firstly obtaining a drug characteristic matrix and a protein characteristic matrix, then constructing a drug protein pair network according to the obtained drug characteristic matrix and the protein characteristic matrix, adopting the multichannel graph convolution network to carry out characteristic extraction on the topological relation between drug protein pairs in the drug protein pair network and the adjacent relation between the drug protein pair characteristics, obtaining topological relation embedding and characteristic adjacent relation embedding, carrying out processing on the topological relation embedding and the characteristic adjacent relation embedding to obtain common embedding, finally carrying out the topological relation embedding, the characteristic adjacent relation embedding and the common embedding fusion by using an attention mechanism, inputting a fusion result into a multilayer perceptron, and further carrying out prediction on the drug target relation.

The method provided by the invention overcomes the problem that the existing method needs to rely on manual feature extraction, so that the extracted features are accurate, and experiments prove that the Roc area obtained by adopting the method provided by the invention is 0.9616, the PR area obtained by adopting the method provided by the invention is 0.9612, which is obviously higher than that of the existing method, and the accuracy of the drug target interaction prediction is improved.

Drawings

FIG. 1 is an overall flow chart of the drug target interaction prediction method based on a multi-channel graph convolution network according to the present invention;

in the figure, G_t＝(A_tAnd X) is a topological graph; z_f(1)Outputting the first layer of the graph convolution network when the graph convolution network is used for extracting the characteristics of the topological relation; z_f(2)Outputting the second layer of the graph convolution network when the graph convolution network is used for extracting the characteristics of the topological relation; z_t(1)Outputting the first layer of the graph convolution network when the graph convolution network is used for extracting the characteristics of the adjacent relation between the characteristics; z_f(2)When the graph convolution network is used for extracting the features of the adjacent relation between the features, the graph convolution network outputs from the second layer.

Detailed Description

First embodiment this embodiment will be described with reference to fig. 1. The method for predicting the drug target interaction based on the multichannel graph convolution network comprises the following steps:

respectively adopting a graph convolution network to carry out feature extraction on the topological relation between the drug protein pairs in the first drug protein pair network and the adjacent relation between the drug protein pair features to obtain a topological relation embedding Z_tAnd (c) aEmbedding of symbolic proximity relations Z_f；

To Z_tAnd Z_fProcessed to obtain a common insert Z_c；

after the network of the third drug protein pair is processed by the trained multichannel graph convolution network and the attention mechanism, the processing result is input into the multilayer perceptron, and a relationship prediction result of the drug protein pair to be predicted (namely, whether the relationship between the drug and the protein exists is predicted) is obtained.

The multichannel graph convolution network of the present embodiment includes three graph convolution networks, a graph convolution network for topological relation feature extraction between drug protein pairs, a graph convolution network for proximity relation feature extraction between drug protein pair features, and a graph convolution network for pair Z_tAnd Z_fA graph convolution network for processing.

Proximity relationships between drug protein pairs features

And the information extraction of the characteristic space is realized by constructing a k-nearest neighbor graph. Here cosine similarity distance is used to measure the similarity distance between features. For the feature matrix X of the Drug Protein Pairs (DPP), if X_iAnd x_jShows DPP_iAnd DPP_jTheir cosine distance S can be expressed as:

we select two nodes nearest to the target node (target DPP) to construct a neighborhood graph, and get a neighborhood graph G_f＝(A_f,X)。

The second embodiment is as follows: the first step is to extract drug information, protein information, disease information, and drug side effect information from a database, and construct a heterogeneous network based on the extracted information; the specific process comprises the following steps:

extracting drug information from a drug bank database, wherein the drug information comprises drug interaction information and known drug target interaction information;

extracting protein information from an HPRD database, wherein the protein information is protein-protein interaction information;

extracting disease information from a toxicological genomics database, wherein the disease information comprises relationship information between diseases and medicines and relationship information between the diseases and proteins;

extracting drug side effect information from a SIDER database, wherein the drug side effect information is relationship information between drugs and side effects;

obtaining M drugs, N proteins, O side effects and W diseases from the extracted information, and constructing a heterogeneous network according to the information extracted from each database;

the heterogeneous network comprises a drug and drug relationship network, a drug and disease relationship network, a drug and drug side effect relationship network, a drug and protein relationship network, a protein and disease relationship network, a drug chemical similarity network, and a protein gene sequence similarity network.

Drug and protein relationship networks were used in step three to determine if the drug protein pairs formed were correct.

The third concrete implementation mode: the second embodiment is different from the second embodiment in that, in the first step, the constructed heterogeneous network is processed by a Jaccard similarity method and a random restart walk method to obtain a drug diffusion state matrix and a protein diffusion state matrix; the specific process comprises the following steps:

for drugs and drug side-effect relationship networks, the drug and drug side-effect relationship networks are represented in the form of a matrix C:

wherein, c_i′j′0 or 1, c_i′j′1 represents that the i 'th medicament and the j' th medicament have a relationship of side effects, and c_i′j′0 represents that the i 'th medicament and the j' th medicament have no relation in side effect, i 'is 1,2, …, M, j' is 1,2, … O;

calculating the similarity between the ith row and the jth row of the matrix C by adopting a Jaccard similarity method, wherein i is 1,2, …, and M, j is 1,2, …, M, and the calculated similarity between the ith row and the jth row is used as an element of the jth row and the jth column in the similarity matrix H, and traversing every two rows in the similarity matrix C to obtain the similarity matrix H;

processing the similarity matrix H by adopting a random restart wandering method to obtain a diffusion state matrix corresponding to the medicine and the medicine side effect relationship network;

similarly, a diffusion state matrix corresponding to the drug and drug relationship network, a diffusion state matrix corresponding to the drug and disease relationship network, a diffusion state matrix corresponding to the protein and protein relationship network, a diffusion state matrix corresponding to the protein and disease relationship network, a diffusion state matrix corresponding to the drug chemical property similarity network and a diffusion state matrix corresponding to the protein gene sequence similarity network are obtained;

correspondingly, if the two drugs have similar chemical properties, the value of the corresponding position in the matrix C is 1, otherwise, the value is 0;

splicing a diffusion state matrix corresponding to a drug and drug side effect relationship network, a diffusion state matrix corresponding to a drug and drug relationship network, a diffusion state matrix corresponding to a drug and disease relationship network and a diffusion state matrix corresponding to a drug chemical property similarity network into a feature matrix D, and taking the feature matrix D as a drug diffusion state matrix;

splicing a diffusion state matrix corresponding to the protein and protein relation network, a diffusion state matrix corresponding to the protein and disease relation network and a diffusion state matrix corresponding to the protein gene sequence similarity network into a feature matrix P, and taking the feature matrix P as the protein diffusion state matrix.

The fourth concrete implementation mode: the third difference between the present embodiment and the specific embodiment is that the method for calculating the similarity between the ith row and the jth row of the matrix C is as follows:

wherein J (A, B) is the similarity between the ith row and the jth row of the matrix C, A is the ith row of the matrix C, and B is the jth row of the matrix C.

The fifth concrete implementation mode: the fourth difference between this embodiment and the fourth embodiment is that, in the second step, a Denoising Automatic Encoding (DAE) method is used for denoising and reducing dimensions of the drug diffusion state matrix and the protein diffusion state matrix.

The sixth specific implementation mode: in this embodiment, different from the fifth embodiment, after the drug and protein relationship network is expressed in the form of a matrix, element 1 in the matrix represents the known existing relationship between the drug and the protein.

The seventh embodiment: sixth embodiment is different from the sixth embodiment in that the pair Z_tAnd Z_fProcessed to obtain a common insert Z_cThe specific process comprises the following steps:

graph convolution network pair Z using weight sharing_tAnd Z_fProcessed to obtain an embedded Z_ctAnd Z_cfTo Z is paired with_cfAnd Z_ctSum and averageTo obtain a common insertion Z_c。

Z_cfIs to Z_fAfter treatment result of (1), Z_ctIs to Z_tThe processed result of (1).

The specific implementation mode is eight: the seventh embodiment is different from the seventh embodiment in that the pair of attention-using mechanisms Z_t、Z_fAnd Z_cProcessing to obtain a characteristic Z, wherein the specific process is as follows:

Z＝α₁*Z_c+α₂*Z_f+α₃*Z_t

wherein alpha is₁、α₂、α₃Respectively, represent the embedded weights.

Examples

The process of the present invention is further discussed below in connection with the examples

First, data preparation, feature embedding of the acquired drug and target:

drug information is extracted from the drug bank database, including interactions between drugs and known drug target interactions. The interactions between proteins were from the HPRD database. We obtain disease information from the toxicological genomics database, including disease to drug, disease to protein relationships. We also obtained some information about drug side effects from the SIDER database. And from 708 drugs, 1512 proteins, side effects 4912, disease 5603, and eight-relationship heterogeneous networks were obtained.

The relationship between drugs, the relationship between drugs and diseases, the relationship between drugs and drug side effects, the relationship between drugs and proteins, the relationship between proteins and diseases, the similarity of chemical properties of drugs and the similarity of protein gene sequences;

the similarity and difference between limited sample sets was first compared using the Jaccard similarity coefficient. For example, a similarity matrix H of the drug and the side effect network is calculated, a and B respectively represent the ith and jth rows of the matrix, J (a, B) represents the similarity between the ith and jth rows, the similarity matrix H is a symmetric matrix, and if H is the similarity of the drug-related network, H ═ J (J ═ J)_ij)_708×708. The Jaccard similarity algorithm can be defined as follows:

then, the obtained similarity matrix adopts a random re-starting wandering algorithm to obtain a matrix of a diffusion state, and related matrixes of the same medicine are spliced into a characteristic matrix D, wherein D is (D is ═_ij)_708×2832(ii) a The protein is also processed to obtain protein feature matrix P, and P ═ P_ij)_1512×4536Thus, a high-dimensional and high-noise medicine and protein diffusion state matrix can be obtained;

finally, the diffusion matrix is denoised and dimensionality reduced by using a DAE automatic denoising and coding method, so that the drug characteristic matrix is 100-dimensional, the protein characteristic matrix is 400-dimensional, namely D_DAE＝(d_ij)_708×100And P_DAE＝(p_ij)_1512×400。

Secondly, constructing a drug protein pair network:

and (3) splicing the characteristic matrix of the drug and the protein obtained in the first step, wherein the drug protein pair which is known to have a relation to the spliced drug and protein is considered to be the correct drug protein pair, and the other spliced drug protein pairs are considered to be incorrect. The characteristic of the drug protein pair is that the fusion of the corresponding drug characteristic and the protein characteristic is combined to obtain 1332 pairs of correct drug proteins, therefore, 1332 pairs of incorrect drug proteins are randomly selected as negative examples;

if the spliced drug and protein pairs share the drug or share the protein, the drug and protein pairs are considered to have a relationship with each other, and a drug protein pair network is constructed according to the relationship.

Third, drug target interaction prediction using a multichannel graph convolutional network:

respectively extracting the features of the topological relation network and the feature adjacent relation network by using graph convolution in consideration of the topological relation between the drug protein pairs and the adjacent relation between the features of the drug protein pairs to obtain the topological networkEmbedding Z_tAnd feature proximity embedding Z_f，

The graph convolution network of each channel has two hidden layers, and the lth layer can be represented as follows:

wherein

A is the adjacency matrix of the figure, I is the identity matrix,

to represent

Diagonal matrix of (W)^(L)Representing the weight of the L-th layer.

Meanwhile, a certain relation between the topological network and the characteristic adjacent network is considered, so after the topological network and the characteristic adjacent network are spliced, a shared parameter strategy is used in the convolution module because the commonality of the topological network and the characteristic adjacent network is wanted. Processing the graph volume network to obtain an embedded Z_cfAnd Z_ctThe sum of which is averaged to obtain the common embedding Z_c；Z_cfAnd Z_ctCan be respectively expressed as:

wherein

A is the adjacency matrix of the figure, I is the identity matrix,

to represent

The diagonal matrix of degrees of (c) is,

representing the sharing weight of the L-th layer.

Then, an attention mechanism is used for processing the three embedding to obtain a characteristic Z, so that more important embedding has larger weight; the formula is as follows:

Z＝α₁*Z_c+α₂*Z_f+α₃*Z_t

wherein alpha is₁、α₂、α₃Respectively representing the embedded weights;

and finally, inputting the characteristic Z into a multilayer perceptron to perform secondary classification so as to predict whether the relationship exists between the medicine and the protein.

Experimental performance was evaluated using AUROC (area under Roc curve) and aurr (area under PR curve) scores and the experimental performance data are shown in table 1:

TABLE 1

Experiments show that the performance of the method is obviously superior to that of the existing NRLMF, DTINet and DTI-CNN methods.

The above-described calculation examples of the present invention are merely to explain the calculation model and the calculation flow of the present invention in detail, and are not intended to limit the embodiments of the present invention. It will be apparent to those skilled in the art that other variations and modifications of the present invention can be made based on the above description, and it is not intended to be exhaustive or to limit the invention to the precise form disclosed, and all such modifications and variations are possible and contemplated as falling within the scope of the invention.

Claims

1. The method for predicting the drug target interaction based on the multichannel graph convolution network is characterized by comprising the following steps:

separately miningPerforming feature extraction on the topological relation between the drug protein pairs in the first drug protein pair network and the adjacent relation between the drug protein pair features by using the graph convolution network to obtain a topological relation embedding Z_tAnd feature proximity embedding Z_f；

To Z_tAnd Z_fProcessed to obtain a common insert Z_c；

2. The method for predicting drug target interaction based on the multi-channel graph convolution network as claimed in claim 1, wherein in the first step, drug information, protein information, disease information and drug side effect information are extracted from a database, and a heterogeneous network is constructed according to the extracted information; the specific process comprises the following steps:

3. The method for predicting the drug target interaction based on the multichannel graph convolution network according to claim 2, wherein in the first step, the constructed heterogeneous network is processed by a Jaccard similarity method and a random restart walk method to obtain a drug diffusion state matrix and a protein diffusion state matrix; the specific process comprises the following steps:

4. The method for predicting drug target interaction based on the multi-channel graph convolutional network of claim 3, wherein the similarity between the ith row and the jth row of the matrix C is calculated by:

5. The method for predicting drug target interaction based on multi-channel graph convolutional network of claim 4, wherein in the second step, denoising and dimension reduction are performed on the drug diffusion state matrix and the protein diffusion state matrix by using a denoising automatic coding method.

6. The method for predicting drug target interaction based on multi-channel graph convolution network of claim 5, wherein after the drug and protein relationship network is expressed in a matrix form, element 1 in the matrix represents the known existing relationship between the drug and the protein.

7. The multi-channel graph convolution network-based drug target interaction prediction method of claim 6, wherein the pair Z is_tAnd Z_fProcessed to obtain a common insert Z_cThe specific process comprises the following steps:

graph convolution network pair Z using weight sharing_tAnd Z_fProcessed to obtain an embedded Z_ctAnd Z_cfTo Z is paired with_cfAnd Z_ctSumming and averaging to obtain the common embedding Z_c。

8. The multi-channel graph convolution network-based drug target interaction prediction method of claim 7, wherein the attention mechanism is used for Z_t、Z_fAnd Z_cProcessing to obtain a characteristic Z, wherein the specific process is as follows:

Z＝α₁*Z_c+α₂*Z_f+α₃*Z_t