CN111564186A

CN111564186A - Method and system for predicting interaction of graph-volume drug pairs based on knowledge graph

Info

Publication number: CN111564186A
Application number: CN202010216234.5A
Authority: CN
Inventors: 全哲; 林轩; 王志杰; 马腾飞; 曾湘详
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2020-03-25
Filing date: 2020-03-25
Publication date: 2020-08-21

Abstract

The invention provides a method for predicting interaction of graph-rolled drug pairs based on a knowledge graph, which comprises the following steps: extracting a drug pair data sample to generate a data set containing a training set, a verification set and a test set; constructing a knowledge graph corresponding to the data set; establishing a GCN drug pair interaction prediction model, and learning the feature information of drugs contained in a drug pair and the neighborhood thereof; inputting the drug pair data samples of the training set into a GCN drug pair interaction prediction model, and training the GCN drug pair interaction prediction model; optimizing a loss function according to a training result, and then sending the training result into a GCN drug pair interaction prediction model for training; training the GCN medicament on an interaction prediction model through iterative calculation; inputting the drug pair data samples in the test set into a GCN drug pair interaction prediction model to obtain a test result; and analyzing the test result to obtain a prediction result. The method and the system for predicting the interaction of the atlas-based atlas medicament pair have high accuracy and short training time.

Description

Method and system for predicting interaction of graph-volume drug pairs based on knowledge graph

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of medicine pair interaction prediction, in particular to a knowledge graph-based method and a knowledge graph-based system for predicting interaction of a convolved medicine pair.

[ background of the invention ]

Drug development is a high-investment and high-risk field, the average time from development to marketing of a drug is 10-15 years, the average investment is about 26 hundred million dollars, and only one of 10000 molecules entering a development pipeline can be successfully developed on average per 5000-. The cost can be reduced to a great extent by adopting a calculation method, and the method is used as a hotspot for auxiliary calculation in the field of drug discovery by depending on a data-driven machine learning method and training a model through a large amount of open-source experimental data and predicting downstream tasks.

In the related art, machine learning algorithms are widely used for prediction of drug-drug interaction. Most existing artificial intelligence calculation models generally need to integrate drug information of a multi-source data set, such as various information (structure, property and the like) of drugs, part of methods focus on representation learning of drug characteristics, structural characteristics of the drugs are acquired through SMILES molecules, and interaction among the drugs is predicted through structural similarity; or integrating drug side effects and other related relations, some methods predict drug-drug interactions by constructing various relation networks and combining effective graph embedding methods. These methods have difficulty obtaining reasonably distributed data samples for batch training due to the rarity of the types of label data available from the data set. Meanwhile, the current mainstream methods only consider a single relationship between drugs, namely drug-drug interaction, and ignore potential association existing between drugs and other entities (such as targets and genes).

The knowledge map contains rich entities and associated information, so that the possibility of exploring potential association of the medicine and other entities is provided. Meanwhile, due to the obvious effect of the graph neural network in many fields, the effective aggregation method can be designed to effectively represent the node characteristics and the neighborhood information in the graph by iteratively extracting the neighborhood information in the graph, and the potential inter-entity relation in the knowledge graph can be captured better.

Therefore, it is necessary to provide a method and a system for predicting drug pair interaction based on atlas of knowledge to solve the above problems.

[ summary of the invention ]

Aiming at the technical problems to be solved, the invention provides a method and a system for predicting the interaction of the atlas-based drug pair based on the knowledge graph, which have high accuracy and short training time consumption.

The invention provides a method for predicting interaction of graph-rolled drug pairs based on a knowledge graph, which comprises the following steps:

s1, extracting a drug pair data sample to generate a data set, wherein the data set comprises a training set, a verification set and a test set;

s2: constructing a knowledge graph corresponding to the data set;

s3, establishing a GCN drug pair interaction prediction model, and learning the feature information of the drugs contained in the drug pairs and the neighborhood thereof by using the knowledge graph;

s4, inputting the drug pair data samples in the training set into the GCN drug pair interaction prediction model, and training the GCN drug pair interaction prediction model;

s5, the training result of the GCN drug pair interaction prediction model is sent to the GCN drug pair interaction prediction model again to train after optimizing a loss function;

s6, completing the training of the GCN medicament on an interaction prediction model through multiple iterative computations;

s7: inputting the drug pair data samples in the test set into the GCN drug pair interaction prediction model to obtain a test result;

s8: and analyzing the test result to obtain a prediction result of the interaction of the medicament.

Preferably, the proportion of the training set in the data set is 80%, the proportion of the validation set is 10%, and the proportion of the test set is 10%.

Preferably, the step S2 includes the steps of:

s21: downloading a compressed file of each original data in the data set from a website;

s22: converting the compressed file into RDF graph data using a Bio2RDF tool;

s23: and uploading the RDF graph data to an RDF triple memory, and extracting the selected triple by executing reference operation to obtain a knowledge graph of the corresponding data in the data set.

Preferably, the GCN drug pair interaction prediction model comprises an embedded layer and two GCN layers arranged in sequence.

Preferably, the knowledge-graph is input to the GCN layer for neighborhood information clustering operations to train the GCN drug pair interaction prediction model.

Preferably, the step S4 specifically includes: and sequentially inputting the drug pair data samples in the training set into the embedding layer and the GCN layer for processing.

Preferably, the step S4 includes the following steps:

s41: respectively inputting drug entities in a drug pair data sample of the training set to the embedding layer to obtain a random initialization vector;

s42: sending the initialization vector output by the embedding layer into the GCN layer;

s43: and obtaining a scoring result of the medicine pair through neighborhood sampling and clustering operation of the GCN layer.

Preferably, in step S5, specifically, the method includes: and sending the training result of the GCN layer to a classifier, and continuously sending the training result to the embedding layer after the classifier optimizes a loss function so as to continuously train the GCN drug pair interaction prediction model.

Preferably, in step S6, the GCN drug pair interaction prediction model is trained when 50 iterations are performed or when the results of 5 consecutive iterations are not changed.

The invention also provides a system for predicting the interaction of the map-volume drug pair based on the knowledge graph, which comprises the following steps:

the drug pair extraction module is used for extracting drug pair data samples and generating a data set, wherein the data set comprises a training set, a verification set and a test set;

the knowledge graph building module is used for building a knowledge graph corresponding to the data set, and the knowledge graph comprises entities, relations and a triple set;

the GCN drug pair interaction prediction model is used for learning the feature information of drugs and neighborhoods thereof and is arranged at the output ends of the knowledge graph construction module and the drug pair extraction module, the GCN drug pair interaction prediction model comprises an embedded layer and two GCN layers which are sequentially arranged, and the embedded layer is arranged at the output ends of the knowledge graph construction module and the drug pair extraction module; and

and the classifier is used for generating output labels for task classification, and is arranged at the output end of the GCN layer.

Compared with the related technology, the invention provides a method and a system for predicting interaction of a medicine pair based on a knowledge graph volume, wherein the medicine pair in a training set of a medicine pair extraction module is input into a GCN medicine pair interaction prediction model, and the knowledge graph generated by a knowledge graph construction module is input into the GCN medicine pair interaction prediction model to learn medicines and neighborhood information thereof; in addition, by comparing several famous machine learning models for predicting the interaction of the drug pair, the performance of the graph-volume drug pair interaction prediction method and system based on the knowledge graph is superior to that of other comparison models in the performance of a drug bank data set, and the requirements of high stability and high accuracy can be met; the method and the system for predicting the interaction of the atlas-based atlas medicament pair have the advantages of high accuracy and short training time.

[ description of the drawings ]

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without inventive efforts, wherein:

FIG. 1 is a flow chart of a method for predicting the interaction of a mapped product drug pair based on a knowledge-graph according to the present invention;

FIG. 2 is a block diagram of a system for predicting drug pair interactions based on knowledge-graph convolution as provided by the present invention;

fig. 3 is a schematic diagram of the operation of the GCN layer shown in fig. 2.

[ detailed description ] embodiments

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a method for predicting interaction between a volume drug pair based on a knowledge graph, comprising the following steps:

and S1, extracting the drug pair data samples to generate a data set, wherein the data set comprises a training set, a verification set and a test set.

The proportion of the training set in the data set is 80%, the proportion of the validation set is 10% and the proportion of the test set is 10%. Drug pairs in the dataset that have an interaction are labeled as 1 and set as positive samples, and drug pairs that do not have an interaction are labeled as 0 and set as negative samples. Performing 5-fold cross validation processing on the data set to obtain the training set, the test set and the validation set;

s2: and constructing a knowledge graph corresponding to the data set.

The invention adopts a knowledge map construction module to generate a knowledge map corresponding to each data set. The step S2 specifically includes:

s22: converting the compressed file into RDF graph data using a Bio2RDF tool;

s23: and uploading the RDF graph data to an RDF triple memory, and extracting selected triples by executing benchmark operation to obtain a knowledge graph of the data corresponding to the data set.

S3, establishing a GCN (Graph connected Network) medicine pair interaction prediction model, and learning the medicine contained in the medicine pair and the characteristic information of the neighborhood thereof by using the knowledge Graph.

In the invention, a drug pair extraction module is used for generating a data sample required by a model; the GCN drug pair interaction prediction model is used for learning the feature information of drugs and neighborhoods thereof and is arranged at the output ends of the knowledge map construction module and the drug pair extraction module, and comprises an embedded layer and two GCN layers. The embedding layer is arranged at the output end of the medicine pair extraction module, and the GCN layer is arranged at the output end of the embedding layer.

In the invention, a classifier is adopted to generate an output label of task classification, which is arranged at the output end of a GCN drug pair interaction prediction model.

And S4, inputting the drug pair data samples in the training set into the GCN drug pair interaction prediction model, and training the GCN drug pair interaction prediction model.

The drug pair data samples are sequentially input to the embedding layer and the GCN layer for processing so as to train an interaction prediction model of the GCN drug pair.

Wherein the training process is as follows:

Specifically, the GCN layer includes neighborhood sampling and clustering operations, and specifically, each GCN layer samples a neighborhood of a drug node in a neighborhood range of a fixed size, and then performs clustering operations to obtain a final drug and a neighborhood representation vector thereof, specifically including:

sum clustering:

concat clustering:

wherein e is expressed as a feature vector of the drug entity,

the feature vector of the neighbor entity of the ith drug in the neighborhood range of S (e), sigma represents an activation function, W represents a learnable weight matrix, and b is a bias.

And S5, the training result of the GCN drug pair interaction prediction model is optimized by a loss function and then is fed into the GCN drug pair interaction prediction model again for training.

Specifically, the training result of the GCN layer is sent to a classifier, and the classifier optimizes the loss function and then continues to send the result to the embedding layer to continue training the GCN drug pair interaction prediction model.

Preferably, binary cross entropy is used as a loss function to calculate the probability value of the classification result

Comparing with the previous original label yi, j, the objective function LOSS can be obtained as:

and S6, completing the training of the GCN medicament on an interaction prediction model through multiple iterative calculations.

And (5) performing 50 iterative calculations or completing training of the GCN drug pair interaction prediction model when the result of 5 continuous iterative calculations is not changed any more.

S7: and inputting the drug pair data samples in the test set into the GCN drug pair interaction prediction model to obtain a test result.

Referring to fig. 2 and fig. 3, the present invention further provides a system 100 for predicting drug pair interaction based on atlas of knowledge, including a drug pair extraction module 10, an atlas construction module 20, a GCN drug pair interaction prediction model 30, and a classifier 40, wherein:

the drug pair extraction module 10 is used for generating data samples required by the model;

the knowledge graph building module 20 is configured to generate a knowledge graph corresponding to each data set;

the GCN drug pair interaction prediction model 30 is used for learning the feature information of the drugs and their neighborhoods, and is arranged at the output end of the knowledge map construction module 20 and the drug pair extraction module 10, and the GCN drug pair interaction prediction model 30 includes an embedded layer 310 and two GCN layers 320;

the classifier 40 is used for generating output labels of task classification, and is arranged at the output end of the GCN drug pair interaction prediction model 30.

Specifically, the embedding layer 310 is disposed at the output end of the pair of drugs extraction module 10 and the knowledge graph construction module 20, and the classifier 40 is disposed at the output end of the GCN layer 320.

In the following, the interaction prediction method and system based on the atlas will be evaluated for performance.

It should be noted that the data set adopted in this embodiment is a drug bank data set (V5.1.4, published in 2019, 7, month 2), and the interaction prediction method and system based on the atlas of knowledge base map volume drug pair are subjected to performance evaluation. The data set included 2578 experimentally-proven small molecule drugs, and a unique 612388 drug pairs screened from 13339 drugs that were experimentally validated. The data set and constructed knowledge-graph statistics are shown in table 1 below.

TABLE 1 DrugBank data set and constructed knowledge graph statistics table

Aiming at the performance evaluation of the interaction prediction system based on the atlas chart and the convolution medicine, in the embodiment, the prediction experiment of the interaction between medicines is carried out on potential medicine pairs in the drug bank data set. In this set of experiments, the scores of ACC, AUPR, AUC-ROC and F1 were used as evaluation indexes of model performance, and the 5-fold cross-validation comparison experiment was performed with 9 models of 5 mainstream methods.

Laplacian, GreRep, Deepwalk, Struc2vec, LINE, SDNE, GAE in the comparative methods, all performed based on BioNEV (https:// github. com/xiangyue9607/BioNEV), experimental parameter settings were kept default.

Regarding the deep ddi method, we rebuild the DNN model and modify the multi-class execution into two classes, with the source code address:

(https://bitbucket.org/kaistsystemsbiology/deepddi/src/master/)。

with regard to the KG-ddi method, we re-execute RDF2Vec to generate a 300-dimensional embedded vector, with the source code address:

(https://github.com/rezacsedu/Drug-drug-interaction-prediction)。

the specific experimental parameter settings provided by the present invention are shown in table 2.

TABLE 2 Superparameter settings for experimental execution

Parameter(s)	Is provided with
		Batch size	4096
Learning rate	1e-2
		L2 weight value	1e-7
Size of vector dimension	32
		Number of GCN layers	2
Neighborhood size	16

The results of the performance tests of the method of the present invention on the drug bank data set are shown in table 3.

TABLE 3 Performance test results on DrugBank data set by the method of the present invention

From the experimental results of table 3, it can be seen that: the test performance of the knowledge-graph-based drug pair interaction prediction system provided by the invention on a drug bank data set is superior to that of other comparison methods, and particularly, compared with all comparison methods, the method provided by the invention has at least 11.18%, 6.99%, 6.51% and 11% performance improvement on scores of ACC, AUPR, AUC-ROC and F1 respectively, so that a high-standard classification effect is realized.

While the foregoing is directed to embodiments of the present invention, it will be understood by those skilled in the art that various changes may be made without departing from the spirit and scope of the invention.

Claims

1. A method for predicting interaction of graph-rolled drug pairs based on a knowledge graph is characterized by comprising the following steps:

s2: constructing a knowledge graph corresponding to the data set;

2. The method of predicting drug-pair interaction based on knowledge-graph volume of claim 1, wherein the training set is 80% in the data set, the validation set is 10% in the data set, and the test set is 10% in the data set.

3. The method of predicting a drug-pair interaction based on a knowledge-graph convolution of claim 1, wherein the step S2 includes the steps of:

s22: converting the compressed file into RDF graph data using a Bio2RDF tool;

4. The method of claim 1, wherein the GCN drug pair interaction prediction model comprises an embedded layer and two GCN layers sequentially arranged.

5. The method of claim 4, wherein the knowledge-graph based atlas drug pair interaction prediction method is input to the GCN layer for neighborhood information clustering operations to train the GCN drug pair interaction prediction model.

6. The method for predicting the drug pair interaction based on the atlas of knowledge as claimed in claim 4, wherein the step S4 is specifically: and sequentially inputting the drug pair data samples in the training set into the embedding layer and the GCN layer for processing.

7. The method of predicting a drug-pair interaction based on a knowledge-graph convolution of claim 6, wherein the step S4 includes the steps of:

8. The method for predicting the drug pair interaction based on the atlas of knowledge as claimed in claim 1, wherein in the step S5 specifically: and sending the training result of the GCN layer to a classifier, and continuously sending the training result to the embedding layer after the classifier optimizes a loss function so as to continuously train the GCN drug pair interaction prediction model.

9. The method of predicting drug-pair interaction based on knowledge-graph convolution of claim 1, wherein in step S6, the GCN drug-pair interaction prediction model is trained when 50 iterations are performed or when the results of 5 consecutive iterations are not changed.

10. A system for predicting a drug-pair interaction based on a map of knowledge comprising: