CN113571125A

CN113571125A - Drug target interaction prediction method based on multilayer network and graph coding

Info

Publication number: CN113571125A
Application number: CN202110865457.9A
Authority: CN
Inventors: 刘闯; 王逸伟; 詹秀秀; 张子柯
Original assignee: Hangzhou Normal University
Current assignee: Hangzhou Normal University
Priority date: 2021-07-29
Filing date: 2021-07-29
Publication date: 2021-10-29

Abstract

The invention discloses a medicine target interaction prediction method based on multilayer network and graph coding. The method comprises a data acquisition module, a data preprocessing module, a feature learning module, a model algorithm design module and a result evaluation module. The data preprocessing module constructs a drug and protein network and processes heterogeneous images. The feature learning module comprises self-supervision learning on a structural graph encoder, vector encoding on a graph and isomorphic vector processing, and topology information of the graph is represented in a vector form. The model algorithm design module comprises the steps of constructing a cross validation set and designing a prediction model. And the result evaluation module verifies the prediction effect of the model by adopting an ROC curve based on a confusion matrix and a PR curve based on an accuracy and recall sequence. The method of the invention researches the medicine and the target from the aspects of data mining and graph, and predicts the interaction between the medicine and the target through the generated graph structure information and the subsequent tree model.

Description

Drug target interaction prediction method based on multilayer network and graph coding

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to a medicine target interaction prediction method based on a multilayer network and graph coding.

Background

With the rapid development of machine learning, the development of biological detection technologies such as third-generation gene sequencing and the like, and the arrival of a big data era in the field due to the rapid increase of biological data volume, more and more researchers and companies aim at the field of AI auxiliary drug development. The computer algorithm is used for assisting in screening the target targets, and the most intuitive advantage is that the computer is used for screening candidate drugs and narrowing the candidate range, so that the period of new drug discovery is greatly shortened, and the research consumables of new drug discovery are reduced. Practical application data indicates that AI technology can substantially reduce drug development costs by about 35%. By analyzing the net income trend of the international top medicine enterprises in recent years, the net income of most medicine enterprises is increased to different degrees after the AI auxiliary medicine is introduced for research and development. The AI technology can also perform multi-specific target analysis on the drug to predict multiple targets of the drug, thereby revealing the complex action mechanism of some diseases. In addition, the AI technology can also improve the accuracy and safety of the prediction of the drug, and search the side effect mechanism of the drug. Therefore, the AI technology can greatly simplify the process of research and development of new drugs on the whole, save research and development expenses, and assist drug enterprises in quickly researching and developing new drugs.

Disclosure of Invention

The invention aims to provide a method for predicting the interaction of drug targets based on a multilayer network and graph coding, which can eliminate the randomness of clinical experiments, narrow the screening range and accelerate the test period.

The invention constructs nine drug related networks (drug interaction network, drug disease related network, drug side effect related network, chemical similarity network of drug, therapeutic similarity network of drug, action target sequence similarity network of drug, biological process similarity network of drug, molecular function similarity network of drug, action cell component similarity network of drug), six target related networks (target interaction network, target disease related network, target sequence similarity network, target biological process similarity network, cell component similarity network where target is located, target molecular function similarity network) and drug target interaction network used as label. And respectively training corresponding structural self-encoders by using the networks independently, encoding the nodes into vectors by using the trained self-encoders, and finally splicing the encoded vectors of the nodes in different networks to form final characteristic vectors. And (3) sending the drug target pairs to be predicted into a trained lifting tree model (the model is obtained by linearly adding a series of decision trees constructed based on a training set) to obtain a final evaluation score.

The method comprises a data acquisition module, a data preprocessing module, a feature learning module, a model algorithm design module and a result evaluation module.

(1) The data acquisition module comprises:

(1-1) for drugs, collecting drug-drug interaction relationship data, drug-disease relationship data, drug-side effect relationship data, and six different types of drug-pair similarity relationship data, including: chemical fingerprint data of the medicine, therapeutic data of the medicine, peptide chain data of an action target of the medicine, biological process data of the medicine, molecular function data of the medicine and action cell component data of the medicine;

(1-2) for the target, namely protein, collecting the data of the interaction relation between the target and the target, the data of the relation between the target and the disease and the data of the similarity relation between four different types of targets, comprising: peptide chain data of the target spot, biological process data of the target spot, cell component data of the target spot and target spot molecule function data;

(1-3) collecting the interaction relation data of the medicine and the target.

(2) The data preprocessing module comprises a medicine and target related network and a multilayer network;

(2-1) the construction of the drug and target related network comprises:

A. for single-class object interaction relation data, constructing homogeneous interaction network, including drug interaction network G_1DTarget interaction network G_1T；

B. For objects of different classesInteraction relationship data, constructing heterogeneous interaction networks, including drug disease-related network G_{D_DI}Network G relating to side effects of drugs_{D_SE}Target disease-related network G_{T_DI}；

C. Collecting drug information of different dimensions, and constructing drug similarity network including chemical similarity network G of drug_2DTherapeutic similarity network of drugs G_3DAnd the action target point sequence similarity network G of the medicine_4DBiological process similarity network G of drugs_5DMolecular functional similarity network G of drugs_6DNetwork of similarity of active cellular components of drugs G_7D；

D. Collecting target point information of different dimensions, and constructing a target point similarity network including a target point sequence similarity network G_2TTarget biological process similarity network G_3TSimilarity network G of cellular components of target site_4TTarget molecule functional similarity network G_5T；

E. Construction of drug target interaction network G_{D_T}。

(2-2) the method for generating the multilayer network comprises the steps of generating a medicine multilayer network and generating a target multilayer network, and comprises the following specific steps:

(2-2-1) first, the drug disease-related network G_{D_DI}Disease similarity network G decomposed and converted into drug_8D＝(V_8D,E_8D) In which V is_8D、E_8DRespectively representing a drug node set in the network and an edge weight set of disease similarity between two drugs; margin for disease similarity of drugs

x_{D_M}And y_{D_M}Two drugs are shown in G_{D_DI}The corresponding row vector in the adjacency matrix of (a) represents the vector modulo;

network G relating drug side effects_{D_SE}Network G of similarity of side effects of drug decomposition and conversion_9D＝(V_9D,E_9D) In which V is_9D、E_9DAre respectively provided withA set of drug nodes in the network, a set of edge weights representing side effect similarities between two drugs; margin for similarity of side effects of drugs

x_{D_SE}And y_{D_SE}Two drugs are shown in G_{D_SE}The corresponding row vector in the adjacency matrix of (a);

target disease-related network G_{T_DI}Decomposing and converting into target disease similarity network G_6T＝(V_6T,E_6T) Wherein V is_6T、E_6TRespectively representing a target point node set in the network and an edge weight set of disease similarity between two target points; margin for disease similarity of target points

x_{T_DI}And y_{T_DI}Indicates that two target points are at G_{T_DI}Corresponding row vectors in the adjoining matrix of (a);

(2-2-2) then combining the drug-related networks into a drug multilayer network G_D＝{G_iD＝(V_iD,E_iD) I is the drug network number, i belongs to [1,9 ]](ii) a Combining target related networks into a target multilayer network G_T＝{G_jT＝(V_jT,E_jT) J is the network number of the target point, j belongs to [1,6 ]]。

(3) The feature learning module comprises a training structural self-encoder, encoding output and similar feature vector processing;

(3-1) training the structural autoencoder: drug multilayer network G_DWith target multilayer network G_TCorrespondingly training a structural self-encoder for each layer;

(3-2) encoding output: respectively coding the corresponding network layers by using the coding ends of the trained structural self-coder to obtain multilayer vectors of all the medicines and the target spots;

(3-3) processing the similar feature vectors: splicing the multiple layers of vectors of a drug to obtain the final characteristic vector representation of the drug; and splicing the multi-layer vectors of a target point to obtain the final characteristic vector representation of the target point.

(4) The model algorithm design module comprises a training sample construction module, a training and evaluation model and a medicine target point interaction prediction module;

(4-1) constructing a training sample: constructing a training sample by adopting a PairWise model, randomly dividing data into M parts, and performing M-fold cross validation, namely selecting one part as a validation set and the rest as a training set each time, adjusting model parameters through the overall performance of the cross validation, wherein M is a positive integer greater than 3;

(4-2) training and evaluating the model: building a lifting tree by adopting a lightweight gradient lifting decision tree and taking the decision tree as a weak learner, namely building the decision tree T (x, theta) by adopting iteration_l) Wherein x and θ_lRespectively inputting a characteristic vector and a learnable parameter of the first decision tree;

(4-3) predicting drug target interaction: and according to the optimal prediction model obtained by the result evaluation module, calculating the interaction probability of all the drug target pairs, and screening out the drug target pairs with high possibility as candidate drug target pairs capable of interacting as prediction results.

(5) The result evaluation module verifies the prediction effect of the model by adopting an ROC curve and a PR curve; the method comprises the following steps:

(5-1) plotting ROC curves: defining the false positive rate FPR as a horizontal axis and the true positive rate TPR as a vertical axis, wherein the larger the area AUROC value covered by the ROC curve is, the better the prediction effect of the model is represented;

real positive rate TPR of ROC curve_αAnd false positive rate FPR_αThe calculation by the confusion matrix is as follows:

the drug target pair is a positive sample in the presence of interaction, and is a negative sample in the absence of interaction; TP_αIndicates the number of positive samples, FP, predicted from the positive samples in the test set_αRepresenting negative examples in a test setMeasured as the number of positive samples, FN_αIndicates the number of positive samples predicted as negative samples, TN_αRepresenting the number of negative samples predicted in the test set as negative samples; α represents a prediction confidence;

(5-2) drawing a PR curve: precision at different prediction confidence alpha_αRecall with recall recalling_αComposition of precision-recall sequence:

drawing a precision-recall curve, namely a PR curve, by taking the horizontal axis as recall rate and the vertical axis as precision rate, wherein AUPR (area under PR) can reflect the classification effect of the classifier on the whole, and the larger AUPR value of the area under the PR curve is, the better the prediction effect of the model is;

(5-3) evaluation of model: and (4) according to the prediction result of the step (4-3), utilizing the drawn ROC curve and PR curve, calculating AUROC and AUPR, and searching for a model parameter under the optimal prediction result.

The method researches the interaction of the drug target pairs from the aspects of data mining and multilayer networks, abstracts different types of data into the same data structure by constructing the network, and realizes the drug target prediction by combining the methods of the decomposition of heterogeneous networks, the automatic learning of network topological structures by structural self-encoders, tree-based classifiers and the like. Therefore, the method can effectively analyze the drug target data and predict the interaction between the drug target data and the drug target data, thereby providing scientific guidance for the research and development of new drugs, improving the research and development efficiency of the new drugs and promoting the development of medical independent innovation to a certain extent.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

Detailed Description

The following detailed description of the embodiments of the invention is provided in connection with the accompanying drawings.

The existing 732 drug-related data, 1915 targets (proteins) and corresponding 12904 side effects and 440 disease-related data comprise data of interactions between drug pairs, between drug diseases, between drug side effects, between targets and targets, between targets and diseases, MACCS fingerprint data of drug chemical formula, GO annotation of drug and target, protein sequence data of target, and half-inhibitory concentration data between drug and target.

As shown in fig. 1, a method for predicting drug target interaction based on multilayer network and graph coding comprises a data acquisition module, a data preprocessing module, a feature learning module, a model algorithm design module, and a result evaluation module, and specifically comprises the following steps:

(1) a data acquisition module comprising:

(1-3) collecting interaction relation data of the medicine and the target;

the above data is downloaded through a public website.

(2) The data preprocessing module comprises a module for constructing a medicine and target related network and generating a multilayer network, and provides a data basis for medicine target prediction, and specifically comprises the following steps:

(2-1) constructing a medicine and target related network, comprising:

(I) for the interaction relation data of the drug and the drug, constructing a drug interaction network G_1D＝(V_1D,E_1D)，V_1DRepresenting a set of drug nodes in the network, E_1DRepresents the netThe edge set of the interaction between two drugs in the collateral exists;

constructing a target interaction network G for the interaction relation data of the target and the target_1T＝(V_1T,E_1T)，V_1TRepresenting a set of target nodes in the network, E_1TRepresenting a set of edges that have an interaction between two targets in the network;

(II) for the relation data of the medicine and the disease, constructing a medicine disease related network

Wherein

E_{D_DI}Respectively representing a medicine node set, a disease node set and an edge set of the relation between the medicine and the disease in the network;

for the relation data of the medicine and the side effect, a medicine side effect related network is constructed

Wherein

E_{D_SE}Respectively representing a drug node set, a side effect node set and an edge set of the relationship between the drug and the side effect in the network;

for target and disease relation data, constructing target disease related network

Wherein

E_{T_DI}Respectively representing a target point node set, a disease node set and an edge set of the relation between a target point and a disease in the network;

(III) for chemical fingerprint data of the medicine, constructing a chemical similarity network G of the medicine_2D＝(V_2D,E_2D) In which V is_2D、E_2DRespectively representing a drug node set and an edge weight set of chemical similarity between two drugs in the network; margin of chemical similarity

Wherein a is₁And b₁Is the bit number of MACCS fingerprints of two drugs respectively, c₁The number of the same bit of the two medicines;

for therapeutic data of a drug, a therapeutic similarity network G of the drug is constructed_3D＝(V_3D,E_3D) In which V is_3D、E_3DA set of drug nodes in the network, a set of side weights representing therapeutic similarity between two drugs, respectively; margin of therapeutic similarity

Wherein a is₂And b₂Coding for the respective ATC of the two drugs, c₂The number of digits for the same ATC code for both drugs;

constructing a medicine action target point sequence similarity network G for the peptide chain data of the medicine action target point_4D＝(V_4D,E_4D) In which V is_4D、E_4DRespectively representing a drug node set in the network and an edge weight set of action target point similarity between two drugs; margin for similarity of drug action targets

Wherein a and b represent the respective targets of the two drugs, T_{T_T}(a, b) shows the sequence similarity of respective targets of the two drugs, mean (-) shows the mean;

for biological process data of the drug, a biological process similarity network G of the drug is constructed_5D＝(V_5D,E_5D) In which V is_5D、E_5DRespectively representing a drug node set in the network and an edge weight set of the similarity of biological processes between two drugs; margin for similarity of pharmacogenomic processes

T_{T_P}(a, b) representing the similarity of biological processes at the respective targets of the two drugs;

for the molecular function data of the medicine, a molecular function similarity network G of the medicine is constructed_6D＝(V_6D,E_6D) In which V is_6D、 E_6DRespectively representing a drug node set in the network and an edge weight set of molecular function similarity between two drugs; the boundary of functional similarity of drug molecules

T_{T_M}(a, b) shows the molecular functional similarity of the respective targets of the two drugs;

for the acting cell component data of the medicine, constructing an acting cell component similarity network G of the medicine_7D＝(V_7D,E_7D) In which V is_7D、E_7DRespectively representing a drug node set in the network and an edge weight set of similarity of acting cell components between two drugs; margin for similarity of cell components for drug action

T_{T_C}(a, b) shows the similarity of the acting cell components of the respective targets of the two drugs;

(IV) constructing a target sequence similarity network G for the peptide chain data of the target_2T＝(V_2T,E_2T) In which V is_2T、E_2TRespectively representing a target point node set and an edge weight set of sequence similarity between two target points in the network; sequence similarity margin

Wherein a is₃And b₃The number of peptide chain sequence positions of two targets respectively, c₃The number of bits of the peptide chain sequence which is the same with the two targets;

for the biological process data of the target, a similarity network G of the biological process of the target is constructed_3T＝(V_3T,E_3T) In which V is_3T、E_3TRespectively representing a target point node set in the network and an edge weight set of the similarity of the biological processes between two target points; edge weights T of similarity of target biological processes_{T_P}(a, b) semantic annotation of GO in the biological process of two targets;

constructing a cell component similarity network G for the cell component data of the target point_4T＝(V_4T,E_4T) In which V is_4T、E_4TRespectively representing a target point node set in the network and an edge weight set of the similarity of the cell components between the two target points; margin T of similarity of cellular components at target site_{T_C}(a, b) semantic annotation of GO of cell components of two target points;

constructing a target molecule functional similarity network G for the target molecule functional data_5T＝(V_5T,E_5T) In which V is_5T、E_5TRespectively representing a target point node set in the network and an edge weight set of molecular function similarity between two target points; edge weight T of target molecule function similarity_{T_M}(a, b) semantic annotation of GO with molecular functions of two targets;

(V) for the interaction relation data of the drug and the target, constructing a drug target interaction network

Wherein

E_{D_T}Respectively representing a drug node set, a target point node set and an edge set of the relationship between the drug and the target point in the network.

(2-2) generating a multilayer network, including generating a drug multilayer network and generating a target multilayer network:

(2-2-1) network G relating drug diseases to drug diseases_{D_DI}Disease similarity network G decomposed and converted into drug_8D＝(V_8D,E_8D) In which V is_8D、E_8DRespectively representing a drug node set in the network and an edge weight set of disease similarity between two drugs; medicineBy the similarity of diseases

network G relating drug side effects_{D_SE}Network G of similarity of side effects of drug decomposition and conversion_9D＝(V_9D,E_9D) In which V is_9D、E_9DRespectively representing a drug node set in the network and an edge weight set of side effect similarity between two drugs; margin for similarity of side effects of drugs

x_{D_SE}And y_{D_SE}Two drugs are shown in G_{D_SE}The corresponding row vector in the adjacent matrix of (2);

target disease-related network G_{T_DI}Decomposing and converting into target disease similarity network G_6T＝(V_6T,E_6T) In which V is_6T、E_6TRespectively representing a target point node set in the network and an edge weight set of disease similarity between two target points; margin for disease similarity of targets

x_{T_DI}And y_{T_DI}Indicates that two target points are at G_{T_DI}The corresponding row vector in the adjacency matrix of (2);

(2-2-2) combining a drug interaction network, a drug disease similarity network, a drug side effect similarity network, a drug chemical similarity network, a drug therapeutic similarity network, a drug action target sequence similarity network, a drug biological process similarity network, a drug molecular function similarity network and a drug action cell component similarity network into a drug multilayer network G_D＝{G_iD＝(V_iD,E_iD) I is the drug network number, i belongs to [1,9 ]]；

Phase of target pointThe interaction network, the disease similarity network of the target, the sequence similarity network of the target, the similarity network of the biological process of the target, the similarity network of the cellular components of the target and the functional similarity network of the target molecule are combined into a target multilayer network G_T＝{G_jT＝(V_jT,E_jT) J is the network number of the target point, j belongs to [1,6 ]]。

(3) A feature learning module:

in the study of machine learning related problems, data and features determine the upper limit of the prediction result, and models and algorithms only approximate the upper limit. The feature coding module of the invention solves the problem of feature selection of the first half sentence, namely better learning gene features of a model algorithm, and achieves the most accurate prediction result. The module is based on a drug multilayer network G_DWith target multilayer network G_TThe method adopts the structural self-encoder to automatically encode the network structure, thereby ensuring the integrity of feature extraction.

(3-1) training the structural autoencoder: drug multilayer network G_DWith target multilayer network G_TEach layer of (a) correspondingly trains a structural self-encoder, and the training process is as follows:

a. using the adjacent matrix corresponding to the single-layer network as the input of the encoder;

b. after encoding, the output of the encoder is obtained and is used as the input of the decoder;

c. decoding to obtain the output of a decoder, and calculating a loss function by using the adjacency matrix, the output of the encoder and the output of the decoder;

d. calculating the gradient of each parameter of the encoder and the decoder by using a loss function, updating the parameters, wherein the updating step length is a multiple of the negative gradient;

e. repeating steps b through d until the loss function converges.

Said loss function L_mThe calculation includes two parts:

first order loss of similarity

N is the number of nodes, z_pAnd z_gRepresenting the coded output vectors, T, of the coder for node p and node g, respectively_pgRepresenting the weight of the connected edge; if it is an interaction network, T_pgIt is only possible to take 0 and 1, representing the case of no edge and an edge, respectively; if it is a similarity network, T_pgAny value between 0 and 1, inclusive, may be used. The loss function is defined in order to make the feature vectors encoded by drugs or targets with high similarity as similar as possible.

Second order loss of similarity

b_nAnd

representing the encoder input vector and the decoder output vector, respectively, of node n. The purpose of defining the loss function is to enable the decoder to reconstruct the original input vector as much as possible from the encoded vector, so that the encoded vector contains as much information as possible of the original vector.

Total loss function L_m＝L_2nd+λL_1stλ is a penalty term, 0 < λ < 1.

(3-2) encoding output: and respectively coding the corresponding network layers by using the coding ends of the trained structural self-coder to obtain multilayer vectors of all the medicines and the target points.

(3-3) processing the same-class feature vectors:

splicing the multiple layers of vectors of a drug to obtain the final characteristic vector representation of the drug;

and splicing the multi-layer vectors of a target point to obtain the final characteristic vector representation of the target point.

(4) A model algorithm design module comprising:

(4-1) constructing a training sample: the drug target pairs include verified drug target pairs and unverified drug target pairs, including undiscovered but objectively interacting drug target pairs. The invention finds out the drug target pairs which have objective interaction but are not discovered from the unverified drug target pairs. Therefore, it can be assumed that the probability that an unverified drug target pair interacts is certainly not greater than the probability of a verified interaction drug target pair. Based on the assumption, a PairWise model is adopted to construct training samples, namely, a positive sample is extracted from a verified and interacted drug target pair, a negative sample is also extracted from an unverified drug target pair, and training samples are constructed through corresponding positive and negative samples to obtain paired positive and negative training sample sets with the same quantity; and randomly dividing the data into M parts, performing M-fold cross validation, namely selecting one part as a validation set and the rest as a training set each time, and adjusting model parameters through the overall performance of the cross validation, wherein M is a positive integer greater than 3.

(4-2) training and evaluating the model: building a lifting tree by adopting a lightweight gradient lifting decision tree and taking the decision tree as a weak learner, namely building the decision tree T (x, theta) by adopting iteration_l) Wherein x and θ_lThe method comprises the following specific processes of inputting feature vectors and learnable parameters of the first decision tree respectively:

(4-2-1) before each round of decision tree construction, screening small gradient samples by using a gradient-based unilateral sampling (GOSS) algorithm, namely reserving a small part of large gradient samples and randomly selecting a part of small gradient samples to calculate the total variance gain, so that the number of samples is reduced;

(4-2-2) before each round of construction of the decision tree, merging mutually exclusive features by using a mutually Exclusive Feature Bundling (EFB) algorithm, thereby reducing feature dimensions;

(4-2-3) constructing a fitting target for the generated first decision tree when an input feature vector x and a corresponding label y of a certain sample are input based on the screened sample: if l is 1, the fitting target is the label of the sample, wherein the label of the positive sample is 1, and the label of the negative sample is 0; when l is more than or equal to 2, the fitting target is

Wherein the lifting tree obtained after the first-1 iteration

L is a loss function, and under the binary task, a single sample (x, y) has a predicted value of

The time loss function is defined as:

(4-2-4) based on the screened samples, fitting the target to construct a binary decision tree, wherein a leaf node of the binary decision tree is split by the following steps: constructing a histogram for each screened feature according to the value range of the feature, calculating the variance gain of each division point by using the histogram, selecting the feature with the maximum variance gain and the division point as the splitting feature of the current node and the optimal division point, and dividing the data of the leaf node corresponding to the optimal division point into two batches; recursion continues until the maximum depth of the tree is reached. The variance gain of feature f based on dataset D at partition point D is expressed as:

wherein x_l、x_l,f、g_lRespectively representing the ith sample vector, the ith feature of the ith sample vector and the negative gradient thereof,

and

all features f are smaller and larger than the division point D in the dataset D, respectively.

(4-2-5) performing K rounds of iteration to generate K decision trees;

(4-2-6) deciding K decisionsAdding the trees to generate a final lightweight gradient lifting decision tree

For the input feature vector x of the sample, the decision tree output H (x) e [0,1]The probability that the input sample is a positive sample can be interpreted;

(5-1) plotting ROC curves: plotting the ROC curve requires generating a confusion matrix, which is also an index for evaluating the model results, is part of the model evaluation, and is represented in the form of a square matrix, displaying the accuracy of the prediction results in a confusion matrix, each column representing the prediction category, the total number of each column representing the number of data predicted as the category, each row representing the true attribution category of data, and the total number of each row representing the number of data instances of the category.

The ROC curve is a new classification model performance evaluation method introduced from the field of medical analysis, is suitable for the research problem of two classifications, and when the ROC curve is drawn, the false positive rate FPR is defined as a horizontal axis, the true positive rate TPR is defined as a vertical axis, the larger the area AUROC value covered by the ROC curve is, namely the closer to 1, the better the prediction effect of the model is represented.

in the context of drug target prediction, the presence of drug target pair interaction is a positive sample and the absence is a negative sample. TP_αIndicates the number of positive samples, FP, predicted from the positive samples in the test set_αIndicating the number of negative samples predicted as positive samples in the test set, FN_αDenotes the number of predicted positive samples as negative samples, TN_αRepresenting the number of negative samples predicted from the test set; α represents a prediction confidence;

(5-2) drawing a PR curve: the rendering of the PR curve requires the generation of precision-recall sequences that are represented by precision at different prediction confidence degrees alpha_αRecall with recall recalling_αThe calculation formula is as follows:

the precision rate describes the accuracy rate of correctly classifying the positive samples under the confidence degree alpha, and the recall rate describes the proportion of correctly classifying the positive samples in the total positive samples under the confidence degree alpha; the two show opposite change trends along with the change of alpha. Therefore, an accuracy-recall ratio pair sequence generated by different alpha is utilized, a horizontal axis is used as a recall ratio, a vertical axis is used as an accuracy ratio to draw a precision-recall curve, namely a PR curve, an area AUPR under the PR curve can reflect the classification effect of the classifier on the whole, and the larger the area AUPR under the PR curve is, the closer the area AUPR is to 1, the better the prediction effect of the expression model is;

Screening candidate drugs is a main means for assisting the development of new drugs by AI, wherein the computer modeling (i.e. which data structure is adopted to represent both) and prediction model selection of drugs and targets are the most critical two steps. The method adopts two different computer modeling, namely network nodes and characteristic vectors, for the medicine and the target at different stages. Two data models are described below, using drugs as examples.

The drug networks can well reflect the relationship between drugs, and the multilayer networks formed by different types of drug networks can better reflect the relationship at different angles, thereby providing a new idea for drug screening. Specifically, the drug network represents a single drug as a node, and the interaction between drugs is defined as the connecting edges between nodes. The definition of edges is different for different types of drug networks, thus expressing the relationship between drug pairs at different viewing angles. Taking the chemical similarity network of drugs as an example, the edge weight between node pairs represents the chemical structure similarity between corresponding drug pairs, and the absence of an edge represents that the similarity is 0. In the process of constructing a drug network, the edge weights are usually normalized so that the weight values range from 0 to 1.

The eigenvector is an array of real numbers, each of which represents an eigenvalue and contains specific information in the application. In the method, the medicine characteristic vector is obtained by a structural self-encoder based on medicine network encoding, and the topological information of the network is contained in the characteristic value. The autoencoder is an auto-supervised representation learning method, and can convert nodes into feature vectors only according to input (here, a medicine network), and the dimensionality of the feature vectors is far smaller than the number of the nodes. Compared with the traditional one-hot coding, the method greatly reduces the complexity and the sparsity of the data. The structural self-encoder adopted by the method considers the first-order adjacency and the second-order adjacency of the network and more comprehensively comprises the whole structure of the network.

Network representation, vector coding and prediction model training of drugs and targets are the core content of comparison in drug target prediction algorithms. The algorithm model avoids the blindness of manual screening, greatly saves time cost and capital cost, and represents the information into a uniform data form by integrating the information of different aspects of the medicine and the target spot, and provides a feasible paradigm for the future medicine target spot prediction by a plurality of relatively independent and clear modules, thereby improving the prediction accuracy and ensuring the high efficiency, flexibility and expandability of the algorithm.

Claims

1. The medicine target interaction prediction method based on the multilayer network and the graph coding comprises a data acquisition module, a data preprocessing module, a feature learning module, a model algorithm design module and a result evaluation module, and is characterized in that:

(1) the data acquisition module comprises:

(1-2) for the target, namely protein, collecting target-target interaction relationship data, target-disease relationship data, and four different types of target-pair similarity relationship data, including: peptide chain data of the target spot, biological process data of the target spot, cell component data of the target spot and target spot molecule function data;

(1-3) collecting interaction relation data of the medicine and the target;

(2-1) the construction of the drug and target related network comprises:

B. For different classes of object interaction relationship data, constructing heterogeneous interaction networks, including drug disease related network G_{D_DI}Network G relating to side effects of drugs_{D_SE}Target disease-related network G_{T_DI}；

C. Collecting drug information of different dimensions, and constructing drug similarity network including chemical similarity network G of drug_2DTherapeutic similarity network of drugs G_3DThe action of the drugsTarget sequence similarity network G_4DBiological process similarity network G of drugs_5DMolecular functional similarity network G of drugs_6DNetwork of similarity of active cellular components of drugs G_7D；

E. Construction of drug target interaction network G_{D_T}；

targeting pointDisease-related network G_{T_DI}Decomposing and converting into target disease similarity network G_6T＝(V_6T,E_6T) In which V is_6T、E_6TRespectively representing a target point node set in the network and an edge weight set of disease similarity between two target points; margin for disease similarity of targets

x_{T_DI}And y_{T_DI}Indicates that two target points are at G_{T_DI}The corresponding row vector in the adjacency matrix of (a);

(2-2-2) then combining the drug-related networks into a drug multilayer network G_D＝{G_iD＝(V_iD,E_iD) I is the drug network number, i belongs to [1,9 ]](ii) a Combining target related networks into a target multilayer network G_T＝{G_jT＝(V_jT,E_jT) J is the network number of the target point, j belongs to [1,6 ]]；

(3-2) encoding output: respectively coding the corresponding network layers by using the coding ends of the trained structural self-coder to obtain multilayer vectors of all the medicines and the target points;

(3-3) processing the similar feature vectors: splicing the multiple layers of vectors of a drug to obtain the final characteristic vector representation of the drug; splicing the multi-layer vectors of a target point to obtain the final characteristic vector representation of the target point;

(4) the model algorithm design module comprises:

(4-3) predicting drug target interaction: calculating the interaction probability of all drug target pairs according to the optimal prediction model obtained by the result evaluation module, and screening out drug target pairs with high possibility as candidate drug target pairs capable of interacting as prediction results;

the drug target pair is a positive sample in the presence of interaction, and is a negative sample in the absence of interaction; TP_αIndicates the number of positive samples, FP, predicted from the positive samples in the test set_αIndicating the number of negative samples predicted as positive samples in the test set, FN_αDenotes the number of predicted positive samples as negative samples, TN_αRepresenting the number of negative samples predicted in the test set as negative samples; α represents a prediction confidence;

(5-2) drawing a PR curve: precision at different prediction confidence alpha_αRecall with recall recalling_αCompose precision-recall sequence:

drawing a precision-recall curve, namely a PR curve, by taking the horizontal axis as recall rate and the vertical axis as precision rate, wherein the AUPR (area under PR) can reflect the classification effect of the classifier on the whole, and the larger the AUPR value of the area under PR is, the better the prediction effect of the model is;

2. The method for predicting the interaction of a drug target based on multilayer network and graph coding according to claim 1, wherein: in the (2-1), A is specifically:

constructing a drug interaction network G for drug and drug interaction relation data_1D＝(V_1D,E_1D)，V_1DRepresenting a set of drug nodes in the network, E_1DA set of edges indicating the presence of interaction between two drugs in the network;

constructing a target interaction network G for the interaction relation data of the target and the target_1T＝(V_1T,E_1T)，V_1TRepresenting a set of target nodes in the network, E_1TIndicating a set of edges in the network that have an interaction between two targets.

3. The method for predicting the interaction of a drug target based on multilayer network and graph coding according to claim 1, wherein: in the (2-1), B is specifically:

for the relation data of the medicine and the disease, a medicine disease related network is constructed

Wherein

Wherein

Wherein

E_{T_DI}Respectively representing a target point node set, a disease node set and an edge set of the relation between the target point and the disease in the network.

4. The method for predicting the interaction of a drug target based on multilayer network and graph coding according to claim 1, wherein: c in (2-1) is specifically:

for chemical fingerprint data of medicine, constructing chemical similarity network G of medicine_2D＝(V_2D,E_2D) In which V is_2D、E_2DRespectively representing a drug node set and an edge weight set of chemical similarity between two drugs in the network; edge weights for chemical similarity

for therapeutic data of a drug, a therapeutic similarity network G of the drug is constructed_3D＝(V_3D,E_3D) In which V is_3D、E_3DA set of drug nodes in the network, a set of side weights representing therapeutic similarity between two drugs, respectively; margin for therapeutic similarity

T_{T_P}(a, b) indicates the similarity of biological processes of the respective targets of the two drugs;

for the molecular function data of the medicine, a molecular function similarity network G of the medicine is constructed_6D＝(V_6D,E_6D) In which V is_6D、E_6DRespectively representing a drug node set in the network and an edge weight set of molecular function similarity between two drugs; the boundary of functional similarity of drug molecules

T_{T_M}(a, b) represents the molecular functional similarity of the respective targets of the two drugs;

T_{T_C}(a, b) shows the similarity of the acting cellular components of the respective targets of the two drugs.

5. The method for predicting the interaction of a drug target based on multilayer network and graph coding according to claim 1, wherein: in the (2-1), D is specifically:

constructing a target sequence similarity network G for the peptide chain data of the target_2T＝(V_2T,E_2T) In which V is_2T、E_2TRespectively representing a target point node set and an edge weight set of sequence similarity between two target points in the network; sequence similarity margin

constructing the cell of the target point according to the cell component data of the target pointComponent similarity network G_4T＝(V_4T,E_4T) In which V is_4T、E_4TRespectively representing a target point node set in the network and an edge weight set of the similarity of the cell components between the two target points; margin T of similarity of cellular components at target site_{T_C}(a, b) semantic annotation of GO of cell components of two target points;

constructing a target molecule functional similarity network G for the target molecule functional data_5T＝(V_5T,E_5T) In which V is_5T、E_5TRespectively representing a target point node set in the network and an edge weight set of molecular function similarity between two target points; edge weight T of target molecule functional similarity_{T_M}(a, b) are obtained by GO semantic annotation of the molecular functions of the two targets.

6. The method for predicting the interaction of a drug target based on multilayer network and graph coding according to claim 1, wherein: in the (2-1), E is specifically:

constructing a drug target interaction network for drug and target interaction relation data

Wherein

E_{D_T}Respectively representing a medicine node set, a target node set and an edge set of the relation between the medicine and the target in the network.

7. The method for predicting the interaction of a drug target based on multilayer network and graph coding according to claim 1, wherein: (3-1) the training process is as follows:

c. obtaining the output of a decoder after decoding, and calculating a loss function by utilizing the adjacency matrix, the output of the encoder and the output of the decoder;

e. repeating steps b to d until the loss function converges;

said loss function L_mThe calculation includes two parts:

first order loss of similarity

N is the number of nodes, z_pAnd z_gRepresenting the coded output vectors, T, of the coder for node p and node g, respectively_pgRepresenting the weight of the connected edge;

second order loss of similarity

b_nAnd

an encoder input vector and a decoder output vector representing node n, respectively;

total loss function L_m＝L_2nd+λL_1stλ is a penalty term, 0 < λ < 1.

8. The method for predicting the interaction of a drug target based on multilayer network and graph coding according to claim 1, wherein: (4-2) the specific process is as follows:

(4-2-1) before each round of decision tree construction, screening out small gradient samples by using a gradient-based unilateral sampling algorithm, namely reserving a small part of large gradient samples and randomly selecting a part of small gradient samples to calculate the total variance gain;

(4-2-2) before each round of decision tree construction, merging mutually exclusive features by using a mutually Exclusive Feature Bundling (EFB) algorithm;

(4-2-3) constructing a simulation for the generated first decision tree based on the screened samples when the input feature vector x and the corresponding label y of a certain sample are inputSynthesizing a target: if l is 1, the fitting target is the label of the sample, wherein the label of the positive sample is 1, and the label of the negative sample is 0; when l is more than or equal to 2, the fitting target is

Wherein the lifting tree obtained after the first-1 iteration

The time loss function is defined as:

(4-2-4) constructing a binary decision tree by fitting the target based on the screened samples, wherein a leaf node of the binary decision tree is split by the following steps: constructing a histogram for each screened feature according to the value range of the feature, calculating the variance gain of each division point by using the histogram, selecting the feature with the maximum variance gain and the division point as the splitting feature and the optimal division point of the current node, and dividing the data of the leaf node corresponding to the optimal division point into two batches; recursion is continued until the maximum depth of the tree is reached; the variance gain of feature f based on dataset D at partition point D is expressed as:

and

respectively counting the number of samples with the characteristics f smaller than the division point D and larger than the division point D in the data set D;

(4-2-5) performing K rounds of iteration to generate K decision trees;

(4-2-6) adding the K decision trees to generate a final lightweight gradient lifting decision tree