CN114420310A

CN114420310A - Medicine ATCCode prediction method based on graph transformation network

Info

Publication number: CN114420310A
Application number: CN202210063363.4A
Authority: CN
Inventors: 罗慧敏; 索志豪; 阎朝坤; 张戈; 王建林
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2022-01-18
Filing date: 2022-01-18
Publication date: 2022-04-29

Abstract

The invention discloses a medicine ATC Code prediction method based on a Graph transformation network, namely DACPGTN, which comprises the steps of firstly, obtaining target protein and disease related to medicine, obtaining 7 kinds of medicine similarity through medicine interaction information based on different evaluation standards, searching or calculating the target protein and disease similarity information related to the medicine, using the similarity information as characteristics to jointly construct a corresponding composite characteristic matrix, secondly, constructing an isomerous Graph representing a plurality of different edge relations by considering the known correlation information existing among three entities of the introduced medicine, the target protein and the disease, learning the correlation information of a plurality of isomerous adjacent matrixes by using Graph transformation network Layer in the Graph transformation network, thereby learning the potential Graph structure between the medicine and the multiple target protein and the disease, and finally obtaining the correlation information Graph structure by the Graph transformation network Layer, inputting the characteristic matrix and the drug-target protein-disease composite characteristic matrix into an end-to-end prediction module for learning to make a final drug ATC Code prediction; the method is simple and effective, and compared with other methods, and tests on a data set show that the method has better performance in the aspect of medicine ATC Code prediction.

Description

Medicine ATCCode prediction method based on graph transformation network

Technical Field

The invention belongs to the field of bioinformatics, and particularly relates to a graph transformation network-based drug ATCCode prediction method, namely DACPGTN, namely the graph transformation network is utilized to predict the ATC Code of a known drug.

Background

Research and development of a medicine are time-consuming and money-consuming works, and a new medicine needs to be researched for decades from research and development to use, and costs billions of dollars. How to find new indications from the existing approved medicines and reduce the development cost is a research hotspot in the field of bioinformatics at present. The Anatomical, Therapeutic and Chemical classification system (ATC) for drugs is the official classification system for drugs by the world health organization. The introduction of standard ATC Code in ATC system greatly facilitates the use of medicine in the treatment stage. The method predicts the classification of Anatomical Therapeutic Chemistry (ATC) of a given compound, infers the effective components, treatment, pharmacology and chemical properties of the compound, is helpful for correctly using the medicine or inferring the new application of the compound, is convenient to know the indication and potential toxic and side effects of the compound, accelerates the development process of the medicine, and is a common new application research idea of old medicines. The ATC Code classifies drugs into five grades, first grade, organ or anatomical system on which the drug acts; second order, pharmacological effects; tertiary and quaternary, chemical, pharmacological and therapeutic subgroups; grade five, specific single or combined medication. The first level includes 14 categories, respectively, (1) organism track and method, (2) Blood and Blood forming organs, (3) Cardiovascular system, (4) Dermatologicals, (5) genomic system and sex microorganisms, (6) systematic carbohydrate preparation, enclosing sex carbohydrates and insulators, (7) antibiotic-interacting for system use, (8) antibiotic and immunological preparation, (9) Musculelel system, (10) neurous system, (11) antibiotic products, interactions and reagents, (12) resource system, (13) vacuum system, (14) vacuum system.

In the widely used medicament information database at present, a large amount of medicaments without ATCCODes exist, and the traditional experimental method is applied to carry out ATCCODe classification on new medicaments or existing medicaments, which wastes time and labor. With the accumulation of relevant data of drugs and the rapid development of various pharmaco informatics databases, the prediction of the drug ATCCode by the prior art means is taken as a research and development strategy widely adopted internationally, and the method has higher input-output efficiency. How to design an effective drug ATC Code prediction method has attracted more and more attention. In the initial pharmaceutical ATCCode study, prediction of ATCCode was defined as a single label learning task, which was considered inappropriate due to the multi-label nature of the biological system, which is a problem with the ATCCode system of compounds.

In recent years, several multi-label classification methods for drug ATC classification have been proposed. For example: chen et al, first, propose to develop a classification method to predict the drug ATCCode by integrating the drug chemistry-chemistry interaction information and the chemistry-chemistry similarity information, and construct a drug ATCCode primary code reference dataset. On the basis of the reference data set, classification methods integrating related information of a plurality of medicines are provided, and ATCCode primary codes of the medicines are predicted. Cheng et al propose a multi-labeled Gaussian kernel regression classifier atc-mis f that assigns drugs to 14 ATCCode first class classes based on drug chemistry-chemistry interactions, structural and fingerprint similarities. After that, Cheng et al further integrated the drug ontology-based predictor iATC-mDO, and improved iATC-mISF to iATC-mHyb on this basis, improving the prediction performance of the classifier. A multi-label classifier EnsLIF based on a gradient histogram algorithm is developed by Nanni and Brahnam, and a one-dimensional characteristic vector of a drug compound is constructed into a two-dimensional matrix, so that the classification performance is improved to a certain extent. ZHou and the like construct a plurality of drug interaction networks, extract drug characteristics in the networks through a network embedding algorithm Mashup, convert original multi-label classification problems into a plurality of single-label classification problems by adopting RAndom k-labELsets (RAKEL) algorithms, and construct a classifier iATC-NRAKEL by adopting a classical machine learning algorithm Support Vector Machine (SVM) in a classification stage to obtain a better prediction effect. On the basis of the classifier, Zhou et al simplify the input of the classifier, and propose a multi-label classifier iATC-FRAKEL using only drug fingerprint information (SMILES format) as feature input, ATCCode for identifying drugs, and provide web services. Wang et al propose a method for predicting drug first-level ATCCode ATC-NLSP, ATC-NLSP uses a machine learning framework, combines drug-drug interaction information, structural similarity and fingerprint similarity, and adopts an NLSP method to discuss correlation among labels, thereby providing a better prediction result. With the successful application of the deep learning technology in multiple fields, Nanni and the like propose a first-level ATC Code multi-label classifier system (FUS3) integrated based on a deep learning method, extract features by using a Convolutional Neural Network (CNN) and a long-short term memory network, train on two general classifiers and obtain better effect. In the current latest research, Zhao et al proposes a new drug ATC Code end-to-end prediction model CGATCPred, which uses a CNN layer to extract composite features from 7 drug association score matrices, establishes an ATC tag association diagram, and learns tag information through two GCN layers in combination with word embedded information. And constructing a new feature by using the dot product between the composite feature and the generated label correlation matrix, splicing the generated new feature and the composite feature extracted by the CNN layer into a fully-connected neural network layer, and predicting the ATCCode of the medicine.

In summary, most of the existing drug ATCCode prediction methods predict based on the correlation between the property of the drug and the drug ATCCode label. To a certain extent, the potential effect of relevant information such as target protein and diseases related to the medicine on medicine ATC Code prediction is ignored, and the known relevant information among different types of data is not fully utilized.

Disclosure of Invention

In order to solve the above problems, the present invention provides a Graph transformation Network-based pharmaceutical ATCCode prediction method, i.e., DACPGTN (Drug-ATC code prediction method on Graph Transformer Network). The implementation of the method is based on potential correlation information between the drug and the related target protein and the disease, and can provide valuable information for the prediction of the drug ATCCode. The hypothesis that the ATCCode classes of two drugs may be the same when the two drugs act on the same target protein or disease, or when there is a multiple association between the two drugs and a target protein or disease. Firstly, acquiring the characteristics of a drug and related target proteins and diseases thereof, and constructing a composite characteristic matrix; secondly, a group of heterogeneous networks is constructed according to the association information among the drug-target protein, the drug-disease and the target protein-disease, and potential association information in the group of heterogeneous networks is learned by using a GraphTransformamer layer in a graph conversion network; and finally, inputting the obtained composite characteristic matrix and the potential correlation information matrix into an end-to-end prediction module to predict the medicine ATCCode. Compared with other methods and tests on a data set show that the method has better performance in the aspect of medicine ATC Code prediction.

The technical scheme adopted by the invention is as follows:

(1) construction of drug-target protein-disease composite feature matrix

(2) Construction of heterogeneous networks between drugs, target proteins, diseases

(3) Obtaining potential association information between medicine-target protein-disease

(4) Predictive drug ATCCode label

The invention has the beneficial effects that: according to the method, the potential association between the medicine and the related entity is obtained by integrating the composite characteristic information of the medicine and the related entity and utilizing the Graph transform Layer in the Graph conversion network, and the known biological information is fully utilized, so that the experimental result shows that the ATCCode prediction method for the medicine can effectively predict the ATCCode label of the medicine. The method is simple and effective, and compared with other methods, and tests on a data set show that the method has better performance in the aspect of medicine ATC Code prediction.

Drawings

FIG. 1 is a flow chart of DACPGTN according to the present invention.

FIG. 2 is a schematic diagram of the construction of the drug-target protein-disease complex feature of the present invention.

Fig. 3 is a schematic diagram of potential association between multi-source heterogeneous network construction and graphtransformamer layer learning according to the present invention.

FIG. 4 is a schematic diagram of an end-to-end prediction module according to the present invention.

FIG. 5 is a diagram illustrating the effect of the number of output nodes of the GCN feature extractor on the result.

Detailed Description

As shown in fig. 1 to 5, a graph switching network-based drug ATCCode prediction method includes the following steps:

1) the method comprises the following steps of obtaining target proteins and diseases related to the drugs by using the known drugs, and calculating the similarity between the diseases and the similarity between the obtained target proteins, wherein the specific processes of obtaining the target proteins and calculating the similarity between the diseases are as follows: firstly, acquiring a comprehensive score among drug-related target proteins from a String database as similarity information of the target proteins; secondly, acquiring a correlation matrix between the medicines and the related diseases, and calculating a Pearson correlation coefficient of each column by using the correlation matrix, namely information provided in the correlation information of each disease and all medicines is used as disease similarity information; performing superposition operation on the acquired drug similarity information under different known evaluation standards on the same dimension, and taking the average value as a drug characteristic matrix; target protein similarity and disease similarity as a feature matrix of the target protein and the disease; reducing the dimension of the feature matrixes of the three entities to the same dimension by utilizing a PCA (principal component analysis) technology, and splicing up and down to construct a composite feature matrix;

2) constructing a drug-target protein heterogeneous network, a drug-disease heterogeneous network, a target protein-disease heterogeneous network and transpositions of the heterogeneous networks: according to the association information between the entities, the specific construction process of the heterogeneous network is as follows: if the association relationship exists between the current Drug i and the Target protein j, the corresponding position element Drug-Target in the heterogeneous network_ijThe value is 1, the value of the corresponding position element is 0, and finally a sparse matrix Drug-Target with the values of 0 and 1 is obtained; similarly, a Target-distance heterogeneous network and a Target-distance heterogeneous network are constructed; the heterogeneous network constructed by the association information between the entities is transposed to finally obtain a heterogeneous network set between different entities

Namely Drug-Target protein heterogeneous network (Drug-Target), Drug-Disease heterogeneous network (Target-Disease), Target protein-Drug heterogeneous network (Drug-Target-Drug)^T) Disease-Drug heterogeneous network (Drug-Disease)^T) Disease-Target protein heterogeneous network (Target-Disease)^T)；

3) Heterogeneous network set obtained based on step 2)

Acquiring potential association information among three entities of drug-target protein-disease by using a Graph transducer Layer, and constructing a new potential association information matrix; the concrete implementation of the Graph transform Layer is as follows:

wherein phi is the convolution layer, w_φ∈R^1×1×KIs a parameter of convolution layer phi; assembling a Graph Transformer layer from a heterogeneous network

Select the adjacency matrix (heterogeneous networks of different types) and select the adjacency matrix through twoOf the adjacency matrix Q₁And Q₂The new graph structure is learned by matrix multiplication; the soft selection of the adjacency matrix is to be selected from

Obtaining non-negative weight, and carrying out 1 x 1 convolution weighted summation on the candidate adjacency matrix;

4) inputting potential correlation information between the drug, the target protein and the disease, which is acquired by the Graph transducer Layer in the step 3), and the composite characteristic matrix constructed in the step 1) into an end-to-end prediction module, and performing ATCCode prediction on the drug node.

In the step 4), the GCN layer is used as a feature extractor in the end-to-end prediction module, the dimensionality reduction operation is carried out by using multiple linear layers, and Dropout is added between the linear layers; the number of GCN layer output nodes is 150, the linear layer 1 comprises 150 neurons, the linear layer 2 comprises 128 neurons, the linear layer 3 comprises 64 neurons, and the linear layer 4 serves as an output layer and comprises 14 neurons; an end-to-end prediction module training and prediction stage, wherein the multi-label classification problem is converted into a predicted target score and a non-target score which are subjected to difference comparison in pairs, a softmax activation function is used for being matched with smooth popularization of a cross entropy loss function on multi-label classification, an extra 0 class is introduced, and the scores of the target classes are larger than S₀The scores of all non-target classes are less than S₀The specific implementation is completed by the following formula:

Ω_neg，Ω_possetting threshold S for positive and negative sample sets respectively₀And (3) obtaining the final Loss which is the popularization of the softmax activation function and the cross entropy Loss function on the multi-label classification problem:

loss(y_true，y_pred)＝logsumexp(y_pred-neg，0)+logsumexp(y_pred-pos，0)

and (3) by means of the good properties of the logsumex function, balancing the weight and solving the class imbalance problem, training an end-to-end prediction module, and outputting classes larger than 0 in the last layer of linear layer in the final prediction stage, namely the prediction result.

As shown in fig. 1, the specific implementation process of the present invention is as follows:

firstly, construction of drug-target protein-disease complex characteristics

The data sets applied by the method comprise a medicine set, a target protein set and a disease set.

1. Drug and related target protein, disease data acquisition

In the ATCCode study, Chen et al constructed a reference dataset to facilitate model comparison at the ATCCode first level. The baseline dataset contained 3883 compounds, each compound corresponding to one or more of the 14 ATC Code first class categories. The method experiment is carried out after further improvement on the basis of the data set. In the KEGG and drug bank databases, target and disease association data of drugs are collected, 1749 drugs in 3883 drugs have target and disease association information, and finally the 1749 drugs serve as a benchmark dataset of the method.

TABLE 1 details of entity information in the data set of the method

Entity type	Statistics of quantity
		Medicine	1749
Target proteins	982
		Disease and disorder	355

2. Drug similarity information

First, using the 7 similarity information provided by Zhao et al for all drugs in the Chen et al data set, see equation (1), for:

{SM_Sim，SM_Exp，SM_Dat，SM_Tex，SM_Com，SM_cp，SM_sub}R^{3883×3883×7}#(1)

the "similarity", "experimental", "database", "text mining" and "composite score", similarity calculation tools SIMCOMP and subemp calculate the similarity between pairs of compounds. All information of 1749 drugs required by the method is extracted from the seven similarity score matrixes, and finally, a drug similarity score matrix is obtained and shown in a formula (2) and serves as drug similarity information in the method:

{SM_Sim，SM_Exp，SM_Dat，SM_Tex，SM_Com，SM_cp，SM_sub}R^{1749×1749×7}#(2)

3. target protein similarity information

According to 982 Target proteins used in the method, a file '9606. protein. info. v11.0' is downloaded from a String library, 982 protein sequence numbers are traversed from the file to obtain a combined score (combined score) between two proteins, and a protein relation score matrix Target is constructed^982×982And (3) normalizing the obtained matrix through a formula (3) to finally obtain a protein comprehensive fraction matrix:

4. disease similarity information calculation

Constructing a Drug-disease relation matrix by using all drugs in reference data sets such as chen and the like and 355 known diseases meeting the requirements of the method, wherein if the drugs and the diseases have a relation, the corresponding position value in the matrix is 1, otherwise, the corresponding position value is 0, and obtaining a Drug-disease relation sparse matrix Drug-diseasese^3883×355. Calculating a Pearson correlation coefficient between each column by using the obtained drug-disease relation matrix to obtain a correlation matrix between diseases, and calculating Pearson correlation by equation (4):

a and B represent two different columns in the matrix, i represents the ith row in the current column, and n is 3883.

5. Constructing a composite feature matrix

And (3) performing superposition operation on the 7 similarity matrixes on the same dimension according to the acquired 7 medicine similarity information, namely performing summation operation on the 7 similarity scores of each current medicine, and performing averaging processing to obtain a medicine similarity score matrix used in the method finally, wherein the medicine similarity score matrix is used as a medicine feature matrix. And taking the protein comprehensive score matrix as a target protein characteristic matrix, and taking the inter-disease Pearson correlation coefficient matrix obtained by calculation as a disease characteristic matrix. In order to enable the model to learn enough characteristics and avoid the problems of gradient disappearance and the like in the model learning process caused by too large dimensionality. While the characteristics of related entities are retained to the maximum extent, noise data which are unfavorable to experimental results are removed to a certain extent, the characteristics are mutually independent, valuable information is better provided for ATCCode category classification, and the feature matrixes of the three types of data are sequentially subjected to dimensionality reduction by using a PCA (principal component analysis) technology. Through experiments, the optimal characteristic dimension is 300. After dimension reduction, the feature matrices of the three types of data are spliced to obtain a node composite feature matrix in the final DACPGTN model.

Secondly, constructing heterogeneous networks among different entities of drugs, Target proteins and diseases, in the construction of experimental data, firstly searching information in two databases of KEGG and Drug bank according to 1749 drugs and 982 Target proteins selected from an experimental data set, and constructing a Drug-Target adjacency matrix. Drug-Target if there is a relationship between Drug i and Target protein j_ijIs 1, otherwise, the value is 0, and finally, the sparse matrix Drug-target with the values of 0 and 1 is obtained^1749×982。

According to the same principle, a Drug-Disease adjacency matrix is constructed, and if the Drug and the Disease have an association relationship in two databases of KEGG and Drug bank, the Drug-Disease adjacency matrix is constructed_ijThe value is 1, otherwise, the value is 0, and finally the sparse matrix Drug-Disease is obtained^1749×355。

Meanwhile, the relation information of 982 medicines and 355 diseases in the experiment is extracted from the existing medicine information database, and a Target-Disease relation matrix is constructed. The definition of the matrix median is similar to the construction of a drug protein relationship matrix, and finally a sparse matrix Target-Disease is obtained^982×355。

In order to better learn the potential correlation information, the constructed heterogeneous matrix is transposed, and finally six adjacent matrixes (D _ T represents the adjacent matrix Drug-Target) are obtained^1749×982D _ D represents the adjacency matrix Drug-distance^1749×355T _ D represents the adjacency matrix Target-distance^982×355，D_T^TRepresents a transposition of Dt, DD^TRepresents a transpose of D _ D, T _ D^TRepresenting the transpose of T _ D. ).

And thirdly, acquiring potential association information between the drug and the target protein and the disease, and acquiring the potential association information between the drug and the target protein and the disease by using a Graph transformer layer in a Graph transformation network, wherein the Graph transformer layer in the Graph transformation network is a soft choice for different edge types and compound relations, namely a method for searching a new Graph structure by using a plurality of candidate adjacency matrixes to execute more effective Graph convolution and learn more powerful node representation. The concrete realization of the Graph transform Layer is completed by the formula (5):

wherein phi is the convolution layer, W_φ∈R^1×1×KIs a parameter of convolution layer phi. Graph transform layer from set of adjacency matrices

In selecting an adjacency matrix (different types of adjacencyConstruct a network) and pass through two selected adjacency matrices Q₁And Q₂Learning a new graph structure. The soft selection of the adjacency matrix is to be selected from

To obtain non-negative weights, and to perform 1 × 1 convolution weighted summation on the candidate adjacency matrix. In the implementation process, the constructed adjacency matrix is subjected to a Graph transform Layer operation by the formula (6-8), and each Q is_iCan be expressed as

Represents a set of edges, l represents the ith Graph Transformer Layer,

representing the weight of the current edge matrix at the l-th layer. And realizing the transfer of the nodes by the multiplication operation of the adjacent matrixes of different types to obtain the connection relation between different nodes. When a Graph Transformer Layer is used, two convolution kernels are provided in the first Layer in the case of a single Layer, and 1 convolution kernel is provided in the other Graph Transformer Layer except the first Layer in the case of a multilayer. And after a new graph structure is obtained according to the weight, multiplication operation between adjacent matrixes is carried out. For enhanced numerical stability, for each layer of the adjacent matrix, a degree matrix D^-1Normalizing the Graph structure to obtain the Graph structure output A of the current Graph Transformer Layer^(l)。

A^(l)＝D^-1Q₁Q₂#(8)

Based on the group of heterogeneous networks constructed in the steps, the Graph Transformer Layer is adopted to learn the association information in different heterogeneous networks, and finally, a Graph information matrix representing the potential association between different nodes is obtained.

Fourthly, predicting the drug ATCCode by an end-to-end prediction module

(1) GCN layer performs feature extraction on composite features and potential associated information matrix

After a brand new graph information matrix is obtained, a graph convolution neural network (GCN) is introduced to serve as a feature extractor to perform convolution operation on graph data. For a GCN network, layer-to-layer propagation is performed by equation (9):

the method is characterized in that the method is a method for generating a new Graph structure for the current input Graph structure, namely a new Graph structure generated after Graph Transformer Layer learning, a potential correlation information matrix,

is composed of

H is the input characteristic of the current GCN network layer, namely the constructed node composite characteristic matrix W^(l)∈R^d×dFor trainable weight matrices, H^(l+1)For the feature matrix output of the current GCN network layer, σ represents the activation function Relu.

In order to learn various connection relations among different node types, the output channel of the Graph Transformer Layer 1 multiplied by 1 convolution can be set as a plurality of channels C, and the adjacent matrix Q after weighted summation is used₁，Q₂Becomes adjacent tensor

Passing through l GrasAfter the superposition of the ph Transformer Layer, the tensor is obtained

Applying one GCN layer for each channel of the tensor, the multipass is passed through equation (10):

| | represents a join operator, C represents the number of output channels,

representative tensor

Of the ith adjacency matrix, D_iRepresents

A degree matrix of W ∈ R^d×dRepresenting a trainable cross-channel shared weight matrix with X ∈ R^N×dRepresenting a feature matrix, using D for the computation of a directed graph^-1A is substituted for

And carrying out normalization processing on the adjacency matrix.

Applying the constructed node characteristic matrix and the adjacent tensor obtained by the Graph Transformer Layer to the GCN Layer operation to obtain the output of a specific dimensionality,

(2) multi-layer linear layer dimension reduction prediction

And (3) performing dimensionality reduction processing on the output of the GCN layer by using a plurality of linear layers, taking the feature vector extracted by the GCN module as the input of the first layer of the full-connection layer, and taking the output dimensionality of the last layer of the linear layer as the same as the dimensionality of the ATCCode label vector of the medicine as an ATC classification prediction result of the medicine. In order to solve the over-fitting problem existing in the superposition of the multilayer network, Relu activation function processing is used after the first layer of linear layer, and Dropout layers are added between each subsequent layer of linear layer. The Dropout layer removes the neuron nodes from the network according to a certain probability, for random gradient descent, due to the introduction of the random neuron removal, each iteration trains different networks, the Dropout layer can effectively solve the over-fitting problem, and the generalization capability of the model is improved.

(3) Model optimization algorithm and loss function

In the DACPGTN model training process, learning is carried out by adopting an Adam optimizer random optimization algorithm, the Adam optimizer random optimization algorithm has excellent performance in deep learning, and has great advantages compared with other types of random optimization algorithms.

Loss function reference Su generalizes over the multi-label classification problem with the softmax activation function used in the single label classification problem in cooperation with the Cross Entropy Loss function (Cross entry Loss). In the original single label classification, the cross entropy loss function is defined as (11):

n represents the number of all possible classes, S_iOf which is a single category. Derived as an approximation of the max function, as shown in equation (12):

in the multi-label classification problem, the score of each target class is also expected to be not less than that of each non-target class, and the popularization of loss is obtained according to the same principle, formula (13)

Ω_neg，Ω_posRespectively positive and negative sample sets.

In multi-label problem prediction, the number k of labels that a sample has is a non-fixed constant, and a threshold is needed to determine which classes to output. To this end, an additional class 0 is introduced, with the desired scores of all target classesGreater than S₀The scores of all non-target classes are less than S₀To obtain equation (14):

if the threshold S is set₀Simplifying equation (14) by 0 yields equation (15):

finally, a loss function formula (16) is obtained, namely the popularization of the softmax activation function and the cross entropy loss function on the multi-label classification problem:

loss(y_true，y_pred)＝logsumexp(y_pred-neg，0)+logsumexp(y_pred-pos，0)＝logsumexp((y_pred-y_true)，0)+logsumexp((y_pred-(1-y_true))，0)#(16)

y_trueas a true label for the drug, y_predFor predicting the label for a drug, y_pred-neg，y_pred-posPositive and negative sample sets are predicted for the drug, respectively. And in the prediction stage of the model, outputting the class with the output larger than 0 in the last layer of linear layer. Compared with the method in the prior ATCCode classification research, the method does not convert the multi-label problem into a plurality of two classification problems, but converts the multi-label problem into the comparison of the target class score and the non-target class score, solves the class imbalance and automatically balances the weight of each item by virtue of the good property of the logsumexp function.

Fifth, experiment verification

1. Evaluation index

In order to verify the effectiveness of the method, the method adopts ten times of cross validation to carry out experiments and tests the prediction performance of the DACPGTN model.

(1) Cross validation by ten folds

The K-fold cross validation is a common cross validation method in deep learning and is commonly used for more rigorously evaluating the performance of a model, and in the performance validation of the method, 10-fold cross validation is used for evaluating the performance of the model. For each trade, the drug samples in the data set are divided into (training set: validation set): test set (9: 1): 1, 10 fold results were averaged for each 10 fold cross validation. And finally, performing 10-fold cross validation for ten times to obtain an average value, evaluating the performance of the model and ensuring that the error of the experimental result is as small as possible.

(2) Evaluation index

In the multi-label classification problem, because one or more labels exist in a single sample, the traditional single-label evaluation index does not have practical significance here, and compared with the traditional single-label evaluation standard, the evaluation standard of the multi-label problem is more complex and finer. Chou et al defined 5 evaluation criteria for evaluating the performance of multi-label classifiers, and the previous ATCCode label classification problem studies were compared using the evaluation criteria, and in order to ensure the fairness of the experiments, the method also used the evaluation criteria in the experiments. The evaluation criteria are specifically defined in the formulae (17-21):

wherein N is the total number of samples, M is the number of labels, the operator | · | is used for calculating the number of elements in the set, U/# represents the union/intersection operation of the set, Y_iTrue mark representing current sample iSign vector, Y_i ^*A prediction label vector representing the current sample i passing through the model, and K representing a function for judging whether the two vectors are identical, which is defined by formula (22):

2. results of the experiment

To evaluate the effectiveness of DACPGTN, DACPGTN was compared to five other methods (CGATCRPred, iATC-NRAKBL, iATC _ mISF, ML-KNN, and RandomForest). CGATCCId is a medicine ATCCode prediction method based on medicine similarity information and label correlation information; the iATC-NRAKBL is a medicine ATCCode prediction method based on a medicine interaction network and a RAKEL algorithm; the iATC _ mISF is a method for predicting the drug ATCCode based on the drug chemistry-chemistry interaction, structure and fingerprint similarity and by using a Gaussian kernel regression method as a classifier; ML-KNN and RandomForest are general classification methods in multi-label classification. For the 5 methods of comparison, the specific drug ATCCode prediction method, the parameter settings were all the same as their determined optimal parameters. For the basic multi-label classification method, the parameters are all set to default. The parameters set for the DACPGTN method are shown in table 2.

TABLE 2 DACPGTN method parameter settings

Number of Grapb Transformer Layer	1
		Number of output channels	2
Training epochs	250
		Learning rate	0.005
Weight decay	0.001
		Number of GCN layers	1
Input feature dimension	300
		GCN layer output dimension	150
FC1 neuron number	150
		FC2 neuron number	128
FC3 neuron number	64
		FC4 neuron number	14
Dropout	0.2

(1) Ten-fold cross validation analysis

And performing comparison experiments on the data set, performing ten-fold cross validation on all the experiments for 10 times, and taking an average value to ensure the fairness of the comparison experiments. The specific experimental results are listed in the following table:

TABLE 3 DACPGTN method vs. other methods results (10X 10-fold CV)

Classifier	Aiming	Coverage	Accuracy	Absolute true	Absolute false
						DACPGTN	0.8543	0.8517	0.8320	0.7902	0.0241
CGATCPred	0.7864	0.8022	0.7711	0.7290	0.0338
						IATC-NRAKEL	0.7744	0.8020	0.7550	0.6947	0.0376
iATC_mISF	0.7094	0.7127	0.7036	0.6306	0.0244
						ML-KNN	0.7293	0.7071	0.6861	0.6300	0.0433
RandomForest	0.6723	0.6533	0.6471	0.6187	0.0368

As can be seen from Table 3, the DACPGTN method of the present invention is most effective in predicting the current data set. Compared with the optimal model CGATCRPred in the current medicine ATCCode classification problem, the optimal model CGATCRPred is improved by 6.8% on Aiming, 5% on Coverage, 5.9% on Accuracy and 5.8% on Absolutetree. Of the five evaluation standards, Accuracy and Absolutetree are the most important evaluation standards, and the DACPGTN method is improved to a certain extent on the two indexes. These results indicate that when a pharmaceutical compound has correlation information between a target protein and a disease, the DACPGTN method of the present invention can learn potential correlation information between the drug, the target protein, and the disease from a plurality of heteromorphic maps using a Graph Transformer Layer in a Graph-transformed network. By integrating the correlation information and the composite characteristics among various nodes, better classification performance can be obtained in the ATCCode classification.

(2) Influence of output dimension of GCN layer on experimental result

In the experiment, the GCN Layer provides classification information for an end-to-end prediction stage by learning a composite feature matrix and a potential correlation information matrix obtained by a Graph Transformer Layer. In order to verify the influence of the characteristic output dimension of the GCN layer nodes on the experimental result and ensure that the model achieves the best performance, the following experiment is carried out, and the result is shown in FIG. 5. The input dimension dim of the original node of the GCN layer is 300, 4 output dimensions are preset, and the performance of different output dimensions of the node of the GCN layer on 5 evaluation standards is obtained through a 10-fold cross validation experiment. As can be seen from fig. 5, the model achieves the best prediction performance when the GCN layer output dimension is 150. Therefore, the node output dimension of the prediction module GCN layer is set to dim 150, and all experiments are performed on this parameter.

(3) Ablation experiment,

In order to explain the drug-target protein correlation information and the drug-disease correlation information more reasonably, potential correlation information of different nodes is obtained after Graph Transformer Layer learning, and the influence on the classification problem of the drug ATC Code is avoided. And respectively taking the drug-target protein correlation information and the drug-disease correlation information as the input of a Graph transducer Layer differential map, and reconstructing a node composite characteristic matrix as the input of a GCN end-to-end prediction module. The results of 10-fold cross validation using the same parameters as in the above experiment are shown in Table 4.

TABLE 4 ablation test results

Classifier	Aiming	Coverage	Accuracy	Absolutetrue	Absolutefalse
						DACPGTN-Disease	0.8442	0.8437	0.8231	0.7782	0.02516
DACPGTN-Target	0.8327	0.8307	0.8051	0.7536	0.02875

As can be seen from the above table, when only the drug-target protein related information or only the drug-disease related information is inputted as the Graph Transformer Layer, the performance of the present invention is somewhat degraded, and the individual drug target protein related information is better than the individual drug disease information. Because the related information of the drug target protein is more than that of the drug disease, more potential related information among the nodes can be acquired, and more valuable information can be provided in the classification problem. In the ATC Code classification problem, the DACPGTN method can obtain better prediction performance by considering multi-source associated information compared with the method only considering single associated information. Fully shows that the DACPGTN method can extract information useful for classification from multi-source associated information, obtains a new graph structure by learning different heterogeneous graphs, and has obvious advantages on the ATCCode classification problem after learning of an end-to-end prediction module.

The above-described embodiments are merely preferred examples of the present invention, and not intended to limit the scope of the invention, so that equivalent changes or modifications in the structure, features and principles described in the present invention should be included in the claims of the present invention.

Claims

1. A medicine ATCCode prediction method based on a graph transformation network is characterized by comprising the following steps:

2) constructing a drug-target protein heterogeneous network, a drug-disease heterogeneous network, a target protein-disease heterogeneous network and transpositions of the heterogeneous networks: according to the association information between the entities, the specific construction process of the heterogeneous network is as follows:

if the association relationship exists between the current drug i and the target protein j, corresponding position elements in the heterogeneous network

The value is 1, the value of the corresponding position element is 0, and finally the sparse matrix with the values of 0 and 1 is obtained

(ii) a In the same way, construct

A heterogeneous network,

A heterogeneous network; the heterogeneous network constructed by the association information between the entities is transposed to finally obtain a heterogeneous network set between different entities

I.e. drug-target protein heterogeneous network (

) Drug-disease heterogeneous network (

) Target protein-disease heterogeneous network (

) Target protein-drug heterogeneous network (

) Disease-drug heterogeneous network (

) Disease-target protein heterogeneous network (

）；

3) Based on step 2) obtainingHeterogeneous network aggregation

wherein

Is a coiled-up layer, and is,

is a convolution layer

The parameters of (1); assembling a Graph Transformer layer from a heterogeneous network

Select an adjacency matrix (heterogeneous networks of different types) and pass through two selected adjacency matrices

And

the new graph structure is learned by matrix multiplication; the soft selection of the adjacency matrix is to be selected from

2. The graph transformation network-based drug ATCCode prediction method according to claim 1, wherein: in the step 4), the GCN layer is used as a feature extractor in the end-to-end prediction module, the dimensionality reduction operation is carried out by using multiple linear layers, and Dropout is added between the linear layers; the number of GCN layer output nodes is 150, the linear layer 1 comprises 150 neurons, the linear layer 2 comprises 128 neurons, the linear layer 3 comprises 64 neurons, and the linear layer 4 serves as an output layer and comprises 14 neurons; training and predicting stage of end-to-end predicting module, comparing the difference between the target score and non-target score of the multi-label classification problem and utilizing

The activation function is matched with the smooth popularization of the cross entropy loss function on multi-label classification, and an additional class 0 is introduced to ensure that the scores of the target classes are all larger than those of the target classes

The scores of all non-target classes are less than

The specific implementation is completed by the following formula:

，

setting threshold values for positive and negative sample sets respectively

=0, the Loss is obtained

Popularization of an activation function and a cross entropy loss function on a multi-label classification problem:

by means of

And in the final prediction stage, the class which is output to be more than 0 in the last layer of linear layer is the prediction result.