CN112435720A

CN112435720A - Prediction method based on self-attention mechanism and multi-drug characteristic combination

Info

Publication number: CN112435720A
Application number: CN202011403977.XA
Authority: CN
Inventors: 宋晓宁; 华阳; 於东军; 冯振华
Original assignee: Shanghai Litu Information Technology Co ltd
Current assignee: Ditu Suzhou Biotechnology Co ltd
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2021-03-02
Anticipated expiration: 2040-12-04
Also published as: CN112435720B

Abstract

The invention discloses a prediction method based on a self-attention mechanism and multi-drug feature combination, which comprises the steps that drug molecules are compiled into two embedded features through extended connectivity fingerprints and Mol2Vec vectors, and drug features are extracted through a bidirectional gating circulation unit and neighborhood convolution; after the protein sequence in the medicine is embedded with the characteristics, extracting protein characteristics by using one-dimensional convolution and performing related attention enhancement with the medicine characteristics; splicing the drug characteristics and the protein characteristics, and enhancing the extraction of protein drug interaction information by using an attention mechanism; the spliced features were placed into a bidirectional gated cycle unit and predicted protein and drug interactions. Combining Morgan fingerprint compiling and Mol2vec vector embedding, so that the extracted medicine characteristic information is richer; the convolution network is found to be combined with a gating circulation unit to extract the characteristics of the protein and the medicine, and the attention mechanism is matched to enhance the extraction of the relationship characteristics between the protein and the medicine, so that the performance of the model can be effectively improved.

Description

Prediction method based on self-attention mechanism and multi-drug characteristic combination

Technical Field

The invention relates to the technical field of protein-drug interaction prediction, in particular to a prediction method based on an attention mechanism and multi-drug characteristic combination.

Background

Predictive work on protein-drug interactions is crucial in early drug screening, with 75% of the entire pharmaceutical industry being devoted to new drug research according to the American Association of drug research and manufacturers' statistics. In addition, only less than 5% of compounds obtained by primary screening can be used in clinical experiments, the traditional large-scale experimental screening usually takes 2-3 years, a large amount of time and energy are consumed by researchers, and the virtual screening of the drugs by using a computer is short in time and high in accuracy, so that the cost of drug screening can be effectively reduced. However, the virtual Drug screening is performed on the premise that the Interaction between different proteins and drugs is predicted (Protein-Drug Interaction, PDI).

The method mostly uses an MLP model to predict the protein-drug interaction, but the method cannot highlight local important information of drug characteristics and also cannot enable the prediction performance of the whole model to be optimal, so that a method for predicting the protein-drug interaction by using a Deep Long Short-Term Memory network (Deep LSTM) is provided, and the result is optimal in the prediction of the action of enzyme and G protein coupled receptors. Although the method still cannot predict the protein-drug interaction on a large scale, the intervention of time sequence information is shown to capture more effective identification characteristics of the interaction between the protein and the drug; therefore, research is needed to further enhance the effectiveness of the model in large-scale protein-drug interaction prediction.

Disclosure of Invention

This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.

The present invention has been made in view of the above-mentioned problems with the prediction of the existing protein-drug interactions.

Therefore, the technical problem solved by the invention is as follows: although the traditional method effectively gives out the detailed characteristics of molecules, the lack of structural information of drug molecules often directly influences the performance of protein-drug interaction prediction; with the increase of drug types, the identification degree of the existing graph volume model to the molecular structure is gradually reduced, so that the overall performance of the model is reduced, and the existing method mostly uses an MLP model to predict the protein-drug interaction. However, the method cannot highlight local important information of the medicine characteristics, and the prediction performance of the whole model cannot be optimal.

In order to solve the technical problems, the invention provides the following technical scheme: the drug molecules are compiled into two embedded characteristics through an extended connectivity fingerprint and a Mol2Vec vector, and the drug characteristics are extracted through a bidirectional gating circulation unit and neighborhood convolution; after the protein sequence in the medicine is embedded with the characteristics, extracting protein characteristics by using one-dimensional convolution and performing related attention enhancement with the medicine characteristics; splicing the drug characteristics and the protein characteristics, and enhancing the extraction of protein drug interaction information by using an attention mechanism; the spliced features were placed into a bidirectional gated cycle unit and predicted protein and drug interactions.

As a preferred embodiment of the prediction method based on the combination of the self-attention mechanism and the multi-drug feature, the method comprises the following steps: the method comprises the following steps of extracting the drug characteristics, namely embedding the characteristics of the drug by combining two modes of expanding connectivity fingerprints and compiling Mol2Vec vectors, firstly extracting the characteristics of the embedded characteristics through a bidirectional gate control circulation unit, splicing the drug characteristics obtained in the two modes, and then further extracting the characteristics of the drug by utilizing a one-dimensional convolution neural network; and finally, sending the obtained result and the protein characteristics into a classifier together so as to obtain the medicine characteristics.

As a preferred embodiment of the prediction method based on the combination of the self-attention mechanism and the multi-drug feature, the method comprises the following steps: the extended connectivity fingerprint comprises a circular fingerprint, and encoding the pharmacomumature formula using the extended connectivity fingerprint comprises: the environment and connectivity of each atom are analyzed on a given radius, then all possible structures are subjected to hash coding, and finally the coding information is compressed to a preset length by using a hash algorithm.

As a preferred embodiment of the prediction method based on the combination of the self-attention mechanism and the multi-drug feature, the method comprises the following steps: the Mol2Vec vector compiling comprises that the Mol2Vec vector compiling is evolved from natural language processing, molecular substructures which point to directions similar to those of chemically-related substructures can be learned, and the compound is finally coded into the vector by summing vectors of the substructures.

As a preferred embodiment of the prediction method based on the combination of the self-attention mechanism and the multi-drug feature, the method comprises the following steps: the protein feature extraction comprises the steps of preprocessing the protein sequence, dividing 22 amino acids into 6 amino acids according to the biochemical features of the protein sequence, and comprising the following steps: a ═ H, R, K }, B ═ D, E, N, Q }, C ═ C, X }, D ═ S, T, P, a, G, U }, E ═ M, I, L, V } and F ═ F, Y, W }, so that the sequence "MSPLNQSAEGLPQEASNRSLN" can be converted into "eddebddbdedbbddbadeb", the method yields a combined number of 6 × 6 ═ 216 feature matrices with significantly reduced dimensionality; meanwhile, the protein and medicine features are extracted by utilizing a one-dimensional convolution network, and the formula of the convolution extracted features is as follows:

wherein: functions x (t) and q (t) are variables of convolution, p is an integral variable, t is an amount for shifting the function q (-p), and is convolution, and the protein sequence is subjected to feature embedding, one-dimensional convolution, maximum pooling and full connection to obtain 128-dimensional features, and is put into a classifier together with the drug features.

As a group of the inventionA preferred embodiment of the prediction method based on the combination of the self-attention mechanism and the multi-drug feature, wherein: said enhancing the attention associated with said drug characteristics comprises setting said drug molecular feature vector to F_drugThe protein proton sequence feature vector is P ═ { P ═ P₁,P₂,…,P_iAnd construct a structure about F_drugThe attention matrix of (a) can be used to calculate which of the sub-sequences are more important to the drug molecule by assigning more weight to the protein proton sequence, and the formula is as follows:

W_attention＝f(W_interF_drug+B_inter)

P′_i＝σ(W_attentionP_i)

wherein: f is a function that can be learned by gradient descent, W_interAnd B_interFor trainable weights and biases in the model, W_attentionAs an attention matrix, P_i' to focus on protein features after learning.

As a preferred embodiment of the prediction method based on the combination of the self-attention mechanism and the multi-drug feature, the method comprises the following steps: the method for enhancing the extraction of protein-drug interaction information by using the self-attention mechanism comprises the step of giving a spliced PDI characteristic vector c_interactionConstructing a self-attention matrix W_self-attenEmphasis is given to the interaction information region learning, whose formula is expressed as follows:

W_self-atten＝f(W_interc_interaction+B_inter)

c′_interaction＝W_self-attenc_interaction

as a preferred embodiment of the prediction method based on the combination of the self-attention mechanism and the multi-drug feature, the method comprises the following steps: the method for extracting the drug characteristics further comprises the step of providing additional drug characteristics by utilizing a message transmission network, wherein the message transmission network is used for predicting quantum chemical properties and is very prominent to be represented on a small sample model, and the method mainly comprises the following three steps: message passing, for each atom, the features (atoms or bonds) of its neighbouring elements are propagated into a so-called message vector based on the graph structure; updating data, namely updating the embedded atomic features through message vectors; and (4) reading aggregation, and aggregating the atomic features in the molecules to obtain molecular feature vectors.

As a preferred embodiment of the prediction method based on the combination of the self-attention mechanism and the multi-drug feature, the method comprises the following steps: the message passing network comprises a specific algorithm of the message passing network, which comprises the following steps: firstly, constructing an initial state set, wherein each state is used for each node in the graph, and then allowing each node to exchange information with the neighbor of the node for message transmission, so that the state of each node comprises the perception of the direct neighbor of the node; repeating the steps, each node obtains the information of the second-order neighborhood, further reaches the expected times of 'message rounds', collects the node states of all the contexts and converts the node states into the characteristics representing the whole graph, and the formula of the node update weight is as follows:

wherein: m_tAs a function of the message, u_tFor the node update function, N (v) is the set of neighbors of the node in the graph,

is the hidden state of the node at time t,

for each node, messages are passed from its neighbors and aggregated from its surroundings into a message vector for the corresponding message vector

Finally updating the atom hidden state g by the message vector_v。

The invention has the beneficial effects that: a method for extracting medicine characteristics in a composite mode is provided, Morgan fingerprint compiling and Mol2vec vector embedding are combined, details of medicines are expressed, substructure information of the medicines is provided in detail, and extracted medicine characteristic information is richer; amino acids are classified according to biological activity, and sparsity of protein features is effectively reduced. Meanwhile, the convolution network is found to be combined with a gate control circulation unit to extract the characteristics of the protein and the medicine, and the attention mechanism is matched to enhance the extraction of the relationship characteristics between the protein and the medicine, so that the performance of the model can be effectively improved; and a GUI interface which is easy to operate is designed, a using method is provided, and the usability of the model in actual work is enhanced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:

FIG. 1 is a flow chart illustrating a prediction method based on a combination of an attention-free mechanism and multiple drug features according to a first embodiment of the present invention;

FIG. 2 is a Morgan fingerprint code diagram of a drug based on the self-attention mechanism and a multi-drug feature combination prediction method according to a first embodiment of the present invention;

FIG. 3 is a plot of the drug Mol2Vec vector compilation based on the prediction method of the combination of the self-attention mechanism and the multi-drug feature according to the first embodiment of the present invention;

FIG. 4 is a diagram of a model for drug feature extraction based on a prediction method of self-attention mechanism and multi-drug feature combination according to a first embodiment of the present invention;

FIG. 5 is a block diagram of an algorithm for a prediction method based on a combination of an attention mechanism and multiple drug features according to a first embodiment of the present invention;

FIG. 6 is a graph of a protein drug interaction simulation based on a prediction method of the combination of the self-attention mechanism and the multi-drug feature according to the third embodiment of the present invention;

FIG. 7 is a graph showing the results of a protein drug interaction test based on the prediction method of the combination of the self-attention mechanism and the multi-drug characteristics according to the third embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, specific embodiments accompanied with figures are described in detail below, and it is apparent that the described embodiments are a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present invention, shall fall within the protection scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

The present invention will be described in detail with reference to the drawings, wherein the cross-sectional views illustrating the structure of the device are not enlarged partially in general scale for convenience of illustration, and the drawings are only exemplary and should not be construed as limiting the scope of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in the actual fabrication.

Meanwhile, in the description of the present invention, it should be noted that the terms "upper, lower, inner and outer" and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation and operate, and thus, cannot be construed as limiting the present invention. Furthermore, the terms first, second, or third are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The terms "mounted, connected and connected" in the present invention are to be understood broadly, unless otherwise explicitly specified or limited, for example: can be fixedly connected, detachably connected or integrally connected; they may be mechanically, electrically, or directly connected, or indirectly connected through intervening media, or may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Example 1

Referring to fig. 1 to 5, a first embodiment of the present invention provides a prediction method based on a combination of an attention mechanism and multiple drug features, including:

s1: drug molecules are compiled into two embedded characteristics through expanding connectivity fingerprints and Mol2Vec vectors, and drug characteristics are extracted through a bidirectional gating circulation unit and neighborhood convolution. In which it is to be noted that,

the extraction of the medicine features comprises the steps that the purpose of extracting the medicine features is to extract identification features of medicines, so that a classifier can better understand the medicine properties and distinguish the differences among different medicines, therefore, excellent medicine features need to have identifiability, typicality and rich information content, the classifier can better fit a hyperplane for classification, and the accuracy of a model is improved; combining two modes of expanding connectivity fingerprints (Morgan fingerprints) and Mol2Vec vector compiling to embed the characteristics of the medicine, firstly extracting the characteristics of the embedded characteristics through a bidirectional gate control circulation unit, splicing the medicine characteristics obtained in the two modes, and then further extracting the characteristics of the medicine by utilizing a one-dimensional convolution neural network; and finally, sending the obtained result and the protein characteristics into a classifier together so as to obtain the medicine characteristics.

Further, the Morgan fingerprint is a circular fingerprint, and referring to fig. 2, the encoding of the formula of the drug using the Morgan fingerprint includes: analyzing the environment and connectivity of each atom on a given radius, performing hash coding on all possible structures, and compressing coding information to a preset length by using a hash algorithm; because the fingerprint coding mode has comprehensive representativeness and the content can be directly obtained from the database, the Morgan fingerprints are used as the characteristic representation of the medicines in many protein-medicine interaction prediction networks, but the Morgan fingerprints are too discrete and relatively large in size, and the rational representation of the substructure information of the medicines is difficult.

Referring to fig. 3, Mol2Vec vector compilation is evolved from Word2Vec in Natural Language Processing (NLP), can learn molecular substructures pointing to directions similar to chemically-related substructures, and finally codes a compound as a vector by summing vectors of the substructures, wherein the compilation mode can clearly show the substructural features of a medicament, has strong typicality and has important supplementary effect on the Morgan features; in order to obtain more abundant and distinctive drug characteristics, the invention combines the two modes to perform characteristic embedding on the drug, and the model refers to the black area in fig. 4.

Further, extracting drug characteristics also includes providing additional drug characteristics by using a Message Passing network (MPNN) for predicting quantum chemical properties, which is very prominent on a small sample model, and mainly includes three steps: message passing, for each atom, the features (atoms or bonds) of its neighbouring elements are propagated into a so-called message vector based on the graph structure; updating data, namely updating the embedded atomic features through message vectors; reading out aggregation, wherein atomic features in molecules are aggregated to obtain molecular feature vectors; the specific algorithm of the message passing network comprises: firstly, constructing an initial state set, wherein each state is used for each node in the graph, and then allowing each node to exchange information with the neighbor of the node for message transmission, so that the state of each node comprises the perception of the direct neighbor of the node; repeating the steps, each node obtains the information of the second-order neighborhood, further reaches the expected times of 'message rounds', collects the node states of all the contexts and converts the node states into the characteristics representing the whole graph, and the formula of the node update weight is as follows:

is the hidden state of the node at time t,

Finally updating the atom hidden state g by the message vector_v。

S2: after the protein sequence in the medicine is embedded with the characteristics, the protein characteristics are extracted by utilizing one-dimensional convolution and the attention of the protein sequences is enhanced relative to the medicine characteristics. In which it is to be noted that,

extracting protein features includes pre-treating protein sequence, classifying 22 kinds of amino acids into 6 kinds based on their biochemical features, including: a ═ H, R, K }, B ═ D, E, N, Q }, C ═ C, X }, D ═ S, T, P, a, G, U }, E ═ M, I, L, V } and F ═ F, Y, W }, so that the sequence "MSPLNQSAEGLPQEASNRSLN" can be converted into "eddebddbdedbbddbadeb", the method yields a combined number of 6 × 6 ═ 216 feature matrices with significantly reduced dimensionality; meanwhile, the protein and medicine features are extracted by utilizing a one-dimensional convolution network, and the formula of the convolution extracted features is as follows:

wherein: the functions x (t) and q (t) are convolution variables, p is an integral variable, t is an amount for shifting the function q (-p), and is convolution, and the protein sequence is subjected to feature embedding, one-dimensional convolution, maximum pooling and full connection to obtain 128-dimensional features, and is put into a classifier together with the drug features.

S3: the drug characteristics and the protein characteristics are spliced, and the extraction of protein drug interaction information is enhanced by utilizing a self-attention mechanism. In which it is to be noted that,

the attention enhancement related to the drug characteristics comprises setting the molecular feature vector of the drug as F_drugThe protein proton sequence feature vector is P ═ { P ═ P₁,P₂,…,P_iAnd construct a structure about F_drugThe attention matrix of (a) can be used to calculate which of the sub-sequences are more important to the drug molecule by assigning more weight to the protein proton sequence, and the formula is as follows:

W_attention＝f(W_interF_drug+B_inter)

P′_i＝σ(W_attentionP_i)

wherein: f is a function that can be learned by gradient descent, W_interAnd B_interFor trainable weights and biases in the model, W_attentionIs an attention matrix, P'_iTo focus on the learned protein characteristics.

Enhancing the extraction of protein-drug interaction information using a self-attention mechanism includes giving a spliced PDI feature vector c_interactionConstructing a self-attention matrix W_self-attenEmphasis is given to the interaction information region learning, whose formula is expressed as follows:

W_self-atten＝f(W_interc_interaction+B_inter)

c′_interaction＝W_self-attenc_interaction

s4: the spliced features were placed into a two-way gated circulation unit and the protein and drug interactions predicted. In which it is to be noted that,

c 'is spliced characteristic'_{interact′dxp``11ion}Putting the two-way gating circulation unit for training and inputting the layer of characteristics into a classifier to predict a final result; the invention uses binary cross entropy as a loss function of network training, and the formula is expressed as follows:

wherein: theta is the weight of the entire model, y_iFor the label of the i-th training sample,

outputting a result for the network of the ith training sample; to prevent overfitting, the present invention constrains network optimization using the L2 paradigm as a penalty term:

wherein: w and b are weight and bias of each layer of model, lambda is a penalty factor, a dropout layer is embedded in the last two layers of the model to solve the problem, and in order to give consideration to training efficiency and classification results, an Adam optimizer is used for carrying out weight optimization on the depth network.

Example 2

As a second embodiment of the present invention, in order to better verify and explain the technical effects adopted in the method of the present invention, in the present embodiment, three data sets are selected for testing, and the test results are compared by means of scientific demonstration to verify the real effects of the method;

before performing the experiment, three data sets of BindingDB, Kinase and Human are selected for verifying the effect of the model, wherein the BindingDB data set is divided into a training set, a verification set and a test set according to the scheme shown in the following table 1, wherein the verification machine and the test set comprise PDI samples of which no ligand or protein is observed in the training set, so that the generalization of the model to unknown drugs and proteins can be evaluated by combining the DB data sets.

Table 1: BindingDB dataset distribution.

Dataset	Protein	Drug	Positive	Negative
					Train	758	43160	28240	21915
Dev	472	5077	2831	2776
					Test	466	5016	2706	2802

The Kinase dataset is constructed based on a KIBA dataset comprising 229 protein samples and 1644 drug samples, KIBA has been developed for various scoring mechanisms for testing activity, such as IC50, ki, Kd, etc., and compared with various biological activity scores, the Kinase dataset can greatly reduce the deviation in the dataset, and the positive and negative sample numbers of Kinase are extremely unbalanced, as shown in Table 2 below:

table 2: kinase dataset distribution.

Dataset	Positive	Negative
			Train	19183	72282
Test	3990	15695

The Human data set contains 852 Human proteins and 1052 drug molecules, and 3369 positive samples and 2843 highly reliable negative samples exist in the data, but the data set is not divided into a training set and a testing set, so the model is evaluated on the data set by adopting a cross-entropy verification mode.

In the experiment of this embodiment, hardware configurations are an intel core i7-8700k central processing unit and an intein GeForce RTX 2060s display card, an operating system is Windows10, wherein training and evaluation of a model respectively use a Keras deep learning framework and a sklern machine learning tool in Python3 environment, in the process of training a network model, effects caused by different parameters are greatly different when a weight is optimized, a learning rate of the weight is preferentially determined here, parameter optimization is performed on other parameters by a grid search method on the basis of determining the learning rate, and through multiple rounds of experiments, the hyper-parameter setting shown in the following table 3 is determined:

table 3: and setting parameters.

Name	Value
		Learning rate	0.0001
Learning decay	0.001
		Cnn filters	128
Cnn stride	10,15
		Dropout	0.05
Regularizer	0.0001

The present embodiment evaluates each model using 2 indexes,respectively, the area under the ROC curve and the area under the PR curve, where the area under the ROC curve is expressed as: AUC, each point on the ROC curve is coordinated by the values of two indices: true Positive Rate (TPR):

and False Positive Rate (FPR):

the area under the PR curve is expressed as: each point on the aucr, PR curve is coordinated by the values of two indices: precision ratio (Precision, P):

and Recall (Recall, R):

wherein: TP is the number of positive case prediction errors, FP is the number of negative case prediction errors, TN is the number of positive case prediction errors, and FN is the number of negative case prediction errors.

The method for extracting the drug characteristics of DeepconvDTI is adopted initially in the experiment, namely only Morgan vectors are used for compiling drug molecules and the model training of the invention is combined, and the AUC (optimal result) of the test is equal to 0.954; experiments show that the method for extracting the drug characteristics ignores the information of the overall molecular substructure, and the information is very important for predicting the protein-drug interaction, so that the invention provides that the Mol2Vec vector and the Morgan vector which cover the drug substructure information are spliced, and then the characteristics are extracted by a convolution network; however, in the study of Withnall et al on the graph network active learning of the implicit structural features of the molecules, it is mentioned that the Message Passing graph network (MPNN) can enable the model to have the capability of learning the molecular structure, based on this, the present invention guesses that the features extracted by adding the graph network can further improve the prediction capability of the model, however, the experimental result of adding the drug molecular features extracted by the MPNN model on the original basis does not reach the expectation, and under the condition that the sample size is relatively large, the advantages of the graph network are relatively limited, and the experimental result is shown in table 4 below, it can be seen that the features compiled by adding the MPNN model on the original basis, the recognition rate of the model is reduced to 0.951, the effect is not ideal, so that the mode of combining the Morgan fingerprint and the morl 2Vec vector compilation is finally used as the feature information of the drug in this document.

Then, after the invention goes through a plurality of experiments to finally determine the method for extracting the drug characteristics and uses the bidirectional gating cycle unit in combination with softmax as a classifier, the use scheme of the attention module is also discussed, as shown in table 4 below, in the table, O indicates that the upper module is used, X indicates that the upper module is not used, when no attention module is used, the AUC of the trained model on the test set is equal to 0.954, the invention expects that the extracted characteristics of the protein are more related to the corresponding drug characteristics, therefore, the attention module is added between the protein characteristics and the drug characteristics, the result of the AUPR is improved by 0.3% on the original basis, on the basis, we continue to add the self-attention module between the combined characteristic layer and the bidirectional gating unit layer to optimize the experimental result, the trained model is equal to 0.961 on the test AUC set, and the invention is supplemented with the experiment of only adding the self-attention module, the AUC of the test can reach 0.960, which is 1% higher than the initial result, and the effect is very obvious.

Table 4: based on ablation experiments on the BindingDB dataset.

The experiment evaluated the models on the above 3 data sets and performed comparative experiments using conventional models of various types, including nearest neighbor model (KNN), Random Forest (RF), L2 logistic regression, support vector machine, and CPI-GNN model, but since the details of parameters of other models than CPI-GNN model are not mentioned here, the effects of the former four types of models will not be discussed except for the human data set, and in addition, the discussion of GraphDTA model, dedcepi model, GCN model, and TransformeiCPI model, which are the most typical models used in predicting PDI in recent two years, has important reference significance.

GraphDTA, GCN, CPI-GNN, TransformeiCPI, depcdti and the models presented herein were compared in sequence on the BindingDB dataset, where the first four experimental data were all current, the depconconvdti model and the model data of the present invention were optimal values for regulatory references during the experiment, and the BindingDB dataset contains a large number of protein and drug samples that were not contained in the training set, with the results shown in table 5 below:

table 5: comparative experiments on BindingDB dataset.

Ways	AUC	AUPR
			GraphDTA	0.929	0.917
GCN	0.927	0.913
			CPI-GNN	0.603	0.543
TransformerCPI	0.951	0.949
			DeepConvDTI	0.944	0.947
Ours	0.961	0.962

The table shows that compared with other leading edge models, the model provided by the invention is better, the AUC is improved by 1.5% compared with baseline, and is improved by 1% compared with the highest value; as the number of negative samples in practical prediction application is far greater than that of positive samples, the performance of the model is guaranteed under unbalanced data, and in order to verify the effect of the model on the unbalanced data of the positive samples and the negative samples, the existing models in the Kinase data set are compared in the experiment, as shown in the following table 6, compared with the other four models, the model of the invention still has excellent performance on the unbalanced data set, and the performance of the model cannot be reduced due to the increase of the negative samples.

Table 6: comparative experiments on Kinase data set.

Ways	AUC	AUPR
			GraphDTA	0.934	0.935
GCN	0.928	0.930
			CPI-GNN	0.922	0.922
TransformerCPI	0.926	0.923
			Ours	0.937	0.962

Finally, the performance of the model of the invention is verified again on a more common Human data set, because the data set is not divided into a training set and a test set, the model needs to be evaluated in a cross entropy verification mode, and meanwhile, in order to ensure the comparability of the experiment, the experiment adopts the same division proportion as that of the predecessor to divide the data into 4: 1, the evaluation system is consistent with the traditional method, the mean value and the variance of the obtained optimal values are shown in the following table 7 after ten different divisions, and the model provided by the invention is more excellent compared with other similar models no matter the precision or the stability is not difficult to see.

Table 7: comparative experiments on Human data sets.

Example 3

Referring to fig. 6 to 7, in order to better verify and explain the practicability of the method of the present invention, the method of the present invention is used to screen and cure the application of the alzheimer disease drug;

dementia is one of the noteworthy problems in public health management, wherein more than 80% of dementia cases suffer from Alzheimer's Disease (AD), and currently available therapies only help to temporarily relieve symptoms, but do not cure the disease or reverse the disease process with respect to neuropathology, so that a new treatment to delay or arrest the disease progression remains an urgent medical need, and it is well accepted according to the theory of AD that the loss of cholinergic neurons leads to a decrease in the neurotransmitter acetylcholine (Ach), so that inhibition of acetylcholinesterase (AChE) can increase the level of Ach, i.e. cognitive ability; meanwhile, researches show that the content of butyrylcholinesterase (BuChE) is kept unchanged at the late stage of the disease, even hydrolyzable ACh is increased, so that adverse effects brought by the activity reduction of acetylcholinesterase (AChE) at the late stage of the disease are replaced, and a mouse experiment for knocking out an acetylcholinesterase gene supports the hypothesis, and further proves that the selective inhibition of BuChE is positively correlated with the improvement of cognitive performance and memory; in other words, inhibition of acetylcholinesterase and butyrylcholinesterase is an important means for treating Alzheimer's Disease (AD), and therefore, the present invention designs a set of drug screening tools with practical significance to screen drugs inhibiting acetylcholinesterase and butyrylcholinesterase based on the proposed model, puts the protein to be tested and sufficient drug molecules into the system, the system will give the number of the drug of Top15 and the histogram (from high to low) of the predicted value of the interaction between the protein and each drug, for the significance of the model to be effectively verified, the test data selected in this example are not in the data set of the training model, but it is noteworthy that acetylcholinesterase exists in the data set of the training model, the amino acid sequence of butyrylcholinesterase and acetylcholinesterase which do not exist in the training data have 65% similarity, the principle of the PDI depth model is to infer unknown interaction relationship based on the existing interaction relationship, therefore, the tested information is often associated with the known information, otherwise, the test result is not based.

In the embodiment, the drug data provided by Rajnish Kumar et al is used as a test target, the test set is 35 compounds determined by Rajnish Kumar et al through manual screening from an Asinex library, and 2-dimensional structural formulas of drug molecules and Asinex numbers are given, the drug molecular formula for predicting PDI is obtained from PubChem according to the corresponding numbers and structural formulas, meanwhile, the Inhibition rates (Inhibition Rate, IR) of each drug molecule on Ache and buffer are given by Rajnish Kumar, according to the criteria defined herein, the IR <0.5 is recorded as no interaction, the IR >0.5 is recorded as interaction, and the test data shown in the following table 8 is obtained.

Table 8: drug test set.

CID	AChE	BuChE	CID	AChE	BuChE
						1148028(A1)	0	0	135644857(B6)	0	0
1120622(A2)	0	1	6489641(B7)	1	0
						709041(A3)	0	0	1292545(C1)	1	0
1153034(A4)	0	0	6498716(C2)	0	1
						135411325(A5)	0	0	6498728(C3)	0	0
1453054(A6)	0	0	6498729(C4)	0	1
						43817564(A7)	0	0	5305813(C5)	1	1
3228454(B1)	0	0	3123873(C6)	0	0
						651119(B2)	0	0	3201566(C7)	0	0
2684623(B3)	0	0	1448624(D1)	0	0
						1096744(B4)	0	0	1439318(D2)	0	0
1071391(B5)	0	0	6411211(D3)	1	0
						1126267(D4)	1	0	3149085(E2)	0	0
807832(D5)	0	0	1171875(E3)	0	0
						1166551(D6)	0	0	3146341(E4)	0	0
651744(D7)	0	0	72030131(E5)	0	0
						715450(E1)	0	0	6496850(E6)	0	0

It is noted that none of the tested drugs exist in the training set but similar structures and identical functional groups are not excluded, and the invention obtains the histogram of TOP15 by systematic prediction according to the high and low predicted interaction values, as shown in FIG. 7, from which it can be seen that the protein drug combinations with interactions in Table 8 exist substantially in the predicted Top15 range, which indicates the practical applicability of the model provided by the invention.

It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims

1. A prediction method based on a self-attention mechanism and multi-drug feature combination is characterized by comprising the following steps: comprises the steps of (a) preparing a mixture of a plurality of raw materials,

the drug molecules are compiled into two embedded characteristics through an extended connectivity fingerprint and a Mol2Vec vector, and the drug characteristics are extracted through a bidirectional gating circulation unit and neighborhood convolution;

after the protein sequence in the medicine is embedded with the characteristics, extracting protein characteristics by using one-dimensional convolution and performing related attention enhancement with the medicine characteristics;

splicing the drug characteristics and the protein characteristics, and enhancing the extraction of protein drug interaction information by using an attention mechanism;

the spliced features were placed into a bidirectional gated cycle unit and predicted protein and drug interactions.

2. The method of claim 1, wherein the method comprises: the extraction of the characteristics of the medicine comprises the following steps,

combining two modes of expanding connectivity fingerprints and compiling Mol2Vec vectors to embed the characteristics of the medicine, firstly extracting the characteristics of the embedded characteristics through a bidirectional gating circulation unit, splicing the characteristics of the medicine obtained in the two modes, and then further extracting the characteristics of the medicine by utilizing a one-dimensional convolutional neural network; and finally, sending the obtained result and the protein characteristics into a classifier together so as to obtain the medicine characteristics.

3. The method of claim 2, wherein the method comprises: the extended connectivity fingerprint includes a set of one or more extended connectivity fingerprints,

the extended connectivity fingerprint is a circular fingerprint, and encoding the pharmacoemulsification formula using the extended connectivity fingerprint comprises: the environment and connectivity of each atom are analyzed on a given radius, then all possible structures are subjected to hash coding, and finally the coding information is compressed to a preset length by using a hash algorithm.

4. A prediction method based on the combination of the self-attention mechanism and the multi-drug feature as claimed in claim 2 or 3, characterized in that: the Mol2Vec vector compilation includes,

the Mol2Vec vector compilation evolves from natural language processing, can learn molecular substructures that point in a similar direction as chemically related substructures, and finally encode the compound as a vector by summing the vectors of the individual substructures.

5. The method of any of claims 1 to 3, wherein the method comprises: the extracted protein features comprise the following components in percentage by weight,

the protein sequence is pretreated, 22 amino acids are divided into 6 according to the biochemical characteristics, and the method comprises the following steps: a ═ H, R, K }, B ═ D, E, N, Q }, C ═ C, X }, D ═ S, T, P, a, G, U }, E ═ M, I, L, V } and F ═ F, Y, W }, so that the sequence "MSPLNQSAEGLPQEASNRSLN" can be converted into "eddebddbdedbbddbadeb", the method yields a combined number of 6 × 6 ═ 216 feature matrices with significantly reduced dimensionality; meanwhile, the protein and medicine features are extracted by utilizing a one-dimensional convolution network, and the formula of the convolution extracted features is as follows:

6. The method of claim 1, wherein the method comprises: said attention enhancement associated with said drug profile comprises,

setting the molecular characteristic vector of the drug as F_drugThe protein proton sequence feature vector is P ═ { P ═ P₁,P₂,…,P_iAnd construct a structure about F_drugThe attention matrix of (a) can be used to calculate which of the sub-sequences are more important to the drug molecule by assigning more weight to the protein proton sequence, and the formula is as follows:

W_attention＝f(W_interF_drug+B_inter)

P′_i＝σ(W_attentionP_i)

7. The method of claim 1 or 6, wherein the method comprises the following steps: the extraction of protein drug interaction information by using the self-attention mechanism enhancement comprises the following steps,

given spliced PDI feature vector c_interactionConstructing a self-attention matrix W_self-attenEmphasis is given to the interaction information region learning, whose formula is expressed as follows:

W_self-atten＝f(W_interc_interaction+B_inter)

c′_interaction＝W_self-attenc_interaction。

8. the method of prediction based on a combination of an attention mechanism and multi-drug features according to claim 1 or 2, characterized by: the method for extracting the characteristics of the medicine also comprises the following steps,

additional drug features may also be provided using a messaging network for predicting quantum chemistry, which represents a very prominent feature on small sample models, consisting essentially of three steps: message passing, for each atom, the features (atoms or bonds) of its neighbouring elements are propagated into a so-called message vector based on the graph structure; updating data, namely updating the embedded atomic features through message vectors; and (4) reading aggregation, and aggregating the atomic features in the molecules to obtain molecular feature vectors.

9. The method of claim 8, wherein the method comprises: the messaging network may include a network of messages including,

the specific algorithm of the messaging network comprises: firstly, constructing an initial state set, wherein each state is used for each node in the graph, and then allowing each node to exchange information with the neighbor of the node for message transmission, so that the state of each node comprises the perception of the direct neighbor of the node; repeating the steps, each node obtains the information of the second-order neighborhood, further reaches the expected times of 'message rounds', collects the node states of all the contexts and converts the node states into the characteristics representing the whole graph, and the formula of the node update weight is as follows:

is the hidden state of the node at time t,

Finally updating the atom hidden state g by the message vector_v。