CN116612810A - Medicine target interaction prediction method based on interaction inference network - Google Patents

Medicine target interaction prediction method based on interaction inference network Download PDF

Info

Publication number
CN116612810A
CN116612810A CN202310507847.8A CN202310507847A CN116612810A CN 116612810 A CN116612810 A CN 116612810A CN 202310507847 A CN202310507847 A CN 202310507847A CN 116612810 A CN116612810 A CN 116612810A
Authority
CN
China
Prior art keywords
target
drug
embedding
coding
interaction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310507847.8A
Other languages
Chinese (zh)
Inventor
陈亮
梁晓敏
陈煜其
毛胜东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shantou University
Original Assignee
Shantou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shantou University filed Critical Shantou University
Priority to CN202310507847.8A priority Critical patent/CN116612810A/en
Publication of CN116612810A publication Critical patent/CN116612810A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Medicinal Chemistry (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The embodiment of the invention discloses a drug target interaction prediction method based on an interaction reasoning network, wherein the network comprises an embedded layer, a coding layer, an interaction layer, a feature extraction layer and an output layer. The sequence embedding of the drug and the target molecules is generated in the embedding layer, the coding layer obtains coded molecular characterization of the drug and the target, the interaction layer simulates interaction between the drug and the target, the feature extraction layer extracts interaction features of the interaction matrix, and finally a prediction result is obtained. The method can be applied to the prediction of the relationship between the drug and the target. By adopting the method, the problem that the training of the traditional DTI prediction model is limited by a scarce marked data set, and unmarked data cannot be fully utilized is solved; second, the interpretability of the model in terms of DTI predictions is improved.

Description

Medicine target interaction prediction method based on interaction inference network
Technical Field
The invention relates to the technical field of medicine and target relation prediction, in particular to a medicine target interaction prediction method based on an interaction inference network
Background
The new medicine has long research and development time and high cost and is generally divided into two stages of preclinical and clinical research. The use of existing drugs to treat new diseases is a viable strategy because these "old" drugs have passed through mechanistic studies and clinical trials, which can reduce development costs and time.
Drug targets refer to biological macromolecules, such as proteins, nucleic acids, etc., that interact with drugs. The pre-recognition of these targets is important for drug development for specific diseases. However, traditional drug discovery methods only consider a single target for a single disease, ignoring complex interactions between drug and target and the case where many diseases involve multiple targets. Therefore, research is increasingly focused on multi-target drugs and drug combinations for multiple targets simultaneously to improve the therapeutic effect of the drugs and overcome drug resistance and toxic and side effects. While the multiple pharmacological properties of the drug may lead to unexpected side effects, new therapeutic effects, so-called drug repositioning, may also be brought about.
Drugs generally improve disease symptoms by interacting with intracellular proteins. Wherein a large number of compounds can be used as drug candidates, and drug targets are mostly proteins. At present, only a small part of matching relation between medicines and target proteins is known, and a plurality of unknown medicine target interactions are yet to be discovered.
Proper identification and verification of interactions between a drug and its target is critical for the discovery of new drugs or repositioning of drugs. However, identification of new drugs and their targets remains a very difficult process due to the complex relationship between chemical space and proteome space. There are a number of factors, such as chemical bond and affinity, which affect the interaction between a drug and its target.
There is therefore a need for efficient computational prediction methods to detect complex drug target associations to improve our understanding of biological interactions and biological processes and to provide new potential drug target interaction candidates for biological experiments. The main calculation prediction methods are molecular docking simulation and machine learning. While molecular docking simulation is widely accepted in biology, the docking process is time consuming and requires three-dimensional structures of the target and drug, which is difficult to obtain. In contrast, machine learning methods use information of known drug and target interactions to train predictive models to predict interactions between new drugs and targets. The use of a computer to screen for possible drug and target interactions as candidates can reduce time and cost, and finally, drug target interactions are validated through biological experiments.
Disclosure of Invention
The embodiment of the invention provides a drug target interaction prediction method based on an interaction inference network, which can solve the technical problems that most of DTI prediction model training is limited by a scarce marked data set, unmarked data cannot be fully utilized, and a model based on machine learning has poor interpretability in the aspect of DTI prediction. The method comprises the following steps:
s1: acquiring data of drugs, targets and drug target interactions from BIOSNAP, bindingDB and DAVIS databases respectively, and acquiring three data set positive sample pair data;
s2: acquiring and generating negative sample pair data;
s3: according to the data (namely positive sample pair data) of the drug, the target and the drug target interaction and the generated negative sample pair data, carrying out data arrangement on three data sets to combine into a training set, a verification set and a test set;
s4: generating the embedding of the drug molecules and the target sequences in the embedding layer and respectively passing through the coding layer;
s5: the coding layer obtains coded molecular characterization of the drug and the target;
s6: the interaction layer carries out interaction simulation of the drug and the target;
s7: the feature extraction layer captures interaction features of the interaction matrix;
s8: and finally, accessing a full-connection network to predict drug target interaction.
Wherein, the step S1 comprises the following steps: identifying a sample pair having a Kd value <30 units as a positive sample, said S2 comprising the steps of: detecting drugs and targets contained in the BindingDB and DAVIS databases, regarding Kd value > =30 as a negative sample pair, and randomly generating a negative sample pair from the bisnap database.
Wherein, the step S3 comprises the following steps: and layering and sampling positive and negative samples of the data set, and dividing the positive and negative samples into mutually exclusive subsets according to the ratio of 7:1:2 by using a leave-out method to serve as a training set, a verification set and a test set.
Wherein, the step S4 comprises the following steps:
converting the molecular sequence into vector representation by using three modes of direct coding of the molecular full sequence, coding of the molecular subsequence and fingerprint coding, and integrating sequence information into a feature vector; when in embedding, the SMILES expression of the drug molecule and the amino acid sequence of the target are respectively encoded to generate the sequence embedding of the drug and the target and />Respectively, into the coding layers.
Wherein the step of directly encoding the full sequence of the molecule comprises: directly coding SMILES and amino acid sequence, setting maximum length of SMILES descriptor to generate embedding as 100, maximum length of amino acid sequence to generate embedding as 1000, intercepting embedding with length greater than maximum length, complementing with 0 with length less than maximum length, and generating target and drug embedding and />
The step of encoding the molecular subsequence comprises:
applying ESPF algorithm to extract the substructure of the medicine and the target from Uniprot data set and the ChumbL database, obtaining vocabulary set V with different scales by setting different frequent threshold values and data sources, and decomposing the medicine and the target sequence into a group of subsequences by using frequent subsequence C as a segmentation standard;
the drug and target sequences are both broken down into substructures, converted into corresponding embedding matricesGenerating corresponding target embedding +_ through a queriable content embedding matrix>And drug embedding->The generation rule is as follows:
wherein , and />Word assembly of sub-sequences according to target and drug decomposition respectively>Automatically formed queriable matrix of the size and preset embedding length;
the fingerprint coding step comprises the following steps:
using molecular fingerprintingEmbedding as a fingerprint->And compresses it by means of a flag bit encoding.
Wherein, the step S5 comprises the following steps:
the coding layer adopts three coding modes respectively: multi-layer perceptron coding, convolutional neural network coding, self-attention mechanism coding, coding the embedding by merging the context information extraction features:
wherein and />Representing the coding of the drug and target sequence, respectively, after passing through the coding layer.
The multi-layer perceptron coding method comprises the following steps:
three layers of fully-connected neural networks are overlapped, 1024, 256 and 64 neurons are adopted as hidden layers respectively, and vectors are input into each layer of networkConnection weight to hidden unit +.>Multiplying to obtain respective outputs, and obtaining fingerprint code via a nonlinear activation function after three layers of neurons>
The convolutional neural network coding method comprises the following steps:
using three-layer one-dimensional convolutional neural network as an encoder for embedding medicines and targets, directly encoding SMILES and amino acid sequences respectively, and generating sequence embedments of medicines and targets and />Inputting into a CNN encoder;
the convolutional neural network codes and stacks three layers of one-dimensional convolutional neural networks, the medicine sequentially uses convolutional kernels with the sizes of 4, 6 and 8 according to the layer number, the target sequentially uses the convolutional kernels with the sizes of 4, 8 and 12 according to the layer number, after the convolutional operation is carried out, the result is subjected to one-time pooling operation and passes through the full-connection layer, and finally the codes of the medicine and the target are obtained and />
The self-attention mechanism coding method comprises the following steps:
combining sequence information with content embedding by position coding to form sequence embedding with sequence information learning capability, and position coding of medicine and target and />The learning position codes are obtained through learning, independent vectors are learned for each position, the independent vectors are generated through a queriable position embedding matrix, and the generation rule is as follows:
wherein , and />The single thermal codes of the ith subsequence of the target and the jth subsequence corresponding to the drug are respectively; /> and />The method is characterized in that a queriable two-dimensional embedding matrix is automatically formed according to the maximum length of a subsequence after decomposition of a medicine and a target and a preset embedding length, and the preset embedding length is consistent with the length of content embedding;
input to a self-attention mechanism encoder and />Obtained by content embedding and position embedding addition.
Wherein, the step S6 specifically includes the steps of:
for each target subsequenceiAnd each drug subsequencejGenerating an interaction value:
wherein FTo measure interactions between drug target pairs as an aggregation function, tensor matrices are obtained after the interaction layerIUsing point multiplication as an aggregation function can produce a single scalar to measure the strength of interaction between a single pair of drug target minimum unit pairs.
Wherein, the step S7 specifically includes the steps of: modeling interaction of adjacent areas among the sub-structural individuals, and performing feature extraction on an interaction graph by using a convolutional neural network.
Wherein, the step S8 specifically includes the steps of: the output layer decodes the acquired features, tiling the extracted interaction features into a vector, and outputting a prediction result through the linear layer.
The embodiment of the invention has the following beneficial effects: the invention uses an interaction inference network for drug target interaction, comprising an embedding layer, a coding layer, an interaction layer, a characteristic extraction layer and an output layer. In the embedded layer: in order to enable more abundant characteristic information of drug and target embedding, drug molecular fingerprints are used as supplements, and simultaneously ESPF algorithm is used for dividing the drug and target sequences into subsequences. In the coding layer: in order to capture the characteristics of each functional group of the molecular sequence such as a drug, a target and the like, a convolutional neural network is adopted to code the original sequence of the molecule; to capture the association features between functional groups, the subsequences of the molecules were encoded using the Self-Attention encoder. The interaction layer design adopts a simulation interaction process, uses point multiplication as an aggregation function to generate a single scalar so as to measure the interaction strength between the minimum unit pairs of a single pair of drug targets and provide interpretability for model interaction prediction results. The componentized design of the embedding layer and the coding layer of the interactive reasoning network has extremely strong expandability. Different prediction effects and experimental meanings can be obtained by selecting different embedding generation modes to be matched with different encoders.
Drawings
FIG. 1 is a network architecture in which the present invention is implemented;
FIG. 2 is a table of DAVIS, bindingDB and BIOSNAP data set statistics;
fig. 3 is a training set, validation set, and test set allocation table.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent.
As shown in fig. 1, a drug target interaction prediction method based on an interaction inference network according to an embodiment of the present invention is implemented by the following steps.
1. Data acquisition and preprocessing
Drug, target, and drug target interaction data were collected from BIOSNAP, bindingDB and DAVIS databases, respectively. The BIOSNAP database data used in the present invention is mainly based on MINER DTI data set containing 4510 kinds of drug information and 2181 kinds of target information and regarded as a DTI pair of positive samples. In addition to the BIOSNAP dataset, two datasets DAVIS and BindingDB are added. The DAVIS dataset contains experimental detection Kd values for 68 drugs and 379 targets, and the BindingDB contains experimental detection Kd values for 7165 drugs and 1254 targets. DTI pairs with Kd values <30 units can be considered positive samples.
2. The negative sample pair data is obtained, and the method specifically comprises the following steps:
drug and target contained in the BindingDB and DAVIS databases were detected, DTI pairs considered as negative samples with Kd values > =30, and DTI pairs of negative samples were randomly generated from the bisnap database for a total of three negative sample DTI pairs, as shown in fig. 2.
3. The data of the three data sets are consolidated, and the method specifically comprises the following steps:
the positive and negative samples of the data set are sampled in a layering manner, and are divided into mutually exclusive subsets according to the ratio of 7:1:2 by using a leave-out method, and the mutually exclusive subsets are used as training sets, verification sets and test sets, and the allocation situation is shown in figure 3. For each experiment, 5 independent runs were performed.
4. Generating sequence intercalation of the drug and target molecules in the intercalation layer, the steps specifically include:
converting the molecular sequence into vector representation by adopting three modes of direct coding of the molecular complete sequence, coding of the molecular subsequence and fingerprint coding (Morgan fingerprint and PubCHem fingerprint coding), and integrating sequence information into the feature vector; when in embedding, SMILES and amino acid sequences are respectively encoded to generate the sequence embedding of drugs and targets and />Respectively, into the coding layers.
Converting the molecular sequence into vector representation by adopting three modes of direct coding of the molecular complete sequence, coding of the molecular subsequence and fingerprint coding (Morgan fingerprint and PubCHem fingerprint coding), and integrating sequence information into the feature vector; when in embedding, SMILES and amino acid sequences are respectively encoded to generate the sequence embedding of drugs and targets and />Respectively input the knittingThe code layer; the method comprises the following specific steps of direct coding of a molecular full sequence, coding of a molecular subsequence and fingerprint coding (Morgan fingerprint and PubCHem fingerprint coding):
(1) Direct coding of the complete sequence of a molecule
Because both the SMILES and the amino acid sequences are alphabetically symbolized, they can be directly encoded. Setting the maximum length of the SMILES descriptor generated embedding as 100, the maximum length of the amino acid sequence generated embedding as 1000, intercepting the embedding with the length larger than the maximum length, and complementing the embedding with '0' with the length smaller than the maximum length to generate the embedding of the target and the drug and />
In this experiment, SMILES is represented by 63 numbers, and the amino acid sequence is represented by 21 numbers. SMILES descriptors and amino acid sequences are fixedly represented by reference numerals. For example, 'C', 'H', 'N' are denoted by '1', '2', and '3', respectively, then the SMILES sequence 'cn=c=o' is denoted as [ C N =c=o ] = [1 3 63 1 63 5].
(2) Molecular subsequence encoding
The invention uses ESPF algorithm to extract the substructure of medicine and target from Uniprot data set and ChumbL data base, and obtains vocabulary set V with different scale by setting different frequent threshold values (referring to the lowest frequency of single subsequence of decomposition in the data set) and data source. Drug and target sequences were broken down into a set of subsequences using frequent subsequence C as a partitioning criterion. Encoded as a bit vector and generates an embedding. The decomposed SMILES descriptors and amino acid sequence length distributions are shorter, and in order to retain more raw information, drugs (100) and targets (500) with smaller frequency thresholds are selected. In this experiment, the maximum length of the drug sequence was 50 subsequences and the maximum length of the target sequence was 545 subsequences. Both the target sequence and the drug sequence are decomposed into substructures, which are converted into corresponding embedding matrices. These matrices generate corresponding target embeddings by means of a queriable content embedment matrix>And drug embedding->. The generation rule is as follows:
wherein , and />Word assembly of sub-sequences according to target and drug decomposition respectively>Automatically formed queriable matrix of a predetermined embedding length.
(3) Fingerprint coding
Fingerprint coding is divided into two types, namely direct fingerprint embedding and fingerprint mark bit coding. Due to molecular fingerprintIs a binary vector consisting of '0', '1', so this vector can be used directly as fingerprint embedding +.>. However, the molecular fingerprints are long, the information density is sparse, and the molecular fingerprints can be compressed in a marker bit encoding mode. Molecular fingerprint identificationIs a Chinese style of instituteA flag bit having a value of 1->And ignoring the marker bit with value 0, extracting the marker bit according to the original order code in the fingerprint>Corresponding order ofiGenerating a shorter binary vector in the order of extraction and using it as an embedding +.>
Vector because of different numbers of fingerprint marks corresponding to each medicineThe lengths are also inconsistent, and the +.>Maximum length. Encoded according to Morgan fingerprint and PubCHem fingerprint>And (3) length distribution. Finally, the Morgan fingerprint code generation embedded maximum length is set to be 76, and the PubCHem code generation embedded maximum length is set to be 250. The embedding with the length less than the maximum length after encoding is complemented by '0'.
5. The coding layer obtains coded molecular characterization of the drug and the target, which specifically comprises the following steps:
the coding layer adopts three coding modes respectively, namely multi-layer perceptron (MLP) coding, convolutional Neural Network (CNN) coding and Self-Attention mechanism (Self-Attention) coding, and the embedding is coded by merging the extraction characteristics of the context information:
wherein and />Representing the coding of the drug and target sequence, respectively, after passing through the coding layer.
The three coding modes, namely multi-layer perceptron (MLP) coding, convolutional Neural Network (CNN) coding and Self-Attention mechanism (Self-Attention) coding, comprise the following specific steps:
(1) Multi-layer perceptron (MLP) coding
The embedded expression form of direct fingerprint generation is simple, context-free and nonlinear. Embedding for directly generating drug molecule fingerprintAn MLP encoder is input.
The MLP encoder superimposes three layers of fully connected neural networks, and the hidden layers respectively adopt 1024, 256 and 64 neurons. In each layer of the network, the vectors are inputConnection weight to hidden unit +.>The respective outputs are multiplied and passed through a nonlinear activation function. Fingerprint coding is obtained after three layers of neurons>
(2) Convolutional Neural Network (CNN) coding
The embedded length generated with directly encoded SMILES and amino acid full molecule sequences can reach 1000 units. Full-linked neural networks can be used for encoding, but this approach tends to result in overfitting and does not conform to the nature of drug and target based group interactions. The convolutional neural network is directly connected with the upper and lower layer neurons through the convolutional kernel and shares parameters, can extract local characteristics of molecules, reduces data processing capacity and retains useful information, and is more suitable for encoding.
The experiment adopts three layers of one-dimensional convolutional neural networks as coders for embedding medicines and targets, and SMILES and amino acid sequences are directly coded respectively to generate the sequence embedments of the medicines and the targets and />Is input into the CNN encoder.
The CNN encoder is overlapped with three layers of one-dimensional convolutional neural networks, the medicine sequentially uses convolutional kernels with the sizes of 4, 6 and 8 according to the layer number, and the target sequentially uses the convolutional kernels with the sizes of 4, 8 and 12 according to the layer number. After convolution operation, the result is subjected to one-time pooling operation and passes through the full connecting layer, and finally the codes of the medicine and the target are obtained and />
(3) Self-Attention (Self-Attention) coding
The computation time and memory consumption of the Self-Attention encoder is about the quadratic term of the input size, so embedding generated using the directly encoded full molecular structure as input is computationally infeasible. The embedded length generated by the molecular subsequence code is relatively short, and the Self-Attention mechanism-based encoder can capture the chemical semantics and the context of the medicine and the target substructure, so that the embedded generated by the subsequence code is used as the input of the Self-Attention encoder.
The embedding generated by the sequentially resolved subsequences of the drug and target does not contain contextual information of the molecular sequence. The Self-Attention encoder combines sequence information with content embedding using position coding (Positional Encoding) to form a sequence embedding with the ability to learn sequence information. Position coding of drugs and targets and />Can be obtained by learning. Learning position codes each position learns an independent vector, and is generated through a queriable position embedding matrix, and the generation rule is as follows:
wherein , and />Respectively the target firstiThe single thermal coding of the subsequence and the jth subsequence corresponding to the drug; /> and />Is based on maximum length and preset value of subsequence after decomposing medicine and targetThe embedding length automatically forms a queriable two-dimensional embedding matrix, and the preset embedding length is consistent with the embedding generation of the content.
Finally, the input of Self-Attention encoder and />Obtained by content embedding and location embedding addition, in order to maintain the consistency of the embedded content generated by the fingerprint with the location>Obtained by content embedding and constant addition:
the generated embedded containing the position information is input into a Self-Attention encoder, and consists of N identical layers, wherein each layer consists of two sub-layers, and the multi-head Self-Attention encoder is used first. Multi-head self-attention encoder is realized byAfter different linear transformations, different Attention results are spliced, and the model can be focused on different positions by using multiple heads of Attention, and a 12-head self-Attention mechanism is used in the invention:
the results of the self-attention layer are then output into a feed-forward neural network. Meanwhile, a sub-layer is connected behind the self-attention layer and the feedforward neural network layer, and splicing and normalization processing are carried out. Finally obtaining the codes of the medicine and the target and />
6. The interaction layer carries out interaction simulation of the drug and the target, and the method specifically comprises the following steps:
to mimic interactions with drug targets, for each target subsequenceiAnd each drug subsequencejGenerating an interaction value:
wherein FAs an aggregation function, used to measure interactions between drug target pairs, point-product representations were used in the experiments. Obtaining tensor matrices after the interaction layerIEach column in this interaction table takes into account the interaction of a single minimal unit of target and drug. Using point multiplication as an aggregation function can produce a single scalar to measure the strength of interaction between a single pair of drug target minimum unit pairs.
7. The feature extraction layer extracts the interaction features of the interaction matrix, and the method specifically comprises the following steps:
two-dimensional interaction table for outputting interaction layerIFeature extraction is then performed at the downstream layer. The higher the value in the interaction table, the higher the likelihood of DTI interactions. If it isA higher dot product for the corresponding position in the interaction table indicates that the interaction units do interact. The visual interaction table can more intuitively see which units contribute to the final result, and the internal working principle of the mutual prediction model of the medicine and the target is clarified as much as possible, so that powerful support is provided for biological interpretation.
Adjacent minimal units in the drug sequence and the target sequence may affect each other and trigger interactions. Thus, it is desirable to mimic the interaction of adjacent regions, not just between pairs. To achieve this, the interactive table is formed by convolutional neural networkIAnd extracting the characteristics. By using quantitative, order-invariant convolution kernels, the convolution layer can capture and aggregate interactions between neighbors and output a matrixO
8. Finally, obtaining a prediction result, wherein the method specifically comprises the following steps:
the output layer decodes the acquired features, tiling the extracted interaction features into a vector, and outputting a prediction result through the linear layerL
wherein ,,/> and />Weight matrix and offset representing fully connected layers, respectively, < ->A true tag representing the drug target interaction pair.
In the embodiment of the invention, the drug target interaction prediction model based on the interaction inference network comprises 5 algorithms with different combinations, (1) the embedding is generated by direct coding of a full molecular sequence, and the CNN encoder is used for encoding to predict; (2) Generating embedding through sub-sequence coding, and predicting by using Self-Attention coder coding; (3) Generating and embedding PubCHem fingerprint direct codes, generating medicine codes by using an MLP coder, and predicting after being connected with the medicine codes generated in the step (1) in series; (4) Embedding Morgan fingerprint direct coding generation, generating a medicine code by using an MLP coder, and predicting after the medicine code generated in the step (1) is connected in series; (5) The PubCHem fingerprint mark bit code is embedded, a Self-Attention encoder is used for generating a medicine code, and the medicine code generated in the step (2) is serially connected with the medicine code for prediction.
The above disclosure is only a preferred embodiment of the present invention, and it is needless to say that the scope of the invention is not limited thereto, and therefore, the equivalent changes according to the claims of the present invention still fall within the scope of the present invention.

Claims (10)

1. The medicine target interaction prediction method based on the interaction inference network is characterized by comprising the following steps of:
s1: acquiring data of drugs, targets and drug target interactions from BIOSNAP, bindingDB and DAVIS databases respectively, and acquiring three data set positive sample pair data;
s2: acquiring and generating negative sample pair data;
s3: integrating three data sets according to the data of the interaction of the medicine, the target and the medicine target and the generated negative sample pair data to construct a training set, a verification set and a test set;
s4: generating the embedding of the drug molecules and the target sequences in the embedding layer and respectively passing through the coding layer;
s5: the coding layer obtains coded molecular characterization of the drug and the target;
s6: the interaction layer carries out interaction simulation of the drug and the target;
s7: the feature extraction layer captures interaction features of the interaction matrix;
s8: and finally, accessing a full-connection network to predict drug target interaction.
2. The method for predicting drug target interactions based on an interactive reasoning network of claim 1, wherein S1 comprises the steps of: identifying a sample pair having a Kd value <30 units as a positive sample, said S2 comprising the steps of: detecting drugs and targets contained in the BindingDB and DAVIS databases, regarding Kd value > =30 as a negative sample pair, and randomly generating a negative sample pair from the bisnap database.
3. The method for predicting drug target interactions based on an interactive reasoning network of claim 2, wherein S3 comprises the steps of: and layering and sampling positive and negative samples of the data set, and dividing the positive and negative samples into mutually exclusive subsets according to the ratio of 7:1:2 by using a leave-out method to serve as a training set, a verification set and a test set.
4. A method of predicting drug target interactions based on an interactive reasoning network as claimed in claim 3, wherein S4 comprises the steps of:
converting the molecular sequence into vector representation by using three modes of direct coding of the molecular full sequence, coding of the molecular subsequence and fingerprint coding, and integrating sequence information into a feature vector; when in embedding, the SMILES expression of the drug molecule and the amino acid sequence of the target are respectively encoded to generate the sequence embedding of the drug and the target and />The coding layers are input separately.
5. The method for predicting drug target interactions based on a mutual reasoning network of claim 4, wherein the step of directly encoding the full sequence of molecules comprises: directly coding SMILES and amino acid sequence, setting maximum length of SMILES descriptor to generate embedding as 100, maximum length of amino acid sequence to generate embedding as 1000, intercepting embedding with length greater than maximum length, complementing with 0 with length less than maximum length, and generating target and drug embedding and />
The step of encoding the molecular subsequence comprises:
extracting the substructures of the drugs and targets from the Uniprot data set and the ChEMBL database, obtaining vocabulary sets V of different scales by setting different frequent thresholds and data sources by using an ESPF algorithm, and decomposing the drugs and target sequences into a group of subsequences by using the frequent subsequences C as a segmentation standard;
the drug and target sequences are both broken down into substructures, converted into corresponding embedding matricesGenerating corresponding target embedding +_ through a queriable content embedding matrix>And drug embedding->The generation rule is as follows:
wherein , and />Word assembly of sub-sequences according to target and drug decomposition respectively>Automatically formed queriable matrix of the size and preset embedding length;
the fingerprint coding step comprises the following steps:
using molecular fingerprintingEmbedding as a fingerprint->And compresses it by means of a flag bit encoding.
6. The method for predicting drug target interactions based on an interactive reasoning network of claim 5, wherein S5 comprises the steps of:
the coding layer adopts three coding modes respectively: multi-layer perceptron coding, convolutional neural network coding, self-attention mechanism coding, coding the embedding by merging the context information extraction features:
wherein and />Representing the coding of the drug and target sequence, respectively, after passing through the coding layer.
7. The method for predicting drug target interactions based on a mutual reasoning network as recited in claim 6, wherein,
the multi-layer perceptron coding method comprises the following steps:
three layers of fully-connected neural networks are overlapped, 1024, 256 and 64 neurons are adopted as hidden layers respectively, and vectors are input into each layer of networkConnection weight to hidden unit +.>Multiplying to obtain respective outputs, and obtaining fingerprint code via a nonlinear activation function after three layers of neurons>
The convolutional neural network coding method comprises the following steps:
using three-layer one-dimensional convolutional neural network as an encoder for embedding medicines and targets, directly encoding SMILES and amino acid sequences respectively, and generating sequence embedments of medicines and targets and />Inputting into a CNN encoder;
the convolutional neural network codes and stacks three layers of one-dimensional convolutional neural networks, the medicine uses the convolutional kernels with the sizes of 4, 6 and 8 according to the layer number, the target uses the convolutional kernels with the sizes of 4, 8 and 12 according to the layer number, and the convolutional is carried outAfter the operation, the results are subjected to one-time pooling operation and pass through a full connecting layer, and finally the codes of the medicine and the target are obtained and />
The self-attention mechanism coding method comprises the following steps:
combining sequence information with content embedding by position coding to form sequence embedding with sequence information learning capability, and position coding of medicine and target and />The learning position codes are obtained through learning, independent vectors are learned for each position, the independent vectors are generated through a queriable position embedding matrix, and the generation rule is as follows:
wherein , and />The single thermal codes of the ith subsequence of the target and the jth subsequence corresponding to the drug are respectively; /> and />The method is characterized in that a queriable two-dimensional embedding matrix is automatically formed according to the maximum length of a subsequence after decomposition of a drug and a target and a preset embedding length, and the preset embedding length is consistent with the content embedding length in generation;
wherein the input of the self-attention mechanism encoder and />Obtained by content embedding and position embedding addition.
8. The method for predicting drug target interactions based on an interactive reasoning network as set forth in claim 1, wherein the step S6 specifically includes the steps of:
for each target subsequenceiAnd each drug subsequencejGenerating an interaction value:
wherein FTo measure drug-target pair interactions as an aggregation function, tensor matrices are obtained after the interaction layerIUsing point multiplication as an aggregation function can produce a single scalar to measure the strength of interaction between a single pair of drug target minimum unit pairs.
9. The method for predicting drug target interactions based on an interactive reasoning network of claim 1, wherein S7 specifically comprises the steps of: modeling interaction of adjacent areas among the sub-structural individuals, and performing feature extraction on an interaction graph by using a convolutional neural network.
10. The method for predicting drug target interactions based on an interactive reasoning network as set forth in claim 1, wherein the step S8 specifically includes the steps of: the output layer decodes the acquired features, tiling the extracted interaction features into a vector, and outputting a prediction result through the linear layer.
CN202310507847.8A 2023-05-08 2023-05-08 Medicine target interaction prediction method based on interaction inference network Pending CN116612810A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310507847.8A CN116612810A (en) 2023-05-08 2023-05-08 Medicine target interaction prediction method based on interaction inference network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310507847.8A CN116612810A (en) 2023-05-08 2023-05-08 Medicine target interaction prediction method based on interaction inference network

Publications (1)

Publication Number Publication Date
CN116612810A true CN116612810A (en) 2023-08-18

Family

ID=87673893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310507847.8A Pending CN116612810A (en) 2023-05-08 2023-05-08 Medicine target interaction prediction method based on interaction inference network

Country Status (1)

Country Link
CN (1) CN116612810A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117198426A (en) * 2023-11-06 2023-12-08 武汉纺织大学 Multi-scale medicine-medicine response interpretable prediction method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117198426A (en) * 2023-11-06 2023-12-08 武汉纺织大学 Multi-scale medicine-medicine response interpretable prediction method and system
CN117198426B (en) * 2023-11-06 2024-01-30 武汉纺织大学 Multi-scale medicine-medicine response interpretable prediction method and system

Similar Documents

Publication Publication Date Title
Zhao et al. HyperAttentionDTI: improving drug–protein interaction prediction by sequence-based deep learning with attention mechanism
Ahmad et al. Deep-AntiFP: Prediction of antifungal peptides using distanct multi-informative features incorporating with deep neural networks
CN111312329B (en) Transcription factor binding site prediction method based on deep convolution automatic encoder
Soleymani et al. Protein–protein interaction prediction with deep learning: A comprehensive review
CN111063393B (en) Prokaryotic acetylation site prediction method based on information fusion and deep learning
CN108062556B (en) Drug-disease relationship identification method, system and device
Wang et al. Incorporating deep learning with word embedding to identify plant ubiquitylation sites
Nguyen et al. Deep learning for metagenomic data: using 2d embeddings and convolutional neural networks
CN116072227B (en) Marine nutrient biosynthesis pathway excavation method, apparatus, device and medium
CN113571125A (en) Drug target interaction prediction method based on multilayer network and graph coding
CN116612810A (en) Medicine target interaction prediction method based on interaction inference network
CN114743600A (en) Gate-controlled attention mechanism-based deep learning prediction method for target-ligand binding affinity
Sonsare et al. Investigation of machine learning techniques on proteomics: A comprehensive survey
CN114358169B (en) Colorectal cancer detection system based on XGBoost
Yu et al. Perturbnet predicts single-cell responses to unseen chemical and genetic perturbations
CN113450870B (en) Matching method and system of medicine and target protein
Connell et al. A single-cell gene expression language model
CN116386733A (en) Protein function prediction method based on multi-view multi-scale multi-attention mechanism
Lu et al. The application of deep learning in the prediction of HIV-1 protease cleavage site
Iraji et al. Druggable protein prediction using a multi-canal deep convolutional neural network based on autocovariance method
Thakur et al. RNN-CNN Based Cancer Prediction Model for Gene Expression
Alzubaidi et al. Deep mining from omics data
Liu et al. Incorporating FPConv-DTI deep learning network and borderline-SMOTE algorithm for predicting drug-target interactions
Vigil et al. DNA Sequencing Using Machine Learning Algorithms
Tong A Comprehensive Comparison of Neural Network-Based Feature Selection Methods in Biological Omics Datasets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination