CN116206775A - Multi-dimensional characteristic fusion medicine-target interaction prediction method - Google Patents
Multi-dimensional characteristic fusion medicine-target interaction prediction method Download PDFInfo
- Publication number
- CN116206775A CN116206775A CN202310038717.4A CN202310038717A CN116206775A CN 116206775 A CN116206775 A CN 116206775A CN 202310038717 A CN202310038717 A CN 202310038717A CN 116206775 A CN116206775 A CN 116206775A
- Authority
- CN
- China
- Prior art keywords
- drug
- target
- information
- interaction
- drugs
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 63
- 230000003993 interaction Effects 0.000 title claims abstract description 41
- 230000004927 fusion Effects 0.000 title 1
- 239000003814 drug Substances 0.000 claims abstract description 81
- 229940079593 drug Drugs 0.000 claims abstract description 81
- 239000003596 drug target Substances 0.000 claims abstract description 39
- 230000008569 process Effects 0.000 claims abstract description 28
- 238000013528 artificial neural network Methods 0.000 claims abstract description 11
- 201000010099 disease Diseases 0.000 claims abstract description 11
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims abstract description 11
- 208000030453 Drug-Related Side Effects and Adverse reaction Diseases 0.000 claims abstract description 10
- 239000000284 extract Substances 0.000 claims abstract description 9
- 239000011159 matrix material Substances 0.000 claims description 24
- 108090000623 proteins and genes Proteins 0.000 claims description 17
- 102000004169 proteins and genes Human genes 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 13
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 230000004913 activation Effects 0.000 claims description 4
- 150000001413 amino acids Chemical class 0.000 claims description 4
- 230000002457 bidirectional effect Effects 0.000 claims description 4
- 238000011156 evaluation Methods 0.000 claims description 4
- 230000006403 short-term memory Effects 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 3
- 206010061623 Adverse drug reaction Diseases 0.000 claims description 2
- 230000002776 aggregation Effects 0.000 claims description 2
- 238000004220 aggregation Methods 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 claims description 2
- 238000002790 cross-validation Methods 0.000 claims description 2
- 238000013507 mapping Methods 0.000 claims description 2
- 230000007246 mechanism Effects 0.000 claims description 2
- 239000000126 substance Chemical group 0.000 claims description 2
- 238000012360 testing method Methods 0.000 claims description 2
- 238000010200 validation analysis Methods 0.000 claims description 2
- 238000009509 drug development Methods 0.000 abstract description 5
- 239000002547 new drug Substances 0.000 abstract description 4
- 238000012795 verification Methods 0.000 abstract description 4
- 238000012827 research and development Methods 0.000 abstract 1
- 238000011161 development Methods 0.000 description 7
- 208000025721 COVID-19 Diseases 0.000 description 4
- 150000001875 compounds Chemical class 0.000 description 4
- 239000003446 ligand Substances 0.000 description 4
- 230000000052 comparative effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 229940126585 therapeutic drug Drugs 0.000 description 2
- 231100000622 toxicogenomics Toxicity 0.000 description 2
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 208000035977 Rare disease Diseases 0.000 description 1
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000007876 drug discovery Methods 0.000 description 1
- 238000009511 drug repositioning Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010534 mechanism of action Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000001225 therapeutic effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H70/00—ICT specially adapted for the handling or processing of medical references
- G16H70/40—ICT specially adapted for the handling or processing of medical references relating to drugs, e.g. their side effects or intended usage
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H20/00—ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
- G16H20/10—ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance relating to drugs or medications, e.g. for ensuring correct administration to patients
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medicinal Chemistry (AREA)
- Theoretical Computer Science (AREA)
- Epidemiology (AREA)
- Medical Informatics (AREA)
- Primary Health Care (AREA)
- Public Health (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Toxicology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Pharmacology & Pharmacy (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
本发明提供了一种融合多维度特征的药物‑靶点相互作用预测方法;从医学数据库中提取具有相互作用的药物和靶点以及相关疾病和药物副作用的信息,并构建一个异构网络;使用异构图注意力神经网络提取异构网络中的拓扑结构特征;将药物SMILES分子序列表示为分子图结构,提取药物特征信息;同时提取靶点特征信息;将提取到的药物和靶点特征信息融入异构图注意力神经网络的消息传递过程中进行训练,保存模型,并对药物和靶点进行关系预测。本发明有效地利用了药物和靶标的生物特征信息,在进行药物‑靶点关系预测时具有更高的准确率,提高了药物‑靶点关系验证的效率和精度,有效缩短了药物研发周期,大大降低了新药研发成本。
The invention provides a drug-target interaction prediction method that integrates multi-dimensional features; extracts information on interacting drugs and targets, as well as related diseases and side effects of drugs from medical databases, and constructs a heterogeneous network; uses The heterogeneous graph attention neural network extracts the topological structure features in the heterogeneous network; expresses the drug SMILES molecular sequence as a molecular graph structure, and extracts the drug feature information; simultaneously extracts the target feature information; the extracted drug and target feature information A message passing process incorporating a heterogeneous graph attention neural network is trained, the model is saved, and the relational predictions for drugs and targets are made. The present invention effectively utilizes the biometric information of the drug and the target, has a higher accuracy rate when predicting the drug-target relationship, improves the efficiency and accuracy of drug-target relationship verification, and effectively shortens the drug research and development cycle. Greatly reduce the cost of new drug development.
Description
技术领域Technical Field
本发明涉及医学人工智能技术领域,具体涉及一种融合多维度特征的药物-靶点相互作用预测方法。The present invention relates to the field of medical artificial intelligence technology, and in particular to a drug-target interaction prediction method integrating multi-dimensional features.
背景技术Background Art
新药研发是一个漫长且昂贵的过程,从思路确定到药物上市通常需要10~17年时间,资金投入将在7~27亿美元之间。利用现有药物治疗新的适应症发现方法具有研发成本低、开发时间短的优点。因此,重利用现有药来治疗常见和罕见的疾病变得越来越有吸引力。在识别具有潜在治疗作用的新候选化合物过程中,预测药物-靶标相互作用是一个必不可少的步骤。药物通过与各种靶标相互作用在人体中发挥重要作用,可以增强或抑制其功能,发挥调控作用以达到治疗某一种疾病的目的。因此,识别药物-靶点相互作用可以帮助理解药物的作用机制,对新靶点的发现和药物重定位起着至关重要作用。New drug development is a long and expensive process. It usually takes 10 to 17 years from the determination of ideas to the launch of drugs, and the capital investment will be between US$700 million and US$2.7 billion. The discovery method of using existing drugs to treat new indications has the advantages of low R&D costs and short development time. Therefore, reusing existing drugs to treat common and rare diseases has become increasingly attractive. In the process of identifying new candidate compounds with potential therapeutic effects, predicting drug-target interactions is an essential step. Drugs play an important role in the human body by interacting with various targets, which can enhance or inhibit their functions and play a regulatory role to achieve the purpose of treating a certain disease. Therefore, identifying drug-target interactions can help understand the mechanism of action of drugs and play a vital role in the discovery of new targets and drug repositioning.
当前,基于结构的方法、基于配体相似性的方法和基于网络的方法是进行药物-靶点相互作用预测的主要方式。其中基于结构的方法通常需要了解蛋白质的三维结构,因此对于那些结构未知的蛋白质,其性能往往较差。基于配体相似性的方法利用已知配体的常识进行预测,如果目标配体库中未标明目标化合物,则此类方法将无法产生可信的预测结果。基于网络的方法充分利用了药物和靶标之间的潜在相关性,已经成为分析和解决药物靶标相互作用预测相关问题的主流技术。受到深度学习中信息传递以及聚类任务的启发,药物靶标预测可以在图神经网络上进行大规模数据的挖掘,其中,基于图卷积网络的方法表现尤为突出,药物与其相关数据所构成的巨大异构网络中保存了大量有效的隐藏信息,利用图卷积网络对这些信息进行处理能够有效挖掘出网络中存在的潜在关联,有利于药物发现研究。然而,这些方法普遍忽视了对生物学知识的利用,比如化合物序列中的生物结构特性,从而无法得到数据中的潜在特征,在模型性能方面仍旧存在着很大的提升空间。At present, structure-based methods, ligand similarity-based methods and network-based methods are the main ways to predict drug-target interactions. Among them, structure-based methods usually require the understanding of the three-dimensional structure of proteins, so their performance is often poor for proteins with unknown structures. Ligand similarity-based methods use the common sense of known ligands for prediction. If the target compound is not marked in the target ligand library, such methods will not be able to produce reliable prediction results. Network-based methods make full use of the potential correlation between drugs and targets and have become the mainstream technology for analyzing and solving problems related to drug-target interaction prediction. Inspired by information transfer and clustering tasks in deep learning, drug target prediction can mine large-scale data on graph neural networks. Among them, methods based on graph convolutional networks are particularly outstanding. A large amount of effective hidden information is stored in the huge heterogeneous network composed of drugs and their related data. Using graph convolutional networks to process this information can effectively mine the potential associations in the network, which is conducive to drug discovery research. However, these methods generally ignore the use of biological knowledge, such as the biological structure characteristics in the compound sequence, so they cannot obtain the potential features in the data, and there is still a lot of room for improvement in model performance.
发明内容Summary of the invention
本发明的目的在于,提出一种融入药物分子结构和蛋白质生物结构信息的图神经网络模型,其自动对药物和靶点的相互作用关系进行预测,提高了验证效率,降低了验证成本。The purpose of the present invention is to propose a graph neural network model that incorporates drug molecular structure and protein biological structure information, which automatically predicts the interaction relationship between drugs and targets, improves verification efficiency and reduces verification costs.
为实现上述目的,本申请的技术方案为:一种融合多维度特征的药物-靶点相互作用预测方法,包括:To achieve the above objectives, the technical solution of the present application is: a drug-target interaction prediction method integrating multi-dimensional features, comprising:
步骤1:从医学数据库中提取具有相互作用的药物和靶点以及相关疾病和药物副作用的信息,并对其进行预处理,构建异构网络;Step 1: Extract information about interacting drugs and targets, as well as related diseases and drug side effects from the medical database, preprocess them, and construct a heterogeneous network;
步骤2:使用异构图注意力神经网络提取所述异构网络中的网络拓扑结构特征;Step 2: Using a heterogeneous graph attention neural network to extract network topology features in the heterogeneous network;
步骤3:将药物SMILES分子序列表示为分子图结构,使用分子注意力Transformer网络提取药物结构特征信息;Step 3: Represent the drug SMILES molecular sequence as a molecular graph structure and use the molecular attention Transformer network to extract drug structural feature information;
步骤4:对靶点序列信息进行嵌入表示,并使用卷积神经网络和双向长短期记忆网络进行处理,提取靶点结构特征信息;Step 4: Embed the target sequence information and process it using a convolutional neural network and a bidirectional long short-term memory network to extract the target structure feature information;
步骤5:将提取到的药物结构特征信息和靶点结构特征信息融入异构图注意力神经网络的消息传递过程;Step 5: Integrate the extracted drug structure feature information and target structure feature information into the message passing process of the heterogeneous graph attention neural network;
步骤6:利用交叉熵损失函数对预测模型进行优化训练,然后保存预测模型;Step 6: Use the cross entropy loss function to optimize the prediction model and then save the prediction model;
步骤7:加载所述预测模型,输入待预测的药物和靶点信息,对药物和靶点进行关系预测并输出预测结果。Step 7: Load the prediction model, input the drug and target information to be predicted, predict the relationship between the drug and the target, and output the prediction result.
进一步地,步骤1具体实现过程包括:Furthermore, the specific implementation process of
步骤1.1:对来自医学数据库中的药物信息和靶点信息进行筛选,删除没有相互作用关系的药物信息和靶点信息;Step 1.1: Screen the drug information and target information from the medical database and delete the drug information and target information that have no interaction relationship;
步骤1.2:从医学数据库中获取药物对应的SMILES分子序列和靶点对应的序列信息,分别作为药物和靶点的生物特征表示信息;Step 1.2: Obtain the SMILES molecular sequence corresponding to the drug and the sequence information corresponding to the target from the medical database, which are used as the biological feature representation information of the drug and the target respectively;
步骤1.3:提取与上述药物和靶点相关的疾病和药物副作用信息;Step 1.3: Extract disease and drug side effect information related to the above drugs and targets;
步骤1.4:参见图2(a),将提取到的药物、靶点、疾病和药物副作用作为节点,它们之间的关联信息表示为边,构建一个异构网络G=(V,E),其中V表示节点集,E表示边缘集;Step 1.4: As shown in Figure 2(a), the extracted drugs, targets, diseases and drug side effects are taken as nodes, and the association information between them is represented as edges to construct a heterogeneous network G = (V, E), where V represents the node set and E represents the edge set;
步骤1.5:将具有相互作用关系的药物和靶点进行整合并构造成<药物编号,靶点编号,标签>的形式,将标签标记为1;Step 1.5: Integrate the drugs and targets with interaction relationships and construct them into the form of <drug number, target number, label>, and mark the label as 1;
步骤1.6:按照正例:负例为1:10的比例,随机构造未知的药物-靶点关系作为负例,并将标签标记为0。Step 1.6: Randomly construct unknown drug-target relationships as negative examples at a ratio of 1:10, and mark them as 0.
进一步地,步骤2具体实现过程包括:Furthermore, the specific implementation process of
所述异构网络,初始化节点的嵌入公式为f0:V→Rd,其中f0(v)代表了每一个节点v的d维映射;节点v的邻居节点信息聚合定义为:In the heterogeneous network, the embedding formula of the initialization node is f 0 :V→R d , where f 0 (v) represents the d-dimensional mapping of each node v; the neighbor node information aggregation of node v is defined as:
其中σ(·)代表的是在一层神经网络传播过程中的非线性激活函数,K为注意力层数,Nv表示节点v的所有相邻节点,W是一个共享的权重参数,a表示注意力机制的权重向量,AconC是一种可自适应学习的新型激活函数。where σ(·) represents the nonlinear activation function in the propagation process of a layer of neural network, K is the number of attention layers, Nv represents all the neighboring nodes of node v, W is a shared weight parameter, a represents the weight vector of the attention mechanism, and AconC is a new activation function that can be adaptively learned.
进一步地,步骤3具体实现过程包括:Furthermore, the specific implementation process of step 3 includes:
步骤3.1:通过调用Python库中的RDKit函数库,将每种药物的SMILES分子序列表示为分子图形式,其中图的顶点和边分别表示药物的原子和化学键,每个药物分子使用一个特征矩阵和一个邻接矩阵进行表示,特征矩阵的每一行对应为每一个原子的属性;Step 3.1: By calling the RDKit function library in the Python library, the SMILES molecular sequence of each drug is represented as a molecular graph, where the vertices and edges of the graph represent the atoms and chemical bonds of the drug, respectively. Each drug molecule is represented by a feature matrix and an adjacency matrix, and each row of the feature matrix corresponds to the attributes of each atom.
步骤3.2:由于每条SMILES序列都有不同的长度,为了创建一个有效的表示形式,选择最大100个字符长度的SMILES序列,这样的长度可以覆盖数据集中至少90%的化合物。大于最大字符长度的序列被截断,而小于最大字符长度的序列用0填充;Step 3.2: Since each SMILES sequence has a different length, in order to create an efficient representation, a SMILES sequence with a maximum length of 100 characters is selected, which can cover at least 90% of the compounds in the dataset. Sequences longer than the maximum character length are truncated, while sequences shorter than the maximum character length are padded with 0;
步骤3.3:参见图2(b),采用分子注意力Transformer网络提取药物特征表示Sdrug;其中分子多头自注意力层的计算公式如下:Step 3.3: Referring to Figure 2(b), the molecular attention Transformer network is used to extract the drug feature representation S drug ; the calculation formula of the molecular multi-head self-attention layer is as follows:
其中表示分子图的邻接矩阵,表示原子间的距离;分别是查询向量矩阵、键向量矩阵和值向量矩阵,其中W是可学习的参数,i∈(1,...,h),h是多头注意力的头数;λa、λd和λg表示加权自注意、距离和邻接矩阵的标量参数。in represents the adjacency matrix of the molecular graph, represents the distance between atoms; are the query vector matrix, key vector matrix, and value vector matrix, respectively, where W is a learnable parameter, i∈(1,...,h), h is the number of heads of multi-head attention; λa , λd , and λg represent scalar parameters of weighted self-attention, distance, and adjacency matrices.
进一步地,所述步骤4具体实现过程包括:Furthermore, the specific implementation process of step 4 includes:
步骤4.1:随机初始化一个对应于靶点序列中所有出现氨基酸的索引表,尺寸为26×100;将每一个靶点序列中的氨基酸与索引表进行对应,构建靶点序列的嵌入矩阵;所述嵌入矩阵的长度为靶点序列中的最大长度,设置为1000;在模型训练过程中,嵌入向量是不断优化的,因此索引表中的相关信息会随着模型的优化而不断变化;Step 4.1: Randomly initialize an index table corresponding to all amino acids appearing in the target sequence, with a size of 26×100; correspond the amino acids in each target sequence to the index table to construct an embedding matrix of the target sequence; the length of the embedding matrix is the maximum length in the target sequence, which is set to 1000; during the model training process, the embedding vector is continuously optimized, so the relevant information in the index table will continue to change with the optimization of the model;
步骤4.2:参见图2(c),使用卷积神经网络和双向长短期记忆网络来提取靶点序列中的特征信息。Step 4.2: As shown in Figure 2(c), a convolutional neural network and a bidirectional long short-term memory network are used to extract feature information from the target sequence.
更进一步地,所述步骤4.2具体实现过程包括:Furthermore, the specific implementation process of step 4.2 includes:
步骤4.2.1:将步骤4.1得到的嵌入矩阵作为卷积神经网络的输入;对于小于嵌入矩阵长度的靶点序列会自动进行空标签的填充;每个CNN块使用三个连续的一维卷积层,卷积核的数量随着层数的增加而增加,第二层使用第一层卷积核的两倍,第三层使用第一层卷积核的三倍;Step 4.2.1: Use the embedding matrix obtained in step 4.1 as the input of the convolutional neural network; empty labels are automatically filled for target sequences that are shorter than the length of the embedding matrix; each CNN block uses three consecutive one-dimensional convolutional layers, and the number of convolutional kernels increases with the number of layers. The second layer uses twice the convolutional kernels of the first layer, and the third layer uses three times the convolutional kernels of the first layer;
步骤4.2.2:使用BiLSTM层接收卷积层的输出,最终输出的是蛋白质结构特征,表示为Sprotein,公式如下:Step 4.2.2: Use the BiLSTM layer to receive the output of the convolutional layer. The final output is the protein structure feature, expressed as S protein , and the formula is as follows:
其中,w和m分别表示权重矩阵和卷积窗口大小,h为LSTM隐层状态,x为蛋白质序列的特征表示。Among them, w and m represent the weight matrix and convolution window size respectively, h is the LSTM hidden layer state, and x is the feature representation of the protein sequence.
更进一步地,所述步骤5具体实现过程包括:Furthermore, the specific implementation process of step 5 includes:
参见图2(d),将步骤3得到药物结构特征向量Sdrug和步骤4得到的靶点结构特征向量Sprotein在异构图注意力神经网络消息传递阶段进行拼接,对式(1)中的节点嵌入进行更新的公式如下:Referring to Figure 2(d), the drug structure feature vector S drug obtained in step 3 and the target structure feature vector S protein obtained in step 4 are concatenated in the heterogeneous graph attention neural network message passing stage, and the formula for updating the node embedding in equation (1) is as follows:
更进一步地,所述步骤6具体实现过程包括:Furthermore, the specific implementation process of step 6 includes:
步骤6.1:参见图2(e),在获得药物和靶点的特征表示后,使用内积法预测药物-靶点相互作用;给定药物节点u和蛋白质节点v,fu和fv表示它们的特征;u和v之间存在交互的概率为:Step 6.1: See Figure 2(e). After obtaining the feature representation of the drug and the target, the inner product method is used to predict the drug-target interaction. Given a drug node u and a protein node v, fu and fv represent their features. The probability of interaction between u and v is:
P=σ((fu)Tfv) (5)P=σ(( fu ) Tfv ) (5)
其中为s型函数,P表示u和v之间的相互作用预测得分;in is a sigmoid function, P represents the interaction prediction score between u and v;
步骤6.2:使用交叉熵损失函数对预测模型进行优化训练,采用10倍交叉验证来测试预测模型性能,并保存效果最好的预测模型Modelbest。Step 6.2: Use the cross entropy loss function to optimize the prediction model training, use 10-fold cross validation to test the prediction model performance, and save the best prediction model Model best .
更进一步地,所述步骤7具体实现过程包括:Furthermore, the specific implementation process of step 7 includes:
加载步骤6.2中的预测模型Modelbest,将验证数据中药物-靶点信息输入预测模型中,判断药物和靶点是否存在相互作用关系,并输出相应的评价指标。Load the prediction model Model best in step 6.2, input the drug-target information in the validation data into the prediction model, determine whether there is an interaction relationship between the drug and the target, and output the corresponding evaluation index.
本发明由于采用以上技术方案,能够取得如下的技术效果:本发明采用深度学习模型,利用医药数据库中药物、靶点、疾病和药物副作用的信息,结合药物和靶点的结构特点,通过模型自动进行药物和靶点相互作用信息的预测。其有效的提取了药物分子和蛋白质结构中的特征信息,在进行药物-靶点关系预测时准确率更高,且具有鲁棒性,提高了药物-靶点关系验证的效率和精度,有效的缩短了药物研发周期,极大的降低了新药研发成本,为新药研发和药物再利用提供了重要的基础和保障。Due to the adoption of the above technical scheme, the present invention can achieve the following technical effects: the present invention adopts a deep learning model, utilizes the information of drugs, targets, diseases and drug side effects in the medical database, combines the structural characteristics of drugs and targets, and automatically predicts the drug-target interaction information through the model. It effectively extracts the characteristic information in the drug molecule and protein structure, has a higher accuracy rate and robustness when predicting the drug-target relationship, improves the efficiency and accuracy of drug-target relationship verification, effectively shortens the drug development cycle, greatly reduces the cost of new drug development, and provides an important foundation and guarantee for new drug development and drug reuse.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为一种融合多维度特征的药物-靶点相互作用预测方法流程图;FIG1 is a flow chart of a drug-target interaction prediction method integrating multi-dimensional features;
图2为一种融合多维度特征的药物-靶点相互作用预测方法模型结构图。FIG2 is a model structure diagram of a drug-target interaction prediction method integrating multi-dimensional features.
具体实施方式DETAILED DESCRIPTION
本发明的实施例是在以本发明技术方案为前提下进行实施的,给出了详细的实施方式和具体的操作过程,但本发明的保护范围不限于下述实施例。The embodiments of the present invention are implemented on the premise of the technical solution of the present invention, and detailed implementation methods and specific operation processes are given, but the protection scope of the present invention is not limited to the following embodiments.
以下结合实施例对本发明做详细的说明,以使本领域普通技术人员参照本说明书后能够据以实施。The present invention is described in detail below in conjunction with embodiments so that those skilled in the art can implement the invention according to the description.
实施例1Example 1
本实施例以Windows系统为开发环境,以Pycharm为开发平台,Python为开发语言,采用本发明的融合多维度特征的药物-靶点相互作用预测方法,进行药物-靶点相互作用关系的预测。This embodiment uses Windows system as the development environment, Pycharm as the development platform, Python as the development language, and adopts the drug-target interaction prediction method integrating multi-dimensional features of the present invention to predict the drug-target interaction relationship.
本实施例中一种融合多维度特征的药物-靶点相互作用预测方法,包括以下步骤:In this embodiment, a drug-target interaction prediction method integrating multi-dimensional features includes the following steps:
从DrugBank、PubChem数据库、HPRD数据库、比较毒理基因组学数据库和SIDER数据库中提取出708种药物、1512种蛋白质、5603种疾病以及4192种药物副作用;将存在的药物-靶点相互作用关系标记为正例,数据标签设为1,共计1923例;从未被标记为正例的药物-靶点对中随机选取19230例,构建负例,数据标签设为0;使用上述得到的数据构建一个异构网络;708 drugs, 1512 proteins, 5603 diseases and 4192 drug side effects were extracted from DrugBank, PubChem, HPRD, Comparative Toxicogenomics and SIDER databases. The existing drug-target interaction relationships were marked as positive examples, and the data labels were set to 1, totaling 1923 cases. 19230 cases were randomly selected from drug-target pairs that were not marked as positive examples to construct negative examples, and the data labels were set to 0. A heterogeneous network was constructed using the above data.
将异构网络、药物SMILES序列、蛋白质序列作为输入,训练并保存预测模型,得到药物和靶点相互作用关系的评价指标预测得分,评价指标包含接受者操作特征曲线下面积(AUROC)和精度-召回率曲线下的面积(AUPR)。The heterogeneous network, drug SMILES sequence, and protein sequence are used as input to train and save the prediction model to obtain the prediction score of the evaluation indicators of the interaction between drugs and targets. The evaluation indicators include the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPR).
根据以上步骤,本发明将药物-靶点关系预测效果与EEG-DTI模型、NeoDTI模型、DTINet模型、MSCMF模型和HNM模型进行对比。从表1中可以看出,本文发明提出方法在AUROC和AUPR上都明显优于其他方法。According to the above steps, the present invention compares the drug-target relationship prediction effect with the EEG-DTI model, NeoDTI model, DTINet model, MSCMF model and HNM model. As can be seen from Table 1, the method proposed in this invention is significantly better than other methods in both AUROC and AUPR.
表1不同模型针对药物-靶点关系预测结果对比Table 1 Comparison of prediction results of drug-target relationship by different models
实施例2Example 2
本实施例以Windows系统为开发环境,以Pycharm为开发平台,Python为开发语言,采用本发明的融合多维度特征的药物-靶点相互作用预测方法,进行COVID-19潜在治疗药物的预测。In this example, Windows system is used as the development environment, Pycharm is used as the development platform, Python is used as the development language, and the drug-target interaction prediction method integrating multi-dimensional features of the present invention is used to predict potential therapeutic drugs for COVID-19.
本实施例中一种融合多维度特征的药物-靶点相互作用预测方法,包括以下步骤:In this embodiment, a drug-target interaction prediction method integrating multi-dimensional features includes the following steps:
从比较毒理基因组学数据库和DrugBank数据库中提取了146个与COVID-19密切相关的靶点,以及708种候选药物;在PubChem数据库获取涉及药物的SMILES序列和靶点的序列结构;从HPRD数据库和SIDER数据库中提取与药物和靶点相关的1456种疾病和4192种药物副作用;通过获取的数据构建异构网络;146 targets closely related to COVID-19 and 708 candidate drugs were extracted from the Comparative Toxicogenomics Database and DrugBank Database; SMILES sequences of drugs and sequence structures of targets were obtained from the PubChem Database; 1456 diseases and 4192 drug side effects related to drugs and targets were extracted from the HPRD Database and SIDER Database; heterogeneous networks were constructed using the acquired data;
将所述异构网络及药物和蛋白质的序列数据作为输入,加载保存的预测模型,得到药物和不同靶点相互作用关系预测得分Score;把所述预测得分Score进行降序排序,提取出对于每个靶点置信度排名前10的候选药物,同时要求这些药物置信度得分都大于0.5。经过这样的处理,只有15个靶标符合要求,最终筛选得到150个候选药物。The heterogeneous network and the sequence data of drugs and proteins are used as input, and the saved prediction model is loaded to obtain the prediction score of the interaction relationship between drugs and different targets; the prediction score is sorted in descending order, and the top 10 candidate drugs with confidence for each target are extracted, and the confidence scores of these drugs are required to be greater than 0.5. After such processing, only 15 targets meet the requirements, and 150 candidate drugs are finally screened.
在实验结果中,筛选出的150个药物中已经有54个出现在COVID-19临床研究中,部分数据如表2所示。通过这种方法,可以快速、更有针对性地为后续湿实验寻找候选药物。In the experimental results, 54 of the 150 drugs screened have appeared in COVID-19 clinical studies, and some of the data are shown in Table 2. This method can quickly and more specifically find candidate drugs for subsequent wet experiments.
表2本发明筛选出与COVID-19相关的治疗药物Table 2 The present invention screened out therapeutic drugs related to COVID-19
前述对本发明的具体示例性实施方案的描述是为了说明和例证的目的。这些描述并非想将本发明限定为所公开的精确形式,并且很显然,根据上述教导,可以进行很多改变和变化。对示例性实施例进行选择和描述的目的在于解释本发明的特定原理及其实际应用,从而使得本领域的技术人员能够实现并利用本发明的各种不同的示例性实施方案以及各种不同的选择和改变。本发明的范围意在由权利要求书及其等同形式所限定。The foregoing description of specific exemplary embodiments of the present invention is for the purpose of illustration and demonstration. These descriptions are not intended to limit the present invention to the precise form disclosed, and it is clear that many changes and variations can be made based on the above teachings. The purpose of selecting and describing exemplary embodiments is to explain the specific principles of the present invention and its practical application, so that those skilled in the art can realize and utilize various different exemplary embodiments of the present invention and various different selections and changes. The scope of the present invention is intended to be limited by the claims and their equivalents.
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310038717.4A CN116206775A (en) | 2023-01-13 | 2023-01-13 | Multi-dimensional characteristic fusion medicine-target interaction prediction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310038717.4A CN116206775A (en) | 2023-01-13 | 2023-01-13 | Multi-dimensional characteristic fusion medicine-target interaction prediction method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116206775A true CN116206775A (en) | 2023-06-02 |
Family
ID=86516594
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310038717.4A Pending CN116206775A (en) | 2023-01-13 | 2023-01-13 | Multi-dimensional characteristic fusion medicine-target interaction prediction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116206775A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116894180A (en) * | 2023-09-11 | 2023-10-17 | 南京航空航天大学 | Product manufacturing quality prediction method based on different composition attention network |
CN117809737A (en) * | 2023-12-26 | 2024-04-02 | 国药(武汉)精准医疗科技有限公司 | Drug target protein interaction identification method, device, equipment and storage medium |
CN118197402A (en) * | 2024-04-02 | 2024-06-14 | 宁夏大学 | A method, device and apparatus for predicting drug-target relationship |
CN118506856A (en) * | 2024-07-18 | 2024-08-16 | 中国石油大学(华东) | Medicine target interaction prediction method and system based on artificial intelligence |
CN119028486A (en) * | 2024-08-14 | 2024-11-26 | 湖北中医药大学 | A drug interaction prediction method and device based on multimodal data fusion |
CN119230128A (en) * | 2024-12-03 | 2024-12-31 | 长春中医药大学 | Medicine interaction prediction method based on multidimensional features |
-
2023
- 2023-01-13 CN CN202310038717.4A patent/CN116206775A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116894180A (en) * | 2023-09-11 | 2023-10-17 | 南京航空航天大学 | Product manufacturing quality prediction method based on different composition attention network |
CN116894180B (en) * | 2023-09-11 | 2023-11-24 | 南京航空航天大学 | Product manufacturing quality prediction method based on different composition attention network |
CN117809737A (en) * | 2023-12-26 | 2024-04-02 | 国药(武汉)精准医疗科技有限公司 | Drug target protein interaction identification method, device, equipment and storage medium |
CN118197402A (en) * | 2024-04-02 | 2024-06-14 | 宁夏大学 | A method, device and apparatus for predicting drug-target relationship |
CN118197402B (en) * | 2024-04-02 | 2024-09-10 | 宁夏大学 | Method, device and equipment for predicting drug target relation |
CN118506856A (en) * | 2024-07-18 | 2024-08-16 | 中国石油大学(华东) | Medicine target interaction prediction method and system based on artificial intelligence |
CN119028486A (en) * | 2024-08-14 | 2024-11-26 | 湖北中医药大学 | A drug interaction prediction method and device based on multimodal data fusion |
CN119230128A (en) * | 2024-12-03 | 2024-12-31 | 长春中医药大学 | Medicine interaction prediction method based on multidimensional features |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116206775A (en) | Multi-dimensional characteristic fusion medicine-target interaction prediction method | |
Vo et al. | A novel framework for trash classification using deep transfer learning | |
CN111914054B (en) | System and method for large-scale semantic indexing | |
CN112070277A (en) | Hypergraph neural network-based drug-target interaction prediction method | |
CN115240777B (en) | Synthetic lethal gene prediction method, device, terminal and medium based on graph neural network | |
CN112599187B (en) | Method for predicting drug and target protein binding fraction based on double-flow neural network | |
CN114420310A (en) | Prediction method of drug ATCCode based on graph conversion network | |
CN113470741B (en) | Drug target relationship prediction method, device, computer equipment and storage medium | |
CN114944192A (en) | Disease-related circular RNA recognition method based on graph attention | |
CN113793696B (en) | A similarity-based prediction method, system, terminal and readable storage medium for the frequency of side effects of new drugs | |
CN112652358A (en) | Drug recommendation system, computer equipment and storage medium for regulating and controlling disease target based on three-channel deep learning | |
CN114564596A (en) | Cross-language knowledge graph link prediction method based on graph attention machine mechanism | |
CN112562791A (en) | Drug target action depth learning prediction system based on knowledge graph, computer equipment and storage medium | |
CN109637579A (en) | A kind of key protein matter recognition methods based on tensor random walk | |
CN116386899A (en) | Drug-disease correlation prediction method and related equipment based on graph learning | |
Wang et al. | A drug target interaction prediction based on LINE-RF learning | |
CN116383401A (en) | Knowledge graph completion method integrating text description and graph convolution mechanism | |
CN116822577A (en) | Data generation system, method, medium and equipment | |
CN115376704A (en) | A Drug-Disease Interaction Prediction Method Fused with Multi-Neighborhood Association Information | |
ElAlami | Unsupervised image retrieval framework based on rule base system | |
CN117457064A (en) | Drug-drug interaction prediction method and device based on graph structure adaptation | |
CN117409983A (en) | Drug interaction prediction model based on drug sequence and substructure characteristics | |
CN114678064A (en) | Drug target interaction prediction method based on network characterization learning | |
CN115329133A (en) | Remote sensing video hash retrieval method based on key frame fusion and attention mechanism | |
CN113990396A (en) | A miRNA-disease association prediction method based on self-attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |