CN116206775A

CN116206775A - Multi-dimensional characteristic fusion medicine-target interaction prediction method

Info

Publication number: CN116206775A
Application number: CN202310038717.4A
Authority: CN
Inventors: 车超; 姚奎
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2023-01-13
Filing date: 2023-01-13
Publication date: 2023-06-02

Abstract

The invention provides a drug-target interaction prediction method that integrates multi-dimensional features; extracts information on interacting drugs and targets, as well as related diseases and side effects of drugs from medical databases, and constructs a heterogeneous network; uses The heterogeneous graph attention neural network extracts the topological structure features in the heterogeneous network; expresses the drug SMILES molecular sequence as a molecular graph structure, and extracts the drug feature information; simultaneously extracts the target feature information; the extracted drug and target feature information A message passing process incorporating a heterogeneous graph attention neural network is trained, the model is saved, and the relational predictions for drugs and targets are made. The present invention effectively utilizes the biometric information of the drug and the target, has a higher accuracy rate when predicting the drug-target relationship, improves the efficiency and accuracy of drug-target relationship verification, and effectively shortens the drug research and development cycle. Greatly reduce the cost of new drug development.

Description

A drug-target interaction prediction method integrating multi-dimensional features

技术领域Technical Field

本发明涉及医学人工智能技术领域，具体涉及一种融合多维度特征的药物-靶点相互作用预测方法。The present invention relates to the field of medical artificial intelligence technology, and in particular to a drug-target interaction prediction method integrating multi-dimensional features.

背景技术Background Art

新药研发是一个漫长且昂贵的过程，从思路确定到药物上市通常需要10～17年时间，资金投入将在7～27亿美元之间。利用现有药物治疗新的适应症发现方法具有研发成本低、开发时间短的优点。因此，重利用现有药来治疗常见和罕见的疾病变得越来越有吸引力。在识别具有潜在治疗作用的新候选化合物过程中，预测药物-靶标相互作用是一个必不可少的步骤。药物通过与各种靶标相互作用在人体中发挥重要作用，可以增强或抑制其功能，发挥调控作用以达到治疗某一种疾病的目的。因此，识别药物-靶点相互作用可以帮助理解药物的作用机制，对新靶点的发现和药物重定位起着至关重要作用。New drug development is a long and expensive process. It usually takes 10 to 17 years from the determination of ideas to the launch of drugs, and the capital investment will be between US$700 million and US$2.7 billion. The discovery method of using existing drugs to treat new indications has the advantages of low R&D costs and short development time. Therefore, reusing existing drugs to treat common and rare diseases has become increasingly attractive. In the process of identifying new candidate compounds with potential therapeutic effects, predicting drug-target interactions is an essential step. Drugs play an important role in the human body by interacting with various targets, which can enhance or inhibit their functions and play a regulatory role to achieve the purpose of treating a certain disease. Therefore, identifying drug-target interactions can help understand the mechanism of action of drugs and play a vital role in the discovery of new targets and drug repositioning.

当前，基于结构的方法、基于配体相似性的方法和基于网络的方法是进行药物-靶点相互作用预测的主要方式。其中基于结构的方法通常需要了解蛋白质的三维结构，因此对于那些结构未知的蛋白质，其性能往往较差。基于配体相似性的方法利用已知配体的常识进行预测，如果目标配体库中未标明目标化合物，则此类方法将无法产生可信的预测结果。基于网络的方法充分利用了药物和靶标之间的潜在相关性，已经成为分析和解决药物靶标相互作用预测相关问题的主流技术。受到深度学习中信息传递以及聚类任务的启发，药物靶标预测可以在图神经网络上进行大规模数据的挖掘，其中，基于图卷积网络的方法表现尤为突出，药物与其相关数据所构成的巨大异构网络中保存了大量有效的隐藏信息，利用图卷积网络对这些信息进行处理能够有效挖掘出网络中存在的潜在关联，有利于药物发现研究。然而，这些方法普遍忽视了对生物学知识的利用，比如化合物序列中的生物结构特性，从而无法得到数据中的潜在特征，在模型性能方面仍旧存在着很大的提升空间。At present, structure-based methods, ligand similarity-based methods and network-based methods are the main ways to predict drug-target interactions. Among them, structure-based methods usually require the understanding of the three-dimensional structure of proteins, so their performance is often poor for proteins with unknown structures. Ligand similarity-based methods use the common sense of known ligands for prediction. If the target compound is not marked in the target ligand library, such methods will not be able to produce reliable prediction results. Network-based methods make full use of the potential correlation between drugs and targets and have become the mainstream technology for analyzing and solving problems related to drug-target interaction prediction. Inspired by information transfer and clustering tasks in deep learning, drug target prediction can mine large-scale data on graph neural networks. Among them, methods based on graph convolutional networks are particularly outstanding. A large amount of effective hidden information is stored in the huge heterogeneous network composed of drugs and their related data. Using graph convolutional networks to process this information can effectively mine the potential associations in the network, which is conducive to drug discovery research. However, these methods generally ignore the use of biological knowledge, such as the biological structure characteristics in the compound sequence, so they cannot obtain the potential features in the data, and there is still a lot of room for improvement in model performance.

发明内容Summary of the invention

本发明的目的在于，提出一种融入药物分子结构和蛋白质生物结构信息的图神经网络模型，其自动对药物和靶点的相互作用关系进行预测，提高了验证效率，降低了验证成本。The purpose of the present invention is to propose a graph neural network model that incorporates drug molecular structure and protein biological structure information, which automatically predicts the interaction relationship between drugs and targets, improves verification efficiency and reduces verification costs.

为实现上述目的，本申请的技术方案为：一种融合多维度特征的药物-靶点相互作用预测方法，包括：To achieve the above objectives, the technical solution of the present application is: a drug-target interaction prediction method integrating multi-dimensional features, comprising:

步骤1：从医学数据库中提取具有相互作用的药物和靶点以及相关疾病和药物副作用的信息，并对其进行预处理，构建异构网络；Step 1: Extract information about interacting drugs and targets, as well as related diseases and drug side effects from the medical database, preprocess them, and construct a heterogeneous network;

步骤2：使用异构图注意力神经网络提取所述异构网络中的网络拓扑结构特征；Step 2: Using a heterogeneous graph attention neural network to extract network topology features in the heterogeneous network;

步骤3：将药物SMILES分子序列表示为分子图结构，使用分子注意力Transformer网络提取药物结构特征信息；Step 3: Represent the drug SMILES molecular sequence as a molecular graph structure and use the molecular attention Transformer network to extract drug structural feature information;

步骤4：对靶点序列信息进行嵌入表示，并使用卷积神经网络和双向长短期记忆网络进行处理，提取靶点结构特征信息；Step 4: Embed the target sequence information and process it using a convolutional neural network and a bidirectional long short-term memory network to extract the target structure feature information;

步骤5：将提取到的药物结构特征信息和靶点结构特征信息融入异构图注意力神经网络的消息传递过程；Step 5: Integrate the extracted drug structure feature information and target structure feature information into the message passing process of the heterogeneous graph attention neural network;

步骤6：利用交叉熵损失函数对预测模型进行优化训练，然后保存预测模型；Step 6: Use the cross entropy loss function to optimize the prediction model and then save the prediction model;

步骤7：加载所述预测模型，输入待预测的药物和靶点信息，对药物和靶点进行关系预测并输出预测结果。Step 7: Load the prediction model, input the drug and target information to be predicted, predict the relationship between the drug and the target, and output the prediction result.

进一步地，步骤1具体实现过程包括：Furthermore, the specific implementation process of step 1 includes:

步骤1.1：对来自医学数据库中的药物信息和靶点信息进行筛选，删除没有相互作用关系的药物信息和靶点信息；Step 1.1: Screen the drug information and target information from the medical database and delete the drug information and target information that have no interaction relationship;

步骤1.2：从医学数据库中获取药物对应的SMILES分子序列和靶点对应的序列信息，分别作为药物和靶点的生物特征表示信息；Step 1.2: Obtain the SMILES molecular sequence corresponding to the drug and the sequence information corresponding to the target from the medical database, which are used as the biological feature representation information of the drug and the target respectively;

步骤1.3：提取与上述药物和靶点相关的疾病和药物副作用信息；Step 1.3: Extract disease and drug side effect information related to the above drugs and targets;

步骤1.4：参见图2(a)，将提取到的药物、靶点、疾病和药物副作用作为节点，它们之间的关联信息表示为边，构建一个异构网络G＝(V,E)，其中V表示节点集，E表示边缘集；Step 1.4: As shown in Figure 2(a), the extracted drugs, targets, diseases and drug side effects are taken as nodes, and the association information between them is represented as edges to construct a heterogeneous network G = (V, E), where V represents the node set and E represents the edge set;

步骤1.5：将具有相互作用关系的药物和靶点进行整合并构造成<药物编号，靶点编号，标签>的形式，将标签标记为1；Step 1.5: Integrate the drugs and targets with interaction relationships and construct them into the form of <drug number, target number, label>, and mark the label as 1;

步骤1.6：按照正例：负例为1：10的比例，随机构造未知的药物-靶点关系作为负例，并将标签标记为0。Step 1.6: Randomly construct unknown drug-target relationships as negative examples at a ratio of 1:10, and mark them as 0.

进一步地，步骤2具体实现过程包括：Furthermore, the specific implementation process of step 2 includes:

所述异构网络，初始化节点的嵌入公式为f⁰:V→R^d，其中f⁰(v)代表了每一个节点v的d维映射；节点v的邻居节点信息聚合定义为：In the heterogeneous network, the embedding formula of the initialization node is f ⁰ :V→R ^d , where f ⁰ (v) represents the d-dimensional mapping of each node v; the neighbor node information aggregation of node v is defined as:

其中σ(·)代表的是在一层神经网络传播过程中的非线性激活函数，K为注意力层数，N_v表示节点v的所有相邻节点，W是一个共享的权重参数，a表示注意力机制的权重向量，AconC是一种可自适应学习的新型激活函数。where σ(·) represents the nonlinear activation function in the propagation process of a layer of neural network, K is the number of attention layers, _Nv represents all the neighboring nodes of node v, W is a shared weight parameter, a represents the weight vector of the attention mechanism, and AconC is a new activation function that can be adaptively learned.

进一步地，步骤3具体实现过程包括：Furthermore, the specific implementation process of step 3 includes:

步骤3.1：通过调用Python库中的RDKit函数库，将每种药物的SMILES分子序列表示为分子图形式，其中图的顶点和边分别表示药物的原子和化学键，每个药物分子使用一个特征矩阵和一个邻接矩阵进行表示，特征矩阵的每一行对应为每一个原子的属性；Step 3.1: By calling the RDKit function library in the Python library, the SMILES molecular sequence of each drug is represented as a molecular graph, where the vertices and edges of the graph represent the atoms and chemical bonds of the drug, respectively. Each drug molecule is represented by a feature matrix and an adjacency matrix, and each row of the feature matrix corresponds to the attributes of each atom.

步骤3.2：由于每条SMILES序列都有不同的长度，为了创建一个有效的表示形式，选择最大100个字符长度的SMILES序列，这样的长度可以覆盖数据集中至少90％的化合物。大于最大字符长度的序列被截断，而小于最大字符长度的序列用0填充；Step 3.2: Since each SMILES sequence has a different length, in order to create an efficient representation, a SMILES sequence with a maximum length of 100 characters is selected, which can cover at least 90% of the compounds in the dataset. Sequences longer than the maximum character length are truncated, while sequences shorter than the maximum character length are padded with 0;

步骤3.3：参见图2(b)，采用分子注意力Transformer网络提取药物特征表示S_drug；其中分子多头自注意力层的计算公式如下：Step 3.3: Referring to Figure 2(b), the molecular attention Transformer network is used to extract the drug feature representation S _drug ; the calculation formula of the molecular multi-head self-attention layer is as follows:

其中

表示分子图的邻接矩阵，

表示原子间的距离；

分别是查询向量矩阵、键向量矩阵和值向量矩阵，其中W是可学习的参数，i∈(1,...,h)，h是多头注意力的头数；λ_a、λ_d和λ_g表示加权自注意、距离和邻接矩阵的标量参数。in

represents the adjacency matrix of the molecular graph,

represents the distance between atoms;

are the query vector matrix, key vector matrix, and value vector matrix, respectively, where W is a learnable parameter, i∈(1,...,h), h is the number of heads of multi-head attention; _λa , _λd , and _λg represent scalar parameters of weighted self-attention, distance, and adjacency matrices.

进一步地，所述步骤4具体实现过程包括：Furthermore, the specific implementation process of step 4 includes:

步骤4.1：随机初始化一个对应于靶点序列中所有出现氨基酸的索引表，尺寸为26×100；将每一个靶点序列中的氨基酸与索引表进行对应，构建靶点序列的嵌入矩阵；所述嵌入矩阵的长度为靶点序列中的最大长度，设置为1000；在模型训练过程中，嵌入向量是不断优化的，因此索引表中的相关信息会随着模型的优化而不断变化；Step 4.1: Randomly initialize an index table corresponding to all amino acids appearing in the target sequence, with a size of 26×100; correspond the amino acids in each target sequence to the index table to construct an embedding matrix of the target sequence; the length of the embedding matrix is the maximum length in the target sequence, which is set to 1000; during the model training process, the embedding vector is continuously optimized, so the relevant information in the index table will continue to change with the optimization of the model;

步骤4.2：参见图2(c)，使用卷积神经网络和双向长短期记忆网络来提取靶点序列中的特征信息。Step 4.2: As shown in Figure 2(c), a convolutional neural network and a bidirectional long short-term memory network are used to extract feature information from the target sequence.

更进一步地，所述步骤4.2具体实现过程包括：Furthermore, the specific implementation process of step 4.2 includes:

步骤4.2.1：将步骤4.1得到的嵌入矩阵作为卷积神经网络的输入；对于小于嵌入矩阵长度的靶点序列会自动进行空标签的填充；每个CNN块使用三个连续的一维卷积层，卷积核的数量随着层数的增加而增加，第二层使用第一层卷积核的两倍，第三层使用第一层卷积核的三倍；Step 4.2.1: Use the embedding matrix obtained in step 4.1 as the input of the convolutional neural network; empty labels are automatically filled for target sequences that are shorter than the length of the embedding matrix; each CNN block uses three consecutive one-dimensional convolutional layers, and the number of convolutional kernels increases with the number of layers. The second layer uses twice the convolutional kernels of the first layer, and the third layer uses three times the convolutional kernels of the first layer;

步骤4.2.2：使用BiLSTM层接收卷积层的输出，最终输出的是蛋白质结构特征，表示为S_protein，公式如下：Step 4.2.2: Use the BiLSTM layer to receive the output of the convolutional layer. The final output is the protein structure feature, expressed as S _protein , and the formula is as follows:

其中，w和m分别表示权重矩阵和卷积窗口大小，h为LSTM隐层状态，x为蛋白质序列的特征表示。Among them, w and m represent the weight matrix and convolution window size respectively, h is the LSTM hidden layer state, and x is the feature representation of the protein sequence.

更进一步地，所述步骤5具体实现过程包括：Furthermore, the specific implementation process of step 5 includes:

参见图2(d)，将步骤3得到药物结构特征向量S_drug和步骤4得到的靶点结构特征向量S_protein在异构图注意力神经网络消息传递阶段进行拼接，对式(1)中的节点嵌入进行更新的公式如下：Referring to Figure 2(d), the drug structure feature vector S _drug obtained in step 3 and the target structure feature vector S _protein obtained in step 4 are concatenated in the heterogeneous graph attention neural network message passing stage, and the formula for updating the node embedding in equation (1) is as follows:

更进一步地，所述步骤6具体实现过程包括：Furthermore, the specific implementation process of step 6 includes:

步骤6.1：参见图2(e)，在获得药物和靶点的特征表示后，使用内积法预测药物-靶点相互作用；给定药物节点u和蛋白质节点v，f_u和f_v表示它们的特征；u和v之间存在交互的概率为：Step 6.1: See Figure 2(e). After obtaining the feature representation of the drug and the target, the inner product method is used to predict the drug-target interaction. Given a drug node u and a protein node v, _fu and _fv represent their features. The probability of interaction between u and v is:

P＝σ((f_u)^Tf_v) (5)P＝σ(( _fu ) ^Tfv ₎ (5)

其中

为s型函数，P表示u和v之间的相互作用预测得分；in

is a sigmoid function, P represents the interaction prediction score between u and v;

步骤6.2：使用交叉熵损失函数对预测模型进行优化训练，采用10倍交叉验证来测试预测模型性能，并保存效果最好的预测模型Model_best。Step 6.2: Use the cross entropy loss function to optimize the prediction model training, use 10-fold cross validation to test the prediction model performance, and save the best prediction model Model _best .

更进一步地，所述步骤7具体实现过程包括：Furthermore, the specific implementation process of step 7 includes:

加载步骤6.2中的预测模型Model_best，将验证数据中药物-靶点信息输入预测模型中，判断药物和靶点是否存在相互作用关系，并输出相应的评价指标。Load the prediction model Model _best in step 6.2, input the drug-target information in the validation data into the prediction model, determine whether there is an interaction relationship between the drug and the target, and output the corresponding evaluation index.

本发明由于采用以上技术方案，能够取得如下的技术效果：本发明采用深度学习模型，利用医药数据库中药物、靶点、疾病和药物副作用的信息，结合药物和靶点的结构特点，通过模型自动进行药物和靶点相互作用信息的预测。其有效的提取了药物分子和蛋白质结构中的特征信息，在进行药物-靶点关系预测时准确率更高，且具有鲁棒性，提高了药物-靶点关系验证的效率和精度，有效的缩短了药物研发周期，极大的降低了新药研发成本，为新药研发和药物再利用提供了重要的基础和保障。Due to the adoption of the above technical scheme, the present invention can achieve the following technical effects: the present invention adopts a deep learning model, utilizes the information of drugs, targets, diseases and drug side effects in the medical database, combines the structural characteristics of drugs and targets, and automatically predicts the drug-target interaction information through the model. It effectively extracts the characteristic information in the drug molecule and protein structure, has a higher accuracy rate and robustness when predicting the drug-target relationship, improves the efficiency and accuracy of drug-target relationship verification, effectively shortens the drug development cycle, greatly reduces the cost of new drug development, and provides an important foundation and guarantee for new drug development and drug reuse.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为一种融合多维度特征的药物-靶点相互作用预测方法流程图；FIG1 is a flow chart of a drug-target interaction prediction method integrating multi-dimensional features;

图2为一种融合多维度特征的药物-靶点相互作用预测方法模型结构图。FIG2 is a model structure diagram of a drug-target interaction prediction method integrating multi-dimensional features.

具体实施方式DETAILED DESCRIPTION

本发明的实施例是在以本发明技术方案为前提下进行实施的，给出了详细的实施方式和具体的操作过程，但本发明的保护范围不限于下述实施例。The embodiments of the present invention are implemented on the premise of the technical solution of the present invention, and detailed implementation methods and specific operation processes are given, but the protection scope of the present invention is not limited to the following embodiments.

以下结合实施例对本发明做详细的说明，以使本领域普通技术人员参照本说明书后能够据以实施。The present invention is described in detail below in conjunction with embodiments so that those skilled in the art can implement the invention according to the description.

实施例1Example 1

本实施例以Windows系统为开发环境，以Pycharm为开发平台，Python为开发语言，采用本发明的融合多维度特征的药物-靶点相互作用预测方法，进行药物-靶点相互作用关系的预测。This embodiment uses Windows system as the development environment, Pycharm as the development platform, Python as the development language, and adopts the drug-target interaction prediction method integrating multi-dimensional features of the present invention to predict the drug-target interaction relationship.

本实施例中一种融合多维度特征的药物-靶点相互作用预测方法，包括以下步骤：In this embodiment, a drug-target interaction prediction method integrating multi-dimensional features includes the following steps:

从DrugBank、PubChem数据库、HPRD数据库、比较毒理基因组学数据库和SIDER数据库中提取出708种药物、1512种蛋白质、5603种疾病以及4192种药物副作用；将存在的药物-靶点相互作用关系标记为正例，数据标签设为1，共计1923例；从未被标记为正例的药物-靶点对中随机选取19230例，构建负例，数据标签设为0；使用上述得到的数据构建一个异构网络；708 drugs, 1512 proteins, 5603 diseases and 4192 drug side effects were extracted from DrugBank, PubChem, HPRD, Comparative Toxicogenomics and SIDER databases. The existing drug-target interaction relationships were marked as positive examples, and the data labels were set to 1, totaling 1923 cases. 19230 cases were randomly selected from drug-target pairs that were not marked as positive examples to construct negative examples, and the data labels were set to 0. A heterogeneous network was constructed using the above data.

将异构网络、药物SMILES序列、蛋白质序列作为输入，训练并保存预测模型，得到药物和靶点相互作用关系的评价指标预测得分，评价指标包含接受者操作特征曲线下面积(AUROC)和精度-召回率曲线下的面积(AUPR)。The heterogeneous network, drug SMILES sequence, and protein sequence are used as input to train and save the prediction model to obtain the prediction score of the evaluation indicators of the interaction between drugs and targets. The evaluation indicators include the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPR).

根据以上步骤，本发明将药物-靶点关系预测效果与EEG-DTI模型、NeoDTI模型、DTINet模型、MSCMF模型和HNM模型进行对比。从表1中可以看出，本文发明提出方法在AUROC和AUPR上都明显优于其他方法。According to the above steps, the present invention compares the drug-target relationship prediction effect with the EEG-DTI model, NeoDTI model, DTINet model, MSCMF model and HNM model. As can be seen from Table 1, the method proposed in this invention is significantly better than other methods in both AUROC and AUPR.

表1不同模型针对药物-靶点关系预测结果对比Table 1 Comparison of prediction results of drug-target relationship by different models

实施例2Example 2

本实施例以Windows系统为开发环境，以Pycharm为开发平台，Python为开发语言，采用本发明的融合多维度特征的药物-靶点相互作用预测方法，进行COVID-19潜在治疗药物的预测。In this example, Windows system is used as the development environment, Pycharm is used as the development platform, Python is used as the development language, and the drug-target interaction prediction method integrating multi-dimensional features of the present invention is used to predict potential therapeutic drugs for COVID-19.

从比较毒理基因组学数据库和DrugBank数据库中提取了146个与COVID-19密切相关的靶点，以及708种候选药物；在PubChem数据库获取涉及药物的SMILES序列和靶点的序列结构；从HPRD数据库和SIDER数据库中提取与药物和靶点相关的1456种疾病和4192种药物副作用；通过获取的数据构建异构网络；146 targets closely related to COVID-19 and 708 candidate drugs were extracted from the Comparative Toxicogenomics Database and DrugBank Database; SMILES sequences of drugs and sequence structures of targets were obtained from the PubChem Database; 1456 diseases and 4192 drug side effects related to drugs and targets were extracted from the HPRD Database and SIDER Database; heterogeneous networks were constructed using the acquired data;

将所述异构网络及药物和蛋白质的序列数据作为输入，加载保存的预测模型，得到药物和不同靶点相互作用关系预测得分Score；把所述预测得分Score进行降序排序，提取出对于每个靶点置信度排名前10的候选药物，同时要求这些药物置信度得分都大于0.5。经过这样的处理，只有15个靶标符合要求，最终筛选得到150个候选药物。The heterogeneous network and the sequence data of drugs and proteins are used as input, and the saved prediction model is loaded to obtain the prediction score of the interaction relationship between drugs and different targets; the prediction score is sorted in descending order, and the top 10 candidate drugs with confidence for each target are extracted, and the confidence scores of these drugs are required to be greater than 0.5. After such processing, only 15 targets meet the requirements, and 150 candidate drugs are finally screened.

在实验结果中，筛选出的150个药物中已经有54个出现在COVID-19临床研究中，部分数据如表2所示。通过这种方法，可以快速、更有针对性地为后续湿实验寻找候选药物。In the experimental results, 54 of the 150 drugs screened have appeared in COVID-19 clinical studies, and some of the data are shown in Table 2. This method can quickly and more specifically find candidate drugs for subsequent wet experiments.

表2本发明筛选出与COVID-19相关的治疗药物Table 2 The present invention screened out therapeutic drugs related to COVID-19

前述对本发明的具体示例性实施方案的描述是为了说明和例证的目的。这些描述并非想将本发明限定为所公开的精确形式，并且很显然，根据上述教导，可以进行很多改变和变化。对示例性实施例进行选择和描述的目的在于解释本发明的特定原理及其实际应用，从而使得本领域的技术人员能够实现并利用本发明的各种不同的示例性实施方案以及各种不同的选择和改变。本发明的范围意在由权利要求书及其等同形式所限定。The foregoing description of specific exemplary embodiments of the present invention is for the purpose of illustration and demonstration. These descriptions are not intended to limit the present invention to the precise form disclosed, and it is clear that many changes and variations can be made based on the above teachings. The purpose of selecting and describing exemplary embodiments is to explain the specific principles of the present invention and its practical application, so that those skilled in the art can realize and utilize various different exemplary embodiments of the present invention and various different selections and changes. The scope of the present invention is intended to be limited by the claims and their equivalents.

Claims

1. A drug-target interaction prediction method integrating multi-dimensional features, characterized by comprising:

Step 1: Extract information about interacting drugs and targets, as well as related diseases and drug side effects from the medical database, preprocess them, and construct a heterogeneous network;

Step 2: Using a heterogeneous graph attention neural network to extract network topology features in the heterogeneous network;

Step 3: Represent the drug SMILES molecular sequence as a molecular graph structure and use the molecular attention Transformer network to extract drug structural feature information;

Step 4: Embed the target sequence information and process it using a convolutional neural network and a bidirectional long short-term memory network to extract the target structure feature information;

Step 5: Integrate the extracted drug structure feature information and target structure feature information into the message passing process of the heterogeneous graph attention neural network;

Step 6: Use the cross entropy loss function to optimize the prediction model and then save the prediction model;

Step 7: Load the prediction model, input the drug and target information to be predicted, predict the relationship between the drug and the target, and output the prediction result.

2. According to the method for predicting drug-target interaction integrating multi-dimensional features in claim 1, the specific implementation process of step 1 comprises:

Step 1.1: Screen the drug information and target information from the medical database and delete the drug information and target information that have no interaction relationship;

Step 1.2: Obtain the SMILES molecular sequence corresponding to the drug and the sequence information corresponding to the target from the medical database, which are used as the biological feature representation information of the drug and the target respectively;

Step 1.3: Extract disease and drug side effect information related to the above drugs and targets;

Step 1.4: Take the extracted drugs, targets, diseases and drug side effects as nodes, and the association information between them as edges to construct a heterogeneous network G = (V, E), where V represents the node set and E represents the edge set;

Step 1.5: Integrate the drugs and targets with interaction relationships and construct them into the form of <drug number, target number, label>, and mark the label as 1;

Step 1.6: Randomly construct unknown drug-target relationships as negative examples in a certain proportion and mark them as 0.

3. According to the method for predicting drug-target interaction integrating multi-dimensional features in claim 1, the specific implementation process of step 2 comprises:

In the heterogeneous network, the embedding formula of the initialization node is f ⁰ :V→R ^d , where f ⁰ (v) represents the d-dimensional mapping of each node v; the neighbor node information aggregation of node v is defined as:

where σ(·) represents the nonlinear activation function in the propagation process of a layer of neural network, K is the number of attention layers, _Nv represents all the neighboring nodes of node v, W is a shared weight parameter, a represents the weight vector of the attention mechanism, and AconC is a new activation function that can be adaptively learned.

4. According to the method for predicting drug-target interaction integrating multi-dimensional features in claim 1, the specific implementation process of step 3 comprises:

Step 3.1: By calling the RDKit function library in the Python library, the SMILES molecular sequence of each drug is represented as a molecular graph, where the vertices and edges of the graph represent the atoms and chemical bonds of the drug, respectively. Each drug molecule is represented by a feature matrix and an adjacency matrix, and each row of the feature matrix corresponds to the attributes of each atom.

Step 3.2: Select a SMILES sequence with a maximum length of 100 characters. Sequences longer than the maximum length are truncated, while sequences shorter than the maximum length are padded with 0s.

Step 3.3: Use the molecular attention Transformer network to extract the drug feature representation S _drug ; the calculation formula of the molecular multi-head self-attention layer is as follows:

in

represents the adjacency matrix of the molecular graph,

represents the distance between atoms; _Qi = _XWiq , _Ki = _XWik , _Vi = _XWiv are the query vector matrix ^, key vector matrix and value vector matrix respectively, where ^W is a learnable parameter, i∈(1,...,h) ^, h is the number of heads of multi-head attention; _λa , _λd and _λg represent scalar parameters of weighted self-attention, distance and adjacency matrices.

5. According to the method for predicting drug-target interaction integrating multi-dimensional features in claim 1, the specific implementation process of step 4 comprises:

Step 4.1: randomly initialize an index table corresponding to all amino acids appearing in the target sequence; correspond each amino acid in the target sequence to the index table to construct an embedding matrix of the target sequence; the length of the embedding matrix is the maximum length in the target sequence;

Step 4.2: Use convolutional neural network and bidirectional long short-term memory network to extract feature information from the target sequence.

6. According to the method for predicting drug-target interaction integrating multi-dimensional features in claim 5, it is characterized in that the specific implementation process of step 4.2 includes:

Step 4.2.1: Use the embedding matrix obtained in step 4.1 as the input of the convolutional neural network; for target sequences that are shorter than the length of the embedding matrix, empty labels are automatically filled; each CNN block uses three consecutive one-dimensional convolutional layers, and the number of convolutional kernels increases with the number of layers, that is, the second layer uses twice the convolutional kernels of the first layer, and the third layer uses three times the convolutional kernels of the first layer;

Step 4.2.2: Use the BiLSTM layer to receive the output of the convolutional layer. The final output is the protein structure feature, expressed as S _protein , and the formula is as follows:

Among them, w and m represent the weight matrix and convolution window size respectively, h is the LSTM hidden layer state, and x is the feature representation of the protein sequence.

7. According to the method for predicting drug-target interaction integrating multi-dimensional features in claim 1, the specific implementation process of step 5 comprises:

The drug structure feature vector S _drug obtained in step 3 and the target structure feature vector S _protein obtained in step 4 are concatenated in the heterogeneous graph attention neural network message passing stage, and the formula for updating the node embedding in formula (1) is as follows:

8. According to the method for predicting drug-target interaction integrating multi-dimensional features in claim 1, the specific implementation process of step 6 comprises:

Step 6.1: After obtaining the feature representation of drugs and targets, the inner product method is used to predict drug-target interactions; given a drug node u and a protein node v, _fu and _fv represent their features; the probability of interaction between u and v is:

P＝σ(( _fu ) ^Tfv ₎ (5)

in

Step 6.2: Use the cross entropy loss function to optimize the prediction model training, use 10-fold cross validation to test the prediction model performance, and save the best prediction model Model _best .

9. According to the method for predicting drug-target interaction integrating multi-dimensional features in claim 8, the specific implementation process of step 7 comprises:

Load the prediction model Model _best in step 6.2, input the drug-target information in the validation data into the prediction model, determine whether there is an interaction relationship between the drug and the target, and output the corresponding evaluation index.