CN111916143A

CN111916143A - Molecular activity prediction method based on fusion of multiple substructure features

Info

Publication number: CN111916143A
Application number: CN202010729533.9A
Authority: CN
Inventors: 丁静怡; 宋健; 焦李成; 吴建设; 成若晖
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-07-27
Filing date: 2020-07-27
Publication date: 2020-11-10
Anticipated expiration: 2040-07-27
Also published as: CN111916143B

Abstract

The invention discloses a molecular activity prediction method based on the fusion of multiple substructure features. The neural network is trained by extracting the substructure features of molecular graphs, which overcomes the problems of closed loop extraction of substructures, poor network prediction accuracy and difficult calculation in the prior art. question. The steps realized by the present invention are: 1) transforming drug molecule information into a molecular feature matrix; 2) selecting an initial node; 3) obtaining multiple substructures; 4) calculating the similarity of the substructures; 5) fusing the substructure feature matrix; 6) train the neural network; 7) judge whether the training neural network converges; 8) obtain the molecular activity to be predicted. The invention has the advantages of distinguishing differences between different substructures, solving the problem of noise in molecular graphs, and predicting molecular activity with high precision.

Description

Molecular activity prediction method based on fusion of multiple substructure features

技术领域technical field

本发明属于生物技术领域，更进一步涉及生物活性技术领域中的一种基于多样子结构特征融合的分子活性预测方法。本发明可利用一类药物分子子结构信息和其所对应的生物活性去预测未知的同类药物分子对生物活性的影响。The invention belongs to the field of biotechnology, and further relates to a molecular activity prediction method based on fusion of multiple structural features in the field of bioactivity technology. The present invention can use the substructure information of a class of drug molecules and their corresponding biological activities to predict the influence of an unknown class of drug molecules on the biological activity.

背景技术Background technique

分子活性预测技术是指利用一种类别药物分子结构信息和其对应的生物活性影响训练神经网络模型，该模型可以利用未知的同类药物分子的结构信息预测其生物活性的影响。这样模型可以从大范围的同类分子中筛选出更适合于生物实验室中进行生物活性分析的药物化合物分子。为了将药物分子转化为计算机所能识别的信息，所以将分子结构信息转为分子图，同时将药物分子对生物活性的影响量化为分子图标签。目前，分子活性预测技术不但能够简化药物开发流程，降低生物实验安全隐患，而且可以节省生物实验成本。目前的分子活性预测技术面对分子图节点标签噪声问题所带来的挑战。Molecular activity prediction technology refers to the use of a class of drug molecule structure information and its corresponding biological activity impact training neural network model, the model can use the unknown similar drug molecule structure information to predict its biological activity impact. In this way, the model can screen out the molecules of drug compounds that are more suitable for biological activity analysis in the biological laboratory from a large range of similar molecules. In order to convert drug molecules into information that can be recognized by computers, the molecular structure information is converted into molecular graphs, and the influence of drug molecules on biological activity is quantified as molecular graph labels. At present, molecular activity prediction technology can not only simplify the drug development process, reduce the potential safety hazards of biological experiments, but also save the cost of biological experiments. Current molecular activity prediction techniques face the challenge of node label noise in molecular graphs.

Pinar Yanardag在其发表的论文“Deep Graph Kernels”(knowledge discoveryand data mining会议2015年)中提出一种通过比较一类分子图之间子结构的相似性来预测其分子活性的方法。该方法将一类分子图分为训练集合和测试集合，将训练集分子图划分为多个子结构，利用训练集合子结构及标签训练神经网络模型，最后模型利用测试集合的分子图子结构和训练集合分子图子结构的相似性得出测试集合的分子图标签。该方法存在的不足之处是，由于一张分子图随机划分为多种不同子结构，不同子结构可能会预测出不同的结果，导致预测分子图标签精度下降。In his paper "Deep Graph Kernels" (knowledge discovery and data mining conference 2015), Pinar Yanardag proposed a method to predict the molecular activity of a class of molecular graphs by comparing the similarity of their substructures between them. This method divides a class of molecular graphs into a training set and a test set, divides the molecular graph of the training set into multiple substructures, uses the training set substructures and labels to train the neural network model, and finally uses the molecular graph substructures of the test set to train the neural network model. The similarity of the molecular graph substructures of the ensemble yields the molecular graph label of the test ensemble. The disadvantage of this method is that since a molecular map is randomly divided into a variety of different substructures, different substructures may predict different results, resulting in a decrease in the accuracy of predicted molecular map labels.

J.B.Lee在其发表的论文“Graph classification using structuralattention”(knowledge discovery and data mining会议2018年)中提出一种基于注意力神经网络的分子活性预测方法。该方法将一类分子图分为训练集合和测试集合，利用注意力机制寻找训练集合中分子图子结构，利用其子结构和分子图标签训练LSTM模型。最后该模型利用注意力机制寻找测试集合分子图的子结构并预测其测试集合分子图标签。该方法存在的不足之处是，由于寻找到的分子图子结构存在闭环，而导致训练网络无法分辨子结构之间的差异性问题。In his paper "Graph classification using structural attention" (knowledge discovery and data mining conference 2018), J.B.Lee proposed an attentional neural network-based molecular activity prediction method. The method divides a class of molecular graphs into a training set and a test set, uses the attention mechanism to find the substructure of the molecular graph in the training set, and uses its substructure and molecular graph labels to train the LSTM model. Finally, the model uses the attention mechanism to find the substructure of the test set molecular graph and predict its test set molecular graph label. The disadvantage of this method is that the training network cannot distinguish the difference between the substructures due to the closed loop of the found molecular graph substructures.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于针对上述已有的分子活性预测技术存在的不足之处，提出了一种基于多样子结构特征融合的分子活性预测方法，用于解决在分子活性预测过程中对具有噪声的分子图提取结构特征难，且预测精度差的问题。The purpose of the present invention is to provide a molecular activity prediction method based on the fusion of multiple substructure features in view of the shortcomings of the above-mentioned existing molecular activity prediction technologies, which is used to solve the problem of noisy molecules in the process of molecular activity prediction. It is difficult to extract structural features and the prediction accuracy is poor.

实现本发明目的的思路是：根据分子图子结构中独有的结构特征和节点标签特征，利用随机游走的方式提取分子图子结构集合，完善分子图子结构特征信息，并且将部分子结构的融合特征输入到构建好的神经网络训练模型，实现更准确、快速的预测分子活性的目的。The idea of realizing the purpose of the present invention is: according to the unique structural features and node label features in the molecular graph substructure, use a random walk to extract the molecular graph substructure set, improve the molecular graph substructure feature information, and combine some of the substructures. The fusion features are input into the constructed neural network training model to achieve the purpose of more accurate and rapid prediction of molecular activity.

本发明的具体实现步骤包括如下：The specific implementation steps of the present invention include the following:

(1)获得药物分子信息对应的特征矩阵：(1) Obtain the feature matrix corresponding to the drug molecule information:

将一种药物分子中的原子基于字节进行独热编码后，得到独热编码特征矩阵，将该种药物原子之间的键值对表示成邻域特征矩阵，将该种药物分子活性基于字节进行独热编码，得到独热编码标签特征矩阵；After one-hot encoding the atoms in a drug molecule based on bytes, a one-hot encoding feature matrix is obtained, the key-value pairs between the drug atoms are represented as a neighborhood feature matrix, and the activity of the drug molecule is based on the word. One-hot encoding is performed on the section to obtain the one-hot encoding label feature matrix;

(2)挑选初始节点：(2) Select the initial node:

(2a)将药物分子的原子表示成节点，原子之间的化学键表示成连边，药物分子活性表示成分子图标签，由节点、连边和分子图标签组成分子图；(2a) The atoms of the drug molecule are represented as nodes, the chemical bonds between atoms are represented as edges, and the activity of drug molecules is represented as molecular graph labels, and the molecular graph is composed of nodes, edges and molecular graph labels;

(2b)利用Betweenness方法，计算分子图中每个节点的中心性值，选取节点中心性值最高的节点作为初始节点；(2b) Use the Betweenness method to calculate the centrality value of each node in the molecular graph, and select the node with the highest node centrality value as the initial node;

(3)提取分子图多个子结构特征：(3) Extract multiple substructure features of molecular graph:

由初始节点开始，利用随机游走方法，从分子图中挑选小于分子图节点数量的l个无重复节点组成分子图的子结构，利用相同方法挑选出一个子结构集合；Starting from the initial node, use the random walk method to select l non-repetitive nodes from the molecular graph that are less than the number of nodes in the molecular graph to form the substructure of the molecular graph, and use the same method to select a set of substructures;

(4)计算子结构的相似度：(4) Calculate the similarity of substructures:

(4a)将子结构集合中的每个子结构基于节点编码，得到该子结构的特征矩阵；(4a) each substructure in the substructure set is based on node coding, and obtains the characteristic matrix of this substructure;

(4b)利用相似度公式，计算子结构集合中的每两两子结构的相似度：(4b) Using the similarity formula, calculate the similarity of every pair of substructures in the substructure set:

其中，J_m,n表示子结构集合中第m个子结构和第n个子结构的相似度，g表示子结构集合中第m个子结构对应的特征矩阵，p表示子结构集合中第n个子结构的特征矩阵，|·|表示矩阵取模操作，∩表示取交集操作，∪表示取并集操作；Among them, J _m,n represents the similarity between the mth substructure and the nth substructure in the substructure set, g represents the feature matrix corresponding to the mth substructure in the substructure set, and p represents the nth substructure in the substructure set. Eigen matrix, |·| represents the matrix modulo operation, ∩ represents the intersection operation, and ∪ represents the union operation;

(4c)将所有相似度大于或等于阈值的子结构存储到相似集合中，再将剩余的子结构存储到相异集合中，所述阈值是在(0.5,1)的范围内，根据不同分子图类中节点的数量选取；(4c) Store all substructures whose similarity is greater than or equal to a threshold into a similar set, and then store the remaining substructures into a dissimilar set, where the threshold is in the range of (0.5, 1), according to different molecules Selection of the number of nodes in the graph class;

(5)融合子结构特征矩阵：(5) Fusion substructure feature matrix:

将相似集合中所有的子结构特征矩阵平均得到一个融合后的子结构特征矩阵；Averaging all substructure feature matrices in the similar set to obtain a fused substructure feature matrix;

(6)训练神经网络：(6) Training the neural network:

(6a)将融合后的子结构特征输入到4层的多层感知机神经网络中，输出预测的分子图标签，利用交叉熵损失函数，计算与该预测的分子图标签对应真实的分子图标签之间的损失值；(6a) Input the fused substructure features into a 4-layer multilayer perceptron neural network, output the predicted molecular map label, and use the cross-entropy loss function to calculate the real molecular map label corresponding to the predicted molecular map label loss value between;

(6b)从相异集合中任意选取两个子结构特征，将所选的两个子结构特征输入到4层的多层感知机神经网络中，输出预测的分子图标签，利用交叉熵损失函数，计算与该预测的分子图标签对应真实的分子图标签之间的损失值；(6b) Arbitrarily select two substructure features from the dissimilar set, input the selected two substructure features into a 4-layer multilayer perceptron neural network, output the predicted molecular graph label, and use the cross entropy loss function to calculate The loss value between the real molecular map labels corresponding to the predicted molecular map labels;

(6c)将上面两个损失值叠加，得到训练神经网络的损失值；(6c) Superimpose the above two loss values to obtain the loss value of the training neural network;

(7)判断训练神经网络的损失值是否收敛，若是，停止训练，得到训练好的多层感知机神经网络，执行步骤(8)，否则，执行步骤(3)；(7) Judging whether the loss value of the training neural network converges, if so, stop training, obtain a trained multilayer perceptron neural network, and execute step (8), otherwise, execute step (3);

(8)将待预测的同类的分子图输入到训练好的多层感知机神经网络中，输出分子图标签，得到与分子图标签对应的活性类型。(8) Input the same molecular graph to be predicted into the trained multilayer perceptron neural network, output the molecular graph label, and obtain the activity type corresponding to the molecular graph label.

本发明与现有技术相比具有以下优点：Compared with the prior art, the present invention has the following advantages:

第一，由于本发明融合子结构特征矩阵将相似集合中所有的子结构特征矩阵平均，得到一个融合后的子结构特征矩阵，将融合后的子结构特征矩阵输入到训练网络中得到分子图标签，克服了现有技术中由于一张分子图随机划分为多种不同子结构，不同子结构可能会预测出不同结果，导致预测分子图标签精度下降的问题，使得本发明具有优秀的提取分子图子结构特征的特点，提高了预测分子图标签的精度。First, since the fusion substructure feature matrix of the present invention averages all substructure feature matrices in the similar set, a fused substructure feature matrix is obtained, and the fused substructure feature matrix is input into the training network to obtain the molecular map label. , overcomes the problem in the prior art that a molecular map is randomly divided into a variety of different substructures, and different substructures may predict different results, resulting in a decrease in the accuracy of the predicted molecular map label, so that the present invention has excellent extraction molecular map. The characteristics of substructure features improve the accuracy of predicting molecular map labels.

第二，本发明利用随机游走方法，从分子图中挑选出无闭环子结构，并划分为相似子结构集合和相异的子结构集合来训练网络，克服了现有技术中因为寻找到的分子图子结构存在闭环，而导致训练网络无法分辨子结构之间的差异性问题，使得本发明能够分辨出不同子结构之间的差异性。Second, the present invention uses the random walk method to select non-closed-loop substructures from the molecular graph, and divide them into similar substructure sets and dissimilar substructure sets to train the network, which overcomes the problem of finding in the prior art. The molecular graph substructures have closed loops, which leads to the problem that the training network cannot distinguish the differences between the substructures, so that the present invention can distinguish the differences between different substructures.

附图说明Description of drawings

图1是本发明的流程图；Fig. 1 is the flow chart of the present invention;

图2是本发明利用随机游走提取分子图子结构过程的示意图；Fig. 2 is the schematic diagram that the present invention utilizes random walk to extract molecular graph substructure process;

图3是本发明融合相似集合子结构特征的示意图。FIG. 3 is a schematic diagram of the present invention fusing similar set substructure features.

具体实施方式Detailed ways

下面结合附图对本发明做进一步的描述。The present invention will be further described below with reference to the accompanying drawings.

参照图1，对本发明的具体步骤做进一步的描述。1, the specific steps of the present invention will be further described.

步骤1，获得药物分子信息对应的特征矩阵。In step 1, a feature matrix corresponding to the drug molecule information is obtained.

将一种药物分子中的原子基于字节进行独热编码后，得到独热编码特征矩阵，将该种药物原子之间的键值对表示成邻域特征矩阵，将该种药物分子活性基于字节进行独热编码，得到独热编码标签特征矩阵。After one-hot encoding the atoms in a drug molecule based on bytes, a one-hot encoding feature matrix is obtained, the key-value pairs between the drug atoms are represented as a neighborhood feature matrix, and the activity of the drug molecule is based on the word. Sections are one-hot encoded to obtain the one-hot encoded label feature matrix.

步骤2，挑选初始节点。Step 2, select the initial node.

将药物分子的原子表示成节点，将药物分子的原子表示成节点，药物分子的原子性质表示为节点标签，原子之间的化学键表示成连边，药物分子活性表示成分子图标签，由节点，连边和分子图标签组成分子图。The atoms of drug molecules are represented as nodes, the atoms of drug molecules are represented as nodes, the atomic properties of drug molecules are represented as node labels, the chemical bonds between atoms are represented as edges, and the activity of drug molecules is represented as molecular graph labels. Links and Molecular Graph labels make up the Molecular Graph.

利用Betweenness方法，计算分子图中每个节点的中心性值，选取节点中心性值最高的节点作为初始节点。Using the Betweenness method, the centrality value of each node in the molecular graph is calculated, and the node with the highest node centrality value is selected as the initial node.

步骤3，提取分子图多个子结构特征。Step 3, extracting multiple sub-structure features of the molecular graph.

下面结合图2，对利用随机游走提取分子图子结构的过程做详细说明。In the following, the process of extracting the substructure of the molecular graph by using random walk will be described in detail with reference to FIG. 2 .

图2中的黑色边缘的节点表示被随机游走方法挑选出来的节点，无黑色边缘的节点表示未被挑选的节点，节点中的数字表示分子中原子的序号，c_t-1表示随机游走在t-1时刻时指向的节点，c_t表示随机游走在t时刻时指向的节点。图2(a)表示随机游走在t-1时刻挑选出节点4的示意图。图2(b)表示回溯到节点9的过程示意图，从图2(b)可以看到当前节点9的连边不存在未被挑选的节点，为了寻找到未被挑选节点，需要c_t-1继续回溯之前可能存在未被挑选的节点。图2(c)表示c_t-1回溯到节点2的示意图，从图2(c)中可以看到当前节点2中连边节点1未被挑选。图2(d)表示挑选出节点1的示意图，从2(d)中看到因为节点1未被挑选，则随机游走挑选出节点1，并且设置c_t指向节点1并将其边缘设为黑色。The nodes with black edges in Figure 2 represent the nodes selected by the random walk method, the nodes without black edges represent the unselected nodes, the numbers in the nodes represent the serial numbers of atoms in the molecule, and c _t-1 represents the random walk The node pointed to at time t-1, c _t represents the node pointed to by the random walk at time t. Figure 2(a) shows a schematic diagram of node 4 being selected by random walk at time t-1. Figure 2(b) shows a schematic diagram of the process of backtracking to node 9. It can be seen from Figure 2(b) that there is no unselected node on the edge of node 9. In order to find the unselected node, c _t-1 There may be unpicked nodes before continuing to backtrack. Figure 2(c) shows a schematic diagram of c _t-1 backtracking to node 2. It can be seen from Figure 2(c) that node 1 is not selected in current node 2. Figure 2(d) shows a schematic diagram of selecting node 1. From 2(d), it can be seen that since node 1 is not selected, node 1 is selected by random walk, and c _t is set to point to node 1 and its edge is set to black.

由图2可知，由初始节点开始，利用随机游走方法，从分子图中挑选小于分子图节点数量的l个无重复节点组成分子图的子结构，利用相同方法挑选出一个子结构集合。As can be seen from Figure 2, starting from the initial node, using the random walk method, select l non-repetitive nodes from the molecular graph that are less than the number of nodes in the molecular graph to form the substructure of the molecular graph, and use the same method to select a substructure set.

步骤4，计算子结构的相似度。Step 4: Calculate the similarity of substructures.

第一步，将子结构集合中的每个子结构基于节点编码，得到该子结构的特征矩阵；In the first step, each substructure in the substructure set is encoded based on the node, and the feature matrix of the substructure is obtained;

第二步，利用相似度公式，计算子结构集合中的每两两子结构的相似度：The second step is to use the similarity formula to calculate the similarity of every pair of substructures in the substructure set:

其中，J_m,n表示子结构集合中第m个子结构和第n个子结构的相似度，p_m表示子结构集合中第m个子结构对应的特征矩阵，p_n表示子结构集合中第n个子结构的特征矩阵，|·|表示矩阵取模操作，∩表示取交集操作，∪表示取并集操作。Among them, J _m,n represents the similarity between the mth substructure and the nth substructure in the substructure set, pm represents the feature matrix corresponding to the mth substructure in the substructure set, and p _n represents the _nth substructure in the substructure set. The characteristic matrix of the structure, |·| represents the matrix modulo operation, ∩ represents the intersection operation, and ∪ represents the union operation.

第三步，|p_m∩p_n|表示子结构p_m和p_n的节点序列特征的杰卡德相似性和汉明距离，具体过程如下：首先，根据子结构得到对应子结构的节点标签跳变序列，比如，子结构p_m存在节点序列特征[1,1,2,2,3]和p_n存在节点序列特征[1,2,3,2,3]，其中元素表示为节点标签。根据[1,1,2,2,3]得到跳变序列[0,1,0,1]，p_n＝[1,2,3,2,3]得到跳变序列[1,1,1,1]，其中跳变序列中1表示子结构序列中相邻节点的标签发生变化，0表示未发生变化。然后对两条路径的节点标签跳变序列进行杰卡德相似性度量和汉明距离度量归一化处理为相似值。The third step, |p _m ∩ p _n | represents the Jaccard similarity and Hamming distance of the node sequence features of the substructures p _m and p _n . The specific process is as follows: First, the node labels of the corresponding substructures are obtained according to the substructures Jump sequences, for example, substructure p _m has node sequence features [1, 1, 2, 2, 3] and p _n has node sequence features [1, 2, 3, 2, 3], where elements are represented as node labels . According to [1, 1, 2, 2, 3], the jump sequence [0, 1, 0, 1] is obtained, and p _n = [1, 2, 3, 2, 3] to obtain the jump sequence [1, 1, 1] , 1], where 1 in the jump sequence means that the labels of adjacent nodes in the substructure sequence have changed, and 0 means that there has been no change. Then, the node label hopping sequences of the two paths are normalized by the Jaccard similarity measure and the Hamming distance measure into the similarity value.

第四步，将所有相似度大于或等于阈值的子结构存储到相似集合中，再将剩余的子结构存储到相异集合中，所述阈值是在(0.5,1)的范围内，根据不同分子图类中节点的数量选取The fourth step is to store all substructures whose similarity is greater than or equal to the threshold into the similar set, and then store the remaining substructures into the dissimilar set. The threshold is in the range of (0.5, 1), according to different The number of nodes in the molecular graph class is selected

步骤5，融合子结构特征矩阵。Step 5, fuse the substructure feature matrix.

下面结合图3，对相似集合中子结构融合的过程做详细说明。The following describes the process of substructure fusion in similar sets in detail with reference to FIG. 3 .

图3中，子结构表示为一行节点序列，节点序列可以表示为特征矩阵。一行节点序列对应一个特征矩阵。图3(a)表示三个子结构的特征矩阵的示意图。图3(b)表示融合后的子结构特征矩阵。图3(a)中三个子结构特征矩阵通过矩阵平均得到如图3(b)中融合后的子结构特征矩阵。In Figure 3, the substructure is represented as a row of node sequences, and the node sequences can be represented as feature matrices. A row of node sequences corresponds to a feature matrix. Figure 3(a) shows a schematic diagram of the feature matrices of the three substructures. Figure 3(b) shows the fused substructure feature matrix. The three substructure feature matrices in Fig. 3(a) are averaged to obtain the substructure feature matrix after fusion as shown in Fig. 3(b).

将相似集合中所有的子结构特征矩阵平均得到一个融合后的子结构特征矩阵。一个子结构可以根据其节点特征编码为特征矩阵，将相似集合中多个子结构特征矩阵平均融合为一个子结构特征矩阵。A fused substructure feature matrix is obtained by averaging all substructure feature matrices in the similar set. A substructure can be encoded into a feature matrix according to its node features, and the feature matrices of multiple substructures in a similar set are averagely fused into a substructure feature matrix.

步骤6，训练神经网络。Step 6, train the neural network.

第一步，将融合后的子结构特征输入到4层的多层感知机神经网络中，输出预测的分子图标签，利用交叉熵损失函数，计算与该预测的分子图标签对应真实的图标签之间的损失值λ₁。The first step is to input the fused substructure features into a 4-layer multilayer perceptron neural network, output the predicted molecular map label, and use the cross-entropy loss function to calculate the real map label corresponding to the predicted molecular map label. The loss value between λ ₁ .

第二步，从相异集合中任意选取两个子结构，将所选的两个子结构特征输入到4层的多层感知机神经网络中，输出预测的分子图标签，利用交叉熵损失函数，计算与该预测的分子图标签对应真实的图标签之间的损失值λ₂。The second step is to arbitrarily select two substructures from the dissimilar set, input the selected two substructure features into a 4-layer multilayer perceptron neural network, and output the predicted molecular graph labels. Using the cross entropy loss function, calculate The loss value λ ₂ between the true graph labels corresponding to the predicted molecular graph labels.

第三步，按照下式，得到神经网络损失值LThe third step, according to the following formula, get the neural network loss value L

L＝pλ₁+(1-p)λ₂ L=pλ ₁ +(1-p)λ ₂

其中，p表示偏向值，该偏向值是在(0.8,1)的范围内中根据不同分子图类中节点的数量选取。Among them, p represents the bias value, which is selected according to the number of nodes in different molecular graph classes in the range of (0.8, 1).

步骤7，判断神经网络的损失值是否收敛，若是，停止训练，得到训练好的多层感知机神经网络，执行步骤(8)，否则，执行步骤(3)。Step 7, judge whether the loss value of the neural network converges, if so, stop training, obtain a trained multilayer perceptron neural network, and execute step (8), otherwise, execute step (3).

步骤8，将待预测的同类的分子图输入到训练好的多层感知机神经网络中，输出图标签，得到与标签对应的活性类型。Step 8: Input the molecular graph of the same type to be predicted into the trained multilayer perceptron neural network, output graph labels, and obtain the activity types corresponding to the labels.

下面结合仿真实验对本发明的效果做进一步的说明：The effect of the present invention is further described below in conjunction with the simulation experiment:

1.仿真实验条件：1. Simulation experimental conditions:

本发明的仿真实验的硬件平台为：处理器为Intel i5 3470 CPU，主频为3.20GHz，内存4GB。The hardware platform of the simulation experiment of the present invention is: the processor is Intel i5 3470 CPU, the main frequency is 3.20GHz, and the memory is 4GB.

本发明的仿真实验的软件平台为：Ubuntu操作系统和python 3.6。The software platform of the simulation experiment of the present invention is: Ubuntu operating system and python 3.6.

本发明仿真实验所使用的输入分子图数据集是美国国家癌症研究中心在2006年，https://www.cancer.gov/网址公布的如下四种分子化合物类型，NCI-1、NCI-33、NCI-83和NCI-123。NCI-1为非小细胞肺癌筛选活性的化合物数据集的平衡数据集，共有两种活性类标，NCI-33为黑素瘤筛选活性的化合物数据集的平衡数据集，共有两种活性类标，NCI-83为乳腺癌筛选活性的化合物数据集的平衡数据集，共有两种活性类标，NCI-123为乳腺癌筛选活性的化合物数据集的平衡数据集数据集，共有两种活性类标。The input molecular map data set used in the simulation experiment of the present invention is the following four types of molecular compounds, NCI-1, NCI-33, NCI-1, NCI-33, NCI-83 and NCI-123. NCI-1 is a balanced dataset of compound datasets for screening activity in non-small cell lung cancer, and there are two types of activity markers. NCI-33 is a balanced dataset of compound datasets for screening activity in melanoma, and there are two types of activity markers. , NCI-83 is a balanced dataset of compound datasets for breast cancer screening activity, there are two types of activity markers, and NCI-123 is a balanced dataset dataset of compound datasets for breast cancer screening activity, with two activity markers .

2.仿真内容及其结果分析：2. Simulation content and result analysis:

本发明仿真实验是采用本发明和三个现有技术(Kernel-GK方法、Deep-SP方法和GAM方法)分别对分子图进行分类，获得分类结果。The simulation experiment of the present invention adopts the present invention and three existing technologies (Kernel-GK method, Deep-SP method and GAM method) to classify molecular graphs, respectively, to obtain classification results.

在仿真实验中，采用的三个现有技术是指：In the simulation experiments, the three existing technologies used are:

现有Kernel-GK方法是指，Nino Shervashidze等人在“Efficient graphletkernels for large graph comparison”2009年Conference on ArtificialIntelligence and Statistics中提出的分子图预测方法，简称相似子结构图内核方法。The existing Kernel-GK method refers to the molecular graph prediction method proposed by Nino Shervashidze et al. in "Efficient graphletkernels for large graph comparison" 2009 Conference on Artificial Intelligence and Statistics, referred to as the similar substructure graph kernel method.

现有Deep-SP方法是指，Pinar Yanardag等人在“Deep Graph Kernels”，2015年knowledge discovery and data mining会议中提出的分子图预测方法，简称深度最短路径图内核方法。The existing Deep-SP method refers to the molecular graph prediction method proposed by Pinar Yanardag et al.

现有GAM方法是指，J.B.Lee等人在“Graph classification using structuralattention”，2018年knowledge discovery and data mining会议中提出的分子图预测方法，简称注意力神经网络图分类方法。The existing GAM method refers to the molecular graph prediction method proposed by J.B.Lee et al. in "Graph classification using structural attention", the 2018 knowledge discovery and data mining conference, referred to as the attention neural network graph classification method.

利用平均精度评价指标分别对三种方法的分类结果进行评价。利用下面公式，计算数据集NCI1、NCI33、NCI83和NCI123的分类精度，将所有的计算结果绘制成表1：The classification results of the three methods were evaluated by the average precision evaluation index. Use the following formula to calculate the classification accuracy of the datasets NCI1, NCI33, NCI83 and NCI123, and draw all the calculation results into Table 1:

表1.仿真实验中本发明和各现有技术分类结果的定量分析表Table 1. Quantitative analysis table of the present invention and each prior art classification result in simulation experiment

结合表1可以看出，本发明的分类精度AA指标均高于3种现有技术方法，证明本发明可以得到更高的分子活性预测精度。It can be seen from Table 1 that the classification accuracy AA index of the present invention is higher than that of the three prior art methods, which proves that the present invention can obtain higher molecular activity prediction accuracy.

以上仿真实验表明：本发明方法利用随机游走的思想，能够标识分子图的不同子结构特征，其中对提取子结构的开环要求，克服了现有的技术方法中因为提取的分子图子结构存在闭环，而导致训练网络不能分辨子结构间差异性的弊端。此外，通过对相似性子结构的特征融合，降低了不同子结构特征间的差异性，解决了现有的技术方法中因为不同子结构预测出的结果不同，而导致分子活性预测精度下降的问题。相较于其他对比方法而言，本发明所提出的多层神经网络预测模型，训练时间短而且网络泛化性高，是一种非常实用的分子活性预测方法。The above simulation experiments show that the method of the present invention utilizes the idea of random walk to identify different substructure features of molecular graphs, wherein the open-loop requirement for extracting substructures overcomes the problem of extracting substructures of molecular graphs in the prior art method. There is a closed loop, which leads to the disadvantage that the training network cannot distinguish the differences between substructures. In addition, through the feature fusion of similar substructures, the difference between the features of different substructures is reduced, and the problem that the prediction accuracy of molecular activity is reduced due to the different prediction results of different substructures in the existing technical methods is solved. Compared with other comparison methods, the multi-layer neural network prediction model proposed by the present invention has short training time and high network generalization, and is a very practical molecular activity prediction method.

Claims

1. A molecular activity prediction method based on multi-substructure feature fusion is characterized in that a random walk method is used for extracting a plurality of substructure features of a molecular diagram, and the fused substructure features are input into a trained multilayer neural network to predict molecular activity, and the method specifically comprises the following steps:

(1) obtaining a characteristic matrix corresponding to the drug molecule information:

carrying out one-hot encoding on atoms in a drug molecule based on bytes to obtain one-hot encoding characteristic matrix, expressing bond value pairs between the drug atoms into a neighborhood characteristic matrix, and carrying out one-hot encoding on the activity of the drug molecule based on bytes to obtain one-hot encoding label characteristic matrix;

(2) selecting an initial node:

(2a) representing atoms of drug molecules into nodes, representing chemical bonds among the atoms into connecting edges, representing the activity of the drug molecules into component subgraph labels, and forming component subgraphs by the nodes, the connecting edges and the molecular label labels;

(2b) calculating the centrality value of each node in the molecular graph by using a Betweenness method, and selecting the node with the highest centrality value as an initial node;

(3) extracting a plurality of substructure characteristics of the molecular diagram:

starting from an initial node, selecting l substructures of the component subgraph without repeated node groups, wherein the number of the substructures is less than that of the nodes of the molecular subgraph, from the molecular subgraph by using a random walk method, and selecting one substructure set by using the same method;

(4) calculating the similarity of the substructures:

(4a) coding each substructure in the substructure set based on nodes to obtain a characteristic matrix of the substructure;

(4b) and calculating the similarity of every two substructures in the substructure set by using a similarity formula:

wherein, J_m,nRepresenting the similarity between the mth substructure and the nth substructure in the substructure set, g representing a characteristic matrix corresponding to the mth substructure in the substructure set, p representing the characteristic matrix of the nth substructure in the substructure set, | · | representing matrix modulo operation, | · representing intersection operation, and u representing union operation;

(4c) storing all the substructures with the similarity greater than or equal to a threshold value into a similar set, and storing the rest substructures into a different set, wherein the threshold value is in a range of (0.5,1), and is selected according to the number of nodes in different molecular diagram classes;

(5) fusion substructure feature matrix:

averaging all the substructure feature matrixes in the similar set to obtain a fused substructure feature matrix;

(6) training a neural network:

(6a) randomly selecting two substructure characteristics from different sets, inputting the two substructure characteristics into a multilayer perceptron neural network with 4 layers, outputting predicted molecular icon labels, and calculating loss values between the real molecular icon labels corresponding to the predicted molecular icon labels by using a cross entropy loss function;

(6b) inputting the fused substructure characteristics into a 4-layer multi-layer perceptron neural network, outputting predicted molecular icon labels, and calculating loss values between the predicted molecular icon labels and the real molecular icon labels corresponding to the predicted molecular icon labels by using a cross entropy loss function;

(6c) superposing the two loss values to obtain a loss value of the training neural network;

(7) judging whether the loss value of the trained neural network is converged, if so, stopping training to obtain the trained multilayer perceptron neural network, and executing the step (8), otherwise, executing the step (3);

(8) inputting the same kind of molecular graph to be predicted into the trained multilayer perceptron neural network, and outputting the molecular graph label to obtain the activity type corresponding to the molecular graph label.

2. The method for predicting the activity of a molecule based on the fusion of various substructure features according to claim 1, wherein the step of the random walk method in step (3) is: and selecting unselected nodes in the node neighborhood of the molecular graph by using a random walk method, and backtracking to the previously selected nodes if the unselected nodes do not exist in the current node neighborhood in the selection process, wherein the node neighborhood represents all other node sets connected with the node in the molecular graph.

3. The method for predicting the activity of a molecule based on the fusion of various substructure features according to claim 1, wherein the step of adding the two loss values in step (6c) is:

firstly, inputting the fused feature matrix into a neural network by using a cross entropy loss function to obtain a loss value lambda₁Inputting the selected different set of substructures into a neural network to obtain a loss value lambda₂。

Secondly, obtaining a neural network loss value L according to the following formula:

L＝pλ₁+(1-p)λ₂

wherein p represents a bias value, which is selected in the range of (0.8,1) according to the number of nodes in different molecular graphs.