CN111916143B

CN111916143B - Molecular activity prediction method based on multi-substructural feature fusion

Info

Publication number: CN111916143B
Application number: CN202010729533.9A
Authority: CN
Inventors: 丁静怡; 宋健; 焦李成; 吴建设; 成若晖
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-07-27
Filing date: 2020-07-27
Publication date: 2023-07-28
Anticipated expiration: 2040-07-27
Also published as: CN111916143A

Abstract

The invention discloses a molecular activity prediction method based on multi-substructure feature fusion, which trains a neural network by extracting the substructure features of a molecular graph, and solves the problems of closed loop of the substructure extraction, poor network prediction precision and difficult calculation in the prior art. The method comprises the following steps of: 1) The information of the drug molecules is converted into a molecular feature matrix; 2) Selecting an initial node; 3) Obtaining a plurality of substructures; 4) Calculating the similarity of the substructures; 5) Fusing the substructures feature matrix; 6) Training a neural network; 7) Judging whether the training neural network converges or not; 8) Obtaining the activity of the molecule to be predicted. The invention has the advantages of distinguishing the difference between different substructures, solving the problem of molecular diagram noise and having high prediction molecular activity precision.

Description

Molecular activity prediction method based on multi-substructural feature fusion

Technical Field

The invention belongs to the technical field of biology, and further relates to a molecular activity prediction method based on fusion of various substructures. The invention can predict the influence of unknown similar drug molecules on the biological activity by utilizing the information of the molecular substructures of the drug molecules and the biological activity corresponding to the information.

Background

The molecular activity prediction technology is to train a neural network model by utilizing molecular structure information of a similar drug and corresponding biological activity influence, and the model can predict the biological activity influence by utilizing the unknown structure information of the similar drug. The model can screen medicine compound molecules which are more suitable for biological activity analysis in biological laboratories from a large range of similar molecules. In order to convert the drug molecules into information recognizable by a computer, the molecular structure information is converted into a molecular map, and the influence of the drug molecules on the bioactivity is quantified into a molecular map label. At present, the molecular activity prediction technology not only can simplify the drug development process and reduce the potential safety hazard of biological experiments, but also can save the biological experiment cost. Current molecular activity prediction techniques face challenges presented by the molecular map node tag noise problem.

Pinar Yanardag in its published paper "Deep Graph Kernels" (knowledge discovery and data mining conference 2015) proposes a method of predicting molecular activity by comparing the similarity of substructures between a class of molecular figures. The method comprises the steps of dividing a class of molecular diagram into a training set and a testing set, dividing the training set molecular diagram into a plurality of substructures, training a neural network model by utilizing the training set substructures and labels, and finally obtaining molecular diagram labels of the testing set by utilizing the similarity of the training set molecular diagram substructures and the training set molecular diagram substructures by the model. The method has the defect that as one molecular diagram is randomly divided into a plurality of different substructures, different substructures can predict different results, so that the label precision of the predicted molecular diagram is reduced.

Lee in its published paper "Graph classification using structural attention" (knowledge discovery and data mining conference 2018) proposes a molecular activity prediction method based on an attention neural network. The method divides a class of molecular diagram into a training set and a testing set, searches a molecular diagram sub-structure in the training set by using an attention mechanism, and trains an LSTM model by using the sub-structure and the molecular diagram label. And finally, the model searches the substructure of the test set molecular graph by using an attention mechanism and predicts the test set molecular graph labels thereof. The method has the defect that the difference problem among the substructures cannot be resolved by the training network because the found substructures of the molecular diagram have closed loops.

Disclosure of Invention

The invention aims to solve the problems that the existing molecular activity prediction technology has the defects, and provides a molecular activity prediction method based on multi-sub-structural feature fusion, which is used for solving the problems that the extraction of structural features on a noisy molecular diagram is difficult and the prediction accuracy is poor in the molecular activity prediction process.

The idea for realizing the purpose of the invention is as follows: according to the unique structural features and node label features in the molecular sub-structure, a random walk mode is utilized to extract a molecular sub-structure set, the molecular sub-structure feature information is perfected, and the fusion features of the partial sub-structures are input into a built neural network training model, so that the aim of predicting the molecular activity more accurately and rapidly is fulfilled.

The specific implementation steps of the invention include the following steps:

(1) Obtaining a feature matrix corresponding to the drug molecular information:

after carrying out single-heat coding on atoms in a medicine molecule based on bytes, obtaining a single-heat coding feature matrix, representing key value pairs among the medicine atoms as a neighborhood feature matrix, and carrying out single-heat coding on the activity of the medicine molecule based on bytes, so as to obtain a single-heat coding label feature matrix;

(2) Selecting an initial node:

(2a) Representing atoms of the drug molecules as nodes, chemical bonds among the atoms as connecting edges, and labeling the drug molecule activity as component subgraph by the nodes, the connecting edges and the molecular icons;

(2b) Calculating the centrality value of each node in the score graph by using a Betwenness method, and selecting the node with the highest centrality value as an initial node;

(3) Extracting a plurality of sub-structural features of a molecular diagram:

starting from an initial node, selecting a substructure of l non-repeated node component sub-graphs smaller than the number of nodes of the molecular graph from the molecular graph by using a random walk method, and selecting a substructure set by using the same method;

(4) Calculating the similarity of the substructures:

(4a) Coding each substructure in the substructure set based on the nodes to obtain a feature matrix of the substructure;

(4b) Calculating the similarity of every two substructures in the substructures by using a similarity formula:

wherein J is _m,n Representing the similarity of an mth substructure and an nth substructure in the substructure set, g representing a feature matrix corresponding to the mth substructure in the substructure set, p representing a feature matrix of the nth substructure in the substructure set, |·| representing a matrix modulo operation, n representing an intersection operation, and u representing a union operation;

(4c) Storing all the substructures with the similarity greater than or equal to a threshold value into a similar set, and storing the rest substructures into a dissimilar set, wherein the threshold value is in the range of (0.5, 1) and is selected according to the number of nodes in different sub-classes;

(5) Fusing the substructural feature matrix:

averaging all the substructures feature matrixes in the similar set to obtain a fused substructures feature matrix;

(6) Training a neural network:

(6a) Inputting the fused substructure features into a 4-layer multi-layer perceptron neural network, outputting predicted molecular icons, and calculating a loss value between the predicted molecular icons and a real molecular icon corresponding to the predicted molecular icon by using a cross entropy loss function;

(6b) Arbitrarily selecting two sub-structural features from different sets, inputting the selected two sub-structural features into a 4-layer multi-layer perceptron neural network, outputting predicted molecular icons, and calculating a loss value between the predicted molecular icons and a real molecular icon corresponding to the predicted molecular icon by using a cross entropy loss function;

(6c) Superposing the two loss values to obtain a loss value of the training neural network;

(7) Judging whether the loss value of the training neural network is converged, if so, stopping training to obtain a trained multi-layer perceptron neural network, executing the step (8), otherwise, executing the step (3);

(8) Inputting the similar molecular diagram to be predicted into a trained multi-layer perceptron neural network, and outputting a molecular diagram label to obtain an activity type corresponding to the molecular diagram label.

Compared with the prior art, the invention has the following advantages:

firstly, as the invention fuses all the substructures feature matrixes in the similar set to average to obtain a fused substructures feature matrix, and inputs the fused substructures feature matrix into a training network to obtain a molecular sub-graph label, the invention solves the problem that the precision of the predicted molecular sub-graph label is reduced because one molecular graph is randomly divided into a plurality of different substructures, and the different substructures possibly predict different results in the prior art, so that the invention has the characteristic of excellent extraction of the molecular sub-structure features and improves the precision of the predicted molecular sub-graph label.

Secondly, the invention selects the substructure without closed loop from the molecular diagram by using a random walk method and divides the substructure into a similar substructure set and a dissimilar substructure set to train the network, thereby solving the problem that the training network cannot distinguish the differences among the substructures because the found substructure of the molecular diagram has closed loop in the prior art, and enabling the invention to distinguish the differences among the different substructures.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of the process of the present invention for extracting a sub-structure of a molecular figure using random walk;

FIG. 3 is a schematic diagram of the fusion of sub-structural features of a set of similarities in accordance with the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The specific steps of the present invention are further described with reference to fig. 1.

Step 1, obtaining a feature matrix corresponding to the drug molecular information.

And carrying out single-heat coding on atoms in a medicine molecule based on bytes to obtain a single-heat coding feature matrix, representing key value pairs among the medicine atoms as a neighborhood feature matrix, and carrying out single-heat coding on the activity of the medicine molecule based on bytes to obtain a single-heat coding label feature matrix.

And 2, selecting an initial node.

The atoms of the drug molecules are expressed as nodes, the atomic properties of the drug molecules are expressed as node labels, the chemical bonds between the atoms are expressed as connecting edges, the activity of the drug molecules is expressed as component subgraph labels, and the component subgraphs are marked by the nodes, the connecting edges and the molecular icons.

Calculating the centrality value of each node in the score graph by using a Betwenness method, and selecting the node with the highest centrality value as an initial node.

And 3, extracting a plurality of sub-structural features of the molecular diagram.

The process of extracting the molecular substructure using the random walk is described in detail below in conjunction with fig. 2.

The black-edge nodes in FIG. 2 represent nodes selected by the random walk method, the non-black-edge nodes represent non-selected nodes, the numbers in the nodes represent the number of atoms in the molecule, c _t-1 Representing the node at which the random walk points at time t-1, c _t Indicating the node at which the random walk points at time t. Fig. 2 (a) shows a schematic diagram of a random walk picking up a node 4 at time t-1. FIG. 2 (b) shows a schematic diagram of a process of tracing back to the node 9, and it can be seen from FIG. 2 (b) that there are no unselected nodes on the current node 9's edge, and that c is required to find an unselected node _t-1 There may be non-picked nodes before continuing the backtracking. FIG. 2 (c) shows c _t-1 Schematic tracing back to node 2From fig. 2 (c), it can be seen that the edge node 1 in the current node 2 is not selected. FIG. 2 (d) shows a schematic diagram of a pick-out node 1, and it is seen from 2 (d) that because node 1 is not picked, the random walk picks out node 1, and c is set _t Point to node 1 and set its edge to black.

From fig. 2, starting from the initial node, a random walk method is used to select a sub-structure of l non-repeated node group component graphs smaller than the number of nodes of the molecular graph from the molecular graph, and the same method is used to select a sub-structure set.

And 4, calculating the similarity of the substructures.

The method comprises the steps of firstly, coding each substructure in a substructure set based on nodes to obtain a characteristic matrix of the substructure;

secondly, calculating the similarity of every two substructures in the substructures set by using a similarity formula:

wherein J is _m,n Representing the similarity of the m-th substructure and the n-th substructure in the substructure set, p _m Representing the feature matrix, p, corresponding to the m-th substructure in the substructure set _n Representing the feature matrix of the nth sub-structure in the sub-structure set, |·| represents the matrix modulo operation, n represents the intersection taking operation, and u represents the union taking operation.

Third step, |p _m ∩p _n I represents the substructure p _m And p _n The specific processes of the Jacquard similarity and the Hamming distance of the node sequence characteristics are as follows: first, a node tag hopping sequence of a corresponding substructure is obtained from the substructure, e.g., substructure p _m The presence of node sequence features [1,2,3 ]]And p _n Presence of node sequence features [1,2,3,2,3 ]]Wherein the element is represented as a node tag. According to [1,2,3 ]]Obtaining the hopping sequence [0,1,0,1 ]]，p _n ＝[1,2,3,2,3]Obtaining the hopping sequence [1,1 ]]Wherein 1 in the hopping sequence indicates that the label of the neighboring node in the substructure sequence changes,0 indicates no change. And then, carrying out Jacquard similarity measurement and Hamming distance measurement normalization processing on the node label hopping sequences of the two paths to obtain similarity values.

Fourth, storing all the substructures with similarity greater than or equal to the threshold value in the similarity set, and storing the rest substructures in the dissimilarity set, wherein the threshold value is in the range of (0.5, 1), and selecting according to the number of nodes in different sub-classes

And 5, fusing the substructures feature matrix.

The process of fusing substructures in a similar set is described in detail below in conjunction with FIG. 3.

In fig. 3, the substructure is represented as a row of node sequences, which may be represented as a feature matrix. One row of node sequences corresponds to one feature matrix. Fig. 3 (a) shows a schematic diagram of the feature matrix of three substructures. Fig. 3 (b) shows the sub-structure feature matrix after fusion. The three sub-structure feature matrices in fig. 3 (a) are averaged by a matrix to obtain the sub-structure feature matrix as fused in fig. 3 (b).

And averaging all the substructural feature matrixes in the similar set to obtain a fused substructural feature matrix. One substructure may be encoded as a feature matrix according to its node characteristics, with multiple substructure feature matrices in a similar set being merged into one substructure feature matrix on average.

And 6, training the neural network.

Firstly, inputting the fused substructure features into a 4-layer multi-layer perceptron neural network, outputting predicted molecular icons, and calculating a loss value lambda between the predicted molecular icons and a real icon corresponding to the predicted molecular icons by using a cross entropy loss function ₁ 。

Step two substructures are selected at will from different sets, the characteristics of the two substructures are input into a 4-layer multi-layer perceptron neural network, a predicted molecular icon is output, and a cross entropy loss function is utilized to calculate a loss value lambda between the true icon corresponding to the predicted molecular icon ₂ 。

Thirdly, obtaining a neural network loss value L according to the following formula

L＝pλ ₁ +(1-p)λ ₂

Where p represents a bias value selected in a range (0.8,1) based on the number of nodes in the different sub-classes of molecules.

And 7, judging whether the loss value of the neural network is converged, if so, stopping training to obtain a trained multi-layer perceptron neural network, executing the step 8, and otherwise, executing the step 3.

And 8, inputting the similar molecular diagram to be predicted into a trained multi-layer perceptron neural network, outputting a diagram label, and obtaining an activity type corresponding to the label.

The effects of the present invention are further described below in conjunction with simulation experiments:

1. simulation experiment conditions:

the hardware platform of the simulation experiment of the invention is: the processor is Intel i5 3470 CPU, the main frequency is 3.20GHz, and the memory is 4GB.

The software platform of the simulation experiment of the invention is: ubuntu operating system and python 3.6.

The input molecular diagram data set used in the simulation experiments of the present invention were the following four molecular compound types, NCI-1, NCI-33, NCI-83 and NCI-123, published by the national cancer institute center at 2006, https:// www.cancer.gov/web site. NCI-1 is the balance dataset of the compound dataset of non-small cell lung cancer screening activity, there are two active class labels, NCI-33 is the balance dataset of the compound dataset of melanoma screening activity, there are two active class labels, NCI-83 is the balance dataset of the compound dataset of breast cancer screening activity, there are two active class labels, NCI-123 is the balance dataset of the compound dataset of breast cancer screening activity, there are two active class labels.

2. Simulation content and result analysis:

the simulation experiment of the invention is to classify the split sub-graphs by adopting the invention and three prior arts (Kernel-GK method, deep-SP method and GAM method) respectively, and obtain classification results.

In simulation experiments, three prior art techniques employed refer to:

the existing Kernel-GK method refers to a molecular diagram prediction method proposed by Nino Shervashidze et al in 2009 in Conference on Artificial Intelligence and Statistics of Efficient graphlet kernels for large graph comparison, which is called a Kernel method of a similar sub-structure diagram for short.

The existing Deep-SP method is a molecular diagram prediction method proposed by Pinar Yanardag et al in the conference of Deep Graph Kernels and knowledge discovery and data mining in 2015, and is called a depth shortest path diagram kernel method for short.

The existing GAM method is a molecular diagram prediction method proposed by j.b. lee et al in the conference of "Graph classification using structural attention",2018 knowledge discovery and data mining, and is abbreviated as an attention neural network diagram classification method.

And respectively evaluating the classification results of the three methods by using the average precision evaluation index. The classification accuracy of the data sets NCI1, NCI33, NCI83 and NCI123 was calculated using the following formula, and all the calculation results were plotted in table 1:

TABLE 1 quantitative analysis Table of the classification results of the invention and the respective prior arts in simulation experiments

As can be seen from the combination of Table 1, the classification accuracy AA index of the invention is higher than that of 3 prior art methods, and the invention can obtain higher molecular activity prediction accuracy.

The simulation experiment shows that: the method can identify different substructure features of the molecular diagram by utilizing the random walk idea, wherein the open-loop requirement on the extracted substructure is overcome, and the defect that the training network cannot distinguish the difference between the substructures because the extracted molecular diagram substructure has a closed loop in the prior art method is overcome. In addition, by fusing the characteristics of the similar substructures, the difference among the characteristics of different substructures is reduced, and the problem that the prediction accuracy of the molecular activity is reduced due to different predicted results of different substructures in the prior art is solved. Compared with other comparison methods, the multi-layer neural network prediction model provided by the invention has short training time and high network generalization, and is a very practical molecular activity prediction method.

Claims

1. A molecular activity prediction method based on multi-substructure feature fusion is characterized in that a random walk method is utilized to extract a plurality of substructure features of a molecular diagram, the fused plurality of substructure features are input into a trained multi-layer neural network to predict molecular activity, and the method specifically comprises the following steps:

(1) Obtaining a feature matrix corresponding to the drug molecular information:

(2) Selecting an initial node:

(3) Extracting a plurality of sub-structural features of a molecular diagram:

(4) Calculating the similarity of the substructures:

(5) Fusing the substructural feature matrix:

(6) Training a neural network:

(6a) Arbitrarily selecting two sub-structural features from different sets, inputting the selected two sub-structural features into a 4-layer multi-layer perceptron neural network, outputting predicted molecular icons, and calculating a loss value between the predicted molecular icons and a real molecular icon corresponding to the predicted molecular icon by using a cross entropy loss function;

(6b) Inputting the fused substructure features into a 4-layer multi-layer perceptron neural network, outputting predicted molecular icons, and calculating a loss value between the predicted molecular icons and a real molecular icon corresponding to the predicted molecular icon by using a cross entropy loss function;

2. The method for predicting molecular activity based on fusion of multiple substructures according to claim 1, wherein the step of the random walk method in step (3) is as follows: and selecting unselected nodes in the node neighborhood of the molecular graph by using a random walk method, and backtracking to the previously selected nodes if the unselected nodes do not exist in the current node neighborhood in the selection process, wherein the node neighborhood represents all other node sets connected with the node in the molecular graph.

3. The method for predicting molecular activity based on fusion of multiple substructures according to claim 1, wherein the step of superimposing two loss values in the step (6 c) is:

firstly, inputting the fused feature matrix into a neural network to obtain a loss value lambda by using a cross entropy loss function ₁ The selected sub-structures in the different sets are input into a neural network to obtain a loss value lambda _2；

Secondly, obtaining a neural network loss value L according to the following formula:

L＝pλ ₁ +(1-p)λ ₂