CN111916143A - Molecular activity prediction method based on multiple substructure feature fusion - Google Patents
Molecular activity prediction method based on multiple substructure feature fusion Download PDFInfo
- Publication number
- CN111916143A CN111916143A CN202010729533.9A CN202010729533A CN111916143A CN 111916143 A CN111916143 A CN 111916143A CN 202010729533 A CN202010729533 A CN 202010729533A CN 111916143 A CN111916143 A CN 111916143A
- Authority
- CN
- China
- Prior art keywords
- substructure
- molecular
- neural network
- substructures
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/30—Prediction of properties of chemical compounds, compositions or mixtures
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Crystallography & Structural Chemistry (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Chemical & Material Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Evolutionary Biology (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Biotechnology (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Bioethics (AREA)
- Public Health (AREA)
- Epidemiology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a molecular activity prediction method based on various substructure characteristic fusion, which trains a neural network by extracting substructure characteristics of a molecular graph, and solves the problems of closed loop of the extracted substructure, poor network prediction precision and difficult calculation in the prior art. The method comprises the following steps: 1) converting the drug molecule information into a molecular characteristic matrix; 2) selecting an initial node; 3) obtaining a plurality of substructures; 4) calculating the similarity of the substructures; 5) fusing a substructure feature matrix; 6) training a neural network; 7) judging whether the training neural network is converged; 8) obtaining the molecular activity to be predicted. The invention has the advantages of distinguishing the difference between different substructures, solving the problem of molecular diagram noise and having high precision of predicting the molecular activity.
Description
Technical Field
The invention belongs to the technical field of biology, and further relates to a molecular activity prediction method based on diverse substructure characteristic fusion in the technical field of biological activity. The invention can predict the influence of unknown similar drug molecules on the biological activity by utilizing the molecular structure information of the similar drug molecules and the corresponding biological activity.
Background
The molecular activity prediction technology is that a neural network model is trained by utilizing the molecular structure information of a class of drugs and the influence of the corresponding biological activity, and the model can predict the influence of the biological activity by utilizing the structure information of unknown similar drug molecules. Thus, the model allows the screening of a wide range of molecules of the same class for drug compounds that are more suitable for biological activity assays in the biological laboratory. In order to convert drug molecules into information that can be recognized by computers, molecular structure information is converted into molecular graphs, and the influence of drug molecules on biological activity is quantified into molecular graph labels. At present, the molecular activity prediction technology can simplify the drug development process, reduce the potential safety hazard of biological experiments and save the cost of the biological experiments. Current molecular activity prediction techniques present challenges to the problem of label noise for molecular graph nodes.
Pinar Yanardag in its published paper "Deep Graph Kernels" (2015 knowledgy discovery and data mining conference) proposed a method to predict molecular activity by comparing the similarity of substructures between a class of molecular graphs. The method comprises the steps of dividing a class of molecular diagrams into a training set and a testing set, dividing the training set molecular diagrams into a plurality of substructures, training a neural network model by using the training set substructures and labels, and finally obtaining the molecular diagram labels of the testing set by using the similarity between the substructures of the molecular diagrams of the testing set and the substructures of the training set molecular diagrams. The method has the disadvantages that because one molecular diagram is randomly divided into a plurality of different substructures, different substructures can predict different results, and the accuracy of predicting molecular diagram tags is reduced.
Lee proposed a molecular activity prediction method based on attention neural networks in its published paper "Graph classification using structural association" (2018 conference on knowledge discovery and data mining). The method divides a class of molecular diagrams into a training set and a testing set, utilizes an attention mechanism to find a molecular diagram substructure in the training set, and utilizes the substructure and molecular diagram labels to train the LSTM model. And finally, the model searches the substructure of the test set molecular graph by using an attention mechanism and predicts the label of the test set molecular graph. The method has the disadvantage that the difference between the substructures cannot be distinguished by a training network due to the closed loop of the searched molecular diagram substructures.
Disclosure of Invention
The invention aims to provide a molecular activity prediction method based on multi-substructure feature fusion aiming at the defects of the existing molecular activity prediction technology, which is used for solving the problems that the extraction of structural features from noisy molecular graphs is difficult and the prediction precision is poor in the molecular activity prediction process.
The idea for realizing the purpose of the invention is as follows: according to unique structural features and node label features in the molecular graph substructure, a molecular graph substructure set is extracted in a random walk mode, molecular graph substructure feature information is perfected, fusion features of partial substructures are input into a well-constructed neural network training model, and the purpose of predicting molecular activity more accurately and rapidly is achieved.
The specific implementation steps of the invention comprise the following steps:
(1) obtaining a characteristic matrix corresponding to the drug molecule information:
carrying out one-hot encoding on atoms in a drug molecule based on bytes to obtain one-hot encoding characteristic matrix, expressing bond value pairs between the drug atoms into a neighborhood characteristic matrix, and carrying out one-hot encoding on the activity of the drug molecule based on bytes to obtain one-hot encoding label characteristic matrix;
(2) selecting an initial node:
(2a) representing atoms of drug molecules into nodes, representing chemical bonds among the atoms into connecting edges, representing the activity of the drug molecules into component subgraph labels, and forming component subgraphs by the nodes, the connecting edges and the molecular label labels;
(2b) calculating the centrality value of each node in the molecular graph by using a Betweenness method, and selecting the node with the highest centrality value as an initial node;
(3) extracting a plurality of substructure characteristics of the molecular diagram:
starting from an initial node, selecting l substructures of the component subgraph without repeated node groups, wherein the number of the substructures is less than that of the nodes of the molecular subgraph, from the molecular subgraph by using a random walk method, and selecting one substructure set by using the same method;
(4) calculating the similarity of the substructures:
(4a) coding each substructure in the substructure set based on nodes to obtain a characteristic matrix of the substructure;
(4b) and calculating the similarity of every two substructures in the substructure set by using a similarity formula:
wherein, Jm,nRepresenting the similarity between the mth substructure and the nth substructure in the substructure set, g representing a characteristic matrix corresponding to the mth substructure in the substructure set, p representing the characteristic matrix of the nth substructure in the substructure set, | · | representing matrix modulo operation, | · representing intersection operation, and u representing union operation;
(4c) storing all the substructures with the similarity greater than or equal to a threshold value into a similar set, and storing the rest substructures into a different set, wherein the threshold value is in a range of (0.5,1), and is selected according to the number of nodes in different molecular diagram classes;
(5) fusion substructure feature matrix:
averaging all the substructure feature matrixes in the similar set to obtain a fused substructure feature matrix;
(6) training a neural network:
(6a) inputting the fused substructure characteristics into a 4-layer multi-layer perceptron neural network, outputting predicted molecular icon labels, and calculating loss values between the predicted molecular icon labels and the real molecular icon labels corresponding to the predicted molecular icon labels by using a cross entropy loss function;
(6b) randomly selecting two substructure characteristics from different sets, inputting the two substructure characteristics into a multilayer perceptron neural network with 4 layers, outputting predicted molecular icon labels, and calculating loss values between the real molecular icon labels corresponding to the predicted molecular icon labels by using a cross entropy loss function;
(6c) superposing the two loss values to obtain a loss value of the training neural network;
(7) judging whether the loss value of the trained neural network is converged, if so, stopping training to obtain the trained multilayer perceptron neural network, and executing the step (8), otherwise, executing the step (3);
(8) inputting the same kind of molecular graph to be predicted into the trained multilayer perceptron neural network, and outputting the molecular graph label to obtain the activity type corresponding to the molecular graph label.
Compared with the prior art, the invention has the following advantages:
firstly, the fused substructure feature matrix averages all substructure feature matrices in the similar set to obtain a fused substructure feature matrix, and the fused substructure feature matrix is input into a training network to obtain the molecular graph tag, so that the problem that in the prior art, due to the fact that a molecular graph is randomly divided into a plurality of different substructures, different substructures can predict different results to cause the accuracy of the predicted molecular graph tag to be reduced is solved, the method has the characteristic of excellent extraction of the substructure features of the molecular graph, and the accuracy of the predicted molecular graph tag is improved.
Secondly, the invention selects the non-closed-loop substructure from the molecular diagram by using a random walk method, and divides the non-closed-loop substructure into a similar substructure set and a different substructure set to train the network, thereby overcoming the problem that the training network can not distinguish the difference between the substructures because the searched molecular diagram substructure has a closed loop in the prior art, and leading the invention to distinguish the difference between different substructures.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of the process of extracting molecular graph substructure using random walk according to the present invention;
FIG. 3 is a schematic diagram of the fusion of similar sets of substructure features according to the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The specific steps of the present invention are further described with reference to fig. 1.
The method comprises the steps of carrying out one-hot coding on atoms in a drug molecule based on bytes to obtain one-hot coding feature matrix, expressing bond value pairs among the drug atoms into a neighborhood feature matrix, and carrying out one-hot coding on the activity of the drug molecule based on bytes to obtain one-hot coding label feature matrix.
And 2, selecting an initial node.
The method comprises the steps of representing atoms of drug molecules into nodes, representing the atoms of the drug molecules into nodes, representing atomic properties of the drug molecules into node labels, representing chemical bonds between the atoms into connecting edges, representing the activity of the drug molecules into component subgraph labels, and forming the component subgraphs by the nodes, the connecting edges and the molecular label labels.
And calculating the centrality value of each node in the molecular graph by using a Betweenness method, and selecting the node with the highest centrality value as an initial node.
And 3, extracting a plurality of substructure features of the molecular diagram.
The following describes in detail the process of extracting the molecular graph substructure by using random walk with reference to fig. 2.
The black-edged nodes in FIG. 2 represent nodes selected by the random walk method, the nodes without black edges represent nodes not selected, and the number tables in the nodesNumber of atoms in the molecule, ct-1Representing the node pointed at by the random walk at time t-1, ctIndicating the node pointed to by the random walk at time t. FIG. 2(a) is a schematic diagram showing the random walk picking up the node 4 at time t-1. FIG. 2(b) is a schematic diagram illustrating the process of backtracking to node 9, and it can be seen from FIG. 2(b) that there are no unselected nodes along the edge of current node 9, and c is required for finding the unselected nodest-1There may be nodes that are not picked before continuing the backtracking. FIG. 2(c) shows ct-1Looking back to the schematic diagram of node 2, it can be seen from fig. 2(c) that the edge node 1 in the current node 2 is not picked. FIG. 2(d) is a schematic diagram showing the picking of node 1. from FIG. 2(d), it can be seen that since node 1 is not picked, node 1 is picked by random walk, and c is settPointing to node 1 and setting its edges to black.
As shown in fig. 2, starting from the initial node, the random walk method is used to select l substructures of the component subgraph without repeated nodes from the molecular graph, where the number of the substructures is less than the number of the nodes of the molecular graph, and one substructures set is selected by using the same method.
And 4, calculating the similarity of the substructures.
Firstly, coding each substructure in a substructure set based on nodes to obtain a characteristic matrix of the substructure;
and secondly, calculating the similarity of every two substructures in the substructure set by using a similarity formula:
wherein, Jm,nRepresenting the similarity of the mth and nth sub-structures in the set of sub-structures, pmRepresenting the characteristic matrix, p, corresponding to the mth substructure of the set of substructuresnThe feature matrix of the nth substructure in the substructure set is represented, | · | represents a matrix modulo operation, | represents an intersection operation, and u represents a union operation.
Third step, | pm∩pnI denotes the substructure pmAnd pnThe Jacard similarity and Hamming distance of the node sequence characteristics are as follows: first, a node tag hopping sequence corresponding to a sub-structure is obtained according to the sub-structure, for example, the sub-structure pmPresence of node sequence features [1,1,2,2,3 ]]And pnPresence of node sequence features [1,2,3 ]]Where the elements are represented as node labels. According to [1,1,2,2,3 ]]Obtain the hopping sequence [0,1,0,1 ]],pn=[1,2,3,2,3]Obtaining the hopping sequence [1,1,1,1]Wherein, 1 in the hopping sequence indicates that the label of the adjacent node in the substructure sequence is changed, and 0 indicates that the label is not changed. And then carrying out Jacard similarity measurement and Hamming distance measurement normalization processing on the node label hopping sequences of the two paths to obtain similar values.
Fourthly, storing all the substructures with the similarity greater than or equal to a threshold value into a similar set, and then storing the rest substructures into a different set, wherein the threshold value is in a range of (0.5,1), and the substructures are selected according to the number of nodes in different molecular graphs
And 5, fusing a substructure feature matrix.
The process of sub-structure fusion in the similarity set is described in detail below with reference to fig. 3.
In fig. 3, the substructure is represented as a row of node sequences, which may be represented as a feature matrix. A row of node sequences corresponds to a feature matrix. Fig. 3(a) shows a schematic diagram of a feature matrix of three substructures. Fig. 3(b) shows the substructure feature matrix after fusion. The three substructure feature matrices in fig. 3(a) are subjected to matrix averaging to obtain the fused substructure feature matrix in fig. 3 (b).
And averaging all the substructure characteristic matrixes in the similar set to obtain a fused substructure characteristic matrix. A substructure can be encoded into a feature matrix according to the node feature, and a plurality of substructure feature matrices in the similar set are averagely fused into a substructure feature matrix.
And 6, training a neural network.
Firstly, inputting the fused substructure features into a 4-layer multi-layer perceptron neural network, outputting predicted molecular icon labels, and utilizing intersectionAn entropy loss function for calculating a loss value λ between the actual tags corresponding to the predicted molecular tags1。
Second, randomly selecting two substructures from the different sets, inputting the characteristics of the two selected substructures into a 4-layer multilayer perceptron neural network, outputting a predicted molecular icon label, and calculating a loss value lambda between real icon labels corresponding to the predicted molecular icon label by using a cross entropy loss function2。
Thirdly, obtaining a neural network loss value L according to the following formula
L=pλ1+(1-p)λ2
Wherein p represents a bias value, which is selected in the range of (0.8,1) according to the number of nodes in different molecular graphs.
And 7, judging whether the loss value of the neural network is converged, if so, stopping training to obtain the trained multilayer perceptron neural network, and executing the step 8, otherwise, executing the step 3.
And 8, inputting the same kind of molecular diagram to be predicted into the trained neural network of the multilayer perceptron, and outputting the diagram label to obtain the activity type corresponding to the label.
The effect of the present invention is further explained by combining the simulation experiment as follows:
1. simulation experiment conditions are as follows:
the hardware platform of the simulation experiment of the invention is as follows: the processor is an Intel i 53470 CPU, the main frequency is 3.20GHz, and the memory is 4 GB.
The software platform of the simulation experiment of the invention is as follows: ubuntu operating system and python 3.6.
The input molecular map data set used in the simulation experiment of the invention is the following four molecular compound types published by the national cancer research center of America in 2006, https:// www.cancer.gov/website, NCI-1, NCI-33, NCI-83 and NCI-123. NCI-1 is a balanced dataset of a compound dataset of screening activity of non-small cell lung cancer, and has two activity class marks in total, NCI-33 is a balanced dataset of a compound dataset of screening activity of melanoma, and has two activity class marks in total, NCI-83 is a balanced dataset of a compound dataset of screening activity of breast cancer, and has two activity class marks in total, and NCI-123 is a balanced dataset of a compound dataset of screening activity of breast cancer, and has two activity class marks in total.
2. Simulation content and result analysis thereof:
the simulation experiment of the invention is to classify the molecular diagram respectively by adopting the invention and three prior arts (Kernel-GK method, Deep-SP method and GAM method) to obtain classification results.
In the simulation experiment, three prior arts are adopted:
the conventional Kernel-GK method refers to a molecular diagram prediction method, namely an inner core method of a similar substructure diagram, which is proposed by Nino Shervashidze et al in 'effective graph Kernel for large graph company' 2009 Conference on Artificial Intelligence and Statistics.
The existing Deep-SP method is a molecular map prediction method proposed by Pinar Yanardag et al in Deep Graph Kernels, 2015 knowledge discovery and data mining conference, which is called a Deep shortest path map kernel method for short.
The existing GAM method is a molecular diagram prediction method, which is called an attention neural network diagram classification method for short, proposed by J.B.Lee et al in the conference of Graph classification using structural association and learning in 2018.
And respectively evaluating the classification results of the three methods by using the average precision evaluation index. The classification accuracy of the data sets NCI1, NCI33, NCI83, and NCI123 was calculated using the following formula, and all the calculations are plotted in table 1:
TABLE 1 quantitative analysis table of classification results of the present invention and various prior arts in simulation experiment
As can be seen by combining the table 1, the classification accuracy AA indexes of the invention are all higher than those of 3 methods in the prior art, and the invention is proved to obtain higher molecular activity prediction accuracy.
The above simulation experiments show that: the method can identify different substructure characteristics of the molecular diagram by using the idea of random walk, wherein the open loop requirement for extracting the substructure overcomes the defect that the difference between the substructure cannot be distinguished by a training network because the extracted substructure of the molecular diagram exists a closed loop in the prior art. In addition, by fusing the characteristics of the similar substructure, the difference among the characteristics of different substructures is reduced, and the problem of reduced molecular activity prediction precision caused by different predicted results of different substructures in the prior art is solved. Compared with other comparison methods, the multilayer neural network prediction model provided by the invention has the advantages of short training time and high network generalization, and is a very practical molecular activity prediction method.
Claims (3)
1. A molecular activity prediction method based on multi-substructure feature fusion is characterized in that a random walk method is used for extracting a plurality of substructure features of a molecular diagram, and the fused substructure features are input into a trained multilayer neural network to predict molecular activity, and the method specifically comprises the following steps:
(1) obtaining a characteristic matrix corresponding to the drug molecule information:
carrying out one-hot encoding on atoms in a drug molecule based on bytes to obtain one-hot encoding characteristic matrix, expressing bond value pairs between the drug atoms into a neighborhood characteristic matrix, and carrying out one-hot encoding on the activity of the drug molecule based on bytes to obtain one-hot encoding label characteristic matrix;
(2) selecting an initial node:
(2a) representing atoms of drug molecules into nodes, representing chemical bonds among the atoms into connecting edges, representing the activity of the drug molecules into component subgraph labels, and forming component subgraphs by the nodes, the connecting edges and the molecular label labels;
(2b) calculating the centrality value of each node in the molecular graph by using a Betweenness method, and selecting the node with the highest centrality value as an initial node;
(3) extracting a plurality of substructure characteristics of the molecular diagram:
starting from an initial node, selecting l substructures of the component subgraph without repeated node groups, wherein the number of the substructures is less than that of the nodes of the molecular subgraph, from the molecular subgraph by using a random walk method, and selecting one substructure set by using the same method;
(4) calculating the similarity of the substructures:
(4a) coding each substructure in the substructure set based on nodes to obtain a characteristic matrix of the substructure;
(4b) and calculating the similarity of every two substructures in the substructure set by using a similarity formula:
wherein, Jm,nRepresenting the similarity between the mth substructure and the nth substructure in the substructure set, g representing a characteristic matrix corresponding to the mth substructure in the substructure set, p representing the characteristic matrix of the nth substructure in the substructure set, | · | representing matrix modulo operation, | · representing intersection operation, and u representing union operation;
(4c) storing all the substructures with the similarity greater than or equal to a threshold value into a similar set, and storing the rest substructures into a different set, wherein the threshold value is in a range of (0.5,1), and is selected according to the number of nodes in different molecular diagram classes;
(5) fusion substructure feature matrix:
averaging all the substructure feature matrixes in the similar set to obtain a fused substructure feature matrix;
(6) training a neural network:
(6a) randomly selecting two substructure characteristics from different sets, inputting the two substructure characteristics into a multilayer perceptron neural network with 4 layers, outputting predicted molecular icon labels, and calculating loss values between the real molecular icon labels corresponding to the predicted molecular icon labels by using a cross entropy loss function;
(6b) inputting the fused substructure characteristics into a 4-layer multi-layer perceptron neural network, outputting predicted molecular icon labels, and calculating loss values between the predicted molecular icon labels and the real molecular icon labels corresponding to the predicted molecular icon labels by using a cross entropy loss function;
(6c) superposing the two loss values to obtain a loss value of the training neural network;
(7) judging whether the loss value of the trained neural network is converged, if so, stopping training to obtain the trained multilayer perceptron neural network, and executing the step (8), otherwise, executing the step (3);
(8) inputting the same kind of molecular graph to be predicted into the trained multilayer perceptron neural network, and outputting the molecular graph label to obtain the activity type corresponding to the molecular graph label.
2. The method for predicting the activity of a molecule based on the fusion of various substructure features according to claim 1, wherein the step of the random walk method in step (3) is: and selecting unselected nodes in the node neighborhood of the molecular graph by using a random walk method, and backtracking to the previously selected nodes if the unselected nodes do not exist in the current node neighborhood in the selection process, wherein the node neighborhood represents all other node sets connected with the node in the molecular graph.
3. The method for predicting the activity of a molecule based on the fusion of various substructure features according to claim 1, wherein the step of adding the two loss values in step (6c) is:
firstly, inputting the fused feature matrix into a neural network by using a cross entropy loss function to obtain a loss value lambda1Inputting the selected different set of substructures into a neural network to obtain a loss value lambda2。
Secondly, obtaining a neural network loss value L according to the following formula:
L=pλ1+(1-p)λ2
wherein p represents a bias value, which is selected in the range of (0.8,1) according to the number of nodes in different molecular graphs.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010729533.9A CN111916143B (en) | 2020-07-27 | 2020-07-27 | Molecular activity prediction method based on multi-substructural feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010729533.9A CN111916143B (en) | 2020-07-27 | 2020-07-27 | Molecular activity prediction method based on multi-substructural feature fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111916143A true CN111916143A (en) | 2020-11-10 |
CN111916143B CN111916143B (en) | 2023-07-28 |
Family
ID=73281083
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010729533.9A Active CN111916143B (en) | 2020-07-27 | 2020-07-27 | Molecular activity prediction method based on multi-substructural feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111916143B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112420131A (en) * | 2020-11-20 | 2021-02-26 | 中国科学技术大学 | Molecular generation method based on data mining |
WO2022222492A1 (en) * | 2021-04-23 | 2022-10-27 | 中国科学院深圳先进技术研究院 | Prediction method and device for drug molecular feature attribute |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1999012118A1 (en) * | 1997-09-03 | 1999-03-11 | Commonwealth Scientific And Industrial Research Organisation | Compound screening system |
CN106874688A (en) * | 2017-03-01 | 2017-06-20 | 中国药科大学 | Intelligent lead compound based on convolutional neural networks finds method |
WO2018220368A1 (en) * | 2017-05-30 | 2018-12-06 | Gtn Ltd | Tensor network machine learning system |
CN109033738A (en) * | 2018-07-09 | 2018-12-18 | 湖南大学 | A kind of pharmaceutical activity prediction technique based on deep learning |
RU2721190C1 (en) * | 2018-12-25 | 2020-05-18 | Общество с ограниченной ответственностью "Аби Продакшн" | Training neural networks using loss functions reflecting relationships between neighbouring tokens |
CN111429977A (en) * | 2019-09-05 | 2020-07-17 | 中国海洋大学 | Novel molecular similarity search algorithm based on graph structure attention |
CN111428848A (en) * | 2019-09-05 | 2020-07-17 | 中国海洋大学 | Molecular intelligent design method based on self-encoder and 3-order graph convolution |
-
2020
- 2020-07-27 CN CN202010729533.9A patent/CN111916143B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1999012118A1 (en) * | 1997-09-03 | 1999-03-11 | Commonwealth Scientific And Industrial Research Organisation | Compound screening system |
CN106874688A (en) * | 2017-03-01 | 2017-06-20 | 中国药科大学 | Intelligent lead compound based on convolutional neural networks finds method |
WO2018220368A1 (en) * | 2017-05-30 | 2018-12-06 | Gtn Ltd | Tensor network machine learning system |
CN109033738A (en) * | 2018-07-09 | 2018-12-18 | 湖南大学 | A kind of pharmaceutical activity prediction technique based on deep learning |
RU2721190C1 (en) * | 2018-12-25 | 2020-05-18 | Общество с ограниченной ответственностью "Аби Продакшн" | Training neural networks using loss functions reflecting relationships between neighbouring tokens |
CN111429977A (en) * | 2019-09-05 | 2020-07-17 | 中国海洋大学 | Novel molecular similarity search algorithm based on graph structure attention |
CN111428848A (en) * | 2019-09-05 | 2020-07-17 | 中国海洋大学 | Molecular intelligent design method based on self-encoder and 3-order graph convolution |
Non-Patent Citations (3)
Title |
---|
张小驰;于华;宫秀军;: "一种基于随机游走的迭代加权子图查询算法", 计算机研究与发展, no. 12 * |
潘永昊;于洪涛;刘树新;: "基于神经网络的链路预测算法", 网络与信息安全学报, no. 07 * |
秦琦枫;曾斌;刘思莹;: "深度神经网络在化学中的应用研究", 江西化工, no. 03 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112420131A (en) * | 2020-11-20 | 2021-02-26 | 中国科学技术大学 | Molecular generation method based on data mining |
CN112420131B (en) * | 2020-11-20 | 2022-07-15 | 中国科学技术大学 | Molecular generation method based on data mining |
WO2022222492A1 (en) * | 2021-04-23 | 2022-10-27 | 中国科学院深圳先进技术研究院 | Prediction method and device for drug molecular feature attribute |
Also Published As
Publication number | Publication date |
---|---|
CN111916143B (en) | 2023-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Anifowose et al. | Improving the prediction of petroleum reservoir characterization with a stacked generalization ensemble model of support vector machines | |
Peel et al. | Detecting change points in the large-scale structure of evolving networks | |
Pang et al. | Predicting vulnerable software components through n-gram analysis and statistical feature selection | |
Ratnasingham et al. | A DNA-based registry for all animal species: the Barcode Index Number (BIN) system | |
Xie et al. | Active and semi-supervised graph neural networks for graph classification | |
CN111916143A (en) | Molecular activity prediction method based on multiple substructure feature fusion | |
Wang et al. | Mushroom toxicity recognition based on multigrained cascade forest | |
Chen | Analysis of machine learning methods for COVID-19 detection using serum Raman spectroscopy | |
Gao et al. | Malware detection using attributed CFG generated by pre-trained language model with graph isomorphism network | |
Zhu et al. | Predicting the results of RNA molecular specific hybridization using machine learning | |
Daisey et al. | Effects of the hierarchy in hierarchical, multi-label classification | |
Obaido et al. | Supervised machine learning in drug discovery and development: algorithms, applications, challenges, and prospects | |
Wang et al. | The generalized matrix decomposition biplot and its application to microbiome data | |
Noviandy et al. | An Interpretable Machine Learning Strategy for Antimalarial Drug Discovery with LightGBM and SHAP | |
Wang et al. | Feature selection methods in the framework of mRMR | |
Wang et al. | Assessment of community efforts to advance computational prediction of protein-protein interactions | |
Zhang et al. | Unbiased gradient boosting decision tree with unbiased feature importance | |
Hu et al. | Dual perspective contrastive learning based subgraph anomaly detection on attributed networks | |
Li et al. | Variance tolerance factors for interpreting all neural networks | |
De Oliveira et al. | An optimization-based process mining approach for explainable classification of timed event logs | |
Rossel et al. | Unsupervised biodiversity estimation using proteomic fingerprints from MALDI‐TOF MS data | |
Verma et al. | CB-SAGE: A novel centrality based graph neural network for floor plan classification | |
Cai et al. | Transformer-based deep learning integrates multi-omic data with cancer pathways | |
Liu et al. | A heterogeneous graph cross-omics attention model for single-cell representation learning | |
Bu et al. | Solving Anscombe’s Quartet using a Transfer Learning Approach |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |