CN111916143A - Molecular activity prediction method based on multiple substructure feature fusion - Google Patents

Molecular activity prediction method based on multiple substructure feature fusion Download PDF

Info

Publication number
CN111916143A
CN111916143A CN202010729533.9A CN202010729533A CN111916143A CN 111916143 A CN111916143 A CN 111916143A CN 202010729533 A CN202010729533 A CN 202010729533A CN 111916143 A CN111916143 A CN 111916143A
Authority
CN
China
Prior art keywords
substructure
molecular
neural network
substructures
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010729533.9A
Other languages
Chinese (zh)
Other versions
CN111916143B (en
Inventor
丁静怡
宋健
焦李成
吴建设
成若晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202010729533.9A priority Critical patent/CN111916143B/en
Publication of CN111916143A publication Critical patent/CN111916143A/en
Application granted granted Critical
Publication of CN111916143B publication Critical patent/CN111916143B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Bioethics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a molecular activity prediction method based on various substructure characteristic fusion, which trains a neural network by extracting substructure characteristics of a molecular graph, and solves the problems of closed loop of the extracted substructure, poor network prediction precision and difficult calculation in the prior art. The method comprises the following steps: 1) converting the drug molecule information into a molecular characteristic matrix; 2) selecting an initial node; 3) obtaining a plurality of substructures; 4) calculating the similarity of the substructures; 5) fusing a substructure feature matrix; 6) training a neural network; 7) judging whether the training neural network is converged; 8) obtaining the molecular activity to be predicted. The invention has the advantages of distinguishing the difference between different substructures, solving the problem of molecular diagram noise and having high precision of predicting the molecular activity.

Description

Molecular activity prediction method based on multiple substructure feature fusion
Technical Field
The invention belongs to the technical field of biology, and further relates to a molecular activity prediction method based on diverse substructure characteristic fusion in the technical field of biological activity. The invention can predict the influence of unknown similar drug molecules on the biological activity by utilizing the molecular structure information of the similar drug molecules and the corresponding biological activity.
Background
The molecular activity prediction technology is that a neural network model is trained by utilizing the molecular structure information of a class of drugs and the influence of the corresponding biological activity, and the model can predict the influence of the biological activity by utilizing the structure information of unknown similar drug molecules. Thus, the model allows the screening of a wide range of molecules of the same class for drug compounds that are more suitable for biological activity assays in the biological laboratory. In order to convert drug molecules into information that can be recognized by computers, molecular structure information is converted into molecular graphs, and the influence of drug molecules on biological activity is quantified into molecular graph labels. At present, the molecular activity prediction technology can simplify the drug development process, reduce the potential safety hazard of biological experiments and save the cost of the biological experiments. Current molecular activity prediction techniques present challenges to the problem of label noise for molecular graph nodes.
Pinar Yanardag in its published paper "Deep Graph Kernels" (2015 knowledgy discovery and data mining conference) proposed a method to predict molecular activity by comparing the similarity of substructures between a class of molecular graphs. The method comprises the steps of dividing a class of molecular diagrams into a training set and a testing set, dividing the training set molecular diagrams into a plurality of substructures, training a neural network model by using the training set substructures and labels, and finally obtaining the molecular diagram labels of the testing set by using the similarity between the substructures of the molecular diagrams of the testing set and the substructures of the training set molecular diagrams. The method has the disadvantages that because one molecular diagram is randomly divided into a plurality of different substructures, different substructures can predict different results, and the accuracy of predicting molecular diagram tags is reduced.
Lee proposed a molecular activity prediction method based on attention neural networks in its published paper "Graph classification using structural association" (2018 conference on knowledge discovery and data mining). The method divides a class of molecular diagrams into a training set and a testing set, utilizes an attention mechanism to find a molecular diagram substructure in the training set, and utilizes the substructure and molecular diagram labels to train the LSTM model. And finally, the model searches the substructure of the test set molecular graph by using an attention mechanism and predicts the label of the test set molecular graph. The method has the disadvantage that the difference between the substructures cannot be distinguished by a training network due to the closed loop of the searched molecular diagram substructures.
Disclosure of Invention
The invention aims to provide a molecular activity prediction method based on multi-substructure feature fusion aiming at the defects of the existing molecular activity prediction technology, which is used for solving the problems that the extraction of structural features from noisy molecular graphs is difficult and the prediction precision is poor in the molecular activity prediction process.
The idea for realizing the purpose of the invention is as follows: according to unique structural features and node label features in the molecular graph substructure, a molecular graph substructure set is extracted in a random walk mode, molecular graph substructure feature information is perfected, fusion features of partial substructures are input into a well-constructed neural network training model, and the purpose of predicting molecular activity more accurately and rapidly is achieved.
The specific implementation steps of the invention comprise the following steps:
(1) obtaining a characteristic matrix corresponding to the drug molecule information:
carrying out one-hot encoding on atoms in a drug molecule based on bytes to obtain one-hot encoding characteristic matrix, expressing bond value pairs between the drug atoms into a neighborhood characteristic matrix, and carrying out one-hot encoding on the activity of the drug molecule based on bytes to obtain one-hot encoding label characteristic matrix;
(2) selecting an initial node:
(2a) representing atoms of drug molecules into nodes, representing chemical bonds among the atoms into connecting edges, representing the activity of the drug molecules into component subgraph labels, and forming component subgraphs by the nodes, the connecting edges and the molecular label labels;
(2b) calculating the centrality value of each node in the molecular graph by using a Betweenness method, and selecting the node with the highest centrality value as an initial node;
(3) extracting a plurality of substructure characteristics of the molecular diagram:
starting from an initial node, selecting l substructures of the component subgraph without repeated node groups, wherein the number of the substructures is less than that of the nodes of the molecular subgraph, from the molecular subgraph by using a random walk method, and selecting one substructure set by using the same method;
(4) calculating the similarity of the substructures:
(4a) coding each substructure in the substructure set based on nodes to obtain a characteristic matrix of the substructure;
(4b) and calculating the similarity of every two substructures in the substructure set by using a similarity formula:
Figure BDA0002602583440000021
wherein, Jm,nRepresenting the similarity between the mth substructure and the nth substructure in the substructure set, g representing a characteristic matrix corresponding to the mth substructure in the substructure set, p representing the characteristic matrix of the nth substructure in the substructure set, | · | representing matrix modulo operation, | · representing intersection operation, and u representing union operation;
(4c) storing all the substructures with the similarity greater than or equal to a threshold value into a similar set, and storing the rest substructures into a different set, wherein the threshold value is in a range of (0.5,1), and is selected according to the number of nodes in different molecular diagram classes;
(5) fusion substructure feature matrix:
averaging all the substructure feature matrixes in the similar set to obtain a fused substructure feature matrix;
(6) training a neural network:
(6a) inputting the fused substructure characteristics into a 4-layer multi-layer perceptron neural network, outputting predicted molecular icon labels, and calculating loss values between the predicted molecular icon labels and the real molecular icon labels corresponding to the predicted molecular icon labels by using a cross entropy loss function;
(6b) randomly selecting two substructure characteristics from different sets, inputting the two substructure characteristics into a multilayer perceptron neural network with 4 layers, outputting predicted molecular icon labels, and calculating loss values between the real molecular icon labels corresponding to the predicted molecular icon labels by using a cross entropy loss function;
(6c) superposing the two loss values to obtain a loss value of the training neural network;
(7) judging whether the loss value of the trained neural network is converged, if so, stopping training to obtain the trained multilayer perceptron neural network, and executing the step (8), otherwise, executing the step (3);
(8) inputting the same kind of molecular graph to be predicted into the trained multilayer perceptron neural network, and outputting the molecular graph label to obtain the activity type corresponding to the molecular graph label.
Compared with the prior art, the invention has the following advantages:
firstly, the fused substructure feature matrix averages all substructure feature matrices in the similar set to obtain a fused substructure feature matrix, and the fused substructure feature matrix is input into a training network to obtain the molecular graph tag, so that the problem that in the prior art, due to the fact that a molecular graph is randomly divided into a plurality of different substructures, different substructures can predict different results to cause the accuracy of the predicted molecular graph tag to be reduced is solved, the method has the characteristic of excellent extraction of the substructure features of the molecular graph, and the accuracy of the predicted molecular graph tag is improved.
Secondly, the invention selects the non-closed-loop substructure from the molecular diagram by using a random walk method, and divides the non-closed-loop substructure into a similar substructure set and a different substructure set to train the network, thereby overcoming the problem that the training network can not distinguish the difference between the substructures because the searched molecular diagram substructure has a closed loop in the prior art, and leading the invention to distinguish the difference between different substructures.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of the process of extracting molecular graph substructure using random walk according to the present invention;
FIG. 3 is a schematic diagram of the fusion of similar sets of substructure features according to the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The specific steps of the present invention are further described with reference to fig. 1.
Step 1, obtaining a characteristic matrix corresponding to drug molecule information.
The method comprises the steps of carrying out one-hot coding on atoms in a drug molecule based on bytes to obtain one-hot coding feature matrix, expressing bond value pairs among the drug atoms into a neighborhood feature matrix, and carrying out one-hot coding on the activity of the drug molecule based on bytes to obtain one-hot coding label feature matrix.
And 2, selecting an initial node.
The method comprises the steps of representing atoms of drug molecules into nodes, representing the atoms of the drug molecules into nodes, representing atomic properties of the drug molecules into node labels, representing chemical bonds between the atoms into connecting edges, representing the activity of the drug molecules into component subgraph labels, and forming the component subgraphs by the nodes, the connecting edges and the molecular label labels.
And calculating the centrality value of each node in the molecular graph by using a Betweenness method, and selecting the node with the highest centrality value as an initial node.
And 3, extracting a plurality of substructure features of the molecular diagram.
The following describes in detail the process of extracting the molecular graph substructure by using random walk with reference to fig. 2.
The black-edged nodes in FIG. 2 represent nodes selected by the random walk method, the nodes without black edges represent nodes not selected, and the number tables in the nodesNumber of atoms in the molecule, ct-1Representing the node pointed at by the random walk at time t-1, ctIndicating the node pointed to by the random walk at time t. FIG. 2(a) is a schematic diagram showing the random walk picking up the node 4 at time t-1. FIG. 2(b) is a schematic diagram illustrating the process of backtracking to node 9, and it can be seen from FIG. 2(b) that there are no unselected nodes along the edge of current node 9, and c is required for finding the unselected nodest-1There may be nodes that are not picked before continuing the backtracking. FIG. 2(c) shows ct-1Looking back to the schematic diagram of node 2, it can be seen from fig. 2(c) that the edge node 1 in the current node 2 is not picked. FIG. 2(d) is a schematic diagram showing the picking of node 1. from FIG. 2(d), it can be seen that since node 1 is not picked, node 1 is picked by random walk, and c is settPointing to node 1 and setting its edges to black.
As shown in fig. 2, starting from the initial node, the random walk method is used to select l substructures of the component subgraph without repeated nodes from the molecular graph, where the number of the substructures is less than the number of the nodes of the molecular graph, and one substructures set is selected by using the same method.
And 4, calculating the similarity of the substructures.
Firstly, coding each substructure in a substructure set based on nodes to obtain a characteristic matrix of the substructure;
and secondly, calculating the similarity of every two substructures in the substructure set by using a similarity formula:
Figure BDA0002602583440000051
wherein, Jm,nRepresenting the similarity of the mth and nth sub-structures in the set of sub-structures, pmRepresenting the characteristic matrix, p, corresponding to the mth substructure of the set of substructuresnThe feature matrix of the nth substructure in the substructure set is represented, | · | represents a matrix modulo operation, | represents an intersection operation, and u represents a union operation.
Third step, | pm∩pnI denotes the substructure pmAnd pnThe Jacard similarity and Hamming distance of the node sequence characteristics are as follows: first, a node tag hopping sequence corresponding to a sub-structure is obtained according to the sub-structure, for example, the sub-structure pmPresence of node sequence features [1,1,2,2,3 ]]And pnPresence of node sequence features [1,2,3 ]]Where the elements are represented as node labels. According to [1,1,2,2,3 ]]Obtain the hopping sequence [0,1,0,1 ]],pn=[1,2,3,2,3]Obtaining the hopping sequence [1,1,1,1]Wherein, 1 in the hopping sequence indicates that the label of the adjacent node in the substructure sequence is changed, and 0 indicates that the label is not changed. And then carrying out Jacard similarity measurement and Hamming distance measurement normalization processing on the node label hopping sequences of the two paths to obtain similar values.
Fourthly, storing all the substructures with the similarity greater than or equal to a threshold value into a similar set, and then storing the rest substructures into a different set, wherein the threshold value is in a range of (0.5,1), and the substructures are selected according to the number of nodes in different molecular graphs
And 5, fusing a substructure feature matrix.
The process of sub-structure fusion in the similarity set is described in detail below with reference to fig. 3.
In fig. 3, the substructure is represented as a row of node sequences, which may be represented as a feature matrix. A row of node sequences corresponds to a feature matrix. Fig. 3(a) shows a schematic diagram of a feature matrix of three substructures. Fig. 3(b) shows the substructure feature matrix after fusion. The three substructure feature matrices in fig. 3(a) are subjected to matrix averaging to obtain the fused substructure feature matrix in fig. 3 (b).
And averaging all the substructure characteristic matrixes in the similar set to obtain a fused substructure characteristic matrix. A substructure can be encoded into a feature matrix according to the node feature, and a plurality of substructure feature matrices in the similar set are averagely fused into a substructure feature matrix.
And 6, training a neural network.
Firstly, inputting the fused substructure features into a 4-layer multi-layer perceptron neural network, outputting predicted molecular icon labels, and utilizing intersectionAn entropy loss function for calculating a loss value λ between the actual tags corresponding to the predicted molecular tags1
Second, randomly selecting two substructures from the different sets, inputting the characteristics of the two selected substructures into a 4-layer multilayer perceptron neural network, outputting a predicted molecular icon label, and calculating a loss value lambda between real icon labels corresponding to the predicted molecular icon label by using a cross entropy loss function2
Thirdly, obtaining a neural network loss value L according to the following formula
L=pλ1+(1-p)λ2
Wherein p represents a bias value, which is selected in the range of (0.8,1) according to the number of nodes in different molecular graphs.
And 7, judging whether the loss value of the neural network is converged, if so, stopping training to obtain the trained multilayer perceptron neural network, and executing the step 8, otherwise, executing the step 3.
And 8, inputting the same kind of molecular diagram to be predicted into the trained neural network of the multilayer perceptron, and outputting the diagram label to obtain the activity type corresponding to the label.
The effect of the present invention is further explained by combining the simulation experiment as follows:
1. simulation experiment conditions are as follows:
the hardware platform of the simulation experiment of the invention is as follows: the processor is an Intel i 53470 CPU, the main frequency is 3.20GHz, and the memory is 4 GB.
The software platform of the simulation experiment of the invention is as follows: ubuntu operating system and python 3.6.
The input molecular map data set used in the simulation experiment of the invention is the following four molecular compound types published by the national cancer research center of America in 2006, https:// www.cancer.gov/website, NCI-1, NCI-33, NCI-83 and NCI-123. NCI-1 is a balanced dataset of a compound dataset of screening activity of non-small cell lung cancer, and has two activity class marks in total, NCI-33 is a balanced dataset of a compound dataset of screening activity of melanoma, and has two activity class marks in total, NCI-83 is a balanced dataset of a compound dataset of screening activity of breast cancer, and has two activity class marks in total, and NCI-123 is a balanced dataset of a compound dataset of screening activity of breast cancer, and has two activity class marks in total.
2. Simulation content and result analysis thereof:
the simulation experiment of the invention is to classify the molecular diagram respectively by adopting the invention and three prior arts (Kernel-GK method, Deep-SP method and GAM method) to obtain classification results.
In the simulation experiment, three prior arts are adopted:
the conventional Kernel-GK method refers to a molecular diagram prediction method, namely an inner core method of a similar substructure diagram, which is proposed by Nino Shervashidze et al in 'effective graph Kernel for large graph company' 2009 Conference on Artificial Intelligence and Statistics.
The existing Deep-SP method is a molecular map prediction method proposed by Pinar Yanardag et al in Deep Graph Kernels, 2015 knowledge discovery and data mining conference, which is called a Deep shortest path map kernel method for short.
The existing GAM method is a molecular diagram prediction method, which is called an attention neural network diagram classification method for short, proposed by J.B.Lee et al in the conference of Graph classification using structural association and learning in 2018.
And respectively evaluating the classification results of the three methods by using the average precision evaluation index. The classification accuracy of the data sets NCI1, NCI33, NCI83, and NCI123 was calculated using the following formula, and all the calculations are plotted in table 1:
Figure BDA0002602583440000071
TABLE 1 quantitative analysis table of classification results of the present invention and various prior arts in simulation experiment
Figure BDA0002602583440000081
As can be seen by combining the table 1, the classification accuracy AA indexes of the invention are all higher than those of 3 methods in the prior art, and the invention is proved to obtain higher molecular activity prediction accuracy.
The above simulation experiments show that: the method can identify different substructure characteristics of the molecular diagram by using the idea of random walk, wherein the open loop requirement for extracting the substructure overcomes the defect that the difference between the substructure cannot be distinguished by a training network because the extracted substructure of the molecular diagram exists a closed loop in the prior art. In addition, by fusing the characteristics of the similar substructure, the difference among the characteristics of different substructures is reduced, and the problem of reduced molecular activity prediction precision caused by different predicted results of different substructures in the prior art is solved. Compared with other comparison methods, the multilayer neural network prediction model provided by the invention has the advantages of short training time and high network generalization, and is a very practical molecular activity prediction method.

Claims (3)

1. A molecular activity prediction method based on multi-substructure feature fusion is characterized in that a random walk method is used for extracting a plurality of substructure features of a molecular diagram, and the fused substructure features are input into a trained multilayer neural network to predict molecular activity, and the method specifically comprises the following steps:
(1) obtaining a characteristic matrix corresponding to the drug molecule information:
carrying out one-hot encoding on atoms in a drug molecule based on bytes to obtain one-hot encoding characteristic matrix, expressing bond value pairs between the drug atoms into a neighborhood characteristic matrix, and carrying out one-hot encoding on the activity of the drug molecule based on bytes to obtain one-hot encoding label characteristic matrix;
(2) selecting an initial node:
(2a) representing atoms of drug molecules into nodes, representing chemical bonds among the atoms into connecting edges, representing the activity of the drug molecules into component subgraph labels, and forming component subgraphs by the nodes, the connecting edges and the molecular label labels;
(2b) calculating the centrality value of each node in the molecular graph by using a Betweenness method, and selecting the node with the highest centrality value as an initial node;
(3) extracting a plurality of substructure characteristics of the molecular diagram:
starting from an initial node, selecting l substructures of the component subgraph without repeated node groups, wherein the number of the substructures is less than that of the nodes of the molecular subgraph, from the molecular subgraph by using a random walk method, and selecting one substructure set by using the same method;
(4) calculating the similarity of the substructures:
(4a) coding each substructure in the substructure set based on nodes to obtain a characteristic matrix of the substructure;
(4b) and calculating the similarity of every two substructures in the substructure set by using a similarity formula:
Figure FDA0002602583430000011
wherein, Jm,nRepresenting the similarity between the mth substructure and the nth substructure in the substructure set, g representing a characteristic matrix corresponding to the mth substructure in the substructure set, p representing the characteristic matrix of the nth substructure in the substructure set, | · | representing matrix modulo operation, | · representing intersection operation, and u representing union operation;
(4c) storing all the substructures with the similarity greater than or equal to a threshold value into a similar set, and storing the rest substructures into a different set, wherein the threshold value is in a range of (0.5,1), and is selected according to the number of nodes in different molecular diagram classes;
(5) fusion substructure feature matrix:
averaging all the substructure feature matrixes in the similar set to obtain a fused substructure feature matrix;
(6) training a neural network:
(6a) randomly selecting two substructure characteristics from different sets, inputting the two substructure characteristics into a multilayer perceptron neural network with 4 layers, outputting predicted molecular icon labels, and calculating loss values between the real molecular icon labels corresponding to the predicted molecular icon labels by using a cross entropy loss function;
(6b) inputting the fused substructure characteristics into a 4-layer multi-layer perceptron neural network, outputting predicted molecular icon labels, and calculating loss values between the predicted molecular icon labels and the real molecular icon labels corresponding to the predicted molecular icon labels by using a cross entropy loss function;
(6c) superposing the two loss values to obtain a loss value of the training neural network;
(7) judging whether the loss value of the trained neural network is converged, if so, stopping training to obtain the trained multilayer perceptron neural network, and executing the step (8), otherwise, executing the step (3);
(8) inputting the same kind of molecular graph to be predicted into the trained multilayer perceptron neural network, and outputting the molecular graph label to obtain the activity type corresponding to the molecular graph label.
2. The method for predicting the activity of a molecule based on the fusion of various substructure features according to claim 1, wherein the step of the random walk method in step (3) is: and selecting unselected nodes in the node neighborhood of the molecular graph by using a random walk method, and backtracking to the previously selected nodes if the unselected nodes do not exist in the current node neighborhood in the selection process, wherein the node neighborhood represents all other node sets connected with the node in the molecular graph.
3. The method for predicting the activity of a molecule based on the fusion of various substructure features according to claim 1, wherein the step of adding the two loss values in step (6c) is:
firstly, inputting the fused feature matrix into a neural network by using a cross entropy loss function to obtain a loss value lambda1Inputting the selected different set of substructures into a neural network to obtain a loss value lambda2
Secondly, obtaining a neural network loss value L according to the following formula:
L=pλ1+(1-p)λ2
wherein p represents a bias value, which is selected in the range of (0.8,1) according to the number of nodes in different molecular graphs.
CN202010729533.9A 2020-07-27 2020-07-27 Molecular activity prediction method based on multi-substructural feature fusion Active CN111916143B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010729533.9A CN111916143B (en) 2020-07-27 2020-07-27 Molecular activity prediction method based on multi-substructural feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010729533.9A CN111916143B (en) 2020-07-27 2020-07-27 Molecular activity prediction method based on multi-substructural feature fusion

Publications (2)

Publication Number Publication Date
CN111916143A true CN111916143A (en) 2020-11-10
CN111916143B CN111916143B (en) 2023-07-28

Family

ID=73281083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010729533.9A Active CN111916143B (en) 2020-07-27 2020-07-27 Molecular activity prediction method based on multi-substructural feature fusion

Country Status (1)

Country Link
CN (1) CN111916143B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112420131A (en) * 2020-11-20 2021-02-26 中国科学技术大学 Molecular generation method based on data mining
WO2022222492A1 (en) * 2021-04-23 2022-10-27 中国科学院深圳先进技术研究院 Prediction method and device for drug molecular feature attribute

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999012118A1 (en) * 1997-09-03 1999-03-11 Commonwealth Scientific And Industrial Research Organisation Compound screening system
CN106874688A (en) * 2017-03-01 2017-06-20 中国药科大学 Intelligent lead compound based on convolutional neural networks finds method
WO2018220368A1 (en) * 2017-05-30 2018-12-06 Gtn Ltd Tensor network machine learning system
CN109033738A (en) * 2018-07-09 2018-12-18 湖南大学 A kind of pharmaceutical activity prediction technique based on deep learning
RU2721190C1 (en) * 2018-12-25 2020-05-18 Общество с ограниченной ответственностью "Аби Продакшн" Training neural networks using loss functions reflecting relationships between neighbouring tokens
CN111428848A (en) * 2019-09-05 2020-07-17 中国海洋大学 Molecular intelligent design method based on self-encoder and 3-order graph convolution
CN111429977A (en) * 2019-09-05 2020-07-17 中国海洋大学 Novel molecular similarity search algorithm based on graph structure attention

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999012118A1 (en) * 1997-09-03 1999-03-11 Commonwealth Scientific And Industrial Research Organisation Compound screening system
CN106874688A (en) * 2017-03-01 2017-06-20 中国药科大学 Intelligent lead compound based on convolutional neural networks finds method
WO2018220368A1 (en) * 2017-05-30 2018-12-06 Gtn Ltd Tensor network machine learning system
CN109033738A (en) * 2018-07-09 2018-12-18 湖南大学 A kind of pharmaceutical activity prediction technique based on deep learning
RU2721190C1 (en) * 2018-12-25 2020-05-18 Общество с ограниченной ответственностью "Аби Продакшн" Training neural networks using loss functions reflecting relationships between neighbouring tokens
CN111428848A (en) * 2019-09-05 2020-07-17 中国海洋大学 Molecular intelligent design method based on self-encoder and 3-order graph convolution
CN111429977A (en) * 2019-09-05 2020-07-17 中国海洋大学 Novel molecular similarity search algorithm based on graph structure attention

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张小驰;于华;宫秀军;: "一种基于随机游走的迭代加权子图查询算法", 计算机研究与发展, no. 12 *
潘永昊;于洪涛;刘树新;: "基于神经网络的链路预测算法", 网络与信息安全学报, no. 07 *
秦琦枫;曾斌;刘思莹;: "深度神经网络在化学中的应用研究", 江西化工, no. 03 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112420131A (en) * 2020-11-20 2021-02-26 中国科学技术大学 Molecular generation method based on data mining
CN112420131B (en) * 2020-11-20 2022-07-15 中国科学技术大学 Molecular generation method based on data mining
WO2022222492A1 (en) * 2021-04-23 2022-10-27 中国科学院深圳先进技术研究院 Prediction method and device for drug molecular feature attribute

Also Published As

Publication number Publication date
CN111916143B (en) 2023-07-28

Similar Documents

Publication Publication Date Title
Albaradei et al. Machine learning and deep learning methods that use omics data for metastasis prediction
Anifowose et al. Improving the prediction of petroleum reservoir characterization with a stacked generalization ensemble model of support vector machines
Peel et al. Detecting change points in the large-scale structure of evolving networks
Pang et al. Predicting vulnerable software components through n-gram analysis and statistical feature selection
Ratnasingham et al. A DNA-based registry for all animal species: the Barcode Index Number (BIN) system
Xie et al. Active and semi-supervised graph neural networks for graph classification
CN111916143A (en) Molecular activity prediction method based on multiple substructure feature fusion
Bollt et al. Introduction to focus issue: Causation inference and information flow in dynamical systems: Theory and applications
Wang et al. Mushroom toxicity recognition based on multigrained cascade forest
Chen Analysis of machine learning methods for COVID-19 detection using serum Raman spectroscopy
Gunady et al. scGAIN: single cell RNA-seq data imputation using generative adversarial networks
Ma et al. Collaborative and adversarial deep transfer auto-encoder for intelligent fault diagnosis
Gao et al. Malware detection using attributed cfg generated by pre-trained language model with graph isomorphism network
Daisey et al. Effects of the hierarchy in hierarchical, multi-label classification
Wang et al. The generalized matrix decomposition biplot and its application to microbiome data
Nguyen et al. Scalable maximal subgraph mining with backbone-preserving graph convolutions
Srinivas et al. Hybrid Approach for Prediction of Cardiovascular Disease Using Class Association Rules and MLP.
Wang et al. Assessment of community efforts to advance computational prediction of protein-protein interactions
Zhang et al. Learning latent embedding of multi-modal single cell data and cross-modality relationship simultaneously
Wang et al. Feature selection methods in the framework of mrmr
Webel et al. Mass spectrometry-based proteomics imputation using self supervised deep learning
Araya‐Salas et al. ohun: An R package for diagnosing and optimizing automatic sound event detection
Akhter et al. BPAGS: a web application for bacteriocin prediction via feature evaluation using alternating decision tree, genetic algorithm, and linear support vector classifier
Hu et al. Dual Perspective Contrastive Learning Based Subgraph Anomaly Detection on Attributed Networks
Mahadevan et al. Cost-aware retraining for machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant