CN111916143B - Molecular activity prediction method based on multi-substructural feature fusion - Google Patents

Molecular activity prediction method based on multi-substructural feature fusion Download PDF

Info

Publication number
CN111916143B
CN111916143B CN202010729533.9A CN202010729533A CN111916143B CN 111916143 B CN111916143 B CN 111916143B CN 202010729533 A CN202010729533 A CN 202010729533A CN 111916143 B CN111916143 B CN 111916143B
Authority
CN
China
Prior art keywords
molecular
substructure
substructures
neural network
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010729533.9A
Other languages
Chinese (zh)
Other versions
CN111916143A (en
Inventor
丁静怡
宋健
焦李成
吴建设
成若晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202010729533.9A priority Critical patent/CN111916143B/en
Publication of CN111916143A publication Critical patent/CN111916143A/en
Application granted granted Critical
Publication of CN111916143B publication Critical patent/CN111916143B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Bioethics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a molecular activity prediction method based on multi-substructure feature fusion, which trains a neural network by extracting the substructure features of a molecular graph, and solves the problems of closed loop of the substructure extraction, poor network prediction precision and difficult calculation in the prior art. The method comprises the following steps of: 1) The information of the drug molecules is converted into a molecular feature matrix; 2) Selecting an initial node; 3) Obtaining a plurality of substructures; 4) Calculating the similarity of the substructures; 5) Fusing the substructures feature matrix; 6) Training a neural network; 7) Judging whether the training neural network converges or not; 8) Obtaining the activity of the molecule to be predicted. The invention has the advantages of distinguishing the difference between different substructures, solving the problem of molecular diagram noise and having high prediction molecular activity precision.

Description

Molecular activity prediction method based on multi-substructural feature fusion
Technical Field
The invention belongs to the technical field of biology, and further relates to a molecular activity prediction method based on fusion of various substructures. The invention can predict the influence of unknown similar drug molecules on the biological activity by utilizing the information of the molecular substructures of the drug molecules and the biological activity corresponding to the information.
Background
The molecular activity prediction technology is to train a neural network model by utilizing molecular structure information of a similar drug and corresponding biological activity influence, and the model can predict the biological activity influence by utilizing the unknown structure information of the similar drug. The model can screen medicine compound molecules which are more suitable for biological activity analysis in biological laboratories from a large range of similar molecules. In order to convert the drug molecules into information recognizable by a computer, the molecular structure information is converted into a molecular map, and the influence of the drug molecules on the bioactivity is quantified into a molecular map label. At present, the molecular activity prediction technology not only can simplify the drug development process and reduce the potential safety hazard of biological experiments, but also can save the biological experiment cost. Current molecular activity prediction techniques face challenges presented by the molecular map node tag noise problem.
Pinar Yanardag in its published paper "Deep Graph Kernels" (knowledge discovery and data mining conference 2015) proposes a method of predicting molecular activity by comparing the similarity of substructures between a class of molecular figures. The method comprises the steps of dividing a class of molecular diagram into a training set and a testing set, dividing the training set molecular diagram into a plurality of substructures, training a neural network model by utilizing the training set substructures and labels, and finally obtaining molecular diagram labels of the testing set by utilizing the similarity of the training set molecular diagram substructures and the training set molecular diagram substructures by the model. The method has the defect that as one molecular diagram is randomly divided into a plurality of different substructures, different substructures can predict different results, so that the label precision of the predicted molecular diagram is reduced.
Lee in its published paper "Graph classification using structural attention" (knowledge discovery and data mining conference 2018) proposes a molecular activity prediction method based on an attention neural network. The method divides a class of molecular diagram into a training set and a testing set, searches a molecular diagram sub-structure in the training set by using an attention mechanism, and trains an LSTM model by using the sub-structure and the molecular diagram label. And finally, the model searches the substructure of the test set molecular graph by using an attention mechanism and predicts the test set molecular graph labels thereof. The method has the defect that the difference problem among the substructures cannot be resolved by the training network because the found substructures of the molecular diagram have closed loops.
Disclosure of Invention
The invention aims to solve the problems that the existing molecular activity prediction technology has the defects, and provides a molecular activity prediction method based on multi-sub-structural feature fusion, which is used for solving the problems that the extraction of structural features on a noisy molecular diagram is difficult and the prediction accuracy is poor in the molecular activity prediction process.
The idea for realizing the purpose of the invention is as follows: according to the unique structural features and node label features in the molecular sub-structure, a random walk mode is utilized to extract a molecular sub-structure set, the molecular sub-structure feature information is perfected, and the fusion features of the partial sub-structures are input into a built neural network training model, so that the aim of predicting the molecular activity more accurately and rapidly is fulfilled.
The specific implementation steps of the invention include the following steps:
(1) Obtaining a feature matrix corresponding to the drug molecular information:
after carrying out single-heat coding on atoms in a medicine molecule based on bytes, obtaining a single-heat coding feature matrix, representing key value pairs among the medicine atoms as a neighborhood feature matrix, and carrying out single-heat coding on the activity of the medicine molecule based on bytes, so as to obtain a single-heat coding label feature matrix;
(2) Selecting an initial node:
(2a) Representing atoms of the drug molecules as nodes, chemical bonds among the atoms as connecting edges, and labeling the drug molecule activity as component subgraph by the nodes, the connecting edges and the molecular icons;
(2b) Calculating the centrality value of each node in the score graph by using a Betwenness method, and selecting the node with the highest centrality value as an initial node;
(3) Extracting a plurality of sub-structural features of a molecular diagram:
starting from an initial node, selecting a substructure of l non-repeated node component sub-graphs smaller than the number of nodes of the molecular graph from the molecular graph by using a random walk method, and selecting a substructure set by using the same method;
(4) Calculating the similarity of the substructures:
(4a) Coding each substructure in the substructure set based on the nodes to obtain a feature matrix of the substructure;
(4b) Calculating the similarity of every two substructures in the substructures by using a similarity formula:
wherein J is m,n Representing the similarity of an mth substructure and an nth substructure in the substructure set, g representing a feature matrix corresponding to the mth substructure in the substructure set, p representing a feature matrix of the nth substructure in the substructure set, |·| representing a matrix modulo operation, n representing an intersection operation, and u representing a union operation;
(4c) Storing all the substructures with the similarity greater than or equal to a threshold value into a similar set, and storing the rest substructures into a dissimilar set, wherein the threshold value is in the range of (0.5, 1) and is selected according to the number of nodes in different sub-classes;
(5) Fusing the substructural feature matrix:
averaging all the substructures feature matrixes in the similar set to obtain a fused substructures feature matrix;
(6) Training a neural network:
(6a) Inputting the fused substructure features into a 4-layer multi-layer perceptron neural network, outputting predicted molecular icons, and calculating a loss value between the predicted molecular icons and a real molecular icon corresponding to the predicted molecular icon by using a cross entropy loss function;
(6b) Arbitrarily selecting two sub-structural features from different sets, inputting the selected two sub-structural features into a 4-layer multi-layer perceptron neural network, outputting predicted molecular icons, and calculating a loss value between the predicted molecular icons and a real molecular icon corresponding to the predicted molecular icon by using a cross entropy loss function;
(6c) Superposing the two loss values to obtain a loss value of the training neural network;
(7) Judging whether the loss value of the training neural network is converged, if so, stopping training to obtain a trained multi-layer perceptron neural network, executing the step (8), otherwise, executing the step (3);
(8) Inputting the similar molecular diagram to be predicted into a trained multi-layer perceptron neural network, and outputting a molecular diagram label to obtain an activity type corresponding to the molecular diagram label.
Compared with the prior art, the invention has the following advantages:
firstly, as the invention fuses all the substructures feature matrixes in the similar set to average to obtain a fused substructures feature matrix, and inputs the fused substructures feature matrix into a training network to obtain a molecular sub-graph label, the invention solves the problem that the precision of the predicted molecular sub-graph label is reduced because one molecular graph is randomly divided into a plurality of different substructures, and the different substructures possibly predict different results in the prior art, so that the invention has the characteristic of excellent extraction of the molecular sub-structure features and improves the precision of the predicted molecular sub-graph label.
Secondly, the invention selects the substructure without closed loop from the molecular diagram by using a random walk method and divides the substructure into a similar substructure set and a dissimilar substructure set to train the network, thereby solving the problem that the training network cannot distinguish the differences among the substructures because the found substructure of the molecular diagram has closed loop in the prior art, and enabling the invention to distinguish the differences among the different substructures.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of the process of the present invention for extracting a sub-structure of a molecular figure using random walk;
FIG. 3 is a schematic diagram of the fusion of sub-structural features of a set of similarities in accordance with the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The specific steps of the present invention are further described with reference to fig. 1.
Step 1, obtaining a feature matrix corresponding to the drug molecular information.
And carrying out single-heat coding on atoms in a medicine molecule based on bytes to obtain a single-heat coding feature matrix, representing key value pairs among the medicine atoms as a neighborhood feature matrix, and carrying out single-heat coding on the activity of the medicine molecule based on bytes to obtain a single-heat coding label feature matrix.
And 2, selecting an initial node.
The atoms of the drug molecules are expressed as nodes, the atomic properties of the drug molecules are expressed as node labels, the chemical bonds between the atoms are expressed as connecting edges, the activity of the drug molecules is expressed as component subgraph labels, and the component subgraphs are marked by the nodes, the connecting edges and the molecular icons.
Calculating the centrality value of each node in the score graph by using a Betwenness method, and selecting the node with the highest centrality value as an initial node.
And 3, extracting a plurality of sub-structural features of the molecular diagram.
The process of extracting the molecular substructure using the random walk is described in detail below in conjunction with fig. 2.
The black-edge nodes in FIG. 2 represent nodes selected by the random walk method, the non-black-edge nodes represent non-selected nodes, the numbers in the nodes represent the number of atoms in the molecule, c t-1 Representing the node at which the random walk points at time t-1, c t Indicating the node at which the random walk points at time t. Fig. 2 (a) shows a schematic diagram of a random walk picking up a node 4 at time t-1. FIG. 2 (b) shows a schematic diagram of a process of tracing back to the node 9, and it can be seen from FIG. 2 (b) that there are no unselected nodes on the current node 9's edge, and that c is required to find an unselected node t-1 There may be non-picked nodes before continuing the backtracking. FIG. 2 (c) shows c t-1 Schematic tracing back to node 2From fig. 2 (c), it can be seen that the edge node 1 in the current node 2 is not selected. FIG. 2 (d) shows a schematic diagram of a pick-out node 1, and it is seen from 2 (d) that because node 1 is not picked, the random walk picks out node 1, and c is set t Point to node 1 and set its edge to black.
From fig. 2, starting from the initial node, a random walk method is used to select a sub-structure of l non-repeated node group component graphs smaller than the number of nodes of the molecular graph from the molecular graph, and the same method is used to select a sub-structure set.
And 4, calculating the similarity of the substructures.
The method comprises the steps of firstly, coding each substructure in a substructure set based on nodes to obtain a characteristic matrix of the substructure;
secondly, calculating the similarity of every two substructures in the substructures set by using a similarity formula:
wherein J is m,n Representing the similarity of the m-th substructure and the n-th substructure in the substructure set, p m Representing the feature matrix, p, corresponding to the m-th substructure in the substructure set n Representing the feature matrix of the nth sub-structure in the sub-structure set, |·| represents the matrix modulo operation, n represents the intersection taking operation, and u represents the union taking operation.
Third step, |p m ∩p n I represents the substructure p m And p n The specific processes of the Jacquard similarity and the Hamming distance of the node sequence characteristics are as follows: first, a node tag hopping sequence of a corresponding substructure is obtained from the substructure, e.g., substructure p m The presence of node sequence features [1,2,3 ]]And p n Presence of node sequence features [1,2,3,2,3 ]]Wherein the element is represented as a node tag. According to [1,2,3 ]]Obtaining the hopping sequence [0,1,0,1 ]],p n =[1,2,3,2,3]Obtaining the hopping sequence [1,1 ]]Wherein 1 in the hopping sequence indicates that the label of the neighboring node in the substructure sequence changes,0 indicates no change. And then, carrying out Jacquard similarity measurement and Hamming distance measurement normalization processing on the node label hopping sequences of the two paths to obtain similarity values.
Fourth, storing all the substructures with similarity greater than or equal to the threshold value in the similarity set, and storing the rest substructures in the dissimilarity set, wherein the threshold value is in the range of (0.5, 1), and selecting according to the number of nodes in different sub-classes
And 5, fusing the substructures feature matrix.
The process of fusing substructures in a similar set is described in detail below in conjunction with FIG. 3.
In fig. 3, the substructure is represented as a row of node sequences, which may be represented as a feature matrix. One row of node sequences corresponds to one feature matrix. Fig. 3 (a) shows a schematic diagram of the feature matrix of three substructures. Fig. 3 (b) shows the sub-structure feature matrix after fusion. The three sub-structure feature matrices in fig. 3 (a) are averaged by a matrix to obtain the sub-structure feature matrix as fused in fig. 3 (b).
And averaging all the substructural feature matrixes in the similar set to obtain a fused substructural feature matrix. One substructure may be encoded as a feature matrix according to its node characteristics, with multiple substructure feature matrices in a similar set being merged into one substructure feature matrix on average.
And 6, training the neural network.
Firstly, inputting the fused substructure features into a 4-layer multi-layer perceptron neural network, outputting predicted molecular icons, and calculating a loss value lambda between the predicted molecular icons and a real icon corresponding to the predicted molecular icons by using a cross entropy loss function 1
Step two substructures are selected at will from different sets, the characteristics of the two substructures are input into a 4-layer multi-layer perceptron neural network, a predicted molecular icon is output, and a cross entropy loss function is utilized to calculate a loss value lambda between the true icon corresponding to the predicted molecular icon 2
Thirdly, obtaining a neural network loss value L according to the following formula
L=pλ 1 +(1-p)λ 2
Where p represents a bias value selected in a range (0.8,1) based on the number of nodes in the different sub-classes of molecules.
And 7, judging whether the loss value of the neural network is converged, if so, stopping training to obtain a trained multi-layer perceptron neural network, executing the step 8, and otherwise, executing the step 3.
And 8, inputting the similar molecular diagram to be predicted into a trained multi-layer perceptron neural network, outputting a diagram label, and obtaining an activity type corresponding to the label.
The effects of the present invention are further described below in conjunction with simulation experiments:
1. simulation experiment conditions:
the hardware platform of the simulation experiment of the invention is: the processor is Intel i5 3470 CPU, the main frequency is 3.20GHz, and the memory is 4GB.
The software platform of the simulation experiment of the invention is: ubuntu operating system and python 3.6.
The input molecular diagram data set used in the simulation experiments of the present invention were the following four molecular compound types, NCI-1, NCI-33, NCI-83 and NCI-123, published by the national cancer institute center at 2006, https:// www.cancer.gov/web site. NCI-1 is the balance dataset of the compound dataset of non-small cell lung cancer screening activity, there are two active class labels, NCI-33 is the balance dataset of the compound dataset of melanoma screening activity, there are two active class labels, NCI-83 is the balance dataset of the compound dataset of breast cancer screening activity, there are two active class labels, NCI-123 is the balance dataset of the compound dataset of breast cancer screening activity, there are two active class labels.
2. Simulation content and result analysis:
the simulation experiment of the invention is to classify the split sub-graphs by adopting the invention and three prior arts (Kernel-GK method, deep-SP method and GAM method) respectively, and obtain classification results.
In simulation experiments, three prior art techniques employed refer to:
the existing Kernel-GK method refers to a molecular diagram prediction method proposed by Nino Shervashidze et al in 2009 in Conference on Artificial Intelligence and Statistics of Efficient graphlet kernels for large graph comparison, which is called a Kernel method of a similar sub-structure diagram for short.
The existing Deep-SP method is a molecular diagram prediction method proposed by Pinar Yanardag et al in the conference of Deep Graph Kernels and knowledge discovery and data mining in 2015, and is called a depth shortest path diagram kernel method for short.
The existing GAM method is a molecular diagram prediction method proposed by j.b. lee et al in the conference of "Graph classification using structural attention",2018 knowledge discovery and data mining, and is abbreviated as an attention neural network diagram classification method.
And respectively evaluating the classification results of the three methods by using the average precision evaluation index. The classification accuracy of the data sets NCI1, NCI33, NCI83 and NCI123 was calculated using the following formula, and all the calculation results were plotted in table 1:
TABLE 1 quantitative analysis Table of the classification results of the invention and the respective prior arts in simulation experiments
As can be seen from the combination of Table 1, the classification accuracy AA index of the invention is higher than that of 3 prior art methods, and the invention can obtain higher molecular activity prediction accuracy.
The simulation experiment shows that: the method can identify different substructure features of the molecular diagram by utilizing the random walk idea, wherein the open-loop requirement on the extracted substructure is overcome, and the defect that the training network cannot distinguish the difference between the substructures because the extracted molecular diagram substructure has a closed loop in the prior art method is overcome. In addition, by fusing the characteristics of the similar substructures, the difference among the characteristics of different substructures is reduced, and the problem that the prediction accuracy of the molecular activity is reduced due to different predicted results of different substructures in the prior art is solved. Compared with other comparison methods, the multi-layer neural network prediction model provided by the invention has short training time and high network generalization, and is a very practical molecular activity prediction method.

Claims (3)

1. A molecular activity prediction method based on multi-substructure feature fusion is characterized in that a random walk method is utilized to extract a plurality of substructure features of a molecular diagram, the fused plurality of substructure features are input into a trained multi-layer neural network to predict molecular activity, and the method specifically comprises the following steps:
(1) Obtaining a feature matrix corresponding to the drug molecular information:
after carrying out single-heat coding on atoms in a medicine molecule based on bytes, obtaining a single-heat coding feature matrix, representing key value pairs among the medicine atoms as a neighborhood feature matrix, and carrying out single-heat coding on the activity of the medicine molecule based on bytes, so as to obtain a single-heat coding label feature matrix;
(2) Selecting an initial node:
(2a) Representing atoms of the drug molecules as nodes, chemical bonds among the atoms as connecting edges, and labeling the drug molecule activity as component subgraph by the nodes, the connecting edges and the molecular icons;
(2b) Calculating the centrality value of each node in the score graph by using a Betwenness method, and selecting the node with the highest centrality value as an initial node;
(3) Extracting a plurality of sub-structural features of a molecular diagram:
starting from an initial node, selecting a substructure of l non-repeated node component sub-graphs smaller than the number of nodes of the molecular graph from the molecular graph by using a random walk method, and selecting a substructure set by using the same method;
(4) Calculating the similarity of the substructures:
(4a) Coding each substructure in the substructure set based on the nodes to obtain a feature matrix of the substructure;
(4b) Calculating the similarity of every two substructures in the substructures by using a similarity formula:
wherein J is m,n Representing the similarity of an mth substructure and an nth substructure in the substructure set, g representing a feature matrix corresponding to the mth substructure in the substructure set, p representing a feature matrix of the nth substructure in the substructure set, |·| representing a matrix modulo operation, n representing an intersection operation, and u representing a union operation;
(4c) Storing all the substructures with the similarity greater than or equal to a threshold value into a similar set, and storing the rest substructures into a dissimilar set, wherein the threshold value is in the range of (0.5, 1) and is selected according to the number of nodes in different sub-classes;
(5) Fusing the substructural feature matrix:
averaging all the substructures feature matrixes in the similar set to obtain a fused substructures feature matrix;
(6) Training a neural network:
(6a) Arbitrarily selecting two sub-structural features from different sets, inputting the selected two sub-structural features into a 4-layer multi-layer perceptron neural network, outputting predicted molecular icons, and calculating a loss value between the predicted molecular icons and a real molecular icon corresponding to the predicted molecular icon by using a cross entropy loss function;
(6b) Inputting the fused substructure features into a 4-layer multi-layer perceptron neural network, outputting predicted molecular icons, and calculating a loss value between the predicted molecular icons and a real molecular icon corresponding to the predicted molecular icon by using a cross entropy loss function;
(6c) Superposing the two loss values to obtain a loss value of the training neural network;
(7) Judging whether the loss value of the training neural network is converged, if so, stopping training to obtain a trained multi-layer perceptron neural network, executing the step (8), otherwise, executing the step (3);
(8) Inputting the similar molecular diagram to be predicted into a trained multi-layer perceptron neural network, and outputting a molecular diagram label to obtain an activity type corresponding to the molecular diagram label.
2. The method for predicting molecular activity based on fusion of multiple substructures according to claim 1, wherein the step of the random walk method in step (3) is as follows: and selecting unselected nodes in the node neighborhood of the molecular graph by using a random walk method, and backtracking to the previously selected nodes if the unselected nodes do not exist in the current node neighborhood in the selection process, wherein the node neighborhood represents all other node sets connected with the node in the molecular graph.
3. The method for predicting molecular activity based on fusion of multiple substructures according to claim 1, wherein the step of superimposing two loss values in the step (6 c) is:
firstly, inputting the fused feature matrix into a neural network to obtain a loss value lambda by using a cross entropy loss function 1 The selected sub-structures in the different sets are input into a neural network to obtain a loss value lambda 2;
Secondly, obtaining a neural network loss value L according to the following formula:
L=pλ 1 +(1-p)λ 2
where p represents a bias value selected in a range (0.8,1) based on the number of nodes in the different sub-classes of molecules.
CN202010729533.9A 2020-07-27 2020-07-27 Molecular activity prediction method based on multi-substructural feature fusion Active CN111916143B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010729533.9A CN111916143B (en) 2020-07-27 2020-07-27 Molecular activity prediction method based on multi-substructural feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010729533.9A CN111916143B (en) 2020-07-27 2020-07-27 Molecular activity prediction method based on multi-substructural feature fusion

Publications (2)

Publication Number Publication Date
CN111916143A CN111916143A (en) 2020-11-10
CN111916143B true CN111916143B (en) 2023-07-28

Family

ID=73281083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010729533.9A Active CN111916143B (en) 2020-07-27 2020-07-27 Molecular activity prediction method based on multi-substructural feature fusion

Country Status (1)

Country Link
CN (1) CN111916143B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112420131B (en) * 2020-11-20 2022-07-15 中国科学技术大学 Molecular generation method based on data mining
CN115240781A (en) * 2021-04-23 2022-10-25 中国科学院深圳先进技术研究院 Prediction method and prediction device for drug molecular characteristic attributes

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999012118A1 (en) * 1997-09-03 1999-03-11 Commonwealth Scientific And Industrial Research Organisation Compound screening system
CN106874688A (en) * 2017-03-01 2017-06-20 中国药科大学 Intelligent lead compound based on convolutional neural networks finds method
WO2018220368A1 (en) * 2017-05-30 2018-12-06 Gtn Ltd Tensor network machine learning system
CN109033738A (en) * 2018-07-09 2018-12-18 湖南大学 A kind of pharmaceutical activity prediction technique based on deep learning
RU2721190C1 (en) * 2018-12-25 2020-05-18 Общество с ограниченной ответственностью "Аби Продакшн" Training neural networks using loss functions reflecting relationships between neighbouring tokens
CN111429977A (en) * 2019-09-05 2020-07-17 中国海洋大学 Novel molecular similarity search algorithm based on graph structure attention
CN111428848A (en) * 2019-09-05 2020-07-17 中国海洋大学 Molecular intelligent design method based on self-encoder and 3-order graph convolution

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999012118A1 (en) * 1997-09-03 1999-03-11 Commonwealth Scientific And Industrial Research Organisation Compound screening system
CN106874688A (en) * 2017-03-01 2017-06-20 中国药科大学 Intelligent lead compound based on convolutional neural networks finds method
WO2018220368A1 (en) * 2017-05-30 2018-12-06 Gtn Ltd Tensor network machine learning system
CN109033738A (en) * 2018-07-09 2018-12-18 湖南大学 A kind of pharmaceutical activity prediction technique based on deep learning
RU2721190C1 (en) * 2018-12-25 2020-05-18 Общество с ограниченной ответственностью "Аби Продакшн" Training neural networks using loss functions reflecting relationships between neighbouring tokens
CN111429977A (en) * 2019-09-05 2020-07-17 中国海洋大学 Novel molecular similarity search algorithm based on graph structure attention
CN111428848A (en) * 2019-09-05 2020-07-17 中国海洋大学 Molecular intelligent design method based on self-encoder and 3-order graph convolution

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
一种基于随机游走的迭代加权子图查询算法;张小驰;于华;宫秀军;;计算机研究与发展(第12期);全文 *
基于神经网络的链路预测算法;潘永昊;于洪涛;刘树新;;网络与信息安全学报(第07期);全文 *
深度神经网络在化学中的应用研究;秦琦枫;曾斌;刘思莹;;江西化工(第03期);全文 *

Also Published As

Publication number Publication date
CN111916143A (en) 2020-11-10

Similar Documents

Publication Publication Date Title
Albaradei et al. Machine learning and deep learning methods that use omics data for metastasis prediction
Peel et al. Detecting change points in the large-scale structure of evolving networks
Cheraghi et al. Application of machine learning techniques for selecting the most suitable enhanced oil recovery method; challenges and opportunities
Hoffman et al. Detecting clusters/communities in social networks
Yu et al. An improved ensemble learning method for classifying high-dimensional and imbalanced biomedicine data
Russakovsky et al. Detecting avocados to zucchinis: what have we done, and where are we going?
CN112784881A (en) Network abnormal flow detection method, model and system
Gomez et al. A divide-and-link algorithm for hierarchical clustering in networks
CN111916143B (en) Molecular activity prediction method based on multi-substructural feature fusion
Bollt et al. Introduction to focus issue: Causation inference and information flow in dynamical systems: Theory and applications
Ruan et al. Parallel and quantitative sequential pattern mining for large-scale interval-based temporal data
Santos et al. A consensus graph clustering algorithm for directed networks
Guo et al. Transportation mode recognition with deep forest based on GPS data
Grissa et al. A hybrid and exploratory approach to knowledge discovery in metabolomic data
Daniel Loyal et al. A Bayesian nonparametric latent space approach to modeling evolving communities in dynamic networks
Bai et al. Hierarchical clustering split for low-bias evaluation of drug-target interaction prediction
Zhang et al. Learning latent embedding of multi-modal single cell data and cross-modality relationship simultaneously
De Oliveira et al. An optimization-based process mining approach for explainable classification of timed event logs
Wang et al. Feature selection methods in the framework of mrmr
Singh et al. SMOTE-LASSO-DeepNet Framework for Cancer Subtyping from Gene Expression Data
Zhou et al. Detecting communities with different sizes for social network analysis
Priscilla et al. A semi-supervised hierarchical approach: Two-dimensional clustering of microarray gene expression data
Wei et al. Negatives Make A Positive: An Embarrassingly Simple Approach to Semi-Supervised Few-Shot Learning
Liu Community detection by affinity propagation with various similarity measures
Pouyan et al. Distance metric learning using random forest for cytometry data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant