CN112185480A - Graph feature extraction, lipid-water distribution coefficient prediction method and graph feature extraction model - Google Patents

Graph feature extraction, lipid-water distribution coefficient prediction method and graph feature extraction model Download PDF

Info

Publication number
CN112185480A
CN112185480A CN202011159909.3A CN202011159909A CN112185480A CN 112185480 A CN112185480 A CN 112185480A CN 202011159909 A CN202011159909 A CN 202011159909A CN 112185480 A CN112185480 A CN 112185480A
Authority
CN
China
Prior art keywords
layer
graph
feature
water distribution
lipid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011159909.3A
Other languages
Chinese (zh)
Other versions
CN112185480B (en
Inventor
周文彪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wangshi Intelligent Technology Co ltd
Original Assignee
Beijing Wangshi Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wangshi Intelligent Technology Co ltd filed Critical Beijing Wangshi Intelligent Technology Co ltd
Priority to CN202011159909.3A priority Critical patent/CN112185480B/en
Publication of CN112185480A publication Critical patent/CN112185480A/en
Application granted granted Critical
Publication of CN112185480B publication Critical patent/CN112185480B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a graph feature extraction and lipid-water distribution coefficient prediction method and a graph feature extraction model, wherein the graph feature extraction method comprises the following steps: acquiring a feature graph to be extracted, wherein the feature graph to be extracted consists of a plurality of nodes and edges connecting the nodes with incidence relations; inputting a feature graph to be extracted into a graph feature extraction model for feature extraction to obtain the feature of each node, wherein the graph feature extraction model comprises a plurality of convolution layers and a GRU network layer, the convolution layers and the GRU network layer are arranged at intervals, and feature fusion of nodes with incidence relation is carried out through the GRU network layer; and inputting the characteristics of each node output by the last convolution layer into the merging layer for characteristic fusion to obtain the characteristics of the characteristic diagram to be extracted. The invention uses the GRU network layer to fuse the characteristic information of the nodes with the incidence relation during each convolution operation, so that the network expression capability is better, the interaction between the nodes in the expression graph is more suitable, and the early-stage extraction workload is reduced.

Description

Graph feature extraction, lipid-water distribution coefficient prediction method and graph feature extraction model
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a graph feature extraction and lipid-water distribution coefficient prediction method and a graph feature extraction model.
Background
Distribution coefficient log of fat and waterpThe concentration of substance in octanol/the concentration of substance in water is an important reference element in drug design, which influences the absorption behavior of drugs in the body. The index isCan be measured by simple experiments, but the experimental measurement of a large number of candidate small molecules is unrealistic in the early virtual screening stage of drug design, and the pharmacogenics experts often use a software calculation method to measure the logpAnd (5) carrying out coarse screening.
In the related art, machine learning models are typically used for log of small moleculespPrediction is carried out, but the extraction of the features requires a great deal of previous work, and needs to have deeper professional knowledge and experience, so that the workload is great. Therefore, it is highly desirable to provide a method for extracting features of a small molecule to learn reasonable and sufficient characterization information of the small molecule, to express the features of the molecule more accurately, and to reduce the workload.
Disclosure of Invention
Therefore, the technical problem to be solved by the invention is to overcome the defects that a large amount of early work is required and the workload is large in feature extraction in the prior art, so that a method for extracting graph features and predicting a lipid-water distribution coefficient and a model for extracting the graph features are provided.
According to a first aspect, an embodiment of the present invention discloses a graph feature extraction method, including: acquiring a feature graph to be extracted, wherein the feature graph to be extracted consists of a plurality of nodes and edges connecting the nodes with incidence relations; inputting the feature graph to be extracted into a graph feature extraction model for feature extraction to obtain the feature of each node, wherein the graph feature extraction model comprises a plurality of convolution layers and a GRU network layer, the convolution layers and the GRU network layer are arranged at intervals, and feature fusion with nodes in an incidence relation is carried out through the GRU network layer; and inputting the characteristics of each node output by the last convolution layer into a merging layer for characteristic fusion to obtain the characteristics of the characteristic diagram to be extracted.
Optionally, the performing, by the GRU network layer, feature fusion with an association relationship node includes:
Figure BDA0002742794390000021
w′=GRU(w,aroundw)
wherein, aroundwRepresents the total impact of all other nodes v connected to node w in the graph on node w;
Figure BDA0002742794390000022
the method is an MLP neural network, and different types of edges correspond to different network parameters; w' represents the updated feature vector of node w.
Optionally, the inputting the feature of each node output by the last convolutional layer to the merging layer for feature fusion includes: inputting the characteristics of each node output by the last convolutional layer into a merging layer, and mapping the characteristics of each node into an analog fingerprint through the merging layer to perform characteristic fusion.
Optionally, the mapping, by the merging layer, the features of each node into the simulated fingerprint and then performing feature fusion includes:
Figure BDA0002742794390000023
Figure BDA0002742794390000024
wherein, w(n)Is the output characteristic vector of the node w after the convolution calculation of the nth layer graph; dim represents that any node feature vector is mapped into a dim-dimensional vector space;
Figure BDA0002742794390000025
a mapping value representing an output feature vector of the nth layer convolution of the node w; softBitMap represents the output of the merging layer.
According to a second aspect, the embodiment of the present invention further discloses a method for predicting a lipid-water distribution coefficient, comprising the following steps: carrying out feature extraction on the biological small molecules by using a graph feature extraction method according to the first aspect or any optional embodiment of the first aspect; and predicting the lipid-water distribution coefficient of the extracted biological micromolecule characteristics by using a pre-trained lipid-water distribution coefficient prediction model.
Optionally, the lipid-water distribution coefficient prediction model is obtained by training in the following way: obtaining first lipid water partition coefficient training data and training data associated with the first lipid water partition coefficient, the training data associated with the first lipid water partition coefficient comprising: at least one of solubility, melting point, dissociation coefficient, and lipid-water partition coefficient measured at PH 7.4; and inputting the first lipid-water distribution coefficient training data and training data related to the first lipid-water distribution coefficient into a machine learning model for pre-training to obtain a lipid-water distribution coefficient prediction model.
Optionally, the method further comprises: acquiring second fat water distribution coefficient training data and training data related to the second fat water distribution coefficient, wherein the second fat water distribution coefficient training data is the same as the first fat water distribution coefficient training data, and the accuracy of the second fat water distribution coefficient training data and the training data related to the second fat water distribution coefficient is greater than that of the first fat water distribution coefficient training data and the training data related to the first fat water distribution coefficient; and inputting the second lipid-water distribution coefficient training data and training data related to the second lipid-water distribution coefficient into the lipid-water distribution coefficient prediction model for training to obtain a target lipid-water distribution coefficient prediction model.
According to a third aspect, an embodiment of the present invention further discloses a graph feature extraction model, including: the system comprises an input layer, a feature graph extraction layer and a feature extraction layer, wherein the input layer is used for acquiring a feature graph to be extracted, and the feature graph to be extracted consists of a plurality of nodes and edges which are connected with the nodes with incidence relations; the method comprises the steps that a plurality of convolution layers and a GRU network layer are arranged at intervals, each convolution layer inputs the characteristics of the nodes with the incidence relation in an extracted characteristic diagram to be extracted into the GRU network layer for characteristic fusion and inputs the characteristics into the next convolution layer, and the steps that each convolution layer inputs the characteristics of the nodes with the incidence relation in the extracted characteristic diagram to be extracted into the GRU network layer for characteristic fusion and inputs the characteristics into the next convolution layer are repeated until the last convolution layer; and the merging layer is used for performing characteristic fusion on the characteristics of each node output by the last layer of convolution layer and outputting the characteristic fusion result through the output layer.
According to a fourth aspect, an embodiment of the present invention further discloses a computer device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to cause the at least one processor to perform the steps of the graph feature extraction method according to the first aspect or any one of the alternative embodiments of the first aspect or the steps of the lipid water partition coefficient prediction method according to the second aspect or any one of the alternative embodiments of the second aspect.
According to a fifth aspect, the embodiments of the present invention further disclose a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the graph feature extraction method according to the first aspect or any one of the optional embodiments of the first aspect or the steps of the fat-water distribution coefficient prediction method according to the second aspect or any one of the optional embodiments of the second aspect. The technical scheme of the invention has the following advantages:
1. the graph feature extraction method provided by the invention comprises the steps of obtaining a feature graph to be extracted, wherein the feature graph to be extracted consists of a plurality of nodes and edges connecting the nodes with incidence relations; inputting a feature graph to be extracted into a graph feature extraction model for feature extraction to obtain the feature of each node, wherein the graph feature extraction model comprises a plurality of convolution layers and a GRU network layer, the convolution layers and the GRU network layer are arranged at intervals, and feature fusion of nodes with incidence relation is carried out through the GRU network layer; and inputting the characteristics of each node output by the last convolution layer into the merging layer for characteristic fusion to obtain the characteristics of the characteristic diagram to be extracted. The embodiment of the invention uses the GRU network layer to fuse the characteristic information of the nodes with the incidence relation during each convolution operation, and because the GRU has different degrees of sensitivity to different input information, the GRU network layer can keep some useful information through a plurality of internal gating sub-networks and discard some useless information, so that the network expression capability is better, the GRU network layer is more suitable for expressing the interaction between the nodes in the graph, the early-stage extraction workload is reduced, and the efficiency is improved.
2. The lipid-water distribution coefficient prediction method provided by the invention is characterized in that a graph characteristic extraction method is used for extracting the characteristics of the biological micromolecules, and a pre-trained lipid-water distribution coefficient prediction model is used for predicting the lipid-water distribution coefficient of the extracted biological micromolecules. According to the embodiment of the invention, the characteristic extraction is carried out on the biological micromolecules by using the graph characteristic extraction method, so that the reasonable and sufficient characterization information of the micromolecules is learned, the characteristics of the molecules are more accurately expressed, and the lipid-water distribution coefficient prediction is carried out on the extracted biological micromolecule characteristics by using the pre-trained lipid-water distribution coefficient prediction model, so that the lipid-water distribution coefficient prediction is more accurate, and the workload is reduced.
3. The invention provides a graph feature extraction model, which comprises the following steps: the input layer is used for acquiring a feature graph to be extracted, and the feature graph to be extracted consists of a plurality of nodes and edges connecting the nodes with incidence relations; the method comprises the steps that a plurality of convolution layers and a GRU network layer are arranged at intervals, each convolution layer inputs the characteristics of the nodes with the incidence relation in an extracted characteristic diagram to be extracted into the GRU network layer for characteristic fusion and inputs the characteristics into the next convolution layer, and the steps that each convolution layer inputs the characteristics of the nodes with the incidence relation in the extracted characteristic diagram to be extracted into the GRU network layer for characteristic fusion and inputs the characteristics into the next convolution layer are repeated until the last convolution layer; and the merging layer is used for performing characteristic fusion on the characteristics of each node output by the last layer of convolution layer and outputting the characteristic fusion result through the output layer. The embodiment of the invention fuses the characteristic information of the nodes with the incidence relation by adding the GRU network layer and using the GRU network layer during convolution operation, and the GRU network layer can keep some information through a plurality of internal gating sub-networks and discard some useless information because the GRU has different degrees of sensitivity to different input information, so that the network expression capability is better, and the GRU network layer is more suitable for expressing the interaction between the nodes in the graph.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a specific example of a graph feature extraction method in an embodiment of the present invention;
FIG. 2 is a flow chart of a specific example of a method for predicting a lipid-water distribution coefficient according to an embodiment of the present invention;
FIG. 3 is a diagram of an exemplary training method of a lipid-water distribution coefficient prediction model according to an embodiment of the present invention;
FIG. 4 is a diagram of another example of a training method of a lipid-water distribution coefficient prediction model according to an embodiment of the present invention;
FIG. 5 is a predicted scatter plot of XlogP3 and logP in an embodiment of the present invention;
FIG. 6 is a diagram illustrating an exemplary graph feature extraction model according to an embodiment of the present invention;
fig. 7 is a schematic block diagram showing a specific example of the feature extraction device in the embodiment of the present invention;
FIG. 8 is a schematic block diagram of a specific example of a fat water distribution coefficient prediction apparatus according to an embodiment of the present invention;
FIG. 9 is a diagram showing a specific example of a computer device according to an embodiment of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; the two elements may be directly connected or indirectly connected through an intermediate medium, or may be communicated with each other inside the two elements, or may be wirelessly connected or wired connected. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The embodiment of the invention discloses a graph feature extraction method, which comprises the following steps of:
s11: and acquiring a feature graph to be extracted, wherein the feature graph to be extracted consists of a plurality of nodes and edges connecting the nodes with incidence relations.
The feature map to be extracted is not particularly limited in the embodiment of the present invention, and may be set by a person skilled in the art according to an actual situation. The feature graph to be extracted is composed of a plurality of nodes and edges connecting the nodes having the association relationship, for example, for a molecular graph, the nodes in the graph are each atom constituting a molecule, and attribute information of the atom includes: formal charge (formal charge), partial charge (partial charge) hybridization types (e.g., sp2, sp3, etc.), and the like; the edges are covalent bonds between atoms, including: single, double, triple and pi bonds.
The method for acquiring the feature map to be extracted may be manually uploaded to a server or called from a database, and the method for acquiring the feature map to be extracted is not particularly limited in the embodiment of the present invention, and may be set by a person skilled in the art according to actual conditions.
S12: inputting the feature graph to be extracted into a graph feature extraction model for feature extraction to obtain the feature of each node, wherein the graph feature extraction model comprises a plurality of convolution layers and a GRU network layer, the convolution layers and the GRU network layer are arranged at intervals, and feature fusion of nodes with incidence relations is carried out through the GRU network layer.
Illustratively, the GRU network layer is a special neural network, and through several gating sub-networks inside the GRU network layer, important information in an input feature map to be extracted can be retained and output to an output feature vector, and meanwhile, some useless information is omitted, so that the GRU network layer accumulates information of neighbor nodes of a target node by referring to information of the target node and then performs feature fusion with feature information of the target node.
The specific fusion method is as follows: for any node w, carrying out information updating by using other nodes v directly connected with the node w, specifically, firstly calculating the sum value around of all the neighbor nodes v of w by using formula (1)wThe summary value is the total influence of all the neighbor nodes v of the node w on w, and then the w value itself is updated by using the formula (2), and the updating operation uses the basic unit GRU of the recurrent neural network, so that the optimal updating and forgetting can be realized.
Figure BDA0002742794390000081
w′=GRU(w,aroundw) (2)
Wherein, aroundwRepresents the total impact of all other nodes v connected to node w in the graph on node w;
Figure BDA0002742794390000082
the method is an MLP neural network, and different types of edges correspond to different network parameters; for example, for a molecular diagram, if a single bond is between w and v, the corresponding NNsingle_bondIf it is a double bond, it corresponds to NNdouble_bond(ii) a w' represents the updated feature vector of node w after the convolution operation.
2-layer convolution operation:
Figure BDA0002742794390000083
w″=GRU(w′,aroundw′) (4)
wherein, aroundw′Represents the total impact of all other nodes v connected to node w in the graph on node w; w "represents the feature vector of the node w after the update of the quadratic convolution operation.
3-layer convolution operation:
Figure BDA0002742794390000084
w″′=GRU(w″,aroundw″) (6)
wherein, aroundw″Represents the total impact of all other nodes v connected to node w in the graph on node w; w' ″ represents the feature vector of the node w after the three convolution operations have been updated.
And so on to obtain the n-th layer of convolution operation characteristic information … … of the 4-layer of convolution operation characteristic information.
S13: and inputting the characteristics of each node output by the last convolution layer into the merging layer for characteristic fusion to obtain the characteristics of the characteristic diagram to be extracted.
For example, the features of each node output by the last convolutional layer are input to a merging layer for feature fusion, and the feature of the feature graph to be extracted may be obtained by inputting the features of all nodes in the graph to the merging layer for direct summation, or by inputting the features of each node in the graph to the merging layer, and by mapping the features of each node to the simulated fingerprint through the merging layer, feature fusion is performed.
The graph feature extraction method provided by the invention comprises the steps of obtaining a feature graph to be extracted, wherein the feature graph to be extracted consists of a plurality of nodes and edges connecting the nodes with incidence relations; inputting a feature graph to be extracted into a graph feature extraction model for feature extraction to obtain the feature of each node, wherein the graph feature extraction model comprises a plurality of convolution layers and a GRU network layer, the convolution layers and the GRU network layer are arranged at intervals, and feature fusion of nodes with incidence relation is carried out through the GRU network layer; and inputting the characteristics of each node output by the last convolution layer into the merging layer for characteristic fusion to obtain the characteristics of the characteristic diagram to be extracted. The embodiment of the invention uses the GRU network layer to fuse the characteristic information of the nodes with the incidence relation during each convolution operation, and because the GRU has different degrees of sensitivity to different input information, the GRU network layer can keep some information through a plurality of internal gating sub-networks and discard some useless information, so that the network expression capability is better, the GRU network layer is more suitable for expressing the interaction between the nodes in the graph, the early-stage extraction workload is reduced, and the efficiency is improved.
As an optional implementation manner of the embodiment of the present invention, the step S13 includes:
inputting the characteristics of each node output by the last convolutional layer into the merging layer, mapping the characteristics of each node into an analog fingerprint through the merging layer, and then performing characteristic fusion.
Exemplarily, inputting the feature of each node output by the last convolutional layer into the merging layer, and mapping the feature of each node into the analog fingerprint through the merging layer to perform feature fusion specifically as follows: inputting the characteristics of each node output by the last convolutional layer into a merging layer, mapping all node information into an x-dimensional bitmap file (bitmap) through the merging layer, constructing an x-dimensional molecular fingerprint so that the characteristic information is mapped to different positions of the bitmap file, and then fusing the characteristics of each node. In an embodiment of the present invention, the bitmap file may be 2048-dimensional. Compared with the method that the feature information of all the nodes is directly summed to obtain the features of the feature graph to be extracted, the method has better feature expression capability, cannot cause confusion and mutual covering of the feature information, and can better express the local and global features of the feature graph to be extracted.
Specifically, the feature extraction of the feature map to be extracted may be performed by the following formula:
Figure BDA0002742794390000101
Figure BDA0002742794390000102
wherein, w(n)Is the output characteristic vector of the node w after the convolution calculation of the nth layer graph; dim represents that any node feature vector is mapped into a dim-dimensional vector space;
Figure BDA0002742794390000103
a mapping value representing an output feature vector of the nth layer convolution of the node w; softBitMap represents the output of the merging layer.
The embodiment of the invention also discloses a method for predicting the distribution coefficient of fat and water, which comprises the following steps as shown in figure 2:
s21: and (3) carrying out feature extraction on the biological small molecules by using a graph feature extraction method as the graph feature extraction method.
Illustratively, the method for extracting the features of the small biological molecules by using the graph feature extraction method according to the embodiment of the graph feature extraction method specifically includes directly inputting the small biological molecule graph into the graph feature extraction model, and extracting the features of the small biological molecules by using the graph feature extraction method according to the embodiment of the graph feature extraction method.
S22: and predicting the lipid-water distribution coefficient of the extracted biological micromolecule characteristics by using a pre-trained lipid-water distribution coefficient prediction model.
Illustratively, the extracted biological small molecule characteristics are directly input into a pre-trained lipid water distribution coefficient prediction model to obtain a lipid water distribution coefficient prediction value. The fat-water distribution coefficient prediction model can use the existing fat-water distribution coefficient prediction model and can also be trained in advance according to requirements.
As shown in fig. 3, the training method of the lipid-water distribution coefficient prediction model can be obtained by training with a method of pre-training and fine-tuning, and under the condition of initializing model parameters, a large amount of software calculation data is firstly used for pre-training, and then more accurate experiment and patent data are used for fine-tuning.
The lipid-water distribution coefficient prediction method provided by the invention is characterized in that a graph characteristic extraction method is used for extracting the characteristics of the biological micromolecules, and a pre-trained lipid-water distribution coefficient prediction model is used for predicting the lipid-water distribution coefficient of the extracted biological micromolecules. According to the embodiment of the invention, the characteristic extraction is carried out on the biological micromolecules by using the graph characteristic extraction method, so that the reasonable and sufficient characterization information of the micromolecules is learned, the characteristics of the molecules are more accurately expressed, and the lipid-water distribution coefficient prediction is carried out on the extracted biological micromolecule characteristics by using the pre-trained lipid-water distribution coefficient prediction model, so that the lipid-water distribution coefficient prediction is more accurate, and the workload is reduced.
As an optional implementation manner of the embodiment of the present invention, the lipid-water distribution coefficient prediction model is obtained by training in the following manner:
obtaining first lipid water partition coefficient training data and training data associated with the first lipid water partition coefficient, the training data associated with the first lipid water partition coefficient comprising: at least one of solubility (logS), melting point (mp), dissociation coefficient (pka), and lipid-water partition coefficient (logD) measured at pH 7.4.
Illustratively, the first lipid water partition coefficient training data and the training data associated with the first lipid water partition coefficient may be obtained from a chemical database, wherein the chemical database comprises 180 ten thousand small molecule compounds, and the lipid water partition coefficient training data and the training data associated with the lipid water partition coefficient in the chemical database are calculated by software. The first lipid water partition coefficient-related training data comprises: at least one of solubility, melting point, dissociation coefficient, and lipid-water partition coefficient measured at pH 7.4
And inputting the first fat-water distribution coefficient training data and training data related to the first fat-water distribution coefficient into a machine learning model for pre-training to obtain a fat-water distribution coefficient prediction model.
Illustratively, as shown in fig. 4, the first fat water distribution coefficient training data and the training data related to the first fat water distribution coefficient are both input into the machine learning model to perform strong supervision or weak supervision multitask pre-training, so as to obtain a fat water distribution coefficient prediction model, and the final loss is the loss weighted sum of the terms. Specifically, different parameters may be weighted differently according to their correlation with the lipid water partition coefficient, for example: 1. lipid-water partition coefficient measured at PH 7.4: 1. dissociation coefficient: 0.1, solubility: 1. melting point: 0.1. the setting of the weight is not particularly limited in the embodiment of the present invention, and those skilled in the art can set the weight according to actual situations.
The embodiment of the invention adopts multi-task learning, simultaneously learns the relevant indexes of a plurality of lipid-water distribution coefficients, enhances the robustness of the model through the synergistic effect among data, and overcomes the problem of sparse data in the field.
As an optional implementation manner of the embodiment of the present invention, the method for predicting a lipid-water distribution coefficient further includes:
and acquiring second fat water distribution coefficient training data and training data related to the second fat water distribution coefficient, wherein the data related to the second fat water distribution coefficient is the same as the training data related to the first fat water distribution coefficient, and the accuracy of the second fat water distribution coefficient training data and the training data related to the second fat water distribution coefficient is higher than that of the first fat water distribution coefficient training data and the training data related to the first fat water distribution coefficient.
Illustratively, the second fat water distribution coefficient training data and the training data related to the second fat water distribution coefficient can be obtained from a chemical database and a patent document, and belong to experimental data.
And inputting the second lipid water distribution coefficient training data and training data related to the second lipid water distribution coefficient into a lipid water distribution coefficient prediction model for training to obtain a target lipid water distribution coefficient prediction model. The specific training method is described in the description of the lipid-water distribution coefficient prediction model, and is not described herein again.
In order to test the effect of the method, the method is compared with XLOG 3 which is widely used at present and has a good effect, and the result is shown in FIG. 5, the same 1800 drug-like small molecules are input into two models, the pearson correlation coefficient obtained by the scheme of the embodiment of the invention is 0.83, the pearson correlation coefficient of XLOG 3 is 0.76, and the prediction of the lipid-water distribution coefficient of the invention is more accurate.
The embodiment of the invention also discloses a graph feature extraction model, as shown in fig. 6, comprising:
the input layer 31 is used for acquiring a feature graph to be extracted, and the feature graph to be extracted is composed of a plurality of nodes and edges connecting the nodes with incidence relations; the specific implementation manner is described in step S11 in the embodiment, and is not described herein again.
A plurality of convolution layers 32 and a GRU network layer 33, wherein the plurality of convolution layers 32 and the GRU network layer 33 are arranged at intervals, each convolution layer 32 inputs the characteristics of the nodes with the incidence relation in the extracted characteristic diagram to be extracted into the GRU network layer 33 for characteristic fusion and inputs the characteristics into the next convolution layer 32, and the steps that each convolution layer 32 inputs the characteristics of the nodes with the incidence relation in the extracted characteristic diagram to be extracted into the GRU network layer 33 for characteristic fusion and inputs the characteristics into the next convolution layer 32 are repeated until the last convolution layer 32; the specific implementation manner is described in step S12 in the embodiment, and is not described herein again.
And a merging layer 34 for performing feature fusion on the features of each node output by the last convolutional layer 32, and outputting the feature fusion result through an output layer 35. The specific implementation manner is described in step S13 in the embodiment, and is not described herein again.
The invention provides a graph feature extraction model, which comprises the following steps: the input layer is used for acquiring a feature map to be extracted; the method comprises the steps that a plurality of convolution layers and a GRU network layer are arranged at intervals, each convolution layer inputs the characteristics of the nodes with the incidence relation in an extracted characteristic diagram to be extracted into the GRU network layer for characteristic fusion and inputs the characteristics into the next convolution layer, and the steps that each convolution layer inputs the characteristics of the nodes with the incidence relation in the extracted characteristic diagram to be extracted into the GRU network layer for characteristic fusion and inputs the characteristics into the next convolution layer are repeated until the last convolution layer; and the merging layer is used for performing characteristic fusion on the characteristics of each node output by the last layer of convolution layer and outputting the characteristic fusion result through the output layer. The embodiment of the invention fuses the characteristic information of the nodes with the incidence relation by adding the GRU network layer and using the GRU network layer during convolution operation, and the GRU network layer can keep some information through a plurality of internal gating sub-networks and discard some useless information because the GRU has different degrees of sensitivity to different input information, so that the network expression capability is better, and the GRU network layer is more suitable for expressing the interaction between the nodes in the graph.
The embodiment of the invention discloses a graph feature extraction device, as shown in fig. 7, comprising:
an obtaining module 41, configured to obtain a feature map to be extracted, where the feature map to be extracted is composed of a plurality of nodes and edges connecting the nodes having an association relationship; the specific implementation manner is described in step S11 in the embodiment, and is not described herein again.
The first extraction module 42 is configured to input a feature graph to be extracted into a graph feature extraction model for feature extraction, so as to obtain features of each node, where the graph feature extraction model includes multiple convolution layers and a GRU network layer, the multiple convolution layers and the GRU network layer are arranged at intervals, and feature fusion with nodes in an association relationship is performed through the GRU network layer; the specific implementation manner is described in step S12 in the embodiment, and is not described herein again.
And the fusion module 43 is configured to input the feature of each node output by the last convolutional layer to the merging layer for feature fusion, so as to obtain the feature of the feature map to be extracted. The specific implementation manner is described in step S13 in the embodiment, and is not described herein again. The graph feature extraction device provided by the invention obtains the feature graph to be extracted, wherein the feature graph to be extracted consists of a plurality of nodes and edges connecting the nodes with incidence relation; inputting a feature graph to be extracted into a graph feature extraction model for feature extraction to obtain the feature of each node, wherein the graph feature extraction model comprises a plurality of convolution layers and a GRU network layer, the convolution layers and the GRU network layer are arranged at intervals, and feature fusion of nodes with incidence relation is carried out through the GRU network layer; and inputting the characteristics of each node output by the last convolution layer into the merging layer for characteristic fusion to obtain the characteristics of the characteristic diagram to be extracted. The embodiment of the invention uses the GRU network layer to fuse the characteristic information of the nodes with the incidence relation during each convolution operation, and the GRU has different degrees of sensitivity to different input information, so that the GRU network layer can keep some information through a plurality of internal gating sub-networks and discard some useless information, thereby ensuring that the network expression capability is better and being more suitable for expressing the interaction between the nodes in the graph.
The embodiment of the invention also discloses a fat-water distribution coefficient prediction device, as shown in fig. 8, comprising:
a second extraction module 51, configured to perform feature extraction on the biological small molecules by using the graph feature extraction method according to the graph feature extraction method described above; the specific implementation manner is described in step S21 in the embodiment, and is not described herein again.
And the prediction module 52 is configured to perform lipid-water distribution coefficient prediction on the extracted small biological molecule features by using a pre-trained lipid-water distribution coefficient prediction model. The specific implementation manner is described in step S21 in the embodiment, and is not described herein again.
The fat-water distribution coefficient prediction device provided by the invention performs characteristic extraction on biological micromolecules by using a graph characteristic extraction method, and performs fat-water distribution coefficient prediction on the extracted biological micromolecule characteristics by using a pre-trained fat-water distribution coefficient prediction model. According to the embodiment of the invention, the characteristic extraction is carried out on the biological micromolecules by using the graph characteristic extraction method, so that the reasonable and sufficient characterization information of the micromolecules is learned, the characteristics of the molecules are more accurately expressed, and the lipid-water distribution coefficient prediction is carried out on the extracted biological micromolecule characteristics by using the pre-trained lipid-water distribution coefficient prediction model, so that the lipid-water distribution coefficient prediction is more accurate, and the workload is reduced.
An embodiment of the present invention further provides a computer device, as shown in fig. 9, the computer device may include a processor 61 and a memory 62, where the processor 61 and the memory 62 may be connected by a bus or in another manner, and fig. 7 illustrates an example of a connection by a bus.
The processor 61 may be a Central Processing Unit (CPU). The Processor 61 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof.
The memory 62, as a non-transitory computer readable storage medium, may be used for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the map feature extraction method or the lipid water distribution coefficient prediction method in the embodiment of the present invention (for example, the obtaining module 41, the first extraction module 42, and the fusion module 43 shown in fig. 7, or the second extraction module 51 and the prediction module 52 shown in fig. 8). The processor 61 executes various functional applications and data processing of the processor by executing non-transitory software programs, instructions and modules stored in the memory 62, namely, implements the map feature extraction method or the lipid water distribution coefficient prediction method in the above-described method embodiments.
The memory 62 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor 61, and the like. Further, the memory 62 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 62 may optionally include memory located remotely from the processor 61, and these remote memories may be connected to the processor 61 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The one or more modules are stored in the memory 62 and, when executed by the processor 61, perform a graph feature extraction method as in the embodiment of fig. 1 or a lipid water distribution coefficient prediction method as in the embodiment of fig. 2.
The details of the computer device may be understood by referring to the corresponding descriptions and effects in the embodiments shown in fig. 1 to fig. 2, and are not described herein again.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (10)

1. A method for extracting a graph feature, comprising:
acquiring a feature graph to be extracted, wherein the feature graph to be extracted consists of a plurality of nodes and edges connecting the nodes with incidence relations;
inputting the feature graph to be extracted into a graph feature extraction model for feature extraction to obtain the feature of each node, wherein the graph feature extraction model comprises a plurality of convolution layers and a GRU network layer, the convolution layers and the GRU network layer are arranged at intervals, and feature fusion with nodes in an incidence relation is carried out through the GRU network layer;
and inputting the characteristics of each node output by the last convolution layer into a merging layer for characteristic fusion to obtain the characteristics of the characteristic diagram to be extracted.
2. The graph feature extraction method according to claim 1, wherein the performing, by the GRU network layer, feature fusion with associated relationship nodes includes:
Figure FDA0002742794380000011
w′=GRU(w,aroundw)
wherein, aroundwRepresenting the total influence of all other nodes v connected with the node w in the graph feature extraction model on the node w;
Figure FDA0002742794380000012
the method is an MLP neural network, and different types of edges correspond to different network parameters; w' represents the updated feature vector of node w.
3. The graph feature extraction method according to claim 1, wherein the inputting features of each node output by the last convolutional layer into a merging layer for feature fusion includes:
inputting the characteristics of each node output by the last convolutional layer into a merging layer, and mapping the characteristics of each node into an analog fingerprint through the merging layer to perform characteristic fusion.
4. The graph feature extraction method according to claim 3, wherein the performing feature fusion after the feature of each node is mapped to the simulated fingerprint by the merging layer comprises:
Figure FDA0002742794380000021
Figure FDA0002742794380000022
wherein, w(n)Is the output characteristic vector of the node w after the convolution calculation of the nth layer graph; dim represents that any node feature vector is mapped into a dim-dimensional vector space;
Figure FDA0002742794380000023
a mapping value representing an output feature vector of the nth layer convolution of the node w; softBitMap represents the output of the merging layer.
5. A method for predicting a lipid-water distribution coefficient is characterized by comprising the following steps:
performing feature extraction on the biological small molecules by using a graph feature extraction method according to any one of claims 1 to 4;
and predicting the lipid-water distribution coefficient of the extracted biological micromolecule characteristics by using a pre-trained lipid-water distribution coefficient prediction model.
6. The method of claim 5, wherein the lipid water partition coefficient prediction model is trained by:
obtaining first lipid water partition coefficient training data and training data associated with the first lipid water partition coefficient, the training data associated with the first lipid water partition coefficient comprising: at least one of solubility, melting point, dissociation coefficient, and lipid-water partition coefficient measured at PH 7.4;
and inputting the first lipid-water distribution coefficient training data and training data related to the first lipid-water distribution coefficient into a machine learning model for pre-training to obtain a lipid-water distribution coefficient prediction model.
7. The method of claim 6, further comprising:
acquiring second fat water distribution coefficient training data and training data related to the second fat water distribution coefficient, wherein the second fat water distribution coefficient training data is the same as the first fat water distribution coefficient training data, and the accuracy of the second fat water distribution coefficient training data and the training data related to the second fat water distribution coefficient is greater than that of the first fat water distribution coefficient training data and the training data related to the first fat water distribution coefficient;
and inputting the second lipid-water distribution coefficient training data and training data related to the second lipid-water distribution coefficient into the lipid-water distribution coefficient prediction model for training to obtain a target lipid-water distribution coefficient prediction model.
8. A graph feature extraction model, comprising:
the system comprises an input layer, a feature graph extraction layer and a feature extraction layer, wherein the input layer is used for acquiring a feature graph to be extracted, and the feature graph to be extracted consists of a plurality of nodes and edges which are connected with the nodes with incidence relations;
the method comprises the steps that a plurality of convolution layers and a GRU network layer are arranged at intervals, each convolution layer inputs the characteristics of the nodes with the incidence relation in an extracted characteristic diagram to be extracted into the GRU network layer for characteristic fusion and inputs the characteristics into the next convolution layer, and the steps that each convolution layer inputs the characteristics of the nodes with the incidence relation in the extracted characteristic diagram to be extracted into the GRU network layer for characteristic fusion and inputs the characteristics into the next convolution layer are repeated until the last convolution layer;
and the merging layer is used for performing characteristic fusion on the characteristics of each node output by the last layer of convolution layer and outputting the characteristic fusion result through the output layer.
9. A computer device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of the graph feature extraction method of any one of claims 1-4 or the steps of the lipid water partition coefficient prediction method of any one of claims 5-7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the graph feature extraction method according to any one of claims 1 to 4 or the steps of the fat water partition coefficient prediction method according to any one of claims 5 to 7.
CN202011159909.3A 2020-10-26 2020-10-26 Graph feature extraction and lipid water distribution coefficient prediction method and graph feature extraction model Active CN112185480B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011159909.3A CN112185480B (en) 2020-10-26 2020-10-26 Graph feature extraction and lipid water distribution coefficient prediction method and graph feature extraction model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011159909.3A CN112185480B (en) 2020-10-26 2020-10-26 Graph feature extraction and lipid water distribution coefficient prediction method and graph feature extraction model

Publications (2)

Publication Number Publication Date
CN112185480A true CN112185480A (en) 2021-01-05
CN112185480B CN112185480B (en) 2024-01-26

Family

ID=73923371

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011159909.3A Active CN112185480B (en) 2020-10-26 2020-10-26 Graph feature extraction and lipid water distribution coefficient prediction method and graph feature extraction model

Country Status (1)

Country Link
CN (1) CN112185480B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012051242A1 (en) * 2010-10-13 2012-04-19 Aspen Technology, Inc. Extension of cosmo-sac solvation model for electrolytes
CN106599198A (en) * 2016-12-14 2017-04-26 广东顺德中山大学卡内基梅隆大学国际联合研究院 Image description method for multi-stage connection recurrent neural network
CN108205613A (en) * 2017-12-11 2018-06-26 华南理工大学 The computational methods of similarity and system and their application between a kind of compound molecule
WO2019238680A1 (en) * 2018-06-11 2019-12-19 Givaudan Sa Method related to organic compositions
CN110957012A (en) * 2019-11-28 2020-04-03 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for analyzing properties of compound
CN111694917A (en) * 2020-06-10 2020-09-22 北京嘀嘀无限科技发展有限公司 Vehicle abnormal track detection and model training method and device
CN111783442A (en) * 2019-12-19 2020-10-16 国网江西省电力有限公司电力科学研究院 Intrusion detection method, device, server and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012051242A1 (en) * 2010-10-13 2012-04-19 Aspen Technology, Inc. Extension of cosmo-sac solvation model for electrolytes
CN106599198A (en) * 2016-12-14 2017-04-26 广东顺德中山大学卡内基梅隆大学国际联合研究院 Image description method for multi-stage connection recurrent neural network
CN108205613A (en) * 2017-12-11 2018-06-26 华南理工大学 The computational methods of similarity and system and their application between a kind of compound molecule
WO2019238680A1 (en) * 2018-06-11 2019-12-19 Givaudan Sa Method related to organic compositions
CN110957012A (en) * 2019-11-28 2020-04-03 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for analyzing properties of compound
CN111783442A (en) * 2019-12-19 2020-10-16 国网江西省电力有限公司电力科学研究院 Intrusion detection method, device, server and storage medium
CN111694917A (en) * 2020-06-10 2020-09-22 北京嘀嘀无限科技发展有限公司 Vehicle abnormal track detection and model training method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIHEON KANG: "Novel Leakage Detection by Ensemble CNN-SVM and Graph-Based Localization in Water Distribution Systems", 《 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS》, vol. 65, no. 5, XP011676065, DOI: 10.1109/TIE.2017.2764861 *
赵兵: "基于注意力机制的CNN-GRU短期电力负荷预测方法", 《电网技术》, vol. 43, no. 12 *

Also Published As

Publication number Publication date
CN112185480B (en) 2024-01-26

Similar Documents

Publication Publication Date Title
Dai et al. Fast and accurate estimation of quality of results in high-level synthesis with machine learning
WO2020047739A1 (en) Method for predicting severe wheat disease on the basis of multiple time-series attribute element depth features
Abbaspour SWAT calibration and uncertainty programs
JP2020508521A (en) Optimization of neural network architecture
CN111639787A (en) Spatio-temporal data prediction method based on graph convolution network
KR20200129130A (en) Applications for drug discovery and systems and methods for spatial graph convolution by molecular simulation
WO2021056914A1 (en) Automatic modeling method and apparatus for object detection model
CN114333986A (en) Method and device for model training, drug screening and affinity prediction
CN114333980A (en) Method and device for model training, protein feature extraction and function prediction
Vickram et al. Validation of artificial neural network models for predicting biochemical markers associated with male infertility
US20210174148A1 (en) Accuracy of Classification Models
CN114026572A (en) Error compensation in analog neural networks
WO2024001806A1 (en) Data valuation method based on federated learning and related device therefor
CN116089870A (en) Industrial equipment fault prediction method and device based on meta-learning under small sample condition
CN113487019A (en) Circuit fault diagnosis method and device, computer equipment and storage medium
US20170161946A1 (en) Stochastic map generation and bayesian update based on stereo vision
CN115545334A (en) Land use type prediction method, land use type prediction device, electronic device, and storage medium
Zhou et al. Functional networks and applications: A survey
CN114881343A (en) Short-term load prediction method and device of power system based on feature selection
CN105224449B (en) The method of testing and device of application program on mobile terminal
US20140310211A1 (en) Method and device for creating a nonparametric, data-based function model
CN112185480A (en) Graph feature extraction, lipid-water distribution coefficient prediction method and graph feature extraction model
CN116542396A (en) Distributed photovoltaic output prediction method and device, storage medium and electronic equipment
CN116109449A (en) Data processing method and related equipment
CN110110209A (en) A kind of intersection recommended method and system based on local weighted linear regression model (LRM)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant