CN116525029A - Molecular diagram generation method and device based on flow model - Google Patents

Molecular diagram generation method and device based on flow model Download PDF

Info

Publication number
CN116525029A
CN116525029A CN202310499015.6A CN202310499015A CN116525029A CN 116525029 A CN116525029 A CN 116525029A CN 202310499015 A CN202310499015 A CN 202310499015A CN 116525029 A CN116525029 A CN 116525029A
Authority
CN
China
Prior art keywords
model
matrix
node
edge
molecular
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310499015.6A
Other languages
Chinese (zh)
Inventor
杜博
纪颖
万国佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202310499015.6A priority Critical patent/CN116525029A/en
Publication of CN116525029A publication Critical patent/CN116525029A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Medical Informatics (AREA)
  • Medicinal Chemistry (AREA)
  • Databases & Information Systems (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a molecular diagram generation method and a device based on a flow model, which relate to the field of artificial intelligence auxiliary drug discovery, and the method comprises the following steps: collecting a common molecular data set, converting the representation form of the molecules, and calculating topological structure and chemical information to extract a first node matrix and a first edge matrix of the molecules; constructing a node model fusing self-attention according to the first node matrix and the first edge matrix, and constructing an edge model fusing two-dimensional convolution according to the first edge matrix; optimizing model parameters of the node model and the edge model through maximum likelihood estimation; acquiring a random node sample and an edge sample, generating a second node matrix through inverse mapping of the optimized node model, and generating a second edge matrix through inverse mapping of the optimized edge model; and combining the second node matrix with the second edge matrix through checking and correcting the chemical rules to generate a component graph. The invention can rapidly generate candidate molecules with effectiveness, diversity and novelty.

Description

Molecular diagram generation method and device based on flow model
Technical Field
The invention belongs to the field of artificial intelligence auxiliary drug discovery, and particularly relates to a molecular diagram generation method and device based on a flow model.
Background
In recent years, deep learning has shown a wide research prospect in the field of drug discovery, and particularly has great research value in the aspect of generating new molecular structures. The rapid acquisition of a large number of novel molecular structures is an important task in various applications such as drug discovery and material science, and helps to accelerate the discovery process of effective new drugs and other chemical materials, which has important significance for the development of medicine and chemistry. However, because the chemical space is discrete in nature and the search space is large, new molecule generation has been faced with the problems of low efficiency and high failure rate.
In the molecular generation task, the generated molecules not only need to meet the valence effectiveness, but also have better drug-like property, diversity, novelty and the like. In recent years, various depth generation methods have shown effectiveness in rapidly generating a large number of molecules, including molecular generation based on a simplified molecular linear input specification (SMILES) sequence or graph structure. Currently, the molecular generation algorithm based on the graph structure is mainly divided into a generation model based on a variation self-encoder, an autoregressive generation model, a generation countermeasure network and the like. Typically, the graph generation algorithm includes two steps, namely learning a characteristic representation of the molecule and generating a molecular graph, following a sequential or one-time generation approach.
Early generation of molecules adopts a generation mode [40] based on a variation self-encoder, the encoder is generally used for mapping high-dimensional original data into a feature space, the molecules are represented by a continuous method, then the decoder is used for mapping continuous molecular representations back to a molecular map, the difference between generated distribution and real distribution is evaluated by using KL divergence, and the calculation difficulty of an optimization target is high. JT-VAE is a classical generation model based on VAE, which is characterized in that a molecular diagram is decomposed into a tree-shaped sub-structure, the sub-structure and the sub-structure are encoded, and then the tree-shaped sub-structure is assembled into a molecule by decoding and mapping back to the molecular diagram, however, the model is not suitable for generating small molecular compounds with fewer atoms in view of the complexity of the tree-shaped sub-structure. Because the generating model based on the VAE carries out model training through optimizing the lower limit of the log likelihood of the data, rather than directly optimizing the log likelihood, the problem that the difference between the log likelihood and the lower limit is larger can exist, and the optimization difficulty is higher.
Some molecular generation methods employ an anti-formation network (GANs) on a molecular data set and incorporate reinforcement learning to generate molecules with desirable properties. Molecular generation models based on challenge generation typically include a generator, a discriminator, and a reward network through which training of challenge is advanced relative to each other, wherein the discriminator and reward network may be implemented using a graph neural network. However, GAN-based generative models have the problem of being prone to pattern collapse, i.e., there is poor diversity of the generated samples, and there is a problem of being prone to model training instability, which needs to be solved by carefully adjusting the hyper-parameters.
In summary, the problems generally existing in the current method include poor effectiveness or diversity of the generated molecules, low success rate of molecular reconstruction, low drug similarity of the generated molecules, and high model complexity caused by the need of combining reinforcement learning. Recently, a stream-based generation model becomes one of the leading directions of graph generation. The stream generation model aims at learning the reversible transformation between the basic distribution (such as gaussian distribution) and the real high-dimensional space, and has the advantages of not needing to design an additional decoder, being capable of accurately encoding real sample points into corresponding latent variables and accurately calculating the log likelihood of the latent variables, and directly using the maximum likelihood estimation as an optimization target, and exhibiting excellent performance in image generation. In addition, the molecular diagram generation can be effectively realized by fusing the graphic neural network in the flow generation model by considering that the molecules are natural graphic structure data.
Therefore, a flow generation model fused with a graph neural network is needed for generating molecules, the model models the molecules as graph data, chemical characteristics and graph topological structures of the molecules can be learned through the graph neural network, and the molecules are encoded in a more accurate potential space by utilizing the flow generation framework, so that a large number of chemical molecular figures with high efficiency, low repetition rate and high novelty are finally generated, and therefore, molecular searching in a huge chemical space is avoided, and the speed and success rate of a drug discovery process are improved.
Disclosure of Invention
In view of the above problems, the first aspect of the present invention provides a molecular flow generation model that fuses attention of a graph. It can learn the potential representation of the molecule efficiently using chemical information in the molecule and graph topology and by fusing the features of the graph roll-ups of the self-attention mechanism.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the molecular diagram generating method based on the flow model is characterized by comprising the following steps of:
collecting a common molecular data set, converting the representation form of the molecules, and calculating topological structure and chemical information to extract a first node matrix and a first edge matrix of the molecules;
constructing a node model fusing self-attention according to the first node matrix and the first edge matrix, and constructing an edge model fusing two-dimensional convolution according to the first edge matrix;
optimizing model parameters of the node model and the edge model through maximum likelihood estimation;
acquiring a random node sample and an edge sample, generating a second node matrix through inverse mapping of the optimized node model, and generating a second edge matrix through inverse mapping of the optimized edge model;
and combining the second node matrix with the second edge matrix through checking and correcting of chemical rules to generate a component graph.
In some embodiments, the collecting the common molecular dataset, converting the representation of the molecules, and calculating the topology and chemical information to extract a first node matrix and a first edge matrix of the molecules, comprises:
calculating the chemical property of each molecule by using a chemical informatics software package RDkit to acquire information of atoms and sides;
carrying out ketonization treatment on the molecules, and retaining the heavy atom skeletons of the molecules;
converting the molecular weight atomic framework into graph structure data;
and acquiring a node matrix and an edge matrix through the graph structure data.
In some embodiments, the constructing a self-attention fused node model according to the first node matrix and the first edge matrix includes:
the input node matrix V is divided into two parts along the channel dimension (V 1 ,V 2 ) The input edge matrix E is used as a constant;
by the formula:
V 1 ,V 2 =Split(V)
h=σ(ρ(AttnGCN(V 1 ,E)))
t,logs=Chunk(h)
s=Sigmoid(logs)
obtaining a single graph coupling layer f V (V, E), where h is the output tensor, t is the transformation coefficient for affine transformation, log s is the scaling coefficient in logarithmic form, σ represents the activation function, ρ represents normalization, attnGCN is the graph convolution module fusing attention, split is the segmentation operation, and Chunk is the segmentation function;
by the formula:
obtaining a node modelWherein->Coupling layer f for a single graph V (V,E)。
In some embodiments, the constructing a side model of a fused two-dimensional convolution from the first side matrix includes:
the input edge matrix is divided into two parts along the channel dimension (E 1 ,E 2 );
By the formula:
E 1 ,E 2 =Split(E)
h=σ(ρ(Conv2d(E 1 )))
t,log s=Chunk(h)
s=Sigmoid(log s)
Z E =f ε (E)
obtaining a single side coupling layer f E (E) Where h is the output tensor, t is the transformation coefficient for affine transformation, log s is the scaling coefficient in logarithmic form, σ represents the activation function, ρ represents normalization, conv2d is the two-dimensional convolution function;
by the formula:
obtaining side patternsWherein f E For single edge coupling layer f E (E)。
In some embodiments, the optimizing the model parameters of the node model and the edge model by maximum likelihood estimation includes:
from the true distributionSample points (V, E);
by the formula:
deriving a node model generation profile P V (V) and side model generation distribution P E (E);
By the formula:
deriving node model generation distributionLog likelihood of>And side model generation distribution P ε (E) Log likelihood log p of (a) E (E);
By the formula:
and calculating the optimized model parameter theta.
In some embodiments, the obtaining the random node samples and the edge samples, generating the second node matrix by the node model inverse mapping after optimization, and generating the second edge matrix by the edge model inverse mapping after optimization includes:
randomly sampling node samples Z in potential space V Sum edge sample Z E
By the formula:
t,log s=Chunk(h)
s=Sigmoid(log s)
E′=E 1 ||E 2
obtaining an edge matrix E', wherein I is a splicing operation, sigma represents an activation function, and rho represents normalization;
by the formula:
t,log s=Chunk(h)
s=Sigmoid(log s)
V′=V 1 ||V 2
and obtaining a node matrix V', wherein I is a splicing operation, sigma represents an activation function, and rho represents normalization.
In some embodiments, the combining the second node matrix with the second edge matrix to generate a molecular map through inspection and correction of chemical rules includes:
creating a blank molecular object using an RdKit chemical tool;
adding all atoms into the molecular object according to the generated node matrix;
gradually adding chemical bonds among atoms according to the generated adjacent matrix, and checking the valence when adding one chemical bond each time;
and outputting the maximum connected graph as a molecular graph by randomly selecting chemical bonds in the effective range.
The second party of the invention provides a molecular diagram generating device based on a flow model, which can effectively learn potential representation of molecules by utilizing chemical information in the molecules and diagram topological structures and by integrating characteristics of diagram convolution learning molecular diagrams of self-attention mechanisms.
A molecular diagram generating device based on a flow model, comprising:
the acquisition module is used for collecting a public molecular data set, converting the representation form of the molecules and calculating the topological structure and chemical information so as to extract a first node matrix and a first edge matrix of the molecules;
the model construction module is used for constructing a node model fused with self-attention according to the first node matrix and the first edge matrix and constructing an edge model fused with two-dimensional convolution according to the first edge matrix;
an optimization module that optimizes model parameters of the node model and the edge model by maximum likelihood estimation;
the generation module is used for acquiring random node samples and edge samples, generating a second node matrix through the inverse mapping of the optimized node model, and generating a second edge matrix through the inverse mapping of the optimized edge model; and combining the second node matrix with the second edge matrix to generate a component graph through checking and correcting the chemical rules.
In some embodiments, the building block is to:
the input node matrix V is divided into two parts along the channel dimension (V 1 ,V 2 ) The input edge matrix E is used as a constant;
by the formula:
V 1 ,V 2 =Split(V)
h=σ(ρ(AttnGCN(V 1 ,E)))
t,log s=Chunk(h)
s=Sigmoid(log s)
obtaining a single graph coupling layer f V (V, E), where h is the output tensor, t is the transformation coefficient for affine transformation, log s is the scaling coefficient in logarithmic form, σ represents the activation function, ρ represents normalization, attnGCN is the graph convolution module fusing attention, split is the segmentation operation, and Chunk is the segmentation function;
by the formula:
obtaining a node modelWherein->Coupling layer f for a single graph V (V,E)。
The building module is also for:
the input edge matrix is divided into two parts along the channel dimension (E 1 ,E 2 );
By the formula:
E 1 ,E 2 =Split(E)
h=σ(ρ(Conv2d(E 1 )))
t,log s=Chunk(h)
s=Sigmoid(log s)
Z E =f ε (E)
obtaining a single side coupling layer f E (E) Where h is the output tensor, t is the transformation coefficient for affine transformation, log s is the scaling coefficient in logarithmic form, σ represents the activation function, ρ represents normalization, conv2d is the two-dimensional convolution function;
by the formula:
obtaining side patternsWherein f E For single edge coupling layer f E (E)。
In some embodiments, the training module is to:
from the true distributionSample points (V, E);
by the formula:
deriving a node model generation profile P V (V) and side model generation distribution P E (E);
By the formula:
deriving node model generation distributionLog likelihood of>And side model generation distribution P ε (E) Log likelihood log p of (a) E (E);
By the formula:
and calculating the optimized model parameter theta.
The molecular diagram generating method based on the flow model in the invention comprises the following steps: collecting a common molecular data set, converting the representation form of the molecules, and calculating topological structure and chemical information to extract a first node matrix and a first edge matrix of the molecules; constructing a node model fusing self-attention according to the first node matrix and the first edge matrix, and constructing an edge model fusing two-dimensional convolution according to the first edge matrix; optimizing model parameters of the node model and the edge model through maximum likelihood estimation; acquiring a random node sample and an edge sample, generating a second node matrix through inverse mapping of the optimized node model, and generating a second edge matrix through inverse mapping of the optimized edge model; and combining the second node matrix with the second edge matrix through checking and correcting of chemical rules to generate a component graph. The molecular diagram generation method based on the flow model can quickly generate candidate molecules with effectiveness, diversity and novelty, is beneficial to reducing the cost and time of the early stage of drug discovery and reduces the failure rate of the later stage of drug discovery.
Drawings
FIG. 1 is a flow chart of a molecular diagram generating method based on a flow model in an embodiment of the invention;
FIG. 2 is a schematic diagram of a molecular diagram generating method of a flow model according to an embodiment of the present invention;
FIG. 3 is a diagram of a node model structure in an embodiment of the present invention;
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present application based on the embodiments herein.
Referring to fig. 1 and 2, an embodiment of the present invention provides a molecular map generating method based on a flow model, which includes the following steps: collecting a common molecular data set, converting the representation form of the molecules, and calculating topological structure and chemical information to extract a first node matrix and a first edge matrix of the molecules; constructing a node model fusing self-attention according to the first node matrix and the first edge matrix, and constructing an edge model fusing two-dimensional convolution according to the first edge matrix; optimizing model parameters of the node model and the edge model through maximum likelihood estimation; acquiring a random node sample and an edge sample, generating a second node matrix through inverse mapping of the optimized node model, and generating a second edge matrix through inverse mapping of the optimized edge model; and combining the second node matrix with the second edge matrix through checking and correcting of chemical rules to generate a component graph.
The common molecular data set is collected, the conversion of the representation form of the molecules and the calculation of the topological structure and the chemical information are carried out, so that a first node matrix and a first side matrix of the molecules are extracted, specifically, the original common molecular data set contains four types of chemical bonds (sides), namely single bonds, double bonds, triple bonds and benzene ring structures, and the molecules coded by the standard SMILES representation method are provided.
The molecules were pre-treated using the semiochemical software package RDKit, the chemistry of each molecule was calculated, atomic and side information was obtained, and furthermore, all molecules were ketonized and hydrogen atoms were removed following the baseline method, leaving only the molecular heavy atom backbone.
It is worth noting that the treated dataset contains only three chemical bonds, i.e. single bond, double bond and triple bond, and the benzene ring structure is expressed by a combination of single bond and double bond. In order to facilitate assembling the generated nodes and the adjacency matrix into molecules, edges with the type of 0 and nodes with the type of zero are additionally arranged, and represent empty edges and empty nodes respectively.
Referring to FIG. 3, the self-attention-fused node model structure is constructed according to a first node matrix and a first edge matrix, wherein each graph coupling layerThe jacobian coefficient matrix of (a) is:
where l is the identity matrix. diag(s) is a diagonal matrix of vectors s, the graph coupling layerThe logarithmic form of jacobian of (c) can be precisely calculated as:
node model according to chain lawIs the product of the jacobian of each layer of graph coupling, calculated as:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing the i-th figure coupling layer.
It is worth noting that for the graph convolution model AttnGCN fusing attention in the graph coupling layer, the input is the first node matrixSum-edge matrix->The output is the embedded vector of all nodes, and the concrete calculation is divided into the following steps:
1) And a message generation stage.
First, for each node in the molecular graph, mapping the node feature vector to a potential space via AttnNode is used to obtain
In addition, through a node message generating function AttnMessage, characteristic messages of nodes under various types of relations R E R are generated
AttnNode and AttnMessage model have the same structure but different parameters, taking AttnNode as an example, the calculation formula is as follows:
q=W q v
k=W k v
z=W z v
in the above, the parameter matrix W q 、W k 、W z The dimension of (d) is d respectively q 、d k 、d z
Further calculate the attention score as:
then, the attention score is normalized by using a Softmax function to obtain an attention weight coefficient, and the input feature vectors are weighted and summed according to the attention weight coefficient to obtain a reconstructed node feature vector fusing the attention:
h′ v =Softmax(α v )*v
2) And a message aggregation stage.
The characteristic messages of the adjacent nodes are propagated according to the adjacent matrix and are aggregated:
wherein, the liquid crystal display device comprises a liquid crystal display device,adjacent node set representing node v connected by r-type edges +.>
3) Update and read-out stage.
Aggregating neighborhood feature messages with node self featuresAdding, updating to the embedded vector representation of the node, and then passingThe read function ReadOut output:
h v =σ(ρ(ReadOut(h′ v ,m v )))
=σ(h′ v +m v )
here, the readout function may be implemented by using a multi-layer sensor or a network with the same structure as the AttnNode, and the readout function is used to ensure that the input-output dimensions of the graph coupling layer are the same.
In this embodiment, a two-dimensional convolution fused edge model is constructed according to a first edge matrix, where the edge modelIs the product of the jacobian of each side coupling layer, calculated as:
wherein the logarithmic form of the jacobian of each edge coupling layer is
For two-dimensional convolution on adjacent matrix tensors and for obtaining more feature channels in affine coupling transformation, the feature space of each adjacent matrix tensor is first subtracted from the model before the edge tensor is input to the modelShift to->The transformed data is then affine transformed. Inverse transformation is adopted to the data output by the edge model, and the data is restored to the dimension of +.>Is a side tensor of (c). Furthermore, to enhance the nonlinearity of the model, adjacent edge coupling layers use alternating partitioning of the input vector, the first edge coupling layer is partitioned into (E 1 ,E 2 )=Split l (E) Then the next layer is divided alternately (E 2 ,E 1 )=Split l+1 (E) So that the directly output part of the coupling layer is used for transformation in the next coupling layer.
In this embodiment, the node model and the model of the edge model are optimized through maximum likelihood estimation, where model parameters are optimized through maximum likelihood estimation during parameter training, that is, KL divergence between true distribution on a training set and model generation distribution is minimized, so that the two are similar as much as possible, and probability distribution P according to a score graph is obtained G (G)=P V (V)×P E (E) And) to obtain:
the larger the likelihood estimation is, the more desirable the model parameters are, so that the smaller the obtained negative log likelihood is, the more desirable the negative log likelihood is after the maximum likelihood is taken as a negative number, and the negative log likelihood of the sample point is calculated during training.
In this embodiment, the second node matrix and the second edge matrix are combined to generate a component graph through checking and correcting the chemical rules, which specifically includes: firstly, creating a blank molecular object by using an RdKit chemical tool, then adding all atoms into the molecular object according to a generated second node matrix, and then gradually adding chemical bonds among the atoms according to the generated second side matrix, and performing valence check when one chemical bond is added each time. If the newly added chemical bond does not accord with the valence of the atoms at two ends, the valence rule is broken, the newbond type=randInt (1, bond type-1) is selected randomly in the effective range, the newly added chemical bond is replaced by the chemical bond newbond type with smaller value, the largest connected graph in the molecular graph is output, the generation molecule is minimized, meanwhile, the diversity is realized, and the chemical inspection and generation of the obtained new molecule are not needed again.
It is worth noting that the correction can also be carried out by the following method: valence checks are performed after the complete molecular diagram is assembled and non-regular atoms or edges are corrected. If the molecule satisfies the chemical constraint, the largest connected component of the molecule is output as the final resultant molecule. If an atom with invalid valence exists in the molecule, the bond type with the largest value in the chemical bonds connected with the atom is found, and the chemical examination and generation are carried out again on the obtained new molecule by randomly selecting the chemical bonds in the effective range in a mode of newbond type=randInt (1, bond type-1) and replacing the chemical bonds with smaller values, so that the new molecule has a variety while the change on the generated molecule is small.
In summary, the molecular map generating method based on the flow model in the invention includes the following steps: collecting a common molecular data set, converting the representation form of the molecules, and calculating topological structure and chemical information to extract a first node matrix and a first edge matrix of the molecules; constructing a node model fusing self-attention according to the first node matrix and the first edge matrix, and constructing an edge model fusing two-dimensional convolution according to the first edge matrix; optimizing model parameters of the node model and the edge model through maximum likelihood estimation; acquiring a random node sample and an edge sample, generating a second node matrix through inverse mapping of the optimized node model, and generating a second edge matrix through inverse mapping of the optimized edge model; and combining the second node matrix with the second edge matrix through checking and correcting of chemical rules to generate a component graph. The molecular diagram generation method based on the flow model can quickly generate candidate molecules with effectiveness, diversity and novelty, is beneficial to reducing the cost and time of the early stage of drug discovery and reduces the failure rate of the later stage of drug discovery.
In the description of the present application, it should be noted that the azimuth or positional relationship indicated by the terms "upper", "lower", etc. are based on the azimuth or positional relationship shown in the drawings, and are merely for convenience of description of the present application and simplification of the description, and are not indicative or implying that the apparatus or element in question must have a specific azimuth, be configured and operated in a specific azimuth, and thus should not be construed as limiting the present application. Unless specifically stated or limited otherwise, the terms "mounted," "connected," and "coupled" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the terms in this application will be understood by those of ordinary skill in the art as the case may be.
It should be noted that in this application, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing is merely a specific embodiment of the application to enable one skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. The molecular diagram generating method based on the flow model is characterized by comprising the following steps of:
collecting a common molecular data set, converting the representation form of the molecules, and calculating topological structure and chemical information to extract a first node matrix and a first edge matrix of the molecules;
constructing a node model fusing self-attention according to the first node matrix and the first edge matrix, and constructing an edge model fusing two-dimensional convolution according to the first edge matrix;
optimizing model parameters of the node model and the edge model through maximum likelihood estimation;
acquiring a random node sample and an edge sample, generating a second node matrix through inverse mapping of the optimized node model, and generating a second edge matrix through inverse mapping of the optimized edge model;
and combining the second node matrix with the second edge matrix through checking and correcting of chemical rules to generate a component graph.
2. The flow model based molecular graph generation method of claim 1, wherein the collecting the common molecular dataset, the converting the representation of the molecules, and the calculating of the topology and chemical information to extract the first node matrix and the first edge matrix of the molecules comprises:
calculating the chemical property of each molecule by using a chemical informatics software package RDkit to acquire information of atoms and sides;
carrying out ketonization treatment on the molecules, and retaining the heavy atom skeletons of the molecules;
converting the molecular weight atomic framework into graph structure data;
and acquiring a node matrix and an edge matrix through the graph structure data.
3. The flow model-based molecular graph generation method of claim 1, wherein constructing the fused self-attention node model from the first node matrix and the first edge matrix comprises:
section to be inputThe dot matrix V is divided into two parts along the channel dimension (V 1 ,V 2 ) The input edge matrix E is used as a constant;
by the formula:
V 1 ,V 2 =Split(V)
h=σ(ρ(AttnGCN(V 1 ,E)))
t,logs=Chunk(h)
s=Sigmoid(logs)
obtaining a single graph coupling layerWherein h is an output tensor, t is a transformation coefficient for affine transformation, log s is a scaling coefficient in logarithmic form, sigma represents an activation function, ρ represents normalization, attnGCN is a graph convolution module for fusing attention, split is a segmentation operation, and Chunk is a segmentation function;
by the formula:
obtaining a node modelWherein->Coupling layer for single diagram->
4. The flow model-based molecular graph generation method of claim 1, wherein constructing a side model of a fused two-dimensional convolution from a first side matrix comprises:
the input edge matrix is divided into two parts along the channel dimension (E 1 ,E 2 );
By the formula:
E 1 ,E 2 =Split(E)
h=σ(ρ(Conv2d(E 1 )))
t,logs=Chunk(h)
s=Sigmoid(logs)
Z E =f ε (E)
obtaining a single side coupling layer f ε (E) Where h is the output tensor, t is the transformation coefficient for affine transformation, log s is the scaling coefficient in logarithmic form, σ represents the activation function, ρ represents normalization, conv2d is the two-dimensional convolution function;
by the formula:
obtaining side patternsWherein f ε For single edge coupling layer f ε (E)。
5. The flow model-based molecular graph generation method of claim 1, wherein optimizing model parameters of a node model and an edge model by maximum likelihood estimation comprises:
from the true distributionSample points (V, E);
by the formula:
deriving node model generation distributionAnd side model generation distribution P ε (E);
By the formula:
deriving node model generation distributionLog likelihood of>And side model generation distribution P ε (E) Log likelihood log p of (a) ε (E);
By the formula:
and calculating the optimized model parameter theta.
6. The flow model-based molecular graph generation method of claim 1, wherein the obtaining the random node samples and the edge samples, generating the second node matrix by the optimized node model inverse mapping, and generating the second edge matrix by the optimized edge model inverse mapping, comprises:
randomly sampling node samples Z in potential space V Sum edge sample Z E
By the formula:
t,logs=Chunk(h)
s=Sigmoid(logs)
E′=E 1 ||E 2
obtaining an edge matrix E', wherein I is a splicing operation, sigma represents an activation function, and rho represents normalization;
by the formula:
t,logs=Chunk(h)
s=Sigmoid(logs)
V′=V 1 ||V 2
and obtaining a node matrix V', wherein I is a splicing operation, sigma represents an activation function, and rho represents normalization.
7. The flow model based molecular map generating method as set forth in claim 1, wherein said combining the second node matrix with the second edge matrix through the inspection and correction of the chemical rules to generate the molecular map comprises:
creating a blank molecular object using an RdKit chemical tool;
adding all atoms into the molecular object according to the generated node matrix;
gradually adding chemical bonds among atoms according to the generated adjacent matrix, and checking the valence when adding one chemical bond each time;
and outputting the maximum connected graph as a molecular graph by randomly selecting chemical bonds in the effective range.
8. A molecular diagram generating device based on a flow model, comprising:
the acquisition module is used for collecting a public molecular data set, converting the representation form of the molecules and calculating the topological structure and chemical information so as to extract a first node matrix and a first edge matrix of the molecules;
the model construction module is used for constructing a node model fused with self-attention according to the first node matrix and the first edge matrix and constructing an edge model fused with two-dimensional convolution according to the first edge matrix;
an optimization module that optimizes model parameters of the node model and the edge model by maximum likelihood estimation;
the generation module is used for acquiring random node samples and edge samples, generating a second node matrix through the inverse mapping of the optimized node model, and generating a second edge matrix through the inverse mapping of the optimized edge model; and combining the second node matrix with the second edge matrix to generate a component graph through checking and correcting the chemical rules.
9. The molecular diagram generating device based on a flow model according to claim 8, wherein the constructing module is configured to:
the input node matrix V is divided into two parts along the channel dimension (V 1 ,V 2 ) The input edge matrix E is used as a constant;
by the formula:
V 1 ,V 2 =Split(V)
h=σ(ρ(AttnGCN(V 1 ,E)))
t,logs=Chunk(h)
s=Sigmoid(logs)
obtaining a single graph coupling layerWhere h is the output tensor, t is the transformation coefficient for affine transformation, log s is the scaling coefficient in logarithmic form, σ represents the activation function, ρ represents normalization, and attnccn is the fusionThe attention graph convolution module is Split and Chunk is a segmentation function;
by the formula:
obtaining a node modelWherein->Coupling layer for single diagram->
The building module is also for:
the input edge matrix is divided into two parts along the channel dimension (E 1 ,E 2 );
By the formula:
E 1 ,E 2 =Split(E)
h=σ(ρ(Conv2d(E 1 )))
t,logs=Chunk(h)
s=Sigmoid(logs)
Z E =f ε (E)
obtaining a single side coupling layer f ε (E) Where h is the output tensor, t is the transformation coefficient for affine transformation, log s is the scaling coefficient in logarithmic form, σ represents the activation function, ρ represents normalization, conv2d is the two-dimensional convolution function;
by the formula:
obtaining side patternsWherein f ε For single edge coupling layer f ε (E)。
10. The flow model based molecular diagram generating device of claim 8, wherein the training module is configured to:
from the true distributionSample points (V, E);
by the formula:
deriving node model generation distributionAnd side model generation distribution P ε (E);
By the formula:
deriving node model generation distributionLog likelihood of>And side model generation distribution P ε (E) Log likelihood log p of (a) ε (E);
By the formula:
and calculating the optimized model parameter theta.
CN202310499015.6A 2023-04-26 2023-04-26 Molecular diagram generation method and device based on flow model Pending CN116525029A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310499015.6A CN116525029A (en) 2023-04-26 2023-04-26 Molecular diagram generation method and device based on flow model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310499015.6A CN116525029A (en) 2023-04-26 2023-04-26 Molecular diagram generation method and device based on flow model

Publications (1)

Publication Number Publication Date
CN116525029A true CN116525029A (en) 2023-08-01

Family

ID=87400746

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310499015.6A Pending CN116525029A (en) 2023-04-26 2023-04-26 Molecular diagram generation method and device based on flow model

Country Status (1)

Country Link
CN (1) CN116525029A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117012304A (en) * 2023-09-18 2023-11-07 河北农业大学 Deep learning molecule generation system and method fused with GGNN-GAN

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117012304A (en) * 2023-09-18 2023-11-07 河北农业大学 Deep learning molecule generation system and method fused with GGNN-GAN
CN117012304B (en) * 2023-09-18 2024-02-02 河北农业大学 Deep learning molecule generation system and method fused with GGNN-GAN

Similar Documents

Publication Publication Date Title
Thirumuruganathan et al. Approximate query processing for data exploration using deep generative models
CN113299354B (en) Small molecule representation learning method based on transducer and enhanced interactive MPNN neural network
CN108647226B (en) Hybrid recommendation method based on variational automatic encoder
TW202117577A (en) Machine learning system and method to generate structure for target property
CN113780002B (en) Knowledge reasoning method and device based on graph representation learning and deep reinforcement learning
CN111428848B (en) Molecular intelligent design method based on self-encoder and 3-order graph convolution
CN113240683B (en) Attention mechanism-based lightweight semantic segmentation model construction method
CN113159239A (en) Method for processing graph data by quantum graph convolutional neural network
Lainscsek et al. Global modeling of the Rössler system from the z-variable
CN116525029A (en) Molecular diagram generation method and device based on flow model
CN111008224A (en) Time sequence classification and retrieval method based on deep multitask representation learning
Bi et al. Knowledge transfer for out-of-knowledge-base entities: Improving graph-neural-network-based embedding using convolutional layers
CN115391563A (en) Knowledge graph link prediction method based on multi-source heterogeneous data fusion
Palmucci et al. Where is your field going? A machine learning approach to study the relative motion of the domains of physics
Mettes et al. Hyperbolic deep learning in computer vision: A survey
Petkov et al. Dag-wgan: Causal structure learning with wasserstein generative adversarial networks
CN117131933A (en) Multi-mode knowledge graph establishing method and application
CN111444316A (en) Knowledge graph question-answer oriented composite question analysis method
CN116106751A (en) Lithium ion battery state of charge estimation method based on Informar
CN115600656A (en) Multi-element time sequence prediction method based on segmentation strategy and multi-component decomposition algorithm
Liang et al. A normalizing flow-based co-embedding model for attributed networks
CN112380326B (en) Question answer extraction method based on multilayer perception and electronic device
CN115239967A (en) Image generation method and device for generating countermeasure network based on Trans-CSN
CN114399048A (en) Graph convolution neural network and type embedding-based education field joint knowledge point prediction method and system
CN113517045A (en) Electronic medical record ICD code prediction method and prediction system based on path generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination