CN112509644A - Molecular optimization method, system, terminal equipment and readable storage medium - Google Patents
Molecular optimization method, system, terminal equipment and readable storage medium Download PDFInfo
- Publication number
- CN112509644A CN112509644A CN202011502775.0A CN202011502775A CN112509644A CN 112509644 A CN112509644 A CN 112509644A CN 202011502775 A CN202011502775 A CN 202011502775A CN 112509644 A CN112509644 A CN 112509644A
- Authority
- CN
- China
- Prior art keywords
- molecule
- source
- branch
- molecular
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 238000005457 optimization Methods 0.000 title claims abstract description 38
- 238000003860 storage Methods 0.000 title claims abstract description 15
- 238000005520 cutting process Methods 0.000 claims abstract description 29
- 230000014759 maintenance of location Effects 0.000 claims abstract description 21
- 238000004590 computer program Methods 0.000 claims description 22
- 230000015654 memory Effects 0.000 claims description 18
- 238000009826 distribution Methods 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 8
- 229910052729 chemical element Inorganic materials 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 4
- 230000000379 polymerizing effect Effects 0.000 claims description 4
- 239000000126 substance Substances 0.000 abstract description 8
- 230000015572 biosynthetic process Effects 0.000 abstract description 2
- 238000003786 synthesis reaction Methods 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 9
- 238000012549 training Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 229910052757 nitrogen Inorganic materials 0.000 description 2
- 230000002787 reinforcement Effects 0.000 description 2
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 2
- 208000019300 CLIPPERS Diseases 0.000 description 1
- HCHKCACWOHOZIP-UHFFFAOYSA-N Zinc Chemical compound [Zn] HCHKCACWOHOZIP-UHFFFAOYSA-N 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 229910052794 bromium Inorganic materials 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 229910052801 chlorine Inorganic materials 0.000 description 1
- 208000021930 chronic lymphocytic inflammation with pontine perivascular enhancement responsive to steroids Diseases 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000037353 metabolic pathway Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 229910052760 oxygen Inorganic materials 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 229910052717 sulfur Inorganic materials 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 239000011701 zinc Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/50—Molecular design, e.g. of drugs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2111/00—Details relating to CAD techniques
- G06F2111/08—Probabilistic or stochastic CAD
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Chemical & Material Sciences (AREA)
- Geometry (AREA)
- Computer Hardware Design (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medicinal Chemistry (AREA)
- Pharmacology & Pharmacy (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Crystallography & Structural Chemistry (AREA)
- Bioinformatics & Computational Biology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present application belongs to the technical field of data synthesis, and in particular, relates to a method, a system, a terminal device and a readable storage medium for molecular optimization. Existing systems perform poorly when the chemical structure of the target molecule is complex. The application provides a molecular optimization method, which comprises the following steps: obtaining source molecules, and cutting the source molecules according to target molecules to obtain a source molecule retention region, wherein the source molecule retention region is a common molecular substructure of the source molecules and the target molecules; converting the source molecule into a simplified molecule linear input canonical string, wherein the simplified molecule linear input canonical string is a non-common molecule substructure part in the target molecule; and splicing the source molecule reserved area and the simplified molecule linear input specification character string to obtain the target molecule. The system error rate is greatly reduced.
Description
Technical Field
The present application belongs to the technical field of data synthesis, and in particular, relates to a method, a system, a terminal device and a readable storage medium for molecular optimization.
Background
Definition of molecular optimization task: given a source molecule, a molecule optimizer (generator) is input, which can convert the source molecule into another molecule (target molecule), and the target molecule has a chemical structure similar to that of the source molecule, but some other chemical properties are greatly improved (e.g., molecular water solubility).
The existing method converts the graph generation task into a serialized decision process. Each element in this sequence is a specific decision, and there are mainly three different optional decisions: 1) whether a new node is added at the current moment (the generation process is considered to be finished when the new node is not added any more); 2) whether a new edge is added at the current moment; 3) and selecting a node to be connected with the current new node. The complete generation process starts with an empty molecular diagram, with more than one decision being performed at each moment. Based on the graph generation paradigm, some further add the factors of reinforcement learning, and define the state space, the action space and the reward function used by the standard reinforcement learning model. But the overall molecular generation logic is not changed.
The solution space of the target molecule is too large (any modification to the source molecule can result in a candidate target molecule). These methods are all threaded by a source numerator, and then the optimization model starts from an empty numerator graph in a random manner, each time generating a node and establishing chemical bonds with the previously generated nodes until an end-of-generation signal is obtained. Because target molecules are often large in molecular scale (contain more nodes), the existing optimization system cannot ensure the similarity between generated molecules and source molecules and cannot ensure the attribute improvement; and the computational resources are consumed too much.
Disclosure of Invention
1. Technical problem to be solved
Based on the existing molecular optimization method, the method can be classified as a 'from the none to the existence' molecular generation process, and the generation of the optimized target molecule is a process from a blank figure to the complete target molecule. When the chemical structure of a target molecule is complex, the existing system has poor performance, and the application provides a molecule optimization method, a system, a terminal device and a readable storage medium.
2. Technical scheme
In order to achieve the above object, the present application provides a molecular optimization method, comprising: obtaining source molecules, and cutting the source molecules according to target molecules to obtain a source molecule retention region, wherein the source molecule retention region is a common molecular substructure of the source molecules and the target molecules; converting the source molecule into a new simplified molecule linear input canonical string, wherein the new simplified molecule linear input canonical string is a non-common molecule substructure part in the target molecule; and splicing the source molecule reserved area and the new simplified molecule linear input canonical character string to obtain the target molecule.
Another embodiment provided by the present application is: the cutting comprises the following steps: analyzing the region to be cut of the source molecule, determining a cutting center according to the band cutting region, determining a cutting branch according to the cutting center, and cutting the cutting branch to obtain the source molecule reserved region.
Another embodiment provided by the present application is: the analyzing the region to be cut of the source molecule comprises: traversing a first node in the source molecule and a second node in the target molecule, wherein the first node and the second node have the same chemical elements; traversing a first branch of a first node to obtain a first branch, traversing a second branch of a second node to obtain a second branch, wherein the first branch is the same as the second branch, and recording the number of nodes in the first branch or the second branch; taking the node I with the maximum numerical value in the node number, wherein the first branch is a reserved area; and the branches except the first branch are areas to be cut.
Another embodiment provided by the present application is: the determining the clipping center includes: acquiring a vector representation I of the node I; polymerizing the first vector representation to obtain a second vector representation of the source molecule; and combining the first vector representation and the second vector representation to predict the probability that the first node is taken as the cutting center, and carrying out normalization processing on the number of the nodes to obtain node probability distribution, wherein the node with the maximum node probability distribution value is taken as the cutting center.
Another embodiment provided by the present application is: the determining the clipping branch comprises: and acquiring a vector representation III of a branch III, wherein the branch III is a branch of the cutting center, and making a decision on whether the branch III is reserved or not by predicting the reservation probability of the branch III through the vector representation I, the vector representation III and the vector representation of the first branch.
Another embodiment provided by the present application is: converting the source molecule into a simplified molecule linear input canonical string includes: converting the source molecule into a standard simplified molecule linear input canonical representation to obtain a coded representation of the source molecule; processing the encoded representation of the source molecule to obtain a new simplified molecular linear input specification.
Another embodiment provided by the present application is: splicing the source molecule reserved area and the simplified molecule linear input canonical character string to obtain the target molecule comprises the following steps: converting the new simplified molecular linear input specification into a molecular graph; and merging the molecular graph and the source molecule reserved area to generate the target molecule.
The present application further provides a molecular optimization system, the system comprising: a molecule cutting unit for determining a source molecule retention region, wherein the source molecule retention region is a common molecular substructure of the source molecule and the target molecule; a time-series unit for converting the source molecule into a new reduced molecule linear input canonical string, the new reduced molecule linear input canonical string being a non-common molecular substructure part in the target molecule; and the molecule splicing unit is used for splicing the source molecule reserved area and the new simplified molecule linear input canonical character string to obtain the target molecule.
Optionally, the molecule clipping unit comprises an analysis module and a molecule clipper; the time sequence unit comprises a first time sequence module and a second time sequence module, and the molecule splicing unit comprises a molecule splicing module; the analysis module is used for analyzing the region to be cut of the source molecule; the molecule cutter is used for predicting a molecule reserving area; the first time series module is used for acquiring a coded representation of the whole source molecule; the second time sequence module is used for acquiring a new simplified molecular linear input specification; the molecule splicing module is used for generating target molecules. The cutting unit also comprises a database and a molecule pair matching module, wherein the database is used for providing molecule pair data; the molecule pair matching module is used for acquiring qualified molecule pairs from the database, and each molecule pair comprises a source molecule and a target molecule.
Optionally, the first time-series unit is an encoder, and the second time-series unit is a decoder; the molecule splicing module comprises a molecule graph conversion sub-module and a merging sub-module, the molecule graph conversion sub-module converts the new simplified molecule linear input specification into a molecule graph, and the merging sub-module merges the molecule graph and the reserved area to generate target molecules.
The application also provides a terminal device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the method.
The present application also provides a computer-readable storage medium, in which a computer program is stored, which computer program, when being executed by a processor, carries out the method.
3. Advantageous effects
Compared with the prior art, the molecular optimization method, the molecular optimization system, the terminal device and the readable storage medium have the advantages that:
the molecular optimization system provided by the application comprises a molecular cutter, and the reserved area is obtained through the molecular cutter, so that the scale of the rest molecular structure to be generated can be greatly reduced, and the error rate of the system is greatly reduced.
According to the molecular optimization method, the common molecular substructure in the source molecule and the target molecule is found through a molecular cutting strategy, so that the accuracy and the efficiency of final molecule generation are improved.
The molecular optimization method provided by the application is based on a general phenomenon: stronger image structural relevance often exists between source molecules and target molecules in a molecular optimization task, namely, a large number of identical sub-molecular structures exist, so that the identical structures are reserved, only the rest different parts are generated, and the accuracy of model optimization can be greatly improved.
The molecular optimization method provided by the application has the fatal problem that the generation mode of the 'from the none to the few' molecules has huge calculation amount, but if the common sub-molecular structure of the source molecule and the target molecule can be reserved in advance, the rest molecular part which needs to be generated really is greatly reduced, and thus, the consumption of the method on calculation resources is reduced.
Drawings
FIG. 1 is a schematic diagram of an exemplary molecular pair of the present application;
FIG. 2 is a schematic diagram of the cropping process of the present application;
FIG. 3 is a schematic diagram of an encoder-decoder framework of the present application;
FIG. 4 is a schematic diagram of the molecular optimization system of the present application;
fig. 5 is a schematic structural diagram of a terminal device of the present application.
Detailed Description
Hereinafter, specific embodiments of the present application will be described in detail with reference to the accompanying drawings, and it will be apparent to those skilled in the art from this detailed description that the present application can be practiced. Features from different embodiments may be combined to yield new embodiments, or certain features may be substituted for certain embodiments to yield yet further preferred embodiments, without departing from the principles of the present application.
Since the entire generation step requires prediction of newly added nodes and chemical bonds at each step, the number of steps performed is proportional to the size of the target molecule. When the target molecule is larger, the consumption phenomenon of computing resource resources is more serious.
Smiles (simplified molecular input line entry specification), a specification for explicitly describing the structure of a molecule using ASCII character strings, simplifies the linear input of a molecule. A method and a tool for representing character strings of a molecular diagram are provided, a molecular diagram structure is input, and SMILES can provide a corresponding character representation. The target molecule generation process of the present application is primarily based on this molecular representation method, but it should be noted that SMILES itself cannot be directly molecularly generated. The present application is only a representation of the use of SMILES to encode a source molecule and a target molecule.
The molecular optimization method provided by the embodiment of the present application can be applied to terminal devices such as a tablet computer, a notebook computer, a super-mobile personal computer (UMPC), a netbook, and a Personal Digital Assistant (PDA), and the embodiment of the present application does not limit the specific types of the terminal devices.
For example, the terminal device may be a Station (ST) in a WLAN, a Personal Digital Assistant (PDA) device, a handheld device with wireless communication capabilities, a computing device or other processing device connected to a wireless modem, a computer, a laptop, a handheld communication device, a handheld computing device, a satellite radio, a wireless modem card.
Referring to fig. 1 to 5, the present application provides a molecular optimization method, including: obtaining source molecules, and cutting the source molecules according to target molecules to obtain a source molecule retention region, wherein the source molecule retention region is a common molecular substructure of the source molecules and the target molecules. And predicting a source molecule retention area, and cutting and removing all molecular substructures which are not in the source molecule retention area. By this step, the scale of the remaining molecular structure to be produced can be reduced, and the error rate can be reduced.
In the training phase, molecular pair data is acquired, each of the molecular pairs including a source molecule and a target molecule. Any one of the molecule databases is selected, which contains a variety of different molecules, such as a ZINC database. The target molecule of the training phase is known; in conducting the test, the target molecule is unknown, and a given source molecule is used to process the source molecule according to the target molecule.
And converting the source molecule into a new simplified molecule linear input canonical character string, wherein the new simplified molecule linear input canonical character string is a non-common molecule substructure part in the target molecule. I.e. the remaining part of the target molecule is generated.
And splicing the source molecule reserved area and the new simplified molecule linear input canonical character string to obtain the target molecule.
Further, the cropping comprises: analyzing the region to be cut of the source molecule, determining a cutting center according to the band cutting region, determining a cutting branch according to the cutting center, and cutting the cutting branch to obtain the source molecule reserved region.
Further, the analyzing the region to be cut of the source molecule includes:
traversing a first node in the source molecule and a second node in the target molecule, wherein the first node and the second node have the same chemical elements; the node one is the source molecule C1 in FIG. 2, and the node two is the target molecule C1 in FIG. 2.
Traversing a first branch of a first node to obtain a first branch, traversing a second branch of a second node to obtain a second branch, wherein the first branch is the same as the second branch, and recording the number of nodes in the first branch or the second branch.
Here, the first branch is all branches of the first node, and the second branch is all branches of the second node.
Taking the node I with the maximum numerical value in the node number, wherein the first branch is a reserved area; and the branches except the first branch are areas to be cut.
Specifically, all atoms i in the source molecule, i.e. nodes i, are traversed (e.g. source molecule C1 in fig. 1 or 2): traversing all atoms j in the target molecule, i.e. nodes j, the atom j is identical to the chemical element of the atom i (as in the target molecule C1 in fig. 1 or 2): traversing all branches of the atom i and the atom j to obtain the same branch in the branch of the atom i and the branch of the atom j, and recording the total number of atoms in the same branch as si,j。
The branch is an area around a certain node and expanded by another node connected with the certain node. Such as (C)2、H3、H4、H5)、(H6)、(H7) And (C)8、H9) Is the source molecule node C in FIG. 1 or 2, respectively14 branches.
As shown in FIG. 1 or 2, the source molecule C1And the target molecule C1Having 2 identical branches, i.e. (C)2、H3、H4、H5) And (H)6) Then the total number of atoms in the same branch si,j=5。
S with the largest valuei,jAs siAnd recording the atom j corresponding to the atom i.
S with the largest valueiAtom i in (1) to cteS, which is the largest valueiThe atom i in (1) is the clipping center, and is recorded with cteThe corresponding atom j.
In addition, for all siIt is converted to a normalized probability distribution for model learning:
wherein VXRepresents the set of all atoms in the source molecule, exp (.) -is an exponential function.
ste iI.e. the normalized distribution. The source molecule cteThe same branch (which may be multiple) corresponding to the atom j is the reserved region;the other branches are the areas to be cut. Each branch is denoted 1/0 as reserved and deleted, and the whole is denoted U by a variable, e.g. U { (C)2、H3、H4、H5):1,(H6):1,(H7):0,(C8、H9) 0, the distribution is the distribution that needs to be fitted for model training.
In addition, c is present in jteIs marked as "to be generated", as in FIG. 2 the target molecule surrounds C1Branch O of7And converting the branch subgraph to be generated into character string representation of SMILES.
In summary, the molecular optimization method provided by the present application can complete the optimization of the source molecule by generating the smallest molecular substructure.
The nodes are represented by atoms.
In the training stage, as the target molecules are known, the molecular pairs can be matched, and the molecular optimization task target is determined to obtain the molecular pairs which meet the conditions.
The molecular optimization task goal or constraint is first determined, and the molecular optimization task goal or constraint of the present application may be a target molecule that requires the generated molecule to have higher water solubility and to have a similarity to the source molecule, but is not limited to other applications. Existing open source tools, which may be RDKIT, may provide functionality to calculate the relative attributes and molecular similarities of a molecule. And traversing the molecule database to obtain the molecule pairs meeting the conditions according to the task goal or constraint of the molecule optimization. FIG. 1 shows an extracted pair of molecules.
Further, the determining the clipping center includes: acquiring a vector representation I of the node I; polymerizing the first vector representation to obtain a second vector representation of the source molecule; and combining the first vector representation and the second vector representation to predict the probability that the first node is taken as the cutting center, and carrying out normalization processing on the number of the nodes to obtain node probability distribution, wherein the node with the maximum node probability distribution value is taken as the cutting center.
Specifically, the Graph message-learning networks (MPNs) are used to perform representation learning on the source molecules, for example, through a preset formula:
the calculation yields a vector representation of each node (atom), i.e. the vector represents one.
Wherein x isiIs a characteristic representation of the node (atom) i, xi,jIs a characteristic representation of the edge (chemical bond) between nodes i and j, mt i,jInformation indicating the transfer of node i to j at time t, N (i) all the neighbor nodes of i, N (i) \ j all the neighbor nodes of i except j, f1And f2Are all neural networks. ThroughAfter one cycle, the representation h of the final node i is obtainedi. Wherein the characteristic representation of atoms and chemical bonds can be encoded by a simple one-hot.
And polymerizing the vector representation again and again to obtain a vector representation of the whole source molecule, namely a vector representation two:
and combining the vector representation one and the vector representation two, predicting the probability that the i node is taken as the clipping center by the following formula:
si=f3([hX,hi])
wherein [,]representing a vector splicing operation, f3Is a standard neural network.
Is as same as ste iIs obtained by the following formulaiAnd (3) carrying out normalization:
wherein s isst iThe node with the maximum value is the predicted cutting center cst。
In the model training phase, sst iFitting a loss function (e.g., KL divergence distance) is required to fit ste iThe tag value of (1). The test stage only needs to output and take sst iThe node corresponding to the maximum numerical value is used as a cutting center cstAnd (4) finishing.
Further, the determining a clipping branch comprises: and acquiring a vector representation III of a branch III, wherein the branch III is a branch of the cutting center, and making a decision on whether the branch III is reserved or not by predicting the reservation probability of the branch III through the vector representation I, the vector representation III and the vector representation of the first branch.
Branch three here is any branch of the clipping center.
Specifically, each branch around the clipping center is predicted from the clipping center, and the remaining branch and the deleted branch are decided. Representing an h by the vectoriA vector representation of any branch is obtained by the following equation:
wherein c isstAs a result of the said cutting centre,representing the center c around said cut in the source moleculestA branch subgraph of (1). |. | is the number of atoms in the subgraph.
Then, the retention probability of this branch j is predicted by the neural network:
wherein f is4Is a standard neural network, sigma is sigmoid function,andrespectively, a vector representation of the reaction center, a vector representation of the branch whether the decision is currently to be made to reserve, and a vector representation of the branch that has been determined to reserve. The output is greater than or equal to 0.5, which means retention, and less than 0.5, which means deletion. Aboutt-1 represents the sequence number of the last iteration, each iteration model needs to make a decision on the retention/deletion of a branch, and if the decision is retained, the branch is added into the set Ust t-1Each element in the set is a subgraph.
the complete reserved area after cutting can be obtained. As shown in fig. 2, the shaded portion is a reserved portion and the box area is deleted.
In the testing stage, the output can be directly obtained according to the above process. While in the training phase, where the output needs to be fitted to the U, the cross entropy between the two can be used as a loss function.
Further, converting the source molecule into a simplified molecule linear input canonical string comprises: converting the source molecule into a standard simplified molecule linear input canonical representation to obtain a coded representation of the source molecule; processing the encoded representation of the source molecule to obtain a new simplified molecular linear input specification.
Specifically, the source molecule is converted into a standard simplified molecule linear input specification (SMILES) representation, and the simplified molecule linear input specification characters are encoded according to the character appearance sequence by adopting a first long-short term memory network; the encoding of the source molecule is used for decoding a new simplified molecule linear input specification character string through a second long-short term memory network.
Further, the obtaining the target molecule by splicing the source molecule retention region and the simplified molecule linear input canonical character string includes: converting the new simplified molecular linear input specification into a molecular graph; and merging the molecular graph and the source molecule reserved area to generate the target molecule.
Specifically, the SMILES is converted into a molecular diagram representation form, and then the first atom (i.e., the clipping center) in the SMILES representation is spliced and combined with the source molecule retention region to obtain the final complete target molecule.
As shown in fig. 4, the present application further provides a molecular optimization system, the system comprising: a molecule cutting unit 1, configured to determine a source molecule retention region, where the source molecule retention region is a common molecular substructure of the source molecule and the target molecule; the molecule cutting unit 1 comprises a molecule cutting unit, an analysis module and a molecule cutter; the analysis module is used for analyzing the region to be cut of the source molecule; the molecule cropper is used for predicting the molecule reserving area.
The molecule cutting unit 1 further comprises a database and a molecule pair matching module, wherein the database is used for providing molecule pair data; the molecule pair matching module is used for acquiring qualified molecule pairs from the database, wherein each molecule pair comprises a source molecule and a target molecule; in the testing stage, the target molecules are unknown, and the molecule pair matching module is not required to be called, and in the training stage, the molecule pair is required to be selected through the molecule pair matching module.
A time-series unit 2, configured to convert the source molecule into a new simplified molecule linear input canonical string, where the new simplified molecule linear input canonical string is a non-common molecule substructure part in the target molecule; the time-series unit 2 comprises a first time-series module and a second time-series module, wherein the first time-series module is used for acquiring a coded representation of the whole source molecule; the second time series module is used for acquiring a new simplified molecular linear input specification.
And the molecule splicing unit 3 is used for splicing the source molecule reserved area and the new simplified molecule linear input canonical character string to obtain the target molecule. The molecule splicing unit 3 comprises a molecule splicing module; the molecule splicing module is used for generating target molecules.
The molecule splicing module comprises a molecule graph conversion sub-module and a merging sub-module, the molecule graph conversion sub-module converts the new simplified molecule linear input specification into a molecule graph, and the merging sub-module merges the molecule graph and the reserved area to generate target molecules.
The first time sequence unit is an encoder, and the second time sequence unit is a decoder.
This section generally employs a decoder-encoder framework, as shown in fig. 3.
In particular, the Encoder (Encoder) converts the input molecules into a SMILES representation, and then encodes the SMILES characters in their order of occurrence using a standard long short-term memory (LSTMs) network. The hidden state of the last instant LSTMs is taken as the final output C of the encoder.
The Decoder (Decoder) decodes a SMILES string, which is the remainder to be generated, by LSTMs different from the encoder.
In this LSTMs decoder, the initial hidden state is the C provided by the encoder, where the information of the source molecule has been encoded. The output of LSTMs at each time instant is a specific character sampled from a character set consisting of all possible constituent letters or symbols of SMILES, such as Br, Cl, N, O, S, p.
In the decoder, each character output at the current time is used asThe input of the next time, e.g., the character "C" predicted by the t-1 th time model, is used as the input of the next time-t LSTMs. In particular, the input at the first momentFixing to a clipping center in the previously selected source molecule, and starting generation; when the output character string is "</s>"time" indicates completion of the generation. The input and output of the decoder can adopt simple one-hot coding, namely, a coding vector constructed by the character set.
The SMILES string generated during this part of the training phase needs to be fitted to the SMILES string of the molecular subgraph to be generated, and cross entropy can be used as a loss function as well.
The present application further provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the steps in any of the method embodiments described above are implemented.
The terminal device of this embodiment includes: at least one processor (only one shown in fig. 5), a memory, and a computer program stored in the memory and executable on the at least one processor, the processor when executing the computer program implementing the steps in any of the various metabolic pathway prediction method embodiments described below.
The terminal device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The terminal device may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that the terminal device is merely an example, and does not constitute a limitation of the terminal device, and may include more or less components than those shown, or combine some components, or different components, such as input and output devices, network access devices, etc.
The Processor may be a Central Processing Unit (CPU), or other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may in some embodiments be an internal storage unit of the terminal device, such as a hard disk or a memory of the terminal device. In other embodiments, the memory may also be an external storage device of the terminal device, such as a plug-in hard disk, a Smart Media Card (MC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal device.
Further, the memory may also include both an internal storage unit and an external storage device of the terminal device. The memory is used for storing an operating system, application programs, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer programs. The memory may also be used to temporarily store data that has been output or is to be output.
The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.
The embodiments of the present application provide a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the above method embodiments when executed. The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. For example, the above-described apparatus/network device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application. In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Claims (10)
1. A method of molecular optimization, characterized by: the method comprises the following steps:
obtaining source molecules, and cutting the source molecules according to target molecules to obtain a source molecule retention region, wherein the source molecule retention region is a common molecular substructure of the source molecules and the target molecules;
converting the source molecule into a new simplified molecule linear input canonical string, wherein the new simplified molecule linear input canonical string is a non-common molecule substructure part in the target molecule;
and splicing the source molecule reserved area and the new simplified molecule linear input canonical character string to obtain the target molecule.
2. The molecular optimization method of claim 1, wherein: the cutting comprises the following steps:
analyzing the region to be cut of the source molecule, determining a cutting center according to the band cutting region, determining a cutting branch according to the cutting center, and cutting the cutting branch to obtain the source molecule reserved region.
3. The molecular optimization method of claim 2, wherein: the analyzing the region to be cut of the source molecule comprises:
traversing a first node in the source molecule and a second node in the target molecule, wherein the first node and the second node have the same chemical elements;
traversing a first branch of a first node to obtain a first branch, traversing a second branch of a second node to obtain a second branch, wherein the first branch is the same as the second branch, and recording the number of nodes in the first branch or the second branch;
taking the node I with the maximum numerical value in the node number, wherein the first branch is a reserved area; and the branches except the first branch are areas to be cut.
4. A molecular optimization method according to claim 3, characterized in that: the determining the clipping center includes:
acquiring a vector representation I of the node I; polymerizing the first vector representation to obtain a second vector representation of the source molecule; and combining the first vector representation and the second vector representation to predict the probability that the first node is taken as the cutting center, and carrying out normalization processing on the number of the nodes to obtain node probability distribution, wherein the node with the maximum node probability distribution value is taken as the cutting center.
5. The molecular optimization method of claim 4, wherein: the determining the clipping branch comprises:
and acquiring a vector representation III of a branch III, wherein the branch III is a branch of the cutting center, and making a decision on whether the branch III is reserved or not by predicting the reservation probability of the branch III through the vector representation I, the vector representation III and the vector representation of the first branch.
6. The molecular optimization method of claim 1, wherein: converting the source molecule into a simplified molecule linear input canonical string includes:
converting the source molecule into a standard simplified molecule linear input canonical representation to obtain a coded representation of the source molecule;
processing the encoded representation of the source molecule to obtain a new simplified molecular linear input specification.
7. The molecular optimization method of claim 1, wherein: splicing the source molecule reserved area and the simplified molecule linear input canonical character string to obtain the target molecule comprises the following steps:
converting the new simplified molecular linear input specification into a molecular graph;
and merging the molecular graph and the source molecule reserved area to generate the target molecule.
8. A molecular optimization system, characterized by: the system comprises:
a molecule cutting unit for determining a source molecule retention region, wherein the source molecule retention region is a common molecular substructure of the source molecule and the target molecule;
a time-series unit for converting the source molecule into a new reduced molecule linear input canonical string, the new reduced molecule linear input canonical string being a non-common molecular substructure part in the target molecule;
and the molecule splicing unit is used for splicing the source molecule reserved area and the new simplified molecule linear input canonical character string to obtain the target molecule.
9. A terminal device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that: the processor, when executing the computer program, implements the method of any of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, characterized in that: the computer program when executed by a processor implements the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011502775.0A CN112509644B (en) | 2020-12-18 | 2020-12-18 | Molecular optimization method, system, terminal equipment and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011502775.0A CN112509644B (en) | 2020-12-18 | 2020-12-18 | Molecular optimization method, system, terminal equipment and readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112509644A true CN112509644A (en) | 2021-03-16 |
CN112509644B CN112509644B (en) | 2024-09-20 |
Family
ID=74922321
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011502775.0A Active CN112509644B (en) | 2020-12-18 | 2020-12-18 | Molecular optimization method, system, terminal equipment and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112509644B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113140261A (en) * | 2021-04-25 | 2021-07-20 | 清华大学 | Chemical molecule synthesis simulation method and device |
CN114171134A (en) * | 2021-11-26 | 2022-03-11 | 北京晶泰科技有限公司 | Molecule generation method, device, equipment and storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180011899A1 (en) * | 2016-07-05 | 2018-01-11 | Zymergen, Inc. | Complex chemical substructure search query building and execution |
CN110277144A (en) * | 2018-03-15 | 2019-09-24 | 国际商业机器公司 | Have the new chemical compound of desirable properties to construct the new chemical structure for synthesis using the chemical data creation of accumulation |
CN110634539A (en) * | 2019-09-12 | 2019-12-31 | 腾讯科技(深圳)有限公司 | Artificial intelligence-based drug molecule processing method and device and storage medium |
CN111312340A (en) * | 2018-12-12 | 2020-06-19 | 深圳市云网拜特科技有限公司 | SMILES-based quantitative structure effect method and device |
CN111524557A (en) * | 2020-04-24 | 2020-08-11 | 腾讯科技(深圳)有限公司 | Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence |
CN111695702A (en) * | 2020-06-16 | 2020-09-22 | 腾讯科技(深圳)有限公司 | Training method, device, equipment and storage medium of molecular generation model |
CN111755078A (en) * | 2020-07-30 | 2020-10-09 | 腾讯科技(深圳)有限公司 | Drug molecule attribute determination method, device and storage medium |
CN111816265A (en) * | 2020-06-30 | 2020-10-23 | 北京晶派科技有限公司 | Molecule generation method and computing device |
WO2020243440A1 (en) * | 2019-05-31 | 2020-12-03 | D. E. Shaw Research, Llc. | Molecular graph generation from structural features using an artificial neural network |
-
2020
- 2020-12-18 CN CN202011502775.0A patent/CN112509644B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180011899A1 (en) * | 2016-07-05 | 2018-01-11 | Zymergen, Inc. | Complex chemical substructure search query building and execution |
CN110277144A (en) * | 2018-03-15 | 2019-09-24 | 国际商业机器公司 | Have the new chemical compound of desirable properties to construct the new chemical structure for synthesis using the chemical data creation of accumulation |
CN111312340A (en) * | 2018-12-12 | 2020-06-19 | 深圳市云网拜特科技有限公司 | SMILES-based quantitative structure effect method and device |
WO2020243440A1 (en) * | 2019-05-31 | 2020-12-03 | D. E. Shaw Research, Llc. | Molecular graph generation from structural features using an artificial neural network |
US20220230713A1 (en) * | 2019-05-31 | 2022-07-21 | D. E. Shaw Research, Llc | Molecular Graph Generation from Structural Features Using an Artificial Neural Network |
CN110634539A (en) * | 2019-09-12 | 2019-12-31 | 腾讯科技(深圳)有限公司 | Artificial intelligence-based drug molecule processing method and device and storage medium |
CN111524557A (en) * | 2020-04-24 | 2020-08-11 | 腾讯科技(深圳)有限公司 | Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence |
CN111695702A (en) * | 2020-06-16 | 2020-09-22 | 腾讯科技(深圳)有限公司 | Training method, device, equipment and storage medium of molecular generation model |
CN111816265A (en) * | 2020-06-30 | 2020-10-23 | 北京晶派科技有限公司 | Molecule generation method and computing device |
CN111755078A (en) * | 2020-07-30 | 2020-10-09 | 腾讯科技(深圳)有限公司 | Drug molecule attribute determination method, device and storage medium |
Non-Patent Citations (2)
Title |
---|
ANDREW DALKE, ET AL.: "mmpdb: An Open-Source Matched Molecular Pair Platform for Large Multiproperty Data Sets", JOURNAL OF CHEMICAL INFORMATION AND MODELING, vol. 58, no. 5, 17 May 2018 (2018-05-17), pages 902 - 910 * |
JIN W, ET AL.: "Learning multimodal graph-to-graph translation for molecule optimization", 7TH INTERNATIONAL CONFERENCE ON LEARNING REPRESENTATIONS, ICLR 2019, NEW ORLEANS, LA, USA, 9 May 2019 (2019-05-09) * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113140261A (en) * | 2021-04-25 | 2021-07-20 | 清华大学 | Chemical molecule synthesis simulation method and device |
CN113140261B (en) * | 2021-04-25 | 2022-05-06 | 清华大学 | Chemical molecule synthesis simulation method and device |
CN114171134A (en) * | 2021-11-26 | 2022-03-11 | 北京晶泰科技有限公司 | Molecule generation method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN112509644B (en) | 2024-09-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110162669B (en) | Video classification processing method and device, computer equipment and storage medium | |
CN112464641A (en) | BERT-based machine reading understanding method, device, equipment and storage medium | |
Li et al. | Multi-level network embedding with boosted low-rank matrix approximation | |
CN112530516A (en) | Metabolic pathway prediction method, system, terminal equipment and readable storage medium | |
KR102109369B1 (en) | Artificial Intelligence System to Predict Changes and Explain Reasons in Time Series | |
CN112509644A (en) | Molecular optimization method, system, terminal equipment and readable storage medium | |
CN112084752B (en) | Sentence marking method, device, equipment and storage medium based on natural language | |
CN110705279A (en) | Vocabulary selection method and device and computer readable storage medium | |
CN114077841A (en) | Semantic extraction method and device based on artificial intelligence, electronic equipment and medium | |
CN112735540B (en) | Molecular optimization method, system, terminal equipment and readable storage medium | |
CN112086144A (en) | Molecule generation method, molecule generation device, electronic device, and storage medium | |
CN112199884A (en) | Article molecule generation method, device, equipment and storage medium | |
Blier-Wong et al. | Rethinking representations in P&C actuarial science with deep neural networks | |
CN115828992A (en) | Method for processing neural network model | |
CN113485829B (en) | Identification value generation method for data increment step of microservice cluster | |
CN114913008A (en) | Decision tree-based bond value analysis method, device, equipment and storage medium | |
CN111506812B (en) | Recommended word generation method and device, storage medium and computer equipment | |
CN115345687A (en) | Cross-website commodity alignment method and device | |
Wu et al. | Dual-Constrained Dynamical Neural ODEs for Ambiguity-aware Continuous Emotion Prediction | |
Karpukhin et al. | HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting? | |
Allouche et al. | Structured dictionary learning of rating migration matrices for credit risk modeling | |
CN117473170B (en) | Intelligent contract template recommendation method and device based on code characterization and electronic equipment | |
US20240078290A1 (en) | System and method for generating candidate idea | |
CN117421386B (en) | GIS-based spatial data processing method and system | |
CN111368966B (en) | Work order description generation method and device, electronic equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |