CN112509644A - Molecular optimization method, system, terminal equipment and readable storage medium - Google Patents

Molecular optimization method, system, terminal equipment and readable storage medium Download PDF

Info

Publication number
CN112509644A
CN112509644A CN202011502775.0A CN202011502775A CN112509644A CN 112509644 A CN112509644 A CN 112509644A CN 202011502775 A CN202011502775 A CN 202011502775A CN 112509644 A CN112509644 A CN 112509644A
Authority
CN
China
Prior art keywords
molecule
source
branch
molecular
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011502775.0A
Other languages
Chinese (zh)
Other versions
CN112509644B (en
Inventor
纪超杰
吴红艳
蔡云鹏
郑奕嘉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202011502775.0A priority Critical patent/CN112509644B/en
Publication of CN112509644A publication Critical patent/CN112509644A/en
Application granted granted Critical
Publication of CN112509644B publication Critical patent/CN112509644B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2111/00Details relating to CAD techniques
    • G06F2111/08Probabilistic or stochastic CAD

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Geometry (AREA)
  • Computer Hardware Design (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present application belongs to the technical field of data synthesis, and in particular, relates to a method, a system, a terminal device and a readable storage medium for molecular optimization. Existing systems perform poorly when the chemical structure of the target molecule is complex. The application provides a molecular optimization method, which comprises the following steps: obtaining source molecules, and cutting the source molecules according to target molecules to obtain a source molecule retention region, wherein the source molecule retention region is a common molecular substructure of the source molecules and the target molecules; converting the source molecule into a simplified molecule linear input canonical string, wherein the simplified molecule linear input canonical string is a non-common molecule substructure part in the target molecule; and splicing the source molecule reserved area and the simplified molecule linear input specification character string to obtain the target molecule. The system error rate is greatly reduced.

Description

Molecular optimization method, system, terminal equipment and readable storage medium
Technical Field
The present application belongs to the technical field of data synthesis, and in particular, relates to a method, a system, a terminal device and a readable storage medium for molecular optimization.
Background
Definition of molecular optimization task: given a source molecule, a molecule optimizer (generator) is input, which can convert the source molecule into another molecule (target molecule), and the target molecule has a chemical structure similar to that of the source molecule, but some other chemical properties are greatly improved (e.g., molecular water solubility).
The existing method converts the graph generation task into a serialized decision process. Each element in this sequence is a specific decision, and there are mainly three different optional decisions: 1) whether a new node is added at the current moment (the generation process is considered to be finished when the new node is not added any more); 2) whether a new edge is added at the current moment; 3) and selecting a node to be connected with the current new node. The complete generation process starts with an empty molecular diagram, with more than one decision being performed at each moment. Based on the graph generation paradigm, some further add the factors of reinforcement learning, and define the state space, the action space and the reward function used by the standard reinforcement learning model. But the overall molecular generation logic is not changed.
The solution space of the target molecule is too large (any modification to the source molecule can result in a candidate target molecule). These methods are all threaded by a source numerator, and then the optimization model starts from an empty numerator graph in a random manner, each time generating a node and establishing chemical bonds with the previously generated nodes until an end-of-generation signal is obtained. Because target molecules are often large in molecular scale (contain more nodes), the existing optimization system cannot ensure the similarity between generated molecules and source molecules and cannot ensure the attribute improvement; and the computational resources are consumed too much.
Disclosure of Invention
1. Technical problem to be solved
Based on the existing molecular optimization method, the method can be classified as a 'from the none to the existence' molecular generation process, and the generation of the optimized target molecule is a process from a blank figure to the complete target molecule. When the chemical structure of a target molecule is complex, the existing system has poor performance, and the application provides a molecule optimization method, a system, a terminal device and a readable storage medium.
2. Technical scheme
In order to achieve the above object, the present application provides a molecular optimization method, comprising: obtaining source molecules, and cutting the source molecules according to target molecules to obtain a source molecule retention region, wherein the source molecule retention region is a common molecular substructure of the source molecules and the target molecules; converting the source molecule into a new simplified molecule linear input canonical string, wherein the new simplified molecule linear input canonical string is a non-common molecule substructure part in the target molecule; and splicing the source molecule reserved area and the new simplified molecule linear input canonical character string to obtain the target molecule.
Another embodiment provided by the present application is: the cutting comprises the following steps: analyzing the region to be cut of the source molecule, determining a cutting center according to the band cutting region, determining a cutting branch according to the cutting center, and cutting the cutting branch to obtain the source molecule reserved region.
Another embodiment provided by the present application is: the analyzing the region to be cut of the source molecule comprises: traversing a first node in the source molecule and a second node in the target molecule, wherein the first node and the second node have the same chemical elements; traversing a first branch of a first node to obtain a first branch, traversing a second branch of a second node to obtain a second branch, wherein the first branch is the same as the second branch, and recording the number of nodes in the first branch or the second branch; taking the node I with the maximum numerical value in the node number, wherein the first branch is a reserved area; and the branches except the first branch are areas to be cut.
Another embodiment provided by the present application is: the determining the clipping center includes: acquiring a vector representation I of the node I; polymerizing the first vector representation to obtain a second vector representation of the source molecule; and combining the first vector representation and the second vector representation to predict the probability that the first node is taken as the cutting center, and carrying out normalization processing on the number of the nodes to obtain node probability distribution, wherein the node with the maximum node probability distribution value is taken as the cutting center.
Another embodiment provided by the present application is: the determining the clipping branch comprises: and acquiring a vector representation III of a branch III, wherein the branch III is a branch of the cutting center, and making a decision on whether the branch III is reserved or not by predicting the reservation probability of the branch III through the vector representation I, the vector representation III and the vector representation of the first branch.
Another embodiment provided by the present application is: converting the source molecule into a simplified molecule linear input canonical string includes: converting the source molecule into a standard simplified molecule linear input canonical representation to obtain a coded representation of the source molecule; processing the encoded representation of the source molecule to obtain a new simplified molecular linear input specification.
Another embodiment provided by the present application is: splicing the source molecule reserved area and the simplified molecule linear input canonical character string to obtain the target molecule comprises the following steps: converting the new simplified molecular linear input specification into a molecular graph; and merging the molecular graph and the source molecule reserved area to generate the target molecule.
The present application further provides a molecular optimization system, the system comprising: a molecule cutting unit for determining a source molecule retention region, wherein the source molecule retention region is a common molecular substructure of the source molecule and the target molecule; a time-series unit for converting the source molecule into a new reduced molecule linear input canonical string, the new reduced molecule linear input canonical string being a non-common molecular substructure part in the target molecule; and the molecule splicing unit is used for splicing the source molecule reserved area and the new simplified molecule linear input canonical character string to obtain the target molecule.
Optionally, the molecule clipping unit comprises an analysis module and a molecule clipper; the time sequence unit comprises a first time sequence module and a second time sequence module, and the molecule splicing unit comprises a molecule splicing module; the analysis module is used for analyzing the region to be cut of the source molecule; the molecule cutter is used for predicting a molecule reserving area; the first time series module is used for acquiring a coded representation of the whole source molecule; the second time sequence module is used for acquiring a new simplified molecular linear input specification; the molecule splicing module is used for generating target molecules. The cutting unit also comprises a database and a molecule pair matching module, wherein the database is used for providing molecule pair data; the molecule pair matching module is used for acquiring qualified molecule pairs from the database, and each molecule pair comprises a source molecule and a target molecule.
Optionally, the first time-series unit is an encoder, and the second time-series unit is a decoder; the molecule splicing module comprises a molecule graph conversion sub-module and a merging sub-module, the molecule graph conversion sub-module converts the new simplified molecule linear input specification into a molecule graph, and the merging sub-module merges the molecule graph and the reserved area to generate target molecules.
The application also provides a terminal device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the method.
The present application also provides a computer-readable storage medium, in which a computer program is stored, which computer program, when being executed by a processor, carries out the method.
3. Advantageous effects
Compared with the prior art, the molecular optimization method, the molecular optimization system, the terminal device and the readable storage medium have the advantages that:
the molecular optimization system provided by the application comprises a molecular cutter, and the reserved area is obtained through the molecular cutter, so that the scale of the rest molecular structure to be generated can be greatly reduced, and the error rate of the system is greatly reduced.
According to the molecular optimization method, the common molecular substructure in the source molecule and the target molecule is found through a molecular cutting strategy, so that the accuracy and the efficiency of final molecule generation are improved.
The molecular optimization method provided by the application is based on a general phenomenon: stronger image structural relevance often exists between source molecules and target molecules in a molecular optimization task, namely, a large number of identical sub-molecular structures exist, so that the identical structures are reserved, only the rest different parts are generated, and the accuracy of model optimization can be greatly improved.
The molecular optimization method provided by the application has the fatal problem that the generation mode of the 'from the none to the few' molecules has huge calculation amount, but if the common sub-molecular structure of the source molecule and the target molecule can be reserved in advance, the rest molecular part which needs to be generated really is greatly reduced, and thus, the consumption of the method on calculation resources is reduced.
Drawings
FIG. 1 is a schematic diagram of an exemplary molecular pair of the present application;
FIG. 2 is a schematic diagram of the cropping process of the present application;
FIG. 3 is a schematic diagram of an encoder-decoder framework of the present application;
FIG. 4 is a schematic diagram of the molecular optimization system of the present application;
fig. 5 is a schematic structural diagram of a terminal device of the present application.
Detailed Description
Hereinafter, specific embodiments of the present application will be described in detail with reference to the accompanying drawings, and it will be apparent to those skilled in the art from this detailed description that the present application can be practiced. Features from different embodiments may be combined to yield new embodiments, or certain features may be substituted for certain embodiments to yield yet further preferred embodiments, without departing from the principles of the present application.
Since the entire generation step requires prediction of newly added nodes and chemical bonds at each step, the number of steps performed is proportional to the size of the target molecule. When the target molecule is larger, the consumption phenomenon of computing resource resources is more serious.
Smiles (simplified molecular input line entry specification), a specification for explicitly describing the structure of a molecule using ASCII character strings, simplifies the linear input of a molecule. A method and a tool for representing character strings of a molecular diagram are provided, a molecular diagram structure is input, and SMILES can provide a corresponding character representation. The target molecule generation process of the present application is primarily based on this molecular representation method, but it should be noted that SMILES itself cannot be directly molecularly generated. The present application is only a representation of the use of SMILES to encode a source molecule and a target molecule.
The molecular optimization method provided by the embodiment of the present application can be applied to terminal devices such as a tablet computer, a notebook computer, a super-mobile personal computer (UMPC), a netbook, and a Personal Digital Assistant (PDA), and the embodiment of the present application does not limit the specific types of the terminal devices.
For example, the terminal device may be a Station (ST) in a WLAN, a Personal Digital Assistant (PDA) device, a handheld device with wireless communication capabilities, a computing device or other processing device connected to a wireless modem, a computer, a laptop, a handheld communication device, a handheld computing device, a satellite radio, a wireless modem card.
Referring to fig. 1 to 5, the present application provides a molecular optimization method, including: obtaining source molecules, and cutting the source molecules according to target molecules to obtain a source molecule retention region, wherein the source molecule retention region is a common molecular substructure of the source molecules and the target molecules. And predicting a source molecule retention area, and cutting and removing all molecular substructures which are not in the source molecule retention area. By this step, the scale of the remaining molecular structure to be produced can be reduced, and the error rate can be reduced.
In the training phase, molecular pair data is acquired, each of the molecular pairs including a source molecule and a target molecule. Any one of the molecule databases is selected, which contains a variety of different molecules, such as a ZINC database. The target molecule of the training phase is known; in conducting the test, the target molecule is unknown, and a given source molecule is used to process the source molecule according to the target molecule.
And converting the source molecule into a new simplified molecule linear input canonical character string, wherein the new simplified molecule linear input canonical character string is a non-common molecule substructure part in the target molecule. I.e. the remaining part of the target molecule is generated.
And splicing the source molecule reserved area and the new simplified molecule linear input canonical character string to obtain the target molecule.
Further, the cropping comprises: analyzing the region to be cut of the source molecule, determining a cutting center according to the band cutting region, determining a cutting branch according to the cutting center, and cutting the cutting branch to obtain the source molecule reserved region.
Further, the analyzing the region to be cut of the source molecule includes:
traversing a first node in the source molecule and a second node in the target molecule, wherein the first node and the second node have the same chemical elements; the node one is the source molecule C1 in FIG. 2, and the node two is the target molecule C1 in FIG. 2.
Traversing a first branch of a first node to obtain a first branch, traversing a second branch of a second node to obtain a second branch, wherein the first branch is the same as the second branch, and recording the number of nodes in the first branch or the second branch.
Here, the first branch is all branches of the first node, and the second branch is all branches of the second node.
Taking the node I with the maximum numerical value in the node number, wherein the first branch is a reserved area; and the branches except the first branch are areas to be cut.
Specifically, all atoms i in the source molecule, i.e. nodes i, are traversed (e.g. source molecule C1 in fig. 1 or 2): traversing all atoms j in the target molecule, i.e. nodes j, the atom j is identical to the chemical element of the atom i (as in the target molecule C1 in fig. 1 or 2): traversing all branches of the atom i and the atom j to obtain the same branch in the branch of the atom i and the branch of the atom j, and recording the total number of atoms in the same branch as si,j
The branch is an area around a certain node and expanded by another node connected with the certain node. Such as (C)2、H3、H4、H5)、(H6)、(H7) And (C)8、H9) Is the source molecule node C in FIG. 1 or 2, respectively14 branches.
As shown in FIG. 1 or 2, the source molecule C1And the target molecule C1Having 2 identical branches, i.e. (C)2、H3、H4、H5) And (H)6) Then the total number of atoms in the same branch si,j=5。
S with the largest valuei,jAs siAnd recording the atom j corresponding to the atom i.
S with the largest valueiAtom i in (1) to cteS, which is the largest valueiThe atom i in (1) is the clipping center, and is recorded with cteThe corresponding atom j.
In addition, for all siIt is converted to a normalized probability distribution for model learning:
Figure BDA0002843965110000051
wherein VXRepresents the set of all atoms in the source molecule, exp (.) -is an exponential function.
ste iI.e. the normalized distribution. The source molecule cteThe same branch (which may be multiple) corresponding to the atom j is the reserved region;the other branches are the areas to be cut. Each branch is denoted 1/0 as reserved and deleted, and the whole is denoted U by a variable, e.g. U { (C)2、H3、H4、H5):1,(H6):1,(H7):0,(C8、H9) 0, the distribution is the distribution that needs to be fitted for model training.
In addition, c is present in jteIs marked as "to be generated", as in FIG. 2 the target molecule surrounds C1Branch O of7And converting the branch subgraph to be generated into character string representation of SMILES.
In summary, the molecular optimization method provided by the present application can complete the optimization of the source molecule by generating the smallest molecular substructure.
The nodes are represented by atoms.
In the training stage, as the target molecules are known, the molecular pairs can be matched, and the molecular optimization task target is determined to obtain the molecular pairs which meet the conditions.
The molecular optimization task goal or constraint is first determined, and the molecular optimization task goal or constraint of the present application may be a target molecule that requires the generated molecule to have higher water solubility and to have a similarity to the source molecule, but is not limited to other applications. Existing open source tools, which may be RDKIT, may provide functionality to calculate the relative attributes and molecular similarities of a molecule. And traversing the molecule database to obtain the molecule pairs meeting the conditions according to the task goal or constraint of the molecule optimization. FIG. 1 shows an extracted pair of molecules.
Further, the determining the clipping center includes: acquiring a vector representation I of the node I; polymerizing the first vector representation to obtain a second vector representation of the source molecule; and combining the first vector representation and the second vector representation to predict the probability that the first node is taken as the cutting center, and carrying out normalization processing on the number of the nodes to obtain node probability distribution, wherein the node with the maximum node probability distribution value is taken as the cutting center.
Specifically, the Graph message-learning networks (MPNs) are used to perform representation learning on the source molecules, for example, through a preset formula:
Figure BDA0002843965110000061
Figure BDA0002843965110000062
the calculation yields a vector representation of each node (atom), i.e. the vector represents one.
Wherein x isiIs a characteristic representation of the node (atom) i, xi,jIs a characteristic representation of the edge (chemical bond) between nodes i and j, mt i,jInformation indicating the transfer of node i to j at time t, N (i) all the neighbor nodes of i, N (i) \ j all the neighbor nodes of i except j, f1And f2Are all neural networks. Through
Figure BDA0002843965110000064
After one cycle, the representation h of the final node i is obtainedi. Wherein the characteristic representation of atoms and chemical bonds can be encoded by a simple one-hot.
And polymerizing the vector representation again and again to obtain a vector representation of the whole source molecule, namely a vector representation two:
Figure BDA0002843965110000063
and combining the vector representation one and the vector representation two, predicting the probability that the i node is taken as the clipping center by the following formula:
si=f3([hX,hi])
wherein [,]representing a vector splicing operation, f3Is a standard neural network.
Is as same as ste iIs obtained by the following formulaiAnd (3) carrying out normalization:
Figure BDA0002843965110000071
wherein s isst iThe node with the maximum value is the predicted cutting center cst
In the model training phase, sst iFitting a loss function (e.g., KL divergence distance) is required to fit ste iThe tag value of (1). The test stage only needs to output and take sst iThe node corresponding to the maximum numerical value is used as a cutting center cstAnd (4) finishing.
Further, the determining a clipping branch comprises: and acquiring a vector representation III of a branch III, wherein the branch III is a branch of the cutting center, and making a decision on whether the branch III is reserved or not by predicting the reservation probability of the branch III through the vector representation I, the vector representation III and the vector representation of the first branch.
Branch three here is any branch of the clipping center.
Specifically, each branch around the clipping center is predicted from the clipping center, and the remaining branch and the deleted branch are decided. Representing an h by the vectoriA vector representation of any branch is obtained by the following equation:
Figure BDA0002843965110000072
wherein c isstAs a result of the said cutting centre,
Figure BDA0002843965110000073
representing the center c around said cut in the source moleculestA branch subgraph of (1). |. | is the number of atoms in the subgraph.
Then, the retention probability of this branch j is predicted by the neural network:
Figure BDA0002843965110000074
wherein f is4Is a standard neural network, sigma is sigmoid function,
Figure BDA00028439651100000710
and
Figure BDA0002843965110000078
respectively, a vector representation of the reaction center, a vector representation of the branch whether the decision is currently to be made to reserve, and a vector representation of the branch that has been determined to reserve. The output is greater than or equal to 0.5, which means retention, and less than 0.5, which means deletion. About
Figure BDA0002843965110000075
t-1 represents the sequence number of the last iteration, each iteration model needs to make a decision on the retention/deletion of a branch, and if the decision is retained, the branch is added into the set Ust t-1Each element in the set is a subgraph.
The
Figure BDA0002843965110000079
Is obtained by the following formula:
Figure BDA0002843965110000081
the complete reserved area after cutting can be obtained. As shown in fig. 2, the shaded portion is a reserved portion and the box area is deleted.
In the testing stage, the output can be directly obtained according to the above process. While in the training phase, where the output needs to be fitted to the U, the cross entropy between the two can be used as a loss function.
Further, converting the source molecule into a simplified molecule linear input canonical string comprises: converting the source molecule into a standard simplified molecule linear input canonical representation to obtain a coded representation of the source molecule; processing the encoded representation of the source molecule to obtain a new simplified molecular linear input specification.
Specifically, the source molecule is converted into a standard simplified molecule linear input specification (SMILES) representation, and the simplified molecule linear input specification characters are encoded according to the character appearance sequence by adopting a first long-short term memory network; the encoding of the source molecule is used for decoding a new simplified molecule linear input specification character string through a second long-short term memory network.
Further, the obtaining the target molecule by splicing the source molecule retention region and the simplified molecule linear input canonical character string includes: converting the new simplified molecular linear input specification into a molecular graph; and merging the molecular graph and the source molecule reserved area to generate the target molecule.
Specifically, the SMILES is converted into a molecular diagram representation form, and then the first atom (i.e., the clipping center) in the SMILES representation is spliced and combined with the source molecule retention region to obtain the final complete target molecule.
As shown in fig. 4, the present application further provides a molecular optimization system, the system comprising: a molecule cutting unit 1, configured to determine a source molecule retention region, where the source molecule retention region is a common molecular substructure of the source molecule and the target molecule; the molecule cutting unit 1 comprises a molecule cutting unit, an analysis module and a molecule cutter; the analysis module is used for analyzing the region to be cut of the source molecule; the molecule cropper is used for predicting the molecule reserving area.
The molecule cutting unit 1 further comprises a database and a molecule pair matching module, wherein the database is used for providing molecule pair data; the molecule pair matching module is used for acquiring qualified molecule pairs from the database, wherein each molecule pair comprises a source molecule and a target molecule; in the testing stage, the target molecules are unknown, and the molecule pair matching module is not required to be called, and in the training stage, the molecule pair is required to be selected through the molecule pair matching module.
A time-series unit 2, configured to convert the source molecule into a new simplified molecule linear input canonical string, where the new simplified molecule linear input canonical string is a non-common molecule substructure part in the target molecule; the time-series unit 2 comprises a first time-series module and a second time-series module, wherein the first time-series module is used for acquiring a coded representation of the whole source molecule; the second time series module is used for acquiring a new simplified molecular linear input specification.
And the molecule splicing unit 3 is used for splicing the source molecule reserved area and the new simplified molecule linear input canonical character string to obtain the target molecule. The molecule splicing unit 3 comprises a molecule splicing module; the molecule splicing module is used for generating target molecules.
The molecule splicing module comprises a molecule graph conversion sub-module and a merging sub-module, the molecule graph conversion sub-module converts the new simplified molecule linear input specification into a molecule graph, and the merging sub-module merges the molecule graph and the reserved area to generate target molecules.
The first time sequence unit is an encoder, and the second time sequence unit is a decoder.
This section generally employs a decoder-encoder framework, as shown in fig. 3.
In particular, the Encoder (Encoder) converts the input molecules into a SMILES representation, and then encodes the SMILES characters in their order of occurrence using a standard long short-term memory (LSTMs) network. The hidden state of the last instant LSTMs is taken as the final output C of the encoder.
The Decoder (Decoder) decodes a SMILES string, which is the remainder to be generated, by LSTMs different from the encoder.
In this LSTMs decoder, the initial hidden state is the C provided by the encoder, where the information of the source molecule has been encoded. The output of LSTMs at each time instant is a specific character sampled from a character set consisting of all possible constituent letters or symbols of SMILES, such as Br, Cl, N, O, S, p.
In the decoder, each character output at the current time is used asThe input of the next time, e.g., the character "C" predicted by the t-1 th time model, is used as the input of the next time-t LSTMs. In particular, the input at the first moment
Figure BDA0002843965110000091
Fixing to a clipping center in the previously selected source molecule, and starting generation; when the output character string is "</s>"time" indicates completion of the generation. The input and output of the decoder can adopt simple one-hot coding, namely, a coding vector constructed by the character set.
The SMILES string generated during this part of the training phase needs to be fitted to the SMILES string of the molecular subgraph to be generated, and cross entropy can be used as a loss function as well.
The present application further provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the steps in any of the method embodiments described above are implemented.
The terminal device of this embodiment includes: at least one processor (only one shown in fig. 5), a memory, and a computer program stored in the memory and executable on the at least one processor, the processor when executing the computer program implementing the steps in any of the various metabolic pathway prediction method embodiments described below.
The terminal device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The terminal device may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that the terminal device is merely an example, and does not constitute a limitation of the terminal device, and may include more or less components than those shown, or combine some components, or different components, such as input and output devices, network access devices, etc.
The Processor may be a Central Processing Unit (CPU), or other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may in some embodiments be an internal storage unit of the terminal device, such as a hard disk or a memory of the terminal device. In other embodiments, the memory may also be an external storage device of the terminal device, such as a plug-in hard disk, a Smart Media Card (MC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal device.
Further, the memory may also include both an internal storage unit and an external storage device of the terminal device. The memory is used for storing an operating system, application programs, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer programs. The memory may also be used to temporarily store data that has been output or is to be output.
The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.
The embodiments of the present application provide a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the above method embodiments when executed. The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. For example, the above-described apparatus/network device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application. In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Claims (10)

1. A method of molecular optimization, characterized by: the method comprises the following steps:
obtaining source molecules, and cutting the source molecules according to target molecules to obtain a source molecule retention region, wherein the source molecule retention region is a common molecular substructure of the source molecules and the target molecules;
converting the source molecule into a new simplified molecule linear input canonical string, wherein the new simplified molecule linear input canonical string is a non-common molecule substructure part in the target molecule;
and splicing the source molecule reserved area and the new simplified molecule linear input canonical character string to obtain the target molecule.
2. The molecular optimization method of claim 1, wherein: the cutting comprises the following steps:
analyzing the region to be cut of the source molecule, determining a cutting center according to the band cutting region, determining a cutting branch according to the cutting center, and cutting the cutting branch to obtain the source molecule reserved region.
3. The molecular optimization method of claim 2, wherein: the analyzing the region to be cut of the source molecule comprises:
traversing a first node in the source molecule and a second node in the target molecule, wherein the first node and the second node have the same chemical elements;
traversing a first branch of a first node to obtain a first branch, traversing a second branch of a second node to obtain a second branch, wherein the first branch is the same as the second branch, and recording the number of nodes in the first branch or the second branch;
taking the node I with the maximum numerical value in the node number, wherein the first branch is a reserved area; and the branches except the first branch are areas to be cut.
4. A molecular optimization method according to claim 3, characterized in that: the determining the clipping center includes:
acquiring a vector representation I of the node I; polymerizing the first vector representation to obtain a second vector representation of the source molecule; and combining the first vector representation and the second vector representation to predict the probability that the first node is taken as the cutting center, and carrying out normalization processing on the number of the nodes to obtain node probability distribution, wherein the node with the maximum node probability distribution value is taken as the cutting center.
5. The molecular optimization method of claim 4, wherein: the determining the clipping branch comprises:
and acquiring a vector representation III of a branch III, wherein the branch III is a branch of the cutting center, and making a decision on whether the branch III is reserved or not by predicting the reservation probability of the branch III through the vector representation I, the vector representation III and the vector representation of the first branch.
6. The molecular optimization method of claim 1, wherein: converting the source molecule into a simplified molecule linear input canonical string includes:
converting the source molecule into a standard simplified molecule linear input canonical representation to obtain a coded representation of the source molecule;
processing the encoded representation of the source molecule to obtain a new simplified molecular linear input specification.
7. The molecular optimization method of claim 1, wherein: splicing the source molecule reserved area and the simplified molecule linear input canonical character string to obtain the target molecule comprises the following steps:
converting the new simplified molecular linear input specification into a molecular graph;
and merging the molecular graph and the source molecule reserved area to generate the target molecule.
8. A molecular optimization system, characterized by: the system comprises:
a molecule cutting unit for determining a source molecule retention region, wherein the source molecule retention region is a common molecular substructure of the source molecule and the target molecule;
a time-series unit for converting the source molecule into a new reduced molecule linear input canonical string, the new reduced molecule linear input canonical string being a non-common molecular substructure part in the target molecule;
and the molecule splicing unit is used for splicing the source molecule reserved area and the new simplified molecule linear input canonical character string to obtain the target molecule.
9. A terminal device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that: the processor, when executing the computer program, implements the method of any of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, characterized in that: the computer program when executed by a processor implements the method of any one of claims 1 to 7.
CN202011502775.0A 2020-12-18 2020-12-18 Molecular optimization method, system, terminal equipment and readable storage medium Active CN112509644B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011502775.0A CN112509644B (en) 2020-12-18 2020-12-18 Molecular optimization method, system, terminal equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011502775.0A CN112509644B (en) 2020-12-18 2020-12-18 Molecular optimization method, system, terminal equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN112509644A true CN112509644A (en) 2021-03-16
CN112509644B CN112509644B (en) 2024-09-20

Family

ID=74922321

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011502775.0A Active CN112509644B (en) 2020-12-18 2020-12-18 Molecular optimization method, system, terminal equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN112509644B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113140261A (en) * 2021-04-25 2021-07-20 清华大学 Chemical molecule synthesis simulation method and device
CN114171134A (en) * 2021-11-26 2022-03-11 北京晶泰科技有限公司 Molecule generation method, device, equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180011899A1 (en) * 2016-07-05 2018-01-11 Zymergen, Inc. Complex chemical substructure search query building and execution
CN110277144A (en) * 2018-03-15 2019-09-24 国际商业机器公司 Have the new chemical compound of desirable properties to construct the new chemical structure for synthesis using the chemical data creation of accumulation
CN110634539A (en) * 2019-09-12 2019-12-31 腾讯科技(深圳)有限公司 Artificial intelligence-based drug molecule processing method and device and storage medium
CN111312340A (en) * 2018-12-12 2020-06-19 深圳市云网拜特科技有限公司 SMILES-based quantitative structure effect method and device
CN111524557A (en) * 2020-04-24 2020-08-11 腾讯科技(深圳)有限公司 Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence
CN111695702A (en) * 2020-06-16 2020-09-22 腾讯科技(深圳)有限公司 Training method, device, equipment and storage medium of molecular generation model
CN111755078A (en) * 2020-07-30 2020-10-09 腾讯科技(深圳)有限公司 Drug molecule attribute determination method, device and storage medium
CN111816265A (en) * 2020-06-30 2020-10-23 北京晶派科技有限公司 Molecule generation method and computing device
WO2020243440A1 (en) * 2019-05-31 2020-12-03 D. E. Shaw Research, Llc. Molecular graph generation from structural features using an artificial neural network

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180011899A1 (en) * 2016-07-05 2018-01-11 Zymergen, Inc. Complex chemical substructure search query building and execution
CN110277144A (en) * 2018-03-15 2019-09-24 国际商业机器公司 Have the new chemical compound of desirable properties to construct the new chemical structure for synthesis using the chemical data creation of accumulation
CN111312340A (en) * 2018-12-12 2020-06-19 深圳市云网拜特科技有限公司 SMILES-based quantitative structure effect method and device
WO2020243440A1 (en) * 2019-05-31 2020-12-03 D. E. Shaw Research, Llc. Molecular graph generation from structural features using an artificial neural network
US20220230713A1 (en) * 2019-05-31 2022-07-21 D. E. Shaw Research, Llc Molecular Graph Generation from Structural Features Using an Artificial Neural Network
CN110634539A (en) * 2019-09-12 2019-12-31 腾讯科技(深圳)有限公司 Artificial intelligence-based drug molecule processing method and device and storage medium
CN111524557A (en) * 2020-04-24 2020-08-11 腾讯科技(深圳)有限公司 Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence
CN111695702A (en) * 2020-06-16 2020-09-22 腾讯科技(深圳)有限公司 Training method, device, equipment and storage medium of molecular generation model
CN111816265A (en) * 2020-06-30 2020-10-23 北京晶派科技有限公司 Molecule generation method and computing device
CN111755078A (en) * 2020-07-30 2020-10-09 腾讯科技(深圳)有限公司 Drug molecule attribute determination method, device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ANDREW DALKE, ET AL.: "mmpdb: An Open-Source Matched Molecular Pair Platform for Large Multiproperty Data Sets", JOURNAL OF CHEMICAL INFORMATION AND MODELING, vol. 58, no. 5, 17 May 2018 (2018-05-17), pages 902 - 910 *
JIN W, ET AL.: "Learning multimodal graph-to-graph translation for molecule optimization", 7TH INTERNATIONAL CONFERENCE ON LEARNING REPRESENTATIONS, ICLR 2019, NEW ORLEANS, LA, USA, 9 May 2019 (2019-05-09) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113140261A (en) * 2021-04-25 2021-07-20 清华大学 Chemical molecule synthesis simulation method and device
CN113140261B (en) * 2021-04-25 2022-05-06 清华大学 Chemical molecule synthesis simulation method and device
CN114171134A (en) * 2021-11-26 2022-03-11 北京晶泰科技有限公司 Molecule generation method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN112509644B (en) 2024-09-20

Similar Documents

Publication Publication Date Title
CN110162669B (en) Video classification processing method and device, computer equipment and storage medium
CN112464641A (en) BERT-based machine reading understanding method, device, equipment and storage medium
Li et al. Multi-level network embedding with boosted low-rank matrix approximation
CN112530516A (en) Metabolic pathway prediction method, system, terminal equipment and readable storage medium
KR102109369B1 (en) Artificial Intelligence System to Predict Changes and Explain Reasons in Time Series
CN112509644A (en) Molecular optimization method, system, terminal equipment and readable storage medium
CN112084752B (en) Sentence marking method, device, equipment and storage medium based on natural language
CN110705279A (en) Vocabulary selection method and device and computer readable storage medium
CN114077841A (en) Semantic extraction method and device based on artificial intelligence, electronic equipment and medium
CN112735540B (en) Molecular optimization method, system, terminal equipment and readable storage medium
CN112086144A (en) Molecule generation method, molecule generation device, electronic device, and storage medium
CN112199884A (en) Article molecule generation method, device, equipment and storage medium
Blier-Wong et al. Rethinking representations in P&C actuarial science with deep neural networks
CN115828992A (en) Method for processing neural network model
CN113485829B (en) Identification value generation method for data increment step of microservice cluster
CN114913008A (en) Decision tree-based bond value analysis method, device, equipment and storage medium
CN111506812B (en) Recommended word generation method and device, storage medium and computer equipment
CN115345687A (en) Cross-website commodity alignment method and device
Wu et al. Dual-Constrained Dynamical Neural ODEs for Ambiguity-aware Continuous Emotion Prediction
Karpukhin et al. HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?
Allouche et al. Structured dictionary learning of rating migration matrices for credit risk modeling
CN117473170B (en) Intelligent contract template recommendation method and device based on code characterization and electronic equipment
US20240078290A1 (en) System and method for generating candidate idea
CN117421386B (en) GIS-based spatial data processing method and system
CN111368966B (en) Work order description generation method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant