CN112509644A

CN112509644A - Molecular optimization method, system, terminal equipment and readable storage medium

Info

Publication number: CN112509644A
Application number: CN202011502775.0A
Authority: CN
Inventors: 纪超杰; 吴红艳; 蔡云鹏; 郑奕嘉
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-03-16
Anticipated expiration: 2040-12-18
Also published as: CN112509644B

Abstract

The present application belongs to the technical field of data synthesis, and in particular, relates to a method, a system, a terminal device and a readable storage medium for molecular optimization. Existing systems perform poorly when the chemical structure of the target molecule is complex. The application provides a molecular optimization method, which comprises the following steps: obtaining source molecules, and cutting the source molecules according to target molecules to obtain a source molecule retention region, wherein the source molecule retention region is a common molecular substructure of the source molecules and the target molecules; converting the source molecule into a simplified molecule linear input canonical string, wherein the simplified molecule linear input canonical string is a non-common molecule substructure part in the target molecule; and splicing the source molecule reserved area and the simplified molecule linear input specification character string to obtain the target molecule. The system error rate is greatly reduced.

Description

Molecular optimization method, system, terminal equipment and readable storage medium

Technical Field

The present application belongs to the technical field of data synthesis, and in particular, relates to a method, a system, a terminal device and a readable storage medium for molecular optimization.

Background

Definition of molecular optimization task: given a source molecule, a molecule optimizer (generator) is input, which can convert the source molecule into another molecule (target molecule), and the target molecule has a chemical structure similar to that of the source molecule, but some other chemical properties are greatly improved (e.g., molecular water solubility).

The existing method converts the graph generation task into a serialized decision process. Each element in this sequence is a specific decision, and there are mainly three different optional decisions: 1) whether a new node is added at the current moment (the generation process is considered to be finished when the new node is not added any more); 2) whether a new edge is added at the current moment; 3) and selecting a node to be connected with the current new node. The complete generation process starts with an empty molecular diagram, with more than one decision being performed at each moment. Based on the graph generation paradigm, some further add the factors of reinforcement learning, and define the state space, the action space and the reward function used by the standard reinforcement learning model. But the overall molecular generation logic is not changed.

The solution space of the target molecule is too large (any modification to the source molecule can result in a candidate target molecule). These methods are all threaded by a source numerator, and then the optimization model starts from an empty numerator graph in a random manner, each time generating a node and establishing chemical bonds with the previously generated nodes until an end-of-generation signal is obtained. Because target molecules are often large in molecular scale (contain more nodes), the existing optimization system cannot ensure the similarity between generated molecules and source molecules and cannot ensure the attribute improvement; and the computational resources are consumed too much.

Disclosure of Invention

1. Technical problem to be solved

Based on the existing molecular optimization method, the method can be classified as a 'from the none to the existence' molecular generation process, and the generation of the optimized target molecule is a process from a blank figure to the complete target molecule. When the chemical structure of a target molecule is complex, the existing system has poor performance, and the application provides a molecule optimization method, a system, a terminal device and a readable storage medium.

2. Technical scheme

In order to achieve the above object, the present application provides a molecular optimization method, comprising: obtaining source molecules, and cutting the source molecules according to target molecules to obtain a source molecule retention region, wherein the source molecule retention region is a common molecular substructure of the source molecules and the target molecules; converting the source molecule into a new simplified molecule linear input canonical string, wherein the new simplified molecule linear input canonical string is a non-common molecule substructure part in the target molecule; and splicing the source molecule reserved area and the new simplified molecule linear input canonical character string to obtain the target molecule.

Another embodiment provided by the present application is: the cutting comprises the following steps: analyzing the region to be cut of the source molecule, determining a cutting center according to the band cutting region, determining a cutting branch according to the cutting center, and cutting the cutting branch to obtain the source molecule reserved region.

Another embodiment provided by the present application is: the analyzing the region to be cut of the source molecule comprises: traversing a first node in the source molecule and a second node in the target molecule, wherein the first node and the second node have the same chemical elements; traversing a first branch of a first node to obtain a first branch, traversing a second branch of a second node to obtain a second branch, wherein the first branch is the same as the second branch, and recording the number of nodes in the first branch or the second branch; taking the node I with the maximum numerical value in the node number, wherein the first branch is a reserved area; and the branches except the first branch are areas to be cut.

Another embodiment provided by the present application is: the determining the clipping center includes: acquiring a vector representation I of the node I; polymerizing the first vector representation to obtain a second vector representation of the source molecule; and combining the first vector representation and the second vector representation to predict the probability that the first node is taken as the cutting center, and carrying out normalization processing on the number of the nodes to obtain node probability distribution, wherein the node with the maximum node probability distribution value is taken as the cutting center.

Another embodiment provided by the present application is: the determining the clipping branch comprises: and acquiring a vector representation III of a branch III, wherein the branch III is a branch of the cutting center, and making a decision on whether the branch III is reserved or not by predicting the reservation probability of the branch III through the vector representation I, the vector representation III and the vector representation of the first branch.

Another embodiment provided by the present application is: converting the source molecule into a simplified molecule linear input canonical string includes: converting the source molecule into a standard simplified molecule linear input canonical representation to obtain a coded representation of the source molecule; processing the encoded representation of the source molecule to obtain a new simplified molecular linear input specification.

Another embodiment provided by the present application is: splicing the source molecule reserved area and the simplified molecule linear input canonical character string to obtain the target molecule comprises the following steps: converting the new simplified molecular linear input specification into a molecular graph; and merging the molecular graph and the source molecule reserved area to generate the target molecule.

The present application further provides a molecular optimization system, the system comprising: a molecule cutting unit for determining a source molecule retention region, wherein the source molecule retention region is a common molecular substructure of the source molecule and the target molecule; a time-series unit for converting the source molecule into a new reduced molecule linear input canonical string, the new reduced molecule linear input canonical string being a non-common molecular substructure part in the target molecule; and the molecule splicing unit is used for splicing the source molecule reserved area and the new simplified molecule linear input canonical character string to obtain the target molecule.

Optionally, the molecule clipping unit comprises an analysis module and a molecule clipper; the time sequence unit comprises a first time sequence module and a second time sequence module, and the molecule splicing unit comprises a molecule splicing module; the analysis module is used for analyzing the region to be cut of the source molecule; the molecule cutter is used for predicting a molecule reserving area; the first time series module is used for acquiring a coded representation of the whole source molecule; the second time sequence module is used for acquiring a new simplified molecular linear input specification; the molecule splicing module is used for generating target molecules. The cutting unit also comprises a database and a molecule pair matching module, wherein the database is used for providing molecule pair data; the molecule pair matching module is used for acquiring qualified molecule pairs from the database, and each molecule pair comprises a source molecule and a target molecule.

Optionally, the first time-series unit is an encoder, and the second time-series unit is a decoder; the molecule splicing module comprises a molecule graph conversion sub-module and a merging sub-module, the molecule graph conversion sub-module converts the new simplified molecule linear input specification into a molecule graph, and the merging sub-module merges the molecule graph and the reserved area to generate target molecules.

The application also provides a terminal device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the method.

The present application also provides a computer-readable storage medium, in which a computer program is stored, which computer program, when being executed by a processor, carries out the method.

3. Advantageous effects

Compared with the prior art, the molecular optimization method, the molecular optimization system, the terminal device and the readable storage medium have the advantages that:

the molecular optimization system provided by the application comprises a molecular cutter, and the reserved area is obtained through the molecular cutter, so that the scale of the rest molecular structure to be generated can be greatly reduced, and the error rate of the system is greatly reduced.

According to the molecular optimization method, the common molecular substructure in the source molecule and the target molecule is found through a molecular cutting strategy, so that the accuracy and the efficiency of final molecule generation are improved.

The molecular optimization method provided by the application is based on a general phenomenon: stronger image structural relevance often exists between source molecules and target molecules in a molecular optimization task, namely, a large number of identical sub-molecular structures exist, so that the identical structures are reserved, only the rest different parts are generated, and the accuracy of model optimization can be greatly improved.

The molecular optimization method provided by the application has the fatal problem that the generation mode of the 'from the none to the few' molecules has huge calculation amount, but if the common sub-molecular structure of the source molecule and the target molecule can be reserved in advance, the rest molecular part which needs to be generated really is greatly reduced, and thus, the consumption of the method on calculation resources is reduced.

Drawings

FIG. 1 is a schematic diagram of an exemplary molecular pair of the present application;

FIG. 2 is a schematic diagram of the cropping process of the present application;

FIG. 3 is a schematic diagram of an encoder-decoder framework of the present application;

FIG. 4 is a schematic diagram of the molecular optimization system of the present application;

fig. 5 is a schematic structural diagram of a terminal device of the present application.

Detailed Description

Hereinafter, specific embodiments of the present application will be described in detail with reference to the accompanying drawings, and it will be apparent to those skilled in the art from this detailed description that the present application can be practiced. Features from different embodiments may be combined to yield new embodiments, or certain features may be substituted for certain embodiments to yield yet further preferred embodiments, without departing from the principles of the present application.

Since the entire generation step requires prediction of newly added nodes and chemical bonds at each step, the number of steps performed is proportional to the size of the target molecule. When the target molecule is larger, the consumption phenomenon of computing resource resources is more serious.

Smiles (simplified molecular input line entry specification), a specification for explicitly describing the structure of a molecule using ASCII character strings, simplifies the linear input of a molecule. A method and a tool for representing character strings of a molecular diagram are provided, a molecular diagram structure is input, and SMILES can provide a corresponding character representation. The target molecule generation process of the present application is primarily based on this molecular representation method, but it should be noted that SMILES itself cannot be directly molecularly generated. The present application is only a representation of the use of SMILES to encode a source molecule and a target molecule.

The molecular optimization method provided by the embodiment of the present application can be applied to terminal devices such as a tablet computer, a notebook computer, a super-mobile personal computer (UMPC), a netbook, and a Personal Digital Assistant (PDA), and the embodiment of the present application does not limit the specific types of the terminal devices.

For example, the terminal device may be a Station (ST) in a WLAN, a Personal Digital Assistant (PDA) device, a handheld device with wireless communication capabilities, a computing device or other processing device connected to a wireless modem, a computer, a laptop, a handheld communication device, a handheld computing device, a satellite radio, a wireless modem card.

Referring to fig. 1 to 5, the present application provides a molecular optimization method, including: obtaining source molecules, and cutting the source molecules according to target molecules to obtain a source molecule retention region, wherein the source molecule retention region is a common molecular substructure of the source molecules and the target molecules. And predicting a source molecule retention area, and cutting and removing all molecular substructures which are not in the source molecule retention area. By this step, the scale of the remaining molecular structure to be produced can be reduced, and the error rate can be reduced.

In the training phase, molecular pair data is acquired, each of the molecular pairs including a source molecule and a target molecule. Any one of the molecule databases is selected, which contains a variety of different molecules, such as a ZINC database. The target molecule of the training phase is known; in conducting the test, the target molecule is unknown, and a given source molecule is used to process the source molecule according to the target molecule.

And converting the source molecule into a new simplified molecule linear input canonical character string, wherein the new simplified molecule linear input canonical character string is a non-common molecule substructure part in the target molecule. I.e. the remaining part of the target molecule is generated.

And splicing the source molecule reserved area and the new simplified molecule linear input canonical character string to obtain the target molecule.

Further, the cropping comprises: analyzing the region to be cut of the source molecule, determining a cutting center according to the band cutting region, determining a cutting branch according to the cutting center, and cutting the cutting branch to obtain the source molecule reserved region.

Further, the analyzing the region to be cut of the source molecule includes:

traversing a first node in the source molecule and a second node in the target molecule, wherein the first node and the second node have the same chemical elements; the node one is the source molecule C1 in FIG. 2, and the node two is the target molecule C1 in FIG. 2.

Traversing a first branch of a first node to obtain a first branch, traversing a second branch of a second node to obtain a second branch, wherein the first branch is the same as the second branch, and recording the number of nodes in the first branch or the second branch.

Here, the first branch is all branches of the first node, and the second branch is all branches of the second node.

Taking the node I with the maximum numerical value in the node number, wherein the first branch is a reserved area; and the branches except the first branch are areas to be cut.

Specifically, all atoms i in the source molecule, i.e. nodes i, are traversed (e.g. source molecule C1 in fig. 1 or 2): traversing all atoms j in the target molecule, i.e. nodes j, the atom j is identical to the chemical element of the atom i (as in the target molecule C1 in fig. 1 or 2): traversing all branches of the atom i and the atom j to obtain the same branch in the branch of the atom i and the branch of the atom j, and recording the total number of atoms in the same branch as s_i,j。

The branch is an area around a certain node and expanded by another node connected with the certain node. Such as (C)₂、H₃、H₄、H₅)、(H₆)、(H₇) And (C)₈、H₉) Is the source molecule node C in FIG. 1 or 2, respectively₁4 branches.

As shown in FIG. 1 or 2, the source molecule C₁And the target molecule C₁Having 2 identical branches, i.e. (C)₂、H₃、H₄、H₅) And (H)₆) Then the total number of atoms in the same branch s_i,j＝5。

S with the largest value_i,jAs s_iAnd recording the atom j corresponding to the atom i.

S with the largest value_iAtom i in (1) to c^teS, which is the largest value_iThe atom i in (1) is the clipping center, and is recorded with c^teThe corresponding atom j.

In addition, for all s_iIt is converted to a normalized probability distribution for model learning:

wherein V_XRepresents the set of all atoms in the source molecule, exp (.) -is an exponential function.

s^te _iI.e. the normalized distribution. The source molecule c^teThe same branch (which may be multiple) corresponding to the atom j is the reserved region;the other branches are the areas to be cut. Each branch is denoted 1/0 as reserved and deleted, and the whole is denoted U by a variable, e.g. U { (C)₂、H₃、H₄、H₅):1,(H₆):1,(H₇):0,(C₈、H₉) 0, the distribution is the distribution that needs to be fitted for model training.

In addition, c is present in j^teIs marked as "to be generated", as in FIG. 2 the target molecule surrounds C₁Branch O of₇And converting the branch subgraph to be generated into character string representation of SMILES.

In summary, the molecular optimization method provided by the present application can complete the optimization of the source molecule by generating the smallest molecular substructure.

The nodes are represented by atoms.

In the training stage, as the target molecules are known, the molecular pairs can be matched, and the molecular optimization task target is determined to obtain the molecular pairs which meet the conditions.

The molecular optimization task goal or constraint is first determined, and the molecular optimization task goal or constraint of the present application may be a target molecule that requires the generated molecule to have higher water solubility and to have a similarity to the source molecule, but is not limited to other applications. Existing open source tools, which may be RDKIT, may provide functionality to calculate the relative attributes and molecular similarities of a molecule. And traversing the molecule database to obtain the molecule pairs meeting the conditions according to the task goal or constraint of the molecule optimization. FIG. 1 shows an extracted pair of molecules.

Further, the determining the clipping center includes: acquiring a vector representation I of the node I; polymerizing the first vector representation to obtain a second vector representation of the source molecule; and combining the first vector representation and the second vector representation to predict the probability that the first node is taken as the cutting center, and carrying out normalization processing on the number of the nodes to obtain node probability distribution, wherein the node with the maximum node probability distribution value is taken as the cutting center.

Specifically, the Graph message-learning networks (MPNs) are used to perform representation learning on the source molecules, for example, through a preset formula:

the calculation yields a vector representation of each node (atom), i.e. the vector represents one.

Wherein x is_iIs a characteristic representation of the node (atom) i, x_i,jIs a characteristic representation of the edge (chemical bond) between nodes i and j, m^t _i,jInformation indicating the transfer of node i to j at time t, N (i) all the neighbor nodes of i, N (i) \ j all the neighbor nodes of i except j, f₁And f₂Are all neural networks. Through

After one cycle, the representation h of the final node i is obtained_i. Wherein the characteristic representation of atoms and chemical bonds can be encoded by a simple one-hot.

And polymerizing the vector representation again and again to obtain a vector representation of the whole source molecule, namely a vector representation two:

and combining the vector representation one and the vector representation two, predicting the probability that the i node is taken as the clipping center by the following formula:

s_i＝f₃([h_X，h_i])

wherein [,]representing a vector splicing operation, f₃Is a standard neural network.

Is as same as s^te _iIs obtained by the following formula_iAnd (3) carrying out normalization:

wherein s is^st _iThe node with the maximum value is the predicted cutting center c^st。

In the model training phase, s^st _iFitting a loss function (e.g., KL divergence distance) is required to fit s^te _iThe tag value of (1). The test stage only needs to output and take s^st _iThe node corresponding to the maximum numerical value is used as a cutting center c^stAnd (4) finishing.

Further, the determining a clipping branch comprises: and acquiring a vector representation III of a branch III, wherein the branch III is a branch of the cutting center, and making a decision on whether the branch III is reserved or not by predicting the reservation probability of the branch III through the vector representation I, the vector representation III and the vector representation of the first branch.

Branch three here is any branch of the clipping center.

Specifically, each branch around the clipping center is predicted from the clipping center, and the remaining branch and the deleted branch are decided. Representing an h by the vector_iA vector representation of any branch is obtained by the following equation:

wherein c is^stAs a result of the said cutting centre,

representing the center c around said cut in the source molecule^stA branch subgraph of (1). |. | is the number of atoms in the subgraph.

Then, the retention probability of this branch j is predicted by the neural network:

wherein f is₄Is a standard neural network, sigma is sigmoid function,

and

respectively, a vector representation of the reaction center, a vector representation of the branch whether the decision is currently to be made to reserve, and a vector representation of the branch that has been determined to reserve. The output is greater than or equal to 0.5, which means retention, and less than 0.5, which means deletion. About

t-1 represents the sequence number of the last iteration, each iteration model needs to make a decision on the retention/deletion of a branch, and if the decision is retained, the branch is added into the set U^st _t-1Each element in the set is a subgraph.

The

Is obtained by the following formula:

the complete reserved area after cutting can be obtained. As shown in fig. 2, the shaded portion is a reserved portion and the box area is deleted.

In the testing stage, the output can be directly obtained according to the above process. While in the training phase, where the output needs to be fitted to the U, the cross entropy between the two can be used as a loss function.

Further, converting the source molecule into a simplified molecule linear input canonical string comprises: converting the source molecule into a standard simplified molecule linear input canonical representation to obtain a coded representation of the source molecule; processing the encoded representation of the source molecule to obtain a new simplified molecular linear input specification.

Specifically, the source molecule is converted into a standard simplified molecule linear input specification (SMILES) representation, and the simplified molecule linear input specification characters are encoded according to the character appearance sequence by adopting a first long-short term memory network; the encoding of the source molecule is used for decoding a new simplified molecule linear input specification character string through a second long-short term memory network.

Further, the obtaining the target molecule by splicing the source molecule retention region and the simplified molecule linear input canonical character string includes: converting the new simplified molecular linear input specification into a molecular graph; and merging the molecular graph and the source molecule reserved area to generate the target molecule.

Specifically, the SMILES is converted into a molecular diagram representation form, and then the first atom (i.e., the clipping center) in the SMILES representation is spliced and combined with the source molecule retention region to obtain the final complete target molecule.

As shown in fig. 4, the present application further provides a molecular optimization system, the system comprising: a molecule cutting unit 1, configured to determine a source molecule retention region, where the source molecule retention region is a common molecular substructure of the source molecule and the target molecule; the molecule cutting unit 1 comprises a molecule cutting unit, an analysis module and a molecule cutter; the analysis module is used for analyzing the region to be cut of the source molecule; the molecule cropper is used for predicting the molecule reserving area.

The molecule cutting unit 1 further comprises a database and a molecule pair matching module, wherein the database is used for providing molecule pair data; the molecule pair matching module is used for acquiring qualified molecule pairs from the database, wherein each molecule pair comprises a source molecule and a target molecule; in the testing stage, the target molecules are unknown, and the molecule pair matching module is not required to be called, and in the training stage, the molecule pair is required to be selected through the molecule pair matching module.

A time-series unit 2, configured to convert the source molecule into a new simplified molecule linear input canonical string, where the new simplified molecule linear input canonical string is a non-common molecule substructure part in the target molecule; the time-series unit 2 comprises a first time-series module and a second time-series module, wherein the first time-series module is used for acquiring a coded representation of the whole source molecule; the second time series module is used for acquiring a new simplified molecular linear input specification.

And the molecule splicing unit 3 is used for splicing the source molecule reserved area and the new simplified molecule linear input canonical character string to obtain the target molecule. The molecule splicing unit 3 comprises a molecule splicing module; the molecule splicing module is used for generating target molecules.

The molecule splicing module comprises a molecule graph conversion sub-module and a merging sub-module, the molecule graph conversion sub-module converts the new simplified molecule linear input specification into a molecule graph, and the merging sub-module merges the molecule graph and the reserved area to generate target molecules.

The first time sequence unit is an encoder, and the second time sequence unit is a decoder.

This section generally employs a decoder-encoder framework, as shown in fig. 3.

In particular, the Encoder (Encoder) converts the input molecules into a SMILES representation, and then encodes the SMILES characters in their order of occurrence using a standard long short-term memory (LSTMs) network. The hidden state of the last instant LSTMs is taken as the final output C of the encoder.

The Decoder (Decoder) decodes a SMILES string, which is the remainder to be generated, by LSTMs different from the encoder.

In this LSTMs decoder, the initial hidden state is the C provided by the encoder, where the information of the source molecule has been encoded. The output of LSTMs at each time instant is a specific character sampled from a character set consisting of all possible constituent letters or symbols of SMILES, such as Br, Cl, N, O, S, p.

In the decoder, each character output at the current time is used asThe input of the next time, e.g., the character "C" predicted by the t-1 th time model, is used as the input of the next time-t LSTMs. In particular, the input at the first moment

Fixing to a clipping center in the previously selected source molecule, and starting generation; when the output character string is "</s>"time" indicates completion of the generation. The input and output of the decoder can adopt simple one-hot coding, namely, a coding vector constructed by the character set.

The SMILES string generated during this part of the training phase needs to be fitted to the SMILES string of the molecular subgraph to be generated, and cross entropy can be used as a loss function as well.

The present application further provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the steps in any of the method embodiments described above are implemented.

The terminal device of this embodiment includes: at least one processor (only one shown in fig. 5), a memory, and a computer program stored in the memory and executable on the at least one processor, the processor when executing the computer program implementing the steps in any of the various metabolic pathway prediction method embodiments described below.

The terminal device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The terminal device may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that the terminal device is merely an example, and does not constitute a limitation of the terminal device, and may include more or less components than those shown, or combine some components, or different components, such as input and output devices, network access devices, etc.

The Processor may be a Central Processing Unit (CPU), or other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may in some embodiments be an internal storage unit of the terminal device, such as a hard disk or a memory of the terminal device. In other embodiments, the memory may also be an external storage device of the terminal device, such as a plug-in hard disk, a Smart Media Card (MC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal device.

Further, the memory may also include both an internal storage unit and an external storage device of the terminal device. The memory is used for storing an operating system, application programs, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer programs. The memory may also be used to temporarily store data that has been output or is to be output.

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.

The embodiments of the present application provide a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the above method embodiments when executed. The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. For example, the above-described apparatus/network device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application. In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Claims

1. A method of molecular optimization, characterized by: the method comprises the following steps:

obtaining source molecules, and cutting the source molecules according to target molecules to obtain a source molecule retention region, wherein the source molecule retention region is a common molecular substructure of the source molecules and the target molecules;

converting the source molecule into a new simplified molecule linear input canonical string, wherein the new simplified molecule linear input canonical string is a non-common molecule substructure part in the target molecule;

2. The molecular optimization method of claim 1, wherein: the cutting comprises the following steps:

analyzing the region to be cut of the source molecule, determining a cutting center according to the band cutting region, determining a cutting branch according to the cutting center, and cutting the cutting branch to obtain the source molecule reserved region.

3. The molecular optimization method of claim 2, wherein: the analyzing the region to be cut of the source molecule comprises:

traversing a first node in the source molecule and a second node in the target molecule, wherein the first node and the second node have the same chemical elements;

traversing a first branch of a first node to obtain a first branch, traversing a second branch of a second node to obtain a second branch, wherein the first branch is the same as the second branch, and recording the number of nodes in the first branch or the second branch;

4. A molecular optimization method according to claim 3, characterized in that: the determining the clipping center includes:

acquiring a vector representation I of the node I; polymerizing the first vector representation to obtain a second vector representation of the source molecule; and combining the first vector representation and the second vector representation to predict the probability that the first node is taken as the cutting center, and carrying out normalization processing on the number of the nodes to obtain node probability distribution, wherein the node with the maximum node probability distribution value is taken as the cutting center.

5. The molecular optimization method of claim 4, wherein: the determining the clipping branch comprises:

and acquiring a vector representation III of a branch III, wherein the branch III is a branch of the cutting center, and making a decision on whether the branch III is reserved or not by predicting the reservation probability of the branch III through the vector representation I, the vector representation III and the vector representation of the first branch.

6. The molecular optimization method of claim 1, wherein: converting the source molecule into a simplified molecule linear input canonical string includes:

converting the source molecule into a standard simplified molecule linear input canonical representation to obtain a coded representation of the source molecule;

processing the encoded representation of the source molecule to obtain a new simplified molecular linear input specification.

7. The molecular optimization method of claim 1, wherein: splicing the source molecule reserved area and the simplified molecule linear input canonical character string to obtain the target molecule comprises the following steps:

converting the new simplified molecular linear input specification into a molecular graph;

and merging the molecular graph and the source molecule reserved area to generate the target molecule.

8. A molecular optimization system, characterized by: the system comprises:

a molecule cutting unit for determining a source molecule retention region, wherein the source molecule retention region is a common molecular substructure of the source molecule and the target molecule;

a time-series unit for converting the source molecule into a new reduced molecule linear input canonical string, the new reduced molecule linear input canonical string being a non-common molecular substructure part in the target molecule;

and the molecule splicing unit is used for splicing the source molecule reserved area and the new simplified molecule linear input canonical character string to obtain the target molecule.

9. A terminal device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that: the processor, when executing the computer program, implements the method of any of claims 1 to 7.

10. A computer-readable storage medium storing a computer program, characterized in that: the computer program when executed by a processor implements the method of any one of claims 1 to 7.