CN115798621A

CN115798621A - Transformer-based context-aware single-step inverse synthesis prediction method and device

Info

Publication number: CN115798621A
Application number: CN202211467767.6A
Authority: CN
Inventors: 张强; 韩玉强; 宫志晨; 陈华钧
Original assignee: ZJU Hangzhou Global Scientific and Technological Innovation Center
Current assignee: ZJU Hangzhou Global Scientific and Technological Innovation Center
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2023-03-14

Abstract

The invention discloses a context-aware single-step inverse synthesis prediction method and device based on a Transformer, comprising the following steps of: searching a context path from the target molecule to the selected molecule in the synthetic path search tree; coding context molecule information related to a chemical reaction template in a context path to obtain a context reaction representation vector sequence, and coding a selected molecule to obtain a deep semantic feature vector of the selected molecule; coding a splicing result of the context response characterization vector sequence and the deep semantic feature vector by using a Transformer model to obtain a molecular characterization vector fusing context information; and (3) performing candidate chemical reaction template prediction on the molecular feature vectors by using a classifier, and screening the high-ranking candidate chemical reaction templates according to the prediction result of the chemical reaction templates to be applied to single-step inverse synthesis from the selected molecules to the target molecules. The method and the device improve the accuracy of molecular chemical reaction template prediction.

Description

Transformer-based context-aware single-step inverse synthesis prediction method and device

Technical Field

The invention belongs to the field of intelligent design of a chemical molecule inverse synthetic route, and particularly relates to a context-aware single-step inverse synthetic prediction method and device based on a Transformer.

Background

The inverse synthesis analysis is a research on how to design a reasonable and efficient synthesis path, namely a series of chemical reactions, for a given target molecule by using the existing basic molecules. As an important field of organic chemistry, it is directly applied to drug discovery and material design. This problem is very challenging even to experienced chemists due to the large space available for chemical reaction combinations. Therefore, in recent years, machine learning-assisted inverse synthetic analysis applications have been developed to help chemists find high quality reaction routes.

Most machine learning assisted inverse composition analysis methods define inverse composition as a search algorithm based single-player game and represent the composite path as a tree, i.e., a composition tree. The method comprises two core modules: a search algorithm module and a single-step inverse synthesis prediction module. The model first selects the most promising nodes using a heuristic function, either manually designed or machine trained, and then expands the synthetic tree using a single-step inverse synthetic model. In recent years, several efficient search algorithms have been applied to inverse synthesis analysis, such as Monte Carlo Tree search and A ^★ And (6) searching. The role of the single step inverse synthesis prediction module is to predict, for a given molecule, the precursor compounds that can be used to synthesize the molecule. The well-designed single-step inverse synthesis prediction model can reduce the search space and improve the search efficiency.

At present, many researches focus on how to improve a single-step reverse synthesis prediction method based on a chemical reaction template, such as a single-step reverse synthesis method and system based on a multi-semantic network disclosed in patent document CN114496105a, and a single-step reverse synthesis method and system disclosed in patent document CN112397155 a. However, the method only uses the characteristic information of the molecule itself, and ignores the context information in the process of searching the reaction path, so that the optimal search result cannot be obtained.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for predicting a single-step inverse synthesis based on Transformer context-aware, which improve the accuracy of molecular chemical reaction template prediction by fusing context information in a synthetic route searching process, thereby improving the searching efficiency and the searching result quality of the whole synthetic route.

To achieve the above object, an embodiment of the present invention provides a method for context-aware single-step inverse synthesis prediction based on a transform, including the following steps:

context path extraction: acquiring a synthetic path search tree and determining target molecules, and searching context paths from the target molecules to selected molecules in the synthetic path search tree, wherein the context paths comprise a chemical reaction template and molecules;

context path coding: coding context molecule information related to a chemical reaction template in a context path to obtain a context reaction representation vector sequence, and coding a selected molecule to obtain a deep semantic feature vector of the selected molecule;

context molecular coding: coding a splicing result of the context response characterization vector sequence and the deep semantic feature vector by using a Transformer model to obtain a molecular characterization vector fusing context information;

predicting a chemical reaction template: and performing candidate chemical reaction template prediction on the molecular characterization vectors fused with the context information by using a classifier, and screening the high-ranking candidate chemical reaction templates according to the prediction result of the chemical reaction template to be applied to single-step inverse synthesis from the selected molecules to the target molecules.

In one embodiment, the synthetic path search tree employs an and-or tree structure in which the and nodes represent chemical reaction templates, or the nodes represent molecules.

In one embodiment, the selected molecules are encoded by a first neural network, the selected molecules are encoded by molecular fingerprints and then input into the first neural network, and the deep semantic feature vectors of the selected molecules are obtained through encoding by the first neural network.

In one embodiment, a second neural network is used for encoding context molecule information related to a chemical reaction template, a product molecule and a reactant molecule related to the chemical reaction template are input to the second neural network after the chemical reaction template is encoded by using a chemical reaction difference fingerprint, and meanwhile, a fingerprint vector of the reactant molecule is also input to the second neural network, a context reaction characterization vector of each chemical reaction template is obtained through encoding of the second neural network, and context reaction characterization vectors of a plurality of chemical reaction templates form a context reaction characterization vector sequence.

In one embodiment, the chemical reaction template is encoded using the following formula:

wherein f (r) is the chemical reaction difference fingerprint of the chemical reaction template r, p _j Fingerprint vector, s, representing the jth product molecule _i Represents the fingerprint vector of the ith reactant molecule, n is the total amount of the biomolecule, and k represents the total amount of the reactant molecule.

In one embodiment, the first neural network employs an MLP, CNN, self-Attention network, and the second neural network employs an MLP, CNN, self-Attention network.

In one embodiment, when a transform model is used for coding a splicing result of a context response characterization vector sequence and a deep semantic feature vector, a position code of each vector needs to be input, and a molecular characterization vector m fusing context information is obtained by calculating by using the following formula _c ：

m _c ＝Transformer([Q(r ₁ )+t ₁ ,Q(r ₂ )+t ₂ ,…,m _r +t _n ])

Wherein m is _r Deep semantic feature vector, Q (r), representing a selected molecule ₁ )，Q(r ₂ ) Representing the context response characterization vector, t, of two chemical reaction templates ₁ ，t ₂ ，t _n A position-coding vector.

In one embodiment, the position-encoding vector represents the distance to the target molecule, calculated as:

wherein, t _k,2i ,t _k,2i+1 Respectively representing 2i,2i +1 th component in the coding vector of the position k, and d represents the dimension of the position coding vector.

In one embodiment, the method further comprises: the method is applied to a chemical molecule synthesis path search algorithm to realize a single-step inverse synthesis prediction task, and comprises the following steps: and predicting and screening to obtain high-order chemical reaction templates and product molecules by using a context-aware single-step inverse synthesis prediction method, and adding the chemical reaction templates and the product molecules into the synthetic path search tree.

To achieve the above object, an embodiment of the present invention provides a Transformer context-aware single-step inverse synthesis prediction apparatus, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the above Transformer-based context-aware single-step inverse synthesis prediction method steps when executing the computer program.

Compared with the prior art, the invention has the beneficial effects that at least:

aiming at single-step inverse synthesis prediction in the chemical molecule inverse synthesis path searching process, a chemical reaction template determined between a target molecule and a selected molecule in a synthesis path searching tree is used as a context, the chemical reaction template in the context path is coded to obtain a context reaction characterization vector sequence, the selected molecule is coded to obtain a deep semantic feature vector, the context reaction characterization vector sequence and the deep semantic feature vector are spliced and used as input of a Transformer, the influence of context information on the selected molecule and a candidate chemical reaction template is modeled, the probability and the sequence of the optimal candidate chemical reaction template are improved, and the accuracy of the single-step inverse synthesis prediction is improved. By matching with a search algorithm, the search efficiency of the whole synthetic path can be improved, and the quality of the synthetic path is improved, including indexes such as shorter path length, higher synthetic yield and the like.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart of a Transformer-based context-aware single-step inverse synthesis prediction method according to an embodiment;

fig. 2 is a schematic diagram of an application process of the context-aware single-step inverse synthesis prediction method in a synthesis path search according to the embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

FIG. 1 is a flow chart of a method for predicting a single-step inverse synthesis based on Transformer context sensing. As shown in fig. 1, the method for predicting context-aware single-step inverse synthesis based on Transformer according to the embodiment includes the following steps:

step 1: context path extraction: obtaining a synthetic path search tree and determining a target molecule, and searching a context path from the target molecule to a selected molecule in the synthetic path search tree, wherein the context path comprises a chemical reaction template and molecules.

In an embodiment, the synthetic path search tree is used for characterizing the synthetic path of the target molecule, and adopts an and-or tree structure, wherein the and node represents the chemical reaction template or the molecule, as shown in fig. 1, the and node representing the molecule or the node representing the molecule is represented by a circle, and the and node representing the chemical reaction template is represented by a square, so that the and-or tree structure can clearly and intuitively represent the structural characteristics of the synthetic path.

In an embodiment, a synthesis path which is already determined between a target molecule and a selected molecule in a synthesis path search tree is used as a context path, as shown in fig. 1, a chemical reaction template between the target molecule t and the selected molecule m and a molecule are used as a context path of m, and the context path contains context information and has an important influence on the ordering of candidate reaction templates of the selected molecule. As shown in FIG. 2, the synthetic path search tree may be implemented by a synthetic path search algorithm Retro ^★ And (4) generating.

Step 2, context path coding: and coding context molecule information related to the chemical reaction template in the context path to obtain a context reaction characterization vector sequence, and coding the selected molecules to obtain deep semantic feature vectors of the selected molecules.

In an embodiment, the encoding molecules and the chemical reaction templates are initialized using molecular fingerprints and chemical reaction difference fingerprints, respectively. Specifically, extended Connectivity Fingerprints (ECFP) of molecules can be obtained by using a chemical informatics tool Rdkit, the fingerprint vector length is 1024, and the fingerprint algorithm represents a molecular structure through a circular atom neighborhood and can represent a large number of different molecular characteristics.

The chemical reaction difference fingerprint refers to the difference of molecular fingerprints between product molecules and reactant molecules in a chemical reaction, and is used for realizing the initial coding of context molecular information related to a chemical reaction template, and the specific calculation method is to perform subtraction operation between the product molecules and the reactant molecular fingerprints, and is expressed as follows:

Based on the initial coding of the context molecule information related to the above molecules and the chemical reaction templates, inputting the initial coding into a second neural network, and simultaneously inputting the fingerprint vector p of the reactant molecules into the second neural network, obtaining a context reaction characterization vector of each chemical reaction template through the second neural network coding, wherein the context reaction characterization vectors of a plurality of chemical reaction templates form a context reaction characterization vector sequence, and the context reaction characterization vector sequence is used for representing influence factors of the context reaction on the prediction of the chemical reaction templates of the selected molecules.

When the second neural network employs MLP, the context-response characterization vector Q (r) for the chemical reaction template is obtained using the following equation:

Q(r)＝MLP(p,f(r))

based on the initial encoding of the selected molecules, i.e. the fingerprint vector of the selected molecules, the fingerprint vector m of the selected molecules is _r Inputting the first neural network, and obtaining the deep semantic feature vector of the selected molecule through the first neural network coding, wherein the deep semantic feature vector is used for learning the association degree between the molecule and the reaction template.

When the first neural network is MLP, calculating the deep semantic feature vector m of the selected molecule m by adopting the following formula _r ：

m _r ＝MLP(m)

Step 3, context molecule coding; and coding the splicing result of the context response characterization vector sequence and the deep semantic feature vector by using a Transformer model to obtain a molecular characterization vector fused with context information.

In the embodiment, when the transform model is used for coding, the context response characterization vector sequence and the deep semantic feature vector m are used _r In addition to the input, bits per vector need to be combinedSetting codes as input, and enabling the Transformer model to model the influence of context information on the prediction of the chemical reaction template based on all input splicing results to realize code prediction. Specifically, the molecular characterization vector m fused with the context information is calculated by adopting the following formula _c ：

m _c ＝Transformer([Q(r ₁ )+t ₁ ,Q(r ₂ )+t ₂ ,…,m _r +t _n ])

In an embodiment, the position-encoding vector represents a distance to the target molecule, and the calculation formula is:

And 4, predicting a chemical reaction template: and performing candidate chemical reaction template prediction on the molecular characterization vectors fused with the context information by using a classifier, and screening the high-ranking candidate chemical reaction templates according to the prediction result of the chemical reaction template to be applied to single-step inverse synthesis of the target molecules.

In the embodiment, a linear multi-classifier is used for predicting the molecular characterization vector fused with the context information to obtain the probability p that the candidate reaction template is suitable for the selected molecules _i As a candidate chemical reaction template predictor:

p _i ＝Softmax(Linear(m _c ))

wherein Linear (-) denotes a Linear layer, and Softmax (-) denotes a Softmax activation function.

After the probability values of all candidate reaction templates are obtained, the candidate chemical reaction templates are ranked according to the probability values, and k chemical reaction templates with the highest ranking are selected to be applied to single-step inverse synthesis from the selected molecules to the target molecules.

In an application aspect, the transform-based context-aware single-step inverse synthesis prediction method provided by the embodiment is applied to a chemical molecular synthesis path search algorithm to realize a single-step inverse synthesis prediction task, wherein the synthesis path search algorithm is based on 3N-MCTS of a monte carlo tree search algorithm and Retro of an a search algorithm, and can improve the search efficiency and the search result quality of the whole synthesis path.

The context-aware single-step inverse synthesis prediction method is applied to a chemical molecule synthesis path search algorithm, and specifically comprises the following steps: and predicting and screening to obtain high-order chemical reaction templates and product molecules by using a context-aware single-step inverse synthesis prediction method, and adding the chemical reaction templates and the product molecules into the synthetic path search tree.

As shown in FIG. 2, the Transformer-based context-aware single-step inverse synthesis prediction method is used as Retro ^★ And an expansion module of the search algorithm is used for completing a single-step inverse synthesis prediction task. The search process is as follows:

selecting, namely selecting an optimal molecule based on the current AND-or synthetic tree structure of a given target chemical molecule for subsequent expansion operation, and determining the optimal synthetic path searching direction of the target molecule;

expanding, based on the selected optimal molecule, selecting an optimal candidate chemical reaction template and reactant molecules for the molecule by using a context-aware single-step inverse synthesis prediction method based on a Transformer model, and then adding a new chemical reaction template node and reactant molecule node (and-or tree branch) into a current and-or synthesis tree;

updating, based on the newly added AND-OR tree branch, cost information of relative nodes in the entire composition tree for re-estimating the total cost of the current composition path.

And then, completing the one-step searching process, and continuously repeating the three steps until the synthetic path taking the available raw materials as the selected molecules is searched, so that the synthetic path searching of the whole target molecules is completed.

Based on the same inventive concept, the embodiment also provides a Transformer context-aware single-step inverse synthesis prediction device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the above Transformer-based context-aware single-step inverse synthesis prediction method steps.

In practical applications, the memory may be a volatile memory at the near end, such as RAM, a non-volatile memory, such as ROM, FLASH, a floppy disk, a mechanical hard disk, etc., or a remote storage cloud. The processor can be a Central Processing Unit (CPU), a microprocessor unit (MPU), a Digital Signal Processor (DSP), or a Field Programmable Gate Array (FPGA), i.e., the transform-based context-aware single-step inverse synthesis prediction method steps can be implemented by these processors.

The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A method for predicting context-aware single-step inverse synthesis based on Transformer is characterized by comprising the following steps:

2. The Transformer-based context-aware single-step inverse synthesis prediction method according to claim 1, wherein the synthetic path search tree adopts an and-or tree structure, wherein an and node represents a chemical reaction template or a node represents a molecule.

3. The transform-based context-aware single-step inverse synthesis prediction method of claim 1, wherein a first neural network is used to encode selected molecules, the selected molecules are encoded by molecular fingerprints, and then the encoded molecules are input into the first neural network, and the deep semantic feature vectors of the selected molecules are obtained through the encoding of the first neural network.

4. The Transformer-based context-aware single-step inverse synthesis prediction method of claim 1, wherein a second neural network is used to encode context molecule information related to the chemical reaction templates, the chemical reaction templates are encoded by using chemical reaction difference fingerprints based on product molecules and reactant molecules related to the chemical reaction templates and then input to the second neural network, meanwhile, fingerprint vectors of the reactant molecules are also input to the second neural network, a context reaction characterization vector of each chemical reaction template is obtained through encoding of the second neural network, and context reaction characterization vectors of a plurality of chemical reaction templates constitute a context reaction characterization vector sequence.

5. The Transformer-based context-aware single-step inverse synthesis prediction method of claim 1, wherein the chemical reaction template is encoded by using the following formula:

6. The method of claim 3 or 4, wherein the first neural network is MLP, CNN, self-Attention network, and the second neural network is MLP, CNN, self-Attention network.

7. The method of claim 1, wherein when a transform model is used to encode the result of splicing a context response characterization vector sequence and a deep semantic feature vector, a position code of each vector needs to be input, and a molecular characterization vector m fused with context information is obtained by calculation using the following formula _c ：

m _c ＝Transformer([Q(r ₁ )+t ₁ ,Q(r ₂ )+t ₂ ,…,m _r +t _n ])

8. The method of claim 7, wherein the position-coding vector represents a distance to a target molecule, and the formula is as follows:

9. The method of claim 1, further comprising: the method is applied to a chemical molecule synthesis path search algorithm to realize a single-step inverse synthesis prediction task, and comprises the following steps: and predicting and screening to obtain high-order chemical reaction templates and product molecules by using a context-aware single-step inverse synthesis prediction method, and adding the chemical reaction templates and the product molecules into the synthetic path search tree.

10. A Transformer context-aware single-step inverse synthetic prediction apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the Transformer-based context-aware single-step inverse synthetic prediction method steps of any one of claims 1-9 when executing the computer program.