CN114360662A - Single-step inverse synthesis method and system based on two-way multi-branch CNN - Google Patents

Single-step inverse synthesis method and system based on two-way multi-branch CNN Download PDF

Info

Publication number
CN114360662A
CN114360662A CN202111569573.2A CN202111569573A CN114360662A CN 114360662 A CN114360662 A CN 114360662A CN 202111569573 A CN202111569573 A CN 202111569573A CN 114360662 A CN114360662 A CN 114360662A
Authority
CN
China
Prior art keywords
reaction
branch
input
molecules
molecule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111569573.2A
Other languages
Chinese (zh)
Inventor
刘娟
杨锋
杨志辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202111569573.2A priority Critical patent/CN114360662A/en
Publication of CN114360662A publication Critical patent/CN114360662A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a single-step inverse synthesis method and a single-step inverse synthesis system based on two paths of multi-branch CNNs, wherein in the method, when single-step inverse synthesis prediction is carried out, a SMILES sequence of a molecule to be predicted is input, and after passing through two paths of multi-branch convolution layers, a characteristic splicing layer and a full connection layer, a reaction rule set is output to generate the first k reaction rules of the molecule. And finally calculating to obtain a reactant SMILES of the target molecule by combining the SMILES of the molecules to be predicted according to the output reaction rule, thereby realizing the single-step reverse synthesis automation. The embodiment also provides a single-step inverse synthesis system based on two-way multi-branch CNN, which realizes the single-step inverse synthesis automation of target molecules through module processing processes such as reaction data set acquisition, training set construction, model training, single-step inverse synthesis prediction, result visualization and the like. The achievement of the invention can be used in the field of chemical reverse synthesis and biological reverse synthesis, and has wider application than the prior method.

Description

Single-step inverse synthesis method and system based on two-way multi-branch CNN
Technical Field
The invention relates to the field of analysis of single-step inverse synthesis by utilizing molecular sequence information and molecular fingerprint information, in particular to a single-step inverse synthesis method and a single-step inverse synthesis system based on two paths of multi-branch CNNs, and belongs to application of a convolutional neural network machine learning model in single-step inverse synthesis.
Background
Retrosynthetic analysis is a technique widely adopted by chemists and proposed by Corey et al in the 60's of the 20 th century for designing synthetic routes to target molecules, i.e., target molecules are continually converted to useful "precursor" molecules by recursive means until commercially useful starting molecules are identified. Since the 60's of the 20 th century, chemists have recognized the promise of computer-aided reverse chemistry, and have devised various methods for synthesizing compounds of high value. Like chemical retro-synthesis, bio-retro-synthesis is a process that potentially simplifies the design of biosynthetic pathways, which was first proposed in 2010. As a conceptual path design strategy, biological reverse synthesis and reverse synthesis have common advantages, and researchers can be guided to obtain simpler available intermediates step by step.
Whether chemical or biological retro-synthesis, it is necessary to predict the reactants of reaction pathway intermediates, i.e., single step retro-synthesis, for predicting the reactants of a target molecule from the target molecule. The single step inverse synthesis is classified into a rule-based method and a random method according to whether a rule is used or not. Rule-based methods match target molecules to a large number of reaction rules, while irregular methods treat inverse synthetic predictions as a sequence-to-sequence problem.
The existing single-step inverse synthesis prediction mainly utilizes a SMILES sequence of molecules to carry out single-step inverse synthesis, and a rule-based method is used for matching a target molecule with a large number of reaction rules or directly carrying out combined modeling on the rules and reactants; the random approach then uses the seq2seq model directly to predict the reactant SMILES sequence. However, one disadvantage of the SMILES sequence is that the SMILES characterization assumes an order between atoms in a molecule, which does not effectively reflect a complex relationship between atoms in a molecule, and using the SMILES sequence alone results in poor prediction performance, and thus requires the combination of other features to provide more molecular information for prediction.
Disclosure of Invention
The invention provides a single-step reverse synthesis method and a system based on two paths of multi-branch CNNs, the single-step reverse synthesis method based on the two paths of multi-branch CNNs is used for predicting a reaction rule capable of generating a target molecule, the method does not need complex field knowledge, the reaction rule is directly predicted, a predicted probability value is obtained at the same time, and the technical problem of poor prediction effect existing in the method in the prior art is solved.
In order to solve the above technical problem, a first aspect of the present invention provides a single-step inverse synthesis method based on two-way multi-branch CNN, including:
s1: obtaining a given reaction data set R, wherein the given reaction data set comprises different reactions, each reaction comprises a substrate molecule and a product molecule, a molecule set S is constructed according to the product molecules of the reaction in the given reaction data set, and a reaction rule set T is constructed according to the reaction in the given reaction data set;
s2: according to the constructed molecular set, the reaction rule set and the corresponding relation between the molecules and the reaction rules, constructing an input data set D { (S, T) }, wherein S represents the molecules, T represents the reaction rules, S belongs to S, T belongs to T, and a bigram (S, T) represents that the reaction which can generate S exists in the reaction corresponding to the reaction rules T;
s3: building two paths of multi-branch CNNs and constructing a single-step inverse synthesis prediction model;
s4: training the single-step inverse synthesis prediction model constructed in the step S3 by using the constructed input data set D as a training set to obtain a trained prediction model A;
s5: inputting target molecules to be predicted into a trained prediction model A, predicting the probability of each reaction rule in a reaction rule set T for generating the target molecules, and selecting the first k rules with the highest probability as results to be output according to the probability values, wherein k is a set parameter.
In one embodiment, in step S1, the predetermined reaction data set R in step S1 is a set of known reactions collected from a common resource, and each reaction in R includes the following components: reaction ID, SMIIES sequence representing one or more reactants, SMILES sequence representing one product molecule; wherein, for an original reaction having a plurality of products, a reaction of decomposing into a plurality of single products is given the same ID.
In one embodiment, the two-way multi-branch CNN constructed in step S4 includes five layers, which are an input layer, two-way multi-branch convolutional layers, a feature splicing layer, a full connection layer, and an output layer, respectively;
wherein the input layer comprises an input node for inputting the SMILES sequence of the molecule;
the two multi-branch convolution layers are composed of two networks with similar structures and used for obtaining two convolution characteristics of input molecules, wherein each path comprises a plurality of branches, and each branch sequentially consists of convolution, batch normalization, Sigmoid activation and maximum pooling operation; for each path of input, different branches adopt convolution kernels with different sizes to perform convolution, and different convolution vectors are obtained after batch normalization, Sigmoid activation and maximum pooling, and the convolution vectors are spliced to obtain convolution characteristics corresponding to the path of input;
the splicing layer is used for splicing the two obtained convolution characteristics to obtain the fusion expression characteristics of the input molecules;
calculating the probability of generating input molecules by each reaction rule in the reaction rule set T through a Softmax function by the full connection layer, wherein the value range of the probability value is [0,1 ];
the output layer comprises | T | nodes which respectively correspond to each reaction rule in the reaction rule set T, and | T | represents the size of the set T.
In one embodiment, the two-way multi-branch convolutional layer, the input V1 of one way is an extended connectivity fingerprint with a radius value of 2 generated based on the SMILES sequence of the input molecule, and the input V2 of the other way is a one-hot coded matrix generated based on the SMILES sequence of the input molecule and a alphabet table, wherein the alphabet consists of predetermined symbols contained by the SMILES sequences of all molecules.
In one embodiment, the convolution kernels used by the two multi-branch convolution layers are all one-dimensional convolution kernels, and for one of the two multi-branch convolution layers, the size of the convolution kernel of the first branch is set to be size0Then the size of the convolution kernel for the ith branch is set to size0+ (i-1) x step, where step is the step size at which the convolution kernel size increases.
In one embodiment, in step S5, the form of the SMILES sequence is converted before the target molecule to be predicted is input into the model.
Based on the same inventive concept, the second aspect of the present invention provides a single-step inverse synthesis system based on two-way multi-branch CNN, comprising:
a reaction data set acquisition module for acquiring a given reaction data set R, wherein the given reaction data set comprises different reactions, each reaction comprises a substrate molecule and a product molecule, a molecule set S is constructed according to the product molecules of the reaction in the given reaction data set, and a reaction rule set T is constructed according to the reaction in the given reaction data set;
the training set construction module is used for constructing an input data set D { (S, T) } according to the constructed molecular set, the reaction rule set and the corresponding relation between the molecules and the reaction rules, wherein S represents a molecule, T represents a reaction rule, S belongs to S, T belongs to T, and a bituple (S, T) represents that a reaction which can generate S exists in the reaction corresponding to the reaction rule T;
the model building module is used for building two paths of multi-branch CNNs and building a single-step inverse synthesis prediction model;
the model training module is used for training the single-step inverse synthesis prediction model constructed in the step S3 by using the constructed input data set D as a training set to obtain a trained prediction model A;
and the single-step inverse synthesis prediction module is used for inputting the target molecules to be predicted into the trained prediction model A, predicting the probability of generating the target molecules by each reaction rule in the reaction rule set T, and selecting the front k rules with the highest probability as results to be output according to the probability values, wherein k is a set parameter.
In one embodiment, the system further comprises: and the result visualization module is used for visually displaying the molecules and the reactions related to the molecules and predicting the obtained reaction rules by utilizing a graphic mode.
One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:
the single-step inverse synthesis method based on two multi-branch CNNs provided by the invention comprises the steps of firstly respectively constructing a molecule set S and a reaction rule set T according to an obtained established reaction data set R, constructing an input data set D according to the constructed molecule set, the reaction rule set and the corresponding relation between molecules and reaction rules, then constructing two multi-branch CNNs, and constructing a single-step inverse synthesis prediction model; secondly, training a constructed single-step inverse synthesis prediction model by utilizing the constructed input data set D; and finally, carrying out single-step inverse synthesis prediction by using the trained model A. The method does not need complex domain knowledge, and can directly predict the reaction rule and simultaneously obtain the prediction probability value. Compared with the existing method, more potential information can be provided, so that the prediction effect can be improved.
In addition, the invention also provides a single-step reverse synthesis system based on the two-way multi-branch CNN, which is used for rapidly assisting chemists to carry out single-step reverse synthesis.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a flow chart of a single-step inverse synthesis method based on two-way multi-branch CNN in the embodiment of the present invention;
FIG. 2 is a schematic diagram of a single-step inverse synthetic model A based on two-way multi-branched CNNs used in an embodiment of the present invention;
fig. 3 is a schematic diagram of a one-hot encoding matrix of a SMILES sequence used in the embodiment of the present invention;
fig. 4 is a schematic block diagram of a single-step inverse synthesis system based on two-way multi-branch CNN according to an embodiment of the present invention.
Detailed Description
Aiming at the defects of the prior art, the invention provides a single-step reverse synthesis method based on two paths of multi-branch CNNs. The single-step inverse synthesis method based on the two-way multi-branch CNN is used for predicting the reaction rule capable of generating the target molecule, does not need complex domain knowledge, directly predicts the reaction rule and simultaneously obtains the predicted probability value. More potential information can be provided than with existing methods. The invention also provides a single-step reverse synthesis system based on two paths of multi-branch CNNs, which is used for rapidly assisting chemists to carry out single-step reverse synthesis.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
The embodiment of the invention provides a single-step inverse synthesis method based on two paths of multi-branch CNNs, which comprises the following steps:
s1: obtaining a given reaction data set R, wherein the given reaction data set comprises different reactions, each reaction comprises a substrate molecule and a product molecule, a molecule set S is constructed according to the product molecules of the reaction in the given reaction data set, and a reaction rule set T is constructed according to the reaction in the given reaction data set;
s2: according to the constructed molecular set, the reaction rule set and the corresponding relation between the molecules and the reaction rules, constructing an input data set D { (S, T) }, wherein S represents the molecules, T represents the reaction rules, S belongs to S, T belongs to T, and a bigram (S, T) represents that the reaction which can generate S exists in the reaction corresponding to the reaction rules T;
s3: building two paths of multi-branch CNNs and constructing a single-step inverse synthesis prediction model;
s4: training the single-step inverse synthesis prediction model constructed in the step S3 by using the constructed input data set D as a training set to obtain a trained prediction model A;
s5: inputting target molecules to be predicted into a trained prediction model A, predicting the probability of each reaction rule in a reaction rule set T for generating the target molecules, and selecting the first k rules with the highest probability as results to be output according to the probability values, wherein k is a set parameter.
Specifically, step S1 is to construct a molecular set and a reaction rule set T from a predetermined reaction data set R, step S2 is to construct an input data set, step S3 is to construct a single-step inverse synthetic prediction model, step S4 is to train the model, and step S5 is to perform single-step inverse synthetic prediction using the trained model a.
In one embodiment, in step S1, the predetermined reaction data set R is a set obtained by arranging known reactions collected from a common resource, and each reaction in R includes the following components: reaction ID, SMIIES sequence representing one or more reactants, SMILES sequence representing one product molecule; wherein, for an original reaction having a plurality of products, a reaction of decomposing into a plurality of single products is given the same ID.
Wherein, when the reaction comprises a catalytic enzyme, the information contained in the reaction further comprises a catalytic enzyme number.
The reaction rule set T may be constructed by calling the template _ extra function in RDChiral. The input of the function is a specific reaction in the reaction data set R, and the output is a reaction rule corresponding to the reaction. Each reaction rule in the reaction rule set T is assigned a unique rule label.
In one embodiment, the two-way multi-branch CNN constructed in step S4 includes five layers, which are an input layer, two-way multi-branch convolutional layers, a feature splicing layer, a full connection layer, and an output layer, respectively;
wherein the input layer comprises an input node for inputting the SMILES sequence of the molecule;
the two multi-branch convolution layers are composed of two networks with similar structures and used for obtaining two convolution characteristics of input molecules, wherein each path comprises a plurality of branches, and each branch sequentially consists of convolution, batch normalization, Sigmoid activation and maximum pooling operation; for each path of input, different branches adopt convolution kernels with different sizes to perform convolution, and different convolution vectors are obtained after batch normalization, Sigmoid activation and maximum pooling, and the convolution vectors are spliced to obtain convolution characteristics corresponding to the path of input;
the splicing layer is used for splicing the two obtained convolution characteristics to obtain the fusion expression characteristics of the input molecules;
calculating the probability of generating input molecules by each reaction rule in the reaction rule set T through a Softmax function by the full connection layer, wherein the value range of the probability value is [0,1 ];
the output layer comprises | T | nodes which respectively correspond to each reaction rule in the reaction rule set T, and | T | represents the size of the set T.
Fig. 2 is a schematic diagram of a single-step reverse synthetic model a based on two-way multi-branched CNNs according to an embodiment of the present invention.
In one embodiment, the two-way multi-branch convolutional layer, the input V1 of one way is an extended connectivity fingerprint with a radius value of 2 generated based on the SMILES sequence of the input molecule, and the input V2 of the other way is a one-hot coded matrix generated based on the SMILES sequence of the input molecule and a alphabet table, wherein the alphabet consists of predetermined symbols contained by the SMILES sequences of all molecules.
Specifically, the Extended Connectivity Fingerprints (ECFP) have a length of 2048. The other one-hot (one-hot) code matrix of input V2, size 300 x 40. Wherein 300 is the maximum length of a given molecule SMILES, if the length of the SMILES is less than or equal to 300, 0 is directly supplemented after the SMILES until the length of the SMILES is 300; if the SMILES length is greater than 300, the SMILES string is truncated, and the parts with the length exceeding 300 are directly truncated, while 40 is the length of the alphabet.
Please refer to fig. 3, which is a schematic diagram of a one-hot encoding matrix of a SMILES sequence used in the embodiment of the present invention.
In one embodiment, the convolution kernels used by the two multi-branch convolution layers are all one-dimensional convolution kernels, and for one of the two multi-branch convolution layers, the size of the convolution kernel of the first branch is set to be size0Then the size of the convolution kernel for the ith branch is set to size0+ (i-1) x step, where step is the step size at which the convolution kernel size increases.
The convolution kernels used by the two multi-branch convolution layers are all one-dimensional convolution kernels, only the number of input channels is different, and the number of output channels is 40. For one way of input V1, the input path of all convolution kernels for that way is 1, and for one way of input V2, the input path of all convolution kernels for that way is 40, i.e., the length of the alphabet. The convolution kernel checks one of the ways, and the size of the convolution kernel of the first branch is set to size0Then the size of the convolution kernel for the ith branch is set to size0+ (i-1) x step, where step is the step size at which the convolution kernel size increases.
In one embodiment, in step S5, the form of the SMILES sequence is converted before the target molecule to be predicted is input into the model.
Referring to fig. 1, a flow chart of a single-step inverse synthesis prediction based on two-way multi-branch CNN according to an embodiment of the present invention is provided.
In specific implementation, the predetermined reaction data set R of step S1 is a set obtained by arranging known reactions collected from a common resource. Each reaction in R comprises the following components: reaction ID, SMIIES sequence representing one or more reactants, SMILES sequence representing one product molecule, catalytic enzyme number (if any). For an original reaction with multiple products, the reaction is broken down into multiple single products and given the same ID.
In one case: the given reaction data set R is obtained by working up all chemical reactions in the publicly available chemical reaction data set USPTO-50 k. Each reaction in R comprises a reaction ID, a SMIIES sequence representing one or more reactants and a SMILES sequence representing one product molecule.
In another case: the predetermined reaction data set R is obtained by collating all metabolic reactions in the publicly available metabolic reaction data set MetaNetX. Each reaction in R comprises the following components: reaction ID, SMIIES sequence representing one or more reactants, SMILES sequence representing one product molecule, catalytic enzyme number (which may be empty).
The reaction rule set T is constructed by calling the template _ extra function in RDChiral. The input of the function is a specific reaction in the reaction data set R, and the output is a reaction rule corresponding to the reaction. Each reaction rule in the reaction rule set T is assigned a unique rule label.
In step S3, the two-path multi-branch CNN network structure mainly includes five layers: the device comprises an input layer, two multi-branch convolution layers, a characteristic splicing layer, a full connection layer and an output layer. The processing steps of each layer are as follows:
3.1 the input layer contains an input node for inputting the SMILES sequence of the molecule;
and the 3.2 two-path multi-branch convolution layer consists of two paths of networks with similar structures and is used for obtaining two convolution characteristics of the input molecules in the 3.1. Each path comprises a plurality of branches, and each branch is formed by convolution, batch normalization, Sigmoid activation and maximum pooling operation in sequence. For the input of the path, different branches adopt convolution kernels with different sizes to be convoluted, different convolution vectors are obtained after batch normalization, Sigmoid activation and maximum pooling, and the convolution characteristics corresponding to the input of the path are obtained through splicing operation of the multiple vectors.
3.3 the splicing layer is used for splicing the two convolution characteristics obtained by the 3.2 to obtain the fusion expression characteristic of the input molecules in the 3.1;
3.4 calculating the probability of generating the input molecules in 3.1 by each reaction rule in the reaction rule set T through a Softmax function by the full connection layer, wherein the value range of the probability value is [0,1 ];
the 3.5 output layer contains | T | nodes respectively corresponding to each reaction rule in the reaction rule set T, and | T | represents the size of the set T.
In 3.2, the input V1 of one of the two multi-branch convolutional layers is an Extended Connectivity Fingerprint (ECFP) with a radius value of 2 generated based on the SMILES sequence of the input molecule in 3.1, and its length is 2048. The other input V2 is a one-hot (one-hot) code matrix generated based on the SMILES sequence of the input molecule in 3.1 and a predetermined alphabet of symbols contained in the SMILES sequence of all molecules, and has a size of 300 x 40. Wherein 300 is the maximum length of a given molecule SMILES, if the length of the SMILES is less than or equal to 300, 0 is directly supplemented after the SMILES until the length of the SMILES is 300; if the SMILES length is greater than 300, the SMILES string is truncated, and the parts with the length exceeding 300 are directly truncated, while 40 is the length of the alphabet. The corresponding position of the letter appearing in the SMILES sequence of the input molecule in the single hot code matrix is 1, otherwise, the letter is 0.
The convolution kernels used in 3.2 are all one-dimensional volume set kernels. For one way of input V1, the input path of all convolution kernels for that way is 1, and for one way of input V2, the input path of all convolution kernels for that way is 40, i.e., the length of the alphabet. For one of the ways, let the convolution kernel size of the first branch be size0Then the size of the convolution kernel for the ith branch is set to size0+ (i-1) x step, where step is the step size at which the convolution kernel size increases.
Fig. 2 is a network structure of a specific embodiment of the two-way multi-branch CNN.
In a particular embodiment of the present invention,
the Sigmoid activation function in 3.2 is:
Figure BDA0003423139490000091
where e is a natural constant, x represents the output of the previous layer, and f (x) is the output of the activation function. The function has the properties of single increment, single increment of an inverse function and the like, is smooth and easy to be derived, is used for hidden layer neuron output, has the value range of (0,1), and can map a real number to the interval of (0, 1).
In one embodiment, the model in S3, for a one-way multi-branch convolution with the input as an ECFP fingerprint, will start the size of the convolution kernel0Set to 32, the convolution kernel increases in sizeThe step size added is set to 32 and the size of the maximum convolution kernel does not exceed 2048 (the length of the ECFP fingerprint vector), for a total of 64 branches.
In one embodiment, the model in S3, for a one-way multi-branch convolution input as a single-hot coding matrix, will start the size of the convolution kernel0Set to 5, the step size step for the convolution kernel size increase is set to 5, the size of the maximum convolution kernel does not exceed 300 (the maximum length of the SMILES sequence), and a total of 60 branches are constructed.
3.4, the Softmax function is specifically defined as:
Figure BDA0003423139490000092
wherein e is a natural constant, ∑jejDenotes the sum of powers, S, of all neurons based on e and exponential on the neuroniShows the result of the i-th neuron after Softmax.
In one embodiment, the loss function of the model is a cross-entropy loss function, which is specifically expressed as:
Figure BDA0003423139490000093
wherein M is the total number of the labels, namely the size of the reaction rule set T, and N is the total number of the samples; y isi,cA binary identifier, which indicates whether the real label of the sample i is c, that is, whether the predicted rule of the sample i is the same as the real rule c, and if the real label of the sample i is the same as the real label of the sample i, the binary identifier is 1, otherwise, the binary identifier is 0; p is a radical ofi,cAnd c represents the probability that the label of the sample i is c, namely the probability that the rule of the predicted sample i is c.
In one embodiment, the optimization of the model is performed using the Adam optimization method.
In a specific embodiment, the data set constructed in step S2 is randomly divided into a training set, a validation set, and a test set in a ratio of 8:1: 1. The training set and the verifying machine are used for training the model, and the testing set does not participate in the training and is used for evaluating the single-step inverse synthesis prediction performance of the trained model. The number of training rounds (epoch) is set to 20, and multiple iterations are performed in each round until all training samples participate in one training, with the number of training samples participating in one iteration batch _ size set to 128. The initial learning rate is set to 0.001.
Example 1 a given reaction data set R was obtained after working up all chemical reactions in the publicly available chemical reaction data set USPTO-50 k. And D, arranging and constructing a data set D in the third step according to the R, and randomly dividing the data set D into a training set, a verification set and a test set according to the ratio of 8:1: 1. The test set tests the prediction precision of the single-step chemical inverse synthesis prediction model obtained by training. Table 1 shows the predicted performance of the single-step chemical inverse synthesis prediction method based on two-way multi-branch CNN proposed by the present invention in single-step chemical inverse synthesis. The best top-1 prediction precision in the field does not exceed 52.5 percent at present, and obviously, the prediction precision of the model obtained based on the method is obviously higher than the best result in the field at present. The prediction accuracy of the model based on the invention on top-3, top-5 and top-10 is higher than that of the existing model.
Table 1: single-step chemical inverse synthesis model prediction precision constructed based on USPTO-50k
Top-1 Top-3 Top-5 Top-10
61.1% 79.1% 83.9% 87.7%
Example 2: and (3) finishing all metabolic reactions in the publicly available metabolic reaction data set MetaNeTX to obtain a set reaction data set R. And D, arranging and constructing a data set D in the third step according to the R, and randomly dividing the data set D into a training set, a verification set and a test set according to the ratio of 8:1: 1. The training set and the verification set train the model, and the test set tests the performance of the single-step biological inverse synthesis prediction model obtained by training. Table 2 shows the predicted performance of the single-step reverse synthesis prediction method based on two-way multi-branch CNN proposed by the present invention in single-step reverse biosynthesis. In the biological reverse synthesis, no prediction work report of single-step biological reverse synthesis is found at present, and the single-step reverse synthesis method in the chemical field is successfully applied to the single-step biological reverse synthesis, so that a new idea of biological reverse synthesis is widened, and the blank of single-step biological reverse synthesis prediction is filled.
Table 2: prediction precision of single-step biological reverse synthesis model constructed based on MetaNet X
Figure BDA0003423139490000101
Figure BDA0003423139490000111
The invention discloses a method for predicting by using a convolutional neural network in the field of single-step inverse synthesis, which realizes an end-to-end single-step inverse synthesis framework and does not need complicated field parameter setting. Meanwhile, the invention utilizes the fusion characteristic of molecular fingerprints and one-hot codes, and can provide more potential information compared with the existing method only adopting single information.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention discloses a method for predicting single-step inverse synthesis by using a convolutional neural network in the field of single-step inverse synthesis, which realizes an end-to-end single-step inverse synthesis framework and does not need complicated field parameter setting.
2. The method utilizes the fusion characteristics of the fingerprint and the one-hot code, and can provide more potential information compared with the prior method.
3. The achievement of the invention can be used in the field of chemical reverse synthesis and biological reverse synthesis, and has wider application than the prior method. Particularly, the idea of single-step inverse synthesis prediction is a blank in the field of biological inverse synthesis at present, and the method can fill the blank.
Example two
Based on the same inventive concept, the present embodiment provides a single-step inverse synthesis system based on two-way multi-branch CNN, which is characterized by comprising:
a reaction data set acquisition module for acquiring a given reaction data set R, wherein the given reaction data set comprises different reactions, each reaction comprises a substrate molecule and a product molecule, a molecule set S is constructed according to the product molecules of the reaction in the given reaction data set, and a reaction rule set T is constructed according to the reaction in the given reaction data set;
the training set construction module is used for constructing an input data set D { (S, T) } according to the constructed molecular set, the reaction rule set and the corresponding relation between the molecules and the reaction rules, wherein S represents a molecule, T represents a reaction rule, S belongs to S, T belongs to T, and a bituple (S, T) represents that a reaction which can generate S exists in the reaction corresponding to the reaction rule T;
the model building module is used for building two paths of multi-branch CNNs and building a single-step inverse synthesis prediction model;
the model training module is used for training the single-step inverse synthesis prediction model constructed in the step S3 by using the constructed input data set D as a training set to obtain a trained prediction model A;
and the single-step inverse synthesis prediction module is used for inputting the target molecules to be predicted into the trained prediction model A, predicting the probability of generating the target molecules by each reaction rule in the reaction rule set T, and selecting the front k rules with the highest probability as results to be output according to the probability values, wherein k is a set parameter.
The reaction data set acquisition module comprises a reaction data set construction unit, a reaction rule set generation unit and a molecule set construction unit.
The molecule processing module is a component of a single-step inverse synthesis prediction model and comprises the steps of extracting SMILES sequences of related molecules from a reaction, generating a fingerprint vector based on the SMILES sequences and generating a one-hot based on the SMILES sequences and an alphabet.
In one embodiment, the system further comprises: and the result visualization module is used for visually displaying the molecules and the reactions related to the molecules and predicting the obtained reaction rules by utilizing a graphic mode.
Fig. 4 is a schematic block diagram of a single-step inverse synthesis system based on two-way multi-branch CNN according to an embodiment of the present invention.
The reaction data set acquisition module comprises three parts, namely a reaction data set acquisition unit, a molecule set construction unit and a reaction rule set generation unit, wherein the data set acquisition unit is used for organizing known reactions collected from public resources to obtain a reaction data set; a molecular assembly construction unit: constructing a molecular set according to product molecules of the reaction in the reaction data set, and generating a reaction rule set by a unit: generating a reaction rule set based on all reactions in the reaction data set; a training set construction module: generating a training set consisting of (molecules, reaction rules); a model training module: training and optimizing the two-way multi-branch CNN to obtain a single-step inverse synthesis prediction model; single step inverse synthesis prediction module: predicting the first k reaction rules which can generate the unknown target molecule SMILES sequence, wherein k is a preset parameter; a result visualization module: the molecules, their associated reactions, predicted rules of reaction, etc. are visually displayed using graphical means.
Generally speaking, in the method of the first embodiment, when single-step inverse synthesis prediction is performed, the SMILES sequence of the molecule to be predicted is input, and after passing through the two-way multi-branch convolution layer, the characteristic splicing layer and the full-connection layer, the reaction rule set is output to generate the first k reaction rules of the molecule. And finally calculating to obtain a reactant SMILES of the target molecule by combining the SMILES of the molecules to be predicted according to the output reaction rule, thereby realizing the single-step reverse synthesis automation. The embodiment also provides a single-step inverse synthesis system based on two-way multi-branch CNN, which realizes the single-step inverse synthesis automation of target molecules through module processing processes such as reaction data set acquisition, training set construction, model training, single-step inverse synthesis prediction, result visualization and the like. The invention discloses a method for predicting single-step inverse synthesis by using a convolutional neural network in the field of single-step inverse synthesis, which realizes an end-to-end single-step inverse synthesis framework. The achievement of the invention can be used in the field of chemical reverse synthesis and biological reverse synthesis, and has wider application than the prior method. Particularly, the idea of single-step inverse synthesis prediction is a blank in the field of biological inverse synthesis at present, and the method can fill the blank.
Since the system introduced in the second embodiment of the present invention is a system adopted for implementing the single-step inverse synthesis method based on two-way multi-branch CNNs in the first embodiment of the present invention, a person skilled in the art can understand the specific structure of the system based on the method introduced in the first embodiment of the present invention, and details are not described herein. All systems adopted by the method of the first embodiment of the present invention are within the intended protection scope of the present invention.
It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. A single-step inverse synthesis method based on two-way multi-branch CNN is characterized by comprising the following steps:
s1: obtaining a given reaction data set R, wherein the given reaction data set comprises different reactions, each reaction comprises a substrate molecule and a product molecule, a molecule set S is constructed according to the product molecules of the reaction in the given reaction data set, and a reaction rule set T is constructed according to the reaction in the given reaction data set;
s2: according to the constructed molecular set, the reaction rule set and the corresponding relation between the molecules and the reaction rules, constructing an input data set D { (S, T) }, wherein S represents the molecules, T represents the reaction rules, S belongs to S, T belongs to T, and a bigram (S, T) represents that the reaction which can generate S exists in the reaction corresponding to the reaction rules T;
s3: building two paths of multi-branch CNNs and constructing a single-step inverse synthesis prediction model;
s4: training the single-step inverse synthesis prediction model constructed in the step S3 by using the constructed input data set D as a training set to obtain a trained prediction model A;
s5: inputting target molecules to be predicted into a trained prediction model A, predicting the probability of each reaction rule in a reaction rule set T for generating the target molecules, and selecting the first k rules with the highest probability as results to be output according to the probability values, wherein k is a set parameter.
2. The single-step reverse synthesis method of claim 1, wherein in step S1, the predetermined reaction data set R in step S1 is a set obtained by arranging known reactions collected from a common resource, and each reaction in R comprises the following components: reaction ID, SMIIES sequence representing one or more reactants, SMILES sequence representing one product molecule; wherein, for an original reaction having a plurality of products, a reaction of decomposing into a plurality of single products is given the same ID.
3. The single-step inverse synthesis method of claim 1, wherein the two-way multi-branch CNN constructed in step S4 includes five layers, which are an input layer, two-way multi-branch convolution layers, a feature splicing layer, a full connection layer, and an output layer, respectively;
wherein the input layer comprises an input node for inputting the SMILES sequence of the molecule;
the two multi-branch convolution layers are composed of two networks with similar structures and used for obtaining two convolution characteristics of input molecules, wherein each path comprises a plurality of branches, and each branch sequentially consists of convolution, batch normalization, Sigmoid activation and maximum pooling operation; for each path of input, different branches adopt convolution kernels with different sizes to perform convolution, and different convolution vectors are obtained after batch normalization, Sigmoid activation and maximum pooling, and the convolution vectors are spliced to obtain convolution characteristics corresponding to the path of input;
the splicing layer is used for splicing the two obtained convolution characteristics to obtain the fusion expression characteristics of the input molecules;
calculating the probability of generating input molecules by each reaction rule in the reaction rule set T through a Softmax function by the full connection layer, wherein the value range of the probability value is [0,1 ];
the output layer comprises | T | nodes which respectively correspond to each reaction rule in the reaction rule set T, and | T | represents the size of the set T.
4. The single-step inverse synthesis method of claim 3, wherein, in the two-way multi-branch convolutional layer, the input V1 of one way is an extended connectivity fingerprint with a radius value of 2 generated based on the SMILES sequence of the input molecule, and the input V2 of the other way is a single-hot coding matrix generated based on the SMILES sequence of the input molecule and an alphabet table, wherein the alphabet consists of symbols predetermined to be contained by the SMILES sequences of all molecules.
5. The single-step inverse synthesis method of claim 3, wherein the convolution kernels used in the two multi-branch convolution layers are all one-dimensional convolution kernels, and for one of the two multi-branch convolution layers, the convolution kernel size of the first branch is set to be size0Then the size of the convolution kernel for the ith branch is set to size0+ (i-1) x step, where step is the step size at which the convolution kernel size increases.
6. The single-step reverse synthesis method of claim 1, wherein in step S5, the target molecule to be predicted is converted into the form of a SMILES sequence before being input into the model.
7. A single-step inverse synthetic system based on two-way multi-branch CNN is characterized by comprising:
a reaction data set acquisition module for acquiring a given reaction data set R, wherein the given reaction data set comprises different reactions, each reaction comprises a substrate molecule and a product molecule, a molecule set S is constructed according to the product molecules of the reaction in the given reaction data set, and a reaction rule set T is constructed according to the reaction in the given reaction data set;
the training set construction module is used for constructing an input data set D { (S, T) } according to the constructed molecular set, the reaction rule set and the corresponding relation between the molecules and the reaction rules, wherein S represents a molecule, T represents a reaction rule, S belongs to S, T belongs to T, and a bituple (S, T) represents that a reaction which can generate S exists in the reaction corresponding to the reaction rule T;
the model building module is used for building two paths of multi-branch CNNs and building a single-step inverse synthesis prediction model;
the model training module is used for training the single-step inverse synthesis prediction model constructed in the step S3 by using the constructed input data set D as a training set to obtain a trained prediction model A;
and the single-step inverse synthesis prediction module is used for inputting the target molecules to be predicted into the trained prediction model A, predicting the probability of generating the target molecules by each reaction rule in the reaction rule set T, and selecting the front k rules with the highest probability as results to be output according to the probability values, wherein k is a set parameter.
8. The biological reverse synthesis system of claim 7, further comprising: and the result visualization module is used for visually displaying the molecules and the reactions related to the molecules and predicting the obtained reaction rules by utilizing a graphic mode.
CN202111569573.2A 2021-12-21 2021-12-21 Single-step inverse synthesis method and system based on two-way multi-branch CNN Pending CN114360662A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111569573.2A CN114360662A (en) 2021-12-21 2021-12-21 Single-step inverse synthesis method and system based on two-way multi-branch CNN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111569573.2A CN114360662A (en) 2021-12-21 2021-12-21 Single-step inverse synthesis method and system based on two-way multi-branch CNN

Publications (1)

Publication Number Publication Date
CN114360662A true CN114360662A (en) 2022-04-15

Family

ID=81100706

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111569573.2A Pending CN114360662A (en) 2021-12-21 2021-12-21 Single-step inverse synthesis method and system based on two-way multi-branch CNN

Country Status (1)

Country Link
CN (1) CN114360662A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114974450A (en) * 2022-06-28 2022-08-30 苏州沃时数字科技有限公司 Method for generating operation steps based on machine learning and automatic testing device
CN116578934A (en) * 2023-07-13 2023-08-11 烟台国工智能科技有限公司 Inverse synthetic analysis method and device based on Monte Carlo tree search
CN116705197A (en) * 2023-08-02 2023-09-05 北京深势科技有限公司 Method and device for processing synthetic and inverse synthetic molecular diagram prediction model
CN116935969A (en) * 2023-07-28 2023-10-24 宁波甬恒瑶瑶智能科技有限公司 Biological inverse synthesis prediction method and device based on depth search and electronic equipment
CN116959613A (en) * 2023-09-19 2023-10-27 烟台国工智能科技有限公司 Compound inverse synthesis method and device based on quantum mechanical descriptor information
CN118197452A (en) * 2024-05-17 2024-06-14 烟台国工智能科技有限公司 Chemical synthesis route ranking analysis method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020023650A1 (en) * 2018-07-25 2020-01-30 Wuxi Nextcode Genomics Usa, Inc. Retrosynthesis prediction using deep highway networks and multiscale reaction classification
CN112397155A (en) * 2020-12-01 2021-02-23 中山大学 Single-step reverse synthesis method and system
US20210125691A1 (en) * 2019-10-01 2021-04-29 Molecule One sp. z o.o., Systems and method for designing organic synthesis pathways for desired organic molecules

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020023650A1 (en) * 2018-07-25 2020-01-30 Wuxi Nextcode Genomics Usa, Inc. Retrosynthesis prediction using deep highway networks and multiscale reaction classification
US20210125691A1 (en) * 2019-10-01 2021-04-29 Molecule One sp. z o.o., Systems and method for designing organic synthesis pathways for desired organic molecules
CN112397155A (en) * 2020-12-01 2021-02-23 中山大学 Single-step reverse synthesis method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
莫雨;朱玲嘉;: "新世纪生命科学之有机化合物的合成与逆合成分析", 天津化工, no. 01, 30 January 2020 (2020-01-30) *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114974450A (en) * 2022-06-28 2022-08-30 苏州沃时数字科技有限公司 Method for generating operation steps based on machine learning and automatic testing device
CN116578934A (en) * 2023-07-13 2023-08-11 烟台国工智能科技有限公司 Inverse synthetic analysis method and device based on Monte Carlo tree search
CN116578934B (en) * 2023-07-13 2023-09-19 烟台国工智能科技有限公司 Inverse synthetic analysis method and device based on Monte Carlo tree search
CN116935969A (en) * 2023-07-28 2023-10-24 宁波甬恒瑶瑶智能科技有限公司 Biological inverse synthesis prediction method and device based on depth search and electronic equipment
CN116935969B (en) * 2023-07-28 2024-03-26 宁波甬恒瑶瑶智能科技有限公司 Biological inverse synthesis prediction method and device based on depth search and electronic equipment
CN116705197A (en) * 2023-08-02 2023-09-05 北京深势科技有限公司 Method and device for processing synthetic and inverse synthetic molecular diagram prediction model
CN116705197B (en) * 2023-08-02 2023-11-17 北京深势科技有限公司 Method and device for processing synthetic and inverse synthetic molecular diagram prediction model
CN116959613A (en) * 2023-09-19 2023-10-27 烟台国工智能科技有限公司 Compound inverse synthesis method and device based on quantum mechanical descriptor information
CN116959613B (en) * 2023-09-19 2023-12-12 烟台国工智能科技有限公司 Compound inverse synthesis method and device based on quantum mechanical descriptor information
CN118197452A (en) * 2024-05-17 2024-06-14 烟台国工智能科技有限公司 Chemical synthesis route ranking analysis method and device

Similar Documents

Publication Publication Date Title
CN114360662A (en) Single-step inverse synthesis method and system based on two-way multi-branch CNN
Park et al. Deep recurrent neural network-based identification of precursor micrornas
US20200167659A1 (en) Device and method for training neural network
CN107862173A (en) A kind of lead compound virtual screening method and device
CN114496105B (en) Single-step inverse synthesis method and system based on multi-semantic network
CN110490320B (en) Deep neural network structure optimization method based on fusion of prediction mechanism and genetic algorithm
CN113838536B (en) Translation model construction method, product prediction model construction method and prediction method
CN114283888B (en) Differential expression gene prediction system based on layered self-attention mechanism
Fonnegra et al. Performance comparison of deep learning frameworks in image classification problems using convolutional and recurrent networks
CN111785326B (en) Gene expression profile prediction method after drug action based on generation of antagonism network
Huang et al. Harnessing deep learning for population genetic inference
CN112652358A (en) Drug recommendation system, computer equipment and storage medium for regulating and controlling disease target based on three-channel deep learning
CN112086144A (en) Molecule generation method, molecule generation device, electronic device, and storage medium
CN114861754A (en) Knowledge tracking method and system based on external attention mechanism
CN113948157A (en) Chemical reaction classification method, device, electronic equipment and storage medium
CN111882042A (en) Automatic searching method, system and medium for neural network architecture of liquid state machine
CN116340726A (en) Energy economy big data cleaning method, system, equipment and storage medium
CN114943276B (en) Depth knowledge tracking method based on tree-type attention mechanism
CN116611504A (en) Neural architecture searching method based on evolution
US9336498B2 (en) Method and apparatus for improving resilience in customized program learning network computational environments
CN113223622B (en) miRNA-disease association prediction method based on meta-path
CN115862757A (en) Prediction method, model and model construction method of stem reactant
Mardikoraem et al. Machine Learning-Driven Protein Library Design: A Path Toward Smarter Libraries
Mazidi et al. PSPGA: A New Method for Protein Structure Prediction based on Genetic Algorithm
CN114783507A (en) Method and device for predicting drug-protein affinity based on secondary structure feature coding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination