CN112270951B - Brand-new molecule generation method based on multitask capsule self-encoder neural network - Google Patents
Brand-new molecule generation method based on multitask capsule self-encoder neural network Download PDFInfo
- Publication number
- CN112270951B CN112270951B CN202011247808.1A CN202011247808A CN112270951B CN 112270951 B CN112270951 B CN 112270951B CN 202011247808 A CN202011247808 A CN 202011247808A CN 112270951 B CN112270951 B CN 112270951B
- Authority
- CN
- China
- Prior art keywords
- capsule
- multitask
- molecules
- encoder
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/30—Drug targeting using structural data; Docking or binding prediction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Abstract
The invention discloses a brand-new molecule generation method based on a multitask capsule self-encoder neural network. The method comprises the steps of expressing drug molecules as SMILES (simplified molecule linear input standard), marking target property labels, and learning the characteristics of known drug molecules through a training phase to obtain a training model; reconstructing the molecules by using the training model through a reconstruction stage; and generating molecules by using the training model through a generation stage, wherein the generated molecules simultaneously have multiple set target properties, and the generated molecules simultaneously have a large number of new molecules and new frameworks. The invention can be used for the generation of various molecules such as medicines or compounds, and the characteristics and properties of known medicines can be learned through one training, thereby carrying out the generation of the molecules which simultaneously meet the required physical, chemical and biological properties. The molecules generated by the method have higher effectiveness and more excellent properties.
Description
Technical Field
The invention relates to the cross technical field of computer artificial intelligence and brand new molecular design, in particular to a brand new molecular generation method of a multitask capsule self-encoder neural network, which is a method for performing brand new molecular design based on a self-encoder framework and a multitask capsule classifier framework and is suitable for generating molecules which simultaneously accord with various physical, chemical and biological properties.
Background
The design method of small molecule drugs plays a key role in the development process of active drugs. Traditional drug design methods such as virtual screening and pharmacophore models are mainly used to search libraries of known virtual compounds. Due to the large number of potentially synthesizable molecules in chemical space (10)23-1060) And the current computer computing performance limitations, global search of the entire chemical space is difficult, and analysis and processing of search results requires a great deal of expertise. As a data-driven calculation method, the artificial intelligence technology can automatically learn knowledge such as chemical structures, structure-activity relationships and the like of drug molecules from a data set, help scientists design molecules with target properties, and bring hopes for drug discovery and development. De novo based on deep neural networksMolecular design methods, as a novel artificial intelligence technique, can be used to generate molecules with desired properties. This has the advantage that new molecules with optimized properties can be generated without the need to enumerate virtual compound libraries. However, the existing molecule generation method only considers the generation of a molecule with one target property, is difficult to learn other characteristics except the property, cannot simultaneously optimize multiple properties of the molecule, influences the final generation effect, and cannot meet the requirement of new drug molecule design. In the process of molecule generation, a key difficulty lies in the selection of a classifier, a common support vector machine cannot be directly and jointly trained with a deep neural network, and a convolutional neural network has poor classification effect and is difficult to be applied to the classification and generation of molecules with various target properties.
Disclosure of Invention
The invention aims to provide a method for generating molecules which can simultaneously meet various target properties such as molecular weight, lipid-water partition coefficient, hydrogen bond donor, hydrogen bond acceptor, rotatable bond number, polar surface area, synthesizability, specific target activity and the like.
The invention provides a new model, which takes a self-encoder as a basic frame and integrates a multi-task capsule classifier in a hidden layer. By adopting the method, various drug molecules with optimized target properties can be effectively generated.
The technical scheme of the method is as follows:
a brand new molecule generation method based on a multitask capsule self-encoder neural network is characterized by comprising the following steps: the drug molecules are denoted as SMILES (simplified molecule linear input specification), labeling the target property labels of the molecules. Structurally, the model includes three parts, an encoder, a multitask capsule classifier and a decoder. The encoder encodes drug molecules SMILES into vectors with fixed length by using a bidirectional long-short term memory network; the multitask capsule classifier utilizes double-layer capsule layer codes to represent vectors of the drug molecular property labels; the decoder directly decodes the hidden layer vector based on the long-term and short-term memory network to generate SMILES of molecules, and the reconstruction of the molecules or the generation of brand new molecules are realized.
The method comprises the following steps:
step 1: collecting training data, extracting a molecular SMILES one-hot (one-hot) coding table, and calculating a property label;
step 2: learning the characteristics of known drug molecules SMILES through a training stage to obtain a training model;
and step 3: reconstructing the molecules by using the training model through a reconstruction stage;
and 4, step 4: through the generation phase, molecules with specific properties are generated by using a training model.
Further, the step 1 specifically includes:
collecting drug molecules and establishing a specific training data set;
using SMILES to represent drug molecules;
determining target properties of the drug molecules through experiments or calculation, and if the properties are quantitative data, selecting a reasonable threshold value to convert the target properties into qualitative data, namely target properties =1 and non-target properties =0; the calculation of molecular target properties is completed by a PaDEL-Descriptor, RDkit or Discovery Studio program;
the training data contains both drug molecules SMILES and specific property labels;
further, the step 2 specifically includes:
inputting training data into a neural network of a multitask capsule self-encoder to be trained;
manually adjusting the learning rate, the number of neurons and the training step number of the model for multiple times to exceed parameters, and keeping the training model with the minimum cross entropy loss function value;
and keeping the best model in multiple training processes as a training model.
Further, the step 3 specifically includes:
operating a training model, and coding training data into vectors with fixed length in batch by an encoder;
a decoder decodes the fixed length vectors into reconstructed molecular data;
calculating a reconstruction rate by reconstructing the molecular data;
and saving the reconstructed molecular data.
Further, the step 4 specifically includes:
operating a training model, and coding training data into vectors with fixed length in batch by an encoder;
a multitask capsule classifier calculating property features of the molecules for generating molecules having the target property;
performing data enhancement on the vector representation of the target property molecules to obtain new vector distribution;
the decoder decodes the new vector distribution into the sub-data;
manually debugging the hyper-parameters of the data enhancement process for multiple times, and keeping the best generation result;
when the number of molecules generated reaches a predetermined number, the generated molecular data is stored.
Further, the multitask capsule self-encoder neural network comprises an encoder, a multitask capsule classifier and a decoder, wherein the training data serves as the input of the encoder, and the output of the encoder serves as the input of the multitask capsule classifier; the output of the multitask capsule classifier is used as the input of the decoder.
Further, the encoder directly encodes the drug molecules SMILES into fixed-length vectors by using a bidirectional long-short term memory network, and the fixed-length vectors are divided into 3 parts:
1) Forward recurrent neural networkAccording to x1To xTxSequentially reading the input sequence, calculating the forward hidden stateBackward-circulating neural networkAccording to xTxTo x1Sequentially reading the input sequence, calculating the backward hidden state
xTxDenotes the T thxThe character of the time of day,denotes the T thxThe forward hidden state of the moment in time,denotes the T thxA backward hidden state of time, f represents a nonlinear function;
htIndicating a hidden state at time t;
3) Generating a hidden layer vector by a sequence of hidden states:
c denotes a vector generated from the hidden state sequence and q denotes a non-linear function.
Further, the multitask Capsule classifier adopts double-layer Capsule layers (Capsule layers) to optimize Margin Loss, and encodes and predicts the property labels of the drug molecules;
mapping hidden layer vectors to the two-layer capsule layers;
the optimizing and adjusting Routing iteration time specifically comprises the following steps:
1) And matrix transformation, namely calculating a prediction vector according to hidden layer vector mapping:
i denotes the first layer of capsules, j denotes the second layer of capsules,denotes a prediction vector, WijRepresenting the weight matrix, u, learned by back propagationjRepresents the output of capsule i;
2) Calculating the total input vector s of the capsule j by the weighted sum of all the prediction vectorsj;
sjRepresenting the total input vector, cijRepresents a coupling coefficient;
3) Calculating the coupling coefficient c by softmax activation functionij;
bijRepresenting the logarithm of the probability of the strength of the connection between capsule i and capsule j. Before the routing iteration, b is set as 0 initially, and c is updated through the number of routing iterationsijA value of (d);
4) The vector output of capsule j is calculated by a non-linear squeeze function:
vja vector output representing capsule j;
5) Calculating the loss L of each capsule in the capsule layerk:
Lk=Tkmax(0,m+-||vk||)2+λ(1-Tk)max(0,||vk||-m-)2
LkDenotes the loss, T, of each capsulekRepresents an index function (taking the value 1 or 0 according to the property label), m+And m-Respectively, the upper and lower boundaries are defined, and λ is a scaling factor defining the weights of both.
Further, the decoder decodes the encoded hidden layer vector of the multitask capsule classifier by using a long-short term memory network, and specifically comprises the following steps:
1) By hiding the layer vector c with the predicted character y before time t1,···,yt-1Generate the predicted character at time t:
p(yt|{y1,···,yt-1},c)=g(yt-1,st,c)
st=f′(st-1,yt-1,c)
yta predicted character, s, representing time ttRepresenting the hidden state of the decoder, g and f' representing non-linear functions;
2) The probability of the predicted sequence Y is calculated from the probabilities of all predicted characters:
p (Y) represents the probability of predicting sequence Y.
The invention has the positive effects that: the invention provides a brand-new model which consists of an encoder, a multitask capsule classifier and a decoder. The invention has the innovation points that the multitask capsule classifier is utilized to effectively acquire the relationship which is difficult to quantify between the structure of the medicine molecule and the properties of various medicines, the relevant information among the data of various properties of the medicine molecule is learned, the various properties of the generated molecule can be predicted in advance, and the generation of the molecules with various properties is realized through the encoder and the decoder. Compared with other previous methods for generating molecules based on machine learning, the method has the advantages that:
first, the method of the present invention produces better results than the traditional machine learning method. Traditional generative models can only generate one target property, and multiple training is needed to generate molecules with multiple properties. The invention realizes the simultaneous classification and generation of various properties by applying the multitask capsule classifier, the molecules generated by the multitask capsule self-encoder can simultaneously meet various target properties such as molecular weight, lipid-water distribution coefficient, hydrogen bond donor, hydrogen bond acceptor, rotatable bond quantity, polar surface area, synthesizability, specific target activity and the like, and meanwhile, the generated molecules have new frameworks.
Second, the method of the present invention uses a multitask capsule classifier that performs better than a single task capsule classifier and other conventional machine learning methods. The multi-task capsule classifier can effectively extract the characteristics of molecules by utilizing the related information among different property data, and improve the prediction effect.
Drawings
FIG. 1 is a block flow diagram of the novel molecular generation method based on a multitask capsule self-encoder.
FIG. 2 is a model diagram of the novel molecular generation method based on the multitask capsule self-encoder.
FIG. 3 shows the detailed steps of the novel molecule generation method based on the multitask capsule self-encoder.
FIG. 4 is a schematic diagram of the training of the novel molecular generation method based on the multitask capsule self-encoder.
FIG. 5 is a reconstruction diagram of the novel molecular generation method based on the multitask capsule self-encoder.
FIG. 6 is a schematic diagram of the novel molecular generation method based on the multitask capsule self-encoder.
Detailed Description
The figures show specific processes for achieving molecular generation of multiple target properties using the present invention.
The invention provides a brand new molecule generation method based on a multitask capsule self-encoder, which relates to the cross technical field of computer artificial intelligence and new drug molecule design.
Target properties of the generated molecules of the invention include: (1) molecular weight; (2) a lipid-water partition coefficient; (3) a hydrogen bond donor; (4) a hydrogen bond acceptor; (5) number of rotatable keys; (6) polar surface area; (7) synthesizability; (8) PDGF, renin, bcl-2 and other target activities.
See fig. 1.
The method comprises the steps of constructing an effective drug molecule database, measuring or calculating drug molecule property labels, constructing a self-encoder frame, constructing a multi-task capsule classifier frame, designing and realizing a data enhancement module, executing a generation process and the like.
See fig. 2.
The method is a brand-new molecule generation method based on a multitask capsule self-encoder. The model takes a self-encoder as a basic framework, and a multitask capsule classifier is configured in a hidden layer. The encoder directly encodes the drug molecules SMILES into vectors with fixed length by using a bidirectional long-short term memory network; the multi-task capsule classifier adopts a double-layer capsule layer to analyze and extract vector characteristics and predicts a property label of a drug molecule; the decoder decodes the hidden layer vector by using a long-short term memory network to realize output and molecule generation.
See fig. 3.
The method comprises the following specific operation steps:
step 1: collecting training data, extracting a molecular unique heat (one-hot) encoding table, and calculating a property label;
step 2: learning the characteristics of known drug molecules through a training phase to obtain a training model;
and step 3: reconstructing the molecules by using the training model through a reconstruction stage;
and 4, step 4: through the generation phase, the target property molecules are generated using the training model.
In the present invention, the step 1 specifically comprises: collecting drug molecules in a database, and representing the drug molecules by adopting SMILES; testing or calculating physical, chemical or biological properties of the drug molecule; selecting a reasonable threshold value to convert the quantitative data into a qualitative category label; the training data includes both drug molecules SMILES and property labels. The training data will be used in the training process of the model.
See fig. 4.
In the present invention, the step 2 specifically includes: inputting training data into a multi-task capsule self-encoder for training, manually adjusting learning rate, neuron number and training step number hyper-parameters of the model for multiple times, keeping the training model with the minimum cross entropy loss function value, and keeping the optimal model in the multiple training process as the training model.
See fig. 5.
In the present invention, the step 3 specifically includes: reading training data, operating a training model, and encoding the training data into vectors with fixed length in batch by an encoder; a decoder decodes the fixed length vectors into reconstructed molecular data; calculating a reconstruction rate by reconstructing the molecular data; and saving the reconstructed molecular data.
See fig. 6.
In the present invention, the step 4 specifically includes: reading training data, operating a training model, and encoding the training data into vectors with fixed length in batch by an encoder; predicting the property of the training molecule by the multi-task capsule classifier, and reserving the target property molecule; performing data enhancement on the vector representation of the target property molecules to obtain new vector distribution; the decoder decodes the new vector distribution into new generated molecular data; manually debugging the hyper-parameters of the data enhancement process for multiple times, and keeping the best generation result; when the number of molecules generated reaches a predetermined number, the generated molecular data is stored.
The encoder directly encodes drug molecules SMILES into fixed-length vectors by utilizing a bidirectional long-short term memory network, and the method comprises the following steps:
1) Forward recurrent neural networkAccording to x1To xTxSequentially reading the input sequence, calculating the forward hidden stateBackward-circulating neural networkAccording to xTxTo x1Sequentially reading the input sequence, calculating the backward hidden state
xTxDenotes the T thxThe character of the time of day,denotes the T thxThe forward hidden state of the moment in time,denotes the T thxA backward hidden state of time, f represents a nonlinear function;
htRepresenting a hidden state at time t;
3) Generating a hidden layer vector by a sequence of hidden states:
c denotes a vector generated from the hidden state sequence and q denotes a non-linear function.
Mapping the hidden layer vector to a double-layer capsule layer, optimizing and adjusting Routing iteration times Routing time, predicting the property of a training molecule by a multi-task capsule classifier, and reserving a target property molecule, wherein the method specifically comprises the following steps:
1) And matrix transformation, namely calculating a prediction vector according to hidden layer vector mapping:
i denotes the first layer of capsules, j denotes the second layer of capsules,denotes a prediction vector, WijRepresenting the weight matrix, u, learned by back propagationjRepresents the output of capsule i;
2) Calculating the total input vector s of capsule j by the weighted sum of all the prediction vectorsj;
sjRepresenting the total input vector, cijRepresents a coupling coefficient;
3) Calculating the coupling coefficient c by softmax activation functionij;
bijRepresenting the log of the probability of the strength of the connection between capsule i and capsule j. Before the routing iteration, b is set as 0 initially, and c is updated through the number of routing iterationsijA value of (d);
4) The vector output of capsule j is calculated by a non-linear squeeze function:
vja vector output representing capsule j;
5) Calculating the loss L of each capsule in the capsule layerk:
Lk=Tkmax(0,m+-||vk||)2+λ(1-Tk)max(0,||vk||-m-)2
LkDenotes the loss, T, of each capsulekRepresenting an index function, m+And m-Representing the upper and lower boundaries, respectively.
The decoder decodes the hidden layer vector by using a long-short term memory network, and specifically comprises:
1) By hiding the vector c with the predicted character y before time t1,···,yt-1Generating a predicted character at time t:
p(yt|{y1,···,yt-1},c)=g(yt-1,st,c)
st=f′(st-1,yt-1,c)
yta predicted character, s, representing time ttRepresenting the hidden state of the decoder, g and f' representing non-linear functions;
2) The probability of the predicted sequence Y is calculated from the probabilities of all predicted characters:
p (Y) represents the probability of predicting sequence Y.
Examples are given.
Generating molecules which simultaneously satisfy the properties of molecular weight, lipid-water partition coefficient, hydrogen bond donor, hydrogen bond acceptor, number of rotatable bonds, polar surface area, synthesizability and the like. The implementation process is as follows:
the first step is as follows: drug molecules (1757517 compounds) were collected from the source database of ChEMBL provenance (https:// www.ebi.ac.uk/ChEMBL /), and indicated with SMILES.
The second step is that: calculating the molecular weight, the lipid-water partition coefficient, the hydrogen bond donor, the hydrogen bond acceptor, the number of rotatable bonds, the polar surface area and the synthesizability of the drug molecules by open source PaDEL-Descriptor, RDkit or Discovery Studio programs; selecting the target properties of molecular weight less than or equal to 500, lipid-water partition coefficient less than or equal to 5 and less than or equal to 0, hydrogen bond donor less than or equal to 5, hydrogen bond acceptor less than or equal to 10, number of rotatable bonds less than or equal to 20, polar surface area less than or equal to 200 and synthesizability less than or equal to 6, and label 1; the training data simultaneously comprises drug molecules SMILES and property labels, and the training data is saved in an SMI format.
The third step: and establishing a brand-new molecular generation model based on the multitask capsule self-encoder by using the training data. Structurally, the model includes three parts, an encoder, a multitask capsule classifier and a decoder. In the training stage, the learning rate, the number of neurons and the training step number of the model are adjusted manually for multiple times, and the optimal model in the multiple training process is reserved as a training model. The present example is debugged from the following aspects:
batch size candidate range for training phase: 128 256, 512 and 1028; network iteration number candidate range: from 100 to 1000, each change is increased by 100;
the encoder uses a bidirectional long-short term memory network to directly encode the drug molecules SMILES into fixed-length vectors. Encoder per layer neuron number candidate range: 128 192 and 256; the number of encoder neuron layers is set to be 1;
the multi-task capsule classifier is composed of double-layer capsule layers, and classification of various target properties is achieved by optimizing and adjusting Routing iteration times. Per capsule layer neuron candidate range: 128 192 and 256; candidate range of number of times of route iteration of capsule part: 1,2,3,4 and 5; the loss weight of the capsule classifier is set to 10; the optimizer selects AdamaOptimizer; learning rate candidate range of capsule classifier: from 0.001 to 0.01, with each increase varying by 0.001;
the decoder decodes the hidden layer vector using a long-short term memory based network. Decoder per layer neuron candidate range: 256 384 and 512; the number of encoder neuron layers is set to 1.
The fourth step: in the reconstruction stage, the molecules are reconstructed through the training model, and reconstructed molecule files are stored.
Batch size candidate range for reconstruction phase: 500 1000, 1500 and 2000; the number of batch processes was set to 10.
The fifth step: in the generation stage, molecule generation is carried out through a training model, a compound which simultaneously meets the requirements of molecular weight, a lipid-water distribution coefficient, a hydrogen bond donor, a hydrogen bond acceptor, the number of rotatable bonds, polar surface area and synthesizability is generated, and a generated molecule file is stored.
Batch size candidate range for generation phase: 500 1000, 1500 and 2000; the batch processing times are set to 10; the standard deviation of the normal distribution during the data enhancement process was set to 0.2.
Drug small molecule design methods play a key role in the active drug discovery process. Traditional drug design methods such as virtual screening and pharmacophore models are mainly used to search libraries of known virtual compounds. The analysis and processing of search results also requires a great deal of expertise, due to the enormous number of drug molecules in the chemical space and the current limitations of computer computational performance, making searching the entire chemical space impractical. The de novo molecular design method based on deep neural networks, as a novel artificial intelligence technology, can be used for generating molecules with required properties. The method has the advantages that new molecules with optimized properties can be generated without enumerating virtual compound libraries, and the like.
However, the existing molecular generation method only considers the generation of a molecule with one target property, and the method has low efficiency, cannot limit various properties of the molecule, and is difficult to meet the requirement of new drug molecular design. The multitask capsule self-encoder provided by the invention can effectively acquire the relationship which is difficult to quantify between the molecular structure of the medicine and the properties of various medicines, learn the related information among the data of various properties of the medicine molecules, realize the molecules which simultaneously meet the physical, chemical and biological properties, and improve the effectiveness and novelty of the generated molecules.
Claims (9)
1. A brand new molecule generation method based on a multitask capsule self-encoder neural network is characterized by comprising the following steps: the method comprises the steps of representing drug molecules as SMILES (simplified molecule linear input standard), marking target property labels, and establishing a brand new molecule generation model comprising an encoder, a multitask capsule classifier and a decoder by utilizing a self-encoder framework; the encoder encodes drug molecules SMILES into vectors with fixed length by using a bidirectional long-short term memory network; the multitask capsule classifier adopts a double-layer capsule layer to optimize Margin Loss, and encodes and predicts a property label of a drug molecule; the decoder decodes the hidden layer vector by using a long-term and short-term memory network to realize input and output reconstruction;
the method comprises the following steps:
step 1: collecting training data, extracting a one-hot (one-hot) encoding table of molecules, and calculating a property label;
step 2: learning the characteristics of known drug molecules through a training phase to obtain a training model;
and step 3: reconstructing the molecules by using the training model through a reconstruction stage;
and 4, step 4: through the generation phase, molecules with specific properties are generated by using a training model.
2. The completely new molecule generating method based on the multitask capsule self-encoder neural network according to claim 1, wherein the step 1 specifically comprises the following steps:
collecting drug molecules and establishing a specific data set;
drug molecules are represented using SMILES (simplified molecular linear input specification);
calculating or collecting drug molecule target property data, if the data is quantitative representation, selecting a reasonable threshold value to convert into qualitative representation, namely target property =1; non-target property =0; all the molecular descriptors are calculated by open source PaDEL-Descriptor, RDkit or Discovery Studio programs;
the training data contains both the drug molecule SMILES and a specific property label.
3. The brand-new molecule generation method based on the multitask capsule self-encoder neural network as claimed in claim 1, wherein said step 2 specifically comprises:
inputting training data into a neural network of a multitask capsule self-encoder to be trained;
manually adjusting the learning rate, the number of neurons and the training step number of the model for multiple times to exceed parameters, and keeping the training model with the minimum cross entropy loss function value;
and reserving the optimal model in the multiple training processes as a pre-training model.
4. The brand-new molecule generation method based on the multitask capsule self-encoder neural network as claimed in claim 1, wherein said step 3 specifically comprises:
running the training model, the encoder encodes the training data into fixed length vectors in batches
A decoder decodes the fixed length vectors into reconstructed molecular data;
calculating a reconstruction rate by reconstructing the molecular data;
and saving the reconstructed molecular data.
5. The brand-new molecule generation method based on the multitask capsule self-encoder neural network as claimed in claim 1, wherein said step 4 specifically comprises:
operating a training model, and coding training data into vectors with fixed length in batch by an encoder;
the multitask capsule classifier encodes and predicts the properties of the training molecules;
performing data enhancement on the vector representation of the target property molecules to obtain new vector distribution;
the decoder decodes the new vector distribution into the sub-data;
manually debugging the hyper-parameters of the data enhancement process for multiple times, and keeping the best generation result;
when the number of molecules generated reaches a predetermined number, the generated molecular data is stored.
6. The novel molecular generation method based on the multitask capsule self-encoder neural network as claimed in claim 5, wherein said multitask capsule self-encoder neural network comprises an encoder, a multitask capsule classifier and a decoder, said training data is used as input of said encoder, and output of said encoder is used as input of said multitask capsule classifier; the output of the multitask capsule classifier is used as the input of the decoder.
7. The novel molecular generation method based on the multitask capsule self-encoder neural network as claimed in claim 6, wherein:
the encoder encodes the drug molecules SMILES into vectors of fixed length by using a bidirectional long-short term memory network, and the vectors are divided into 3 parts:
1) Forward recurrent neural networkAccording to x1To xTxSequentially reading the input sequence, calculating the forward hidden stateBackward-circulating neural networkAccording to xTxTo x1Sequentially reading the input sequence, calculating the backward hidden state
xTxDenotes the T thxThe character of the time of day,denotes the T thxThe forward hidden state of the moment in time,denotes the T thxA backward hidden state of time, f represents a nonlinear function;
htIndicating a hidden state at time t;
3) Generating a hidden layer vector by a sequence of hidden states:
c denotes a vector generated from the hidden state sequence and q denotes a non-linear function.
8. The novel molecular generation method based on the multitask capsule self-encoder neural network as claimed in claim 6, wherein:
the multitask capsule classifier adopts a double-layer capsule layer to optimize Margin Loss and predicts the property label of the drug molecule;
mapping the hidden layer vector to double-layer Capsule layers, and optimizing and adjusting Routing iteration time, specifically comprising:
1) And matrix transformation, namely calculating a prediction vector according to hidden layer vector mapping:
i denotes the first layer of capsules, j denotes the second layer of capsules,denotes a prediction vector, WijRepresenting the weight matrix, u, learned by back propagationjRepresents the output of capsule i;
2) Calculating the total input vector s of capsule j by the weighted sum of all the prediction vectorsj;
sjRepresenting the total input vector, cijRepresents a coupling coefficient;
3) Calculating the coupling coefficient c by softmax activation functionij;
bijThe logarithm of probability representing the strength of the connection between capsule i and capsule j; before route iteration, b is set as 0 initially, and c is updated through the number of route iterationsijA value of (d);
4) The vector output of capsule j is calculated by a non-linear squeeze function:
vja vector output representing capsule j;
5) Calculating the loss L of each capsule in the capsule layerk:
Lk=Tkmax(0,m+-||vk||)2+λ(1-Tk)max(0,||vk||-m-)2
LkDenotes the loss, T, of each capsulekRepresents an index function, m+And m-Representing the upper and lower boundaries, respectively.
9. The novel molecular generation method based on the multitask capsule self-encoder neural network as claimed in claim 6, wherein:
the decoder decodes the hidden layer vector by using a long-short term memory network, and specifically comprises:
1) By hiding the vector c with the predicted character y before time t1,···,yt-1Generating a predicted character at time t:
p(yt|{y1,···,yt-1},c)=g(yt-1,st,c)
st=f′(st-1,yt-1,c)
yta predicted character, s, representing time ttRepresenting the hidden state of the decoder, g and f' representing non-linear functions;
2) The probability of the predicted sequence Y is calculated from the probabilities of all predicted characters:
p (Y) represents the probability of predicting sequence Y.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011247808.1A CN112270951B (en) | 2020-11-10 | 2020-11-10 | Brand-new molecule generation method based on multitask capsule self-encoder neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011247808.1A CN112270951B (en) | 2020-11-10 | 2020-11-10 | Brand-new molecule generation method based on multitask capsule self-encoder neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112270951A CN112270951A (en) | 2021-01-26 |
CN112270951B true CN112270951B (en) | 2022-11-01 |
Family
ID=74339427
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011247808.1A Active CN112270951B (en) | 2020-11-10 | 2020-11-10 | Brand-new molecule generation method based on multitask capsule self-encoder neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112270951B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112562869A (en) * | 2021-02-24 | 2021-03-26 | 北京中医药大学东直门医院 | Drug combination safety evaluation system, method and device |
CN113223637B (en) * | 2021-05-07 | 2023-07-25 | 中国科学院自动化研究所 | Medicine molecular generator training method based on domain knowledge and deep reinforcement learning |
CN113488119B (en) * | 2021-06-18 | 2024-02-02 | 重庆医科大学 | Drug small molecule numerical value characteristic structured database and establishment method thereof |
CN113470740A (en) * | 2021-06-30 | 2021-10-01 | 中国石油大学(华东) | Medicine recommendation system, computer equipment and storage medium based on fully-connected network integrated deep learning model |
CN114049922B (en) * | 2021-11-09 | 2022-06-03 | 四川大学 | Molecular design method based on small-scale data set and generation model |
CN114496112B (en) * | 2022-01-21 | 2023-10-31 | 内蒙古工业大学 | Intelligent quantification method for anti-breast cancer drug ingredients based on multi-objective optimization |
CN114446414B (en) * | 2022-01-24 | 2023-05-23 | 电子科技大学 | Reverse synthetic analysis method based on quantum circulation neural network |
CN114937478B (en) * | 2022-05-18 | 2023-03-10 | 北京百度网讯科技有限公司 | Method for training a model, method and apparatus for generating molecules |
CN114913938B (en) * | 2022-05-27 | 2023-04-07 | 中南大学 | Small molecule generation method, equipment and medium based on pharmacophore model |
GB2621108A (en) * | 2022-07-08 | 2024-02-07 | Topia Life Sciences Ltd | An automated system for generating novel molecules |
CN117334271A (en) * | 2023-09-25 | 2024-01-02 | 江苏运动健康研究院 | Method for generating molecules based on specified attributes |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108073780A (en) * | 2016-11-14 | 2018-05-25 | 王�忠 | A kind of method of the relatively clinical efficacy of Chinese medicine compound prescription |
CN111126554A (en) * | 2018-10-31 | 2020-05-08 | 深圳市云网拜特科技有限公司 | Drug lead compound screening method and system based on generation of confrontation network |
CN111584010A (en) * | 2020-04-01 | 2020-08-25 | 昆明理工大学 | Key protein identification method based on capsule neural network and ensemble learning |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106146670B (en) * | 2015-04-24 | 2019-01-15 | 宜明昂科生物医药技术(上海)有限公司 | A kind of new recombination double functions fusion protein and its preparation and application |
US10776712B2 (en) * | 2015-12-02 | 2020-09-15 | Preferred Networks, Inc. | Generative machine learning systems for drug design |
KR102574256B1 (en) * | 2017-04-25 | 2023-09-01 | 더 보드 어브 트러스티스 어브 더 리랜드 스탠포드 주니어 유니버시티 | Dose Reduction for Medical Imaging Using Deep Convolutional Neural Networks |
WO2019070978A1 (en) * | 2017-10-06 | 2019-04-11 | Mayo Foundation For Medical Education And Research | Ecg-based cardiac ejection-fraction screening |
KR102587959B1 (en) * | 2018-01-17 | 2023-10-11 | 삼성전자주식회사 | Method and apparatus for generating chemical structure using neural network |
GB2573102A (en) * | 2018-04-20 | 2019-10-30 | Drugai Ltd | Interaction property prediction system and method |
CN111128314B (en) * | 2018-10-30 | 2023-07-21 | 深圳市云网拜特科技有限公司 | Drug discovery method and system |
CN109979541B (en) * | 2019-03-20 | 2021-06-22 | 四川大学 | Method for predicting pharmacokinetic property and toxicity of drug molecules based on capsule network |
CN110473595A (en) * | 2019-07-04 | 2019-11-19 | 四川大学 | A kind of capsule network Relation extraction model in the most short interdependent path of combination |
CN110634539A (en) * | 2019-09-12 | 2019-12-31 | 腾讯科技(深圳)有限公司 | Artificial intelligence-based drug molecule processing method and device and storage medium |
CN110970099B (en) * | 2019-12-10 | 2023-04-28 | 北京大学 | Drug molecule generation method based on regularized variation automatic encoder |
CN111508568B (en) * | 2020-04-20 | 2023-08-29 | 腾讯科技(深圳)有限公司 | Molecule generation method, molecule generation device, computer readable storage medium and terminal device |
CN111785326B (en) * | 2020-06-28 | 2024-02-06 | 西安电子科技大学 | Gene expression profile prediction method after drug action based on generation of antagonism network |
CN111814460B (en) * | 2020-07-06 | 2021-02-09 | 四川大学 | External knowledge-based drug interaction relation extraction method and system |
-
2020
- 2020-11-10 CN CN202011247808.1A patent/CN112270951B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108073780A (en) * | 2016-11-14 | 2018-05-25 | 王�忠 | A kind of method of the relatively clinical efficacy of Chinese medicine compound prescription |
CN111126554A (en) * | 2018-10-31 | 2020-05-08 | 深圳市云网拜特科技有限公司 | Drug lead compound screening method and system based on generation of confrontation network |
CN111584010A (en) * | 2020-04-01 | 2020-08-25 | 昆明理工大学 | Key protein identification method based on capsule neural network and ensemble learning |
Also Published As
Publication number | Publication date |
---|---|
CN112270951A (en) | 2021-01-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112270951B (en) | Brand-new molecule generation method based on multitask capsule self-encoder neural network | |
CN113707235B (en) | Drug micromolecule property prediction method, device and equipment based on self-supervision learning | |
JP2023082017A (en) | computer system | |
CN112561064B (en) | Knowledge base completion method based on OWKBC model | |
Xiao et al. | History-based attention in Seq2Seq model for multi-label text classification | |
US11354582B1 (en) | System and method for automated retrosynthesis | |
KR102491346B1 (en) | Method, apparatus and computer program for generating formalized research record data automatically for learning artificial intelligence model | |
CN114999565A (en) | Drug target affinity prediction method based on representation learning and graph neural network | |
WO2022188653A1 (en) | Molecular scaffold hopping processing method and apparatus, medium, electronic device and computer program product | |
Yuan et al. | DeCban: prediction of circRNA-RBP interaction sites by using double embeddings and cross-branch attention networks | |
Zhao et al. | Exploiting multiple question factors for knowledge tracing | |
Wu et al. | Semi-supervised cross-modal hashing via modality-specific and cross-modal graph convolutional networks | |
Bhardwaj et al. | Computational biology in the lens of CNN | |
Fan et al. | Surrogate-assisted evolutionary neural architecture search with network embedding | |
Wang et al. | Artificial Intelligence in | |
Luo et al. | A Caps-UBI model for protein ubiquitination site prediction | |
CN116843995A (en) | Method and device for constructing cytographic pre-training model | |
Chen et al. | Personalized expert recommendation systems for optimized nutrition | |
Wang et al. | Molecular property prediction by contrastive learning with attention-guided positive sample selection | |
US11568961B2 (en) | System and method for accelerating FEP methods using a 3D-restricted variational autoencoder | |
CN114239575B (en) | Statement analysis model construction method, statement analysis method, device, medium and computing equipment | |
CN115240787A (en) | Brand-new molecule generation method based on deep conditional recurrent neural network | |
US20220198286A1 (en) | System and method for molecular reconstruction from molecular probability distributions | |
Song | Distilling knowledge from user information for document level sentiment classification | |
Ma et al. | Target-Embedding Autoencoder With Knowledge Distillation for Multi-Label Classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |