CN112270951B - Brand-new molecule generation method based on multitask capsule self-encoder neural network - Google Patents

Brand-new molecule generation method based on multitask capsule self-encoder neural network Download PDF

Info

Publication number
CN112270951B
CN112270951B CN202011247808.1A CN202011247808A CN112270951B CN 112270951 B CN112270951 B CN 112270951B CN 202011247808 A CN202011247808 A CN 202011247808A CN 112270951 B CN112270951 B CN 112270951B
Authority
CN
China
Prior art keywords
capsule
multitask
molecules
encoder
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011247808.1A
Other languages
Chinese (zh)
Other versions
CN112270951A (en
Inventor
邹俊
杨胜勇
李侃
杨欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202011247808.1A priority Critical patent/CN112270951B/en
Publication of CN112270951A publication Critical patent/CN112270951A/en
Application granted granted Critical
Publication of CN112270951B publication Critical patent/CN112270951B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The invention discloses a brand-new molecule generation method based on a multitask capsule self-encoder neural network. The method comprises the steps of expressing drug molecules as SMILES (simplified molecule linear input standard), marking target property labels, and learning the characteristics of known drug molecules through a training phase to obtain a training model; reconstructing the molecules by using the training model through a reconstruction stage; and generating molecules by using the training model through a generation stage, wherein the generated molecules simultaneously have multiple set target properties, and the generated molecules simultaneously have a large number of new molecules and new frameworks. The invention can be used for the generation of various molecules such as medicines or compounds, and the characteristics and properties of known medicines can be learned through one training, thereby carrying out the generation of the molecules which simultaneously meet the required physical, chemical and biological properties. The molecules generated by the method have higher effectiveness and more excellent properties.

Description

Brand-new molecule generation method based on multitask capsule self-encoder neural network
Technical Field
The invention relates to the cross technical field of computer artificial intelligence and brand new molecular design, in particular to a brand new molecular generation method of a multitask capsule self-encoder neural network, which is a method for performing brand new molecular design based on a self-encoder framework and a multitask capsule classifier framework and is suitable for generating molecules which simultaneously accord with various physical, chemical and biological properties.
Background
The design method of small molecule drugs plays a key role in the development process of active drugs. Traditional drug design methods such as virtual screening and pharmacophore models are mainly used to search libraries of known virtual compounds. Due to the large number of potentially synthesizable molecules in chemical space (10)23-1060) And the current computer computing performance limitations, global search of the entire chemical space is difficult, and analysis and processing of search results requires a great deal of expertise. As a data-driven calculation method, the artificial intelligence technology can automatically learn knowledge such as chemical structures, structure-activity relationships and the like of drug molecules from a data set, help scientists design molecules with target properties, and bring hopes for drug discovery and development. De novo based on deep neural networksMolecular design methods, as a novel artificial intelligence technique, can be used to generate molecules with desired properties. This has the advantage that new molecules with optimized properties can be generated without the need to enumerate virtual compound libraries. However, the existing molecule generation method only considers the generation of a molecule with one target property, is difficult to learn other characteristics except the property, cannot simultaneously optimize multiple properties of the molecule, influences the final generation effect, and cannot meet the requirement of new drug molecule design. In the process of molecule generation, a key difficulty lies in the selection of a classifier, a common support vector machine cannot be directly and jointly trained with a deep neural network, and a convolutional neural network has poor classification effect and is difficult to be applied to the classification and generation of molecules with various target properties.
Disclosure of Invention
The invention aims to provide a method for generating molecules which can simultaneously meet various target properties such as molecular weight, lipid-water partition coefficient, hydrogen bond donor, hydrogen bond acceptor, rotatable bond number, polar surface area, synthesizability, specific target activity and the like.
The invention provides a new model, which takes a self-encoder as a basic frame and integrates a multi-task capsule classifier in a hidden layer. By adopting the method, various drug molecules with optimized target properties can be effectively generated.
The technical scheme of the method is as follows:
a brand new molecule generation method based on a multitask capsule self-encoder neural network is characterized by comprising the following steps: the drug molecules are denoted as SMILES (simplified molecule linear input specification), labeling the target property labels of the molecules. Structurally, the model includes three parts, an encoder, a multitask capsule classifier and a decoder. The encoder encodes drug molecules SMILES into vectors with fixed length by using a bidirectional long-short term memory network; the multitask capsule classifier utilizes double-layer capsule layer codes to represent vectors of the drug molecular property labels; the decoder directly decodes the hidden layer vector based on the long-term and short-term memory network to generate SMILES of molecules, and the reconstruction of the molecules or the generation of brand new molecules are realized.
The method comprises the following steps:
step 1: collecting training data, extracting a molecular SMILES one-hot (one-hot) coding table, and calculating a property label;
step 2: learning the characteristics of known drug molecules SMILES through a training stage to obtain a training model;
and step 3: reconstructing the molecules by using the training model through a reconstruction stage;
and 4, step 4: through the generation phase, molecules with specific properties are generated by using a training model.
Further, the step 1 specifically includes:
collecting drug molecules and establishing a specific training data set;
using SMILES to represent drug molecules;
determining target properties of the drug molecules through experiments or calculation, and if the properties are quantitative data, selecting a reasonable threshold value to convert the target properties into qualitative data, namely target properties =1 and non-target properties =0; the calculation of molecular target properties is completed by a PaDEL-Descriptor, RDkit or Discovery Studio program;
the training data contains both drug molecules SMILES and specific property labels;
further, the step 2 specifically includes:
inputting training data into a neural network of a multitask capsule self-encoder to be trained;
manually adjusting the learning rate, the number of neurons and the training step number of the model for multiple times to exceed parameters, and keeping the training model with the minimum cross entropy loss function value;
and keeping the best model in multiple training processes as a training model.
Further, the step 3 specifically includes:
operating a training model, and coding training data into vectors with fixed length in batch by an encoder;
a decoder decodes the fixed length vectors into reconstructed molecular data;
calculating a reconstruction rate by reconstructing the molecular data;
and saving the reconstructed molecular data.
Further, the step 4 specifically includes:
operating a training model, and coding training data into vectors with fixed length in batch by an encoder;
a multitask capsule classifier calculating property features of the molecules for generating molecules having the target property;
performing data enhancement on the vector representation of the target property molecules to obtain new vector distribution;
the decoder decodes the new vector distribution into the sub-data;
manually debugging the hyper-parameters of the data enhancement process for multiple times, and keeping the best generation result;
when the number of molecules generated reaches a predetermined number, the generated molecular data is stored.
Further, the multitask capsule self-encoder neural network comprises an encoder, a multitask capsule classifier and a decoder, wherein the training data serves as the input of the encoder, and the output of the encoder serves as the input of the multitask capsule classifier; the output of the multitask capsule classifier is used as the input of the decoder.
Further, the encoder directly encodes the drug molecules SMILES into fixed-length vectors by using a bidirectional long-short term memory network, and the fixed-length vectors are divided into 3 parts:
1) Forward recurrent neural network
Figure GDA0003717359680000031
According to x1To xTxSequentially reading the input sequence, calculating the forward hidden state
Figure GDA0003717359680000032
Backward-circulating neural network
Figure GDA0003717359680000033
According to xTxTo x1Sequentially reading the input sequence, calculating the backward hidden state
Figure GDA0003717359680000034
Figure GDA0003717359680000035
Figure GDA0003717359680000036
xTxDenotes the T thxThe character of the time of day,
Figure GDA0003717359680000037
denotes the T thxThe forward hidden state of the moment in time,
Figure GDA0003717359680000038
denotes the T thxA backward hidden state of time, f represents a nonlinear function;
2) By forward hidden state
Figure GDA0003717359680000039
And backward hidden state
Figure GDA00037173596800000310
Calculating the hidden state ht
Figure GDA00037173596800000311
htIndicating a hidden state at time t;
3) Generating a hidden layer vector by a sequence of hidden states:
Figure GDA00037173596800000312
c denotes a vector generated from the hidden state sequence and q denotes a non-linear function.
Further, the multitask Capsule classifier adopts double-layer Capsule layers (Capsule layers) to optimize Margin Loss, and encodes and predicts the property labels of the drug molecules;
mapping hidden layer vectors to the two-layer capsule layers;
the optimizing and adjusting Routing iteration time specifically comprises the following steps:
1) And matrix transformation, namely calculating a prediction vector according to hidden layer vector mapping:
Figure GDA00037173596800000313
i denotes the first layer of capsules, j denotes the second layer of capsules,
Figure GDA00037173596800000314
denotes a prediction vector, WijRepresenting the weight matrix, u, learned by back propagationjRepresents the output of capsule i;
2) Calculating the total input vector s of the capsule j by the weighted sum of all the prediction vectorsj
Figure GDA00037173596800000315
sjRepresenting the total input vector, cijRepresents a coupling coefficient;
3) Calculating the coupling coefficient c by softmax activation functionij
Figure GDA0003717359680000041
bijRepresenting the logarithm of the probability of the strength of the connection between capsule i and capsule j. Before the routing iteration, b is set as 0 initially, and c is updated through the number of routing iterationsijA value of (d);
4) The vector output of capsule j is calculated by a non-linear squeeze function:
Figure GDA0003717359680000042
vja vector output representing capsule j;
5) Calculating the loss L of each capsule in the capsule layerk
Lk=Tkmax(0,m+-||vk||)2+λ(1-Tk)max(0,||vk||-m-)2
LkDenotes the loss, T, of each capsulekRepresents an index function (taking the value 1 or 0 according to the property label), m+And m-Respectively, the upper and lower boundaries are defined, and λ is a scaling factor defining the weights of both.
Further, the decoder decodes the encoded hidden layer vector of the multitask capsule classifier by using a long-short term memory network, and specifically comprises the following steps:
1) By hiding the layer vector c with the predicted character y before time t1,···,yt-1Generate the predicted character at time t:
p(yt|{y1,···,yt-1},c)=g(yt-1,st,c)
st=f′(st-1,yt-1,c)
yta predicted character, s, representing time ttRepresenting the hidden state of the decoder, g and f' representing non-linear functions;
2) The probability of the predicted sequence Y is calculated from the probabilities of all predicted characters:
Figure GDA0003717359680000043
p (Y) represents the probability of predicting sequence Y.
The invention has the positive effects that: the invention provides a brand-new model which consists of an encoder, a multitask capsule classifier and a decoder. The invention has the innovation points that the multitask capsule classifier is utilized to effectively acquire the relationship which is difficult to quantify between the structure of the medicine molecule and the properties of various medicines, the relevant information among the data of various properties of the medicine molecule is learned, the various properties of the generated molecule can be predicted in advance, and the generation of the molecules with various properties is realized through the encoder and the decoder. Compared with other previous methods for generating molecules based on machine learning, the method has the advantages that:
first, the method of the present invention produces better results than the traditional machine learning method. Traditional generative models can only generate one target property, and multiple training is needed to generate molecules with multiple properties. The invention realizes the simultaneous classification and generation of various properties by applying the multitask capsule classifier, the molecules generated by the multitask capsule self-encoder can simultaneously meet various target properties such as molecular weight, lipid-water distribution coefficient, hydrogen bond donor, hydrogen bond acceptor, rotatable bond quantity, polar surface area, synthesizability, specific target activity and the like, and meanwhile, the generated molecules have new frameworks.
Second, the method of the present invention uses a multitask capsule classifier that performs better than a single task capsule classifier and other conventional machine learning methods. The multi-task capsule classifier can effectively extract the characteristics of molecules by utilizing the related information among different property data, and improve the prediction effect.
Drawings
FIG. 1 is a block flow diagram of the novel molecular generation method based on a multitask capsule self-encoder.
FIG. 2 is a model diagram of the novel molecular generation method based on the multitask capsule self-encoder.
FIG. 3 shows the detailed steps of the novel molecule generation method based on the multitask capsule self-encoder.
FIG. 4 is a schematic diagram of the training of the novel molecular generation method based on the multitask capsule self-encoder.
FIG. 5 is a reconstruction diagram of the novel molecular generation method based on the multitask capsule self-encoder.
FIG. 6 is a schematic diagram of the novel molecular generation method based on the multitask capsule self-encoder.
Detailed Description
The figures show specific processes for achieving molecular generation of multiple target properties using the present invention.
The invention provides a brand new molecule generation method based on a multitask capsule self-encoder, which relates to the cross technical field of computer artificial intelligence and new drug molecule design.
Target properties of the generated molecules of the invention include: (1) molecular weight; (2) a lipid-water partition coefficient; (3) a hydrogen bond donor; (4) a hydrogen bond acceptor; (5) number of rotatable keys; (6) polar surface area; (7) synthesizability; (8) PDGF, renin, bcl-2 and other target activities.
See fig. 1.
The method comprises the steps of constructing an effective drug molecule database, measuring or calculating drug molecule property labels, constructing a self-encoder frame, constructing a multi-task capsule classifier frame, designing and realizing a data enhancement module, executing a generation process and the like.
See fig. 2.
The method is a brand-new molecule generation method based on a multitask capsule self-encoder. The model takes a self-encoder as a basic framework, and a multitask capsule classifier is configured in a hidden layer. The encoder directly encodes the drug molecules SMILES into vectors with fixed length by using a bidirectional long-short term memory network; the multi-task capsule classifier adopts a double-layer capsule layer to analyze and extract vector characteristics and predicts a property label of a drug molecule; the decoder decodes the hidden layer vector by using a long-short term memory network to realize output and molecule generation.
See fig. 3.
The method comprises the following specific operation steps:
step 1: collecting training data, extracting a molecular unique heat (one-hot) encoding table, and calculating a property label;
step 2: learning the characteristics of known drug molecules through a training phase to obtain a training model;
and step 3: reconstructing the molecules by using the training model through a reconstruction stage;
and 4, step 4: through the generation phase, the target property molecules are generated using the training model.
In the present invention, the step 1 specifically comprises: collecting drug molecules in a database, and representing the drug molecules by adopting SMILES; testing or calculating physical, chemical or biological properties of the drug molecule; selecting a reasonable threshold value to convert the quantitative data into a qualitative category label; the training data includes both drug molecules SMILES and property labels. The training data will be used in the training process of the model.
See fig. 4.
In the present invention, the step 2 specifically includes: inputting training data into a multi-task capsule self-encoder for training, manually adjusting learning rate, neuron number and training step number hyper-parameters of the model for multiple times, keeping the training model with the minimum cross entropy loss function value, and keeping the optimal model in the multiple training process as the training model.
See fig. 5.
In the present invention, the step 3 specifically includes: reading training data, operating a training model, and encoding the training data into vectors with fixed length in batch by an encoder; a decoder decodes the fixed length vectors into reconstructed molecular data; calculating a reconstruction rate by reconstructing the molecular data; and saving the reconstructed molecular data.
See fig. 6.
In the present invention, the step 4 specifically includes: reading training data, operating a training model, and encoding the training data into vectors with fixed length in batch by an encoder; predicting the property of the training molecule by the multi-task capsule classifier, and reserving the target property molecule; performing data enhancement on the vector representation of the target property molecules to obtain new vector distribution; the decoder decodes the new vector distribution into new generated molecular data; manually debugging the hyper-parameters of the data enhancement process for multiple times, and keeping the best generation result; when the number of molecules generated reaches a predetermined number, the generated molecular data is stored.
The encoder directly encodes drug molecules SMILES into fixed-length vectors by utilizing a bidirectional long-short term memory network, and the method comprises the following steps:
1) Forward recurrent neural network
Figure GDA0003717359680000061
According to x1To xTxSequentially reading the input sequence, calculating the forward hidden state
Figure GDA0003717359680000062
Backward-circulating neural network
Figure GDA0003717359680000071
According to xTxTo x1Sequentially reading the input sequence, calculating the backward hidden state
Figure GDA0003717359680000072
Figure GDA0003717359680000073
Figure GDA0003717359680000074
xTxDenotes the T thxThe character of the time of day,
Figure GDA0003717359680000075
denotes the T thxThe forward hidden state of the moment in time,
Figure GDA0003717359680000076
denotes the T thxA backward hidden state of time, f represents a nonlinear function;
2) By forward hidden state
Figure GDA0003717359680000077
And backward hidden state
Figure GDA0003717359680000078
Calculating the hidden state ht
Figure GDA0003717359680000079
htRepresenting a hidden state at time t;
3) Generating a hidden layer vector by a sequence of hidden states:
Figure GDA00037173596800000710
c denotes a vector generated from the hidden state sequence and q denotes a non-linear function.
Mapping the hidden layer vector to a double-layer capsule layer, optimizing and adjusting Routing iteration times Routing time, predicting the property of a training molecule by a multi-task capsule classifier, and reserving a target property molecule, wherein the method specifically comprises the following steps:
1) And matrix transformation, namely calculating a prediction vector according to hidden layer vector mapping:
Figure GDA00037173596800000711
i denotes the first layer of capsules, j denotes the second layer of capsules,
Figure GDA00037173596800000712
denotes a prediction vector, WijRepresenting the weight matrix, u, learned by back propagationjRepresents the output of capsule i;
2) Calculating the total input vector s of capsule j by the weighted sum of all the prediction vectorsj
Figure GDA00037173596800000713
sjRepresenting the total input vector, cijRepresents a coupling coefficient;
3) Calculating the coupling coefficient c by softmax activation functionij
Figure GDA00037173596800000714
bijRepresenting the log of the probability of the strength of the connection between capsule i and capsule j. Before the routing iteration, b is set as 0 initially, and c is updated through the number of routing iterationsijA value of (d);
4) The vector output of capsule j is calculated by a non-linear squeeze function:
Figure GDA0003717359680000081
vja vector output representing capsule j;
5) Calculating the loss L of each capsule in the capsule layerk
Lk=Tkmax(0,m+-||vk||)2+λ(1-Tk)max(0,||vk||-m-)2
LkDenotes the loss, T, of each capsulekRepresenting an index function, m+And m-Representing the upper and lower boundaries, respectively.
The decoder decodes the hidden layer vector by using a long-short term memory network, and specifically comprises:
1) By hiding the vector c with the predicted character y before time t1,···,yt-1Generating a predicted character at time t:
p(yt|{y1,···,yt-1},c)=g(yt-1,st,c)
st=f′(st-1,yt-1,c)
yta predicted character, s, representing time ttRepresenting the hidden state of the decoder, g and f' representing non-linear functions;
2) The probability of the predicted sequence Y is calculated from the probabilities of all predicted characters:
Figure GDA0003717359680000082
p (Y) represents the probability of predicting sequence Y.
Examples are given.
Generating molecules which simultaneously satisfy the properties of molecular weight, lipid-water partition coefficient, hydrogen bond donor, hydrogen bond acceptor, number of rotatable bonds, polar surface area, synthesizability and the like. The implementation process is as follows:
the first step is as follows: drug molecules (1757517 compounds) were collected from the source database of ChEMBL provenance (https:// www.ebi.ac.uk/ChEMBL /), and indicated with SMILES.
The second step is that: calculating the molecular weight, the lipid-water partition coefficient, the hydrogen bond donor, the hydrogen bond acceptor, the number of rotatable bonds, the polar surface area and the synthesizability of the drug molecules by open source PaDEL-Descriptor, RDkit or Discovery Studio programs; selecting the target properties of molecular weight less than or equal to 500, lipid-water partition coefficient less than or equal to 5 and less than or equal to 0, hydrogen bond donor less than or equal to 5, hydrogen bond acceptor less than or equal to 10, number of rotatable bonds less than or equal to 20, polar surface area less than or equal to 200 and synthesizability less than or equal to 6, and label 1; the training data simultaneously comprises drug molecules SMILES and property labels, and the training data is saved in an SMI format.
The third step: and establishing a brand-new molecular generation model based on the multitask capsule self-encoder by using the training data. Structurally, the model includes three parts, an encoder, a multitask capsule classifier and a decoder. In the training stage, the learning rate, the number of neurons and the training step number of the model are adjusted manually for multiple times, and the optimal model in the multiple training process is reserved as a training model. The present example is debugged from the following aspects:
batch size candidate range for training phase: 128 256, 512 and 1028; network iteration number candidate range: from 100 to 1000, each change is increased by 100;
the encoder uses a bidirectional long-short term memory network to directly encode the drug molecules SMILES into fixed-length vectors. Encoder per layer neuron number candidate range: 128 192 and 256; the number of encoder neuron layers is set to be 1;
the multi-task capsule classifier is composed of double-layer capsule layers, and classification of various target properties is achieved by optimizing and adjusting Routing iteration times. Per capsule layer neuron candidate range: 128 192 and 256; candidate range of number of times of route iteration of capsule part: 1,2,3,4 and 5; the loss weight of the capsule classifier is set to 10; the optimizer selects AdamaOptimizer; learning rate candidate range of capsule classifier: from 0.001 to 0.01, with each increase varying by 0.001;
the decoder decodes the hidden layer vector using a long-short term memory based network. Decoder per layer neuron candidate range: 256 384 and 512; the number of encoder neuron layers is set to 1.
The fourth step: in the reconstruction stage, the molecules are reconstructed through the training model, and reconstructed molecule files are stored.
Batch size candidate range for reconstruction phase: 500 1000, 1500 and 2000; the number of batch processes was set to 10.
The fifth step: in the generation stage, molecule generation is carried out through a training model, a compound which simultaneously meets the requirements of molecular weight, a lipid-water distribution coefficient, a hydrogen bond donor, a hydrogen bond acceptor, the number of rotatable bonds, polar surface area and synthesizability is generated, and a generated molecule file is stored.
Batch size candidate range for generation phase: 500 1000, 1500 and 2000; the batch processing times are set to 10; the standard deviation of the normal distribution during the data enhancement process was set to 0.2.
Drug small molecule design methods play a key role in the active drug discovery process. Traditional drug design methods such as virtual screening and pharmacophore models are mainly used to search libraries of known virtual compounds. The analysis and processing of search results also requires a great deal of expertise, due to the enormous number of drug molecules in the chemical space and the current limitations of computer computational performance, making searching the entire chemical space impractical. The de novo molecular design method based on deep neural networks, as a novel artificial intelligence technology, can be used for generating molecules with required properties. The method has the advantages that new molecules with optimized properties can be generated without enumerating virtual compound libraries, and the like.
However, the existing molecular generation method only considers the generation of a molecule with one target property, and the method has low efficiency, cannot limit various properties of the molecule, and is difficult to meet the requirement of new drug molecular design. The multitask capsule self-encoder provided by the invention can effectively acquire the relationship which is difficult to quantify between the molecular structure of the medicine and the properties of various medicines, learn the related information among the data of various properties of the medicine molecules, realize the molecules which simultaneously meet the physical, chemical and biological properties, and improve the effectiveness and novelty of the generated molecules.

Claims (9)

1. A brand new molecule generation method based on a multitask capsule self-encoder neural network is characterized by comprising the following steps: the method comprises the steps of representing drug molecules as SMILES (simplified molecule linear input standard), marking target property labels, and establishing a brand new molecule generation model comprising an encoder, a multitask capsule classifier and a decoder by utilizing a self-encoder framework; the encoder encodes drug molecules SMILES into vectors with fixed length by using a bidirectional long-short term memory network; the multitask capsule classifier adopts a double-layer capsule layer to optimize Margin Loss, and encodes and predicts a property label of a drug molecule; the decoder decodes the hidden layer vector by using a long-term and short-term memory network to realize input and output reconstruction;
the method comprises the following steps:
step 1: collecting training data, extracting a one-hot (one-hot) encoding table of molecules, and calculating a property label;
step 2: learning the characteristics of known drug molecules through a training phase to obtain a training model;
and step 3: reconstructing the molecules by using the training model through a reconstruction stage;
and 4, step 4: through the generation phase, molecules with specific properties are generated by using a training model.
2. The completely new molecule generating method based on the multitask capsule self-encoder neural network according to claim 1, wherein the step 1 specifically comprises the following steps:
collecting drug molecules and establishing a specific data set;
drug molecules are represented using SMILES (simplified molecular linear input specification);
calculating or collecting drug molecule target property data, if the data is quantitative representation, selecting a reasonable threshold value to convert into qualitative representation, namely target property =1; non-target property =0; all the molecular descriptors are calculated by open source PaDEL-Descriptor, RDkit or Discovery Studio programs;
the training data contains both the drug molecule SMILES and a specific property label.
3. The brand-new molecule generation method based on the multitask capsule self-encoder neural network as claimed in claim 1, wherein said step 2 specifically comprises:
inputting training data into a neural network of a multitask capsule self-encoder to be trained;
manually adjusting the learning rate, the number of neurons and the training step number of the model for multiple times to exceed parameters, and keeping the training model with the minimum cross entropy loss function value;
and reserving the optimal model in the multiple training processes as a pre-training model.
4. The brand-new molecule generation method based on the multitask capsule self-encoder neural network as claimed in claim 1, wherein said step 3 specifically comprises:
running the training model, the encoder encodes the training data into fixed length vectors in batches
A decoder decodes the fixed length vectors into reconstructed molecular data;
calculating a reconstruction rate by reconstructing the molecular data;
and saving the reconstructed molecular data.
5. The brand-new molecule generation method based on the multitask capsule self-encoder neural network as claimed in claim 1, wherein said step 4 specifically comprises:
operating a training model, and coding training data into vectors with fixed length in batch by an encoder;
the multitask capsule classifier encodes and predicts the properties of the training molecules;
performing data enhancement on the vector representation of the target property molecules to obtain new vector distribution;
the decoder decodes the new vector distribution into the sub-data;
manually debugging the hyper-parameters of the data enhancement process for multiple times, and keeping the best generation result;
when the number of molecules generated reaches a predetermined number, the generated molecular data is stored.
6. The novel molecular generation method based on the multitask capsule self-encoder neural network as claimed in claim 5, wherein said multitask capsule self-encoder neural network comprises an encoder, a multitask capsule classifier and a decoder, said training data is used as input of said encoder, and output of said encoder is used as input of said multitask capsule classifier; the output of the multitask capsule classifier is used as the input of the decoder.
7. The novel molecular generation method based on the multitask capsule self-encoder neural network as claimed in claim 6, wherein:
the encoder encodes the drug molecules SMILES into vectors of fixed length by using a bidirectional long-short term memory network, and the vectors are divided into 3 parts:
1) Forward recurrent neural network
Figure FDA0003717359670000021
According to x1To xTxSequentially reading the input sequence, calculating the forward hidden state
Figure FDA0003717359670000022
Backward-circulating neural network
Figure FDA0003717359670000023
According to xTxTo x1Sequentially reading the input sequence, calculating the backward hidden state
Figure FDA0003717359670000024
Figure FDA0003717359670000025
Figure FDA0003717359670000026
xTxDenotes the T thxThe character of the time of day,
Figure FDA0003717359670000027
denotes the T thxThe forward hidden state of the moment in time,
Figure FDA0003717359670000028
denotes the T thxA backward hidden state of time, f represents a nonlinear function;
2) By forward hidden state
Figure FDA0003717359670000029
And backward hidden state
Figure FDA00037173596700000210
Calculating the hidden state ht
Figure FDA00037173596700000211
htIndicating a hidden state at time t;
3) Generating a hidden layer vector by a sequence of hidden states:
Figure FDA00037173596700000212
c denotes a vector generated from the hidden state sequence and q denotes a non-linear function.
8. The novel molecular generation method based on the multitask capsule self-encoder neural network as claimed in claim 6, wherein:
the multitask capsule classifier adopts a double-layer capsule layer to optimize Margin Loss and predicts the property label of the drug molecule;
mapping the hidden layer vector to double-layer Capsule layers, and optimizing and adjusting Routing iteration time, specifically comprising:
1) And matrix transformation, namely calculating a prediction vector according to hidden layer vector mapping:
Figure FDA0003717359670000031
i denotes the first layer of capsules, j denotes the second layer of capsules,
Figure FDA0003717359670000032
denotes a prediction vector, WijRepresenting the weight matrix, u, learned by back propagationjRepresents the output of capsule i;
2) Calculating the total input vector s of capsule j by the weighted sum of all the prediction vectorsj
Figure FDA0003717359670000033
sjRepresenting the total input vector, cijRepresents a coupling coefficient;
3) Calculating the coupling coefficient c by softmax activation functionij
Figure FDA0003717359670000034
bijThe logarithm of probability representing the strength of the connection between capsule i and capsule j; before route iteration, b is set as 0 initially, and c is updated through the number of route iterationsijA value of (d);
4) The vector output of capsule j is calculated by a non-linear squeeze function:
Figure FDA0003717359670000035
vja vector output representing capsule j;
5) Calculating the loss L of each capsule in the capsule layerk
Lk=Tkmax(0,m+-||vk||)2+λ(1-Tk)max(0,||vk||-m-)2
LkDenotes the loss, T, of each capsulekRepresents an index function, m+And m-Representing the upper and lower boundaries, respectively.
9. The novel molecular generation method based on the multitask capsule self-encoder neural network as claimed in claim 6, wherein:
the decoder decodes the hidden layer vector by using a long-short term memory network, and specifically comprises:
1) By hiding the vector c with the predicted character y before time t1,···,yt-1Generating a predicted character at time t:
p(yt|{y1,···,yt-1},c)=g(yt-1,st,c)
st=f′(st-1,yt-1,c)
yta predicted character, s, representing time ttRepresenting the hidden state of the decoder, g and f' representing non-linear functions;
2) The probability of the predicted sequence Y is calculated from the probabilities of all predicted characters:
Figure FDA0003717359670000041
p (Y) represents the probability of predicting sequence Y.
CN202011247808.1A 2020-11-10 2020-11-10 Brand-new molecule generation method based on multitask capsule self-encoder neural network Active CN112270951B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011247808.1A CN112270951B (en) 2020-11-10 2020-11-10 Brand-new molecule generation method based on multitask capsule self-encoder neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011247808.1A CN112270951B (en) 2020-11-10 2020-11-10 Brand-new molecule generation method based on multitask capsule self-encoder neural network

Publications (2)

Publication Number Publication Date
CN112270951A CN112270951A (en) 2021-01-26
CN112270951B true CN112270951B (en) 2022-11-01

Family

ID=74339427

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011247808.1A Active CN112270951B (en) 2020-11-10 2020-11-10 Brand-new molecule generation method based on multitask capsule self-encoder neural network

Country Status (1)

Country Link
CN (1) CN112270951B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112562869A (en) * 2021-02-24 2021-03-26 北京中医药大学东直门医院 Drug combination safety evaluation system, method and device
CN113223637B (en) * 2021-05-07 2023-07-25 中国科学院自动化研究所 Medicine molecular generator training method based on domain knowledge and deep reinforcement learning
CN113488119B (en) * 2021-06-18 2024-02-02 重庆医科大学 Drug small molecule numerical value characteristic structured database and establishment method thereof
CN113470740A (en) * 2021-06-30 2021-10-01 中国石油大学(华东) Medicine recommendation system, computer equipment and storage medium based on fully-connected network integrated deep learning model
CN114049922B (en) * 2021-11-09 2022-06-03 四川大学 Molecular design method based on small-scale data set and generation model
CN114496112B (en) * 2022-01-21 2023-10-31 内蒙古工业大学 Intelligent quantification method for anti-breast cancer drug ingredients based on multi-objective optimization
CN114446414B (en) * 2022-01-24 2023-05-23 电子科技大学 Reverse synthetic analysis method based on quantum circulation neural network
CN114937478B (en) * 2022-05-18 2023-03-10 北京百度网讯科技有限公司 Method for training a model, method and apparatus for generating molecules
CN114913938B (en) * 2022-05-27 2023-04-07 中南大学 Small molecule generation method, equipment and medium based on pharmacophore model
GB2621108A (en) * 2022-07-08 2024-02-07 Topia Life Sciences Ltd An automated system for generating novel molecules
CN117334271A (en) * 2023-09-25 2024-01-02 江苏运动健康研究院 Method for generating molecules based on specified attributes

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108073780A (en) * 2016-11-14 2018-05-25 王�忠 A kind of method of the relatively clinical efficacy of Chinese medicine compound prescription
CN111126554A (en) * 2018-10-31 2020-05-08 深圳市云网拜特科技有限公司 Drug lead compound screening method and system based on generation of confrontation network
CN111584010A (en) * 2020-04-01 2020-08-25 昆明理工大学 Key protein identification method based on capsule neural network and ensemble learning

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106146670B (en) * 2015-04-24 2019-01-15 宜明昂科生物医药技术(上海)有限公司 A kind of new recombination double functions fusion protein and its preparation and application
US10776712B2 (en) * 2015-12-02 2020-09-15 Preferred Networks, Inc. Generative machine learning systems for drug design
KR102574256B1 (en) * 2017-04-25 2023-09-01 더 보드 어브 트러스티스 어브 더 리랜드 스탠포드 주니어 유니버시티 Dose Reduction for Medical Imaging Using Deep Convolutional Neural Networks
WO2019070978A1 (en) * 2017-10-06 2019-04-11 Mayo Foundation For Medical Education And Research Ecg-based cardiac ejection-fraction screening
KR102587959B1 (en) * 2018-01-17 2023-10-11 삼성전자주식회사 Method and apparatus for generating chemical structure using neural network
GB2573102A (en) * 2018-04-20 2019-10-30 Drugai Ltd Interaction property prediction system and method
CN111128314B (en) * 2018-10-30 2023-07-21 深圳市云网拜特科技有限公司 Drug discovery method and system
CN109979541B (en) * 2019-03-20 2021-06-22 四川大学 Method for predicting pharmacokinetic property and toxicity of drug molecules based on capsule network
CN110473595A (en) * 2019-07-04 2019-11-19 四川大学 A kind of capsule network Relation extraction model in the most short interdependent path of combination
CN110634539A (en) * 2019-09-12 2019-12-31 腾讯科技(深圳)有限公司 Artificial intelligence-based drug molecule processing method and device and storage medium
CN110970099B (en) * 2019-12-10 2023-04-28 北京大学 Drug molecule generation method based on regularized variation automatic encoder
CN111508568B (en) * 2020-04-20 2023-08-29 腾讯科技(深圳)有限公司 Molecule generation method, molecule generation device, computer readable storage medium and terminal device
CN111785326B (en) * 2020-06-28 2024-02-06 西安电子科技大学 Gene expression profile prediction method after drug action based on generation of antagonism network
CN111814460B (en) * 2020-07-06 2021-02-09 四川大学 External knowledge-based drug interaction relation extraction method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108073780A (en) * 2016-11-14 2018-05-25 王�忠 A kind of method of the relatively clinical efficacy of Chinese medicine compound prescription
CN111126554A (en) * 2018-10-31 2020-05-08 深圳市云网拜特科技有限公司 Drug lead compound screening method and system based on generation of confrontation network
CN111584010A (en) * 2020-04-01 2020-08-25 昆明理工大学 Key protein identification method based on capsule neural network and ensemble learning

Also Published As

Publication number Publication date
CN112270951A (en) 2021-01-26

Similar Documents

Publication Publication Date Title
CN112270951B (en) Brand-new molecule generation method based on multitask capsule self-encoder neural network
CN113707235B (en) Drug micromolecule property prediction method, device and equipment based on self-supervision learning
JP2023082017A (en) computer system
CN112561064B (en) Knowledge base completion method based on OWKBC model
Xiao et al. History-based attention in Seq2Seq model for multi-label text classification
US11354582B1 (en) System and method for automated retrosynthesis
KR102491346B1 (en) Method, apparatus and computer program for generating formalized research record data automatically for learning artificial intelligence model
CN114999565A (en) Drug target affinity prediction method based on representation learning and graph neural network
WO2022188653A1 (en) Molecular scaffold hopping processing method and apparatus, medium, electronic device and computer program product
Yuan et al. DeCban: prediction of circRNA-RBP interaction sites by using double embeddings and cross-branch attention networks
Zhao et al. Exploiting multiple question factors for knowledge tracing
Wu et al. Semi-supervised cross-modal hashing via modality-specific and cross-modal graph convolutional networks
Bhardwaj et al. Computational biology in the lens of CNN
Fan et al. Surrogate-assisted evolutionary neural architecture search with network embedding
Wang et al. Artificial Intelligence in
Luo et al. A Caps-UBI model for protein ubiquitination site prediction
CN116843995A (en) Method and device for constructing cytographic pre-training model
Chen et al. Personalized expert recommendation systems for optimized nutrition
Wang et al. Molecular property prediction by contrastive learning with attention-guided positive sample selection
US11568961B2 (en) System and method for accelerating FEP methods using a 3D-restricted variational autoencoder
CN114239575B (en) Statement analysis model construction method, statement analysis method, device, medium and computing equipment
CN115240787A (en) Brand-new molecule generation method based on deep conditional recurrent neural network
US20220198286A1 (en) System and method for molecular reconstruction from molecular probability distributions
Song Distilling knowledge from user information for document level sentiment classification
Ma et al. Target-Embedding Autoencoder With Knowledge Distillation for Multi-Label Classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant