CN112270951A - Brand-new molecule generation method based on multitask capsule self-encoder neural network - Google Patents

Brand-new molecule generation method based on multitask capsule self-encoder neural network Download PDF

Info

Publication number
CN112270951A
CN112270951A CN202011247808.1A CN202011247808A CN112270951A CN 112270951 A CN112270951 A CN 112270951A CN 202011247808 A CN202011247808 A CN 202011247808A CN 112270951 A CN112270951 A CN 112270951A
Authority
CN
China
Prior art keywords
capsule
multitask
molecules
encoder
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011247808.1A
Other languages
Chinese (zh)
Other versions
CN112270951B (en
Inventor
邹俊
杨胜勇
李侃
杨欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202011247808.1A priority Critical patent/CN112270951B/en
Publication of CN112270951A publication Critical patent/CN112270951A/en
Application granted granted Critical
Publication of CN112270951B publication Critical patent/CN112270951B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Medicinal Preparation (AREA)

Abstract

The invention discloses a brand-new molecule generation method based on a multitask capsule self-encoder neural network. The method comprises the steps of expressing drug molecules as SMILES (simplified molecule linear input standard), marking target property labels, and learning the characteristics of known drug molecules through a training phase to obtain a training model; reconstructing the molecules by using the training model through a reconstruction stage; and generating molecules by using the training model through a generation stage, wherein the generated molecules simultaneously have multiple set target properties, and the generated molecules simultaneously have a large number of new molecules and new frameworks. The invention can be used for the generation of various molecules such as drugs or compounds, and the characteristics and properties of known drugs can be learned through one training, so that the generation of molecules which simultaneously meet the required physical, chemical and biological properties can be carried out. The molecules generated by the method have higher effectiveness and more excellent properties.

Description

Brand-new molecule generation method based on multitask capsule self-encoder neural network
Technical Field
The invention relates to the cross technical field of computer artificial intelligence and brand new molecular design, in particular to a brand new molecular generation method of a multitask capsule self-encoder neural network, which is a method for performing brand new molecular design based on a self-encoder framework and a multitask capsule classifier framework and is suitable for generating molecules which simultaneously accord with various physical, chemical and biological properties.
Background
The design method of small molecule drugs plays a key role in the development process of active drugs. Traditional drug design methods such as virtual screening and pharmacophore models are mainly used to search libraries of known virtual compounds. Due to the large number of potentially synthesizable molecules in chemical space (10)23-1060) And the current computer computing performance limit, the global search of the whole chemical space is difficult, and the analysis and processing of the search results require a great deal of professional experience. As a data-driven calculation method, the artificial intelligence technology can automatically learn knowledge such as chemical structures, structure-activity relationships and the like of drug molecules from a data set, help scientists design molecules with target properties, and bring hopes for drug discovery and development. The de novo molecular design method based on deep neural networks, as a novel artificial intelligence technology, can be used for generating molecules with required properties. This has the advantage that new molecules with optimized properties can be generated without the need to enumerate virtual compound libraries. However, the existing molecule generation method only considers the generation of a molecule with one target property, is difficult to learn other characteristics except the property, cannot simultaneously optimize multiple properties of the molecule, influences the final generation effect, and cannot meet the requirement of new drug molecule design. In the process of molecule generation, a key difficulty lies in the selection of a classifier, a common support vector machine cannot be directly and jointly trained with a deep neural network, and a convolutional neural network has poor classification effect and is difficult to be applied to the classification and generation of molecules with various target properties.
Disclosure of Invention
The invention aims to provide a method for generating molecules which can simultaneously meet various target properties such as molecular weight, lipid-water partition coefficient, hydrogen bond donor, hydrogen bond acceptor, rotatable bond number, polar surface area, synthesizability, specific target activity and the like.
The invention provides a new model, which takes a self-encoder as a basic frame and integrates a multi-task capsule classifier in a hidden layer. By adopting the method, various drug molecules with optimized target properties can be effectively generated.
The technical scheme of the method is as follows:
a brand new molecule generation method based on a multitask capsule self-encoder neural network is characterized by comprising the following steps: the drug molecules are denoted as SMILES (simplified molecule linear input specification), labeling the target property labels of the molecules. Structurally, the model includes three parts, an encoder, a multitask capsule classifier and a decoder. The encoder encodes drug molecules SMILES into vectors with fixed length by using a bidirectional long-short term memory network; the multitask capsule classifier utilizes double-layer capsule layer codes to represent vectors of the drug molecular property labels; the decoder directly decodes the hidden layer vector based on the long-term and short-term memory network to generate SMILES of molecules, and the reconstruction of the molecules or the generation of brand new molecules are realized.
The method comprises the following steps:
step 1: collecting training data, extracting a molecular SMILES one-hot (one-hot) coding table, and calculating a property label;
step 2: learning the characteristics of known drug molecules SMILES through a training stage to obtain a training model;
and step 3: reconstructing the molecules by using the training model through a reconstruction stage;
and 4, step 4: through the generation phase, molecules with specific properties are generated by using a training model.
Further, the step 1 specifically includes:
collecting drug molecules and establishing a specific training data set;
using SMILES to represent drug molecules;
determining target properties of drug molecules through experiments or calculation, and if the properties are quantitative data, selecting a reasonable threshold value to convert the properties into qualitative data, namely the target properties are 1 and the non-target properties are 0; the calculation of molecular target properties is completed by a PaDEL-Descriptor, RDkit or Discovery Studio program;
the training data contains both drug molecules SMILES and specific property labels;
further, the step 2 specifically includes:
inputting training data into a neural network of a multitask capsule self-encoder to be trained;
manually adjusting the hyper-parameters (learning rate, neuron number and training step number) of the model for multiple times, and keeping the training model with the minimum cross entropy loss function value;
and keeping the best model in multiple training processes as a training model.
Further, the step 3 specifically includes:
operating a training model, and coding training data into vectors with fixed length in batch by an encoder;
a decoder decodes the fixed length vectors into reconstructed molecular data;
calculating a reconstruction rate by reconstructing the molecular data;
and saving the reconstructed molecular data.
Further, the step 4 specifically includes:
operating a training model, and coding training data into vectors with fixed length in batch by an encoder;
a multitask capsule classifier calculating property features of the molecules for generating molecules having the target property;
performing data enhancement on the vector representation of the target property molecules to obtain new vector distribution;
the decoder decodes the new vector distribution into the sub-data;
manually debugging the hyper-parameters of the data enhancement process for multiple times, and keeping the best generation result;
when the number of molecules generated reaches a predetermined number, the generated molecular data is stored.
Further, the multitask capsule self-encoder neural network comprises an encoder, a multitask capsule classifier and a decoder, wherein the training data serves as the input of the encoder, and the output of the encoder serves as the input of the multitask capsule classifier; the output of the multitask capsule classifier is used as the input of the decoder.
Further, the encoder directly encodes the drug molecules SMILES into fixed-length vectors by using a bidirectional long-short term memory network, and the fixed-length vectors are divided into 3 parts:
1) forward recurrent neural network
Figure BDA0002770627690000031
According to x1To xTxSequentially reading the input sequence, calculating the forward hidden state
Figure BDA0002770627690000032
Backward-circulating neural network
Figure BDA0002770627690000033
According to xTxTo x1Sequentially reading the input sequence, calculating the backward hidden state
Figure BDA0002770627690000034
Figure BDA0002770627690000035
Figure BDA0002770627690000036
xTxDenotes the T thxThe character of the time of day,
Figure BDA0002770627690000037
denotes the T thxThe forward hidden state of the moment in time,
Figure BDA0002770627690000038
denotes the T thxA backward hidden state of time, f represents a nonlinear function;
2) by forward hidden state
Figure BDA0002770627690000039
And backward hidden state
Figure BDA00027706276900000310
Calculating the hidden state ht
Figure BDA00027706276900000311
htIndicating a hidden state at time t;
3) generating a hidden layer vector by a sequence of hidden states:
Figure BDA00027706276900000312
c denotes a vector generated from the hidden state sequence and q denotes a non-linear function.
Further, the multitask Capsule classifier adopts double-layer Capsule layers (Capsule layers) to optimize Margin Loss, and encodes and predicts the property labels of the drug molecules;
mapping hidden layer vectors to the two-layer capsule layers;
the optimizing and adjusting Routing iteration time specifically comprises the following steps:
1) and matrix transformation, namely calculating a prediction vector according to hidden layer vector mapping:
Figure BDA00027706276900000313
i denotes the first layer of capsules, j denotes the second layer of capsules,
Figure BDA00027706276900000314
denotes a prediction vector, WijRepresenting the weight matrix, u, learned by back propagationjRepresents the output of capsule i;
2) calculating the total input vector s of capsule j by the weighted sum of all the prediction vectorsj
Figure BDA00027706276900000315
sjRepresenting the total input vector, cijRepresents a coupling coefficient;
3) calculating the coupling coefficient c by softmax activation functionij
Figure BDA0002770627690000041
bijRepresenting the logarithm of the probability of the strength of the connection between capsule i and capsule j. Before route iteration, b is set as 0 initially, and c is updated through the number of route iterationsijA value of (d);
4) the vector output of capsule j is calculated by a non-linear squeeze function:
Figure BDA0002770627690000042
vja vector output representing capsule j;
5) calculating the loss L of each capsule in the capsule layerk
Lk=Tkmax(0,m+-||vk||)2+λ(1-Tk)max(0,||vk||-m-)2
LkDenotes the loss, T, of each capsulekRepresents an index function (taking the value 1 or 0 according to the property label), m+And m-Respectively, the upper and lower boundaries are defined, and λ is a scaling factor defining the weights of both.
Further, the decoder decodes the encoded hidden layer vector of the multitask capsule classifier by using a long-short term memory network, and specifically comprises the following steps:
1) by hiding the layer vector c with the predicted character y before time t1,···,yt-1Generating a predicted character at time t:
p(yt|{y1,···,yt-1},c)=g(yt-1,st,c)
st=f′(st-1,yt-1,c)
yta predicted character, s, representing time ttRepresenting the hidden state of the decoder, g and f' representing non-linear functions;
2) the probability of the predicted sequence Y is calculated from the probabilities of all predicted characters:
Figure BDA0002770627690000043
p (Y) represents the probability of predicting sequence Y.
The invention has the positive effects that: the invention provides a brand-new model which consists of an encoder, a multitask capsule classifier and a decoder. The invention has the innovation points that the multitask capsule classifier is utilized to effectively acquire the relationship which is difficult to quantify between the structure of the medicine molecule and the properties of various medicines, the relevant information among the data of various properties of the medicine molecule is learned, the various properties of the generated molecule can be predicted in advance, and the generation of the molecules with various properties is realized through the encoder and the decoder. Compared with other previous methods for generating molecules based on machine learning, the method has the advantages that:
first, the method of the present invention produces better results than the traditional machine learning method. Traditional generative models can only generate one target property, and multiple training is needed to generate molecules with multiple properties. The invention realizes the simultaneous classification and generation of various properties by applying the multitask capsule classifier, the molecules generated by the multitask capsule self-encoder can simultaneously meet various target properties such as molecular weight, lipid-water distribution coefficient, hydrogen bond donor, hydrogen bond acceptor, rotatable bond quantity, polar surface area, synthesizability, specific target activity and the like, and meanwhile, the generated molecules have new frameworks.
Second, the method of the present invention uses a multitask capsule classifier that performs better than a single task capsule classifier and other conventional machine learning methods. The multi-task capsule classifier can effectively extract the characteristics of molecules by utilizing the related information among different property data, and improve the prediction effect.
Drawings
FIG. 1 is a block flow diagram of the novel molecular generation method based on a multitask capsule self-encoder.
FIG. 2 is a model diagram of the novel molecular generation method based on the multitask capsule self-encoder.
FIG. 3 shows the detailed steps of the novel molecule generation method based on the multitask capsule self-encoder.
FIG. 4 is a schematic diagram of the training of the novel molecular generation method based on the multitask capsule self-encoder.
FIG. 5 is a reconstruction diagram of the novel molecular generation method based on the multitask capsule self-encoder.
FIG. 6 is a schematic diagram of the novel molecular generation method based on the multitask capsule self-encoder.
Detailed description of the invention
The figures show specific processes for achieving molecular generation of multiple target properties using the present invention.
The invention provides a brand new molecule generation method based on a multitask capsule self-encoder, which relates to the cross technical field of computer artificial intelligence and new drug molecule design.
Target properties of the resulting molecules of the invention include: (1) a molecular weight; (2) a lipid-water partition coefficient; (3) a hydrogen bond donor; (4) a hydrogen bond acceptor; (5) number of rotatable keys; (6) a polar surface area; (7) the synthesis performance is improved; (8) PDGF, Renin, Bcl-2 and other target activities.
See fig. 1.
The method comprises the steps of constructing an effective drug molecule database, measuring or calculating drug molecule property labels, constructing a self-encoder frame, constructing a multi-task capsule classifier frame, designing and realizing a data enhancement module, executing a generation process and the like.
See fig. 2.
The method is a brand new molecule generation method based on a multitask capsule self-encoder. The model takes a self-encoder as a basic framework, and a multitask capsule classifier is configured in a hidden layer. The encoder directly encodes the drug molecules SMILES into vectors with fixed length by using a bidirectional long-short term memory network; the multi-task capsule classifier analyzes and extracts vector characteristics by adopting double-layer capsule layers and predicts the property labels of the drug molecules; the decoder decodes the hidden layer vector by using a long-short term memory network to realize output and molecule generation.
See fig. 3.
The method comprises the following specific operation steps:
step 1: collecting training data, extracting a molecular unique heat (one-hot) encoding table, and calculating a property label;
step 2: learning the characteristics of known drug molecules through a training phase to obtain a training model;
and step 3: reconstructing the molecules by using the training model through a reconstruction stage;
and 4, step 4: through the generation phase, target property molecules are generated using a training model.
In the present invention, the step 1 specifically comprises: collecting drug molecules in a database, and representing the drug molecules by adopting SMILES; testing or calculating physical, chemical or biological properties of the drug molecule; selecting a reasonable threshold value to convert quantitative data into qualitative category labels; the training data includes both drug molecules SMILES and property labels. The training data will be used in the training process of the model.
See fig. 4.
In the present invention, the step 2 specifically includes: training data is input into a multi-task capsule self-encoder to be trained, the hyper-parameters (learning rate, neuron number and training step number) of the model are manually adjusted for multiple times, the training model with the minimum cross entropy loss function value is reserved, and the optimal model in the multiple training process is reserved as the training model.
See fig. 5.
In the present invention, the step 3 specifically includes: reading training data, operating a training model, and encoding the training data into vectors with fixed length in batch by an encoder; a decoder decodes the fixed length vectors into reconstructed molecular data; calculating a reconstruction rate by reconstructing the molecular data; and saving the reconstructed molecular data.
See fig. 6.
In the present invention, the step 4 specifically includes: reading training data, operating a training model, and encoding the training data into vectors with fixed length in batch by an encoder; predicting the property of the training molecule by the multi-task capsule classifier, and reserving the target property molecule; performing data enhancement on the vector representation of the target property molecules to obtain new vector distribution; the decoder decodes the new vector distribution into new generated molecular data; manually debugging the hyper-parameters of the data enhancement process for multiple times, and keeping the best generation result; when the number of molecules generated reaches a predetermined number, the generated molecular data is stored.
The encoder directly encodes drug molecules SMILES into fixed-length vectors by utilizing a bidirectional long-short term memory network, and the method comprises the following steps:
1) forward recurrent neural network
Figure BDA0002770627690000061
According to x1To xTxSequentially reading the input sequence, calculating the forward hidden state
Figure BDA0002770627690000062
Backward-circulating neural network
Figure BDA0002770627690000071
According to xTxTo x1Sequentially reading the input sequence, calculating the backward hidden state
Figure BDA0002770627690000072
Figure BDA0002770627690000073
Figure BDA0002770627690000074
xTxDenotes the T thxThe character of the time of day,
Figure BDA0002770627690000075
denotes the T thxThe forward hidden state of the moment in time,
Figure BDA0002770627690000076
denotes the T thxA backward hidden state of time, f represents a nonlinear function;
2) by forward hidden state
Figure BDA0002770627690000077
And backward hidden state
Figure BDA0002770627690000078
Calculating the hidden state ht
Figure BDA0002770627690000079
htIndicating a hidden state at time t;
3) generating a hidden layer vector by a sequence of hidden states:
Figure BDA00027706276900000710
c denotes a vector generated from the hidden state sequence and q denotes a non-linear function.
Mapping the hidden layer vector to a double-layer capsule layer, optimizing and adjusting Routing iteration times, predicting the property of a training molecule by a multi-task capsule classifier, and reserving a target property molecule, wherein the method specifically comprises the following steps:
1) and matrix transformation, namely calculating a prediction vector according to hidden layer vector mapping:
Figure BDA00027706276900000711
i denotes the first layer of capsules, j denotes the second layer of capsules,
Figure BDA00027706276900000712
denotes a prediction vector, WijRepresenting the weight matrix, u, learned by back propagationjRepresents the output of capsule i;
2) calculating the total input vector s of capsule j by the weighted sum of all the prediction vectorsj
Figure BDA00027706276900000713
sjRepresenting the total input vector, cijRepresents a coupling coefficient;
3) calculating the coupling coefficient c by softmax activation functionij
Figure BDA00027706276900000714
bijRepresenting the logarithm of the probability of the strength of the connection between capsule i and capsule j. Before route iteration, b is set as 0 initially, and c is updated through the number of route iterationsijA value of (d);
4) the vector output of capsule j is calculated by a non-linear squeeze function:
Figure BDA0002770627690000081
vja vector output representing capsule j;
5) calculating the loss L of each capsule in the capsule layerk
Lk=Tkmax(0,m+-||vk||)2+λ(1-Tk)max(0,||vk||-m-)2
LkDenotes the loss, T, of each capsulekRepresenting an index function, m+And m-Representing the upper and lower boundaries, respectively.
The decoder decodes the hidden layer vector by using a long-short term memory network, and specifically comprises:
1) by hiding the vector c with the predicted character y before time t1,···,yt-1Generating a predicted character at time t:
p(yt|{y1,···,yt-1},c)=g(yt-1,st,c)
st=f′(st-1,yt-1,c)
yta predicted character, s, representing time ttRepresenting the hidden state of the decoder, g and f' representing non-linear functions;
2) the probability of the predicted sequence Y is calculated from the probabilities of all predicted characters:
Figure BDA0002770627690000082
p (Y) represents the probability of predicting sequence Y.
Examples are given.
Generating molecules which simultaneously satisfy the properties of molecular weight, lipid-water partition coefficient, hydrogen bond donor, hydrogen bond acceptor, number of rotatable bonds, polar surface area, synthesizability and the like. The implementation process is as follows:
the first step is as follows: drug molecules (1757517 compounds) were collected from the ChEMBL open source database (https:// www.ebi.ac.uk/ChEMBL /), and indicated with SMILES.
The second step is that: calculating the molecular weight, the lipid-water partition coefficient, the hydrogen bond donor, the hydrogen bond acceptor, the number of rotatable bonds, the polar surface area and the synthesizability of the drug molecules by open source PaDEL-Descriptor, RDkit or Discovery Studio programs; selecting the target properties of molecular weight less than or equal to 500, lipid-water partition coefficient less than or equal to 5 and less than or equal to 0, hydrogen bond donor less than or equal to 5, hydrogen bond acceptor less than or equal to 10, number of rotatable bonds less than or equal to 20, polar surface area less than or equal to 200 and synthesizability less than or equal to 6, and label 1; the training data simultaneously comprises drug molecules SMILES and property labels, and the training data is saved in an SMI format.
The third step: and establishing a brand-new molecular generation model based on the multitask capsule self-encoder by using the training data. Structurally, the model includes three parts, an encoder, a multitask capsule classifier and a decoder. In the training stage, the hyper-parameters (learning rate, neuron number and training step number) of the model are manually adjusted for multiple times, and the optimal model in the multiple training process is reserved as a training model. The present example is debugged from the following aspects:
batch size candidate range for training phase: 128, 256, 512, and 1028; network iteration number candidate range: from 100 to 1000, each change increases by 100;
the encoder uses a bidirectional long-short term memory network to directly encode the drug molecules SMILES into fixed-length vectors. Encoder per layer neuron number candidate range: 128, 192 and 256; the number of encoder neuron layers is set to be 1;
the multi-task capsule classifier is composed of double-layer capsule layers, and classification of various target properties is achieved by optimizing and adjusting Routing iteration times. Per capsule layer neuron candidate range: 128, 192 and 256; candidate range of number of times of route iteration of capsule part: 1,2,3,4 and 5; the loss weight of the capsule classifier is set to 10; the optimizer selects AdamaOptizer; learning rate candidate range of capsule classifier: from 0.001 to 0.01, with each increase varying by 0.001;
the decoder decodes the hidden layer vector using a long-short term memory based network. Decoder per layer neuron candidate range: 256, 384 and 512; the number of encoder neuron layers is set to 1.
The fourth step: in the reconstruction stage, the molecules are reconstructed through the training model, and reconstructed molecule files are stored.
Batch size candidate range for reconstruction phase: 500, 1000, 1500 and 2000; the number of batch processes was set to 10.
The fifth step: in the generation stage, molecule generation is carried out through a training model, a compound which simultaneously meets the requirements of molecular weight, a lipid-water distribution coefficient, a hydrogen bond donor, a hydrogen bond acceptor, the number of rotatable bonds, polar surface area and synthesizability is generated, and a generated molecule file is stored.
Batch size candidate range of generation phase: 500, 1000, 1500 and 2000; the batch processing times are set to 10; the standard deviation of the normal distribution during the data enhancement process was set to 0.2.
Drug small molecule design methods play a key role in the active drug discovery process. Traditional drug design methods such as virtual screening and pharmacophore models are mainly used to search libraries of known virtual compounds. The analysis and processing of search results also requires a great deal of expertise, due to the enormous number of drug molecules in the chemical space and the current limitations of computer computational performance, making searching the entire chemical space impractical. The de novo molecular design method based on deep neural networks, as a novel artificial intelligence technology, can be used for generating molecules with required properties. The method has the advantages that new molecules with optimized properties can be generated without enumerating virtual compound libraries, and the like.
However, the existing molecule generation method only considers the generation of a molecule with one target property, and the method has low efficiency, cannot limit various properties of the molecule, and is difficult to meet the requirement of new drug molecule design. The multitask capsule self-encoder provided by the invention can effectively acquire the relationship which is difficult to quantify between the molecular structure of the medicine and the properties of various medicines, learn the related information among the data of various properties of the medicine molecules, realize the molecules which simultaneously meet the physical, chemical and biological properties, and improve the effectiveness and novelty of the generated molecules.

Claims (9)

1. The invention discloses a brand new molecule generation method based on a multitask capsule self-encoder neural network, which is characterized by comprising the following steps of: expressing drug molecules as SMILES (simplified molecule linear input specification), marking target property labels, and establishing a brand new molecule generation model comprising an encoder, a multitask capsule classifier and a decoder by utilizing a self-encoder framework; the encoder encodes drug molecules SMILES into vectors with fixed length by using a bidirectional long-short term memory network; the multitask capsule classifier adopts a double-layer capsule layer to optimize Margin Loss, and encodes and predicts a property label of a drug molecule; the decoder decodes the hidden layer vector by using a long-term and short-term memory network to realize input and output reconstruction;
the method comprises the following steps:
step 1: collecting training data, extracting a molecular unique heat (one-hot) encoding table, and calculating a property label;
step 2: learning the characteristics of known drug molecules through a training phase to obtain a training model;
and step 3: reconstructing the molecules by using the training model through a reconstruction stage;
and 4, step 4: through the generation phase, molecules with specific properties are generated by using a training model.
2. The brand-new molecule generation method based on the multitask capsule self-encoder neural network as claimed in claim 1, wherein the step 1 specifically comprises:
collecting drug molecules and establishing a specific data set;
drug molecules are represented using SMILES (simplified molecular linear input specification);
calculating or collecting target property data of the drug molecules, if the data is quantitatively represented, selecting a reasonable threshold value to be converted into qualitative representation, namely the target property is 1; non-target property is 0; all the molecular descriptors are calculated by open source PaDEL-Descriptor, RDkit or Discovery Studio programs;
the training data contains both the drug molecule SMILES and a specific property label.
3. The brand-new molecule generation method based on the multitask capsule self-encoder neural network as claimed in claim 1, wherein said step 2 specifically comprises:
inputting training data into a neural network of a multitask capsule self-encoder to be trained;
manually adjusting the hyper-parameters (learning rate, neuron number and training step number) of the model for multiple times, and keeping the training model with the minimum cross entropy loss function value;
and reserving the optimal model in the multiple training processes as a pre-training model.
4. The brand-new molecule generation method based on the multitask capsule self-encoder neural network as claimed in claim 1, wherein said step 3 specifically comprises:
operating a training model, and coding training data into vectors with fixed length in batch by an encoder;
a decoder decodes the fixed length vectors into reconstructed molecular data;
calculating a reconstruction rate by reconstructing the molecular data;
and saving the reconstructed molecular data.
5. The brand-new molecule generation method based on the multitask capsule self-encoder neural network as claimed in claim 1, wherein said step 4 specifically comprises:
operating a training model, and coding training data into vectors with fixed length in batch by an encoder;
the multitask capsule classifier encodes and predicts the properties of the training molecules;
performing data enhancement on the vector representation of the target property molecules to obtain new vector distribution;
the decoder decodes the new vector distribution into the sub-data;
manually debugging the hyper-parameters of the data enhancement process for multiple times, and keeping the best generation result;
when the number of molecules generated reaches a predetermined number, the generated molecular data is stored.
6. The novel molecular generation method based on the multitask capsule self-encoder neural network as claimed in claim 5, wherein said multitask capsule self-encoder neural network comprises an encoder, a multitask capsule classifier and a decoder, said training data is used as input of said encoder, and output of said encoder is used as input of said multitask capsule classifier; the output of the multitask capsule classifier is used as the input of the decoder.
7. The novel molecular generation method based on the multitask capsule self-encoder neural network as claimed in claim 6, wherein:
the encoder encodes the drug molecules SMILES into vectors of fixed length by using a bidirectional long-short term memory network, and the vectors are divided into 3 parts:
1) forward recurrent neural network
Figure FDA0002770627680000021
According to x1To xTxSequentially reading the input sequence, calculating the forward hidden state
Figure FDA0002770627680000022
Backward-circulating neural network
Figure FDA0002770627680000023
According to xTxTo x1Sequentially reading the input sequence, calculating the backward hidden state
Figure FDA0002770627680000024
Figure FDA0002770627680000025
Figure FDA0002770627680000026
xTxDenotes the T thxThe character of the time of day,
Figure FDA0002770627680000027
denotes the T thxThe forward hidden state of the moment in time,
Figure FDA0002770627680000028
denotes the T thxAfter the momentTowards the hidden state, f represents a non-linear function;
2) by forward hidden state
Figure FDA0002770627680000029
And backward hidden state
Figure FDA00027706276800000210
Calculating the hidden state ht
Figure FDA00027706276800000211
htIndicating a hidden state at time t;
3) generating a hidden layer vector by a sequence of hidden states:
Figure FDA00027706276800000212
c denotes a vector generated from the hidden state sequence and q denotes a non-linear function.
8. The novel molecular generation method based on the multitask capsule self-encoder neural network as claimed in claim 6, wherein:
the multitask capsule classifier adopts a double-layer capsule layer to optimize Margin Loss and predicts the property label of the drug molecule;
mapping the hidden layer vector to double-layer Capsule layers, and optimizing and adjusting Routing iteration time, specifically comprising:
1) and matrix transformation, namely calculating a prediction vector according to hidden layer vector mapping:
Figure FDA0002770627680000031
i denotes the first layer of capsules, j denotes the second layer of capsules,
Figure FDA0002770627680000032
denotes a prediction vector, WijRepresenting the weight matrix, u, learned by back propagationjRepresents the output of capsule i;
2) calculating the total input vector s of capsule j by the weighted sum of all the prediction vectorsj
Figure FDA0002770627680000033
sjRepresenting the total input vector, cijRepresents a coupling coefficient;
3) calculating the coupling coefficient c by softmax activation functionij
Figure FDA0002770627680000034
bijThe logarithm of probability representing the strength of the connection between capsule i and capsule j; before route iteration, b is set as 0 initially, and c is updated through the number of route iterationsijA value of (d);
4) the vector output of capsule j is calculated by a non-linear squeeze function:
Figure FDA0002770627680000035
vja vector output representing capsule j;
5) calculating the loss L of each capsule in the capsule layerk
Lk=Tkmax(0,m+-||vk||)2+λ(1-Tk)max(0,||vk||-m-)2
LkDenotes the loss, T, of each capsulekRepresenting an index function, m+And m-Representing the upper and lower boundaries, respectively.
9. The novel molecular generation method based on the multitask capsule self-encoder neural network as claimed in claim 6, wherein:
the decoder decodes the hidden layer vector by using a long-short term memory network, and specifically comprises:
1) by hiding the vector c with the predicted character y before time t1,···,yt-1Generating a predicted character at time t:
p(yt|{y1,···,yt-1},c)=g(yt-1,st,c)
st=f′(st-1,yt-1,c)
yta predicted character, s, representing time ttRepresenting the hidden state of the decoder, g and f' representing non-linear functions;
2) the probability of the predicted sequence Y is calculated from the probabilities of all predicted characters:
Figure FDA0002770627680000041
p (Y) represents the probability of predicting sequence Y.
CN202011247808.1A 2020-11-10 2020-11-10 Brand-new molecule generation method based on multitask capsule self-encoder neural network Active CN112270951B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011247808.1A CN112270951B (en) 2020-11-10 2020-11-10 Brand-new molecule generation method based on multitask capsule self-encoder neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011247808.1A CN112270951B (en) 2020-11-10 2020-11-10 Brand-new molecule generation method based on multitask capsule self-encoder neural network

Publications (2)

Publication Number Publication Date
CN112270951A true CN112270951A (en) 2021-01-26
CN112270951B CN112270951B (en) 2022-11-01

Family

ID=74339427

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011247808.1A Active CN112270951B (en) 2020-11-10 2020-11-10 Brand-new molecule generation method based on multitask capsule self-encoder neural network

Country Status (1)

Country Link
CN (1) CN112270951B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112562869A (en) * 2021-02-24 2021-03-26 北京中医药大学东直门医院 Drug combination safety evaluation system, method and device
CN113223637A (en) * 2021-05-07 2021-08-06 中国科学院自动化研究所 Drug molecule generator training method based on domain knowledge and deep reinforcement learning
CN113470740A (en) * 2021-06-30 2021-10-01 中国石油大学(华东) Medicine recommendation system, computer equipment and storage medium based on fully-connected network integrated deep learning model
CN113488119A (en) * 2021-06-18 2021-10-08 重庆医科大学 Medicine small molecule numerical value feature structured database and establishing method thereof
CN114049922A (en) * 2021-11-09 2022-02-15 四川大学 Molecular design method based on small-scale data set and generation model
CN114446414A (en) * 2022-01-24 2022-05-06 电子科技大学 Reverse synthetic analysis method based on quantum circulating neural network
CN114496112A (en) * 2022-01-21 2022-05-13 内蒙古工业大学 Multi-objective optimization-based breast cancer resistant drug component intelligent quantification method
CN114913938A (en) * 2022-05-27 2022-08-16 中南大学 Small molecule generation method, equipment and medium based on pharmacophore model
CN114937478A (en) * 2022-05-18 2022-08-23 北京百度网讯科技有限公司 Method for training a model, method and apparatus for generating molecules
CN117334271A (en) * 2023-09-25 2024-01-02 江苏运动健康研究院 Method for generating molecules based on specified attributes
WO2024009110A1 (en) * 2022-07-08 2024-01-11 Topia Life Sciences Limited An automated system for generating novel molecules

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106146670A (en) * 2015-04-24 2016-11-23 宜明昂科生物医药技术(上海)有限公司 A kind of new recombination double functions fusion protein and preparation thereof and application
US20170161635A1 (en) * 2015-12-02 2017-06-08 Preferred Networks, Inc. Generative machine learning systems for drug design
CN108073780A (en) * 2016-11-14 2018-05-25 王�忠 A kind of method of the relatively clinical efficacy of Chinese medicine compound prescription
CN109979541A (en) * 2019-03-20 2019-07-05 四川大学 Medicament molecule pharmacokinetic property and toxicity prediction method based on capsule network
US20190220573A1 (en) * 2018-01-17 2019-07-18 Samsung Electronics Co., Ltd. Method and apparatus for generating a chemical structure using a neural network
WO2019202292A1 (en) * 2018-04-20 2019-10-24 DrugAI Limited Interaction property prediction system and method
CN110473595A (en) * 2019-07-04 2019-11-19 四川大学 A kind of capsule network Relation extraction model in the most short interdependent path of combination
CN110634539A (en) * 2019-09-12 2019-12-31 腾讯科技(深圳)有限公司 Artificial intelligence-based drug molecule processing method and device and storage medium
CN110970099A (en) * 2019-12-10 2020-04-07 北京大学 Medicine molecule generation method based on regularization variational automatic encoder
CN111126554A (en) * 2018-10-31 2020-05-08 深圳市云网拜特科技有限公司 Drug lead compound screening method and system based on generation of confrontation network
CN111128314A (en) * 2018-10-30 2020-05-08 深圳市云网拜特科技有限公司 Drug discovery method and system
CN111432720A (en) * 2017-10-06 2020-07-17 梅约医学教育与研究基金会 ECG-based cardiac ejection fraction screening
CN111508568A (en) * 2020-04-20 2020-08-07 腾讯科技(深圳)有限公司 Molecule generation method and device, computer readable storage medium and terminal equipment
CN111584010A (en) * 2020-04-01 2020-08-25 昆明理工大学 Key protein identification method based on capsule neural network and ensemble learning
US20200311914A1 (en) * 2017-04-25 2020-10-01 The Board Of Trustees Of Leland Stanford University Dose reduction for medical imaging using deep convolutional neural networks
CN111785326A (en) * 2020-06-28 2020-10-16 西安电子科技大学 Method for predicting gene expression profile after drug action based on generation of confrontation network
CN111814460A (en) * 2020-07-06 2020-10-23 四川大学 External knowledge-based drug interaction relation extraction method and system

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106146670A (en) * 2015-04-24 2016-11-23 宜明昂科生物医药技术(上海)有限公司 A kind of new recombination double functions fusion protein and preparation thereof and application
US20170161635A1 (en) * 2015-12-02 2017-06-08 Preferred Networks, Inc. Generative machine learning systems for drug design
CN108073780A (en) * 2016-11-14 2018-05-25 王�忠 A kind of method of the relatively clinical efficacy of Chinese medicine compound prescription
US20200311914A1 (en) * 2017-04-25 2020-10-01 The Board Of Trustees Of Leland Stanford University Dose reduction for medical imaging using deep convolutional neural networks
CN111432720A (en) * 2017-10-06 2020-07-17 梅约医学教育与研究基金会 ECG-based cardiac ejection fraction screening
US20190220573A1 (en) * 2018-01-17 2019-07-18 Samsung Electronics Co., Ltd. Method and apparatus for generating a chemical structure using a neural network
WO2019202292A1 (en) * 2018-04-20 2019-10-24 DrugAI Limited Interaction property prediction system and method
CN111128314A (en) * 2018-10-30 2020-05-08 深圳市云网拜特科技有限公司 Drug discovery method and system
CN111126554A (en) * 2018-10-31 2020-05-08 深圳市云网拜特科技有限公司 Drug lead compound screening method and system based on generation of confrontation network
CN109979541A (en) * 2019-03-20 2019-07-05 四川大学 Medicament molecule pharmacokinetic property and toxicity prediction method based on capsule network
CN110473595A (en) * 2019-07-04 2019-11-19 四川大学 A kind of capsule network Relation extraction model in the most short interdependent path of combination
CN110634539A (en) * 2019-09-12 2019-12-31 腾讯科技(深圳)有限公司 Artificial intelligence-based drug molecule processing method and device and storage medium
CN110970099A (en) * 2019-12-10 2020-04-07 北京大学 Medicine molecule generation method based on regularization variational automatic encoder
CN111584010A (en) * 2020-04-01 2020-08-25 昆明理工大学 Key protein identification method based on capsule neural network and ensemble learning
CN111508568A (en) * 2020-04-20 2020-08-07 腾讯科技(深圳)有限公司 Molecule generation method and device, computer readable storage medium and terminal equipment
CN111785326A (en) * 2020-06-28 2020-10-16 西安电子科技大学 Method for predicting gene expression profile after drug action based on generation of confrontation network
CN111814460A (en) * 2020-07-06 2020-10-23 四川大学 External knowledge-based drug interaction relation extraction method and system

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
ARPIT SRIVASTAVA等: "Computational Drug Discovery Approach for Drug Design against Zika Virus", 《2018 INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND SYSTEMS BIOLOGY (BSB)》 *
GUPTA ANVITA等: "Generative Recurrent Networks for De Novo Drug Design", 《MOLECULAR INFORMATICS》 *
HONGMING CHEN等: "The rise of deep learning in drug discovery", 《DRUG DISCOVERY TODAY》 *
RAWAT, ANIL KUMAR等: "Analysis on Latest Technologies in Medical Imaging for Early Diagnosis and Treatment", 《JOURNAL OF COMPUTATIONAL AND THEORETICAL NANOSCIENCE》 *
WANG Y等: "Capsule Networks Showed Excellent Performance in the Classification of hERG Blockers/Nonblockers", 《FRONTIERS IN PHARMACOLOGY》 *
XIN YANG等: "Concepts of Artificial Intelligence for Computer-Assisted Drug Discovery", 《CHEMICAL REVIEWS》 *
廖俊等: "深度学习在药物研发中的研究进展", 《药学进展》 *
谭小芹等: "中国药物分子设计40年发展成就", 《中国科学:生命科学》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112562869A (en) * 2021-02-24 2021-03-26 北京中医药大学东直门医院 Drug combination safety evaluation system, method and device
CN113223637A (en) * 2021-05-07 2021-08-06 中国科学院自动化研究所 Drug molecule generator training method based on domain knowledge and deep reinforcement learning
CN113488119A (en) * 2021-06-18 2021-10-08 重庆医科大学 Medicine small molecule numerical value feature structured database and establishing method thereof
CN113488119B (en) * 2021-06-18 2024-02-02 重庆医科大学 Drug small molecule numerical value characteristic structured database and establishment method thereof
CN113470740A (en) * 2021-06-30 2021-10-01 中国石油大学(华东) Medicine recommendation system, computer equipment and storage medium based on fully-connected network integrated deep learning model
CN114049922B (en) * 2021-11-09 2022-06-03 四川大学 Molecular design method based on small-scale data set and generation model
CN114049922A (en) * 2021-11-09 2022-02-15 四川大学 Molecular design method based on small-scale data set and generation model
CN114496112A (en) * 2022-01-21 2022-05-13 内蒙古工业大学 Multi-objective optimization-based breast cancer resistant drug component intelligent quantification method
CN114496112B (en) * 2022-01-21 2023-10-31 内蒙古工业大学 Intelligent quantification method for anti-breast cancer drug ingredients based on multi-objective optimization
CN114446414A (en) * 2022-01-24 2022-05-06 电子科技大学 Reverse synthetic analysis method based on quantum circulating neural network
CN114937478A (en) * 2022-05-18 2022-08-23 北京百度网讯科技有限公司 Method for training a model, method and apparatus for generating molecules
CN114937478B (en) * 2022-05-18 2023-03-10 北京百度网讯科技有限公司 Method for training a model, method and apparatus for generating molecules
CN114913938A (en) * 2022-05-27 2022-08-16 中南大学 Small molecule generation method, equipment and medium based on pharmacophore model
WO2024009110A1 (en) * 2022-07-08 2024-01-11 Topia Life Sciences Limited An automated system for generating novel molecules
CN117334271A (en) * 2023-09-25 2024-01-02 江苏运动健康研究院 Method for generating molecules based on specified attributes

Also Published As

Publication number Publication date
CN112270951B (en) 2022-11-01

Similar Documents

Publication Publication Date Title
CN112270951B (en) Brand-new molecule generation method based on multitask capsule self-encoder neural network
CN113707235B (en) Drug micromolecule property prediction method, device and equipment based on self-supervision learning
US11113479B2 (en) Utilizing a gated self-attention memory network model for predicting a candidate answer match to a query
CN112561064B (en) Knowledge base completion method based on OWKBC model
CN116415654A (en) Data processing method and related equipment
Xiao et al. History-based attention in Seq2Seq model for multi-label text classification
US11354582B1 (en) System and method for automated retrosynthesis
KR102491346B1 (en) Method, apparatus and computer program for generating formalized research record data automatically for learning artificial intelligence model
CN114999565A (en) Drug target affinity prediction method based on representation learning and graph neural network
CN114168754A (en) Relation extraction method based on syntactic dependency and fusion information
WO2022188653A1 (en) Molecular scaffold hopping processing method and apparatus, medium, electronic device and computer program product
Zhao et al. Exploiting multiple question factors for knowledge tracing
Cao et al. A dual attention model based on probabilistically mask for 3D human motion prediction
Luo et al. A Caps-UBI model for protein ubiquitination site prediction
CN116843995A (en) Method and device for constructing cytographic pre-training model
CN114239575B (en) Statement analysis model construction method, statement analysis method, device, medium and computing equipment
CN115240787A (en) Brand-new molecule generation method based on deep conditional recurrent neural network
Song Distilling knowledge from user information for document level sentiment classification
CN114936564A (en) Multi-language semantic matching method and system based on alignment variational self-coding
CN114358021A (en) Task type dialogue statement reply generation method based on deep learning and storage medium
Ma et al. Target-Embedding Autoencoder With Knowledge Distillation for Multi-Label Classification
CN114238579B (en) Text analysis method, text analysis device, text analysis medium and computing equipment
Li et al. Application of virtual human sign language translation based on speech recognition
KR102483916B1 (en) Method, apparatus and computer program for generating formalized research record data automatically for learning artificial intelligence
CN116681087B (en) Automatic problem generation method based on multi-stage time sequence and semantic information enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant