CN112270951A

CN112270951A - Brand-new molecule generation method based on multitask capsule self-encoder neural network

Info

Publication number: CN112270951A
Application number: CN202011247808.1A
Authority: CN
Inventors: 邹俊; 杨胜勇; 李侃; 杨欣
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2021-01-26
Anticipated expiration: 2040-11-10
Also published as: CN112270951B

Abstract

The invention discloses a brand-new molecule generation method based on a multitask capsule self-encoder neural network. The method comprises the steps of expressing drug molecules as SMILES (simplified molecule linear input standard), marking target property labels, and learning the characteristics of known drug molecules through a training phase to obtain a training model; reconstructing the molecules by using the training model through a reconstruction stage; and generating molecules by using the training model through a generation stage, wherein the generated molecules simultaneously have multiple set target properties, and the generated molecules simultaneously have a large number of new molecules and new frameworks. The invention can be used for the generation of various molecules such as drugs or compounds, and the characteristics and properties of known drugs can be learned through one training, so that the generation of molecules which simultaneously meet the required physical, chemical and biological properties can be carried out. The molecules generated by the method have higher effectiveness and more excellent properties.

Description

Brand-new molecule generation method based on multitask capsule self-encoder neural network

Technical Field

The invention relates to the cross technical field of computer artificial intelligence and brand new molecular design, in particular to a brand new molecular generation method of a multitask capsule self-encoder neural network, which is a method for performing brand new molecular design based on a self-encoder framework and a multitask capsule classifier framework and is suitable for generating molecules which simultaneously accord with various physical, chemical and biological properties.

Background

The design method of small molecule drugs plays a key role in the development process of active drugs. Traditional drug design methods such as virtual screening and pharmacophore models are mainly used to search libraries of known virtual compounds. Due to the large number of potentially synthesizable molecules in chemical space (10)²³-10⁶⁰) And the current computer computing performance limit, the global search of the whole chemical space is difficult, and the analysis and processing of the search results require a great deal of professional experience. As a data-driven calculation method, the artificial intelligence technology can automatically learn knowledge such as chemical structures, structure-activity relationships and the like of drug molecules from a data set, help scientists design molecules with target properties, and bring hopes for drug discovery and development. The de novo molecular design method based on deep neural networks, as a novel artificial intelligence technology, can be used for generating molecules with required properties. This has the advantage that new molecules with optimized properties can be generated without the need to enumerate virtual compound libraries. However, the existing molecule generation method only considers the generation of a molecule with one target property, is difficult to learn other characteristics except the property, cannot simultaneously optimize multiple properties of the molecule, influences the final generation effect, and cannot meet the requirement of new drug molecule design. In the process of molecule generation, a key difficulty lies in the selection of a classifier, a common support vector machine cannot be directly and jointly trained with a deep neural network, and a convolutional neural network has poor classification effect and is difficult to be applied to the classification and generation of molecules with various target properties.

Disclosure of Invention

The invention aims to provide a method for generating molecules which can simultaneously meet various target properties such as molecular weight, lipid-water partition coefficient, hydrogen bond donor, hydrogen bond acceptor, rotatable bond number, polar surface area, synthesizability, specific target activity and the like.

The invention provides a new model, which takes a self-encoder as a basic frame and integrates a multi-task capsule classifier in a hidden layer. By adopting the method, various drug molecules with optimized target properties can be effectively generated.

The technical scheme of the method is as follows:

a brand new molecule generation method based on a multitask capsule self-encoder neural network is characterized by comprising the following steps: the drug molecules are denoted as SMILES (simplified molecule linear input specification), labeling the target property labels of the molecules. Structurally, the model includes three parts, an encoder, a multitask capsule classifier and a decoder. The encoder encodes drug molecules SMILES into vectors with fixed length by using a bidirectional long-short term memory network; the multitask capsule classifier utilizes double-layer capsule layer codes to represent vectors of the drug molecular property labels; the decoder directly decodes the hidden layer vector based on the long-term and short-term memory network to generate SMILES of molecules, and the reconstruction of the molecules or the generation of brand new molecules are realized.

The method comprises the following steps:

step 1: collecting training data, extracting a molecular SMILES one-hot (one-hot) coding table, and calculating a property label;

step 2: learning the characteristics of known drug molecules SMILES through a training stage to obtain a training model;

and step 3: reconstructing the molecules by using the training model through a reconstruction stage;

and 4, step 4: through the generation phase, molecules with specific properties are generated by using a training model.

Further, the step 1 specifically includes:

collecting drug molecules and establishing a specific training data set;

using SMILES to represent drug molecules;

determining target properties of drug molecules through experiments or calculation, and if the properties are quantitative data, selecting a reasonable threshold value to convert the properties into qualitative data, namely the target properties are 1 and the non-target properties are 0; the calculation of molecular target properties is completed by a PaDEL-Descriptor, RDkit or Discovery Studio program;

the training data contains both drug molecules SMILES and specific property labels;

further, the step 2 specifically includes:

inputting training data into a neural network of a multitask capsule self-encoder to be trained;

manually adjusting the hyper-parameters (learning rate, neuron number and training step number) of the model for multiple times, and keeping the training model with the minimum cross entropy loss function value;

and keeping the best model in multiple training processes as a training model.

Further, the step 3 specifically includes:

operating a training model, and coding training data into vectors with fixed length in batch by an encoder;

a decoder decodes the fixed length vectors into reconstructed molecular data;

calculating a reconstruction rate by reconstructing the molecular data;

and saving the reconstructed molecular data.

Further, the step 4 specifically includes:

a multitask capsule classifier calculating property features of the molecules for generating molecules having the target property;

performing data enhancement on the vector representation of the target property molecules to obtain new vector distribution;

the decoder decodes the new vector distribution into the sub-data;

manually debugging the hyper-parameters of the data enhancement process for multiple times, and keeping the best generation result;

when the number of molecules generated reaches a predetermined number, the generated molecular data is stored.

Further, the multitask capsule self-encoder neural network comprises an encoder, a multitask capsule classifier and a decoder, wherein the training data serves as the input of the encoder, and the output of the encoder serves as the input of the multitask capsule classifier; the output of the multitask capsule classifier is used as the input of the decoder.

Further, the encoder directly encodes the drug molecules SMILES into fixed-length vectors by using a bidirectional long-short term memory network, and the fixed-length vectors are divided into 3 parts:

1) forward recurrent neural network

According to x₁To x_TxSequentially reading the input sequence, calculating the forward hidden state

Backward-circulating neural network

According to x_TxTo x₁Sequentially reading the input sequence, calculating the backward hidden state

x_TxDenotes the T th_xThe character of the time of day,

denotes the T th_xThe forward hidden state of the moment in time,

denotes the T th_xA backward hidden state of time, f represents a nonlinear function;

2) by forward hidden state

And backward hidden state

Calculating the hidden state h_t：

h_tIndicating a hidden state at time t;

3) generating a hidden layer vector by a sequence of hidden states:

c denotes a vector generated from the hidden state sequence and q denotes a non-linear function.

Further, the multitask Capsule classifier adopts double-layer Capsule layers (Capsule layers) to optimize Margin Loss, and encodes and predicts the property labels of the drug molecules;

mapping hidden layer vectors to the two-layer capsule layers;

the optimizing and adjusting Routing iteration time specifically comprises the following steps:

1) and matrix transformation, namely calculating a prediction vector according to hidden layer vector mapping:

i denotes the first layer of capsules, j denotes the second layer of capsules,

denotes a prediction vector, W_ijRepresenting the weight matrix, u, learned by back propagation_jRepresents the output of capsule i;

2) calculating the total input vector s of capsule j by the weighted sum of all the prediction vectors_j；

s_jRepresenting the total input vector, c_ijRepresents a coupling coefficient;

3) calculating the coupling coefficient c by softmax activation function_ij；

b_ijRepresenting the logarithm of the probability of the strength of the connection between capsule i and capsule j. Before route iteration, b is set as 0 initially, and c is updated through the number of route iterations_ijA value of (d);

4) the vector output of capsule j is calculated by a non-linear squeeze function:

v_ja vector output representing capsule j;

5) calculating the loss L of each capsule in the capsule layer_k：

L_k＝T_kmax(0,m⁺-||v_k||)²+λ(1-T_k)max(0,||v_k||-m^-)²

L_kDenotes the loss, T, of each capsule_kRepresents an index function (taking the value 1 or 0 according to the property label), m⁺And m^-Respectively, the upper and lower boundaries are defined, and λ is a scaling factor defining the weights of both.

Further, the decoder decodes the encoded hidden layer vector of the multitask capsule classifier by using a long-short term memory network, and specifically comprises the following steps:

1) by hiding the layer vector c with the predicted character y before time t₁,···,y_t-1Generating a predicted character at time t:

p(y_t|{y₁,···,y_t-1},c)＝g(y_t-1,s_t,c)

s_t＝f′(s_t-1,y_t-1,c)

y_ta predicted character, s, representing time t_tRepresenting the hidden state of the decoder, g and f' representing non-linear functions;

2) the probability of the predicted sequence Y is calculated from the probabilities of all predicted characters:

p (Y) represents the probability of predicting sequence Y.

The invention has the positive effects that: the invention provides a brand-new model which consists of an encoder, a multitask capsule classifier and a decoder. The invention has the innovation points that the multitask capsule classifier is utilized to effectively acquire the relationship which is difficult to quantify between the structure of the medicine molecule and the properties of various medicines, the relevant information among the data of various properties of the medicine molecule is learned, the various properties of the generated molecule can be predicted in advance, and the generation of the molecules with various properties is realized through the encoder and the decoder. Compared with other previous methods for generating molecules based on machine learning, the method has the advantages that:

first, the method of the present invention produces better results than the traditional machine learning method. Traditional generative models can only generate one target property, and multiple training is needed to generate molecules with multiple properties. The invention realizes the simultaneous classification and generation of various properties by applying the multitask capsule classifier, the molecules generated by the multitask capsule self-encoder can simultaneously meet various target properties such as molecular weight, lipid-water distribution coefficient, hydrogen bond donor, hydrogen bond acceptor, rotatable bond quantity, polar surface area, synthesizability, specific target activity and the like, and meanwhile, the generated molecules have new frameworks.

Second, the method of the present invention uses a multitask capsule classifier that performs better than a single task capsule classifier and other conventional machine learning methods. The multi-task capsule classifier can effectively extract the characteristics of molecules by utilizing the related information among different property data, and improve the prediction effect.

Drawings

FIG. 1 is a block flow diagram of the novel molecular generation method based on a multitask capsule self-encoder.

FIG. 2 is a model diagram of the novel molecular generation method based on the multitask capsule self-encoder.

FIG. 3 shows the detailed steps of the novel molecule generation method based on the multitask capsule self-encoder.

FIG. 4 is a schematic diagram of the training of the novel molecular generation method based on the multitask capsule self-encoder.

FIG. 5 is a reconstruction diagram of the novel molecular generation method based on the multitask capsule self-encoder.

FIG. 6 is a schematic diagram of the novel molecular generation method based on the multitask capsule self-encoder.

Detailed description of the invention

The figures show specific processes for achieving molecular generation of multiple target properties using the present invention.

The invention provides a brand new molecule generation method based on a multitask capsule self-encoder, which relates to the cross technical field of computer artificial intelligence and new drug molecule design.

Target properties of the resulting molecules of the invention include: (1) a molecular weight; (2) a lipid-water partition coefficient; (3) a hydrogen bond donor; (4) a hydrogen bond acceptor; (5) number of rotatable keys; (6) a polar surface area; (7) the synthesis performance is improved; (8) PDGF, Renin, Bcl-2 and other target activities.

See fig. 1.

The method comprises the steps of constructing an effective drug molecule database, measuring or calculating drug molecule property labels, constructing a self-encoder frame, constructing a multi-task capsule classifier frame, designing and realizing a data enhancement module, executing a generation process and the like.

See fig. 2.

The method is a brand new molecule generation method based on a multitask capsule self-encoder. The model takes a self-encoder as a basic framework, and a multitask capsule classifier is configured in a hidden layer. The encoder directly encodes the drug molecules SMILES into vectors with fixed length by using a bidirectional long-short term memory network; the multi-task capsule classifier analyzes and extracts vector characteristics by adopting double-layer capsule layers and predicts the property labels of the drug molecules; the decoder decodes the hidden layer vector by using a long-short term memory network to realize output and molecule generation.

See fig. 3.

The method comprises the following specific operation steps:

step 1: collecting training data, extracting a molecular unique heat (one-hot) encoding table, and calculating a property label;

step 2: learning the characteristics of known drug molecules through a training phase to obtain a training model;

and 4, step 4: through the generation phase, target property molecules are generated using a training model.

In the present invention, the step 1 specifically comprises: collecting drug molecules in a database, and representing the drug molecules by adopting SMILES; testing or calculating physical, chemical or biological properties of the drug molecule; selecting a reasonable threshold value to convert quantitative data into qualitative category labels; the training data includes both drug molecules SMILES and property labels. The training data will be used in the training process of the model.

See fig. 4.

In the present invention, the step 2 specifically includes: training data is input into a multi-task capsule self-encoder to be trained, the hyper-parameters (learning rate, neuron number and training step number) of the model are manually adjusted for multiple times, the training model with the minimum cross entropy loss function value is reserved, and the optimal model in the multiple training process is reserved as the training model.

See fig. 5.

In the present invention, the step 3 specifically includes: reading training data, operating a training model, and encoding the training data into vectors with fixed length in batch by an encoder; a decoder decodes the fixed length vectors into reconstructed molecular data; calculating a reconstruction rate by reconstructing the molecular data; and saving the reconstructed molecular data.

See fig. 6.

In the present invention, the step 4 specifically includes: reading training data, operating a training model, and encoding the training data into vectors with fixed length in batch by an encoder; predicting the property of the training molecule by the multi-task capsule classifier, and reserving the target property molecule; performing data enhancement on the vector representation of the target property molecules to obtain new vector distribution; the decoder decodes the new vector distribution into new generated molecular data; manually debugging the hyper-parameters of the data enhancement process for multiple times, and keeping the best generation result; when the number of molecules generated reaches a predetermined number, the generated molecular data is stored.

The encoder directly encodes drug molecules SMILES into fixed-length vectors by utilizing a bidirectional long-short term memory network, and the method comprises the following steps:

1) forward recurrent neural network

Backward-circulating neural network

x_TxDenotes the T th_xThe character of the time of day,

denotes the T th_xThe forward hidden state of the moment in time,

2) by forward hidden state

And backward hidden state

Calculating the hidden state h_t：

h_tIndicating a hidden state at time t;

3) generating a hidden layer vector by a sequence of hidden states:

Mapping the hidden layer vector to a double-layer capsule layer, optimizing and adjusting Routing iteration times, predicting the property of a training molecule by a multi-task capsule classifier, and reserving a target property molecule, wherein the method specifically comprises the following steps:

i denotes the first layer of capsules, j denotes the second layer of capsules,

s_jRepresenting the total input vector, c_ijRepresents a coupling coefficient;

3) calculating the coupling coefficient c by softmax activation function_ij；

v_ja vector output representing capsule j;

5) calculating the loss L of each capsule in the capsule layer_k：

L_k＝T_kmax(0,m⁺-||v_k||)²+λ(1-T_k)max(0,||v_k||-m^-)²

L_kDenotes the loss, T, of each capsule_kRepresenting an index function, m⁺And m^-Representing the upper and lower boundaries, respectively.

The decoder decodes the hidden layer vector by using a long-short term memory network, and specifically comprises:

1) by hiding the vector c with the predicted character y before time t₁,···,y_t-1Generating a predicted character at time t:

p(y_t|{y₁,···,y_t-1},c)＝g(y_t-1,s_t,c)

s_t＝f′(s_t-1,y_t-1,c)

p (Y) represents the probability of predicting sequence Y.

Examples are given.

Generating molecules which simultaneously satisfy the properties of molecular weight, lipid-water partition coefficient, hydrogen bond donor, hydrogen bond acceptor, number of rotatable bonds, polar surface area, synthesizability and the like. The implementation process is as follows:

the first step is as follows: drug molecules (1757517 compounds) were collected from the ChEMBL open source database (https:// www.ebi.ac.uk/ChEMBL /), and indicated with SMILES.

The second step is that: calculating the molecular weight, the lipid-water partition coefficient, the hydrogen bond donor, the hydrogen bond acceptor, the number of rotatable bonds, the polar surface area and the synthesizability of the drug molecules by open source PaDEL-Descriptor, RDkit or Discovery Studio programs; selecting the target properties of molecular weight less than or equal to 500, lipid-water partition coefficient less than or equal to 5 and less than or equal to 0, hydrogen bond donor less than or equal to 5, hydrogen bond acceptor less than or equal to 10, number of rotatable bonds less than or equal to 20, polar surface area less than or equal to 200 and synthesizability less than or equal to 6, and label 1; the training data simultaneously comprises drug molecules SMILES and property labels, and the training data is saved in an SMI format.

The third step: and establishing a brand-new molecular generation model based on the multitask capsule self-encoder by using the training data. Structurally, the model includes three parts, an encoder, a multitask capsule classifier and a decoder. In the training stage, the hyper-parameters (learning rate, neuron number and training step number) of the model are manually adjusted for multiple times, and the optimal model in the multiple training process is reserved as a training model. The present example is debugged from the following aspects:

batch size candidate range for training phase: 128, 256, 512, and 1028; network iteration number candidate range: from 100 to 1000, each change increases by 100;

the encoder uses a bidirectional long-short term memory network to directly encode the drug molecules SMILES into fixed-length vectors. Encoder per layer neuron number candidate range: 128, 192 and 256; the number of encoder neuron layers is set to be 1;

the multi-task capsule classifier is composed of double-layer capsule layers, and classification of various target properties is achieved by optimizing and adjusting Routing iteration times. Per capsule layer neuron candidate range: 128, 192 and 256; candidate range of number of times of route iteration of capsule part: 1,2,3,4 and 5; the loss weight of the capsule classifier is set to 10; the optimizer selects AdamaOptizer; learning rate candidate range of capsule classifier: from 0.001 to 0.01, with each increase varying by 0.001;

the decoder decodes the hidden layer vector using a long-short term memory based network. Decoder per layer neuron candidate range: 256, 384 and 512; the number of encoder neuron layers is set to 1.

The fourth step: in the reconstruction stage, the molecules are reconstructed through the training model, and reconstructed molecule files are stored.

Batch size candidate range for reconstruction phase: 500, 1000, 1500 and 2000; the number of batch processes was set to 10.

The fifth step: in the generation stage, molecule generation is carried out through a training model, a compound which simultaneously meets the requirements of molecular weight, a lipid-water distribution coefficient, a hydrogen bond donor, a hydrogen bond acceptor, the number of rotatable bonds, polar surface area and synthesizability is generated, and a generated molecule file is stored.

Batch size candidate range of generation phase: 500, 1000, 1500 and 2000; the batch processing times are set to 10; the standard deviation of the normal distribution during the data enhancement process was set to 0.2.

Drug small molecule design methods play a key role in the active drug discovery process. Traditional drug design methods such as virtual screening and pharmacophore models are mainly used to search libraries of known virtual compounds. The analysis and processing of search results also requires a great deal of expertise, due to the enormous number of drug molecules in the chemical space and the current limitations of computer computational performance, making searching the entire chemical space impractical. The de novo molecular design method based on deep neural networks, as a novel artificial intelligence technology, can be used for generating molecules with required properties. The method has the advantages that new molecules with optimized properties can be generated without enumerating virtual compound libraries, and the like.

However, the existing molecule generation method only considers the generation of a molecule with one target property, and the method has low efficiency, cannot limit various properties of the molecule, and is difficult to meet the requirement of new drug molecule design. The multitask capsule self-encoder provided by the invention can effectively acquire the relationship which is difficult to quantify between the molecular structure of the medicine and the properties of various medicines, learn the related information among the data of various properties of the medicine molecules, realize the molecules which simultaneously meet the physical, chemical and biological properties, and improve the effectiveness and novelty of the generated molecules.

Claims

1. The invention discloses a brand new molecule generation method based on a multitask capsule self-encoder neural network, which is characterized by comprising the following steps of: expressing drug molecules as SMILES (simplified molecule linear input specification), marking target property labels, and establishing a brand new molecule generation model comprising an encoder, a multitask capsule classifier and a decoder by utilizing a self-encoder framework; the encoder encodes drug molecules SMILES into vectors with fixed length by using a bidirectional long-short term memory network; the multitask capsule classifier adopts a double-layer capsule layer to optimize Margin Loss, and encodes and predicts a property label of a drug molecule; the decoder decodes the hidden layer vector by using a long-term and short-term memory network to realize input and output reconstruction;

the method comprises the following steps:

2. The brand-new molecule generation method based on the multitask capsule self-encoder neural network as claimed in claim 1, wherein the step 1 specifically comprises:

collecting drug molecules and establishing a specific data set;

drug molecules are represented using SMILES (simplified molecular linear input specification);

calculating or collecting target property data of the drug molecules, if the data is quantitatively represented, selecting a reasonable threshold value to be converted into qualitative representation, namely the target property is 1; non-target property is 0; all the molecular descriptors are calculated by open source PaDEL-Descriptor, RDkit or Discovery Studio programs;

the training data contains both the drug molecule SMILES and a specific property label.

3. The brand-new molecule generation method based on the multitask capsule self-encoder neural network as claimed in claim 1, wherein said step 2 specifically comprises:

and reserving the optimal model in the multiple training processes as a pre-training model.

4. The brand-new molecule generation method based on the multitask capsule self-encoder neural network as claimed in claim 1, wherein said step 3 specifically comprises:

a decoder decodes the fixed length vectors into reconstructed molecular data;

calculating a reconstruction rate by reconstructing the molecular data;

and saving the reconstructed molecular data.

5. The brand-new molecule generation method based on the multitask capsule self-encoder neural network as claimed in claim 1, wherein said step 4 specifically comprises:

the multitask capsule classifier encodes and predicts the properties of the training molecules;

the decoder decodes the new vector distribution into the sub-data;

6. The novel molecular generation method based on the multitask capsule self-encoder neural network as claimed in claim 5, wherein said multitask capsule self-encoder neural network comprises an encoder, a multitask capsule classifier and a decoder, said training data is used as input of said encoder, and output of said encoder is used as input of said multitask capsule classifier; the output of the multitask capsule classifier is used as the input of the decoder.

7. The novel molecular generation method based on the multitask capsule self-encoder neural network as claimed in claim 6, wherein:

the encoder encodes the drug molecules SMILES into vectors of fixed length by using a bidirectional long-short term memory network, and the vectors are divided into 3 parts:

1) forward recurrent neural network

Backward-circulating neural network

x_TxDenotes the T th_xThe character of the time of day,

denotes the T th_xThe forward hidden state of the moment in time,

denotes the T th_xAfter the momentTowards the hidden state, f represents a non-linear function;

2) by forward hidden state

And backward hidden state

Calculating the hidden state h_t：

h_tIndicating a hidden state at time t;

3) generating a hidden layer vector by a sequence of hidden states:

8. The novel molecular generation method based on the multitask capsule self-encoder neural network as claimed in claim 6, wherein:

the multitask capsule classifier adopts a double-layer capsule layer to optimize Margin Loss and predicts the property label of the drug molecule;

mapping the hidden layer vector to double-layer Capsule layers, and optimizing and adjusting Routing iteration time, specifically comprising:

i denotes the first layer of capsules, j denotes the second layer of capsules,

s_jRepresenting the total input vector, c_ijRepresents a coupling coefficient;

3) calculating the coupling coefficient c by softmax activation function_ij；

b_ijThe logarithm of probability representing the strength of the connection between capsule i and capsule j; before route iteration, b is set as 0 initially, and c is updated through the number of route iterations_ijA value of (d);

v_ja vector output representing capsule j;

5) calculating the loss L of each capsule in the capsule layer_k：

L_k＝T_kmax(0,m⁺-||v_k||)²+λ(1-T_k)max(0,||v_k||-m^-)²

9. The novel molecular generation method based on the multitask capsule self-encoder neural network as claimed in claim 6, wherein:

p(y_t|{y₁,···,y_t-1},c)＝g(y_t-1,s_t,c)

s_t＝f′(s_t-1,y_t-1,c)

p (Y) represents the probability of predicting sequence Y.