WO2022047677A1

WO2022047677A1 - Drug molecule screening method and system

Info

Publication number: WO2022047677A1
Application number: PCT/CN2020/113085
Authority: WO
Inventors: 汪念; 吴楚楠; 徐旻; 温书豪; 马健; 赖力鹏
Original assignee: 深圳晶泰科技有限公司
Priority date: 2020-09-02
Filing date: 2020-09-02
Publication date: 2022-03-10

Abstract

A drug molecule screening method and system. The method comprises: collecting drug molecule data related to a specific disease, pre-processing the data, and calculating an encoding vector and a drug physicochemical property of the data; constructing and training an AI model based on a conditional variational auto-encoder, combining the encoding vector with the drug physicochemical property of molecules to serve as an input layer of the model, converting an encoding layer of the model into a hidden layer encoding vector, generating a possible drug molecular structure by means of a decoding layer of the model, minimizing, during a model training process, a model loss function by means of a gradient descent algorithm, and continuously updating and iterating weight parameters of neural network structures of the encoding layer and the decoding layer; and according to the trained model based on the conditional variational auto-encoder, generating potential drug molecules for curing the specific disease. According to the drug molecule screening method and system, drug physicochemical property data of compound molecules are also utilized, drug physicochemical properties have good correlation with whether a compound can finally be made into a drug, and the druggability is improved.

Description

Drug molecular screening method and system

technical field

The invention relates to a screening method, in particular to a drug molecule screening method and system.

Background technique

In the field of drug research and development, the traditional method is to screen and synthesize drugs by computer simulation. With the rapid development of AI medicine, people have begun to try to apply various AI algorithm models in the field of medical research and development to solve the problem of long cycle of new drug research and development process. The target information of many diseases is unknown, which makes it extremely difficult and costly to find effective drug molecules from numerous compound libraries. However, the rapid computing power and innovative theoretical basis of AI bring great advantages to the screening process of drug molecules. Here comes a new way of research. For example, adversarial generative networks, convolutional neural networks, recurrent neural networks, reinforcement learning, etc. have been tried and applied in the generation of drug molecules. These AI models can quickly find drug molecules similar to the target molecules from a large number of chemical library molecules. It greatly reduces the search space of molecules, and at the same time generates a certain degree of effective drug molecules for subsequent drug screening and experimental processes.

Among the current technologies for generating drug molecules using AI models, the most widely used methods by researchers are based on autoencoder models, such as VAE and AAE models. This model based on the idea of adversarial networks can explore potential similarities to existing drug molecules. The disadvantage of drug molecules is that the effectiveness and accuracy of the generated drug molecules are not high, and at the same time, the generated potential drug molecules and the training set molecules have the same problem, and the diversity is weakened. This makes the potential drug molecules generated have little breakthrough in the existing field, or it is difficult to enter the real drug test stage in terms of druggability due to the low effectiveness of their drug properties. In addition, the properties of drug molecules are rarely considered in the existing AI models, and the input layer data of the model is relatively single, which reduces the effectiveness of the drug molecules generated by the model.

SUMMARY OF THE INVENTION

Based on this, it is necessary to provide a drug molecular screening method that can improve druggability.

At the same time, a drug molecular screening system that can be optimized to generate and improve druggability is provided.

A drug molecular screening method, comprising:

Preprocessing: collect molecular data of drugs related to specific diseases, preprocess the data, calculate its encoding vector and related drug physicochemical properties, form structural data and store it in the database;

Build a training model: Build and train an AI model based on a conditional variational autoencoder, use the combination of the encoding vector and the drug physicochemical properties of the molecule as the input layer of the model, convert the encoding layer of the model into the hidden layer encoding vector, and then use the model's encoding layer. The decoding layer generates possible drug molecular structures. During the model training process, the model loss function is minimized through the gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and decoding layer are continuously updated;

Generate Potential Drug Molecules: Generate potential drug molecules that cure a specific disease based on the trained conditional variational autoencoder model.

In a preferred embodiment, the encoding vector is a SMILES encoding vector, and the preprocessing includes: counting all the characters in the SMILES formula, converting each character in the SMILES formula into a one-hot vector, and converting each drug into a one-hot vector. SMILES-like data processing of molecules as encoded vectors of specified dimensions.

In a preferred embodiment, the calculation of the drug physicochemical properties of the drug molecule includes: calculation of molecular mass, calculation of lipid-water partition coefficient, calculation of molecular H-bond donors, calculation of molecular H-bond acceptors, calculation of molecular topological polar surface area one or more of.

In a preferred embodiment, the data of three indicators of molecular mass, lipid-water partition coefficient, and molecular topological polar surface area are normalized, and the data are uniformly mapped to the range of -1.0-1.0, and each The five drug physicochemical properties of drug molecules form a 5-dimensional vector.

In a preferred embodiment, the SMILES formula data and the drug physicochemical property data together form a total drug molecule data set, and the total data set is randomly divided into a training data set and a test data set according to a ratio of 4:1, and each SMILES formula The data is processed as a 120-dimensional encoded vector, which is spliced and combined with 5 vectors representing different drug physicochemical properties to form a 125-dimensional vector data, which is used as the input layer of the model.

In a preferred embodiment, the AI model structure includes: an input layer, an encoding layer, a hidden layer, a decoding layer, and an output layer. The encoding layer inputs the output data of the input layer and outputs the data to the hidden layer, and the encoding layer is an RNN. The network structure includes 3 layers of recurrent neural network layers, using LSTM units, each layer is set with 512 hidden nodes, the decoding layer inputs the output data of the hidden layer, and outputs it to the output layer, the decoding layer is an RNN network structure, including 3-layer recurrent neural network layer, using LSTM unit, each layer is set with 512 hidden nodes, there is a softmax layer after the decoding layer, and its cost function adopts the cross-entropy function

Among them, K is the number of categories, y is the label, p is the output of the network, and refers to the probability that the category is i; through the softmax layer, the probability distribution of each specific character category in the SMILES encoding vector is estimated, and finally the data is passed. The direct correspondence between the one-hot value and the specific coded character in the preprocessing reconstructs the output sample and outputs the SMILES formula.

In a preferred embodiment, the input layer generates a hidden layer through an encoding layer to form an encoder, the hidden layer generates an output layer through a decoding layer to form a decoder, and the encoder converts the high-dimensional input into a low-level latent vector, the Loss function E[logP(X|z,c)]-D _KL [Q(z|X,c)||P(z|c)]

The loss function includes two parts. The first part represents the log-likelihood of P(X) under the probability distribution P(X'|z,c), which represents the distance between the output of the coding layer and the input training sample X; the second part is KL divergence, which represents the distance between Q(z|X,c) and its reference probability distribution N(0,1).

In a preferred embodiment, according to the model structure, the model is constructed and trained based on tensorflow, the training data set is used for model training in the training process, and the test data set is used to calculate the test set error, that is, the loss function, to prevent the model from overfitting, After a certain training round epoch, compare the training data set error with the test data set error, when the test data set error is basically unchanged, the training data set error decreases and changes are weakened, and the parameters of the encoding layer and decoding layer of the model are optimized to the best value, stop training and save the model.

A drug molecular screening system, comprising:

Preprocessing module: collect molecular data of drugs related to specific diseases, preprocess the data, calculate its encoding vector and related drug physicochemical properties, form structural data and store it in the database;

Build a training model module: build and train an AI model based on conditional variational autoencoders, use the combination of encoding vectors and the pharmacological properties of molecules as the input layer of the model, convert the encoding layer of the model into the hidden layer encoding vector, and then pass the model The decoding layer generates possible drug molecular structures. During the model training process, the model loss function is minimized by gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and decoding layer are continuously updated to make the model training better;

Generating Potential Drug Molecule Modules: Generate potential drug molecules for curing specific diseases based on the trained conditional variational autoencoder model.

The above-mentioned drug molecule screening methods and systems also utilize the data of the drug physicochemical properties of the compound molecules, because the drug physicochemical properties of the compound molecules have a great correlation with whether the compound can finally be formulated into a drug, and generally those with poor physicochemical properties or do not meet the scope requirements The probability of druggability of the compound is also extremely low, so the pharmacological properties of the molecule are taken into account in the input layer data of the model, and the physical and chemical properties of the molecule obtained after model training are controlled within a reasonable range to improve its druggability In this way, the effectiveness and accuracy of the drug molecules generated by the model can be effectively improved, and the generated molecules can be more diverse by regulating the numerical range of specific drug physicochemical properties.

In addition, the combination of SMILES-encoded vector and molecular drug physicochemical properties is used as the input layer of the model, in which the drug physicochemical properties of the molecules are expected to show better properties among the above-mentioned properties of the drug molecules generated by the model. The property data abstraction is used as the conditional vector of the input layer, which is directly introduced into the calculation in both the encoding layer and the decoding layer. The input layer data is converted into the hidden layer encoding vector after passing through the encoder, and then the possible drug molecular structure is generated after passing through the decoding layer. The encoding layer and decoding layer use the LSTM cyclic neural network structure. In the process of model training, the model loss function is minimized through the gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and decoding layer are continuously updated, so that Model training is better.

Another drug molecule screening method based on conditional variational autoencoder is adopted, which can effectively improve the effectiveness of the potential molecules generated. In the input layer, the physicochemical property data of the target drug is introduced as a conditional vector, and the conditional variational autoencoder is used. The model is used for multivariate control. In addition to taking advantage of the hidden layer space, the drug physicochemical properties of drug molecules are input into the encoding layer, and then the conditional vectors are processed and manipulated in the decoding layer. The generated potential drug molecules will be in these targets. The physicochemical properties of the drug have better performance, and the effectiveness and accuracy of the drug molecule are improved.

Description of drawings

FIG. 1 is a flowchart of a drug molecular screening method according to an embodiment of the present invention.

detailed description

As shown in Figure 1, a drug molecular screening method according to an embodiment of the present invention includes:

Step S101, preprocessing: collecting drug molecule data related to a specific disease, preprocessing the data, calculating its encoding vector and related drug physicochemical properties, forming structural data and storing it in a database;

Step S103, constructing a training model: constructing and training an AI model based on a conditional variational autoencoder, using the combination of the encoding vector and the pharmacological physicochemical properties of the molecule as the input layer of the model, converting the encoding layer of the model into a hidden layer encoding vector, and then converting the encoding vector into the hidden layer encoding vector. The possible drug molecular structure is generated by the decoding layer of the model. During the model training process, the model loss function is minimized by the gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and the decoding layer are continuously updated, so that the model training is more efficient. Okay;

Step S105, generating potential drug molecules: generating potential drug molecules for curing a specific disease according to the trained conditional variational autoencoder model.

Further, the encoding vector in this embodiment is a SMILES (Simplified Molecular Input Line Entry Specification) formula encoding vector. The preprocessing includes: counting all the characters in the SMILES formula, converting each character in the SMILES formula into a one-hot vector, and processing the SMILES formula data of each drug molecule into a coding vector with a set dimension.

The specific preprocessing process includes: calculating the SMILES-encoded vector of the drug molecule. First, count all the characters that appear in the SMILES formula, convert each character in the SMILES formula into a one-hot vector, and add an 'E' character at the end of the string to indicate the end. The SMILES encoding vector of each drug molecule is fixed to 120 dimensions. When the number of characters is less than this value, the following digits are filled with the one-hot value of 'E', and the SMILES data of each drug molecule is Both are processed as a 120-dimensional encoded vector.

SMILES, the full name is Simplified Molecular Input Line Entry Specification, Simplified Molecular Linear Input Specification, is a specification that explicitly describes the molecular structure with ASCII strings.

The collected drug molecule data set is the data of the molecular structure formula of drug compounds, generally SMI files, and each drug molecule is represented by the SMILES format. For example, the SMILES formula of aspirin is: CC(=O)OC1=CC=CC=C1C(= O) O.

Calculating the drug physicochemical properties of drug molecules includes one or more of: calculating molecular mass, calculating lipid-water partition coefficient, calculating molecular H-bond donor number, molecular H-bond acceptor number, and molecular topological polar surface area calculation.

Specifically, to calculate the drug physicochemical properties of drug molecules, the methods provided in the RDKIR software library are used, which mainly include:

Calculate molecular mass (MW): rdkit.Chem.Descriptors.ExactMolWt();

Calculate the fat-water partition coefficient (LogP): rdkit.Chem.Crippen.MolLogP();

Calculate the number of molecular H-bond donors (HBD): rdkit.Chem.rdMolDescriptos.CalcNumHBD();

Calculate the number of molecular H-bond acceptors (HBA): rdkit.Chem.rdMolDescriptos.CalcNumHBA();

Calculate molecular topological polar surface area (TPSA): rdkit.Chem.rdMolDescriptos.CalcTPSA().

Further, the data of three indicators, namely, the calculated molecular mass, the calculated lipid-water partition coefficient, and the calculated molecular topological polar surface area, were normalized, and the data were uniformly mapped to the range of -1.0-1.0. The physicochemical properties of the drug form a 5-dimensional vector.

Further, the SMILES data and the drug physicochemical property data are combined to form the total drug molecule data set, and the total data set is randomly divided into training data set and test data set according to the ratio of 4:1, and each SMILES data is processed into a 120 The dimensional encoding vector is combined with 5 vectors representing different physicochemical properties to form a 125-dimensional vector data, which is used as the input layer of the model.

Specifically, the calculation results of the physicochemical properties of the drug can be obtained after statistical analysis: the numerical range of MW of all molecules in the data set is 0-500, the numerical range of LogP is 0-5, and the numerical range of TPSA is 0-150. The data of the three indicators of LogP and TPSA are normalized, and the data is uniformly mapped to the range of -1.0 to 1.0. The numerical values of the molecular H-bond donor and the molecular H-bond acceptor are integers, which can be directly represented by their numerical values. Through the above logic, the 5 drug physicochemical properties of each drug molecule are abstracted into a 5-dimensional vector.

The above is composed of molecular SMILES formula data and 5 kinds of drug physicochemical property data together to form a drug molecule data set. The total data set is divided into training data set and test data set according to the ratio of 4:1. The division process is random, that is, the training data set accounts for 80%, and the test data accounts for 20%.

After data preprocessing, each drug molecule sample in the drug dataset has an encoded vector in SMILES format, and 5 vectors representing different drug physicochemical properties (ie, conditional vectors in this method). For each drug molecule sample, two data vectors are directly spliced and combined to form a 125-dimensional vector data. All drug molecules in the drug data set are represented by the above form, that is, as the input layer data of the model.

The Conditional Variational Autoencoder (CVAE) model proposed in this application, compared with the VAE and AAE models, also utilizes the data on the pharmacological properties of the compound molecule, because the pharmacological properties of the compound molecule are related to whether the compound can finally be formulated into a drug. There is a large correlation. Generally, compounds with poor physicochemical properties or those that do not meet the range requirements have a very low probability of druggability. Therefore, the pharmacological physicochemical properties of the molecules are included in the consideration of the data in the input layer of the model, and the final control after model training is controlled. The physicochemical properties of the obtained molecules are within a reasonable range, and their druggability can be improved, so that the effectiveness and accuracy of the drug molecules generated by the model can be effectively improved. Molecules are more diverse.

Therefore, this method will calculate the drug physicochemical properties of the drug molecule data set in advance, and mainly select five drug physicochemical properties: molecular mass (MW), lipid-water partition coefficient (LogP), molecular H-bond donor number (HBD), molecular H The number of bond acceptors (HBA), molecular topological polar surface area (TPSA), and the reasons for selecting the above 5 drug physicochemical properties are mainly based on the following considerations:

1. Molecular mass is the most basic descriptor of a molecule;

2. The logP value of the fat-water partition coefficient represents the logarithm of the ratio of the partition coefficient of the compound molecule in n-octanol (oil) and water, reflecting the distribution of the substance in the oil-water two-phase. The larger the logP value is, the more lipophilic the substance is. On the contrary, the smaller it is, the more hydrophilic it is, that is, the better the water solubility is. Water solubility is a key indicator of compound medicine.

3. The number of H-bond donors HBD and the number of H-bond acceptors HBA respectively represent the number of H-bond donors and acceptors between molecules, and are generally used as a basic indicator for evaluating the effectiveness of molecules.

4. Molecular topology Polar surface area TPSA is also a parameter commonly used in medicinal chemistry, which is defined as the total surface area of polar molecules in a compound, mostly oxygen and nitrogen atoms, including hydrogen atoms attached to them. In medicinal chemistry applications, polar surface area is a descriptive indicator for evaluating the transportable properties of drugs within cells. For a good druggable compound, its topological polar surface area should be within a certain range of values.

Therefore, in this method, the pharmacological physicochemical property data of drug molecules are also regarded as auxiliary condition data as part of the model input layer data. The advantage of this is that the physicochemical properties of the drugs in the dataset can be fully utilized, so that after the model is trained, the physicochemical properties of the drug molecules generated by the model can also be within a reasonable range, and the effectiveness of the drug molecules generated by the model can be improved.

The AI model structure includes: input layer, encoding layer, hidden layer, decoding layer, and output layer. The coding layer of this embodiment inputs the output data of the input layer, and outputs the data to the hidden layer. The encoding layer is an RNN network structure, including 3 layers of cyclic neural network layers, using LSTM units, each layer is set with 512 hidden nodes, the decoding layer inputs the output data of the hidden layer, and outputs to the output layer, the decoding layer is RNN Network structure, including 3 layers of recurrent neural network layers, using LSTM units, each layer is set with 512 hidden nodes, there is a softmax layer after the decoding layer, and its cost function adopts the cross entropy function

The input layer data includes: the SMILES data of the drug molecule set and the 5 drug physicochemical property data of the drug molecule set. Specifically, the input layer: the preprocessed vector data of the drug molecule data set (SMILES encoding vector + conditional vector composed of drug physicochemical properties), that is, the input layer is X1, X2...Xn, and Xn is represented by a set of vector data.

Coding layer: input is input layer, output is hidden layer, coding layer is RNN network structure, including 3 layers of RNN recurrent neural network layer, using LSTM unit (LSTM, Long Short-Term Memory, LSTM unit includes input gate, forget gate , output gate), and 512 hidden nodes are set in each layer.

Hidden layer: The hidden layer vector dimension is set to 200.

Decoding layer: the input is the hidden layer, and the output is the output layer. The decoding layer is an RNN network structure. It contains 3 layers of RNN recurrent neural network layer, also adopts LSTM unit, and sets 512 hidden nodes in each layer. There is a softmax layer after the decoding layer unit, and its cost function adopts the cross-entropy function. The cross-entropy function is as follows:

Where: K is the number of categories, y is the label, and p is the output of the network, which refers to the probability that the category is i.

Through the softmax layer, it is estimated that each bit in the SMILES encoding vector is the probability distribution of a specific character category. Finally, through the direct correspondence between the one-hot value and the specific encoded character in the data preprocessing, the output sample is reconstructed, and the output SMILES style.

Output layer: SMIELS formula.

The input layer generates a hidden layer through an encoding layer to form an encoder, and the hidden layer generates an output layer through a decoding layer to form a decoder. The encoder converts the high-dimensional input into a low-order hidden vector.

The loss function of this embodiment, E[logP(X|z, c)]-D _KL [Q(z|X, c)||P(z|c)]

The AI model based on the conditional variational autoencoder in this embodiment is a CVAE model, and the CVAE consists of two modules, an encoder and a decoder. The input layer passes through the encoding layer to generate the hidden layer as the encoder. The hidden layer passes through the decoding layer to generate the output layer as the decoder. The encoder converts the high-dimensional input into a low-dimensional latent vector Z, while the decoder restores the latent vector to an output that is as close to the input as possible. During training, it is necessary to input training samples X and control conditions L, and optimize the parameters in the combined encoder and decoder network through gradient descent.

First, the encoder maps the training sample X into two sets of parameters, which determine an output z based on the input X and the conditional vector c conditional probability distribution Q(z|X,c), assuming it is a normal distribution, The two sets of parameters represent the mean and variance, respectively. The hidden vector z that is subject to the distribution is sampled from it, and the hidden vector z is mapped to a new set of parameters by the decoder. During the mapping process, the conditional vector c will also be used in the decoding layer, where the conditional vector c is related to the sample X. Therefore, a conditional probability distribution P(X'|z,c) based on z and c is determined, so that the probability distribution of the decoding layer is constrained by the hidden vector and the conditional vector. In order to make the distance between the generated X and the input X as small as possible and ensure the generation ability of the model, Q(z|X,c) should be as close as possible to the standard normal distribution N(0,1). The model uses the stochastic gradient descent algorithm to minimize the loss function value, and the process is to update and optimize the network structure parameters in the encoding layer and the decoding layer.

The loss function of the CVAE model (the loss function is the cost function) is shown in Loss:

E[logP(X|z,c)]-D _KL [Q(z|X,c)||P(z|c)]

The loss function consists of two parts. The first part represents the log-likelihood of P(X) under the probability distribution P(X'|z,c), which represents the distance between the output of the encoding layer and the input sample X. The second part is the KL divergence (Kullback-Leibler), which represents the distance between Q(z|X,c) and its reference probability distribution N(0,1).

Further, in this embodiment, according to the model structure, the model is constructed and trained based on tensorflow, the training data set is used for model training during the training process, and the test data set is used to calculate the test set error, that is, the loss function, to prevent the model from overfitting, After a certain training round epoch, compare the training data set error with the test data set error, when the test data set error is basically unchanged, the training data set error decreases and changes are weakened, and the parameters of the encoding layer and decoding layer of the model are optimized to the best value, stop training and save the model.

Specifically, the construction and training process of the model: According to the above model structure, the construction and training of the model are carried out based on tensorflow. During the training process, the training data set is used for model training, and the test data set is used to calculate the test set error (loss function), Prevent model overfitting. At the beginning of the training phase of the model, the errors of both will be large and decrease rapidly. After a certain training round epoch, the training data set error and the test data set error are compared. When the test data set error is basically unchanged, the training data When the set error decreases and the change weakens, it can be considered that the model has high accuracy on both the training data set and the test data set at the same time, and the parameters of the encoding layer and decoding layer of the model are optimized to the best values, so stop training and save the model, That is the CAVE model we finally trained.

In the drug molecule screening method of this embodiment, firstly, five kinds of drug physicochemical properties of the drug molecule are calculated according to the selection, and the selection of these properties is directly related to the effectiveness of the drug molecule. Then, the calculated drug physicochemical property values are abstracted into vectors and used as condition vectors.

The uniqueness of the conditional variational autoencoder model is that the conditional vector is additionally input. In this method, the conditional vector contains 5 kinds of drug physicochemical properties, not just one property, so multivariate control can be performed, and the specific control process It is implemented by model structure and logic.

The hidden layer space is the physical and chemical core of the autoencoder method. The advantage is that the encoder converts the high-dimensional input into a low-dimensional hidden layer vector, and then the decoder restores the hidden layer vector to solve the output of the input data as much as possible. to achieve the goal.

The drug physicochemical properties of drug molecules will be abstracted into conditional vectors, which will be used as the input layer data of the model and entered into the encoder.

The input of the decoder is the hidden layer vector, and the hidden layer vector contains the conditional vector, that is, the data information of the physicochemical properties of the drug. Therefore, the conditional vector data will also enter the network structure calculation in the decoder.

A drug molecular screening system, comprising:

Further, the calculation of the drug physicochemical properties of the drug molecule in this embodiment includes: calculation of molecular mass, calculation of lipid-water partition coefficient, calculation of the number of molecular H-bond donors, calculation of the number of molecular H-bond acceptors, and calculation of one of the molecular topological polar surface areas. one or more.

Further, the encoding vector in this embodiment is a SMILES (Simplified Molecular Input Line Entry Specification) formula encoding vector.

Further, the preprocessing module of this embodiment includes: counting all the characters in the SMILES formula, converting each character in the SMILES formula into a one-hot vector, and processing the SMILES formula data of each drug molecule into an encoding of a set dimension vector.

The specific preprocessing module includes: calculating the SMILES encoding vector of the drug molecule. First, count all the characters that appear in the SMILES formula, convert each character in the SMILES formula into a one-hot vector, and add an 'E' character at the end of the string to indicate the end. The SMILES encoding vector of each drug molecule is fixed to 120 dimensions. When the number of characters is less than this value, the following digits are filled with the one-hot value of 'E', and the SMILES data of each drug molecule is Both are processed as a 120-dimensional encoded vector.

The loss function E[logP(X|z, c)]-D _KL [Q(z|X, c)||P(z|c)] of this embodiment includes two parts, the first part represents the use of The log-likelihood of P(X) under the probability distribution P(X'|z,c) represents the distance between the output of the coding layer and the input training sample X; the second part is the KL divergence, which represents Q(z|X, c) The distance between it and its reference probability distribution N(0,1).

The drug molecule screening method based on conditional variational autoencoder of the present invention comprises the following steps:

Collect drug molecule datasets related to specific diseases and perform data preprocessing. The structured data of drug molecules is obtained through data collection. After data analysis and processing, the SMILES-encoded vector and the related drug physicochemical properties are calculated, and the structured data is formed and stored in the database. The data set is divided into model training data set and test data set according to a certain proportion. Among them, the drug physicochemical properties of drug molecules can be selected from molecular mass, LogP, molecular H bond donor and acceptor, TPSA (molecular topological polar surface area) and other index properties.

Conditional variational autoencoder-based AI models were constructed and trained based on the aforementioned drug molecule datasets. The main steps of establishing the model are: the combination of the SMILES-encoded vector and the molecular pharmacological properties as the input layer of the model, in which the pharmacological properties of the molecules are expected to show better properties among the above-mentioned property indicators for the drug molecules generated by the model. The selected drug physicochemical property data is abstracted as the conditional vector of the input layer, which will be directly introduced into the calculation in both the encoding layer and the decoding layer. The input layer data is converted into the hidden layer encoding vector after passing through the encoder, and then the possible drug molecular structure is generated after passing through the decoding layer. The encoding layer and decoding layer use the LSTM cyclic neural network structure. In the process of model training, the model loss function is minimized through the gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and decoding layer are continuously updated, so that Model training is better.

According to the above-trained model of the conditional variational autoencoder, potential drug molecules capable of curing the specific disease are generated, which are used in the subsequent drug calculation process and experimental process.

The invention adopts a drug molecule screening method based on conditional variational autoencoder, which can effectively improve the effectiveness of the generated potential molecules. The target drug physicochemical property data is introduced into the input layer as a conditional vector, and the conditional variational autoencoder is used. The device model is used for multivariate control. In addition to taking advantage of the hidden layer space, the drug physicochemical properties of drug molecules are input into the encoding layer, and then the conditional vectors are processed and manipulated in the decoding layer. The generated potential drug molecules will be in these The physicochemical properties of the target drug have better performance, which improves the effectiveness and accuracy of the drug molecule.

A specific embodiment of the drug molecule screening method based on conditional variational autoencoders of the present invention is as follows:

Collect drug molecule data sets related to specific diseases. The data sets are selected from other open source databases such as pubchem, ccdc, etc. In this example, pancreatic cancer is selected as an example, and the open source databases are summarized and extracted from the open source database. The SMILES formula of the drug molecule is used as the initial data set, and then it enters the data preprocessing stage to calculate the SMILES formula encoding vector of the drug molecule. First, count all the characters that appear in the SMILES formula, convert each character in the SMILES formula into a one-hot vector, and add an 'E' character at the end of the string to indicate the end. The SMILES encoding vector of each drug molecule is fixed to 120 dimensions. When the number of characters is less than this value, the following digits are filled with the one-hot value of 'E', and the SMILES data of each drug molecule is Both are processed as a 120-dimensional encoded vector. At the same time, five relatively representative drug physicochemical properties of drug molecules are selected: molecular mass, LogP, molecular H bond donor, molecular H bond acceptor, TPSA (molecular topological polar surface area), and for each drug molecule in the above data set Calculate the above properties, and abstract them into a vector representation according to the numerical expression of the properties, as the conditional vector input in the model.

According to the above drug molecule dataset, an AI model based on conditional variational autoencoder is trained, wherein the model mainly includes input layer, encoding layer, hidden layer, decoding layer and output layer. In the input layer, the conditional variation vector obtained by the combination of the molecular fingerprint vector and the drug property will be used as the input layer data and proceed to the encoding layer. The encoding layer and decoding layer use a recurrent neural network structure with LSTM, the number of neural network layers is 3, and each layer has 512 hidden nodes. There are softmax layers in the decoding layer units, and the cost function adopts the cross entropy function. The output vector produced by the decoding layer is finally converted into SMILES molecular encoding form. After the iterative training of the model, the network structure parameters in the encoding layer and the decoding layer are updated and optimized, and the decoding layer will generate samples for reconstruction output, so that the effectiveness of the drug molecules generated by the output layer is improved. It will also have better performance on the surface, thus becoming a potential drug molecule that can cure pancreatic cancer.

According to the above-trained conditional variational autoencoder model, and according to the follow-up research methods and goals, a certain number of potential drug molecules that can cure pancreatic cancer are generated. The model can generate a large number of potential drug molecules, and the generated drug molecule data set can be further screened or ranked after a certain property analysis, and the appropriate number of drug molecules with high effectiveness can be screened for subsequent drug calculation. Procedures and Experimental Procedures.

The generated drug molecule data set is used to calculate the physical and chemical properties of the five drugs proposed above. According to the existing drug molecules that can cure pancreatic cancer, the ranges of the five drug physical and chemical properties are counted. Screening mainly refers to filtering out some of the generated molecules whose physical and chemical properties obviously do not conform to the values within the range of statistical data. Ranking refers to comprehensively considering the values of five physical and chemical properties to form a ranking of the possibility of molecular drugs. As a follow-up research goal, a certain number of generated drug molecules with the highest drug-forming possibility can be selected for follow-up research.

Taking the above ideal embodiments according to the present application as inspiration, and through the above descriptions, relevant personnel can make various changes and modifications without departing from the technical idea of the present application. The technical scope of the present application is not limited to the content in the description, and the technical scope must be determined according to the scope of the claims.

As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

Claims

A drug molecular screening method, comprising:

Preprocessing: collect molecular data of drugs related to specific diseases, preprocess the data, calculate its encoding vector and related drug physicochemical properties, form structural data and store it in the database;

Build a training model: Build and train an AI model based on a conditional variational autoencoder, use the combination of the encoding vector and the drug physicochemical properties of the molecule as the input layer of the model, convert the encoding layer of the model into the hidden layer encoding vector, and then use the model's encoding layer. The decoding layer generates possible drug molecular structures. During the model training process, the model loss function is minimized through the gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and decoding layer are continuously updated;

Generate Potential Drug Molecules: Generate potential drug molecules that cure a specific disease based on the trained conditional variational autoencoder model.
The drug molecule screening method according to claim 1, wherein the encoding vector is a SMILES encoding vector, and the preprocessing comprises: counting all characters in the SMILES formula, and converting each character in the SMILES formula into A one-hot vector, which processes the SMILES-style data of each drug molecule into an encoded vector of set dimensions.
The method for screening drug molecules according to claim 1, wherein the calculating the drug physicochemical properties of the drug molecule comprises: calculating molecular mass, calculating lipid-water partition coefficient, calculating the number of molecular H-bond donors, calculating molecular H-bond acceptor One or more of the number of bodies, the calculation of molecular topological polar surface area.
The drug molecular screening method according to claim 3, characterized in that, the data of three indexes of calculating molecular mass, calculating lipid-water partition coefficient, and calculating molecular topological polar surface area are normalized, and the data are uniformly mapped to- In the range of 1.0-1.0, the 5 pharmacological properties of each drug molecule are formed into a 5-dimensional vector.
The drug molecule screening method according to claim 2, wherein the SMILES formula data and the drug physicochemical property data together form a total drug molecule data set, and the total data set is randomly divided into a training data set and a 4:1 ratio. For the test data set, each SMILES data is processed into a 120-dimensional encoded vector, and spliced and combined with 5 vectors representing different drug physicochemical properties to form a 125-dimensional vector data, which is used as the input layer of the model.
The drug molecule screening method according to any one of claims 1 to 5, wherein the AI model structure comprises: an input layer, an encoding layer, a hidden layer, a decoding layer, and an output layer, and the encoding layer inputs the input layer The output data is output to the hidden layer. The encoding layer is an RNN network structure, including 3 layers of recurrent neural network layers, using LSTM units, and each layer is set with 512 hidden nodes. The decoding layer inputs the output data of the hidden layer, and outputs To the output layer, the decoding layer is an RNN network structure, including 3 layers of recurrent neural network layers, using LSTM units, each layer is set with 512 hidden nodes, there is a softmax layer after the decoding layer, and its cost function adopts the cross entropy function.
Among them, K is the number of categories, y is the label, p is the output of the network, and refers to the probability that the category is i; through the softmax layer, the probability distribution of each specific character category in the SMILES encoding vector is estimated, and finally the data is passed. The direct correspondence between the one-hot value and the specific coded characters in the preprocessing reconstructs the output sample and outputs the SMILES formula.
The drug molecule screening method according to claim 6, wherein the input layer generates a hidden layer through an encoding layer to form an encoder, the hidden layer generates an output layer through a decoding layer to form a decoder, and the encoder converts high-dimensional input Converted into a low-dimensional hidden vector, the loss function E[logP(X|z, c)]-D KL [Q(z|X, c)||P(z|c)], the loss function includes two part, the first part represents the log-likelihood of P(X) under the probability distribution P(X'|z,c), which represents the distance between the output of the coding layer and the input training sample X; the second part is the KL divergence, which represents The distance between Q(z|X,c) and its reference probability distribution N(0,1).
The drug molecule screening method according to any one of claims 1 to 5, wherein, according to the model structure, the model is constructed and trained based on tensorflow, the training data set is used for model training during the training process, and the test data set is used simultaneously. Calculate the test set error, that is, the loss function, to prevent the model from overfitting. After a certain training round of epoch, compare the training data set error with the test data set error. When the test data set error is basically unchanged, the training data set error decreases and changes are weakened. , the encoding layer and decoding layer parameters of the model are optimized to the optimal values, stop training and save the model.
A drug molecular screening system, comprising:

Preprocessing module: collect molecular data of drugs related to specific diseases, preprocess the data, calculate its encoding vector and related drug physicochemical properties, form structural data and store it in the database;

Build a training model module: build and train an AI model based on conditional variational autoencoders, use the combination of encoding vectors and the pharmacological properties of molecules as the input layer of the model, convert the encoding layer of the model into the hidden layer encoding vector, and then pass the model In the process of model training, the model loss function is minimized by the gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and decoding layer are continuously updated;

Generating Potential Drug Molecule Modules: Generate potential drug molecules for curing specific diseases based on the trained conditional variational autoencoder model.
The drug molecule screening system according to claim 9, wherein the calculation of the drug physicochemical properties of the drug molecule comprises: calculation of molecular mass, calculation of lipid-water partition coefficient, calculation of molecular H-bond donor number, calculation of molecular H-bond acceptor One or more of the number of bodies, the calculation of molecular topological polar surface area.