CN112071373A

CN112071373A - Drug molecule screening method and system

Info

Publication number: CN112071373A
Application number: CN202010912085.6A
Authority: CN
Inventors: 汪念; 吴楚楠; 徐旻; 温书豪; 马健; 赖力鹏
Original assignee: Shenzhen Jingtai Technology Co Ltd
Current assignee: Shenzhen Jingtai Technology Co Ltd
Priority date: 2020-09-02
Filing date: 2020-09-02
Publication date: 2020-12-11

Abstract

A drug molecule screening method and system comprises the following steps: collecting drug molecule data related to a specific disease, preprocessing the data, and calculating a coding vector and the physicochemical property of a drug; constructing and training an AI model based on a condition variation self-encoder, combining coding vectors and the medicinal physicochemical properties of molecules as an input layer of the model, converting the coding layer of the model into a hidden layer coding vector, generating a possible medicinal molecular structure through a decoding layer of the model, minimizing a model loss function through a gradient descent algorithm in the model training process, and continuously updating weight parameters of a neural network structure of an iterative coding layer and the decoding layer; generating potential drug molecules for curing specific diseases according to the trained condition variation self-encoder model; the drug molecule screening method and the drug molecule screening system utilize the physicochemical property data of the compound molecules, and the physicochemical property of the drug has larger relevance with whether the compound can be finally prepared into a drug or not, so that the drug property of the compound is improved.

Description

Drug molecule screening method and system

Technical Field

The invention relates to a screening method, in particular to a drug molecule screening method and a drug molecule screening system.

Background

In the field of drug research and development, the traditional method is to synthesize the drug after computer simulation screening, and with the rapid development of the AI drugs, people begin to try to apply various AI algorithm models in the field of drug research and development to solve the problem of long new drug research and development process cycle, target point information of many diseases is unknown at present, so that the difficulty and cost of searching effective drug molecules from many compound libraries are extremely high, and the theoretical basis of the rapid calculation capability and innovation of the AI brings a new research mode for the screening process of the drug molecules. For example, in the application of attempts to generate drug molecules, such as a challenge generation network, a convolutional neural network, a cyclic neural network, reinforcement learning, etc., these AI models can quickly find drug molecules similar to target molecules from a large number of chemical library molecules, thereby greatly reducing the search space of the molecules, and simultaneously generating drug molecules effective to some extent for the subsequent drug screening and experimental processes.

In the current technology of generating drug molecules by using AI model, researchers use the most applied methods based on the models of self-encoder, such as VAE and AAE models, and the model based on the thought of confrontation network can search out the potential drug molecules similar to the existing drug molecules, which has the defects that the effectiveness and accuracy of the generated drug molecules are not high, and the generated potential drug molecules have the same problem with the training set molecules, so that the diversity is weakened. This makes the potential drug molecules produced less breakthrough in the current field or makes it more difficult to enter the actual drug testing stage due to their less effective pharmacological properties. In addition, the properties of the drug molecules are less considered in the conventional AI model, and the data of the input layer of the model is single, so that the effectiveness of the drug molecules generated by the model is reduced.

Disclosure of Invention

Accordingly, there is a need for a method for screening drug molecules that can improve drug potency.

Meanwhile, a drug molecule screening system capable of optimizing generation and improving druggability is provided.

A method of screening for a drug molecule comprising:

pretreatment: collecting drug molecule data related to a specific disease, preprocessing the data, calculating a coding vector and related physicochemical properties of drugs of the data, forming structural data and storing the structural data in a database;

constructing a training model: constructing and training an AI model based on a condition variation self-encoder, combining coding vectors and the medicinal physicochemical properties of molecules as an input layer of the model, converting the coding layer of the model into a hidden layer coding vector, generating a possible medicinal molecular structure through a decoding layer of the model, minimizing a model loss function through a gradient descent algorithm in the model training process, and continuously updating weight parameters of a neural network structure of an iterative coding layer and the decoding layer;

generation of potential drug molecules: and generating potential drug molecules for curing specific diseases according to the trained condition variation self-encoder model.

In a preferred embodiment, the coding vector is a SMILES-like coding vector, and the preprocessing includes: counting all characters in the SMILES, converting each character in the SMILES into one-hot vectors, and processing SMILES data of each drug molecule into coding vectors with set dimensions.

In a preferred embodiment, said calculating the pharmaco-physicochemical properties of the drug molecules comprises: calculating one or more of molecular mass, calculating a lipid-water distribution coefficient, calculating the number of molecular H bond donors, calculating the number of molecular H bond acceptors and calculating the molecular topological polar surface area.

In the preferred embodiment, the data of three indexes of calculating molecular mass, calculating a lipid-water distribution coefficient and calculating a molecular topological polar surface area are normalized, the data are uniformly mapped into a range of-1.0 to 1.0, and 5 medicinal physicochemical properties of each medicament molecule form a 5-dimensional vector.

In a preferred embodiment, SMILES type data and the physicochemical property data of the medicine jointly form a total data set of medicine molecules, the total data set is randomly divided into a training data set and a testing data set according to the proportion of 4:1, each SMILES type data is processed into a 120-dimensional coding vector, and the 120-dimensional coding vector and 5 vectors representing different physicochemical properties of the medicine are spliced and combined to form 125-dimensional vector data which is used as an input layer of a model.

In a preferred embodiment, the AI model structure includes: the encoding layer inputs input layer output data and outputs the output data to the hidden layer, the encoding layer is of an RNN network structure and comprises 3 layers of cyclic neural network layers, LSTM units are adopted, each layer is provided with 512 hidden nodes, the output data of the hidden layer is input to the decoding layer and output to the output layer, the decoding layer is of an RNN network structure and comprises 3 layers of cyclic neural network layers, LSTM units are adopted, each layer is provided with 512 hidden nodes, a softmax layer is further arranged behind the decoding layer, and a cross entropy function is adopted as a cost function of the softmax layer

Wherein K is the number of categories, y is the label, p is the output of the network, and means the probability that the category is i; and estimating the probability distribution of each specific certain character category in the SMILES type coding vector through a softmax layer, finally reconstructing an output sample through the direct corresponding relation between one-hot values and specific coding characters in data preprocessing, and outputting the SMILES.

In a preferred embodiment, the input layer generates a hidden layer through an encoding layer to form an encoder, the hidden layer generates an output layer through a decoding layer to form a decoder, the encoder converts a high-dimensional input into a hidden vector with lower bits, and the loss function

The loss function comprises two parts, wherein the first part represents the distance between the output of the coding layer and the input training sample X by using the log-likelihood of P (X) under the probability distribution P (X' | z, c); the second part is the KL divergence, representing the distance between Q (z | X, c) and its reference probability distribution N (0, 1).

In a preferred embodiment, according to a model structure, a model is constructed and trained based on tensorflow, a training data set is used for model training in the training process, meanwhile, a test data set is used for calculating a test set error, namely a loss function, so that overfitting of the model is prevented, after a certain training round epoch, the training data set error and the test data set error are compared, when the test data set error is basically unchanged, the error drop change of the training data set is weakened, parameters of an encoding layer and a decoding layer of the model are optimized to an optimal value, training is stopped, and the model is stored.

A drug molecule screening system comprising:

a preprocessing module: collecting drug molecule data related to a specific disease, preprocessing the data, calculating a coding vector and related physicochemical properties of drugs of the data, forming structural data and storing the structural data in a database;

constructing a training model module: constructing and training an AI model based on a condition variation self-encoder, combining coding vectors and the medicinal physicochemical properties of molecules as an input layer of the model, converting the coding layer of the model into a hidden layer coding vector, generating a possible medicinal molecular structure through a decoding layer of the model, minimizing a model loss function through a gradient descent algorithm in the model training process, and continuously updating weight parameters of neural network structures of an iterative coding layer and the decoding layer so as to enable the model to be better trained;

generation of potential drug molecule modules: and generating potential drug molecules for curing specific diseases according to the trained condition variation self-encoder model.

According to the drug molecule screening method and system, the drug physicochemical property data of the compound molecules are utilized, and the drug physicochemical property of the compound molecules has greater correlation with whether the compound can be finally prepared, so that the probability of the compound with poor physicochemical property or not meeting the range requirement on the drug preparation is extremely low, the drug physicochemical property of the molecules is taken into consideration of the data of the model input layer, the physicochemical property of the molecules finally obtained after model training is controlled within a reasonable range, the drug preparation property is improved, the effectiveness and the accuracy of the drug molecules generated by the model can be effectively improved, and the generated molecules can have diversity by regulating and controlling the specific drug physicochemical property index numerical range.

And combining the SMILES type coding vector and the molecular drug physicochemical property as an input layer of the model, wherein the drug physicochemical property of the molecule is expected to show better property of drug molecules generated by the model in the property indexes, and abstracting the selected drug physicochemical property data as a condition vector of the input layer, and directly introducing the selected drug physicochemical property data into calculation in both the coding layer and the decoding layer. The input layer data is converted into a hidden layer coding vector through a coder, and a possible medicine molecular structure is generated through a decoding layer. The coding layer and the decoding layer adopt an LSTM cyclic neural network structure, in the model training process, a model loss function is minimized through a gradient descent algorithm, and weight parameters of the neural network structures of the iterative coding layer and the decoding layer are continuously updated, so that the model training is better.

In addition, the effectiveness of the generated potential molecules can be effectively improved by adopting a drug molecule screening method based on a conditional variation self-encoder, target drug physicochemical property data is introduced into an input layer as a condition vector, a conditional variation self-encoder model is used for multivariable control, except for utilizing the advantages of a hidden layer space, the drug physicochemical properties of the drug molecules are input into the encoding layer, the condition vector is processed and controlled in a decoding layer, the generated potential drug molecules can have better performance on the target drug physicochemical properties, and the effectiveness and the accuracy of the drug molecules are improved.

Drawings

FIG. 1 is a flow chart of a drug molecule screening method according to an embodiment of the present invention.

Detailed Description

As shown in fig. 1, a method for screening a drug molecule according to an embodiment of the present invention includes:

step S101, preprocessing: collecting drug molecule data related to a specific disease, preprocessing the data, calculating a coding vector and related physicochemical properties of drugs of the data, forming structural data and storing the structural data in a database;

step S103, constructing a training model: constructing and training an AI model based on a condition variation self-encoder, combining coding vectors and the medicinal physicochemical properties of molecules as an input layer of the model, converting the coding layer of the model into a hidden layer coding vector, generating a possible medicinal molecular structure through a decoding layer of the model, minimizing a model loss function through a gradient descent algorithm in the model training process, and continuously updating weight parameters of neural network structures of an iterative coding layer and the decoding layer so as to enable the model to be better trained;

step S105, generating potential drug molecules: and generating potential drug molecules for curing specific diseases according to the trained condition variation self-encoder model.

Further, the encoding vector of the present embodiment is a SMILES (Simplified Molecular Input Line Entry Specification) type encoding vector. The pretreatment comprises the following steps: counting all characters in the SMILES, converting each character in the SMILES into one-hot vectors, and processing SMILES data of each drug molecule into coding vectors with set dimensions.

The specific pretreatment process comprises the following steps: calculating the SMILES type coding vector of the drug molecule. Firstly, counting all characters appearing in the SMILES formula, converting each character in the SMILES formula into a one-hot vector, and adding an 'E' character at the end of a character string to represent the end. Setting the SMILES coding vector of each drug molecule to be fixed to 120 dimensions, filling the latter digits with the one-hot value of 'E' when the digit number of the character is smaller than the value, and processing the SMILES data of each drug molecule into a 120-dimensional coding vector.

SMILES, called simply Molecular Input Line Entry Specification, simplifies the linear Input Specification of molecules, and is a Specification for clearly describing the Molecular structure by using ASCII character strings.

The collected drug molecule data set is the data of the molecular structural formula of the drug compound, which is generally a SMI file, and each drug molecule is represented by a SMILES format, such as the SMILES formula of aspirin: CC (= O) OC1= CC = C1C (= O) O.

Calculating the pharmaco-physicochemical properties of the drug molecules included: calculating one or more of molecular mass, calculating a lipid-water distribution coefficient, calculating the number of molecular H bond donors, calculating the number of molecular H bond acceptors and calculating the molecular topological polar surface area.

Specifically, calculating the physicochemical properties of the drugs of the drug molecules, and adopting a method provided in an RDKIR software library, mainly comprising the following steps:

calculated molecular Mass (MW): rdkit. chem. descriptors. exact molwt ();

calculate the lipid water partition coefficient (LogP): rdkit. chem. criptpen. mollogp ();

calculate number of molecular H Bond Donors (HBD): rdkit. chem. rdmoldescriptos. calcnumhbd ();

calculate number of molecular H bond receptors (HBA): rdkit. chem. rdmoldescriptos. calcnumhba (); calculating molecular Topological Polar Surface Area (TPSA): rdkit. chem. rdmoldescriptors. calcptps ().

Further, the data of three indexes of calculating molecular mass, calculating a lipid-water distribution coefficient and calculating a molecular topological polar surface area are normalized, the data are uniformly mapped into a range of-1.0 to 1.0, and 5 medicinal physicochemical properties of each medicinal molecule form a 5-dimensional vector.

Further, SMILES type data and medicine physicochemical property data jointly form a medicine molecule total data set, the total data set is randomly divided into a training data set and a testing data set according to the proportion of 4:1, each SMILES type data is processed into a 120-dimensional coding vector, and the 120-dimensional coding vector and 5 vectors representing different medicine physicochemical properties are spliced and combined to form 125-dimensional vector data which is used as an input layer of the model.

Specifically, the calculation results of the physicochemical properties of the drugs can be obtained through statistical analysis: the MW value range of all molecules in the data set is 0-500, the LogP value range is 0-5, the TPSA value range is 0-150, normalization processing is carried out on data of the MW index, the LogP index and the TPSA index, and the data are mapped into the range of-1.0 in a unified mode. The values of the molecular H bond donor and the molecular H bond acceptor are integers, and the values are directly expressed. The 5 medicinal physicochemical properties of each drug molecule are abstracted to form a 5-dimensional vector through the logic.

The molecular data set of the medicine is composed of molecular SMILES type data and 5 kinds of medicine physicochemical property data. And according to the ratio of 4: a ratio of 1 divides the total data set into a training data set and a test data set, the division being random, i.e. the training data set accounts for 80% and the test data accounts for 20%.

After data preprocessing, each drug molecule sample in the drug data set has one coding vector in the SMILES format and 5 vectors representing different physicochemical property attributes of the drug (i.e. condition vectors in the method). And each drug molecule sample directly splices and combines the two data vectors to form 125-dimensional vector data, and all the drug molecules in the drug data set are characterized by the form, namely the data are used as the input layer data of the model.

Compared with VAE and AAE models, the conditional variational self-encoder (CVAE) model provided by the application utilizes the medicinal physicochemical property data of compound molecules, has larger correlation between the medicinal physicochemical property of the compound molecules and whether the compound can be finally prepared into a medicament, and has extremely low probability of the formation of the medicament of the compound with poor physicochemical property or not meeting the range requirement, so that the medicinal physicochemical property of the molecules is brought into the consideration range of the data of the input layer of the model, the physicochemical property of the molecules finally obtained after model training is controlled within a reasonable range, the formation of the medicament is improved, the effectiveness and the accuracy of the medicinal molecules generated by the model can be effectively improved, and the generated molecules have more diversity by regulating the numerical range of specific medicinal physicochemical property indexes.

Therefore, the method can calculate the physicochemical properties of the drugs in the drug molecule data set in advance, and mainly selects 5 physicochemical properties: the molecular Mass (MW), the lipid-water partition coefficient (LogP), the number of molecular H Bond Donors (HBD), the number of molecular H Bond Acceptors (HBA), the molecular Topological Polar Surface Area (TPSA), the 5 pharmacological physicochemical properties mentioned above are selected mainly on the basis of the following considerations:

1. molecular mass is the most fundamental descriptor of a molecule;

2. the logP value of the lipid-water partition coefficient represents the logarithm of the partition coefficient ratio of the molecule of the compound in n-octanol (oil) and water, and reflects the partition condition of the substance in oil-water two phases. The larger the logP value is, the more oleophilic the substance is, whereas the smaller the logP value is, the more hydrophilic the substance is, i.e. the better the water solubility is, the water solubility is a key index of the compound in pharmacy.

3. The number of molecular H-bond donors HBD and the number of molecular H-bond acceptors HBA represent the number of donors and acceptors, respectively, that form H-bonds between molecules, and are also generally used as basic indicators for assessing the efficacy of molecules.

4. Molecular topology polar surface area TPSA, which is also a parameter commonly used in pharmaceutical chemistry, is defined as the total surface area of polar molecules within a compound, mostly oxygen and nitrogen atoms, also including the hydrogen atoms to which they are attached. In medicinal chemistry applications, polar surface area is a descriptive indicator for evaluating the transport properties of a drug within a cell. For a good pharmaceutical compound, the topological polar surface area of the compound is within a certain value range.

Therefore, the method takes the physicochemical property data of the drug molecules as the accessory condition data as a part of the data of the model input layer. The advantage of doing so is that can make full use of the concentrated medicine physicochemical property data of data to through the training of model, can make the physicochemical property of the medicine molecule that the model produced also in reasonable range, improve its effect of generating the medicine molecule.

The AI model structure includes: input layer, coding layer, hidden layer, decoding layerAnd an output layer. The coding layer of the present embodiment inputs input layer output data and outputs the input layer output data to the hidden layer. The coding layer is RNN network structure, contains 3 layers of cyclic neural network layer, adopts LSTM unit, and each layer sets up 512 hidden nodes, decode the output data of the layer input hidden layer, export the output layer, decode the layer and be RNN network structure, contain 3 layers of cyclic neural network layer, adopt LSTM unit, each layer sets up 512 hidden nodes, still has softmax layer after decoding the layer, and its cost function adopts cross entropy function

The input layer data includes: the SMILES data for the drug sub-set and the 5 pharmaco-physicochemical property data for the drug sub-set. Specifically, the input layer: vector data (SMILES type coding vector and condition vector consisting of physicochemical properties of drugs) after pretreatment of the drug molecule data set, namely an input layer is X1, and X2 … Xn and Xn are all characterized by a group of vector data.

And (3) coding layer: the input is an input layer, the output is a hidden layer, the coding layer is an RNN network structure and comprises 3 RNN recurrent neural network layers, LSTM units (LSTM, Long Short-Term Memory) are adopted, each LSTM unit comprises an input gate, a forgetting gate and an output gate, and 512 hidden nodes are arranged in each layer.

Hiding the layer: hidden layer vector dimension is set to 200, decoding layer: input is a hidden layer and output is an output layer. The decoding layer is an RNN network structure. The method comprises 3 layers of RNN recurrent neural network layers, and also adopts an LSTM unit, wherein 512 hidden nodes are arranged in each layer. After the decoding layer unit, a softmax layer exists, and the cost function of the softmax layer adopts a cross entropy function which is shown as follows:

wherein: k is the number of categories, y is the label, p is the output of the network, referring to the probability that the category is i.

And estimating the probability distribution that each bit in the SMILES type coding vector is a specific character type through a softmax layer, finally reconstructing an output sample through the direct corresponding relation between one-hot values in data preprocessing and specific coding characters, and outputting the SMILES.

An output layer: SMELELS molecular formula.

The input layer generates a hidden layer through an encoding layer to form an encoder, the hidden layer generates an output layer through a decoding layer to form a decoder, and the encoder converts high-dimensional input into a hidden vector of a lower position.

The loss function of the present embodiment is,

The AI model based on the conditional variational auto-encoder of the embodiment is a CVAE model, and the CVAE consists of two modules, namely an encoder module and a decoder module. The input layer generates a hidden layer through an encoding layer, and the hidden layer is an encoder. The hidden layer generates an output layer as a decoder through a decoding layer. The encoder converts the high dimensional input to a low dimensional hidden vector Z, and the decoder reduces the hidden vector to an output as close to the input as possible. During training, training samples X and control conditions L need to be input, and parameters in a combined network of an encoder and a decoder are optimized through gradient descent.

First, the encoder maps the training sample X into two sets of parameters that determine a conditional probability distribution Q (z | X, c) whose output z is based on the input X and the condition vector c, and assuming it is a normal distribution, the two sets of parameters represent the mean and variance, respectively. Sampling to obtain a hidden vector z obeying the distribution, mapping the hidden vector z into a new set of parameters through a decoder, and simultaneously utilizing a condition vector c in a decoding layer in the mapping process, wherein the condition vector c is associated with the sample X, so that a conditional probability distribution P (X' | z, c) based on z and c is determined, and the probability distribution of the decoding layer is restricted by the hidden vector and the condition vector. In order to minimize the distance between the generated X and the input X and to ensure the model generating capability, Q (z | X, c) should look as flush as possible to the normal distribution N (0, 1). The model minimizes the loss function value by using a random gradient descent algorithm, namely, the network structure parameters in the coding layer and the decoding layer are updated and optimized.

The Loss function (the Loss function is the cost function) Loss of the CVAE model is shown as follows:

the loss function comprises two parts, the first part representing the log-likelihood of P (X) under the probability distribution P (X' | z, c) characterizing the distance of the output of the coding layer from the input sample X. The second part is the KL divergence (Kullback-Leibler), which represents the distance between Q (z | X, c) and its reference probability distribution N (0, 1).

Further, in this embodiment, according to the model structure, the model is constructed and trained based on tensoflow, the training data set is used for model training in the training process, meanwhile, the test data set is used for calculating the test set error, that is, the loss function, so as to prevent overfitting of the model, after a certain training round epoch, the training data set error and the test data set error are compared, when the test data set error is basically unchanged, the error drop change of the training data set is weakened, the parameters of the coding layer and the decoding layer of the model are optimized to the optimal values, the training is stopped, and the model is saved.

Specifically, the model construction and training process: according to the model structure, the model is constructed and trained based on the tensoflow, a training data set is used for model training in the training process, and meanwhile, a test data set is used for calculating the error (loss function) of the test set, so that overfitting of the model is prevented. In the stage of starting training of the model, the error of the model and the error of the test data set are both larger and are rapidly reduced, then after a certain training round epoch, the error of the training data set and the error of the test data set are compared, when the error of the test data set is basically unchanged and the error of the training data set is reduced and changed, the model at the moment can be considered to have higher precision on the training data set and the test data set, parameters of an encoding layer and a decoding layer of the model are both optimized to the optimal values, and therefore the training is stopped and the model is saved, namely the CAVE model which is finally trained.

In the method for screening drug molecules of this embodiment, the physicochemical properties of 5 drugs of the drug molecules are calculated according to the selection, and the selection of the properties is directly related to the efficacy of the drug molecules. And abstracting the calculated physical and chemical property values of the medicines into vectors to serve as condition vectors.

The conditional variation self-encoder model is characterized in that a condition vector is additionally input, the condition vector in the method comprises 5 physicochemical properties of medicines instead of a single property, so that multivariable control can be performed, and a specific control process is realized by a model structure and logic.

The hidden layer space belongs to a physicochemical core of a self-encoder method, and has the advantages that a coder converts high-dimensional input into low-dimensional hidden layer vectors, and then the hidden layer vectors are restored into output solving input data as much as possible through a decoder, so that the goal is achieved.

The physicochemical properties of the drugs of the drug molecules are abstracted by the condition vector and are used as the data of the input layer of the model to enter the encoder.

The input of the decoder is a hidden vector, and the hidden vector contains condition vector, namely, physicochemical property data information of the medicine, so that the condition vector data also enters the network structure calculation in the decoder.

A drug molecule screening system comprising:

Further, the calculation of the physicochemical properties of the drug molecules of the present example includes: calculating one or more of molecular mass, calculating a lipid-water distribution coefficient, calculating the number of molecular H bond donors, calculating the number of molecular H bond acceptors and calculating the molecular topological polar surface area.

Further, the encoding vector of the present embodiment is a SMILES (Simplified Molecular Input Line Entry Specification) type encoding vector.

Further, the preprocessing module of the embodiment includes: counting all characters in the SMILES, converting each character in the SMILES into one-hot vectors, and processing SMILES data of each drug molecule into coding vectors with set dimensions.

The specific preprocessing module comprises: calculating the SMILES type coding vector of the drug molecule. Firstly, counting all characters appearing in the SMILES formula, converting each character in the SMILES formula into a one-hot vector, and adding an 'E' character at the end of a character string to represent the end. Setting the SMILES coding vector of each drug molecule to be fixed to 120 dimensions, filling the latter digits with the one-hot value of 'E' when the digit number of the character is smaller than the value, and processing the SMILES data of each drug molecule into a 120-dimensional coding vector.

The AI model structure includes: the device comprises an input layer, an encoding layer, a hiding layer, a decoding layer and an output layer. The coding layer of the present embodiment inputs input layer output data and outputs the input layer output data to the hidden layer. The coding layer is RNN network structure, contains 3 layers of cyclic neural network layer, adopts LSTM unit, and each layer sets up 512 hidden nodes, decode the output data of the layer input hidden layer, export the output layer, decode the layer and be RNN network structure, contain 3 layers of cyclic neural network layer, adopt LSTM unit, each layer sets up 512 hidden nodes, still has softmax layer after decoding the layer, and its cost function adopts cross entropy function

Loss function of the present embodiment

The loss function comprises two parts, wherein the first part represents the log-likelihood of P (X) under the probability distribution P (X' | z, c) and represents the distance between the output of the coding layer and the input training sample X; the second part is the KL divergence, representing the distance between Q (z | X, c) and its reference probability distribution N (0, 1).

The invention relates to a drug molecule screening method based on a conditional variation self-encoder, which comprises the following steps:

drug molecule data sets relevant to specific diseases are collected and subjected to data preprocessing. The data collection is carried out to obtain the structural data of the drug molecules, and after the data analysis and processing, the SMILES type coding vector and the related physicochemical property of the drug are calculated to form structural data which is stored in a database. Wherein the data set is divided into a model training data set and a test data set according to a certain proportion. The physicochemical properties of the drug molecules can select the index properties such as molecular mass, LogP, molecular H bond for receptor, TPSA (molecular topological polar surface area), and the like.

An AI model based on a conditional variation autoencoder is constructed and trained from the drug molecule data set. The main steps for establishing the model are as follows: the SMILES type coding vector and the molecular drug physicochemical property are combined to be used as an input layer of the model, wherein the drug physicochemical property of the molecule is expected to show better property of drug molecules generated by the model in the property indexes, and the selected drug physicochemical property data are abstracted to be used as condition vectors of the input layer and are directly introduced into calculation in the coding layer and the decoding layer. The input layer data is converted into a hidden layer coding vector through a coder, and a possible medicine molecular structure is generated through a decoding layer. The coding layer and the decoding layer adopt an LSTM cyclic neural network structure, in the model training process, a model loss function is minimized through a gradient descent algorithm, and weight parameters of the neural network structures of the iterative coding layer and the decoding layer are continuously updated, so that the model training is better.

And generating potential drug molecules capable of curing the specific disease according to the trained condition variation self-encoder model for the subsequent drug calculation process and experimental process.

The invention adopts a drug molecule screening method based on a condition variation self-encoder, which can effectively improve the effectiveness of generated potential molecules, introduces target drug physicochemical property data as condition vectors in an input layer, uses a condition variation self-encoder model for multivariable control, inputs the drug physicochemical properties of the drug molecules into the encoding layer except utilizing the advantages of a hidden layer space, further processes and controls the condition vectors in a decoding layer, so that the generated potential drug molecules can have better expression on the target drug physicochemical properties, and the effectiveness and the accuracy of the drug molecules are improved.

The invention relates to a drug molecule screening method based on a conditional variation self-encoder, which comprises the following specific embodiments:

collecting a drug molecule data set related to a specific disease, wherein the data set is selected from other open source databases such as pubchem, ccdc and the like, in this embodiment, taking pancreatic cancer as an example, summarizing and extracting the SMILES formula of drug molecules which have a therapeutic effect or are potentially related to pancreatic cancer from the open source databases to serve as an initial data set, and then entering a data preprocessing stage to calculate SMILES formula encoding vectors of the drug molecules. Firstly, counting all characters appearing in the SMILES formula, converting each character in the SMILES formula into a one-hot vector, and adding an 'E' character at the end of a character string to represent the end. Setting the SMILES coding vector of each drug molecule to be fixed to 120 dimensions, filling the latter digits with the one-hot value of 'E' when the digit number of the character is smaller than the value, and processing the SMILES data of each drug molecule into a 120-dimensional coding vector. Simultaneously, 5 relatively representative medicinal physicochemical properties of the medicine molecules are selected: molecular mass, LogP, molecular H bond donor, molecular H bond acceptor, TPSA (molecular topological polar surface area), calculating the properties for each drug molecule in the data set, and abstracting to a vector expression form according to its property numerical expression, as a conditional vector input in the model.

And training an AI model based on the conditional variation self-encoder according to the drug molecular data set, wherein the model mainly comprises an input layer, an encoding layer, a hidden layer, a decoding layer and an output layer. In the input layer, the molecular fingerprint vector and the condition variation vector obtained by the drug property combination are jointly used as the input layer data to be transmitted to the coding layer. The coding layer and the decoding layer adopt a recurrent neural network structure with LSTM, the number of the neural network layers is 3, and each layer has 512 hidden nodes. And each decoding layer unit is provided with a softmax layer, and the cost function adopts a cross entropy function. The output vector produced by the decoding layer will eventually be converted to a SMILES molecular encoding form. Through iterative training of the model, network structure parameters in the coding layer and the decoding layer are updated and optimized, and the decoding layer reconstructs and outputs generated samples, so that the effectiveness of drug molecules generated by the output layer is improved, better performance can be realized on the physicochemical property of the selected target drug, and the drug molecules become potential drug molecules capable of curing pancreatic cancer.

According to the trained condition variation self-encoder model, a certain number of potential drug molecules capable of curing pancreatic cancer are generated according to subsequent research modes and targets. The model can generate a large number of potential drug molecules, and the generated drug molecule data set can be subjected to further screening or ranking and other operations after certain property analysis, so that drug molecules with appropriate quantity and high effectiveness are screened out and then used in subsequent drug calculation processes and experimental processes.

The generated drug molecule data set is used for calculating the physicochemical properties of the 5 traditional Chinese medicines, and the ranges of the physicochemical properties of the 5 traditional Chinese medicines are counted according to the traditional drug molecules capable of curing pancreatic cancer. The screening mainly refers to filtering out molecules in a range of numerical values in which certain physicochemical properties obviously do not conform to statistical data, the ranking refers to comprehensively considering the values of 5 physicochemical properties to form ranking of the drug forming possibility of the molecules, and a certain number of generated drug molecules with the drug forming possibility ranked at the top can be selected for subsequent research according to subsequent research targets.

In light of the foregoing description of the preferred embodiments according to the present application, it is to be understood that various changes and modifications may be made without departing from the spirit and scope of the invention. The technical scope of the present application is not limited to the contents of the specification, and must be determined according to the scope of the claims.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. A method for screening a drug molecule, comprising:

2. The method of claim 1, wherein the encoding vector is a SMILES-like encoding vector, and the pre-processing comprises: counting all characters in the SMILES, converting each character in the SMILES into one-hot vectors, and processing SMILES data of each drug molecule into coding vectors with set dimensions.

3. The method of claim 1, wherein the calculating the physicochemical properties of the drug molecules comprises: calculating one or more of molecular mass, calculating a lipid-water distribution coefficient, calculating the number of molecular H bond donors, calculating the number of molecular H bond acceptors and calculating the molecular topological polar surface area.

4. The drug molecule screening method of claim 3, wherein the data of the three indexes of calculating the molecular mass, calculating the lipid-water distribution coefficient and calculating the molecular topological polar surface area are normalized, the data are uniformly mapped into the range of-1.0 to 1.0, and 5 physicochemical properties of 5 drugs of each drug molecule are formed into a 5-dimensional vector.

5. The drug molecule screening method of claim 2, wherein the SMILES-type data and the physicochemical property data of the drug are combined to form a total data set of the drug molecule, the total data set is randomly divided into a training data set and a testing data set according to a ratio of 4:1, each SMILES-type data is processed into a 120-dimensional coding vector, and 5 vectors representing different physicochemical properties of the drug are spliced and combined to form 125-dimensional vector data which is used as an input layer of the model.

6. The method for screening drug molecules according to any one of claims 1 to 5, wherein the AI model structure comprises: the encoding layer inputs input layer output data and outputs the output data to the hidden layer, the encoding layer is of an RNN network structure and comprises 3 layers of cyclic neural network layers, LSTM units are adopted, each layer is provided with 512 hidden nodes, the output data of the hidden layer is input to the decoding layer and output to the output layer, the decoding layer is of an RNN network structure and comprises 3 layers of cyclic neural network layers, LSTM units are adopted, each layer is provided with 512 hidden nodes, a softmax layer is further arranged behind the decoding layer, and a cross entropy function is adopted as a cost function of the softmax layer

Wherein, K is the number of categories, y is the label, p is the output of the network, and means the probability that the category is i; estimating the probability distribution of each specific certain character category in the SMILES type coding vector through a softmax layer, finally reconstructing an output sample through the direct corresponding relation between one-hot values and specific coding characters in data preprocessing, and outputting SMILEAnd (5) S formula.

7. The method of claim 6, wherein the input layer generates a hidden layer through an encoding layer to form an encoder, the hidden layer generates an output layer through a decoding layer to form a decoder, the encoder converts a high-dimensional input into a low-dimensional hidden vector, and the loss function is

8. The drug molecule screening method according to any one of claims 1 to 5, wherein a model is constructed and trained based on tensoflow according to a model structure, a training data set is used for model training in the training process, a test data set error, namely a loss function, is calculated by using the test data set, overfitting of the model is prevented, after a certain training round epoch, the training data set error and the test data set error are compared, when the test data set error is basically unchanged, the error of the training data set is reduced and changed, parameters of an encoding layer and a decoding layer of the model are optimized to an optimal value, training is stopped, and the model is stored.

9. A drug molecule screening system, comprising:

constructing a training model module: constructing and training an AI model based on a condition variation self-encoder, combining coding vectors and the medicinal physicochemical properties of molecules as an input layer of the model, converting the coding layer of the model into a hidden layer coding vector, generating a possible medicinal molecular structure through a decoding layer of the model, minimizing a model loss function through a gradient descent algorithm in the model training process, and continuously updating weight parameters of a neural network structure of an iterative coding layer and the decoding layer;

10. The drug molecule screening system of claim 9, wherein the calculating the pharmaco-physicochemical property of the drug molecule comprises: calculating one or more of molecular mass, calculating a lipid-water distribution coefficient, calculating the number of molecular H bond donors, calculating the number of molecular H bond acceptors and calculating the molecular topological polar surface area.