WO2022047677A1 - Drug molecule screening method and system - Google Patents

Drug molecule screening method and system Download PDF

Info

Publication number
WO2022047677A1
WO2022047677A1 PCT/CN2020/113085 CN2020113085W WO2022047677A1 WO 2022047677 A1 WO2022047677 A1 WO 2022047677A1 CN 2020113085 W CN2020113085 W CN 2020113085W WO 2022047677 A1 WO2022047677 A1 WO 2022047677A1
Authority
WO
WIPO (PCT)
Prior art keywords
layer
drug
model
data
encoding
Prior art date
Application number
PCT/CN2020/113085
Other languages
French (fr)
Chinese (zh)
Inventor
汪念
吴楚楠
徐旻
温书豪
马健
赖力鹏
Original Assignee
深圳晶泰科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳晶泰科技有限公司 filed Critical 深圳晶泰科技有限公司
Priority to PCT/CN2020/113085 priority Critical patent/WO2022047677A1/en
Publication of WO2022047677A1 publication Critical patent/WO2022047677A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Definitions

  • the invention relates to a screening method, in particular to a drug molecule screening method and system.
  • the most widely used methods by researchers are based on autoencoder models, such as VAE and AAE models.
  • This model based on the idea of adversarial networks can explore potential similarities to existing drug molecules.
  • the disadvantage of drug molecules is that the effectiveness and accuracy of the generated drug molecules are not high, and at the same time, the generated potential drug molecules and the training set molecules have the same problem, and the diversity is weakened. This makes the potential drug molecules generated have little breakthrough in the existing field, or it is difficult to enter the real drug test stage in terms of druggability due to the low effectiveness of their drug properties.
  • the properties of drug molecules are rarely considered in the existing AI models, and the input layer data of the model is relatively single, which reduces the effectiveness of the drug molecules generated by the model.
  • a drug molecular screening method comprising:
  • Preprocessing collect molecular data of drugs related to specific diseases, preprocess the data, calculate its encoding vector and related drug physicochemical properties, form structural data and store it in the database;
  • Build a training model Build and train an AI model based on a conditional variational autoencoder, use the combination of the encoding vector and the drug physicochemical properties of the molecule as the input layer of the model, convert the encoding layer of the model into the hidden layer encoding vector, and then use the model's encoding layer.
  • the decoding layer generates possible drug molecular structures.
  • the model loss function is minimized through the gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and decoding layer are continuously updated;
  • Generate Potential Drug Molecules Generate potential drug molecules that cure a specific disease based on the trained conditional variational autoencoder model.
  • the encoding vector is a SMILES encoding vector
  • the preprocessing includes: counting all the characters in the SMILES formula, converting each character in the SMILES formula into a one-hot vector, and converting each drug into a one-hot vector. SMILES-like data processing of molecules as encoded vectors of specified dimensions.
  • the calculation of the drug physicochemical properties of the drug molecule includes: calculation of molecular mass, calculation of lipid-water partition coefficient, calculation of molecular H-bond donors, calculation of molecular H-bond acceptors, calculation of molecular topological polar surface area one or more of.
  • the data of three indicators of molecular mass, lipid-water partition coefficient, and molecular topological polar surface area are normalized, and the data are uniformly mapped to the range of -1.0-1.0, and each The five drug physicochemical properties of drug molecules form a 5-dimensional vector.
  • the SMILES formula data and the drug physicochemical property data together form a total drug molecule data set, and the total data set is randomly divided into a training data set and a test data set according to a ratio of 4:1, and each SMILES formula
  • the data is processed as a 120-dimensional encoded vector, which is spliced and combined with 5 vectors representing different drug physicochemical properties to form a 125-dimensional vector data, which is used as the input layer of the model.
  • the AI model structure includes: an input layer, an encoding layer, a hidden layer, a decoding layer, and an output layer.
  • the encoding layer inputs the output data of the input layer and outputs the data to the hidden layer
  • the encoding layer is an RNN.
  • the network structure includes 3 layers of recurrent neural network layers, using LSTM units, each layer is set with 512 hidden nodes, the decoding layer inputs the output data of the hidden layer, and outputs it to the output layer
  • the decoding layer is an RNN network structure, including 3-layer recurrent neural network layer, using LSTM unit, each layer is set with 512 hidden nodes, there is a softmax layer after the decoding layer, and its cost function adopts the cross-entropy function
  • K is the number of categories
  • y is the label
  • p is the output of the network, and refers to the probability that the category is i; through the softmax layer, the probability distribution of each specific character category in the SMILES encoding vector is estimated, and finally the data is passed.
  • the direct correspondence between the one-hot value and the specific coded character in the preprocessing reconstructs the output sample and outputs the SMILES formula.
  • the input layer generates a hidden layer through an encoding layer to form an encoder
  • the hidden layer generates an output layer through a decoding layer to form a decoder
  • the encoder converts the high-dimensional input into a low-level latent vector, the Loss function E[logP(X
  • the loss function includes two parts.
  • the first part represents the log-likelihood of P(X) under the probability distribution P(X'
  • the second part is KL divergence, which represents the distance between Q(z
  • the model is constructed and trained based on tensorflow, the training data set is used for model training in the training process, and the test data set is used to calculate the test set error, that is, the loss function, to prevent the model from overfitting, After a certain training round epoch, compare the training data set error with the test data set error, when the test data set error is basically unchanged, the training data set error decreases and changes are weakened, and the parameters of the encoding layer and decoding layer of the model are optimized to the best value, stop training and save the model.
  • a drug molecular screening system comprising:
  • Preprocessing module collect molecular data of drugs related to specific diseases, preprocess the data, calculate its encoding vector and related drug physicochemical properties, form structural data and store it in the database;
  • Build a training model module build and train an AI model based on conditional variational autoencoders, use the combination of encoding vectors and the pharmacological properties of molecules as the input layer of the model, convert the encoding layer of the model into the hidden layer encoding vector, and then pass the model
  • the decoding layer generates possible drug molecular structures.
  • the model loss function is minimized by gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and decoding layer are continuously updated to make the model training better;
  • Generating Potential Drug Molecule Modules Generate potential drug molecules for curing specific diseases based on the trained conditional variational autoencoder model.
  • the calculation of the drug physicochemical properties of the drug molecule includes: calculation of molecular mass, calculation of lipid-water partition coefficient, calculation of molecular H-bond donors, calculation of molecular H-bond acceptors, calculation of molecular topological polar surface area one or more of.
  • the above-mentioned drug molecule screening methods and systems also utilize the data of the drug physicochemical properties of the compound molecules, because the drug physicochemical properties of the compound molecules have a great correlation with whether the compound can finally be formulated into a drug, and generally those with poor physicochemical properties or do not meet the scope requirements
  • the probability of druggability of the compound is also extremely low, so the pharmacological properties of the molecule are taken into account in the input layer data of the model, and the physical and chemical properties of the molecule obtained after model training are controlled within a reasonable range to improve its druggability In this way, the effectiveness and accuracy of the drug molecules generated by the model can be effectively improved, and the generated molecules can be more diverse by regulating the numerical range of specific drug physicochemical properties.
  • the combination of SMILES-encoded vector and molecular drug physicochemical properties is used as the input layer of the model, in which the drug physicochemical properties of the molecules are expected to show better properties among the above-mentioned properties of the drug molecules generated by the model.
  • the property data abstraction is used as the conditional vector of the input layer, which is directly introduced into the calculation in both the encoding layer and the decoding layer.
  • the input layer data is converted into the hidden layer encoding vector after passing through the encoder, and then the possible drug molecular structure is generated after passing through the decoding layer.
  • the encoding layer and decoding layer use the LSTM cyclic neural network structure.
  • the model loss function is minimized through the gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and decoding layer are continuously updated, so that Model training is better.
  • Another drug molecule screening method based on conditional variational autoencoder is adopted, which can effectively improve the effectiveness of the potential molecules generated.
  • the physicochemical property data of the target drug is introduced as a conditional vector, and the conditional variational autoencoder is used.
  • the model is used for multivariate control.
  • the drug physicochemical properties of drug molecules are input into the encoding layer, and then the conditional vectors are processed and manipulated in the decoding layer.
  • the generated potential drug molecules will be in these targets.
  • the physicochemical properties of the drug have better performance, and the effectiveness and accuracy of the drug molecule are improved.
  • FIG. 1 is a flowchart of a drug molecular screening method according to an embodiment of the present invention.
  • a drug molecular screening method includes:
  • Step S101 preprocessing: collecting drug molecule data related to a specific disease, preprocessing the data, calculating its encoding vector and related drug physicochemical properties, forming structural data and storing it in a database;
  • Step S103 constructing a training model: constructing and training an AI model based on a conditional variational autoencoder, using the combination of the encoding vector and the pharmacological physicochemical properties of the molecule as the input layer of the model, converting the encoding layer of the model into a hidden layer encoding vector, and then converting the encoding vector into the hidden layer encoding vector.
  • the possible drug molecular structure is generated by the decoding layer of the model.
  • the model loss function is minimized by the gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and the decoding layer are continuously updated, so that the model training is more efficient. Okay;
  • Step S105 generating potential drug molecules: generating potential drug molecules for curing a specific disease according to the trained conditional variational autoencoder model.
  • the encoding vector in this embodiment is a SMILES (Simplified Molecular Input Line Entry Specification) formula encoding vector.
  • the preprocessing includes: counting all the characters in the SMILES formula, converting each character in the SMILES formula into a one-hot vector, and processing the SMILES formula data of each drug molecule into a coding vector with a set dimension.
  • the specific preprocessing process includes: calculating the SMILES-encoded vector of the drug molecule. First, count all the characters that appear in the SMILES formula, convert each character in the SMILES formula into a one-hot vector, and add an 'E' character at the end of the string to indicate the end.
  • the SMILES encoding vector of each drug molecule is fixed to 120 dimensions. When the number of characters is less than this value, the following digits are filled with the one-hot value of 'E', and the SMILES data of each drug molecule is Both are processed as a 120-dimensional encoded vector.
  • SMILES the full name is Simplified Molecular Input Line Entry Specification, Simplified Molecular Linear Input Specification, is a specification that explicitly describes the molecular structure with ASCII strings.
  • the collected drug molecule data set is the data of the molecular structure formula of drug compounds, generally SMI files, and each drug molecule is represented by the SMILES format.
  • Calculating the drug physicochemical properties of drug molecules includes one or more of: calculating molecular mass, calculating lipid-water partition coefficient, calculating molecular H-bond donor number, molecular H-bond acceptor number, and molecular topological polar surface area calculation.
  • the methods provided in the RDKIR software library are used, which mainly include:
  • H-bond donors rdkit.Chem.rdMolDescriptos.CalcNumHBD();
  • HBA molecular H-bond acceptors
  • TPSA molecular topological polar surface area
  • the data of three indicators namely, the calculated molecular mass, the calculated lipid-water partition coefficient, and the calculated molecular topological polar surface area, were normalized, and the data were uniformly mapped to the range of -1.0-1.0.
  • the physicochemical properties of the drug form a 5-dimensional vector.
  • the SMILES data and the drug physicochemical property data are combined to form the total drug molecule data set, and the total data set is randomly divided into training data set and test data set according to the ratio of 4:1, and each SMILES data is processed into a 120
  • the dimensional encoding vector is combined with 5 vectors representing different physicochemical properties to form a 125-dimensional vector data, which is used as the input layer of the model.
  • the calculation results of the physicochemical properties of the drug can be obtained after statistical analysis: the numerical range of MW of all molecules in the data set is 0-500, the numerical range of LogP is 0-5, and the numerical range of TPSA is 0-150.
  • the data of the three indicators of LogP and TPSA are normalized, and the data is uniformly mapped to the range of -1.0 to 1.0.
  • the numerical values of the molecular H-bond donor and the molecular H-bond acceptor are integers, which can be directly represented by their numerical values.
  • the above is composed of molecular SMILES formula data and 5 kinds of drug physicochemical property data together to form a drug molecule data set.
  • the total data set is divided into training data set and test data set according to the ratio of 4:1.
  • the division process is random, that is, the training data set accounts for 80%, and the test data accounts for 20%.
  • each drug molecule sample in the drug dataset has an encoded vector in SMILES format, and 5 vectors representing different drug physicochemical properties (ie, conditional vectors in this method).
  • 5 vectors representing different drug physicochemical properties ie, conditional vectors in this method.
  • two data vectors are directly spliced and combined to form a 125-dimensional vector data. All drug molecules in the drug data set are represented by the above form, that is, as the input layer data of the model.
  • the Conditional Variational Autoencoder (CVAE) model proposed in this application compared with the VAE and AAE models, also utilizes the data on the pharmacological properties of the compound molecule, because the pharmacological properties of the compound molecule are related to whether the compound can finally be formulated into a drug.
  • CVAE Conditional Variational Autoencoder
  • compounds with poor physicochemical properties or those that do not meet the range requirements have a very low probability of druggability. Therefore, the pharmacological physicochemical properties of the molecules are included in the consideration of the data in the input layer of the model, and the final control after model training is controlled.
  • the physicochemical properties of the obtained molecules are within a reasonable range, and their druggability can be improved, so that the effectiveness and accuracy of the drug molecules generated by the model can be effectively improved. Molecules are more diverse.
  • this method will calculate the drug physicochemical properties of the drug molecule data set in advance, and mainly select five drug physicochemical properties: molecular mass (MW), lipid-water partition coefficient (LogP), molecular H-bond donor number (HBD), molecular H The number of bond acceptors (HBA), molecular topological polar surface area (TPSA), and the reasons for selecting the above 5 drug physicochemical properties are mainly based on the following considerations:
  • the logP value of the fat-water partition coefficient represents the logarithm of the ratio of the partition coefficient of the compound molecule in n-octanol (oil) and water, reflecting the distribution of the substance in the oil-water two-phase.
  • the number of H-bond donors HBD and the number of H-bond acceptors HBA respectively represent the number of H-bond donors and acceptors between molecules, and are generally used as a basic indicator for evaluating the effectiveness of molecules.
  • Molecular topology Polar surface area TPSA is also a parameter commonly used in medicinal chemistry, which is defined as the total surface area of polar molecules in a compound, mostly oxygen and nitrogen atoms, including hydrogen atoms attached to them. In medicinal chemistry applications, polar surface area is a descriptive indicator for evaluating the transportable properties of drugs within cells. For a good druggable compound, its topological polar surface area should be within a certain range of values.
  • the pharmacological physicochemical property data of drug molecules are also regarded as auxiliary condition data as part of the model input layer data.
  • the advantage of this is that the physicochemical properties of the drugs in the dataset can be fully utilized, so that after the model is trained, the physicochemical properties of the drug molecules generated by the model can also be within a reasonable range, and the effectiveness of the drug molecules generated by the model can be improved.
  • the AI model structure includes: input layer, encoding layer, hidden layer, decoding layer, and output layer.
  • the coding layer of this embodiment inputs the output data of the input layer, and outputs the data to the hidden layer.
  • the encoding layer is an RNN network structure, including 3 layers of cyclic neural network layers, using LSTM units, each layer is set with 512 hidden nodes, the decoding layer inputs the output data of the hidden layer, and outputs to the output layer, the decoding layer is RNN Network structure, including 3 layers of recurrent neural network layers, using LSTM units, each layer is set with 512 hidden nodes, there is a softmax layer after the decoding layer, and its cost function adopts the cross entropy function
  • K is the number of categories
  • y is the label
  • p is the output of the network, and refers to the probability that the category is i; through the softmax layer, the probability distribution of each specific character category in the SMILES encoding vector is estimated, and finally the data is passed.
  • the direct correspondence between the one-hot value and the specific coded character in the preprocessing reconstructs the output sample and outputs the SMILES formula.
  • the input layer data includes: the SMILES data of the drug molecule set and the 5 drug physicochemical property data of the drug molecule set.
  • the input layer the preprocessed vector data of the drug molecule data set (SMILES encoding vector + conditional vector composed of drug physicochemical properties), that is, the input layer is X1, X2...Xn, and Xn is represented by a set of vector data.
  • Coding layer input is input layer, output is hidden layer, coding layer is RNN network structure, including 3 layers of RNN recurrent neural network layer, using LSTM unit (LSTM, Long Short-Term Memory, LSTM unit includes input gate, forget gate , output gate), and 512 hidden nodes are set in each layer.
  • LSTM unit Long Short-Term Memory
  • Hidden layer The hidden layer vector dimension is set to 200.
  • Decoding layer the input is the hidden layer, and the output is the output layer.
  • the decoding layer is an RNN network structure. It contains 3 layers of RNN recurrent neural network layer, also adopts LSTM unit, and sets 512 hidden nodes in each layer. There is a softmax layer after the decoding layer unit, and its cost function adopts the cross-entropy function.
  • the cross-entropy function is as follows:
  • K is the number of categories
  • y is the label
  • p is the output of the network, which refers to the probability that the category is i.
  • each bit in the SMILES encoding vector is the probability distribution of a specific character category.
  • the input layer generates a hidden layer through an encoding layer to form an encoder
  • the hidden layer generates an output layer through a decoding layer to form a decoder.
  • the encoder converts the high-dimensional input into a low-order hidden vector.
  • the loss function includes two parts.
  • the first part represents the log-likelihood of P(X) under the probability distribution P(X'
  • the second part is KL divergence, which represents the distance between Q(z
  • the AI model based on the conditional variational autoencoder in this embodiment is a CVAE model, and the CVAE consists of two modules, an encoder and a decoder.
  • the input layer passes through the encoding layer to generate the hidden layer as the encoder.
  • the hidden layer passes through the decoding layer to generate the output layer as the decoder.
  • the encoder converts the high-dimensional input into a low-dimensional latent vector Z, while the decoder restores the latent vector to an output that is as close to the input as possible.
  • the encoder maps the training sample X into two sets of parameters, which determine an output z based on the input X and the conditional vector c conditional probability distribution Q(z
  • the two sets of parameters represent the mean and variance, respectively.
  • the hidden vector z that is subject to the distribution is sampled from it, and the hidden vector z is mapped to a new set of parameters by the decoder.
  • the conditional vector c will also be used in the decoding layer, where the conditional vector c is related to the sample X.
  • z,c) based on z and c is determined, so that the probability distribution of the decoding layer is constrained by the hidden vector and the conditional vector.
  • X,c) should be as close as possible to the standard normal distribution N(0,1).
  • the model uses the stochastic gradient descent algorithm to minimize the loss function value, and the process is to update and optimize the network structure parameters in the encoding layer and the decoding layer.
  • the loss function of the CVAE model (the loss function is the cost function) is shown in Loss:
  • the loss function consists of two parts.
  • the first part represents the log-likelihood of P(X) under the probability distribution P(X'
  • the second part is the KL divergence (Kullback-Leibler), which represents the distance between Q(z
  • the model is constructed and trained based on tensorflow, the training data set is used for model training during the training process, and the test data set is used to calculate the test set error, that is, the loss function, to prevent the model from overfitting, After a certain training round epoch, compare the training data set error with the test data set error, when the test data set error is basically unchanged, the training data set error decreases and changes are weakened, and the parameters of the encoding layer and decoding layer of the model are optimized to the best value, stop training and save the model.
  • the construction and training process of the model are carried out based on tensorflow.
  • the training data set is used for model training, and the test data set is used to calculate the test set error (loss function), Prevent model overfitting.
  • the test set error loss function
  • the training data set error and the test data set error are compared.
  • the training data When the test data set error is basically unchanged, the training data When the set error decreases and the change weakens, it can be considered that the model has high accuracy on both the training data set and the test data set at the same time, and the parameters of the encoding layer and decoding layer of the model are optimized to the best values, so stop training and save the model, That is the CAVE model we finally trained.
  • the drug molecule screening method of this embodiment firstly, five kinds of drug physicochemical properties of the drug molecule are calculated according to the selection, and the selection of these properties is directly related to the effectiveness of the drug molecule. Then, the calculated drug physicochemical property values are abstracted into vectors and used as condition vectors.
  • conditional variational autoencoder model The uniqueness of the conditional variational autoencoder model is that the conditional vector is additionally input.
  • conditional vector contains 5 kinds of drug physicochemical properties, not just one property, so multivariate control can be performed, and the specific control process It is implemented by model structure and logic.
  • the hidden layer space is the physical and chemical core of the autoencoder method.
  • the advantage is that the encoder converts the high-dimensional input into a low-dimensional hidden layer vector, and then the decoder restores the hidden layer vector to solve the output of the input data as much as possible. to achieve the goal.
  • the drug physicochemical properties of drug molecules will be abstracted into conditional vectors, which will be used as the input layer data of the model and entered into the encoder.
  • the input of the decoder is the hidden layer vector, and the hidden layer vector contains the conditional vector, that is, the data information of the physicochemical properties of the drug. Therefore, the conditional vector data will also enter the network structure calculation in the decoder.
  • a drug molecular screening system comprising:
  • Preprocessing module collect molecular data of drugs related to specific diseases, preprocess the data, calculate its encoding vector and related drug physicochemical properties, form structural data and store it in the database;
  • Build a training model module build and train an AI model based on conditional variational autoencoders, use the combination of encoding vectors and the pharmacological properties of molecules as the input layer of the model, convert the encoding layer of the model into the hidden layer encoding vector, and then pass the model
  • the decoding layer generates possible drug molecular structures.
  • the model loss function is minimized by gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and decoding layer are continuously updated to make the model training better;
  • Generating Potential Drug Molecule Modules Generate potential drug molecules for curing specific diseases based on the trained conditional variational autoencoder model.
  • the calculation of the drug physicochemical properties of the drug molecule in this embodiment includes: calculation of molecular mass, calculation of lipid-water partition coefficient, calculation of the number of molecular H-bond donors, calculation of the number of molecular H-bond acceptors, and calculation of one of the molecular topological polar surface areas. one or more.
  • the encoding vector in this embodiment is a SMILES (Simplified Molecular Input Line Entry Specification) formula encoding vector.
  • the preprocessing module of this embodiment includes: counting all the characters in the SMILES formula, converting each character in the SMILES formula into a one-hot vector, and processing the SMILES formula data of each drug molecule into an encoding of a set dimension vector.
  • the specific preprocessing module includes: calculating the SMILES encoding vector of the drug molecule. First, count all the characters that appear in the SMILES formula, convert each character in the SMILES formula into a one-hot vector, and add an 'E' character at the end of the string to indicate the end.
  • the SMILES encoding vector of each drug molecule is fixed to 120 dimensions. When the number of characters is less than this value, the following digits are filled with the one-hot value of 'E', and the SMILES data of each drug molecule is Both are processed as a 120-dimensional encoded vector.
  • Calculating the drug physicochemical properties of drug molecules includes one or more of: calculating molecular mass, calculating lipid-water partition coefficient, calculating molecular H-bond donor number, molecular H-bond acceptor number, and molecular topological polar surface area calculation.
  • the data of three indicators namely, the calculated molecular mass, the calculated lipid-water partition coefficient, and the calculated molecular topological polar surface area, were normalized, and the data were uniformly mapped to the range of -1.0-1.0.
  • the physicochemical properties of the drug form a 5-dimensional vector.
  • the SMILES data and the drug physicochemical property data are combined to form the total drug molecule data set, and the total data set is randomly divided into training data set and test data set according to the ratio of 4:1, and each SMILES data is processed into a 120
  • the dimensional encoding vector is combined with 5 vectors representing different physicochemical properties to form a 125-dimensional vector data, which is used as the input layer of the model.
  • the calculation results of the physicochemical properties of the drug can be obtained after statistical analysis: the numerical range of MW of all molecules in the data set is 0-500, the numerical range of LogP is 0-5, and the numerical range of TPSA is 0-150.
  • the data of the three indicators of LogP and TPSA are normalized, and the data is uniformly mapped to the range of -1.0 to 1.0.
  • the numerical values of the molecular H-bond donor and the molecular H-bond acceptor are integers, which can be directly represented by their numerical values.
  • the above is composed of molecular SMILES formula data and 5 kinds of drug physicochemical property data together to form a drug molecule data set.
  • the total data set is divided into training data set and test data set according to the ratio of 4:1.
  • the division process is random, that is, the training data set accounts for 80%, and the test data accounts for 20%.
  • each drug molecule sample in the drug dataset has an encoded vector in SMILES format, and 5 vectors representing different drug physicochemical properties (ie, conditional vectors in this method).
  • 5 vectors representing different drug physicochemical properties ie, conditional vectors in this method.
  • two data vectors are directly spliced and combined to form a 125-dimensional vector data. All drug molecules in the drug data set are represented by the above form, that is, as the input layer data of the model.
  • the AI model structure includes: input layer, encoding layer, hidden layer, decoding layer, and output layer.
  • the coding layer of this embodiment inputs the output data of the input layer, and outputs the data to the hidden layer.
  • the encoding layer is an RNN network structure, including 3 layers of cyclic neural network layers, using LSTM units, each layer is set with 512 hidden nodes, the decoding layer inputs the output data of the hidden layer, and outputs to the output layer, the decoding layer is RNN Network structure, including 3 layers of recurrent neural network layers, using LSTM units, each layer is set with 512 hidden nodes, there is a softmax layer after the decoding layer, and its cost function adopts the cross entropy function
  • K is the number of categories
  • y is the label
  • p is the output of the network, and refers to the probability that the category is i; through the softmax layer, the probability distribution of each specific character category in the SMILES encoding vector is estimated, and finally the data is passed.
  • the direct correspondence between the one-hot value and the specific coded character in the preprocessing reconstructs the output sample and outputs the SMILES formula.
  • the input layer generates a hidden layer through an encoding layer to form an encoder
  • the hidden layer generates an output layer through a decoding layer to form a decoder.
  • the encoder converts the high-dimensional input into a low-order hidden vector.
  • c)] of this embodiment includes two parts, the first part represents the use of The log-likelihood of P(X) under the probability distribution P(X'
  • the model is constructed and trained based on tensorflow, the training data set is used for model training during the training process, and the test data set is used to calculate the test set error, that is, the loss function, to prevent the model from overfitting, After a certain training round epoch, compare the training data set error with the test data set error, when the test data set error is basically unchanged, the training data set error decreases and changes are weakened, and the parameters of the encoding layer and decoding layer of the model are optimized to the best value, stop training and save the model.
  • the drug molecule screening method based on conditional variational autoencoder of the present invention comprises the following steps:
  • the structured data of drug molecules is obtained through data collection. After data analysis and processing, the SMILES-encoded vector and the related drug physicochemical properties are calculated, and the structured data is formed and stored in the database. The data set is divided into model training data set and test data set according to a certain proportion.
  • the drug physicochemical properties of drug molecules can be selected from molecular mass, LogP, molecular H bond donor and acceptor, TPSA (molecular topological polar surface area) and other index properties.
  • Conditional variational autoencoder-based AI models were constructed and trained based on the aforementioned drug molecule datasets.
  • the main steps of establishing the model are: the combination of the SMILES-encoded vector and the molecular pharmacological properties as the input layer of the model, in which the pharmacological properties of the molecules are expected to show better properties among the above-mentioned property indicators for the drug molecules generated by the model.
  • the selected drug physicochemical property data is abstracted as the conditional vector of the input layer, which will be directly introduced into the calculation in both the encoding layer and the decoding layer.
  • the input layer data is converted into the hidden layer encoding vector after passing through the encoder, and then the possible drug molecular structure is generated after passing through the decoding layer.
  • the encoding layer and decoding layer use the LSTM cyclic neural network structure.
  • the model loss function is minimized through the gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and decoding layer are continuously updated, so that Model training is better.
  • the invention adopts a drug molecule screening method based on conditional variational autoencoder, which can effectively improve the effectiveness of the generated potential molecules.
  • the target drug physicochemical property data is introduced into the input layer as a conditional vector, and the conditional variational autoencoder is used.
  • the device model is used for multivariate control.
  • the drug physicochemical properties of drug molecules are input into the encoding layer, and then the conditional vectors are processed and manipulated in the decoding layer.
  • the generated potential drug molecules will be in these
  • the physicochemical properties of the target drug have better performance, which improves the effectiveness and accuracy of the drug molecule.
  • the data sets are selected from other open source databases such as pubchem, ccdc, etc.
  • pancreatic cancer is selected as an example, and the open source databases are summarized and extracted from the open source database.
  • the SMILES formula of the drug molecule is used as the initial data set, and then it enters the data preprocessing stage to calculate the SMILES formula encoding vector of the drug molecule. First, count all the characters that appear in the SMILES formula, convert each character in the SMILES formula into a one-hot vector, and add an 'E' character at the end of the string to indicate the end.
  • the SMILES encoding vector of each drug molecule is fixed to 120 dimensions.
  • the SMILES data of each drug molecule is Both are processed as a 120-dimensional encoded vector.
  • five relatively representative drug physicochemical properties of drug molecules are selected: molecular mass, LogP, molecular H bond donor, molecular H bond acceptor, TPSA (molecular topological polar surface area), and for each drug molecule in the above data set Calculate the above properties, and abstract them into a vector representation according to the numerical expression of the properties, as the conditional vector input in the model.
  • an AI model based on conditional variational autoencoder is trained, wherein the model mainly includes input layer, encoding layer, hidden layer, decoding layer and output layer.
  • the conditional variation vector obtained by the combination of the molecular fingerprint vector and the drug property will be used as the input layer data and proceed to the encoding layer.
  • the encoding layer and decoding layer use a recurrent neural network structure with LSTM, the number of neural network layers is 3, and each layer has 512 hidden nodes.
  • the output vector produced by the decoding layer is finally converted into SMILES molecular encoding form.
  • the network structure parameters in the encoding layer and the decoding layer are updated and optimized, and the decoding layer will generate samples for reconstruction output, so that the effectiveness of the drug molecules generated by the output layer is improved. It will also have better performance on the surface, thus becoming a potential drug molecule that can cure pancreatic cancer.
  • a certain number of potential drug molecules that can cure pancreatic cancer are generated.
  • the model can generate a large number of potential drug molecules, and the generated drug molecule data set can be further screened or ranked after a certain property analysis, and the appropriate number of drug molecules with high effectiveness can be screened for subsequent drug calculation. Procedures and Experimental Procedures.
  • the generated drug molecule data set is used to calculate the physical and chemical properties of the five drugs proposed above.
  • the ranges of the five drug physical and chemical properties are counted.
  • Screening mainly refers to filtering out some of the generated molecules whose physical and chemical properties obviously do not conform to the values within the range of statistical data.
  • Ranking refers to comprehensively considering the values of five physical and chemical properties to form a ranking of the possibility of molecular drugs. As a follow-up research goal, a certain number of generated drug molecules with the highest drug-forming possibility can be selected for follow-up research.
  • the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions
  • the apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

Abstract

A drug molecule screening method and system. The method comprises: collecting drug molecule data related to a specific disease, pre-processing the data, and calculating an encoding vector and a drug physicochemical property of the data; constructing and training an AI model based on a conditional variational auto-encoder, combining the encoding vector with the drug physicochemical property of molecules to serve as an input layer of the model, converting an encoding layer of the model into a hidden layer encoding vector, generating a possible drug molecular structure by means of a decoding layer of the model, minimizing, during a model training process, a model loss function by means of a gradient descent algorithm, and continuously updating and iterating weight parameters of neural network structures of the encoding layer and the decoding layer; and according to the trained model based on the conditional variational auto-encoder, generating potential drug molecules for curing the specific disease. According to the drug molecule screening method and system, drug physicochemical property data of compound molecules are also utilized, drug physicochemical properties have good correlation with whether a compound can finally be made into a drug, and the druggability is improved.

Description

药物分子筛选方法及系统Drug molecular screening method and system 技术领域technical field
本发明涉及筛选方法,特别涉及一种药物分子筛选方法及系统。The invention relates to a screening method, in particular to a drug molecule screening method and system.
背景技术Background technique
在药物研发领域,传统的方法是计算机模拟筛选再合成药物,随着AI医药的快速发展,人们开始尝试将各种AI算法模型应用在医药研发领域,以解决新药研发流程周期长的问题,目前很多疾病的靶点信息是未知的,导致从众多化合物库中寻找有效的药物分子的难度和成本都是极高的,而AI快速的计算能力和创新性的理论基础给药物分子的筛选过程带来了新的研究方式。譬如对抗生成网络,卷积神经网络,循环神经网络,强化学习等在药物分子生成上的尝试应用,这些AI模型可以快速从大批量的化学库分子中找到与目标分子相似的药物分子,从而极大的减少分子的搜索空间,同时生成一定程度上有效的药物分子用于后续的药物筛选和实验过程。In the field of drug research and development, the traditional method is to screen and synthesize drugs by computer simulation. With the rapid development of AI medicine, people have begun to try to apply various AI algorithm models in the field of medical research and development to solve the problem of long cycle of new drug research and development process. The target information of many diseases is unknown, which makes it extremely difficult and costly to find effective drug molecules from numerous compound libraries. However, the rapid computing power and innovative theoretical basis of AI bring great advantages to the screening process of drug molecules. Here comes a new way of research. For example, adversarial generative networks, convolutional neural networks, recurrent neural networks, reinforcement learning, etc. have been tried and applied in the generation of drug molecules. These AI models can quickly find drug molecules similar to the target molecules from a large number of chemical library molecules. It greatly reduces the search space of molecules, and at the same time generates a certain degree of effective drug molecules for subsequent drug screening and experimental processes.
目前利用AI模型生成药物分子的技术中,研究学者应用最多的方法基于自编码器的模型,譬如VAE和AAE模型等,这种基于对抗网络思想的模型能够探索出与现有药物分子相似的潜在药物分子,其缺陷在于生成的药物分子的有效性和准确度确并不高,同时存在所生成的潜在药物分子与训练集分子存在同一化问题,多样性被削弱。这就使得生成的可能药物分子在现有领域上突破性不大,或者因为其药物性质的有效性低从而在成药性上难以进入到真正的药物试验阶段。此外,现有AI模型中较少考虑到药物分子性质,模型的输入层数据较为单一,降低了模型所生成药物分子的有效性。Among the current technologies for generating drug molecules using AI models, the most widely used methods by researchers are based on autoencoder models, such as VAE and AAE models. This model based on the idea of adversarial networks can explore potential similarities to existing drug molecules. The disadvantage of drug molecules is that the effectiveness and accuracy of the generated drug molecules are not high, and at the same time, the generated potential drug molecules and the training set molecules have the same problem, and the diversity is weakened. This makes the potential drug molecules generated have little breakthrough in the existing field, or it is difficult to enter the real drug test stage in terms of druggability due to the low effectiveness of their drug properties. In addition, the properties of drug molecules are rarely considered in the existing AI models, and the input layer data of the model is relatively single, which reduces the effectiveness of the drug molecules generated by the model.
发明内容SUMMARY OF THE INVENTION
基于此,有必要提供一种可提高成药性的药物分子筛选方法。Based on this, it is necessary to provide a drug molecular screening method that can improve druggability.
同时,提供一种可优化生成可提高成药性的药物分子筛选系统。At the same time, a drug molecular screening system that can be optimized to generate and improve druggability is provided.
一种药物分子筛选方法,包括:A drug molecular screening method, comprising:
预处理:采集与特定疾病相关药物分子数据,对数据进行预处理,计算其编码向量及相关的药物理化性质,形成结构数据并存入数据库;Preprocessing: collect molecular data of drugs related to specific diseases, preprocess the data, calculate its encoding vector and related drug physicochemical properties, form structural data and store it in the database;
构建训练模型:构建和训练基于条件变分自编码器的AI模型,将编码向量和分子的药物理化性质组合作为模型的输入层,通过模型的编码层转换为隐层编码向量,再经模型的解码层生成可能的药物分子结构,在模型训练过程中,通过梯度下降算法将模型损失函数最小化,不断更新迭代编码层和解码层的神经网络结构的权值参数;Build a training model: Build and train an AI model based on a conditional variational autoencoder, use the combination of the encoding vector and the drug physicochemical properties of the molecule as the input layer of the model, convert the encoding layer of the model into the hidden layer encoding vector, and then use the model's encoding layer. The decoding layer generates possible drug molecular structures. During the model training process, the model loss function is minimized through the gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and decoding layer are continuously updated;
生成潜在药物分子:根据训练出的条件变分自编码器的模型,生成治愈特定疾病的潜在药物分子。Generate Potential Drug Molecules: Generate potential drug molecules that cure a specific disease based on the trained conditional variational autoencoder model.
在优选的实施例中,所述编码向量为SMILES式编码向量,所述预处理包括:统计出SMILES式中所有字符,将SMILES式中每个字符都转换为one-hot向量,将每个药物分子的SMILES式数据处理为设定维度的编码向量。In a preferred embodiment, the encoding vector is a SMILES encoding vector, and the preprocessing includes: counting all the characters in the SMILES formula, converting each character in the SMILES formula into a one-hot vector, and converting each drug into a one-hot vector. SMILES-like data processing of molecules as encoded vectors of specified dimensions.
在优选的实施例中,所述计算药物分子的药物理化性质包括:计算分子质量、计算脂水分配系数、计算分子H键供体数、计算分子H键受体数、计算分子拓扑极性表面积中的一种或多种。In a preferred embodiment, the calculation of the drug physicochemical properties of the drug molecule includes: calculation of molecular mass, calculation of lipid-water partition coefficient, calculation of molecular H-bond donors, calculation of molecular H-bond acceptors, calculation of molecular topological polar surface area one or more of.
在优选的实施例中,将计算分子质量、计算脂水分配系数、计算分子拓扑极性表面积三个指标的数据进行归一化处理,将数据统一映射到-1.0-1.0范围内,将每个药物分子的5个药物理化性质形成5维向量。In a preferred embodiment, the data of three indicators of molecular mass, lipid-water partition coefficient, and molecular topological polar surface area are normalized, and the data are uniformly mapped to the range of -1.0-1.0, and each The five drug physicochemical properties of drug molecules form a 5-dimensional vector.
在优选的实施例中,将SMILES式数据和药物理化性质数据共同组成药物分子总数据集并按照4:1的比例将总数据集随机划分为训练数据集和测试数据集,将每个SMILES式数据处理为一个120维编码向量,和5个代表不同药物理化性质的向量拼接组合形成一个125维向量数据,并作为模型的输入层。In a preferred embodiment, the SMILES formula data and the drug physicochemical property data together form a total drug molecule data set, and the total data set is randomly divided into a training data set and a test data set according to a ratio of 4:1, and each SMILES formula The data is processed as a 120-dimensional encoded vector, which is spliced and combined with 5 vectors representing different drug physicochemical properties to form a 125-dimensional vector data, which is used as the input layer of the model.
在优选的实施例中,所述AI模型结构包括:输入层、编码层、隐层、解码层、输出层,所述编码层输入输入层输出数据,输出到隐层,所述编码层为RNN网络结构,包含3层循环神经网络层,采用LSTM单元,每一层设置512个隐藏节点,所述解码层输入隐层的输出数据,输出到输出层,所述解码层为RNN网络结构,包含3层循环神经网络层,采用LSTM单元,每一层设置512个隐藏节点,在解码层后还存在softmax层,其代价函数采用交叉熵函数In a preferred embodiment, the AI model structure includes: an input layer, an encoding layer, a hidden layer, a decoding layer, and an output layer. The encoding layer inputs the output data of the input layer and outputs the data to the hidden layer, and the encoding layer is an RNN. The network structure includes 3 layers of recurrent neural network layers, using LSTM units, each layer is set with 512 hidden nodes, the decoding layer inputs the output data of the hidden layer, and outputs it to the output layer, the decoding layer is an RNN network structure, including 3-layer recurrent neural network layer, using LSTM unit, each layer is set with 512 hidden nodes, there is a softmax layer after the decoding layer, and its cost function adopts the cross-entropy function
Figure PCTCN2020113085-appb-000001
Figure PCTCN2020113085-appb-000001
其中,K为种类数量,y是标签,p是网络的输出,指类别是i的概率;通 过softmax层,估计出SMILES式编码向量中每一位具体某个字符类别的概率分布,最终通过数据预处理中one-hot值与具体编码字符直接的对应关系,重构出输出样本,输出SMILES式。Among them, K is the number of categories, y is the label, p is the output of the network, and refers to the probability that the category is i; through the softmax layer, the probability distribution of each specific character category in the SMILES encoding vector is estimated, and finally the data is passed. The direct correspondence between the one-hot value and the specific coded character in the preprocessing reconstructs the output sample and outputs the SMILES formula.
在优选的实施例中,所述输入层经过编码层生成隐层形成编码器,隐层经过解码层生成输出层形成解码器,所述编码器将高维输入转化为低位的隐向量,所述损失函数E[logP(X|z,c)]-D KL[Q(z|X,c)||P(z|c)] In a preferred embodiment, the input layer generates a hidden layer through an encoding layer to form an encoder, the hidden layer generates an output layer through a decoding layer to form a decoder, and the encoder converts the high-dimensional input into a low-level latent vector, the Loss function E[logP(X|z,c)]-D KL [Q(z|X,c)||P(z|c)]
该损失函数包括两部分,第一部分表示使用概率分布P(X’|z,c)下P(X)的对数似然,表征编码层的输出与输入训练样本X的距离;第二部分为KL散度,表示Q(z|X,c)与其参考概率分布N(0,1)之间的距离。The loss function includes two parts. The first part represents the log-likelihood of P(X) under the probability distribution P(X'|z,c), which represents the distance between the output of the coding layer and the input training sample X; the second part is KL divergence, which represents the distance between Q(z|X,c) and its reference probability distribution N(0,1).
在优选的实施例中,根据模型结构,基于tensorflow进行模型的构建与训练,训练过程中使用训练数据集进行模型训练,同时使用测试数据集计算测试集误差即损失函数,防止模型过拟合,在一定的训练轮次epoch之后,对比训练数据集误差和测试数据集误差,当测试数据集误差基本不变,训练数据集误差下降变化减弱,模型的编码层和解码层参数被优化到最佳数值,停止训练并保存模型。In a preferred embodiment, according to the model structure, the model is constructed and trained based on tensorflow, the training data set is used for model training in the training process, and the test data set is used to calculate the test set error, that is, the loss function, to prevent the model from overfitting, After a certain training round epoch, compare the training data set error with the test data set error, when the test data set error is basically unchanged, the training data set error decreases and changes are weakened, and the parameters of the encoding layer and decoding layer of the model are optimized to the best value, stop training and save the model.
一种药物分子筛选系统,包括:A drug molecular screening system, comprising:
预处理模块:采集与特定疾病相关药物分子数据,对数据进行预处理,计算其编码向量及相关的药物理化性质,形成结构数据并存入数据库;Preprocessing module: collect molecular data of drugs related to specific diseases, preprocess the data, calculate its encoding vector and related drug physicochemical properties, form structural data and store it in the database;
构建训练模型模块:构建和训练基于条件变分自编码器的AI模型,将编码向量和分子的药物理化性质组合作为模型的输入层,通过模型的编码层转换为隐层编码向量,再经模型的解码层生成可能的药物分子结构,在模型训练过程中,通过梯度下降算法将模型损失函数最小化,不断更新迭代编码层和解码层的神经网络结构的权值参数,使得模型训练更好;Build a training model module: build and train an AI model based on conditional variational autoencoders, use the combination of encoding vectors and the pharmacological properties of molecules as the input layer of the model, convert the encoding layer of the model into the hidden layer encoding vector, and then pass the model The decoding layer generates possible drug molecular structures. During the model training process, the model loss function is minimized by gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and decoding layer are continuously updated to make the model training better;
生成潜在药物分子模块:根据训练出的条件变分自编码器的模型,生成治愈特定疾病的潜在药物分子。Generating Potential Drug Molecule Modules: Generate potential drug molecules for curing specific diseases based on the trained conditional variational autoencoder model.
在优选的实施例中,所述计算药物分子的药物理化性质包括:计算分子质量、计算脂水分配系数、计算分子H键供体数、计算分子H键受体数、计算分子拓扑极性表面积中的一种或多种。In a preferred embodiment, the calculation of the drug physicochemical properties of the drug molecule includes: calculation of molecular mass, calculation of lipid-water partition coefficient, calculation of molecular H-bond donors, calculation of molecular H-bond acceptors, calculation of molecular topological polar surface area one or more of.
上述药物分子筛选方法及系统通过将化合物分子的药物理化性质数据也利用起来,因为化合物分子的药物理化性质与该化合物是否最终能够成药有着较大相关性,一般理化性质差的或者不符合范围要求的化合物在成药性上概率也极低,因此将分子的药物理化性质纳入到模型输入层数据的考虑范围,并控制最终通过模型训练后得到的分子的理化性质在合理范围内,提高其成药性,这样就可以有效提高模型所产生的药物分子的有效性和准确性,通过调控具体的药物理化性质指标数值范围,也可以使得产生的分子更具有多样性。The above-mentioned drug molecule screening methods and systems also utilize the data of the drug physicochemical properties of the compound molecules, because the drug physicochemical properties of the compound molecules have a great correlation with whether the compound can finally be formulated into a drug, and generally those with poor physicochemical properties or do not meet the scope requirements The probability of druggability of the compound is also extremely low, so the pharmacological properties of the molecule are taken into account in the input layer data of the model, and the physical and chemical properties of the molecule obtained after model training are controlled within a reasonable range to improve its druggability In this way, the effectiveness and accuracy of the drug molecules generated by the model can be effectively improved, and the generated molecules can be more diverse by regulating the numerical range of specific drug physicochemical properties.
另将SMILES式编码向量和分子药物理化性质组合作为模型的输入层,其中分子的药物理化性质是期待模型所生成药物分子在上述性质指标中能表现出较好的性质,将所挑选的药物理化性质数据抽象作为输入层的条件向量,在编码层和解码层均会被直接引入计算中。输入层数据通过编码器后转换为隐层编码向量,再经过解码层后生成可能的药物分子结构。其中编码层和解码层采用LSTM的循环神经网络结构,在模型训练的过程中,通过梯度下降算法将模型损失函数最小化,不断更新迭代编码层和解码层的神经网络结构的权值参数,使得模型训练更好。In addition, the combination of SMILES-encoded vector and molecular drug physicochemical properties is used as the input layer of the model, in which the drug physicochemical properties of the molecules are expected to show better properties among the above-mentioned properties of the drug molecules generated by the model. The property data abstraction is used as the conditional vector of the input layer, which is directly introduced into the calculation in both the encoding layer and the decoding layer. The input layer data is converted into the hidden layer encoding vector after passing through the encoder, and then the possible drug molecular structure is generated after passing through the decoding layer. The encoding layer and decoding layer use the LSTM cyclic neural network structure. In the process of model training, the model loss function is minimized through the gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and decoding layer are continuously updated, so that Model training is better.
另采用一种基于条件变分自编码器的药物分子筛选方法,能够有效提高所生成潜在分子的有效性,在输入层中引入了目标药物理化性质数据作为条件向量,使用条件变分自编码器模型用于多变量控制,除了利用隐层空间的优势外,将药物分子的药物理化性质输入到编码层中,进而在解码层中处理和操控条件向量,所生成潜在药物分子将会在这些目标药物理化性质上有更好的表现,提高了药物分子的有效性和准确度。Another drug molecule screening method based on conditional variational autoencoder is adopted, which can effectively improve the effectiveness of the potential molecules generated. In the input layer, the physicochemical property data of the target drug is introduced as a conditional vector, and the conditional variational autoencoder is used. The model is used for multivariate control. In addition to taking advantage of the hidden layer space, the drug physicochemical properties of drug molecules are input into the encoding layer, and then the conditional vectors are processed and manipulated in the decoding layer. The generated potential drug molecules will be in these targets. The physicochemical properties of the drug have better performance, and the effectiveness and accuracy of the drug molecule are improved.
附图说明Description of drawings
图1为本发明一实施例的药物分子筛选方法的流程图。FIG. 1 is a flowchart of a drug molecular screening method according to an embodiment of the present invention.
具体实施方式detailed description
如图1所示,本发明一实施例的药物分子筛选方法,包括:As shown in Figure 1, a drug molecular screening method according to an embodiment of the present invention includes:
步骤S101,预处理:采集与特定疾病相关药物分子数据,对数据进行预处理,计算其编码向量及相关的药物理化性质,形成结构数据并存入数据库;Step S101, preprocessing: collecting drug molecule data related to a specific disease, preprocessing the data, calculating its encoding vector and related drug physicochemical properties, forming structural data and storing it in a database;
步骤S103,构建训练模型:构建和训练基于条件变分自编码器的AI模型,将编码向量和分子的药物理化性质组合作为模型的输入层,通过模型的编码层转换为隐层编码向量,再经模型的解码层生成可能的药物分子结构,在模型训练过程中,通过梯度下降算法将模型损失函数最小化,不断更新迭代编码层和解码层的神经网络结构的权值参数,使得模型训练更好;Step S103, constructing a training model: constructing and training an AI model based on a conditional variational autoencoder, using the combination of the encoding vector and the pharmacological physicochemical properties of the molecule as the input layer of the model, converting the encoding layer of the model into a hidden layer encoding vector, and then converting the encoding vector into the hidden layer encoding vector. The possible drug molecular structure is generated by the decoding layer of the model. During the model training process, the model loss function is minimized by the gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and the decoding layer are continuously updated, so that the model training is more efficient. Okay;
步骤S105,生成潜在药物分子:根据训练出的条件变分自编码器的模型,生成治愈特定疾病的潜在药物分子。Step S105, generating potential drug molecules: generating potential drug molecules for curing a specific disease according to the trained conditional variational autoencoder model.
进一步,本实施例的编码向量为SMILES(Simplified Molecular Input Line Entry Specification简化分子线性输入规范)式编码向量。预处理包括:统计出SMILES式中所有字符,将SMILES式中每个字符都转换为one-hot向量,将每个药物分子的SMILES式数据处理为设定维度的编码向量。Further, the encoding vector in this embodiment is a SMILES (Simplified Molecular Input Line Entry Specification) formula encoding vector. The preprocessing includes: counting all the characters in the SMILES formula, converting each character in the SMILES formula into a one-hot vector, and processing the SMILES formula data of each drug molecule into a coding vector with a set dimension.
具体的预处理过程包括:计算药物分子的SMILES式编码向量。首先统计SMILES式中出现的所有字符,将SMILES式中每个字符都转换为one-hot向量,并在字符串的结尾处添加’E’字符表示结束。设定每个药物分子的SMILES编码向量固定为120维,当字符位数小于该值时,后面的位数均用’E’的one-hot值进行填充,将每个药物分子的SMILES式数据均处理为一个120维编码向量。The specific preprocessing process includes: calculating the SMILES-encoded vector of the drug molecule. First, count all the characters that appear in the SMILES formula, convert each character in the SMILES formula into a one-hot vector, and add an 'E' character at the end of the string to indicate the end. The SMILES encoding vector of each drug molecule is fixed to 120 dimensions. When the number of characters is less than this value, the following digits are filled with the one-hot value of 'E', and the SMILES data of each drug molecule is Both are processed as a 120-dimensional encoded vector.
SMILES,全称是Simplified Molecular Input Line Entry Specification,简化分子线性输入规范,是一种用ASCII字符串明确描述分子结构的规范。SMILES, the full name is Simplified Molecular Input Line Entry Specification, Simplified Molecular Linear Input Specification, is a specification that explicitly describes the molecular structure with ASCII strings.
采集的药物分子数据集为药物化合物分子结构式数据,一般为SMI文件,每个药物分子均用SMILES格式来表示,譬如阿司匹林的SMILES式为:CC(=O)OC1=CC=CC=C1C(=O)O。The collected drug molecule data set is the data of the molecular structure formula of drug compounds, generally SMI files, and each drug molecule is represented by the SMILES format. For example, the SMILES formula of aspirin is: CC(=O)OC1=CC=CC=C1C(= O) O.
计算药物分子的药物理化性质包括:计算分子质量、计算脂水分配系数、计算分子H键供体数、计算分子H键受体数、计算分子拓扑极性表面积中的一种或多种。Calculating the drug physicochemical properties of drug molecules includes one or more of: calculating molecular mass, calculating lipid-water partition coefficient, calculating molecular H-bond donor number, molecular H-bond acceptor number, and molecular topological polar surface area calculation.
具体的,计算药物分子的药物理化性质,采用RDKIR软件库中提供的方法,主要包括:Specifically, to calculate the drug physicochemical properties of drug molecules, the methods provided in the RDKIR software library are used, which mainly include:
计算分子质量(MW):rdkit.Chem.Descriptors.ExactMolWt();Calculate molecular mass (MW): rdkit.Chem.Descriptors.ExactMolWt();
计算脂水分配系数(LogP):rdkit.Chem.Crippen.MolLogP();Calculate the fat-water partition coefficient (LogP): rdkit.Chem.Crippen.MolLogP();
计算分子H键供体数(HBD):rdkit.Chem.rdMolDescriptos.CalcNumHBD();Calculate the number of molecular H-bond donors (HBD): rdkit.Chem.rdMolDescriptos.CalcNumHBD();
计算分子H键受体数(HBA):rdkit.Chem.rdMolDescriptos.CalcNumHBA();Calculate the number of molecular H-bond acceptors (HBA): rdkit.Chem.rdMolDescriptos.CalcNumHBA();
计算分子拓扑极性表面积(TPSA):rdkit.Chem.rdMolDescriptos.CalcTPSA().Calculate molecular topological polar surface area (TPSA): rdkit.Chem.rdMolDescriptos.CalcTPSA().
进一步,将计算分子质量、计算脂水分配系数、计算分子拓扑极性表面积三个指标的数据进行归一化处理,将数据统一映射到-1.0-1.0范围内,将每个药物分子的5个药物理化性质形成5维向量。Further, the data of three indicators, namely, the calculated molecular mass, the calculated lipid-water partition coefficient, and the calculated molecular topological polar surface area, were normalized, and the data were uniformly mapped to the range of -1.0-1.0. The physicochemical properties of the drug form a 5-dimensional vector.
进一步,将SMILES式数据和药物理化性质数据共同组成药物分子总数据集并按照4:1的比例将总数据集随机划分为训练数据集和测试数据集,将每个SMILES式数据处理为一个120维编码向量,和5个代表不同药物理化性质的向量拼接组合形成一个125维向量数据,并作为模型的输入层。Further, the SMILES data and the drug physicochemical property data are combined to form the total drug molecule data set, and the total data set is randomly divided into training data set and test data set according to the ratio of 4:1, and each SMILES data is processed into a 120 The dimensional encoding vector is combined with 5 vectors representing different physicochemical properties to form a 125-dimensional vector data, which is used as the input layer of the model.
具体的,药物理化性质计算结果经过统计分析后可得:数据集中所有分子的MW的数值范围为0~500,LogP的数值范围为0~5,TPSA的数值范围为0~150,将MW,LogP,TPSA三个指标的数据进行归一化处理,将数据统一映射到-1.0~1.0范围内。分子H键供体和分子H键受体的数值为整数,直接用其数值表示即可。通过上述逻辑将每个药物分子的5个药物理化性质抽象形成一个5维向量。Specifically, the calculation results of the physicochemical properties of the drug can be obtained after statistical analysis: the numerical range of MW of all molecules in the data set is 0-500, the numerical range of LogP is 0-5, and the numerical range of TPSA is 0-150. The data of the three indicators of LogP and TPSA are normalized, and the data is uniformly mapped to the range of -1.0 to 1.0. The numerical values of the molecular H-bond donor and the molecular H-bond acceptor are integers, which can be directly represented by their numerical values. Through the above logic, the 5 drug physicochemical properties of each drug molecule are abstracted into a 5-dimensional vector.
以上由分子SMILES式数据和5种药物理化性质数据共同组成药物分子数据集。并按照4:1的比例将总的数据集划分为训练数据集和测试数据集,划分过程是随机的,即训练数据集占比80%,测试数据占比20%。The above is composed of molecular SMILES formula data and 5 kinds of drug physicochemical property data together to form a drug molecule data set. The total data set is divided into training data set and test data set according to the ratio of 4:1. The division process is random, that is, the training data set accounts for 80%, and the test data accounts for 20%.
通过数据预处理后,药物数据集中的每个药物分子样本都存在一个SMILES格式编码向量,和5个代表不同药物理化性质属性的向量(即本方法中的条件向量)。每个药物分子样本将两种数据向量直接拼接组合后形成一个125维向量数据,药物数据集中所有的药物分子均用上述形式表征,即作为模型的输入层数据。After data preprocessing, each drug molecule sample in the drug dataset has an encoded vector in SMILES format, and 5 vectors representing different drug physicochemical properties (ie, conditional vectors in this method). For each drug molecule sample, two data vectors are directly spliced and combined to form a 125-dimensional vector data. All drug molecules in the drug data set are represented by the above form, that is, as the input layer data of the model.
本申请所提出的条件变分自编码器(CVAE)模型,相对于VAE和AAE模型,将化合物分子的药物理化性质数据也利用起来,因为化合物分子的药物理化性质与该化合物是否最终能够成药有着较大相关性,一般理化性质差的或者不符合范围要求的化合物在成药性上概率也极低,因此将分子的药物理化性质纳入到模型输入层数据的考虑范围,并控制最终通过模型训练后得到的分子的理化性质在合理范围内,提高其成药性,这样就可以有效提高模型所产生的药物分子的有效性和准确性,通过调控具体的药物理化性质指标数值范围,也可以使得产生的分子更具有多样性。The Conditional Variational Autoencoder (CVAE) model proposed in this application, compared with the VAE and AAE models, also utilizes the data on the pharmacological properties of the compound molecule, because the pharmacological properties of the compound molecule are related to whether the compound can finally be formulated into a drug. There is a large correlation. Generally, compounds with poor physicochemical properties or those that do not meet the range requirements have a very low probability of druggability. Therefore, the pharmacological physicochemical properties of the molecules are included in the consideration of the data in the input layer of the model, and the final control after model training is controlled. The physicochemical properties of the obtained molecules are within a reasonable range, and their druggability can be improved, so that the effectiveness and accuracy of the drug molecules generated by the model can be effectively improved. Molecules are more diverse.
因此本方法会提前计算药物分子数据集的药物理化性质,主要选取了5个药物理化性质:分子质量(MW)、脂水分配系数(LogP),分子H键供体数(HBD),分子H键受体数(HBA),分子拓扑极性表面积(TPSA),选择上述5个药物理化性质的原因主要是基于以下考虑:Therefore, this method will calculate the drug physicochemical properties of the drug molecule data set in advance, and mainly select five drug physicochemical properties: molecular mass (MW), lipid-water partition coefficient (LogP), molecular H-bond donor number (HBD), molecular H The number of bond acceptors (HBA), molecular topological polar surface area (TPSA), and the reasons for selecting the above 5 drug physicochemical properties are mainly based on the following considerations:
1.分子质量是分子最基础的描述符;1. Molecular mass is the most basic descriptor of a molecule;
2.脂水分配系数logP值表示该化合物分子在正辛醇(油)和水中的分配系数比值的对数值,反映了物质在油水两相中的分配情况。logP值越大,说明该物质越亲油,反之,越小,则越亲水,即水溶性越好,水溶性是化合物成药的一个关键指标。2. The logP value of the fat-water partition coefficient represents the logarithm of the ratio of the partition coefficient of the compound molecule in n-octanol (oil) and water, reflecting the distribution of the substance in the oil-water two-phase. The larger the logP value is, the more lipophilic the substance is. On the contrary, the smaller it is, the more hydrophilic it is, that is, the better the water solubility is. Water solubility is a key indicator of compound medicine.
3.分子H键供体数HBD和分子H键受体数HBA分别代表了分子之间形成H键的供体和受体数量,一般也会作为评估分子成效性的基本指标。3. The number of H-bond donors HBD and the number of H-bond acceptors HBA respectively represent the number of H-bond donors and acceptors between molecules, and are generally used as a basic indicator for evaluating the effectiveness of molecules.
4.分子拓扑极性表面积TPSA也是常用于药物化学的一个参数,其定义为化合物内极性分子的总表面积,多为氧原子及氮原子,也包括与其相连的氢原子。在药物化学的应用中,极性表面积是评价药物在细胞内的可运输性质的描述指标。对于好的成药性化合物,其拓扑极性表面积应在一定的取值范围内。4. Molecular topology Polar surface area TPSA is also a parameter commonly used in medicinal chemistry, which is defined as the total surface area of polar molecules in a compound, mostly oxygen and nitrogen atoms, including hydrogen atoms attached to them. In medicinal chemistry applications, polar surface area is a descriptive indicator for evaluating the transportable properties of drugs within cells. For a good druggable compound, its topological polar surface area should be within a certain range of values.
因此本方法会将药物分子的药物理化性质数据也作为附属条件数据作为模型输入层数据的一部分。这样做的好处在于可以充分利用数据集中的药物理化性质数据,从而通过模型的训练后,能够使模型所产生的药物分子的理化性质也在合理范围内,提高其生成药物分子的成效性。Therefore, in this method, the pharmacological physicochemical property data of drug molecules are also regarded as auxiliary condition data as part of the model input layer data. The advantage of this is that the physicochemical properties of the drugs in the dataset can be fully utilized, so that after the model is trained, the physicochemical properties of the drug molecules generated by the model can also be within a reasonable range, and the effectiveness of the drug molecules generated by the model can be improved.
AI模型结构包括:输入层、编码层、隐层、解码层、输出层。本实施例的编码层输入输入层输出数据,输出到隐层。编码层为RNN网络结构,包含3层循环神经网络层,采用LSTM单元,每一层设置512个隐藏节点,所述解码层输入隐层的输出数据,输出到输出层,所述解码层为RNN网络结构,包含3层循环神经网络层,采用LSTM单元,每一层设置512个隐藏节点,在解码层后还存在softmax层,其代价函数采用交叉熵函数
Figure PCTCN2020113085-appb-000002
The AI model structure includes: input layer, encoding layer, hidden layer, decoding layer, and output layer. The coding layer of this embodiment inputs the output data of the input layer, and outputs the data to the hidden layer. The encoding layer is an RNN network structure, including 3 layers of cyclic neural network layers, using LSTM units, each layer is set with 512 hidden nodes, the decoding layer inputs the output data of the hidden layer, and outputs to the output layer, the decoding layer is RNN Network structure, including 3 layers of recurrent neural network layers, using LSTM units, each layer is set with 512 hidden nodes, there is a softmax layer after the decoding layer, and its cost function adopts the cross entropy function
Figure PCTCN2020113085-appb-000002
其中,K为种类数量,y是标签,p是网络的输出,指类别是i的概率;通过softmax层,估计出SMILES式编码向量中每一位具体某个字符类别的概率分布,最终通过数据预处理中one-hot值与具体编码字符直接的对应关系,重构出输出样本,输出SMILES式。Among them, K is the number of categories, y is the label, p is the output of the network, and refers to the probability that the category is i; through the softmax layer, the probability distribution of each specific character category in the SMILES encoding vector is estimated, and finally the data is passed. The direct correspondence between the one-hot value and the specific coded character in the preprocessing reconstructs the output sample and outputs the SMILES formula.
输入层数据包括:药物分子集的SMILES式数据与药物分子集的5个药物理化性质数据。具体的,输入层:药物分子数据集预处理后的向量数据(SMILES式编码向量+药物理化性质组成的条件向量),即输入层为X1,X2…Xn,Xn均用一组向量数据表征。The input layer data includes: the SMILES data of the drug molecule set and the 5 drug physicochemical property data of the drug molecule set. Specifically, the input layer: the preprocessed vector data of the drug molecule data set (SMILES encoding vector + conditional vector composed of drug physicochemical properties), that is, the input layer is X1, X2...Xn, and Xn is represented by a set of vector data.
编码层:input为输入层,output为隐层,编码层为RNN网络结构,包含3层RNN循环神经网络层,采用LSTM单元(LSTM,Long Short-Term Memory,LSTM单元中包含输入门,遗忘门,输出门),每一层中设置512个隐藏节点。Coding layer: input is input layer, output is hidden layer, coding layer is RNN network structure, including 3 layers of RNN recurrent neural network layer, using LSTM unit (LSTM, Long Short-Term Memory, LSTM unit includes input gate, forget gate , output gate), and 512 hidden nodes are set in each layer.
隐层:隐层向量维数设置为200.Hidden layer: The hidden layer vector dimension is set to 200.
解码层:input为隐层,output为输出层。解码层为RNN网络结构。包含3层RNN循环神经网络层,同样采用LSTM单元,每一层中设置512个隐藏节点。在解码层单元后存在softmax层,其代价函数采用交叉熵函数,交叉熵函数如下所示:
Figure PCTCN2020113085-appb-000003
Decoding layer: the input is the hidden layer, and the output is the output layer. The decoding layer is an RNN network structure. It contains 3 layers of RNN recurrent neural network layer, also adopts LSTM unit, and sets 512 hidden nodes in each layer. There is a softmax layer after the decoding layer unit, and its cost function adopts the cross-entropy function. The cross-entropy function is as follows:
Figure PCTCN2020113085-appb-000003
其中:K是种类数量,y是标签,p是网络的输出,指类别是i的概率。Where: K is the number of categories, y is the label, and p is the output of the network, which refers to the probability that the category is i.
通过softmax层,估计出SMILES式编码向量中每一位是具体某个字符类别 的概率分布,最终通过数据预处理中one-hot值与具体编码字符直接的对应关系,重构出输出样本,输出SMILES式。Through the softmax layer, it is estimated that each bit in the SMILES encoding vector is the probability distribution of a specific character category. Finally, through the direct correspondence between the one-hot value and the specific encoded character in the data preprocessing, the output sample is reconstructed, and the output SMILES style.
输出层:SMIELS分子式。Output layer: SMIELS formula.
输入层经过编码层生成隐层形成编码器,隐层经过解码层生成输出层形成解码器,所述编码器将高维输入转化为低位的隐向量。The input layer generates a hidden layer through an encoding layer to form an encoder, and the hidden layer generates an output layer through a decoding layer to form a decoder. The encoder converts the high-dimensional input into a low-order hidden vector.
本实施例的损失函数,E[logP(X|z,c)]-D KL[Q(z|X,c)||P(z|c)] The loss function of this embodiment, E[logP(X|z, c)]-D KL [Q(z|X, c)||P(z|c)]
该损失函数包括两部分,第一部分表示使用概率分布P(X’|z,c)下P(X)的对数似然,表征编码层的输出与输入训练样本X的距离;第二部分为KL散度,表示Q(z|X,c)与其参考概率分布N(0,1)之间的距离。The loss function includes two parts. The first part represents the log-likelihood of P(X) under the probability distribution P(X'|z,c), which represents the distance between the output of the coding layer and the input training sample X; the second part is KL divergence, which represents the distance between Q(z|X,c) and its reference probability distribution N(0,1).
本实施例的基于条件变分自编码器的AI模型为CVAE模型,CVAE由编码器和解码器两个模块组成。其中输入层经过编码层生成隐层为编码器。隐层经过解码层生成输出层为解码器。编码器将高维输入转化为低维的隐向量Z,而解码器则将隐向量还原为尽可能接近输入的输出。训练时,需要输入训练样本X以及控制条件L,通过梯度下降优化编码器和解码器组合网络中的参数。The AI model based on the conditional variational autoencoder in this embodiment is a CVAE model, and the CVAE consists of two modules, an encoder and a decoder. The input layer passes through the encoding layer to generate the hidden layer as the encoder. The hidden layer passes through the decoding layer to generate the output layer as the decoder. The encoder converts the high-dimensional input into a low-dimensional latent vector Z, while the decoder restores the latent vector to an output that is as close to the input as possible. During training, it is necessary to input training samples X and control conditions L, and optimize the parameters in the combined encoder and decoder network through gradient descent.
首先,编码器将训练样本X映射为两组参数,这两组参数确定一个输出z基于输入X和条件向量c的条件概率分布Q(z|X,c),假设它为一个正态分布,则两组参数代表的意义分别为均值和方差。从中采样得到服从于该分布的隐向量z,经过解码器将隐向量z映射为新的一组参数,映射过程中将在解码层也同时利用条件向量c,其中条件向量c与样本X是相关联的,因此确定一个基于z和c的条件概率分布P(X’|z,c),从而使得解码层的概率分布受隐向量和条件向量的约束。为了使生成的X与输入X的距离尽可能小且保证模型的生成能力,Q(z|X,c)应尽可能向标准正态分布N(0,1)看齐。模型使用随机梯度下降算法将损失函数值最小化,该过程就是将编码层和解码层中的网络结构参数进行更新优化。First, the encoder maps the training sample X into two sets of parameters, which determine an output z based on the input X and the conditional vector c conditional probability distribution Q(z|X,c), assuming it is a normal distribution, The two sets of parameters represent the mean and variance, respectively. The hidden vector z that is subject to the distribution is sampled from it, and the hidden vector z is mapped to a new set of parameters by the decoder. During the mapping process, the conditional vector c will also be used in the decoding layer, where the conditional vector c is related to the sample X. Therefore, a conditional probability distribution P(X'|z,c) based on z and c is determined, so that the probability distribution of the decoding layer is constrained by the hidden vector and the conditional vector. In order to make the distance between the generated X and the input X as small as possible and ensure the generation ability of the model, Q(z|X,c) should be as close as possible to the standard normal distribution N(0,1). The model uses the stochastic gradient descent algorithm to minimize the loss function value, and the process is to update and optimize the network structure parameters in the encoding layer and the decoding layer.
CVAE模型的损失函数(损失函数即为代价函数)Loss所示:The loss function of the CVAE model (the loss function is the cost function) is shown in Loss:
E[logP(X|z,c)]-D KL[Q(z|X,c)||P(z|c)] E[logP(X|z,c)]-D KL [Q(z|X,c)||P(z|c)]
该损失函数包括两部分,第一部分表示使用概率分布P(X’|z,c)下P(X)的对数似然,表征编码层的输出与输入样本X的距离。第二部分为KL散度(Kullback-Leibler),表示Q(z|X,c)与其参考概率分布N(0,1)之间的距离。The loss function consists of two parts. The first part represents the log-likelihood of P(X) under the probability distribution P(X'|z,c), which represents the distance between the output of the encoding layer and the input sample X. The second part is the KL divergence (Kullback-Leibler), which represents the distance between Q(z|X,c) and its reference probability distribution N(0,1).
进一步,本实施例中,根据模型结构,基于tensorflow进行模型的构建与训练,训练过程中使用训练数据集进行模型训练,同时使用测试数据集计算测试集误差即损失函数,防止模型过拟合,在一定的训练轮次epoch之后,对比训练数据集误差和测试数据集误差,当测试数据集误差基本不变,训练数据集误差 下降变化减弱,模型的编码层和解码层参数被优化到最佳数值,停止训练并保存模型。Further, in this embodiment, according to the model structure, the model is constructed and trained based on tensorflow, the training data set is used for model training during the training process, and the test data set is used to calculate the test set error, that is, the loss function, to prevent the model from overfitting, After a certain training round epoch, compare the training data set error with the test data set error, when the test data set error is basically unchanged, the training data set error decreases and changes are weakened, and the parameters of the encoding layer and decoding layer of the model are optimized to the best value, stop training and save the model.
具体的,模型的构建与训练过程:根据上述的模型结构,基于tensorflow进行模型的构建与训练,训练过程中使用训练数据集进行模型训练,同时使用测试数据集计算测试集误差(损失函数),防止模型过拟合。在模型开始训练阶段,两者的误差都会较大,并迅速降低,之后在一定的训练轮次epoch之后,对比训练数据集误差和测试数据集误差,当测试数据集误差基本不变,训练数据集误差下降变化减弱时,可以认为此时模型同时在训练数据集和测试数据集上有较高精度,模型的编码层和解码层参数均被优化到最佳数值,因此停止训练并保存模型,即为我们最终所训练好的CAVE模型。Specifically, the construction and training process of the model: According to the above model structure, the construction and training of the model are carried out based on tensorflow. During the training process, the training data set is used for model training, and the test data set is used to calculate the test set error (loss function), Prevent model overfitting. At the beginning of the training phase of the model, the errors of both will be large and decrease rapidly. After a certain training round epoch, the training data set error and the test data set error are compared. When the test data set error is basically unchanged, the training data When the set error decreases and the change weakens, it can be considered that the model has high accuracy on both the training data set and the test data set at the same time, and the parameters of the encoding layer and decoding layer of the model are optimized to the best values, so stop training and save the model, That is the CAVE model we finally trained.
本实施例的药物分子筛选方法,首先根据选择后计算出药物分子的5种药物理化性质,该性质的挑选均与药物分子的成效性有着直接相关性。然后将计算出的药物理化性质值抽象为向量后作为条件向量。In the drug molecule screening method of this embodiment, firstly, five kinds of drug physicochemical properties of the drug molecule are calculated according to the selection, and the selection of these properties is directly related to the effectiveness of the drug molecule. Then, the calculated drug physicochemical property values are abstracted into vectors and used as condition vectors.
条件变分自编码器模型的独特之处就在于会额外输入条件向量,本方法中条件向量包含了5种药物理化性质,而不是单纯一个性质,因此可以进行多变量的控制,具体的控制过程是由模型结构和逻辑来实现的。The uniqueness of the conditional variational autoencoder model is that the conditional vector is additionally input. In this method, the conditional vector contains 5 kinds of drug physicochemical properties, not just one property, so multivariate control can be performed, and the specific control process It is implemented by model structure and logic.
隐层空间是属于自编码器方法的理化核心,优势在于编码器将高维输入转化为低维的隐层向量,然后再通过解码器将隐层向量进行还原为尽可能解决输入数据的输出,从而达到目标。The hidden layer space is the physical and chemical core of the autoencoder method. The advantage is that the encoder converts the high-dimensional input into a low-dimensional hidden layer vector, and then the decoder restores the hidden layer vector to solve the output of the input data as much as possible. to achieve the goal.
药物分子的药物理化性质将被抽象被条件向量,作为模型的输入层数据,进入到编码器中。The drug physicochemical properties of drug molecules will be abstracted into conditional vectors, which will be used as the input layer data of the model and entered into the encoder.
解码器的输入为隐层向量,而隐层向量中包含了条件向量即药物理化性质数据信息,因此条件向量数据也将进入到解码器中的网络结构计算中去。The input of the decoder is the hidden layer vector, and the hidden layer vector contains the conditional vector, that is, the data information of the physicochemical properties of the drug. Therefore, the conditional vector data will also enter the network structure calculation in the decoder.
一种药物分子筛选系统,包括:A drug molecular screening system, comprising:
预处理模块:采集与特定疾病相关药物分子数据,对数据进行预处理,计算其编码向量及相关的药物理化性质,形成结构数据并存入数据库;Preprocessing module: collect molecular data of drugs related to specific diseases, preprocess the data, calculate its encoding vector and related drug physicochemical properties, form structural data and store it in the database;
构建训练模型模块:构建和训练基于条件变分自编码器的AI模型,将编码向量和分子的药物理化性质组合作为模型的输入层,通过模型的编码层转换为隐层编码向量,再经模型的解码层生成可能的药物分子结构,在模型训练过程中,通过梯度下降算法将模型损失函数最小化,不断更新迭代编码层和解码层的神经网络结构的权值参数,使得模型训练更好;Build a training model module: build and train an AI model based on conditional variational autoencoders, use the combination of encoding vectors and the pharmacological properties of molecules as the input layer of the model, convert the encoding layer of the model into the hidden layer encoding vector, and then pass the model The decoding layer generates possible drug molecular structures. During the model training process, the model loss function is minimized by gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and decoding layer are continuously updated to make the model training better;
生成潜在药物分子模块:根据训练出的条件变分自编码器的模型,生成治愈特定疾病的潜在药物分子。Generating Potential Drug Molecule Modules: Generate potential drug molecules for curing specific diseases based on the trained conditional variational autoencoder model.
进一步,本实施例的计算药物分子的药物理化性质包括:计算分子质量、计算脂水分配系数、计算分子H键供体数、计算分子H键受体数、计算分子拓扑极性表面积中的一种或多种。Further, the calculation of the drug physicochemical properties of the drug molecule in this embodiment includes: calculation of molecular mass, calculation of lipid-water partition coefficient, calculation of the number of molecular H-bond donors, calculation of the number of molecular H-bond acceptors, and calculation of one of the molecular topological polar surface areas. one or more.
进一步,本实施例的编码向量为SMILES(Simplified Molecular Input Line Entry Specification简化分子线性输入规范)式编码向量。Further, the encoding vector in this embodiment is a SMILES (Simplified Molecular Input Line Entry Specification) formula encoding vector.
进一步,本实施例的预处理模块包括:统计出SMILES式中所有字符,将SMILES式中每个字符都转换为one-hot向量,将每个药物分子的SMILES式数据处理为设定维度的编码向量。Further, the preprocessing module of this embodiment includes: counting all the characters in the SMILES formula, converting each character in the SMILES formula into a one-hot vector, and processing the SMILES formula data of each drug molecule into an encoding of a set dimension vector.
具体的预处理模块包括:计算药物分子的SMILES式编码向量。首先统计SMILES式中出现的所有字符,将SMILES式中每个字符都转换为one-hot向量,并在字符串的结尾处添加’E’字符表示结束。设定每个药物分子的SMILES编码向量固定为120维,当字符位数小于该值时,后面的位数均用’E’的one-hot值进行填充,将每个药物分子的SMILES式数据均处理为一个120维编码向量。The specific preprocessing module includes: calculating the SMILES encoding vector of the drug molecule. First, count all the characters that appear in the SMILES formula, convert each character in the SMILES formula into a one-hot vector, and add an 'E' character at the end of the string to indicate the end. The SMILES encoding vector of each drug molecule is fixed to 120 dimensions. When the number of characters is less than this value, the following digits are filled with the one-hot value of 'E', and the SMILES data of each drug molecule is Both are processed as a 120-dimensional encoded vector.
计算药物分子的药物理化性质包括:计算分子质量、计算脂水分配系数、计算分子H键供体数、计算分子H键受体数、计算分子拓扑极性表面积中的一种或多种。Calculating the drug physicochemical properties of drug molecules includes one or more of: calculating molecular mass, calculating lipid-water partition coefficient, calculating molecular H-bond donor number, molecular H-bond acceptor number, and molecular topological polar surface area calculation.
进一步,将计算分子质量、计算脂水分配系数、计算分子拓扑极性表面积三个指标的数据进行归一化处理,将数据统一映射到-1.0-1.0范围内,将每个药物分子的5个药物理化性质形成5维向量。Further, the data of three indicators, namely, the calculated molecular mass, the calculated lipid-water partition coefficient, and the calculated molecular topological polar surface area, were normalized, and the data were uniformly mapped to the range of -1.0-1.0. The physicochemical properties of the drug form a 5-dimensional vector.
进一步,将SMILES式数据和药物理化性质数据共同组成药物分子总数据集并按照4:1的比例将总数据集随机划分为训练数据集和测试数据集,将每个SMILES式数据处理为一个120维编码向量,和5个代表不同药物理化性质的向量拼接组合形成一个125维向量数据,并作为模型的输入层。Further, the SMILES data and the drug physicochemical property data are combined to form the total drug molecule data set, and the total data set is randomly divided into training data set and test data set according to the ratio of 4:1, and each SMILES data is processed into a 120 The dimensional encoding vector is combined with 5 vectors representing different physicochemical properties to form a 125-dimensional vector data, which is used as the input layer of the model.
具体的,药物理化性质计算结果经过统计分析后可得:数据集中所有分子的MW的数值范围为0~500,LogP的数值范围为0~5,TPSA的数值范围为0~150,将MW,LogP,TPSA三个指标的数据进行归一化处理,将数据统一映射到-1.0~1.0范围内。分子H键供体和分子H键受体的数值为整数,直接用其数值表示即可。通过上述逻辑将每个药物分子的5个药物理化性质抽象形成一个5维向量。Specifically, the calculation results of the physicochemical properties of the drug can be obtained after statistical analysis: the numerical range of MW of all molecules in the data set is 0-500, the numerical range of LogP is 0-5, and the numerical range of TPSA is 0-150. The data of the three indicators of LogP and TPSA are normalized, and the data is uniformly mapped to the range of -1.0 to 1.0. The numerical values of the molecular H-bond donor and the molecular H-bond acceptor are integers, which can be directly represented by their numerical values. Through the above logic, the 5 drug physicochemical properties of each drug molecule are abstracted into a 5-dimensional vector.
以上由分子SMILES式数据和5种药物理化性质数据共同组成药物分子数据集。并按照4:1的比例将总的数据集划分为训练数据集和测试数据集,划分过程是随机的,即训练数据集占比80%,测试数据占比20%。The above is composed of molecular SMILES formula data and 5 kinds of drug physicochemical property data together to form a drug molecule data set. The total data set is divided into training data set and test data set according to the ratio of 4:1. The division process is random, that is, the training data set accounts for 80%, and the test data accounts for 20%.
通过数据预处理后,药物数据集中的每个药物分子样本都存在一个SMILES 格式编码向量,和5个代表不同药物理化性质属性的向量(即本方法中的条件向量)。每个药物分子样本将两种数据向量直接拼接组合后形成一个125维向量数据,药物数据集中所有的药物分子均用上述形式表征,即作为模型的输入层数据。After data preprocessing, each drug molecule sample in the drug dataset has an encoded vector in SMILES format, and 5 vectors representing different drug physicochemical properties (ie, conditional vectors in this method). For each drug molecule sample, two data vectors are directly spliced and combined to form a 125-dimensional vector data. All drug molecules in the drug data set are represented by the above form, that is, as the input layer data of the model.
AI模型结构包括:输入层、编码层、隐层、解码层、输出层。本实施例的编码层输入输入层输出数据,输出到隐层。编码层为RNN网络结构,包含3层循环神经网络层,采用LSTM单元,每一层设置512个隐藏节点,所述解码层输入隐层的输出数据,输出到输出层,所述解码层为RNN网络结构,包含3层循环神经网络层,采用LSTM单元,每一层设置512个隐藏节点,在解码层后还存在softmax层,其代价函数采用交叉熵函数
Figure PCTCN2020113085-appb-000004
The AI model structure includes: input layer, encoding layer, hidden layer, decoding layer, and output layer. The coding layer of this embodiment inputs the output data of the input layer, and outputs the data to the hidden layer. The encoding layer is an RNN network structure, including 3 layers of cyclic neural network layers, using LSTM units, each layer is set with 512 hidden nodes, the decoding layer inputs the output data of the hidden layer, and outputs to the output layer, the decoding layer is RNN Network structure, including 3 layers of recurrent neural network layers, using LSTM units, each layer is set with 512 hidden nodes, there is a softmax layer after the decoding layer, and its cost function adopts the cross entropy function
Figure PCTCN2020113085-appb-000004
其中,K为种类数量,y是标签,p是网络的输出,指类别是i的概率;通过softmax层,估计出SMILES式编码向量中每一位具体某个字符类别的概率分布,最终通过数据预处理中one-hot值与具体编码字符直接的对应关系,重构出输出样本,输出SMILES式。Among them, K is the number of categories, y is the label, p is the output of the network, and refers to the probability that the category is i; through the softmax layer, the probability distribution of each specific character category in the SMILES encoding vector is estimated, and finally the data is passed. The direct correspondence between the one-hot value and the specific coded character in the preprocessing reconstructs the output sample and outputs the SMILES formula.
输入层经过编码层生成隐层形成编码器,隐层经过解码层生成输出层形成解码器,所述编码器将高维输入转化为低位的隐向量。The input layer generates a hidden layer through an encoding layer to form an encoder, and the hidden layer generates an output layer through a decoding layer to form a decoder. The encoder converts the high-dimensional input into a low-order hidden vector.
本实施例的损失函数E[logP(X|z,c)]-D KL[Q(z|X,c)||P(z|c)],该损失函数包括两部分,第一部分表示使用概率分布P(X’|z,c)下P(X)的对数似然,表征编码层的输出与输入训练样本X的距离;第二部分为KL散度,表示Q(z|X,c)与其参考概率分布N(0,1)之间的距离。 The loss function E[logP(X|z, c)]-D KL [Q(z|X, c)||P(z|c)] of this embodiment includes two parts, the first part represents the use of The log-likelihood of P(X) under the probability distribution P(X'|z,c) represents the distance between the output of the coding layer and the input training sample X; the second part is the KL divergence, which represents Q(z|X, c) The distance between it and its reference probability distribution N(0,1).
进一步,本实施例中,根据模型结构,基于tensorflow进行模型的构建与训练,训练过程中使用训练数据集进行模型训练,同时使用测试数据集计算测试集误差即损失函数,防止模型过拟合,在一定的训练轮次epoch之后,对比训练数据集误差和测试数据集误差,当测试数据集误差基本不变,训练数据集误差下降变化减弱,模型的编码层和解码层参数被优化到最佳数值,停止训练并保存模型。Further, in this embodiment, according to the model structure, the model is constructed and trained based on tensorflow, the training data set is used for model training during the training process, and the test data set is used to calculate the test set error, that is, the loss function, to prevent the model from overfitting, After a certain training round epoch, compare the training data set error with the test data set error, when the test data set error is basically unchanged, the training data set error decreases and changes are weakened, and the parameters of the encoding layer and decoding layer of the model are optimized to the best value, stop training and save the model.
本发明的基于条件变分自编码器的药物分子筛选方法,包括以下步骤:The drug molecule screening method based on conditional variational autoencoder of the present invention comprises the following steps:
采集与特定疾病相关的药物分子数据集,并进行数据预处理。通过数据收集得到药物分子的结构化数据,通过数据分析和处理后,计算其SMILES式编码向量,及相关的药物理化性质,形成结构数据存入数据库中。其中按照一定比例将数据集划分为模型训练数据集和测试数据集。其中药物分子的药物理化性质,可以选择分子质量、LogP,分子H键供受体,TPSA(分子拓扑极性表面积)等指标性质。Collect drug molecule datasets related to specific diseases and perform data preprocessing. The structured data of drug molecules is obtained through data collection. After data analysis and processing, the SMILES-encoded vector and the related drug physicochemical properties are calculated, and the structured data is formed and stored in the database. The data set is divided into model training data set and test data set according to a certain proportion. Among them, the drug physicochemical properties of drug molecules can be selected from molecular mass, LogP, molecular H bond donor and acceptor, TPSA (molecular topological polar surface area) and other index properties.
根据上述的药物分子数据集,构建和训练基于条件变分自编码器的AI模型。建立模型的主要步骤为:将SMILES式编码向量和分子药物理化性质组合作为模型的输入层,其中分子的药物理化性质是期待模型所生成药物分子在上述性质指标中能表现出较好的性质,将所挑选的药物理化性质数据抽象作为输入层的条件向量,在编码层和解码层均会被直接引入计算中。输入层数据通过编码器后转换为隐层编码向量,再经过解码层后生成可能的药物分子结构。其中编码层和解码层采用LSTM的循环神经网络结构,在模型训练的过程中,通过梯度下降算法将模型损失函数最小化,不断更新迭代编码层和解码层的神经网络结构的权值参数,使得模型训练更好。Conditional variational autoencoder-based AI models were constructed and trained based on the aforementioned drug molecule datasets. The main steps of establishing the model are: the combination of the SMILES-encoded vector and the molecular pharmacological properties as the input layer of the model, in which the pharmacological properties of the molecules are expected to show better properties among the above-mentioned property indicators for the drug molecules generated by the model. The selected drug physicochemical property data is abstracted as the conditional vector of the input layer, which will be directly introduced into the calculation in both the encoding layer and the decoding layer. The input layer data is converted into the hidden layer encoding vector after passing through the encoder, and then the possible drug molecular structure is generated after passing through the decoding layer. The encoding layer and decoding layer use the LSTM cyclic neural network structure. In the process of model training, the model loss function is minimized through the gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and decoding layer are continuously updated, so that Model training is better.
根据上述所训练出的条件变分自编码器的模型,生成能够治愈所述特定疾病的潜在药物分子,用于后续的药物计算流程和实验流程。According to the above-trained model of the conditional variational autoencoder, potential drug molecules capable of curing the specific disease are generated, which are used in the subsequent drug calculation process and experimental process.
本发明采用一种基于条件变分自编码器的药物分子筛选方法,能够有效提高所生成潜在分子的有效性,在输入层中引入了目标药物理化性质数据作为条件向量,使用条件变分自编码器模型用于多变量控制,除了利用隐层空间的优势外,将药物分子的药物理化性质输入到编码层中,进而在解码层中处理和操控条件向量,所生成潜在药物分子将会在这些目标药物理化性质上有更好的表现,提高了药物分子的有效性和准确度。The invention adopts a drug molecule screening method based on conditional variational autoencoder, which can effectively improve the effectiveness of the generated potential molecules. The target drug physicochemical property data is introduced into the input layer as a conditional vector, and the conditional variational autoencoder is used. The device model is used for multivariate control. In addition to taking advantage of the hidden layer space, the drug physicochemical properties of drug molecules are input into the encoding layer, and then the conditional vectors are processed and manipulated in the decoding layer. The generated potential drug molecules will be in these The physicochemical properties of the target drug have better performance, which improves the effectiveness and accuracy of the drug molecule.
本发明的基于条件变分自编码器的药物分子筛选方法,其一具体实施例如下:A specific embodiment of the drug molecule screening method based on conditional variational autoencoders of the present invention is as follows:
采集与特定疾病相关的药物分子数据集,数据集从pubchem,ccdc等其他开源数据库中挑选得到,本实施例选择胰腺癌为例,从开源数据库汇总提取目前针对胰腺癌有治疗作用或潜在相关的药物分子的SMILES分子式来作为初始数据集集合,然后进入到数据预处理阶段,计算药物分子的SMILES式编码向量。首先统计SMILES式中出现的所有字符,将SMILES式中每个字符都转换为one-hot向量,并在字符串的结尾处添加’E’字符表示结束。设定每个药物分子的SMILES编码向量固定为120维,当字符位数小于该值时,后面的位数均用’E’的one-hot值进行填充,将每个药物分子的SMILES式数据均处理为一个120维编码向量。同时选择药物分子5个相对具有代表性的药物理化性质:分子质量、LogP,分子H键供体,分子H键受体,TPSA(分子拓扑极性表面积),对上述数据集中的每个药物分子计算上述性质,并根据其性质数值表达方式,抽象为向量表达形式,作为模型中的条件向量输入。Collect drug molecule data sets related to specific diseases. The data sets are selected from other open source databases such as pubchem, ccdc, etc. In this example, pancreatic cancer is selected as an example, and the open source databases are summarized and extracted from the open source database. The SMILES formula of the drug molecule is used as the initial data set, and then it enters the data preprocessing stage to calculate the SMILES formula encoding vector of the drug molecule. First, count all the characters that appear in the SMILES formula, convert each character in the SMILES formula into a one-hot vector, and add an 'E' character at the end of the string to indicate the end. The SMILES encoding vector of each drug molecule is fixed to 120 dimensions. When the number of characters is less than this value, the following digits are filled with the one-hot value of 'E', and the SMILES data of each drug molecule is Both are processed as a 120-dimensional encoded vector. At the same time, five relatively representative drug physicochemical properties of drug molecules are selected: molecular mass, LogP, molecular H bond donor, molecular H bond acceptor, TPSA (molecular topological polar surface area), and for each drug molecule in the above data set Calculate the above properties, and abstract them into a vector representation according to the numerical expression of the properties, as the conditional vector input in the model.
根据上述的药物分子数据集,训练基于条件变分自编码器的AI模型,其中模型主要包含输入层,编码层,隐层,解码层和输出层。在输入层中,分子指 纹向量和药物性质组合得到的条件变分向量将共同作为输入层数据,进行到编码层中。编码层和解码层采用带有LSTM的递归神经网络结构,神经网络层数为3层,每一层有512个隐藏节点。在解码层单元中均有softmax层,代价函数采用交叉熵函数。经过解码层所产生的输出向量将最终转换为SMILES分子编码形式。经过模型的迭代训练,编码层和解码层中的网络结构参数被更新优化,解码层将生成样本进行重构输出,使得输出层所产生的药物分子有效性提高,在所选择的目标药物理化性质上也会有更好的表现,从而成为潜在的能够治愈胰腺癌的药物分子。According to the above drug molecule dataset, an AI model based on conditional variational autoencoder is trained, wherein the model mainly includes input layer, encoding layer, hidden layer, decoding layer and output layer. In the input layer, the conditional variation vector obtained by the combination of the molecular fingerprint vector and the drug property will be used as the input layer data and proceed to the encoding layer. The encoding layer and decoding layer use a recurrent neural network structure with LSTM, the number of neural network layers is 3, and each layer has 512 hidden nodes. There are softmax layers in the decoding layer units, and the cost function adopts the cross entropy function. The output vector produced by the decoding layer is finally converted into SMILES molecular encoding form. After the iterative training of the model, the network structure parameters in the encoding layer and the decoding layer are updated and optimized, and the decoding layer will generate samples for reconstruction output, so that the effectiveness of the drug molecules generated by the output layer is improved. It will also have better performance on the surface, thus becoming a potential drug molecule that can cure pancreatic cancer.
根据上述所训练出的条件变分自编码器的模型,根据后续的研究方式及目标,生成一定数量的能够治愈胰腺癌的潜在药物分子。该模型可以生成大量的潜在药物分子,可将生成的药物分子数据集经过一定性质分析后进行进一步筛选或排位等操作,筛选出数量合适且有效性高的药物分子再用于后续的药物计算流程和实验流程。According to the above-trained conditional variational autoencoder model, and according to the follow-up research methods and goals, a certain number of potential drug molecules that can cure pancreatic cancer are generated. The model can generate a large number of potential drug molecules, and the generated drug molecule data set can be further screened or ranked after a certain property analysis, and the appropriate number of drug molecules with high effectiveness can be screened for subsequent drug calculation. Procedures and Experimental Procedures.
生成的药物分子数据集进行上述所提出的5中药物理化性质的计算,根据现有的能够治愈胰腺癌的药物分子,统计其5种药物理化性质的范围。筛选主要指的是将生成分子中某些理化性质明显不符合统计数据范围数值内的分子进行过滤掉,排位指的是综合考虑5种理化性质的值,形成分子成药可能性的排序,根据后续的研究目标,可以选择成药可能性排名靠前的一定数量的生成药物分子进行后续的研究。The generated drug molecule data set is used to calculate the physical and chemical properties of the five drugs proposed above. According to the existing drug molecules that can cure pancreatic cancer, the ranges of the five drug physical and chemical properties are counted. Screening mainly refers to filtering out some of the generated molecules whose physical and chemical properties obviously do not conform to the values within the range of statistical data. Ranking refers to comprehensively considering the values of five physical and chemical properties to form a ranking of the possibility of molecular drugs. As a follow-up research goal, a certain number of generated drug molecules with the highest drug-forming possibility can be selected for follow-up research.
以上述依据本申请的理想实施例为启示,通过上述的说明内容,相关工作人员完全可以在不偏离本项申请技术思想的范围内,进行多样的变更以及修改。本项申请的技术性范围并不局限于说明书上的内容,必须要根据权利要求范围来确定其技术性范围。Taking the above ideal embodiments according to the present application as inspiration, and through the above descriptions, relevant personnel can make various changes and modifications without departing from the technical idea of the present application. The technical scope of the present application is not limited to the content in the description, and the technical scope must be determined according to the scope of the claims.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机 或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

Claims (10)

  1. 一种药物分子筛选方法,其特征在于,包括:A drug molecular screening method, comprising:
    预处理:采集与特定疾病相关药物分子数据,对数据进行预处理,计算其编码向量及相关的药物理化性质,形成结构数据并存入数据库;Preprocessing: collect molecular data of drugs related to specific diseases, preprocess the data, calculate its encoding vector and related drug physicochemical properties, form structural data and store it in the database;
    构建训练模型:构建和训练基于条件变分自编码器的AI模型,将编码向量和分子的药物理化性质组合作为模型的输入层,通过模型的编码层转换为隐层编码向量,再经模型的解码层生成可能的药物分子结构,在模型训练过程中,通过梯度下降算法将模型损失函数最小化,不断更新迭代编码层和解码层的神经网络结构的权值参数;Build a training model: Build and train an AI model based on a conditional variational autoencoder, use the combination of the encoding vector and the drug physicochemical properties of the molecule as the input layer of the model, convert the encoding layer of the model into the hidden layer encoding vector, and then use the model's encoding layer. The decoding layer generates possible drug molecular structures. During the model training process, the model loss function is minimized through the gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and decoding layer are continuously updated;
    生成潜在药物分子:根据训练出的条件变分自编码器的模型,生成治愈特定疾病的潜在药物分子。Generate Potential Drug Molecules: Generate potential drug molecules that cure a specific disease based on the trained conditional variational autoencoder model.
  2. 根据权利要求1所述的药物分子筛选方法,其特征在于,所述编码向量为SMILES式编码向量,所述预处理包括:统计出SMILES式中所有字符,将SMILES式中每个字符都转换为one-hot向量,将每个药物分子的SMILES式数据处理为设定维度的编码向量。The drug molecule screening method according to claim 1, wherein the encoding vector is a SMILES encoding vector, and the preprocessing comprises: counting all characters in the SMILES formula, and converting each character in the SMILES formula into A one-hot vector, which processes the SMILES-style data of each drug molecule into an encoded vector of set dimensions.
  3. 根据权利要求1所述的药物分子筛选方法,其特征在于,所述计算药物分子的药物理化性质包括:计算分子质量、计算脂水分配系数、计算分子H键供体数、计算分子H键受体数、计算分子拓扑极性表面积中的一种或多种。The method for screening drug molecules according to claim 1, wherein the calculating the drug physicochemical properties of the drug molecule comprises: calculating molecular mass, calculating lipid-water partition coefficient, calculating the number of molecular H-bond donors, calculating molecular H-bond acceptor One or more of the number of bodies, the calculation of molecular topological polar surface area.
  4. 根据权利要求3所述的药物分子筛选方法,其特征在于,将计算分子质量、计算脂水分配系数、计算分子拓扑极性表面积三个指标的数据进行归一化处理,将数据统一映射到-1.0-1.0范围内,将每个药物分子的5个药物理化性质形成5维向量。The drug molecular screening method according to claim 3, characterized in that, the data of three indexes of calculating molecular mass, calculating lipid-water partition coefficient, and calculating molecular topological polar surface area are normalized, and the data are uniformly mapped to- In the range of 1.0-1.0, the 5 pharmacological properties of each drug molecule are formed into a 5-dimensional vector.
  5. 根据权利要求2所述的药物分子筛选方法,其特征在于,将SMILES式数据和药物理化性质数据共同组成药物分子总数据集并按照4:1的比例将总数据集随机划分为训练数据集和测试数据集,将每个SMILES式数据处理为一个120维编码向量,和5个代表不同药物理化性质的向量拼接组合形成一个125维向量数据,并作为模型的输入层。The drug molecule screening method according to claim 2, wherein the SMILES formula data and the drug physicochemical property data together form a total drug molecule data set, and the total data set is randomly divided into a training data set and a 4:1 ratio. For the test data set, each SMILES data is processed into a 120-dimensional encoded vector, and spliced and combined with 5 vectors representing different drug physicochemical properties to form a 125-dimensional vector data, which is used as the input layer of the model.
  6. 根据权利要求1至5任意一项所述的药物分子筛选方法,其特征在于,所述AI模型结构包括:输入层、编码层、隐层、解码层、输出层,所述编码层输入输入层输出数据,输出到隐层,所述编码层为RNN网络结构,包含3层循环神经网络层,采用LSTM单元,每一层设置512个隐藏节点,所述解码层 输入隐层的输出数据,输出到输出层,所述解码层为RNN网络结构,包含3层循环神经网络层,采用LSTM单元,每一层设置512个隐藏节点,在解码层后还存在softmax层,其代价函数采用交叉熵函数
    Figure PCTCN2020113085-appb-100001
    其中,K为种类数量,y是标签,p是网络的输出,指类别是i的概率;通过softmax层,估计出SMILES式编码向量中每一位具体某个字符类别的概率分布,最终通过数据预处理中one-hot值与具体编码字符直接的对应关系,重构出输出样本,输出SMILES式。
    The drug molecule screening method according to any one of claims 1 to 5, wherein the AI model structure comprises: an input layer, an encoding layer, a hidden layer, a decoding layer, and an output layer, and the encoding layer inputs the input layer The output data is output to the hidden layer. The encoding layer is an RNN network structure, including 3 layers of recurrent neural network layers, using LSTM units, and each layer is set with 512 hidden nodes. The decoding layer inputs the output data of the hidden layer, and outputs To the output layer, the decoding layer is an RNN network structure, including 3 layers of recurrent neural network layers, using LSTM units, each layer is set with 512 hidden nodes, there is a softmax layer after the decoding layer, and its cost function adopts the cross entropy function.
    Figure PCTCN2020113085-appb-100001
    Among them, K is the number of categories, y is the label, p is the output of the network, and refers to the probability that the category is i; through the softmax layer, the probability distribution of each specific character category in the SMILES encoding vector is estimated, and finally the data is passed. The direct correspondence between the one-hot value and the specific coded characters in the preprocessing reconstructs the output sample and outputs the SMILES formula.
  7. 根据权利要求6所述的药物分子筛选方法,其特征在于,所述输入层经过编码层生成隐层形成编码器,隐层经过解码层生成输出层形成解码器,所述编码器将高维输入转化为低维的隐向量,所述损失函数E[logP(X|z,c)]-D KL[Q(z|X,c)||P(z|c)],该损失函数包括两部分,第一部分表示使用概率分布P(X’|z,c)下P(X)的对数似然,表征编码层的输出与输入训练样本X的距离;第二部分为KL散度,表示Q(z|X,c)与其参考概率分布N(0,1)之间的距离。 The drug molecule screening method according to claim 6, wherein the input layer generates a hidden layer through an encoding layer to form an encoder, the hidden layer generates an output layer through a decoding layer to form a decoder, and the encoder converts high-dimensional input Converted into a low-dimensional hidden vector, the loss function E[logP(X|z, c)]-D KL [Q(z|X, c)||P(z|c)], the loss function includes two part, the first part represents the log-likelihood of P(X) under the probability distribution P(X'|z,c), which represents the distance between the output of the coding layer and the input training sample X; the second part is the KL divergence, which represents The distance between Q(z|X,c) and its reference probability distribution N(0,1).
  8. 根据权利要求1至5任意一项所述的药物分子筛选方法,其特征在于,根据模型结构,基于tensorflow进行模型的构建与训练,训练过程中使用训练数据集进行模型训练,同时使用测试数据集计算测试集误差即损失函数,防止模型过拟合,在一定的训练轮次epoch之后,对比训练数据集误差和测试数据集误差,当测试数据集误差基本不变,训练数据集误差下降变化减弱,模型的编码层和解码层参数被优化到最佳数值,停止训练并保存模型。The drug molecule screening method according to any one of claims 1 to 5, wherein, according to the model structure, the model is constructed and trained based on tensorflow, the training data set is used for model training during the training process, and the test data set is used simultaneously. Calculate the test set error, that is, the loss function, to prevent the model from overfitting. After a certain training round of epoch, compare the training data set error with the test data set error. When the test data set error is basically unchanged, the training data set error decreases and changes are weakened. , the encoding layer and decoding layer parameters of the model are optimized to the optimal values, stop training and save the model.
  9. 一种药物分子筛选系统,其特征在于,包括:A drug molecular screening system, comprising:
    预处理模块:采集与特定疾病相关药物分子数据,对数据进行预处理,计算其编码向量及相关的药物理化性质,形成结构数据并存入数据库;Preprocessing module: collect molecular data of drugs related to specific diseases, preprocess the data, calculate its encoding vector and related drug physicochemical properties, form structural data and store it in the database;
    构建训练模型模块:构建和训练基于条件变分自编码器的AI模型,将编码向量和分子的药物理化性质组合作为模型的输入层,通过模型的编码层转换为隐层编码向量,再经模型的解码层生成可能的药物分子结构,在模型训练 过程中,通过梯度下降算法将模型损失函数最小化,不断更新迭代编码层和解码层的神经网络结构的权值参数;Build a training model module: build and train an AI model based on conditional variational autoencoders, use the combination of encoding vectors and the pharmacological properties of molecules as the input layer of the model, convert the encoding layer of the model into the hidden layer encoding vector, and then pass the model In the process of model training, the model loss function is minimized by the gradient descent algorithm, and the weight parameters of the neural network structure of the iterative encoding layer and decoding layer are continuously updated;
    生成潜在药物分子模块:根据训练出的条件变分自编码器的模型,生成治愈特定疾病的潜在药物分子。Generating Potential Drug Molecule Modules: Generate potential drug molecules for curing specific diseases based on the trained conditional variational autoencoder model.
  10. 根据权利要求9所述的药物分子筛选系统,其特征在于,所述计算药物分子的药物理化性质包括:计算分子质量、计算脂水分配系数、计算分子H键供体数、计算分子H键受体数、计算分子拓扑极性表面积中的一种或多种。The drug molecule screening system according to claim 9, wherein the calculation of the drug physicochemical properties of the drug molecule comprises: calculation of molecular mass, calculation of lipid-water partition coefficient, calculation of molecular H-bond donor number, calculation of molecular H-bond acceptor One or more of the number of bodies, the calculation of molecular topological polar surface area.
PCT/CN2020/113085 2020-09-02 2020-09-02 Drug molecule screening method and system WO2022047677A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/113085 WO2022047677A1 (en) 2020-09-02 2020-09-02 Drug molecule screening method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/113085 WO2022047677A1 (en) 2020-09-02 2020-09-02 Drug molecule screening method and system

Publications (1)

Publication Number Publication Date
WO2022047677A1 true WO2022047677A1 (en) 2022-03-10

Family

ID=80492382

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/113085 WO2022047677A1 (en) 2020-09-02 2020-09-02 Drug molecule screening method and system

Country Status (1)

Country Link
WO (1) WO2022047677A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114937478A (en) * 2022-05-18 2022-08-23 北京百度网讯科技有限公司 Method for training a model, method and apparatus for generating molecules
CN115567582A (en) * 2022-11-09 2023-01-03 山东恒远智能科技有限公司 Intelligent industrial internet data service system and method
CN116130036A (en) * 2023-01-09 2023-05-16 四川大学 Reverse design method of metal organic frame based on graph representation
CN116189809A (en) * 2023-01-06 2023-05-30 东南大学 Drug molecule important node prediction method based on challenge resistance
CN117334271A (en) * 2023-09-25 2024-01-02 江苏运动健康研究院 Method for generating molecules based on specified attributes
CN117594157A (en) * 2024-01-19 2024-02-23 烟台国工智能科技有限公司 Method and device for generating molecules of single system based on reinforcement learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110970099A (en) * 2019-12-10 2020-04-07 北京大学 Medicine molecule generation method based on regularization variational automatic encoder
CN111126554A (en) * 2018-10-31 2020-05-08 深圳市云网拜特科技有限公司 Drug lead compound screening method and system based on generation of confrontation network
CN111128314A (en) * 2018-10-30 2020-05-08 深圳市云网拜特科技有限公司 Drug discovery method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111128314A (en) * 2018-10-30 2020-05-08 深圳市云网拜特科技有限公司 Drug discovery method and system
CN111126554A (en) * 2018-10-31 2020-05-08 深圳市云网拜特科技有限公司 Drug lead compound screening method and system based on generation of confrontation network
CN110970099A (en) * 2019-12-10 2020-04-07 北京大学 Medicine molecule generation method based on regularization variational automatic encoder

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JAECHANG LIM, SEONGOK RYU, JIN WOO KIM, WOO YOUN KIM: "Molecular generative model based on conditional variational autoencoder for de novo molecular design", JOURNAL OF CHEMINFORMATICS, vol. 10, no. 1, 1 December 2018 (2018-12-01), XP055635609, DOI: 10.1186/s13321-018-0286-7 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114937478A (en) * 2022-05-18 2022-08-23 北京百度网讯科技有限公司 Method for training a model, method and apparatus for generating molecules
CN114937478B (en) * 2022-05-18 2023-03-10 北京百度网讯科技有限公司 Method for training a model, method and apparatus for generating molecules
CN115567582A (en) * 2022-11-09 2023-01-03 山东恒远智能科技有限公司 Intelligent industrial internet data service system and method
CN115567582B (en) * 2022-11-09 2023-09-15 山东恒远智能科技有限公司 Intelligent industrial Internet data service system and method
CN116189809A (en) * 2023-01-06 2023-05-30 东南大学 Drug molecule important node prediction method based on challenge resistance
CN116189809B (en) * 2023-01-06 2024-01-09 东南大学 Drug molecule important node prediction method based on challenge resistance
CN116130036A (en) * 2023-01-09 2023-05-16 四川大学 Reverse design method of metal organic frame based on graph representation
CN116130036B (en) * 2023-01-09 2024-03-01 四川大学 Reverse design method of metal organic frame based on graph representation
CN117334271A (en) * 2023-09-25 2024-01-02 江苏运动健康研究院 Method for generating molecules based on specified attributes
CN117594157A (en) * 2024-01-19 2024-02-23 烟台国工智能科技有限公司 Method and device for generating molecules of single system based on reinforcement learning
CN117594157B (en) * 2024-01-19 2024-04-09 烟台国工智能科技有限公司 Method and device for generating molecules of single system based on reinforcement learning

Similar Documents

Publication Publication Date Title
WO2022047677A1 (en) Drug molecule screening method and system
CN112071373A (en) Drug molecule screening method and system
Dong et al. Deep learning in retrosynthesis planning: datasets, models and tools
CN110945495A (en) Conversion of natural language queries to database queries based on neural networks
Jiang et al. A hybrid intelligent model for acute hypotensive episode prediction with large-scale data
TW202105198A (en) Method and system for mapping text phrases to a taxonomy
CN113571125A (en) Drug target interaction prediction method based on multilayer network and graph coding
CN112151127A (en) Unsupervised learning drug virtual screening method and system based on molecular semantic vector
CN114913938B (en) Small molecule generation method, equipment and medium based on pharmacophore model
Lu et al. Incorporating domain knowledge into natural language inference on clinical texts
CN109767817B (en) Drug potential adverse reaction discovery method based on neural network language model
CN115376704A (en) Medicine-disease interaction prediction method fusing multi-neighborhood correlation information
Gong et al. Prognosis analysis of heart failure based on recurrent attention model
Bhaskar et al. Molecular graph generation via geometric scattering
CN114841261A (en) Increment width and deep learning drug response prediction method, medium, and apparatus
CN114582508A (en) Methods for predicting potentially relevant circular RNA-disease pairs based on GCN and integrated learning
CN115660871B (en) Unsupervised modeling method for medical clinical process, computer equipment and storage medium
CN116453617A (en) Multi-target optimization molecule generation method and system combining active learning
CN116312856A (en) Medicine interaction prediction method and system based on substructure
CN115240873A (en) Medicine recommendation method based on machine learning, electronic equipment and computer-readable storage medium
CN115858799A (en) Knowledge representation learning method integrating ordered relationship path and entity description information
CN115936014A (en) Medical entity code matching method, system, computer equipment and storage medium
Luo et al. Early prediction of organ failures in patients with acute pancreatitis using text mining
Feng et al. Can Attention Be Used to Explain EHR-Based Mortality Prediction Tasks: A Case Study on Hemorrhagic Stroke
Shao et al. TBPM-DDIE: Transformer Based Pretrained Method for predicting Drug-Drug Interactions Events

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20951924

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 05/07/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20951924

Country of ref document: EP

Kind code of ref document: A1