CN117334271B - Method for generating molecules based on specified attributes - Google Patents
Method for generating molecules based on specified attributes Download PDFInfo
- Publication number
- CN117334271B CN117334271B CN202311238924.0A CN202311238924A CN117334271B CN 117334271 B CN117334271 B CN 117334271B CN 202311238924 A CN202311238924 A CN 202311238924A CN 117334271 B CN117334271 B CN 117334271B
- Authority
- CN
- China
- Prior art keywords
- molecular
- model
- molecules
- generation
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 239000003814 drug Substances 0.000 claims abstract description 39
- 229940079593 drug Drugs 0.000 claims abstract description 33
- 238000012549 training Methods 0.000 claims abstract description 33
- 230000008569 process Effects 0.000 claims abstract description 17
- 238000012216 screening Methods 0.000 claims abstract description 12
- 230000006870 function Effects 0.000 claims description 23
- 239000000370 acceptor Substances 0.000 claims description 6
- 238000013461 design Methods 0.000 claims description 6
- 230000015572 biosynthetic process Effects 0.000 claims description 5
- 230000014509 gene expression Effects 0.000 claims description 5
- 229910052739 hydrogen Inorganic materials 0.000 claims description 5
- 239000001257 hydrogen Substances 0.000 claims description 5
- 150000002632 lipids Chemical class 0.000 claims description 5
- 230000035699 permeability Effects 0.000 claims description 5
- 238000003786 synthesis reaction Methods 0.000 claims description 5
- 238000005192 partition Methods 0.000 claims description 4
- 230000004071 biological effect Effects 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 150000003384 small molecules Chemical group 0.000 claims description 3
- 125000004122 cyclic group Chemical group 0.000 claims description 2
- 238000005259 measurement Methods 0.000 claims description 2
- 238000011161 development Methods 0.000 abstract description 5
- 238000013135 deep learning Methods 0.000 abstract description 4
- 238000005516 engineering process Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000012827 research and development Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 238000009509 drug development Methods 0.000 description 4
- 150000001875 compounds Chemical class 0.000 description 3
- 238000009510 drug design Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 101000741396 Chlamydia muridarum (strain MoPn / Nigg) Probable oxidoreductase TC_0900 Proteins 0.000 description 2
- 101000741399 Chlamydia pneumoniae Probable oxidoreductase CPn_0761/CP_1111/CPj0761/CpB0789 Proteins 0.000 description 2
- 101000741400 Chlamydia trachomatis (strain D/UW-3/Cx) Probable oxidoreductase CT_610 Proteins 0.000 description 2
- 238000004618 QSPR study Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 101100261173 Arabidopsis thaliana TPS7 gene Proteins 0.000 description 1
- 102100021257 Beta-secretase 1 Human genes 0.000 description 1
- 101000894895 Homo sapiens Beta-secretase 1 Proteins 0.000 description 1
- 238000004617 QSAR study Methods 0.000 description 1
- GLQOALGKMKUSBF-UHFFFAOYSA-N [amino(diphenyl)silyl]benzene Chemical compound C=1C=CC=CC=1[Si](C=1C=CC=CC=1)(N)C1=CC=CC=C1 GLQOALGKMKUSBF-UHFFFAOYSA-N 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000013479 data entry Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000007876 drug discovery Methods 0.000 description 1
- 230000000857 drug effect Effects 0.000 description 1
- 230000036267 drug metabolism Effects 0.000 description 1
- 239000003596 drug target Substances 0.000 description 1
- 239000003999 initiator Substances 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000007479 molecular analysis Methods 0.000 description 1
- 238000000302 molecular modelling Methods 0.000 description 1
- 239000002547 new drug Substances 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- 125000006850 spacer group Chemical group 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000009966 trimming Methods 0.000 description 1
- 238000003041 virtual screening Methods 0.000 description 1
Abstract
The invention relates to the technical field of drug molecule generation models, in particular to a method for generating molecules based on specified attributes. S1, collecting small molecular data of a drug, and preparing a data set aiming at a molecular generation task; s2, establishing a molecular generation model based on pre-training and model fine adjustment; s3, performing fine tuning training on the molecular generation model by using the data set manufactured in the S1 to obtain two different models suitable for the molecular generation downstream task; s4, respectively adjusting the DPMG model structure of the downstream task based on molecular attribute prediction and target molecular generation of the designated attribute; s5, screening the obtained molecules; s6, providing indexes for evaluating molecular generation and quantifying the quality of the model. According to the invention, the pre-training model is introduced into the generation of drug molecules, and the deep learning is introduced into the field of molecular generation through fine adjustment of the pre-training model, so that the development process of the drug is remarkably accelerated.
Description
Technical Field
The invention relates to the technical field of drug molecule generation models, in particular to a method for generating molecules based on specified attributes.
Background
Over the last few decades, computer science has gradually incorporated the field of drug development from initial data entry into the design of auxiliary drugs, with the advent of Computer-aided drug design (Computer-Aided Drug Design, CADD). Although CADD technology performs well on certain tasks, such as virtual screening of drug molecules, challenges remain in the design and optimization of drug molecules, and as technology progresses, artificial intelligence becomes the optimal solution to this problem.
AI pharmacy is based on big data of medicine, and by using AI technology such as machine learning, deep learning, etc. to replace a large number of experiments, the structure, efficacy, etc. of the medicine are rapidly analyzed, so as to achieve the technical means of developing new medicine in short time and with low cost. Compared with the traditional computer aided drug design, the AI technology can rapidly identify drug targets, match proper molecules from a database, design and synthesize compounds and predict drug metabolism properties and physicochemical properties, thereby greatly shortening drug research and development time, reducing research and development cost and improving success rate. The advent and application of pre-trained models, again in the great background of artificial intelligence development, shortens this process.
Labeling data sets, algorithm models and computational effort are indispensable components in AI pharmacy, and are also sources of major challenges facing molecular generation at present:
Data aspect: the high-quality data has high acquisition threshold and obvious restriction influence. The data sources of the drug research and development enterprises can be divided into public data and non-public data, the public data is easy to obtain, but the data quality is difficult to guarantee, and the reliability of model operation performed according to the data sources is insufficient. The non-public data is mainly accumulation of previous projects of each pharmaceutical company, and the accuracy of the data is high, but the data is extremely difficult to obtain because the data belongs to core assets of the pharmaceutical company.
Algorithm aspect: the matching requirement of the algorithm and the application scene is high. The advantages of the algorithm model in AI drug development can reflect the accuracy, calculation speed, model quantity, generalization performance and the like of the result, and different pre-training models can have different emphasis directions, so that the advantages are different, and the pre-training model with corresponding advantages is reasonably selected under specific task requirements and application scenes.
The aspect of calculating force: trimming the model may require a significant amount of computing resources, especially when the model architecture and parameters need to be adjusted. This may limit the practical application of the method.
At present, the research and development of the AI drugs in China are mainly applied to drug discovery links and preclinical research links, and are limited by the inherent complexity of biological systems and the characteristic of disease heterogeneity, and the AI technology cannot bring revolutionary changes to the efficiency and success rate of drug research and development, and the whole AI technology is still in an exploration stage. In the future, with the updating of algorithms, the breakthrough of calculation power and the development of big data, the AI technology is deeply applied to each link of new drug development, and plays an increasingly important role in the stages of compound synthesis, drug effect prediction, automatic development and the like.
Disclosure of Invention
The invention aims at solving the problems in the background technology and provides a method for generating molecules based on specified attributes. The model introduces a pre-training model into the generation of drug molecules, and introduces deep learning into the field of molecular generation through fine adjustment of the pre-training model, so that the development process of the drug is remarkably accelerated.
The technical scheme of the invention, 1, a method for generating molecules based on specified attributes, comprises the following specific steps:
s1, collecting small molecular data of a drug, and preparing a data set aiming at a molecular generation task;
S2, establishing a molecular generation model based on pre-training and model fine adjustment;
s3, performing fine tuning training on the molecular generation model by using the data set manufactured in the S1 to obtain two different models suitable for the molecular generation downstream task;
S4, respectively adjusting the DPMG model structure of the downstream task based on molecular attribute prediction and target molecular generation of the designated attribute;
s5, screening the obtained molecules;
S6, providing indexes for evaluating molecular generation and quantifying the quality of the model.
The expression form of the small molecular structure comprises a linear input SMILES and a two-dimensional undirected cyclic graph;
The small molecular data of the medicine collected in the S1 comprises public data and experimental output data, and the scale of the collected data reaches tens of millions;
Making the collected drug molecule data into a table form comprising SMILES expression of the molecule, lipid partition coefficient, drug affinity, synthesis accessibility; and a RDKit tool was used to filter out reasonable molecules.
Preferably, the present invention uses a textified SMILES as the data for the input model;
The collected SMILES expressions and properties of the corresponding molecules also include solubility and permeability estimates, molecular mass, topological polar surface area, number of hydrogen bond donors and acceptors, number of alarm structures, number of rotatable bodies;
wherein the three molecular attributes of lipid distribution coefficient, drug similarity quantitative estimation and synthesis accessibility are generated according to the molecular attributes, and the target is a molecule with the same value as the value of the three attributes;
or specifying one or more combinations of six attributes of solubility and permeability estimates, molecular mass, topological polar surface area, number of hydrogen bond donors and acceptors, number of alarm structures, number of rotatable bodies.
Preferably, the model needs to generate drug molecules on the basis of designating corresponding attributes, so as to ensure that the neural network can recognize the relationships among the attributes, the molecular structures and the SMILES text information, and therefore different data sets are manufactured aiming at different attributes of the same molecular structure in S1, thereby ensuring the capability of the neural network to learn associated information.
In the S3 pre-training process, 12 layers of transformers, 768 hidden layers and 12 attention arrows are adopted in the same semantic space to encode text data;
wherein the text sequence information and the molecular structure information are learned by specifying the direction of the attention arrow.
S4, in the molecular attribute prediction process, after the molecular generation is finished, the generated molecules are input into a DPMG model, the rest attributes of the molecules are predicted, and all the attributes of the molecules are completed; comparing the obtained result with the attribute given previously, and calculating the predicted deviation degree;
In the molecular attribute prediction process, a regression model is used as an end to obtain a numerical value, and MSE is used as a loss function:
For the mean square error loss, the general formula is:
In the above formula, y i is the true measurement value, For model predictive value, in regression model, letThe loss function is expressed as:
the training objective function is the value of a and b when searching the minimum value of the following functions:
s4, generating drug molecules meeting requirements according to given attribute parameters in the generation process of target molecules with the specified attributes;
Inputting the numerical value of the molecular attribute into an encoder, and inputting word embedding obtained in the encoder into a decoder to obtain an output SMILES file;
Introducing a teacher model in the output process, and comparing the existing SMILES part with a correct SMILES molecular file serving as a reference every time an atom is generated, so as to finish the fine tuning operation of the training model;
After each atom is compared with the reference SMILES file, the reference SMILES file is used as a precondition for generating atoms in the next round, and each atom is generated, a loss value is calculated through comparison, and parameters are fine-tuned through back propagation.
The data adopts binary cross entropy as the loss function when calculating the loss function:
The dataset presents a pattern of data= (x 1,y1)(x2,y2)(x3,y3)(x4,y4) … …, wherein, Is the input variable, i.e. a character generated in the model,Let y take 0 or 1 here, and when the probability of y=1 is θ, i.e., P θ (y=1) =θ, the log-likelihood that these data points are observed can be expressed by the above equation, where the likelihood function i (θ) is the objective function;
if a negative sign is added in front of the loss function, the loss function is converted into a cross entropy of y i and theta;
loss function in cross entropy form for a single sample:
Loss=-[yilogp+(1-yi)log(1-p)]
y i is the observed value of the ith sample and P is the predicted probability.
Preferably, in S5, the basic drug-like property of molecules generated based on QED and SAscore is subjected to preliminary filtration.
Preferably, in S6, MOSES and GuacaMol are used to score the drug molecules and to screen for the rationality of the molecular structure;
Wherein,
Generating molecular rationality: rationality of a molecule refers to whether the structure, properties, and design of the molecule are consistent with the expected biological activity or pharmaceutical properties, whether the basic laws of chemistry, biology, and pharmacology are met, and whether the molecule corresponds to a real molecule; if the molecular structure is SMILES, the MolFromSmiles method of RDKit kit is used to check whether the molecular structure can be converted from SMILES format into rdmol object, if so, the molecular structure is reasonable.
Compared with the prior art, the invention has the following beneficial technical effects:
1. The invention is generated according to the required attribute of the drug molecules, is simple and clear, and is convenient to understand and operate.
2. In the process of the traditional method, the links of screening and optimizing are required to be subjected to multiple experiments and calculation simulation, and the purpose of accelerating the screening process and reducing the workload of the simulation and experiment is achieved by adding a small neural network capable of screening according to the specified conditions after the completion of the generation.
3. The pre-training task aims at massive unlabeled data, so that the requirement on model parameters is high, and resources and time required by model training are prolonged. According to the invention, the adaptation of the downstream task is realized by carrying out fine adjustment on the trained pre-training model, so that the calculation cost of training the large model is reduced.
Drawings
FIG. 1 is a schematic diagram of a frame structure of a model of the present invention;
FIG. 2 is a schematic diagram of the pre-training of the model of the present invention;
FIG. 3 is a schematic representation of text data encoding of the model of the present invention;
FIG. 4 is a fine-tuning model of the molecular property prediction task of the model of the present invention;
FIG. 5 is a fine-tuning model of a target molecule generation task for specified properties of the model of the present invention.
Detailed Description
Example 1
The invention provides a method for generating molecules based on specified attributes, which comprises the following specific steps:
1. collecting drug small molecule data, and preparing a data set aiming at a molecule generation task;
A table is built that includes molecular SMILES structure text files, molecular attribute values (SAS, QED, etc.). The data obtained are shown in the following table:
Data volume | |
Training set | 207493 |
Test set | 50000 |
The attributes involved in the present model dataset include: lipid partition coefficient (LogP, partition coefficient), quantitative drug similarity estimation (QED, quantitative Estimate of Drug-likeness), synthetic accessibility (SAS, SYNTHETIC ACCESSIBILITY SCORE), and the generation of molecules targeted to have the same values as those specifying these three attributes. Other molecular properties such as solubility and permeability estimates (linpinski), molecular Mass (MW), topological polar surface area (TPSA, topological polar surface area), number of hydrogen bond donors and acceptors (HBD & HBA, numbers ofhydrogen-bond donors & acceptors), number of alarm structures (ALERT), number of rotatable bodies (ROTB) may also be specified.
It should be noted that although the present model can correspondingly implement two different downstream tasks through fine tuning, the fine tuning dataset is identical, except for the classification mode of the data, for molecular attribute prediction, the dataset is in the form of molecular SMILES and its corresponding attributes, and for molecular generation, the dataset is in the form of molecular SMILES corresponding to a certain attribute.
2. Establishing a molecular generation model based on pre-training and model fine tuning; as shown in fig. 1, the overall model framework is divided into two steps of molecular generation and molecular screening, and the three technical points of pre-training a model, pre-training fine tuning and molecular screening are involved.
3. Performing fine tuning training on the molecular generation model by using the data set manufactured in the step S1 to obtain a model suitable for a molecular generation downstream task; as shown in fig. 2, the computerized simulation and analysis of small molecules greatly speeds up the process of drug development. Characterization and understanding of molecules is an essential step in achieving this goal. Various molecular characterizations, such as molecular descriptors and fingerprints, have been proposed. Traditionally, these descriptors are designed by domain experts based on chemical and pharmaceutical knowledge for qualitatively or quantitatively representing molecules. Various shallow learning-based machine learning models are used to obtain quantitative structure-activity relationships (qsar) and quantitative structure-property relationships (QSPRs) to predict the activity and properties of molecules. With the advent of deep learning and representation learning in recent years, automatic representation and understanding of molecules by learning advanced features underlying low-level data has become an effective method of molecular modeling, making it possible to directly input original molecules for subsequent molecular analysis.
In the aspect of encoding and decoding, a sequence model in a text sequence, such as a Recurrent Neural Network (RNN), a long-short-time memory network (LSTM) and the like, is used, and a transducer is adopted to process the character sequence of the SMILES, so that the encoding and decoding effects are achieved. The model adopts 12 layers of transformers, 768 hidden layers and 12 attention arrows to realize the encoding operation of text data. In order to ensure that the coding and decoding process can not cause confusion of the generated structures due to different codebooks, the coding and decoding of the model are limited in the same semantic space
As shown in fig. 3, in the pre-training model, a part of attention arrows (solid line part) are bidirectional, and both the information at the front and the information at the rear can be connected, while the part connected by the dotted line realizes the contrast learning of molecular structure and text information through non-bidirectional connection, namely a causal relation model in an attention mechanism; modifying the model attention mechanism arrow, and then performing fine tuning training on the pre-training model by using the data set to obtain a model suitable for a molecular generation downstream task.
4. Adjusting the DPMG model structure of the downstream task based on molecular attribute prediction and target molecular generation of the designated attribute;
as shown in fig. 5, a fine-tuning model of a task is generated using target molecules of specified properties: in the downstream task of total generation, the model generates molecules, the generated molecules are scored through a molecular screening link, and the scoring standard is generated molecular rationality (Validity) to generate molecular novelty (Novelty).
During the execution of this downstream task, [ bos ] is selected as the initiator, [ seq ] is the spacer, [ seq-len ] is used to control the text length of the generated molecular SMILES file, and the input format of the molecular attributes is: attribute 1[ seq ] attribute 2[ seq ] attribute 3. The term "additionally requires attention, in the case of a model that has been trimmed, no loss is needed to update the parameters, and every output of the SMILES file is directly most input into the model.
The test benchmarks for this term are:
Test index | Validity | Novelty | Attribute bias |
Expected value of test index | 0.87 | 0.99 | Less than or equal to 5% |
As shown in fig. 4, a fine-tuning model of molecular property prediction tasks is used:
In this downstream task, fine tuning training is performed by changing the attention mechanism arrow and using differently labeled data sets, it should be noted that although both the molecular property prediction and the molecular generation model are called DPMG, these are essentially two different models, which are applicable to the respective downstream task, and there are differences in both structure and parameters. The input of this downstream task is the drug molecule generated by inputting the specified attribute in the last downstream task.
In this downstream task, the transducer acts as an encoder, the encoded vector passing through the transducer layer is ultimately converted to text output by a machine-learned classification model, which is back-propagated through the loss function of MSE during training to update the model parameters.
In the downstream task, the model predicts the generated molecular completion attribute, and the prediction deviation of the molecular attribute is within 5 percent and is regarded as qualified, otherwise, the model is unqualified.
5. Screening the obtained molecules;
Preliminary filtration based on QED and SAscore ensures that the resulting molecules possess basic drug-like properties during the initial stages of molecular generation; the similarity calculation based on molecular fingerprints assists in eliminating structurally redundant molecules with low intellectual property. Preliminary screening of the resulting drug molecules was performed.
6. Providing an index for evaluating molecular generation and quantifying the quality of a model
To screen for rationality of the resulting molecular structure, the resulting neural network molecules need to be scored by means of a small neural network. MOSES and GuacaMol are two mainstream tools for scoring generated drug molecules, the former emphasizes testing of general drug-like indexes such as rationality, novelty and skeleton diversity of the generated molecules of the model, and the latter evaluates the multi-objective optimization capability of the model by defining a series of tasks.
Generating molecular rationality (Validity): rationality of a molecule refers to whether the structure, properties and design of the molecule are consistent with the expected biological activity or pharmaceutical properties, whether the basic laws of chemistry, biology and pharmacology are met, and whether it corresponds (at least theoretically) to a real molecule. If the molecular structure is SMILES, the MolFromSmiles method of RDKit toolkit is generally used to check whether it can be converted from SMILES format to rdmol object, if so, it is a reasonable molecule.
Molecular novelty (Novelty) is generated: the novelty of a molecule refers to whether the molecular structure is unique among known libraries of compounds or whether it is innovative. The index calculation mode can be set manually according to task requirements.
Table 1 model ROC-AUC curve comparison for molecular property prediction task:
Logreg | KernalSVM | XGBoost | IRV | Multitask | GC | Weave | DPMG | |
HIV | 0.702 | 0.792 | 0.756 | 0.737 | 0.698 | 0.763 | 0.703 | 0.798 |
BACE | 0.781 | 0.862 | 0.85 | 0.838 | 0.698 | 0.763 | 0.703 | 0.872 |
BBBP | 0.699 | 0.729 | 0.696 | 0.7 | 0.688 | 0.69 | 0.671 | 0.962 |
CLINTOX | 0.722 | 0.669 | 0.799 | 0.77 | 0.778 | 0.807 | 0.832 | 0.984 |
Table 2 model performance comparison of molecular generation tasks:
Model | Validity | Uniqueness | Novelty |
JT-VAE | 62% | 100% | 100% |
GCPN | 20% | 99.97% | 100% |
MRNN | 65% | 99.89% | 100% |
GraphNVP | 55% | 94.80% | 100% |
GraphAF | 68% | 99.10% | 100% |
DPMG | 85.28% | 99.91% | 100% |
As can be seen from the contents of tables 1 and 2 above, the model built using the present invention can be generated by specifying only the values of one or more attributes, and has high effectiveness, uniqueness and novelty in the generation of molecules based on the specified attributes.
The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited thereto, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.
Claims (5)
1. A method for generating molecules based on specified properties, comprising the specific steps of:
s1, collecting small molecular data of a drug, and preparing a data set aiming at a molecular generation task;
S2, establishing a molecular generation model based on pre-training and model fine adjustment;
s3, performing fine tuning training on the molecular generation model by using the data set manufactured in the S1 to obtain two different DPMG models suitable for the molecular generation downstream task;
S4, respectively adjusting the DPMG model structure of the downstream task based on molecular attribute prediction and target molecular generation of the designated attribute;
After the molecular generation is finished, the generated molecules are taken as input to enter a DPMG model, the rest properties of the molecules are predicted, and all the properties of the molecules are complemented; comparing the obtained result with the attribute given previously, and calculating the predicted deviation degree;
In the molecular attribute prediction process, a regression model is used as an end to obtain a numerical value, and MSE is used as a loss function:
For the mean square error loss, the general formula is:
In the above formula, y i is the true measurement value, For model predictive value, in regression model, letThe loss function is expressed as:
the training objective function is the value of a and b when searching the minimum value of the following functions:
generating drug molecules meeting requirements according to given attribute parameters in the generation process of target molecules with specified attributes;
Inputting the numerical value of the molecular attribute into an encoder, and inputting word embedding obtained in the encoder into a decoder to obtain an output SMILES file;
Introducing a teacher model in the output process, and comparing the existing SMILES part with a correct SMILES molecular file serving as a reference every time an atom is generated, so as to finish the fine tuning operation of the training model;
After each atom is compared with the reference SMILES file, taking the reference SMILES file as a precondition for generating atoms in the next round, and each atom is generated, calculating a loss value through comparison and fine-tuning parameters through back propagation;
the data adopts binary cross entropy as the loss function when calculating the loss function:
The dataset presents a pattern of data= (x 1,y1)(x2,y2)(x3,y3)(x4,y4) … …, wherein, Is the input variable, i.e. a character generated in the model,Let y take 0 or 1 here, and when the probability of y=1 is θ, i.e., P θ (y=1) =θ, the log-likelihood that these data points are observed can be expressed by the above equation, where the likelihood function i (θ) is the objective function;
if a negative sign is added in front of the loss function, the loss function is converted into a cross entropy of y i and theta;
loss function in cross entropy form for a single sample:
Loss=-[yilogp+(1-yi)log(1-p)]
y i is the observed value of the ith sample, P is the predicted probability;
S5, screening the obtained molecules; performing preliminary filtration on basic drug properties of molecules generated based on QED and SAscore more;
s6, providing indexes for evaluating molecular generation and quantifying the quality of a model;
scoring the drug molecules using MOSES and GuacaMol and screening for rationality of the molecular structure;
Wherein,
Generating molecular rationality: rationality of a molecule refers to whether the structure, properties, and design of the molecule are consistent with the expected biological activity or pharmaceutical properties, whether the basic laws of chemistry, biology, and pharmacology are met, and whether the molecule corresponds to a real molecule; if the molecular structure is SMILES, the MolFromSmiles method of RDKit kit is used to check whether the molecular structure can be converted from SMILES format into rdmol object, if so, the molecular structure is reasonable.
2. The method of claim 1, wherein the representation of the small molecule structure comprises a linear input SMILES and a two-dimensional undirected cyclic graph;
The small molecular data of the medicine collected in the S1 comprises public data and experimental output data, and the scale of the collected data reaches tens of millions;
Making the collected drug molecule data into a table form comprising SMILES expression of the molecule, lipid partition coefficient, drug affinity, synthesis accessibility; and a RDKit tool was used to filter out reasonable molecules.
3. The method of generating molecules based on specified attributes according to claim 2, wherein in S1, textual SMILES is used as data for the input model;
The collected SMILES expressions and properties of the corresponding molecules also include solubility and permeability estimates, molecular mass, topological polar surface area, number of hydrogen bond donors and acceptors, number of alarm structures, number of rotatable bodies;
wherein the three molecular attributes of lipid distribution coefficient, drug similarity quantitative estimation and synthesis accessibility are generated according to the molecular attributes, and the target is a molecule with the same value as the value of the three attributes;
or specifying one or more combinations of six attributes of solubility and permeability estimates, molecular mass, topological polar surface area, number of hydrogen bond donors and acceptors, number of alarm structures, number of rotatable bodies.
4. A method of generating molecules based on specified properties according to claim 1, wherein in S1 different datasets are created for different properties of the same molecular structure.
5. The method for generating molecules based on specified attributes according to claim 1, wherein 12 layers of transformers, 768 hidden layers and 12 attention arrows are adopted to encode text data in the same semantic space in the training process of S3;
wherein the text sequence information and the molecular structure information are learned by specifying the direction of the attention arrow.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311238924.0A CN117334271B (en) | 2023-09-25 | Method for generating molecules based on specified attributes |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311238924.0A CN117334271B (en) | 2023-09-25 | Method for generating molecules based on specified attributes |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117334271A CN117334271A (en) | 2024-01-02 |
CN117334271B true CN117334271B (en) | 2024-07-12 |
Family
ID=
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110459274A (en) * | 2019-08-01 | 2019-11-15 | 南京邮电大学 | A kind of small-molecule drug virtual screening method and its application based on depth migration study |
CN110534164A (en) * | 2019-09-26 | 2019-12-03 | 广州费米子科技有限责任公司 | Drug molecule generation method based on deep learning |
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110459274A (en) * | 2019-08-01 | 2019-11-15 | 南京邮电大学 | A kind of small-molecule drug virtual screening method and its application based on depth migration study |
CN110534164A (en) * | 2019-09-26 | 2019-12-03 | 广州费米子科技有限责任公司 | Drug molecule generation method based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | DeepDSC: a deep learning method to predict drug sensitivity of cancer cell lines | |
CN108228716B (en) | SMOTE _ Bagging integrated sewage treatment fault diagnosis method based on weighted extreme learning machine | |
Knowles | ParEGO: A hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems | |
CN108647226B (en) | Hybrid recommendation method based on variational automatic encoder | |
CN110910951A (en) | Method for predicting protein and ligand binding free energy based on progressive neural network | |
CN109994158B (en) | System and method for constructing molecular reverse stress field based on reinforcement learning | |
CN110083125B (en) | Machine tool thermal error modeling method based on deep learning | |
CN113838536B (en) | Translation model construction method, product prediction model construction method and prediction method | |
CN111461286B (en) | Spark parameter automatic optimization system and method based on evolutionary neural network | |
US20230197205A1 (en) | Bioretrosynthetic method and system based on and-or tree and single-step reaction template prediction | |
CN111785326B (en) | Gene expression profile prediction method after drug action based on generation of antagonism network | |
CN114708903A (en) | Method for predicting distance between protein residues based on self-attention mechanism | |
Liu et al. | EACP: An effective automatic channel pruning for neural networks | |
EP4198991A1 (en) | Molecular scaffold hopping processing method and apparatus, medium, electronic device and computer program product | |
CN116976505A (en) | Click rate prediction method of decoupling attention network based on information sharing | |
CN113722439A (en) | Cross-domain emotion classification method and system based on antagonism type alignment network | |
CN117334271B (en) | Method for generating molecules based on specified attributes | |
CN114281950B (en) | Data retrieval method and system based on multi-graph weighted fusion | |
CN114819107B (en) | Mixed data assimilation method based on deep learning | |
CN116054144A (en) | Distribution network reconstruction method, system and storage medium for distributed photovoltaic access | |
CN117334271A (en) | Method for generating molecules based on specified attributes | |
CN115796029A (en) | NL2SQL method based on explicit and implicit characteristic decoupling | |
Kavipriya et al. | Adaptive weight deep convolutional neural network (AWDCNN) classifier for predicting student’s performance in job placement process | |
CN115410642A (en) | Biological relation network information modeling method and system | |
CN114154582A (en) | Deep reinforcement learning method based on environment dynamic decomposition model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |