CN113707234B - Lead compound patent drug property optimization method based on machine translation model - Google Patents

Lead compound patent drug property optimization method based on machine translation model Download PDF

Info

Publication number
CN113707234B
CN113707234B CN202110992135.0A CN202110992135A CN113707234B CN 113707234 B CN113707234 B CN 113707234B CN 202110992135 A CN202110992135 A CN 202110992135A CN 113707234 B CN113707234 B CN 113707234B
Authority
CN
China
Prior art keywords
optimization
character string
score
calculation
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110992135.0A
Other languages
Chinese (zh)
Other versions
CN113707234A (en
Inventor
曹东升
付丽
杨梓宜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202110992135.0A priority Critical patent/CN113707234B/en
Publication of CN113707234A publication Critical patent/CN113707234A/en
Application granted granted Critical
Publication of CN113707234B publication Critical patent/CN113707234B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The embodiment of the disclosure provides a lead compound patent drug property optimization method based on a machine translation model, which belongs to the technical field of medical care informatics and specifically comprises the following steps: training a translation model; establishing a plurality of calculation models corresponding to the pharmacokinetic endpoints according to a machine learning algorithm to form a prediction model group; inputting the initial molecular character string into an encoder to generate a target vector; inputting the target vector into a prediction model group according to the received optimization instruction to obtain an optimization prediction index corresponding to the optimization instruction; performing weighted average calculation according to the optimization prediction index and the calculation index corresponding to the initial molecular character string to obtain the score of the initial molecular character string; according to the target vector and the score, an optimization algorithm is utilized to iterate for preset times to obtain an optimization score set; inputting the optimized score set into a decoder, and calculating the character string corresponding to each optimized vector by using a preset algorithm to form a target molecule character string set. By the scheme, the optimization efficiency and the adaptability are improved.

Description

Lead compound patent drug property optimization method based on machine translation model
Technical Field
The embodiment of the disclosure relates to the technical field of medical health care informatics, in particular to a lead compound patent drug property optimization method based on a machine translation model.
Background
One of the biggest challenges in drug development is how to efficiently optimize lead compounds, which is also a major challenge for pharmaceutical chemists. More than 50% of compounds fail in the course of drug development because of lack of suitable absorption, distribution, metabolism, excretion (ADMET) and safety properties, whereas ADMET property optimization is a very difficult multi-objective optimization task, requiring that the patentability of the molecule be improved while maintaining the activity of the molecule; on the other hand, factors such as large space, less experience, high cost, long time consumption and the like also make the efficient optimization of the pharmacokinetic properties and the safety of the compound a great difficulty. In the prior art, new molecules are generated through calculation, and then the generated new compounds are screened by utilizing a virtual screening program to obtain candidate compounds, however, the calculated amount is huge, or the whole molecule is predicted through a prediction model, the specific index of the drug formation cannot be optimized, and the drug has low drug property after optimization, and has poor optimization efficiency and adaptability.
Therefore, a lead compound patent drug property optimization method with high optimization efficiency and adaptability based on a machine translation model is needed.
Disclosure of Invention
In view of the above, embodiments of the present disclosure provide a method for optimizing the pharmaceutical properties of lead compounds based on a machine translation model, which at least partially solves the problems of poor optimization efficiency and adaptability in the prior art.
In a first aspect, embodiments of the present disclosure provide a lead compound patentability optimization method based on a machine translation model, comprising:
training a translation model by using a preset number of sample molecule strings, wherein the translation model comprises an encoder and a decoder;
establishing a plurality of calculation models corresponding to the pharmacokinetic endpoints according to a machine learning algorithm to form a prediction model group;
inputting the initial molecular character string into the encoder to generate a target vector;
inputting the target vector into a prediction model group according to the received optimization instruction to obtain an optimization prediction index corresponding to the optimization instruction;
performing weighted average calculation according to the optimization prediction index and the calculation index corresponding to the initial molecular character string to obtain the score of the initial molecular character string;
according to the target vector and the score, an optimization score set is obtained by utilizing an optimization algorithm to iterate for preset times, wherein the optimization score set comprises a plurality of optimization vectors and optimization scores corresponding to the optimization vectors;
inputting the optimized score set into the decoder, and calculating the character strings corresponding to each optimized vector by using a preset algorithm to form a target molecule character string set.
According to a specific implementation manner of the embodiment of the present disclosure, the step of training the translation model by using a preset number of sample molecule strings includes:
inputting each sample molecule character string into the encoder, and inputting the output result of the encoder into the decoder;
and each output result of the decoder is lost with the real label of the corresponding sample molecule character string, and gradient updating is performed.
According to a specific implementation manner of an embodiment of the disclosure, the step of establishing a plurality of calculation models corresponding to pharmacokinetic endpoints according to a machine learning algorithm to form a prediction model set includes:
extracting a sample dataset from within an initial database;
extracting data corresponding to each pharmacokinetic endpoint from the sample data set to train an XGBoost algorithm, and obtaining a calculation model corresponding to each pharmacokinetic endpoint;
and forming the prediction model group according to the calculation models corresponding to all the pharmacokinetic endpoints.
According to a specific implementation manner of the embodiment of the present disclosure, the step of inputting the target vector into a prediction model group according to a received optimization instruction to obtain an optimization prediction index corresponding to the optimization instruction includes:
analyzing the pharmacokinetic endpoint contained in the optimization instructions;
selecting a corresponding calculation model from the prediction model group according to the pharmacokinetic endpoint contained in the optimization instruction;
and respectively inputting the target vector into each calculation model to obtain a prediction index corresponding to each pharmacokinetic endpoint, and forming the optimized prediction index.
According to a specific implementation manner of the embodiment of the present disclosure, before the step of obtaining the score of the initial molecular string by performing weighted average calculation according to the optimization prediction index and the calculation index corresponding to the initial molecular string, the method further includes:
setting a corresponding weight for each of the pharmacokinetic endpoints and the computational metrics;
setting a common property range and a preset property range corresponding to each pharmacokinetic endpoint and the calculation index, wherein the common property range is larger than the preset property range.
According to a specific implementation manner of the embodiment of the present disclosure, the step of performing weighted average calculation according to the optimization prediction index and the calculation index corresponding to the initial molecular string to obtain the score of the initial molecular string includes:
calculating a predicted value according to the weight and the predicted index of each pharmacokinetic endpoint respectively, and calculating the predicted value according to the value and the weight of the calculated index;
and determining a predicted score corresponding to each predicted value according to the property range of each predicted value, and forming the score of the initial molecular character string.
According to a specific implementation manner of the embodiment of the present disclosure, the step of calculating the character string corresponding to each optimization vector by using a preset algorithm to form a target molecule character string set includes:
predicting each character according to the Beam Search algorithm and the optimization vector until a character string is formed;
and forming the target molecule character string set according to the character strings corresponding to all the optimization vectors.
The lead compound patent drug property optimization scheme based on the machine translation model in the embodiment of the disclosure comprises the following steps: training a translation model by using a preset number of sample molecule strings, wherein the translation model comprises an encoder and a decoder; establishing a plurality of calculation models corresponding to the pharmacokinetic endpoints according to a machine learning algorithm to form a prediction model group; inputting the initial molecular character string into the encoder to generate a target vector; inputting the target vector into a prediction model group according to the received optimization instruction to obtain an optimization prediction index corresponding to the optimization instruction; performing weighted average calculation according to the optimization prediction index and the calculation index corresponding to the initial molecular character string to obtain the score of the initial molecular character string; according to the target vector and the score, an optimization score set is obtained by utilizing an optimization algorithm to iterate for preset times, wherein the optimization score set comprises a plurality of optimization vectors and optimization scores corresponding to the optimization vectors; inputting the optimized score set into the decoder, and calculating the character strings corresponding to each optimized vector by using a preset algorithm to form a target molecule character string set.
The beneficial effects of the embodiment of the disclosure are that: according to the scheme, a calculation model is built for each pharmacokinetic endpoint to be optimized, each index of the initial molecule is independently optimized, iterative optimization is performed after weighted average calculation, and iterative results are tidied and output into a fixed target molecule character string set, so that optimization efficiency and adaptability are improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.
FIG. 1 is a schematic flow chart of a lead compound pharmaceutical optimization method based on a machine translation model provided in an embodiment of the disclosure;
FIG. 2 is a schematic diagram of a part of a method for optimizing the pharmaceutical properties of a lead compound based on a machine translation model according to an embodiment of the present disclosure;
FIG. 3 is a schematic flow diagram of a portion of another method for optimizing the patentability of lead compounds based on a machine translation model according to an embodiment of the present disclosure;
fig. 4 is a schematic diagram of a specific implementation process of a lead compound patent drug property optimization method based on a machine translation model according to an embodiment of the disclosure.
Detailed Description
Embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.
Other advantages and effects of the present disclosure will become readily apparent to those skilled in the art from the following disclosure, which describes embodiments of the present disclosure by way of specific examples. It will be apparent that the described embodiments are merely some, but not all embodiments of the present disclosure. The disclosure may be embodied or practiced in other different specific embodiments, and details within the subject specification may be modified or changed from various points of view and applications without departing from the spirit of the disclosure. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.
It is noted that various aspects of the embodiments are described below within the scope of the following claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present disclosure, one skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. In addition, such apparatus may be implemented and/or such methods practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.
It should also be noted that the illustrations provided in the following embodiments merely illustrate the basic concepts of the disclosure by way of illustration, and only the components related to the disclosure are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
In addition, in the following description, specific details are provided in order to provide a thorough understanding of the examples. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details.
One of the biggest challenges in drug development is how to efficiently optimize lead compounds, which is also a major challenge for pharmaceutical chemists. More than 50% of compounds fail in the course of drug development because of lack of suitable absorption, distribution, metabolism, excretion (ADMET) and safety properties, whereas ADMET property optimization is a very difficult multi-objective optimization task, requiring that the patentability of the molecule be improved while maintaining the activity of the molecule; on the other hand, factors such as large space, less experience, high cost, long time consumption and the like also make the efficient optimization of the pharmacokinetic properties and the safety of the compound a great difficulty. In the prior art, new molecules are generated through calculation, and then the generated new compounds are screened by utilizing a virtual screening program to obtain candidate compounds, however, the calculated amount is huge, or the whole molecule is predicted through a prediction model, the specific index of the drug formation cannot be optimized, and the drug has low drug property after optimization, and has poor optimization efficiency and adaptability.
The embodiment of the disclosure provides a lead compound patency optimization method based on a machine translation model, which can be applied to a lead compound patency optimization process of a computer-aided drug design scene.
Referring to fig. 1, a schematic flow chart of a method for optimizing the pharmaceutical properties of a lead compound based on a machine translation model is provided in an embodiment of the disclosure. As shown in fig. 1, the method mainly comprises the following steps:
s101, training a translation model by using a preset number of sample molecule strings, wherein the translation model comprises an encoder and a decoder;
during implementation, the translation model can be built according to the language neural network, and then the translation model is trained by utilizing a preset number of sample molecule character strings, so that the translation accuracy and the richness related to chemical space are improved, and the follow-up optimization process is more accurate.
S102, establishing a plurality of calculation models corresponding to pharmacokinetic endpoints according to a machine learning algorithm to form a prediction model group;
in practice, it is considered that in the optimization process, specific improvements are required to be made on the pharmacokinetic (ADMET) properties of the molecules, and pharmacokinetic endpoints generally affecting drug properties mainly include: log d7.4, log s, caco-2, MDCK cells, plasma protein binding rate (PPB), AMES toxicity, cardiotoxicity (hERG), hepatotoxicity and half-lethal dose (LD 50) toxicity total 9 important ADMET endpoints. The calculation models corresponding to the plurality of pharmacokinetic endpoints can be established through a machine learning algorithm to form a prediction model group, and the models can be respectively established according to 9 important ADMET endpoints, and the models can be respectively established according to any ADMET endpoints, which are not listed here.
S103, inputting the initial molecular character string into the encoder to generate a target vector;
in the implementation, the SMILES character string corresponding to the lead compound molecule to be optimized may be used as the initial molecule character string, and then the initial molecule character string is input into the encoder to generate the target vector.
For example, to avoid the problem of gradient extinction or gradient explosion caused by Recurrent Neural Networks (RNNs), both encoders and decoders employ a 3-layer stack gate cycling unit (GRU), each layer containing 256, 512 and 1024 units. For the encoder model, its last layer is the fully connected layer (information bottleneck) containing 512 cells and the hyperbolic tangent activation function, and 512-dimensional vectors are generated as the target vectors. The 512-dimensional vector obtained after the information bottleneck screening symbolizes the most remarkable statistical feature in SMILES. Of course, the specific setting of the encoder and the dimension of the generated target vector may be set according to actual needs.
S104, inputting the target vector into a prediction model group according to the received optimization instruction to obtain an optimization prediction index corresponding to the optimization instruction;
in a specific implementation, the optimizing instruction may be optimizing for a specific pharmacokinetic endpoint in the initial molecular string, for example, when the optimizing instruction is optimizing a plasma protein binding rate (PPB), a cardiac toxicity (hERG), a hepatic toxicity, and a half lethal dose (LD 50) in the initial molecular string, the target vector is input into the prediction model set, and the calculation model corresponding to the plasma protein binding rate (PPB), the cardiac toxicity (hERG), the hepatic toxicity, and the half lethal dose (LD 50) in the prediction model set is used for analyzing the target vector, so as to obtain the optimized prediction index corresponding to the optimizing instruction.
S105, carrying out weighted average calculation according to the optimization prediction index and the calculation index corresponding to the initial molecular character string to obtain the score of the initial molecular character string;
the calculation index corresponding to the initial molecular character string can be directly calculated according to the initial molecular character string, after the optimization prediction index is obtained, the weighted average calculation can be carried out according to the influence of each index on the drug formation to obtain the score of the initial molecular character string, so that the situation that the optimized molecule only pays attention to the improvement of the property, but ignores important structural information and generates unexpected molecules is avoided.
S106, according to the target vector and the score, utilizing an optimization algorithm to iterate for preset times to obtain an optimization score set, wherein the optimization score set comprises a plurality of optimization vectors and optimization scores corresponding to the optimization vectors;
in the implementation, in order to further improve the optimization efficiency, after the score of the initial molecular string is obtained, a preset number of iterations of an optimization algorithm can be utilized according to the target vector and the score to obtain a plurality of optimization vectors and optimization scores corresponding to the optimization vectors, so as to form the optimization score set.
For example, the optimization vector and the optimization score obtained by combining a particle swarm optimization algorithm (particle swarm optimization, PSO for short) with a translation model are used for calculation, so that efficient molecular optimization is realized. PSO is a random optimization method for simulating population intelligence, and searching the optimal point by recording and comparing information in space search of a plurality of particles. In this process, the information of each particle in the population is defined by its position x and velocity v, where the score f is used to explore space and guide optimization. In this study, position x is a 512-dimensional vector value and score f is an optimization score. The motion of the ith particle in iteration step k is affected by its own historical optimum and the historical optimum of all particles; after each iteration, each particle will update its speed and position according to the collected information and its state, and then the optimization vectors within the set of optimization scores may be ordered according to the optimization scores.
S107, inputting the optimized score set into the decoder, and calculating the character strings corresponding to each optimized vector by using a preset algorithm to form a target molecule character string set.
In the implementation, after the optimized score set is obtained, the optimized score set may be input to the decoder, and each optimized vector in the optimized score set is decoded by the decoder to generate a Canonical string, so as to form the target molecule string set, where the target molecule string set includes a plurality of molecules optimized according to the optimized instruction, so as to facilitate subsequent verification and application.
According to the machine translation model-based lead compound patent drug property optimization method, a calculation model is built for each pharmacokinetic end point to be optimized, each index of an initial molecule is independently optimized, iterative optimization is performed after weighted average calculation, and iterative results are arranged and output into a fixed target molecule character string set, so that optimization efficiency and adaptability are improved.
Based on the above embodiment, the training the translation model using the predetermined number of sample molecule strings in step S101 includes:
inputting each sample molecule character string into the encoder, and inputting the output result of the encoder into the decoder;
and each output result of the decoder is lost with the real label of the corresponding sample molecule character string, and gradient updating is performed.
In implementation, each sample molecule string may be input to the encoder, the encoder generates 512-dimensional vectors corresponding to each sample molecule string, then inputs the 512-dimensional vectors corresponding to each sample molecule string to the decoder, and then performs loss calculation on the output of the decoder and the real tag, and performs gradient update to improve the translation accuracy of the translation model.
On the basis of the above embodiment, as shown in fig. 2, in step S102, a step of establishing a plurality of calculation models corresponding to pharmacokinetic endpoints according to a machine learning algorithm to form a prediction model set includes:
s201, extracting a sample data set from an initial database;
for example, the sample dataset may be formed by retrieving and literature collecting the ChEMBL, EPA and drug bank databases, obtaining and data preprocessing the ADMET dataset, screening out interference data and invalid data of the ADMET dataset.
S202, extracting data corresponding to each pharmacokinetic endpoint from the sample data set to train an XGBoost algorithm, and obtaining a calculation model corresponding to each pharmacokinetic endpoint;
for example, the corresponding data may be extracted from the sample dataset for a total of 9 important ADMET endpoints, log 7.4, log, caco-2, MDCK cells, plasma protein binding rate (PPB), AMES toxicity, cardiotoxicity (hERG), hepatotoxicity, and half-lethal dose (LD 50) toxicity, and the data corresponding to the different ADMET endpoints may be learned in combination with the XGBoost algorithm, and a computational model corresponding to each of the pharmacokinetic endpoints may be established. Of course, other machine learning algorithms may also be used to learn and model.
And S203, forming the prediction model group according to the calculation models corresponding to all the pharmacokinetic endpoints.
And after obtaining the calculation model corresponding to each pharmacokinetic endpoint, forming the prediction model group according to the calculation models corresponding to all the pharmacokinetic endpoints.
Further, the step of inputting the target vector into a prediction model group according to the received optimization instruction to obtain an optimization prediction index corresponding to the optimization instruction includes:
analyzing the pharmacokinetic endpoint contained in the optimization instructions;
selecting a corresponding calculation model from the prediction model group according to the pharmacokinetic endpoint contained in the optimization instruction;
and respectively inputting the target vector into each calculation model to obtain a prediction index corresponding to each pharmacokinetic endpoint, and forming the optimized prediction index.
When the pharmacokinetic endpoint included in the optimization instruction is optimizing the plasma protein binding rate (PPB), cardiac toxicity (hERG), hepatotoxicity and half-lethal dose (LD 50) in the initial molecular string, the target vector is input into the prediction model group, calculation models corresponding to the plasma protein binding rate (PPB), cardiac toxicity (hERG), hepatotoxicity and half-lethal dose (LD 50) in the prediction model group are used, and then the target vector is input into each calculation model respectively to obtain a prediction index corresponding to each pharmacokinetic endpoint, and the optimized prediction index is formed.
Optionally, before the step of obtaining the score of the initial molecular string by performing weighted average calculation according to the optimized prediction index and the calculation index corresponding to the initial molecular string, the method further includes:
setting a corresponding weight for each of the pharmacokinetic endpoints and the computational metrics;
setting a common property range and a preset property range corresponding to each pharmacokinetic endpoint and the calculation index, wherein the common property range is larger than the preset property range.
In specific implementation, considering that the realization of multi-objective optimization tasks is required to be guaranteed, and the expected value of the optimized molecule is quantized, the corresponding weight can be set for each pharmacokinetic endpoint and each calculation index, and the common property range and the preset property range corresponding to each pharmacokinetic endpoint and each calculation index are set, so that the optimization of the patentability of the lead compound is guaranteed, and the expected molecule is generated.
Further, the step of obtaining the score of the initial molecular string by performing weighted average calculation according to the optimized prediction index and the calculation index corresponding to the initial molecular string includes:
calculating a predicted value according to the weight and the predicted index of each pharmacokinetic endpoint respectively, and calculating the predicted value according to the value and the weight of the calculated index;
and determining a predicted score corresponding to each predicted value according to the property range of each predicted value, and forming the score of the initial molecular character string.
For example, a predicted value may be calculated according to the weight and the predicted index of each pharmacokinetic endpoint, and the predicted value may be calculated according to the value and the weight of the calculated index, and then the range in which the predicted value is located may be determined, and if the predicted value is within the preset property range, the property score value corresponding to the predicted value is 1, and if the predicted value is outside the preset property range but within the common property range, the property score corresponding to the predicted value is 0 according to the score value corresponding to the distance from the target range of (0, 1).
On the basis of the above embodiment, as shown in fig. 3, in step S107, a preset algorithm is used to calculate a string corresponding to each of the optimization vectors, so as to form a target molecule string set, including:
s301, predicting each character according to a Beam Search algorithm and the optimization vector until a character string is formed;
in particular, the Beam Search algorithm is a heuristic Search algorithm that iteratively predicts each character by expanding the most promising nodes in a finite set to explore the best combination of words, and by which each of the optimization vectors in the set of optimization scores can be substituted into the Beam Search algorithm, iteratively predicting each character in each of the optimization vectors until a complete string sequence is formed. Of course, other algorithms may be used for decoding.
S302, forming the target molecule character string set according to character strings corresponding to all the optimization vectors.
And in the specific implementation, the same steps are carried out until each optimized vector generates a corresponding character string, and then the character strings corresponding to all the optimized vectors are formed into the target molecule character string set. The specific optimization flow of the lead compound patent drug property optimization method of the machine translation model provided by the embodiment of the disclosure is shown in fig. 4, and the target molecule character string set is finally generated.
The foregoing is merely specific embodiments of the disclosure, but the protection scope of the disclosure is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the disclosure are intended to be covered by the protection scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (4)

1. A method for optimizing the patentability of a lead compound based on a machine translation model, comprising the steps of:
training a translation model by using a preset number of sample molecule strings, wherein the translation model comprises an encoder and a decoder;
establishing a plurality of calculation models corresponding to the pharmacokinetic endpoints according to a machine learning algorithm to form a prediction model group;
the step of establishing a plurality of calculation models corresponding to the pharmacokinetic endpoints according to a machine learning algorithm to form a prediction model group comprises the following steps:
extracting a sample dataset from within an initial database;
extracting data corresponding to each pharmacokinetic endpoint from the sample data set to train an XGBoost algorithm, and obtaining a calculation model corresponding to each pharmacokinetic endpoint;
forming the prediction model group according to the calculation models corresponding to all the pharmacokinetic endpoints;
inputting the initial molecular character string into the encoder to generate a target vector;
inputting the target vector into a prediction model group according to the received optimization instruction to obtain an optimization prediction index corresponding to the optimization instruction;
the step of inputting the target vector into a prediction model group according to the received optimization instruction to obtain an optimization prediction index corresponding to the optimization instruction comprises the following steps:
analyzing the pharmacokinetic endpoint contained in the optimization instructions;
selecting a corresponding calculation model from the prediction model group according to the pharmacokinetic endpoint contained in the optimization instruction;
inputting the target vector into each calculation model respectively to obtain a prediction index corresponding to each pharmacokinetic endpoint, and forming the optimized prediction index;
performing weighted average calculation according to the optimization prediction index and a calculation index corresponding to the initial molecular character string to obtain a score of the initial molecular character string, wherein the calculation index is directly calculated according to the initial molecular character string;
the step of obtaining the score of the initial molecular character string by carrying out weighted average calculation according to the optimized prediction index and the calculation index corresponding to the initial molecular character string comprises the following steps:
calculating a first predicted value according to the weight and the predicted index of each pharmacokinetic endpoint, and calculating a second predicted value according to the value and the weight of the calculated index;
determining a predicted score corresponding to each predicted value according to the property range of each predicted value, and forming a score of the initial molecular character string;
the step of determining a prediction score corresponding to each predicted value according to the property range of each predicted value and forming the score of the initial molecular character string comprises the following steps:
judging the range of each predicted value, if the predicted value is in the preset property range, the property score value corresponding to the predicted value is 1, if the predicted value is out of the preset property range but still in the common property range, according to the score value corresponding to the distance from the target range to (0, 1), and if the predicted value exceeds the common property range, the property score corresponding to the predicted value is 0;
according to the target vector and the score, an optimization score set is obtained by utilizing an optimization algorithm to iterate for preset times, wherein the optimization score set comprises a plurality of optimization vectors and optimization scores corresponding to the optimization vectors;
inputting the optimized score set into the decoder, and calculating the character strings corresponding to each optimized vector by using a preset algorithm to form a target molecule character string set.
2. The method of claim 1, wherein the step of training a translation model using a predetermined number of sample molecule strings comprises:
inputting each sample molecule character string into the encoder, and inputting the output result of the encoder into the decoder;
and each output result of the decoder is lost with the real label of the corresponding sample molecule character string, and gradient updating is performed.
3. The method according to claim 1, wherein before the step of obtaining the score of the initial molecular string by performing weighted average calculation according to the optimization prediction index and the calculation index corresponding to the initial molecular string, the method further comprises:
setting a corresponding weight for each of the pharmacokinetic endpoints and the computational metrics;
setting a common property range and a preset property range corresponding to each pharmacokinetic endpoint and the calculation index, wherein the common property range is larger than the preset property range.
4. The method according to claim 1, wherein the step of calculating the character string corresponding to each of the optimization vectors by using a preset algorithm to form a target molecule character string set includes:
predicting each character according to the beamlearch algorithm and the optimization vector until a character string is formed;
and forming the target molecule character string set according to the character strings corresponding to all the optimization vectors.
CN202110992135.0A 2021-08-27 2021-08-27 Lead compound patent drug property optimization method based on machine translation model Active CN113707234B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110992135.0A CN113707234B (en) 2021-08-27 2021-08-27 Lead compound patent drug property optimization method based on machine translation model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110992135.0A CN113707234B (en) 2021-08-27 2021-08-27 Lead compound patent drug property optimization method based on machine translation model

Publications (2)

Publication Number Publication Date
CN113707234A CN113707234A (en) 2021-11-26
CN113707234B true CN113707234B (en) 2023-09-05

Family

ID=78655608

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110992135.0A Active CN113707234B (en) 2021-08-27 2021-08-27 Lead compound patent drug property optimization method based on machine translation model

Country Status (1)

Country Link
CN (1) CN113707234B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103034687A (en) * 2012-11-29 2013-04-10 中国科学院自动化研究所 Correlation module identifying method based on 2-type heterogeneous network
CN103294933A (en) * 2013-05-10 2013-09-11 司宏宗 Drug screening method
WO2019018780A1 (en) * 2017-07-20 2019-01-24 The University Of North Carolina At Chapel Hill Methods, systems and non-transitory computer readable media for automated design of molecules with desired properties using artificial intelligence
JP2019020791A (en) * 2017-07-12 2019-02-07 国立大学法人岐阜大学 Toxicity predicting method and utilization thereof
WO2020016579A2 (en) * 2018-07-17 2020-01-23 Gtn Ltd Machine learning based methods of analysing drug-like molecules
WO2020051714A1 (en) * 2018-09-13 2020-03-19 Cyclica Inc. Method and system for predicting properties of chemical structures
CN111126554A (en) * 2018-10-31 2020-05-08 深圳市云网拜特科技有限公司 Drug lead compound screening method and system based on generation of confrontation network
CN111402967A (en) * 2020-03-12 2020-07-10 中南大学 Method for improving virtual screening capability of docking software based on machine learning algorithm
CN111755078A (en) * 2020-07-30 2020-10-09 腾讯科技(深圳)有限公司 Drug molecule attribute determination method, device and storage medium
CN112071373A (en) * 2020-09-02 2020-12-11 深圳晶泰科技有限公司 Drug molecule screening method and system
CN112116963A (en) * 2020-09-24 2020-12-22 深圳智药信息科技有限公司 Automated drug design method, system, computing device and computer-readable storage medium
CN112133447A (en) * 2020-08-14 2020-12-25 中南大学 Construction method of colloid screening model and colloid screening method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3712897A1 (en) * 2019-03-22 2020-09-23 Tata Consultancy Services Limited Automated prediction of biological response of chemical compounds based on chemical information

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103034687A (en) * 2012-11-29 2013-04-10 中国科学院自动化研究所 Correlation module identifying method based on 2-type heterogeneous network
CN103294933A (en) * 2013-05-10 2013-09-11 司宏宗 Drug screening method
JP2019020791A (en) * 2017-07-12 2019-02-07 国立大学法人岐阜大学 Toxicity predicting method and utilization thereof
WO2019018780A1 (en) * 2017-07-20 2019-01-24 The University Of North Carolina At Chapel Hill Methods, systems and non-transitory computer readable media for automated design of molecules with desired properties using artificial intelligence
WO2020016579A2 (en) * 2018-07-17 2020-01-23 Gtn Ltd Machine learning based methods of analysing drug-like molecules
WO2020051714A1 (en) * 2018-09-13 2020-03-19 Cyclica Inc. Method and system for predicting properties of chemical structures
CN111126554A (en) * 2018-10-31 2020-05-08 深圳市云网拜特科技有限公司 Drug lead compound screening method and system based on generation of confrontation network
CN111402967A (en) * 2020-03-12 2020-07-10 中南大学 Method for improving virtual screening capability of docking software based on machine learning algorithm
CN111755078A (en) * 2020-07-30 2020-10-09 腾讯科技(深圳)有限公司 Drug molecule attribute determination method, device and storage medium
CN112133447A (en) * 2020-08-14 2020-12-25 中南大学 Construction method of colloid screening model and colloid screening method
CN112071373A (en) * 2020-09-02 2020-12-11 深圳晶泰科技有限公司 Drug molecule screening method and system
CN112116963A (en) * 2020-09-24 2020-12-22 深圳智药信息科技有限公司 Automated drug design method, system, computing device and computer-readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
曹东升等.基于Markov性的半监督流行学习算法研究.中国科学:数学.2015,第45卷(第5期),703-712. *

Also Published As

Publication number Publication date
CN113707234A (en) 2021-11-26

Similar Documents

Publication Publication Date Title
Wang et al. Hat: Hardware-aware transformers for efficient natural language processing
Akay et al. A comprehensive survey on optimizing deep learning models by metaheuristics
CN112270951B (en) Brand-new molecule generation method based on multitask capsule self-encoder neural network
CN111312329B (en) Transcription factor binding site prediction method based on deep convolution automatic encoder
CN112905801B (en) Stroke prediction method, system, equipment and storage medium based on event map
CN111429977B (en) Novel molecular similarity search algorithm based on attention of graph structure
CN116189809B (en) Drug molecule important node prediction method based on challenge resistance
CN115240786A (en) Method for predicting reactant molecules, method for training reactant molecules, device for performing the method, and electronic apparatus
CN113571125A (en) Drug target interaction prediction method based on multilayer network and graph coding
Vogel et al. Learning from flowsheets: A generative transformer model for autocompletion of flowsheets
Zheng et al. Ddpnas: Efficient neural architecture search via dynamic distribution pruning
CN113255366A (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN115148302A (en) Compound property prediction method based on graph neural network and multi-task learning
Huang et al. Conditional diffusion based on discrete graph structures for molecular graph generation
CN115526246A (en) Self-supervision molecular classification method based on deep learning model
CN110008482A (en) Text handling method, device, computer readable storage medium and computer equipment
CN113707234B (en) Lead compound patent drug property optimization method based on machine translation model
CN114038516B (en) Molecular generation and optimization method based on variation self-encoder
CN115240787A (en) Brand-new molecule generation method based on deep conditional recurrent neural network
Kwong et al. A genetic classification error method for speech recognition
CN110348001A (en) A kind of term vector training method and server
CN112100320B (en) Term generating method, device and storage medium
Zhang et al. Design Automation for Fast, Lightweight, and Effective Deep Learning Models: A Survey
Bonilla et al. Predictive search distributions
US20230197209A1 (en) Graph based machine learning for generating valid small molecule compounds

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant