CN113707234B

CN113707234B - Lead compound patent drug property optimization method based on machine translation model

Info

Publication number: CN113707234B
Application number: CN202110992135.0A
Authority: CN
Inventors: 曹东升; 付丽; 杨梓宜
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2023-09-05
Anticipated expiration: 2041-08-27
Also published as: CN113707234A

Abstract

The embodiment of the disclosure provides a lead compound patent drug property optimization method based on a machine translation model, which belongs to the technical field of medical care informatics and specifically comprises the following steps: training a translation model; establishing a plurality of calculation models corresponding to the pharmacokinetic endpoints according to a machine learning algorithm to form a prediction model group; inputting the initial molecular character string into an encoder to generate a target vector; inputting the target vector into a prediction model group according to the received optimization instruction to obtain an optimization prediction index corresponding to the optimization instruction; performing weighted average calculation according to the optimization prediction index and the calculation index corresponding to the initial molecular character string to obtain the score of the initial molecular character string; according to the target vector and the score, an optimization algorithm is utilized to iterate for preset times to obtain an optimization score set; inputting the optimized score set into a decoder, and calculating the character string corresponding to each optimized vector by using a preset algorithm to form a target molecule character string set. By the scheme, the optimization efficiency and the adaptability are improved.

Description

Lead compound patent drug property optimization method based on machine translation model

Technical Field

The embodiment of the disclosure relates to the technical field of medical health care informatics, in particular to a lead compound patent drug property optimization method based on a machine translation model.

Background

One of the biggest challenges in drug development is how to efficiently optimize lead compounds, which is also a major challenge for pharmaceutical chemists. More than 50% of compounds fail in the course of drug development because of lack of suitable absorption, distribution, metabolism, excretion (ADMET) and safety properties, whereas ADMET property optimization is a very difficult multi-objective optimization task, requiring that the patentability of the molecule be improved while maintaining the activity of the molecule; on the other hand, factors such as large space, less experience, high cost, long time consumption and the like also make the efficient optimization of the pharmacokinetic properties and the safety of the compound a great difficulty. In the prior art, new molecules are generated through calculation, and then the generated new compounds are screened by utilizing a virtual screening program to obtain candidate compounds, however, the calculated amount is huge, or the whole molecule is predicted through a prediction model, the specific index of the drug formation cannot be optimized, and the drug has low drug property after optimization, and has poor optimization efficiency and adaptability.

Therefore, a lead compound patent drug property optimization method with high optimization efficiency and adaptability based on a machine translation model is needed.

Disclosure of Invention

In view of the above, embodiments of the present disclosure provide a method for optimizing the pharmaceutical properties of lead compounds based on a machine translation model, which at least partially solves the problems of poor optimization efficiency and adaptability in the prior art.

In a first aspect, embodiments of the present disclosure provide a lead compound patentability optimization method based on a machine translation model, comprising:

training a translation model by using a preset number of sample molecule strings, wherein the translation model comprises an encoder and a decoder;

establishing a plurality of calculation models corresponding to the pharmacokinetic endpoints according to a machine learning algorithm to form a prediction model group;

inputting the initial molecular character string into the encoder to generate a target vector;

inputting the target vector into a prediction model group according to the received optimization instruction to obtain an optimization prediction index corresponding to the optimization instruction;

performing weighted average calculation according to the optimization prediction index and the calculation index corresponding to the initial molecular character string to obtain the score of the initial molecular character string;

according to the target vector and the score, an optimization score set is obtained by utilizing an optimization algorithm to iterate for preset times, wherein the optimization score set comprises a plurality of optimization vectors and optimization scores corresponding to the optimization vectors;

inputting the optimized score set into the decoder, and calculating the character strings corresponding to each optimized vector by using a preset algorithm to form a target molecule character string set.

According to a specific implementation manner of the embodiment of the present disclosure, the step of training the translation model by using a preset number of sample molecule strings includes:

inputting each sample molecule character string into the encoder, and inputting the output result of the encoder into the decoder;

and each output result of the decoder is lost with the real label of the corresponding sample molecule character string, and gradient updating is performed.

According to a specific implementation manner of an embodiment of the disclosure, the step of establishing a plurality of calculation models corresponding to pharmacokinetic endpoints according to a machine learning algorithm to form a prediction model set includes:

extracting a sample dataset from within an initial database;

extracting data corresponding to each pharmacokinetic endpoint from the sample data set to train an XGBoost algorithm, and obtaining a calculation model corresponding to each pharmacokinetic endpoint;

and forming the prediction model group according to the calculation models corresponding to all the pharmacokinetic endpoints.

According to a specific implementation manner of the embodiment of the present disclosure, the step of inputting the target vector into a prediction model group according to a received optimization instruction to obtain an optimization prediction index corresponding to the optimization instruction includes:

analyzing the pharmacokinetic endpoint contained in the optimization instructions;

selecting a corresponding calculation model from the prediction model group according to the pharmacokinetic endpoint contained in the optimization instruction;

and respectively inputting the target vector into each calculation model to obtain a prediction index corresponding to each pharmacokinetic endpoint, and forming the optimized prediction index.

According to a specific implementation manner of the embodiment of the present disclosure, before the step of obtaining the score of the initial molecular string by performing weighted average calculation according to the optimization prediction index and the calculation index corresponding to the initial molecular string, the method further includes:

setting a corresponding weight for each of the pharmacokinetic endpoints and the computational metrics;

setting a common property range and a preset property range corresponding to each pharmacokinetic endpoint and the calculation index, wherein the common property range is larger than the preset property range.

According to a specific implementation manner of the embodiment of the present disclosure, the step of performing weighted average calculation according to the optimization prediction index and the calculation index corresponding to the initial molecular string to obtain the score of the initial molecular string includes:

calculating a predicted value according to the weight and the predicted index of each pharmacokinetic endpoint respectively, and calculating the predicted value according to the value and the weight of the calculated index;

and determining a predicted score corresponding to each predicted value according to the property range of each predicted value, and forming the score of the initial molecular character string.

According to a specific implementation manner of the embodiment of the present disclosure, the step of calculating the character string corresponding to each optimization vector by using a preset algorithm to form a target molecule character string set includes:

predicting each character according to the Beam Search algorithm and the optimization vector until a character string is formed;

and forming the target molecule character string set according to the character strings corresponding to all the optimization vectors.

The lead compound patent drug property optimization scheme based on the machine translation model in the embodiment of the disclosure comprises the following steps: training a translation model by using a preset number of sample molecule strings, wherein the translation model comprises an encoder and a decoder; establishing a plurality of calculation models corresponding to the pharmacokinetic endpoints according to a machine learning algorithm to form a prediction model group; inputting the initial molecular character string into the encoder to generate a target vector; inputting the target vector into a prediction model group according to the received optimization instruction to obtain an optimization prediction index corresponding to the optimization instruction; performing weighted average calculation according to the optimization prediction index and the calculation index corresponding to the initial molecular character string to obtain the score of the initial molecular character string; according to the target vector and the score, an optimization score set is obtained by utilizing an optimization algorithm to iterate for preset times, wherein the optimization score set comprises a plurality of optimization vectors and optimization scores corresponding to the optimization vectors; inputting the optimized score set into the decoder, and calculating the character strings corresponding to each optimized vector by using a preset algorithm to form a target molecule character string set.

The beneficial effects of the embodiment of the disclosure are that: according to the scheme, a calculation model is built for each pharmacokinetic endpoint to be optimized, each index of the initial molecule is independently optimized, iterative optimization is performed after weighted average calculation, and iterative results are tidied and output into a fixed target molecule character string set, so that optimization efficiency and adaptability are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

FIG. 1 is a schematic flow chart of a lead compound pharmaceutical optimization method based on a machine translation model provided in an embodiment of the disclosure;

FIG. 2 is a schematic diagram of a part of a method for optimizing the pharmaceutical properties of a lead compound based on a machine translation model according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow diagram of a portion of another method for optimizing the patentability of lead compounds based on a machine translation model according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a specific implementation process of a lead compound patent drug property optimization method based on a machine translation model according to an embodiment of the disclosure.

Detailed Description

Embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Other advantages and effects of the present disclosure will become readily apparent to those skilled in the art from the following disclosure, which describes embodiments of the present disclosure by way of specific examples. It will be apparent that the described embodiments are merely some, but not all embodiments of the present disclosure. The disclosure may be embodied or practiced in other different specific embodiments, and details within the subject specification may be modified or changed from various points of view and applications without departing from the spirit of the disclosure. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.

It is noted that various aspects of the embodiments are described below within the scope of the following claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present disclosure, one skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. In addition, such apparatus may be implemented and/or such methods practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.

It should also be noted that the illustrations provided in the following embodiments merely illustrate the basic concepts of the disclosure by way of illustration, and only the components related to the disclosure are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.

In addition, in the following description, specific details are provided in order to provide a thorough understanding of the examples. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details.

The embodiment of the disclosure provides a lead compound patency optimization method based on a machine translation model, which can be applied to a lead compound patency optimization process of a computer-aided drug design scene.

Referring to fig. 1, a schematic flow chart of a method for optimizing the pharmaceutical properties of a lead compound based on a machine translation model is provided in an embodiment of the disclosure. As shown in fig. 1, the method mainly comprises the following steps:

s101, training a translation model by using a preset number of sample molecule strings, wherein the translation model comprises an encoder and a decoder;

during implementation, the translation model can be built according to the language neural network, and then the translation model is trained by utilizing a preset number of sample molecule character strings, so that the translation accuracy and the richness related to chemical space are improved, and the follow-up optimization process is more accurate.

S102, establishing a plurality of calculation models corresponding to pharmacokinetic endpoints according to a machine learning algorithm to form a prediction model group;

in practice, it is considered that in the optimization process, specific improvements are required to be made on the pharmacokinetic (ADMET) properties of the molecules, and pharmacokinetic endpoints generally affecting drug properties mainly include: log d7.4, log s, caco-2, MDCK cells, plasma protein binding rate (PPB), AMES toxicity, cardiotoxicity (hERG), hepatotoxicity and half-lethal dose (LD 50) toxicity total 9 important ADMET endpoints. The calculation models corresponding to the plurality of pharmacokinetic endpoints can be established through a machine learning algorithm to form a prediction model group, and the models can be respectively established according to 9 important ADMET endpoints, and the models can be respectively established according to any ADMET endpoints, which are not listed here.

S103, inputting the initial molecular character string into the encoder to generate a target vector;

in the implementation, the SMILES character string corresponding to the lead compound molecule to be optimized may be used as the initial molecule character string, and then the initial molecule character string is input into the encoder to generate the target vector.

For example, to avoid the problem of gradient extinction or gradient explosion caused by Recurrent Neural Networks (RNNs), both encoders and decoders employ a 3-layer stack gate cycling unit (GRU), each layer containing 256, 512 and 1024 units. For the encoder model, its last layer is the fully connected layer (information bottleneck) containing 512 cells and the hyperbolic tangent activation function, and 512-dimensional vectors are generated as the target vectors. The 512-dimensional vector obtained after the information bottleneck screening symbolizes the most remarkable statistical feature in SMILES. Of course, the specific setting of the encoder and the dimension of the generated target vector may be set according to actual needs.

S104, inputting the target vector into a prediction model group according to the received optimization instruction to obtain an optimization prediction index corresponding to the optimization instruction;

in a specific implementation, the optimizing instruction may be optimizing for a specific pharmacokinetic endpoint in the initial molecular string, for example, when the optimizing instruction is optimizing a plasma protein binding rate (PPB), a cardiac toxicity (hERG), a hepatic toxicity, and a half lethal dose (LD 50) in the initial molecular string, the target vector is input into the prediction model set, and the calculation model corresponding to the plasma protein binding rate (PPB), the cardiac toxicity (hERG), the hepatic toxicity, and the half lethal dose (LD 50) in the prediction model set is used for analyzing the target vector, so as to obtain the optimized prediction index corresponding to the optimizing instruction.

S105, carrying out weighted average calculation according to the optimization prediction index and the calculation index corresponding to the initial molecular character string to obtain the score of the initial molecular character string;

the calculation index corresponding to the initial molecular character string can be directly calculated according to the initial molecular character string, after the optimization prediction index is obtained, the weighted average calculation can be carried out according to the influence of each index on the drug formation to obtain the score of the initial molecular character string, so that the situation that the optimized molecule only pays attention to the improvement of the property, but ignores important structural information and generates unexpected molecules is avoided.

S106, according to the target vector and the score, utilizing an optimization algorithm to iterate for preset times to obtain an optimization score set, wherein the optimization score set comprises a plurality of optimization vectors and optimization scores corresponding to the optimization vectors;

in the implementation, in order to further improve the optimization efficiency, after the score of the initial molecular string is obtained, a preset number of iterations of an optimization algorithm can be utilized according to the target vector and the score to obtain a plurality of optimization vectors and optimization scores corresponding to the optimization vectors, so as to form the optimization score set.

For example, the optimization vector and the optimization score obtained by combining a particle swarm optimization algorithm (particle swarm optimization, PSO for short) with a translation model are used for calculation, so that efficient molecular optimization is realized. PSO is a random optimization method for simulating population intelligence, and searching the optimal point by recording and comparing information in space search of a plurality of particles. In this process, the information of each particle in the population is defined by its position x and velocity v, where the score f is used to explore space and guide optimization. In this study, position x is a 512-dimensional vector value and score f is an optimization score. The motion of the ith particle in iteration step k is affected by its own historical optimum and the historical optimum of all particles; after each iteration, each particle will update its speed and position according to the collected information and its state, and then the optimization vectors within the set of optimization scores may be ordered according to the optimization scores.

S107, inputting the optimized score set into the decoder, and calculating the character strings corresponding to each optimized vector by using a preset algorithm to form a target molecule character string set.

In the implementation, after the optimized score set is obtained, the optimized score set may be input to the decoder, and each optimized vector in the optimized score set is decoded by the decoder to generate a Canonical string, so as to form the target molecule string set, where the target molecule string set includes a plurality of molecules optimized according to the optimized instruction, so as to facilitate subsequent verification and application.

According to the machine translation model-based lead compound patent drug property optimization method, a calculation model is built for each pharmacokinetic end point to be optimized, each index of an initial molecule is independently optimized, iterative optimization is performed after weighted average calculation, and iterative results are arranged and output into a fixed target molecule character string set, so that optimization efficiency and adaptability are improved.

Based on the above embodiment, the training the translation model using the predetermined number of sample molecule strings in step S101 includes:

In implementation, each sample molecule string may be input to the encoder, the encoder generates 512-dimensional vectors corresponding to each sample molecule string, then inputs the 512-dimensional vectors corresponding to each sample molecule string to the decoder, and then performs loss calculation on the output of the decoder and the real tag, and performs gradient update to improve the translation accuracy of the translation model.

On the basis of the above embodiment, as shown in fig. 2, in step S102, a step of establishing a plurality of calculation models corresponding to pharmacokinetic endpoints according to a machine learning algorithm to form a prediction model set includes:

s201, extracting a sample data set from an initial database;

for example, the sample dataset may be formed by retrieving and literature collecting the ChEMBL, EPA and drug bank databases, obtaining and data preprocessing the ADMET dataset, screening out interference data and invalid data of the ADMET dataset.

S202, extracting data corresponding to each pharmacokinetic endpoint from the sample data set to train an XGBoost algorithm, and obtaining a calculation model corresponding to each pharmacokinetic endpoint;

for example, the corresponding data may be extracted from the sample dataset for a total of 9 important ADMET endpoints, log 7.4, log, caco-2, MDCK cells, plasma protein binding rate (PPB), AMES toxicity, cardiotoxicity (hERG), hepatotoxicity, and half-lethal dose (LD 50) toxicity, and the data corresponding to the different ADMET endpoints may be learned in combination with the XGBoost algorithm, and a computational model corresponding to each of the pharmacokinetic endpoints may be established. Of course, other machine learning algorithms may also be used to learn and model.

And S203, forming the prediction model group according to the calculation models corresponding to all the pharmacokinetic endpoints.

And after obtaining the calculation model corresponding to each pharmacokinetic endpoint, forming the prediction model group according to the calculation models corresponding to all the pharmacokinetic endpoints.

Further, the step of inputting the target vector into a prediction model group according to the received optimization instruction to obtain an optimization prediction index corresponding to the optimization instruction includes:

When the pharmacokinetic endpoint included in the optimization instruction is optimizing the plasma protein binding rate (PPB), cardiac toxicity (hERG), hepatotoxicity and half-lethal dose (LD 50) in the initial molecular string, the target vector is input into the prediction model group, calculation models corresponding to the plasma protein binding rate (PPB), cardiac toxicity (hERG), hepatotoxicity and half-lethal dose (LD 50) in the prediction model group are used, and then the target vector is input into each calculation model respectively to obtain a prediction index corresponding to each pharmacokinetic endpoint, and the optimized prediction index is formed.

Optionally, before the step of obtaining the score of the initial molecular string by performing weighted average calculation according to the optimized prediction index and the calculation index corresponding to the initial molecular string, the method further includes:

In specific implementation, considering that the realization of multi-objective optimization tasks is required to be guaranteed, and the expected value of the optimized molecule is quantized, the corresponding weight can be set for each pharmacokinetic endpoint and each calculation index, and the common property range and the preset property range corresponding to each pharmacokinetic endpoint and each calculation index are set, so that the optimization of the patentability of the lead compound is guaranteed, and the expected molecule is generated.

Further, the step of obtaining the score of the initial molecular string by performing weighted average calculation according to the optimized prediction index and the calculation index corresponding to the initial molecular string includes:

For example, a predicted value may be calculated according to the weight and the predicted index of each pharmacokinetic endpoint, and the predicted value may be calculated according to the value and the weight of the calculated index, and then the range in which the predicted value is located may be determined, and if the predicted value is within the preset property range, the property score value corresponding to the predicted value is 1, and if the predicted value is outside the preset property range but within the common property range, the property score corresponding to the predicted value is 0 according to the score value corresponding to the distance from the target range of (0, 1).

On the basis of the above embodiment, as shown in fig. 3, in step S107, a preset algorithm is used to calculate a string corresponding to each of the optimization vectors, so as to form a target molecule string set, including:

s301, predicting each character according to a Beam Search algorithm and the optimization vector until a character string is formed;

in particular, the Beam Search algorithm is a heuristic Search algorithm that iteratively predicts each character by expanding the most promising nodes in a finite set to explore the best combination of words, and by which each of the optimization vectors in the set of optimization scores can be substituted into the Beam Search algorithm, iteratively predicting each character in each of the optimization vectors until a complete string sequence is formed. Of course, other algorithms may be used for decoding.

S302, forming the target molecule character string set according to character strings corresponding to all the optimization vectors.

And in the specific implementation, the same steps are carried out until each optimized vector generates a corresponding character string, and then the character strings corresponding to all the optimized vectors are formed into the target molecule character string set. The specific optimization flow of the lead compound patent drug property optimization method of the machine translation model provided by the embodiment of the disclosure is shown in fig. 4, and the target molecule character string set is finally generated.

The foregoing is merely specific embodiments of the disclosure, but the protection scope of the disclosure is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the disclosure are intended to be covered by the protection scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method for optimizing the patentability of a lead compound based on a machine translation model, comprising the steps of:

the step of establishing a plurality of calculation models corresponding to the pharmacokinetic endpoints according to a machine learning algorithm to form a prediction model group comprises the following steps:

extracting a sample dataset from within an initial database;

forming the prediction model group according to the calculation models corresponding to all the pharmacokinetic endpoints;

the step of inputting the target vector into a prediction model group according to the received optimization instruction to obtain an optimization prediction index corresponding to the optimization instruction comprises the following steps:

inputting the target vector into each calculation model respectively to obtain a prediction index corresponding to each pharmacokinetic endpoint, and forming the optimized prediction index;

performing weighted average calculation according to the optimization prediction index and a calculation index corresponding to the initial molecular character string to obtain a score of the initial molecular character string, wherein the calculation index is directly calculated according to the initial molecular character string;

the step of obtaining the score of the initial molecular character string by carrying out weighted average calculation according to the optimized prediction index and the calculation index corresponding to the initial molecular character string comprises the following steps:

calculating a first predicted value according to the weight and the predicted index of each pharmacokinetic endpoint, and calculating a second predicted value according to the value and the weight of the calculated index;

determining a predicted score corresponding to each predicted value according to the property range of each predicted value, and forming a score of the initial molecular character string;

the step of determining a prediction score corresponding to each predicted value according to the property range of each predicted value and forming the score of the initial molecular character string comprises the following steps:

judging the range of each predicted value, if the predicted value is in the preset property range, the property score value corresponding to the predicted value is 1, if the predicted value is out of the preset property range but still in the common property range, according to the score value corresponding to the distance from the target range to (0, 1), and if the predicted value exceeds the common property range, the property score corresponding to the predicted value is 0;

2. The method of claim 1, wherein the step of training a translation model using a predetermined number of sample molecule strings comprises:

3. The method according to claim 1, wherein before the step of obtaining the score of the initial molecular string by performing weighted average calculation according to the optimization prediction index and the calculation index corresponding to the initial molecular string, the method further comprises:

4. The method according to claim 1, wherein the step of calculating the character string corresponding to each of the optimization vectors by using a preset algorithm to form a target molecule character string set includes:

predicting each character according to the beamlearch algorithm and the optimization vector until a character string is formed;