CN114842924A

CN114842924A - Optimized de novo drug design method

Info

Publication number: CN114842924A
Application number: CN202210396113.2A
Authority: CN
Inventors: 刘奇磊; 张磊; 赵雨靓; 都健; 孟庆伟
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2022-04-15
Filing date: 2022-04-15
Publication date: 2022-08-02

Abstract

The invention relates to an optimized de novo drug design method, and belongs to the fields of computer-aided drug design technology and bioinformatics and pharmaceutical informatics. The computer-aided molecular design method based on the mixed integer nonlinear programming model is used for designing candidate drugs and is coupled with a deep learning model for predicting the target-ligand binding affinity. As one of the most important properties in the field of drug design, the binding affinity property is considered as an objective function in subsequent mixed integer nonlinear programming models. The complex drug structure is converted into a drug backbone by using the Bemis-Murcko algorithm, and the backbone structure is considered in a computer-aided molecular design method to ensure the rationality of the designed drug candidate. In addition, a similarity algorithm based on a framework is provided to greatly reduce the scale of the drug design problem, so that the mixed integer nonlinear programming model can effectively design candidate drugs.

Description

Optimized de novo drug design method

Technical Field

The invention relates to the fields of computer-aided drug design technology and bioinformatics and pharmaceutical informatics, in particular to a de novo drug design method.

Background

Small molecule drugs, as a high value-added chemical product, are often used to prevent diseases and protect public health. However, numerous challenges have hindered the development of innovative drugs, including high cost to market, low success rates of clinical trials, and long development cycles. The average cost of developing a new drug is estimated to exceed $ 25 billion, with a development cycle of about 10-15 years. Therefore, it is desirable to accelerate drug discovery early in drug development using computer-assisted techniques and models in order to reduce the time and cost associated with experimental analysis and to improve the hit rate of subsequent clinical studies.

Virtual screening methods typically employ molecular docking-based methods or molecular similarity algorithms to identify new potential drugs from large chemical libraries containing thousands to hundreds of millions of candidate compounds. Despite the high efficiency of virtual screening methods, it is still costly to traverse the screening of promising drug candidates by enumeration methods, estimated at 10 considering the number of molecules that can be considered as potential drug classes ³⁰ To 10 ⁶⁰ In between, it is computationally infeasible to perform a complete search in this chemical space. To avoid this problem, de novo drug design methods based on optimization have received a great deal of attention from scholars.

In contrast, de novo drug design methods can design molecular structures from the beginning according to given requirements, with the advantage of exploring a broader chemical space beyond existing compound libraries. In general, optimization-based de novo drug design methods consist of three parts, including generative models for creating the chemical space of a drug candidate, property prediction models for evaluating the properties of a drug candidate, and optimization algorithms for finding the best drug candidate with desirable properties. For example, a drug design method based on a genetic algorithm randomly combines groups by heuristic rules to create a chemical space, and then optimizes a population (drug candidate) according to a fitness function (objective function) by a non-gradient optimization algorithm. The drug design method can quickly identify candidate drugs meeting the property requirements in a wide chemical space outside the existing compound library. However, genetic algorithm-based drug design methods tend to fall into locally optimal solutions, thereby limiting the search for more promising drug candidates.

Recently, generative models based on deep learning have been developed for de novo drug design. In contrast to methods based on genetic algorithms, the depth generation model does not require heuristic rules. They are trained and predicted in a fully data-driven and high-throughput manner, with very low requirements on expert knowledge. And the deep learning architecture widely used comprises a variational automatic encoder, a generation countermeasure network and the like. Nevertheless, the deep generative models are still in the infancy stage, since structurally infeasible molecules can often be observed in the generated molecules. Furthermore, the extrapolation capabilities of the depth generative model are still unsatisfactory. For example, it has been demonstrated that creating a lattice that is unstable against network performance and tends to create a smaller chemical space (i.e., a model that creates only a single molecular structure or a small fraction of similar molecular structures). Furthermore, depth generative models are typically presented as "black boxes" with low interpretability. If the computational cost of the "white-box" model is as low as that of the depth-generated model, it is desirable to develop a "white-box" model for drug design by using mathematical formulas with explicit interpretability.

Computer-aided molecular design (CAMD) techniques have long been used to design optimal molecular structures that meet target properties. It is commonly used to design small molecule solvents such as absorbents, extractants, etc. The CAMD problem is usually expressed as a mixed integer nonlinear programming (MINLP) model consisting of an objective function, molecular structure constraints, and molecular property constraints. The MINLP model is able to optimize combinatorial groups by using highly explanatory mathematical formulas to create a chemical space containing reasonable molecules. By solving the MINLP model with a gradient-based optimization algorithm, all feasible solutions (i.e., molecules represented by the set of groups) that satisfy the structural and property constraints can be obtained, and the optimal solution with the largest or smallest target property determined. Thus, mathematical programming models are more efficient than depth-generative models in generating molecules with desired properties, because depth-generative models typically use fingerprint-based similarity indicators to generate molecules, which are then evaluated by enumeration traversal to evaluate the properties of the generated molecules. However, when the system of nonlinear equations in the MINLP model is too complex, it is practically infeasible to solve the MINLP model directly. In order to solve this problem, some researchers have proposed a solution algorithm to solve the MINLP model with strong non-convexity. Although the CAMD approach has been largely successful in designing small molecule solvents, the direct application of the traditional CAMD approach to drug candidate design still presents two challenges: first, the structure (especially the ring structure) of the drug is more complex and larger than small molecule solvents, and therefore, a larger number of ring groups are required to build MINLP models for candidate drug design, which will increase the problem scale and solution difficulty of MINLP models. Even if the MINLP model is successfully solved, the second challenge is that the traditional CAMD approach tends to produce some structurally feasible but anomalous ring structures. For example, if the cyclic group "aC-C # CH" is chosen for the MINLP model, the traditional CAMD method will design a molecule similar to "cc (C) C1C (C # C) C1C # C" (denoted by SMILES).

Disclosure of Invention

In view of the above-mentioned problems of the prior art and in order to solve the above-mentioned two challenges, the present invention aims to establish an optimization-based de novo drug design method, wherein a CAMD method based on MINLP model is proposed for candidate drug design coupled with a deep learning model for predicting target-ligand binding affinity. As one of the most important properties in the field of drug design, the binding affinity property is considered as an objective function in the subsequent MINLP model. The Bemis-Murcko algorithm is used to transform complex drug structures into drug backbones, which are considered in our customized CAMD approach to ensure the rationality of the designed drug candidates. In addition, a skeleton-based similarity algorithm is proposed to greatly reduce the scale of drug design problems and enable MINLP models to efficiently design drug candidates. Finally, a case study involving anti-tumor drug candidate design was introduced to highlight the versatility and effectiveness of the established optimization-based de novo drug design framework.

In order to achieve the purpose, the invention adopts the following technical scheme, which comprises the following specific steps:

an optimized de novo drug design method comprising the following specific steps:

step 1: establishing a medicine database;

step 2: establishing a database containing a Bemis-Murcko framework: the Bemis-Murcko skeleton was extracted from the drug structures in the drug database by using the Bemis-Murcko algorithm in RDKit.

And step 3: searching a skeleton subset G similar to the skeleton of the target medicament from a skeleton database by using a skeleton-based similarity algorithm aiming at the skeleton structure of the target medicament ₁ Simultaneously selecting a group of common groups to form a set G ₂ 。

And 4, step 4: design of promising drug candidates: the de novo drug design problem is formulated as a mixed integer nonlinear programming (MINLP) model consisting of objective functions, drug structure constraints, drug property constraints, including deep learning models; identifying an optimal candidate drug structure with high binding affinity probability in feasible solutions by solving an MINLP model, wherein the feasible solutions are generated by optimally combining a drug skeleton and a group, and the combining process is restricted by the constraint of the MINLP model; the objective function is to maximize the probability of high binding affinity of the target-ligand complex. The MINLP model formula is as follows:

an objective function:

subject to steps 5-8:

and 5: deep learning constraint: general equation (1) represents a deep learning model for identifying target-ligand complexes with high binding affinity.

Prob _bjnd ＝f _deep (s) (1)

Step 6: structural constraint of the drug: general equation (2) represents the octagon rule m ₁ Valence bond rule m ₂ And chemical complexity m ₃ By the combination of the skeleton and the group, a molecule with reasonable structure can be generated.

And 7: the drug property constraint: general equations (3), (4) represent the "rikis rule" property: relative molecular mass MW, number of hydrogen bond acceptors HBA, number of hydrogen bond donors HBD, octanol-water partition coefficient logP, rotational angle number ROT (ROT) _frag ) A synthetic feasibility score SA and a synthetic complexity score SC.

And 8: other constraints are: general equation (5) represents an improved SMILES-based isoform generation algorithm for automatically converting a set of drug candidate segments, including a backbone and a group, into a corresponding drug SMILES string.

f _SMILES (n _i ,s)＝0 (5)

In the above general equation, F _obj Is an objective function, Prob _bind Is the probability of high binding affinity of the target-ligand complex, i represents the fragment involved in the drug candidate, n _i Representing the number of fragments involved in the drug candidate, s is the SMILES representation of the molecule, m is the type of structural constraint,

and

are the upper and lower bounds of the structural constraint m, p is the drug property, k is the property type,

and

is p _k The upper and lower bounds of (c).

And step 9: solving the MINLP model by using a decomposition type solving algorithm. If there is no best candidate drug that meets all constraints, return to step 4 to relax the constraints.

Step 10: the optimal solution of the MINLP model was further validated by a physics-based molecular docking and molecular dynamics simulation method.

Further, step 1 specifically includes:

step 1.1: small molecule drugs with CAS numbers were collected from the drug bank database.

Step 1.2: searching for the CID number of the drug in PubChem by using the crawler script and the CAS number of the drug; after deleting the drug without CID number, the remaining drug can be used for further screening.

Step 1.3: using the "riksky pentarule" property: the relative molecular mass MW is less than or equal to 500, the number HBD of hydrogen bond donors is less than or equal to 5, the number HBA of hydrogen bond acceptors is less than or equal to 10, the octanol-water distribution coefficient logP is less than or equal to 5, and the number ROT of the rotatable angle is less than or equal to 10, so as to screen the drug with good pharmacokinetic property; the "riksky penta-rule" properties and isomeric SMILES strings of all drugs were obtained from their CID numbers through the official-web interface of the PubChem database.

After applying the above step 1.1 to step 1.3 criteria, a drug database is established containing small molecule drugs and their CAS numbers, CID numbers, isomer SMILES strings and "rikis rule" properties.

Further, in step 3, the similarity algorithm based on the skeleton combines six similarity algorithms and four molecular representation methods to form 24 combinations, and each combination is utilizedIdentifying skeleton similar to that of target medicine, selecting three most similar skeletons obtained from each combination, and eliminating repeat skeletons to obtain final similar skeleton subset G ₁ . The six similarity algorithms include Tanimoto, Dice, Cosine, Sokal, Russel, Kulczynski, and the four molecular representation methods include topological fingerprints, MACCS keys, ECFP fingerprints, and FCFP fingerprints.

Further, step 6 specifically includes:

the specific drug structure constraints of the MINLP model are given by equations (6) - (12); structural constraints

Meaning that a drug candidate selects only one scaffold from a set of scaffolds.

Among the structural constraints of chemical complexity are:

n _i ≤3,i∈G ₂ (8)

the structural constraint of the rule of an octagon is:

the structural constraints of the valence rules are:

in equations (6) to (12), n _i Is the number of segments i involved in the drug candidate, v _i Is the number of bonds of fragment i, G is the set of fragments, G ₁ Is a subset of the skeleton, G ₂ Is a group of radicals, G ₃ Is G ₂ Subset of (1), G ₃ ＝{-CH ₃ ,-CH ₂ , -CH,CH ₂ ＝CH-,-CH＝CH-,CH ₂ ＝C<,-CH＝C<}。

Is i ₁ The number of the segments is such that,

is i ₂ The number of the segments is such that,

is i ₁ Number of bonds of the fragment.

Further, step 7 specifically includes:

step 7.1: in the MINLP model, the "riksky pentarule" properties (MW, HBA, HBD, logP, ROT) are calculated by the RDKit-based quantitative drug similarity estimation method (QED); the upper limit of the "Ribes-based rule" property is MW < 500, HBD < 5, HBA < 10, logP < 5 and ROT (ROT) _frag ) Less than or equal to 10; ROT in equation (3) _frag Is a linear sum of the number of rotational angles of the segments involved in the drug candidate, and the ROT in equation (4) is the number of rotational angles of the entire molecule calculated from the SMILES string of the drug candidate, which is of a non-linear nature; considering the ROT _frag ≤ROT，ROT _frag >A 10 drug will not meet the constraint of ROT ≦ 10; therefore, when solving the MINLP model using the decomposition algorithm, the ROT is introduced _frag The ability to cull portions of the infeasible solution before computing the ROT helps improve the solution efficiency of the MINLP model.

And 7.2: the SA and SC properties are used to ensure that the designed drug candidate molecule is easy to synthesize, wherein low SA and SC values indicate that the molecule is easy to synthesize.

Further, step 8 specifically includes:

the SMILES string of the drug candidate is the prediction Prob _bind Model input information for the ROT, logP, SA and SC properties; a bridge needs to be established in the structural constraint to associate the SMILES character string with the fragment set, otherwise, the MINLP model cannot be successfully solved; thus, by using a modified SMILES-based isoform generation algorithm, a fragment set of a drug candidate may be automatically converted to its corresponding SMILES string, such that the MINLP model may be successfully solved, while the algorithm may identify drug candidate isomers containing the same fragment set.

The improved SMILES-based isoform generation algorithm is executed by RWMol modules in the RDKit library: a molecule is represented by a set of fragments, including a backbone and a group, and this information is fed into an isomer generation algorithm. The specific algorithm execution flow is as follows:

(1) the backbone is selected as the "seed" and then the groups are combined with the backbone as "leaves" according to a predetermined order of group addition. Check for "+" coincidence in seed-leaf structure. If so, adding another group to the seed-leaf structure in a predetermined order of group addition. If not, go to the next step.

(2) Check if all groups were added to the seed-leaf structure. If not, indicating that redundant groups are not added into the seed-leaf structure, deleting the seed-leaf structure at the moment, and repeating the operation of the step (1) according to the next preset group addition sequence. If all have been added, proceed to the next step.

(3) Checking whether the generated seed-leaf structure can generate a SMILES structure through an RDKit library; if not, repeating the operations of the steps (1) to (2) according to the next preset group addition sequence; if yes, saving the SMILES result and repeating the steps (1) - (2) according to the next preset group adding sequence; after all the pre-established groups addition sequences have been tried, all possible SMILES structures are obtained.

Further, step 9 specifically includes:

step 9.1 solves the MINLP optimization model using a factorized solution algorithm: the MINLP model is decomposed into one Mixed Integer Linear Programming (MILP) subproblem and three nonlinear programming (NLP) subproblems.

Step 9.2: subproblem 1 (MILP): structural constraints that first limit the octahedral rules, valence rules and chemical complexity, and MW, HBA, HBD and ROT _frag Linear property constraint of (1), generating N in GAMS using a BARON solver ₁ A feasible solution, the feasible solution being a drug candidate represented by a set of fragments; a parameter N needs to be set in the gam ₁ ^max To specify the maximum number of solutions for the MILP model.

Step 9.3: subproblem 2 (NLP): constraints using a modified SMILES-based isoform generation algorithm, based on N ₁ Fragment set generation N ₂ SMILES string, N, of a drug candidate ₂ ≥N ₁ 。

Step 9.4: subproblem 3 (NLP): calculating N using a non-linear property prediction model taking into account the non-linear property constraints of ROT, logP, SA, and SC ₂ Corresponding properties of SMILES strings of the candidate drugs are eliminated, and the SMILES strings which do not meet the constraint condition are eliminated; selecting the remaining N ₃ The SMILES string of each drug candidate is further evaluated.

Step 9.5: subproblem 4 (NLP): computing N using a non-linear deep learning model, taking into account deep learning model constraints ₃ Objective function Prob of individual target-ligand complexes _bind And sorting the SMILES strings of the drug candidates according to the objective function.

Compared with the prior art, the invention mainly has the following beneficial effects:

(1) first, the present invention applies the CAMD method to the design of drug candidates. In fact, it is not easy to apply the traditional CAMD method directly to drug candidates. The first challenge is that the structure of the drug (especially the ring structure) is more complex and larger than that of small molecule solvents, which would make the MINLP model difficult to solve. For example, the number of ring structures in a solvent is generally 0 to 2, while the number of ring structures in a drug is generally 2 to 5. This means that we need a larger number of ring groups to build the MILP model (using sub-problem 1 after the decomposition algorithm). This increases the scale size and the difficulty of solving the MILP model if the CAMD method is applied to drug candidates.

(2) Second, even if the MILP model is successfully solved, the second challenge is that traditional CAMD methods tend to produce some structurally feasible but anomalous ring structures. For example, if the cyclic group "aC-C # CH" is chosen for the MILP model, the traditional CAMD method will design a molecule similar to "cc (C) C1C (C # C) C1C # C" (denoted by SMILES).

To address both of the above challenges, we used the Bemis-Murcko algorithm to obtain drug scaffolds generated from drugs or drug candidates approved in the drug bank database and incorporate the scaffolds into the traditional CAMD method to ensure the rationality of the designed drug candidates. However, if we used all 2,898 generated drug skeletons to design drug candidates, the computational efficiency of the MILP model remains unacceptable. Given that drug candidates with similar backbones may have similar characteristics, we used a backbone-based similarity algorithm to identify a subset of backbones that are similar to the target marketed drug prior to building the MILP model; thus, we need to focus only on the underlying skeleton to design drug candidates, which can greatly reduce the size of the MILP model.

(3) Again, we used two of the most advanced deep learning techniques, the gate-enhanced attention mechanism and the convolutional neural network, to build a reasonable deep learning model for predicting the binding affinity of target-ligand complexes and to integrate with the CAMD method.

In summary, the present invention establishes an optimization-based framework that combines deep learning models with a customized MINLP-based CAMD approach for candidate drug design. The invention can greatly reduce the time and cost related to experimental analysis, improve the hit rate of subsequent clinical research, design a new chemical structure and improve the efficiency and diversity of novel drug design.

Drawings

FIG. 1 is a flow chart of the present invention based on an optimized de novo drug design framework;

FIG. 2 is a schematic diagram of the structure of the drug of the present invention consisting of a backbone (ring structure and linker) and side chains.

FIG. 3 is a chemical space of drug candidate design results plotted using the ECFP fingerprinting and Principal Component Analysis (PCA) method of the present invention. (integers (0 to 14) in the right side legend represent 15 skeletons, dot "7" represents axitinib)

Detailed Description

The flow and effect of the present invention will be explained below with reference to the drawings and examples. The embodiments of the present invention are implemented on the premise of the technical solution of the present invention, and detailed embodiments and specific operation procedures are given, but the scope of the present invention is not limited to the following embodiments.

Referring to fig. 1, an embodiment of the present invention specifically discloses an optimization-based de novo drug design framework, comprising the following steps:

step 1: a drug database was built with reference to the drug Bank V5.0 database (https:// go. drug Bank. com /) for subsequent drug scaffold generation.

Step 2: a drug scaffold was generated for subsequent MINLP-based drug design models. By using the backbone (as shown in fig. 2) instead of the traditional cyclic group in the CAMD method, a more structurally rational drug candidate can be designed. Thus, a database containing 2,898 Bemis-Murcko scaffolds was established for subsequent drug candidate design. These skeletons were generated using the Bemis-Murcko algorithm in RDKit for 4,781 drugs in the drug database.

And step 3: to reduce unnecessary search space and save computation time costs for candidate drug design issues, a scaffold subset G similar to the scaffold of axitinib, an FDA-approved anti-neoplastic drug, was identified using a scaffold-based similarity algorithm prior to establishing the MINLP model ₁ As a result, 72 skeletons were obtained in total, and 19 skeletons were obtained by deleting the repetition values. With selection of the usual groups to form the set G ₂ 。

And 4, step 4: the optimized mathematical programming method is used for designing promising candidate drugs in a high-throughput and intelligent manner. The de novo drug design problem is formulated as a MINLP model consisting of an objective function (maximizing the probability of high binding affinity of the target-ligand complex), drug structure constraints, drug property constraints (including deep learning models). By solving the MINLP model, optimal drug candidates with high binding affinity can be identified in the feasible solutions generated by optimal combination of drug backbones and groups under model constraints. Note that each MINLP model is suitable for one target for which promising drug candidates can be designed. The MINLP model formula is as follows:

an objective function:

subject to steps 5-8:

and 5: deep learning constraint: the general equation (1)) represents a deep learning model for identifying target-ligand complexes with high binding affinity.

Prob _bind ＝f _deep (s) (1)

Step 6: structural constraint of the drug: the general equation (2)) represents the octagon rule (m) ₁ ) Valence bond rule (m) ₂ ) And chemical complexity (m) ₃ ) By the combination of the skeleton and the group, a molecule with reasonable structure can be generated.

And 7: the drug property constraint: the general equation (equations (3-4)) represents the "Ribes-Chi-rule" properties (relative molecular mass MW, number of hydrogen bond acceptors HBA, number of hydrogen bond donors HBD, octanol-water partition coefficient logP, rotational angle number ROT (ROT) _frag ) A synthetic feasibility Score (SA), and a synthetic complexity Score (SC).

And 8: other constraints are: the general equation (5)) represents a modified SMILES-based isoform generation algorithm for automatically converting a set of segments (backbones and groups) of a drug candidate into a corresponding drug SMILES string.

f _SMILES (n _i ,s)＝0 (5)

In the above general equation, F _obj Is an objective function, Prob _bind Is the probability of high binding affinity of the target-ligand complex, i represents the fragment (backbone and group) involved in the drug candidate, n _i Representing the number of fragments involved in the drug candidate, s is the SMILES representation of the molecule, m is the type of structural constraint,

and

and

is p _k The upper and lower bounds of (c).

And step 9: due to the large number of nonlinear constraints involved in the MINLP model, a factorial solution algorithm is used to solve the MINLP model. If there is no best candidate drug that meets all constraints, return to step 4 to relax the constraints.

Step 10: the optimal solution of the MINLP model was further validated by a physics-based molecular docking and molecular dynamics simulation method. Finally, a case study involving anti-tumor drug candidate design was presented to highlight the effectiveness of a de novo drug design framework based on the MINLP model.

Further, step 1 specifically includes:

step 1.1: 7,746 small molecule drugs with CAS numbers were first collected from the drug Bank database V5.0(https:// go. drug Bank. com /).

Step 1.2: then, the PubPhem database was searched for the CID numbers of 7,746 drugs (unique identifiers of chemicals in the PubPhem database, https:// Pubchem. ncbi. nlm. nih. gov/search /) by using a crawler script and CAS number. After deleting the drug without CID number, 7,474 drugs were available for further screening.

Step 1.3: drugs with good pharmacokinetic properties were screened using the "Ribes-Siry rule" properties (relative molecular Mass (MW) ≦ 500, number of Hydrogen Bond Donors (HBD) ≦ 5, number of Hydrogen Bond Acceptors (HBA) ≦ 10, octanol-water partition coefficient (logP) ≦ 5, rotational angle number (ROT) ≦ 10). The "riksky penta-rule" properties and isomeric SMILES strings of all 7,474 drugs were obtained from their CID numbers through the official web interface of the PubChem database.

Finally, after applying the above criteria, a drug database was created containing 4,781 small molecule drugs and their CAS numbers, CID numbers, isomer SMILES strings, and "riminsky pentarule" properties.

Further, step 3 specifically includes:

step 3.1: if all 2,898 skeletons are used in a single MINLP-based drug design model, an excessive model size would prevent an effective solution for the MINLP model. Therefore, we propose a skeleton-based similarity algorithm for identifying a skeleton similar to that of the target drug (G) from 2,898 skeletons ₁ ) Subset (determined by case study). In this way, the similarity algorithm based on the skeleton combines six similarity algorithms with four molecular representation methods to form 24 combinations in total, each combination is used for identifying the skeleton similar to the target drug skeleton, three most similar skeletons obtained by each combination are taken and repeated to obtain the final similar skeleton subset G ₁ (ii) a Six similarity algorithms include Tanimoto, Dice, Cosine, Sokal, Russel, Kulczynski, four molecular representation methods include topological fingerprints, MACCS keys, ECFP fingerprints, and FCFP fingerprints.

Step 3.2: after deleting the repeated skeletons, a certain number of similar skeletons can be obtained according to specific problems. In addition, 29 common groups of groups (G) were selected for our MINLP model ₂ ) As in table 1.

Table 1 selected groups

Further, step 6 specifically includes:

the specific drug structure constraints of the MINLP model are given by the equations (6-12)). Structural constraints

Indicating that a drug candidate selects only one scaffold from the set of scaffolds.

n _i ≤3,i∈G ₂ (8)

In the equations (6 to 12), n _i Is the number of segments (backbone and group) i involved in the drug candidate, v _i Is the number of bonds of fragment i, G is the set of fragments, G ₁ Is a skeleton set, G ₂ Is a group of radicals, G ₃ Is G ₂ Is (G) ₃ ＝{-CH ₃ ,-CH ₂ ,-CH,CH ₂ ＝CH-,-CH＝CH-,CH ₂ ＝C<,-CH＝C<})；

Is i ₁ The number of the segments is such that,

is i ₂ The number of the segments is such that,

is i ₁ Number of bonds of the fragment.

Further, step 7 specifically includes:

step 7.1: in the MINLP model, the "riksky pentarule" properties (MW, HBA, HBD, logP, ROT) are calculated by the RDKit-based quantitative drug similarity estimation method (QED); the upper limit of the "Ribes-based rule" property is MW < 500, HBD < 5, HBA < 10, logP < 5 and ROT (ROT) _frag ) Less than or equal to 10; ROT in equation (3) _frag Is a linear sum of the number of rotational angles of the segments involved in the drug candidate, and the ROT in equation (4) is the number of rotational angles of the entire molecule calculated from the SMILES string of the drug candidate, which is of a non-linear nature; considering the ROT _frag ≤ROT，ROT _frag >A 10 drug will not meet the constraint of ROT ≦ 10; therefore, when solving the MINLP model using the decomposition algorithm, the ROT is introduced _frag Can be used forEliminating part of infeasible solutions before calculating the ROT helps to improve the solving efficiency of the MINLP model.

Step 7.2: SA (1-10) and SC (1-5) properties are used to ensure that the designed drug candidates are easy to synthesize, wherein low SA and SC values indicate that the molecules are easy to synthesize. Their constraint ranges are set to SA ≦ 6 and SC ≦ 3.4, according to our empirical knowledge.

Further, step 8 specifically includes:

The improved SMILES-based isoform generation algorithm is executed by RWMol modules in the RDKit library: a molecule represented by a set of fragments, the fragments comprising a backbone and a group, which information is fed into an isomer generation algorithm; the specific algorithm execution flow is as follows:

(1) selecting a framework as a seed, and then combining the group as a leaf with the framework according to a preset group adding sequence; checking for "+" coincidence in seed-leaf structure; if so, adding another group into the seed-leaf structure according to a preset group adding sequence; if not, entering the next step;

(2) checking whether all groups are added to the seed-leaf structure; if not, indicating that redundant groups are not added into the seed-leaf structure, deleting the seed-leaf structure, and repeating the operation of the step (1) according to the next preset group addition sequence; if all the materials are added, the next step is carried out;

Further, step 9 specifically includes:

step 9.1: in general, if the non-linear equations in the MINLP model are not particularly complex, the MINLP model can be solved directly using the BARON solver in the GAMS software (https:// www.gams.com /).

Step 9.2: however, the SMILES-based property prediction model (e.g., deep learning model) in our MINLP model contains extremely complex non-linear equations, which are extremely difficult to solve directly for MINLP models. For this reason, we use a factorized solution algorithm to solve our complex MINLP optimization model. It decomposes the MINLP model into a Mixed Integer Linear Programming (MILP) subproblem and three nonlinear programming (NLP) subproblems.

Step 9.3: subproblem 1 (MILP): structural constraints that first limit the octahedral rules, valence rules and chemical complexity, and MW, HBA, HBD and ROT _frag Linear property constraint of (c), generating a certain number (N) in GAMS using a BARON solver ₁ ) (iii) (drug candidates represented by the fragment set). Note that a parameter N needs to be set in GAMS ₁ ^max To specify the maximum number of solutions for the MILP model. Obtaining N under the constraint of structure and linear property by mathematical programming method ₁ 33,759 feasible solutions (represented by a set of fragments), took 210 seconds on our desktop computer (intel (r) core (tm) i7-10700F CPU @2.90GHz 24.0GB RAM).

Step 9.4: subproblem 2 (NLP): constraints using modified SMILES-based isoform generation algorithms, based on N respectively ₁ Group fragment set Generation N ₂ 262,741 SMILES strings of drug candidates (3,113 seconds).

Step 9.5: subproblem 3 (NLP): calculating N using a non-linear property prediction model taking into account the non-linear property constraints of ROT, logP, SA, and SC ₂ The corresponding nature of the SMILES string of each drug candidate and culling those that do not meet the constraints. Remaining N ₃ The SMILES string of 105,164 drug candidates was used for further analysis.

First, the 105,164 compound SMILES strings we designed were searched in the PubChem database. It was found that 4,902 (4.64%) of the designed molecular structures were already present in PubChem, indicating that our MINLP-based drug design model was able to not only find existing drug candidates, but also to design new drug candidates (95.36%).

Next, a chemical space was created using the ECFP fingerprinting and Principal Component Analysis (PCA) method (see FIG. 3) to characterize the structural diversity of 105,164 designed drug candidates, where the x-axis and y-axis are the two principal components PC1 and PC2, respectively, the integers (0-14) in the right legend represent 15 backbones, and point "7" represents axitinib. Figure 3 demonstrates the broad distribution of the designed drug candidates in chemical space, demonstrating the powerful ability of our MINLP-based drug design model to design structurally diverse drug candidates similar to axitinib.

Step 9.6: subproblem 4 (NLP): computing N using a non-linear deep learning model, taking into account deep learning model constraints ₃ Prob of individual target-ligand complexes _bind (objective function) and sorting the SMILES strings of the drug candidates according to the objective function. Ranking results showed that there were 433 drug candidates designed to be superior to axitinib (97.96%) in the probability of high binding affinity. Some of the top-ranked drug candidates were further validated by other physics-based methods such as molecular docking and molecular dynamics simulation.

The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims

1. An optimized de novo drug design method is characterized by comprising the following specific steps:

step 1: establishing a medicine database;

and 2, step: establishing a database containing a Bemis-Murcko framework: extracting a Bemis-Murcko skeleton from a drug structure in a drug database by using a Bemis-Murcko algorithm in RDKit;

and step 3: searching a skeleton subset G similar to the skeleton of the target medicament from a skeleton database by using a skeleton-based similarity algorithm aiming at the skeleton structure of the target medicament ₁ Simultaneously selecting a group of common groups to form a set G ₂ ；

And 4, step 4: design of promising drug candidates: the de novo drug design problem is expressed as a mixed integer nonlinear programming MINLP model consisting of an objective function, drug structure constraints, drug property constraints including a deep learning model; identifying an optimal candidate drug structure with high binding affinity probability in feasible solutions by solving an MINLP model, wherein the feasible solutions are generated by optimally combining a drug skeleton and a group, and the combining process is restricted by the constraint of the MINLP model; the objective function is to maximize the probability of high binding affinity of the target-ligand complex; the MINLP model formula is as follows:

an objective function:

subject to steps 5-8:

and 5: deep learning constraint: general equation (1) represents a deep learning model for identifying target-ligand complexes with high binding affinity;

Prob _bind ＝f _deep (S) (1)

and 6: structural constraint of the drug: general equation (2) represents the octagon rule m ₁ Valence bond rule m ₂ And chemical complexity m ₃ The combination of the skeleton and the group can generate molecules with reasonable structures;

and 7: the drug property constraint: general equations (3), (4) represent the "rikis rule" property: relative molecular mass MW, number of hydrogen bond acceptors HBA, number of hydrogen bond donors HBD, octanol-water partition coefficient log P, rotational angle number ROT or ROT _frag A synthetic feasibility score SA and a synthetic complexity score SC;

and 8: other constraints are as follows: general equation (5) represents an improved SMILES-based isoform generation algorithm for automatically converting a collection of drug candidate fragments, including a backbone and groups, into a corresponding drug SMILES string;

f _SMILES (n _i ,s)＝0 (5)

in the above general equation, Fo _bj Is an objective function, Prob _bind Is the probability of high binding affinity of the target-ligand complex, i represents the fragment involved in the drug candidate, n _i Representing the number of fragments involved in the drug candidate, s is the SMILES representation of the molecule, m is the type of structural constraint,

and

is structural constraintThe upper and lower bounds of m, p is the drug property, k is the property type,

and

is p _k The upper and lower bounds of (1);

and step 9: solving the MINLP model by adopting a decomposition type solving algorithm; if there is no best candidate drug that meets all constraints, return to step 4 to relax the constraints range;

2. The optimized de novo drug design method of claim 1, wherein step 1 specifically comprises:

step 1.1: collecting small molecule drugs with CAS number from DrugBank database;

step 1.2: searching for the CID number of the drug in PubChem by using the crawler script and the CAS number of the drug; after the medicine without the CID number is deleted, the residual medicine can be further screened;

step 1.3: using the "riksky pentarule" property: the relative molecular mass MW is less than or equal to 500, the number HBD of hydrogen bond donors is less than or equal to 5, the number HBA of hydrogen bond acceptors is less than or equal to 10, the octanol-water distribution coefficient logP is less than or equal to 5, and the rotational angle number ROT is less than or equal to 10, so as to screen the medicine with good pharmacokinetic property; the properties of the 'Ribes-Ski-pentarule' and isomer SMILES character strings of all the medicines are obtained through an official network interface of a PubChem database according to CID numbers of the medicines;

3. The method of claim 1, wherein step 3 is performed by a computerThe similarity algorithm based on the skeleton combines six similarity algorithms with four molecular representation methods to form 24 combinations, utilizes each combination to identify the skeleton similar to the target medicine skeleton, takes the three most similar skeletons obtained by each combination, and eliminates the repeat skeletons to obtain the final similar skeleton subset G ₁ (ii) a The six similarity algorithms include Tanimoto, Dice, Cosine, Sokal, Russel, Kulczynski, and the four molecular representation methods include topological fingerprints, MACCS keys, ECFP fingerprints, and FCFP fingerprints.

4. The method of claim 1, wherein step 6 comprises:

Meaning that a drug candidate selects only one scaffold from the set of scaffolds;

among the structural constraints of chemical complexity are:

n _i ≤3，i∈G ₂ (8)

the structural constraint of the rule of an octagon is:

the structural constraints of the valence rules are:

in equations (6) to (12), n _i Is the number of segments i involved in the drug candidate, v _i Is the number of bonds of fragment i, G is the set of fragments, G _G Is a subset of the skeleton, G ₂ Is a group of radicals, G ₃ Is G ₂ Subset of (1), G ₃ ＝{-CH ₃ ,-CH ₂ ,-CH,CH ₂ ＝CH-,-CH＝CH-,CH ₂ ＝C<,-CH＝C<}；

Is i ₁ The number of the segments is such that,

is i ₂ The number of the segments is such that,

is i ₁ Number of bonds of the fragment.

5. The method of claim 1, wherein step 7 comprises:

step 7.1: in the MINLP model, the properties of the 'Ribes-Ski rule' are calculated by a quantitative estimation method QED of drug similarity based on RDkit; the upper limit of the properties of the "Ribes-based rule" is MW < 500, HBD < 5, HBA < 10, logP < 5, ROT or ROT _frag Less than or equal to 10; ROT in equation (3) _frag Is a linear sum of the number of rotational angles of the segments involved in the drug candidate, and in equation (4)The ROT of (a) is the integral rotatable angle number of the molecule calculated according to the SMILES character string of the candidate drug, and belongs to the nonlinear property; considering the ROT _frag ≤ROT，ROT _frag >A 10 drug will not meet the constraint of ROT ≦ 10; therefore, when solving the MINLP model using the decomposition algorithm, the ROT is introduced _frag Partial infeasible solutions can be removed before the ROT is calculated, which is beneficial to improving the solving efficiency of the MINLP model;

step 7.2: the SA and SC properties are used to ensure that the designed drug candidate molecule is easy to synthesize, wherein low SA and SC values indicate that the molecule is easy to synthesize.

6. The method of claim 1, wherein step 8 comprises:

the SMILES string of the drug candidate is the prediction Prob _bind Model input information for the ROT, logP, SA and SC properties; a bridge needs to be established in the structural constraint to associate the SMILES character string with the fragment set, otherwise, the MINLP model cannot be successfully solved; thus, by using a modified SMILES-based isoform generation algorithm, a fragment set of a drug candidate may be automatically converted to its corresponding SMILES string, such that the MINLP model may be successfully solved, while the algorithm may identify drug candidate isomers containing the same fragment set;

7. The method of claim 1, wherein step 9 comprises:

step 9.1 solves the MINLP optimization model using a factorized solution algorithm: decomposing the MINLP model into a mixed integer linear programming MILP subproblem and three nonlinear programming NLP subproblems;

step 9.2: subproblems 1 — MILP: structural constraints that first limit the octahedral rules, valence rules and chemical complexity, and MW, HBA, HBD and ROT _frag Linear property constraint of (1), generating N in GAMS using a BARON solver ₁ A feasible solution, the feasible solution being a drug candidate represented by a set of fragments; a parameter N needs to be set in the gam ₁ ^max To specify a maximum number of solutions for the MILP model;

step 9.3: subproblem 2 — NLP: constraints using a modified SMILES-based isoform generation algorithm, based on N ₁ Fragment set generation N ₂ SMILES string, N, of a drug candidate ₂ ≥N ₁ ；

Step 9.4: subproblem 3 — NLP: calculating N using a non-linear property prediction model taking into account the non-linear property constraints of ROT, logP, SA, and SC ₂ Corresponding properties of SMILES strings of the candidate drugs are eliminated, and the SMILES strings which do not meet the constraint condition are eliminated; selecting the remaining N ₃ Further evaluating the SMILES string of each drug candidate;

step 9.5: sub-problems4- -NLP: computing N using a non-linear deep learning model, taking into account deep learning model constraints ₃ Objective function Prob of individual target-ligand complexes _bind And sorting the SMILES strings of the drug candidates according to the objective function.