CN116508106A - Drug optimization through active learning - Google Patents

Drug optimization through active learning Download PDF

Info

Publication number
CN116508106A
CN116508106A CN202180072416.2A CN202180072416A CN116508106A CN 116508106 A CN116508106 A CN 116508106A CN 202180072416 A CN202180072416 A CN 202180072416A CN 116508106 A CN116508106 A CN 116508106A
Authority
CN
China
Prior art keywords
compounds
subset
score
compound
population
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202180072416.2A
Other languages
Chinese (zh)
Inventor
威廉·保罗·万·霍恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aix Saianxia Artificial Intelligence Co ltd
Original Assignee
Aix Saianxia Artificial Intelligence Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aix Saianxia Artificial Intelligence Co ltd filed Critical Aix Saianxia Artificial Intelligence Co ltd
Priority claimed from PCT/GB2021/052753 external-priority patent/WO2022084696A1/en
Publication of CN116508106A publication Critical patent/CN116508106A/en
Pending legal-status Critical Current

Links

Abstract

The invention provides a method for optimizing medicines through active learning. The method includes defining a population of a plurality of compounds, each compound having one or more molecular characteristics, defining a training set of compounds from the population of known one or more biological characteristics. The method includes selecting a subset of compounds from a population that is not within the training set, determining a score for the selected subset based on molecular features present in the compounds of the selected subset, and evaluating the selected subset based on the determined subset score. Subset scores are determined from the frequency of the molecular characteristics in the population and the frequency of the molecular characteristics within a sample set comprising a training set and a selected subset.

Description

Drug optimization through active learning
Technical Field
The present invention relates to methods and systems for computational design of compounds such as drugs. In particular, the invention relates to methods for optimizing computational models by active learning for designing drugs that interact with selected target molecules, and to drugs designed using these systems and methods.
Background
Drug discovery is the process of identifying candidate compounds to go to the next stage of drug development, e.g., preclinical testing of these candidate compounds to meet certain criteria for further development. Modern drug discovery involves the identification and optimization of initial screening "hit" compounds. In particular, these compounds need to be optimized with respect to the desired criteria, which may involve optimization of many different properties. The characteristics to be optimized may include, for example: efficacy/potency against desired targets; selectivity to unintended targets; the toxicity probability is low; and good drug metabolism and pharmacokinetic properties (ADME). Only compounds meeting specific requirements can be candidates for continuing the drug development process.
The drug discovery process may involve the preparation/synthesis of large numbers of compounds during the optimization from initial screening hits to candidate compounds. In particular, the synthesized compounds are assayed to determine their properties, such as biological activity. However, as part of a particular drug discovery program, the number of compounds that can be synthesized will far exceed the number of compounds that can be synthesized and tested, possibly by several orders of magnitude. Thus, the measurement results of the synthesized compounds are analyzed and used to provide a basis for deciding which compounds to synthesize next, in order to maximize the likelihood of obtaining compounds with further improved characteristics over the various criteria required for the candidate compounds.
The synthesis of one or more compounds at a particular stage and subsequent determination of biological activity is referred to as the design cycle (or iteration) of the drug discovery process. Typically, a set of compounds will be synthesized and tested at each design cycle of the process, as this is more efficient than synthesizing and testing one compound at a time. However, the level of available resources generally means that there is an upper limit on the number of compounds synthesized in a group during any given design cycle.
In the drug discovery project, hundreds or even thousands of compounds are typically synthesized over several design cycles before candidate compounds are discovered. This is a lengthy, expensive and inefficient process: thousands of pounds may be required to synthesize a compound, and an average of three to five years may be required to obtain a candidate compound.
The use of computational methods greatly increases the level of analysis of compounds that have been synthesized compared to the analysis performed by the pharmaceutical chemist alone. In particular, machine Learning (ML), artificial Intelligence (AI), or other mathematical methods may be used to evaluate a large number of design parameters in parallel at levels beyond human capability to identify relationships between the parameters and desired characteristics (e.g., levels of biological activity). These determined relationships can then be used by mathematical methods to better predict which compounds are more likely to exhibit a greater number/level of desired characteristics relative to the desired criteria for candidate compounds. This means that such mathematical methods can be used to reduce the number of design cycles, and thus the number of compounds that need to be synthesized, to obtain compounds that achieve the desired combination of properties required for the candidate compound, and thus achieve a reduction in costs and time associated with drug discovery projects.
Only those compounds that have been synthesized and tested can be used to train an ML model that aims to predict which compounds are most likely to exhibit the desired properties, such as highest biological activity. Thus, the accuracy of predicting which compounds are most suitable to synthesize in the next design cycle to optimize the desired properties depends only on the data available for training the ML model (i.e., previously synthesized compounds). In particular, the ML model will (possibly) make accurate predictions only if: a sufficient number of compounds within the set for training the ML model; and, the compounds of the training set are sufficiently representative of the library of compounds from which the compounds to be synthesized are selected.
As noted above, the chemical space of a compound may be enormous in any given drug discovery project. In order not to waste resources, it is therefore important to choose the most efficient compounds to synthesize in improving the ML model so that they will become part of the training set for the next design cycle, which means that a better ML model is available for subsequent iterations. Also, computational methods can be used to suggest which compounds to add to the training set to provide the greatest improvement in the predictive power of the ML model. These calculations can provide a greater degree of improvement in the ML model than relying solely on the expertise of the pharmaceutical chemist, either alone or in combination with the expertise of the pharmaceutical chemist. However, prior art methods for selecting the best data point to add to the training set may not be optimal for the drug discovery project. One reason for this may be that unlike other physical or theoretical spaces, the spacing of chemical spaces is not equal and therefore the index based on these assumptions may not be as effective.
The present invention has been made in view of this background.
Disclosure of Invention
The present invention relates generally to computational methods and systems for designing and automatically selecting compounds in a chemical space to optimize training of ML models, wherein the final trained model can be used to design and automatically select with greater accuracy compounds optimized with respect to desired criteria.
According to one aspect of the present invention, a method for computing a drug design is provided. The method includes defining a population of a plurality of compounds, each compound having one or more molecular characteristics. The method includes defining a training set of compounds from a population of known one or more biological characteristics. The method includes selecting a subset of one or more compounds from a population that is not within the training set. The method includes determining a subset score for the selected subset based on molecular characteristics of one or more compounds within the selected subset, and evaluating the selected subset based on the determined subset score. Subset scores are determined based on the frequency of the molecular characteristics in the population and the frequency of the molecular characteristics within a sampling set comprising the training set and the selected subset.
The determining step may include determining a compound score for each of the one or more compounds of the selected subset based on the one or more molecular characteristics of the compounds, and wherein the subset score is determined based on the determined compound score for each compound within the selected subset.
The subset score may be determined as the sum of the corresponding compound scores for compounds within the selected subset.
Determining a compound score for a compound within the selected subset may include determining a molecular property score for each of the one or more molecular properties of the compound based on the frequency of the corresponding molecular property in the population and the frequency of the corresponding molecular property in the sample set.
The compound score for the compound may be determined as the sum of the determined molecular property scores for one or more molecular properties of the compound.
A molecular property score for each of one or more molecular properties may be determined based on a normalized probability of the molecular property within a sample set, the normalized probability being determined based on a frequency of the molecular property within a population and within the sample set.
The normalized probability may be determined from the number of compounds within the sample set relative to the number of compounds in the population.
The normalized probability may be a laplace (Laplacian) corrected normalized probability.
Normalized probability P of Laplace correction corr Can be given by
Wherein F is sampled Is the frequency of the molecular characteristics in the sample set, F set Is the frequency of the molecular characteristics in the population, and P base Is the number of compounds in the sample set divided by the number of compounds in the population.
The molecular property score for each of the one or more molecular properties may be determined from the number of compounds of the molecular property present within the sample set relative to the number of compounds within the sample set.
A molecular property score may be determined from the normalized shannon entropy values of the molecular property within the sample set.
The normalized shannon entropy value can be given by
Where f is the number of compounds in the sample set where the molecular property is present divided by the number of compounds in the sample set.
Molecular characterization score Cov final Can be given by
And f is greater than 0.5
Wherein the method comprises the steps of
Cov=-ln(P corr /P base )
The subset may include a specified number of compounds.
The method may include defining the number of compounds to be selected within the subset.
The evaluating step may include determining whether the subset score meets a specified condition.
The prescribed condition may be that the subset score is greater than a prescribed minimum threshold score.
If the prescribed conditions are met, the method may include synthesizing at least some of the compounds within the selected subset to determine one or more biological characteristics of the compounds.
The method may include adding a synthetic compound to the training set.
The selected subset may be an initial selected subset, and the method may include: selecting a second subset, different from the initially selected subset, comprising one or more compounds from a population that is not within the training set; and determining a subset score for the selected second subset, and evaluating the selected second subset based on the determined score.
If the prescribed condition is not met, the steps of selecting the second subset and determining its score may be performed.
Selecting the second subset may include replacing one or more compounds within the initially selected subset with one or more new compounds from the population that are not within the training set.
The method may include identifying one or more compounds within the initially selected subset to be replaced based on the respective determined compound scores for the one or more compounds within the initially selected subset.
One or more compounds having the lowest determined compound score within the initially selected subset may be identified for replacement.
The method may comprise iteratively performing the steps of: selecting a new subset different from the subset selected in the previous iteration, including one or more compounds from the population that is not within the training set; and determining a subset score for the selected new subset and evaluating the selected new subset based on the determined score until a termination condition is met.
The termination condition may include at least one of: the maximum number of iterations has been performed; the subset scores of the subsets selected in one iteration meet a prescribed condition; and, the difference between the respective subset scores of the selected subsets at successive iterations is less than a prescribed difference threshold.
The method may include synthesizing a selected subset of compounds in an iteration that satisfies a termination condition to determine one or more biological characteristics of the compounds.
The method may include selecting a plurality of new subsets in each iteration, identifying one of the plurality of selected subsets in an iteration satisfying a termination condition based on the determined subset scores of the respective plurality of selected subsets, and synthesizing compounds of the one identified subset to determine one or more biological properties of the compounds.
The identified subset may be the subset having the highest subset score within the plurality of subsets upon satisfaction of the iteration of the termination condition.
The selected subset may be a first subset, and the method may include: selecting a plurality of subsets from a population not within the training set, each subset comprising a plurality of compounds; determining a subset score for each subset; and selecting a first subset from among the plurality of subsets based on the determined subset scores for the respective subsets.
The first subset may be selected as the subset having the highest subset score among the plurality of subsets.
Multiple subsets may each have the same number of compounds.
The step of evaluating may include evaluating the selected subset based on activity scores of the selected subset obtained from an activity model for predicting activity levels of the compounds in the population.
The evaluating step may include evaluating the selected subset based on the determined subset score and an activity score relative to a desired balance of the scores.
The plurality of new subsets may each include a different balance between the determined score and the activity score.
The plurality of new subsets may form Pareto (Pareto) fronts of the determined subsets and activity scores at iterations that satisfy the termination condition.
The training set may be initially empty.
The molecular characteristics of each of the plurality of compounds in the population may include structural features of the compounds.
The structural features of each of the plurality of compounds in the population may correspond to fragments present in the compound.
Fragments present in each of the plurality of compounds may be represented as molecular fingerprints.
The molecular fingerprint may be an Extended Connectivity Fingerprint (ECFP), optionally ECFP0, ECFP2, ECFP4, ECFP6, ECFP8, ECFP10 or ECFP12.
The molecular characteristics of each of the plurality of compounds in the population may include chemical characteristics of the compound.
The molecular characteristics of each of the plurality of compounds in the population may include structural features and chemical characteristics of the compound.
The chemical property may correspond to the type of interaction exhibited when the corresponding compound binds to the predetermined target molecule.
The chemical identity of at least some of the compounds in the population may correspond to a prediction of the type of interaction exhibited when the respective compound binds to the predetermined target molecule.
The prediction may include predicting which of one or more predetermined types of interactions will be exhibited upon binding of the corresponding compound to the predetermined target molecule.
The method may comprise obtaining a prediction of the type of interaction exhibited when the corresponding compound binds to the predetermined target molecule.
Obtaining a prediction for each compound may include: generating a three-dimensional image of the compound; and performing a docking process using the generated three-dimensional image to predict a preferred docking posture when the compound binds to a predetermined target molecule, wherein the type of interaction exhibited is predicted based on the result of the docking process.
The type of interaction exhibited when the corresponding compound binds to the predetermined target molecule may be expressed as an interaction fingerprint. Optionally, the interaction fingerprint is a protein-ligand interaction fingerprint (PLIF).
The type of interaction may include one or more of: hydrogen bonding interactions, weak hydrogen bonding interactions, ionic interactions, hydrophobic interactions, face-to-face aromatic interactions, side-to-face aromatic interactions, alpha pi-cationic interactions, and metal complexation interactions.
Each compound in the population may be a ligand and the predetermined target molecule is a protein.
The molecular characteristics of each of the plurality of compounds in the population may include the physical characteristics of the compound.
The one or more biological characteristics may include one or more of the following: activity, selectivity, toxicity, absorption, distribution, metabolism, and excretion.
One or more biological properties may be defined relative to the respective desired biological properties.
The method may include: defining a machine learning model for modeling one or more biological properties of a compound in a population based on one or more molecular properties of the compound; and training a machine learning model using a training set of compounds.
The method may include performing a training step each time one or more compounds are added to the training set.
The machine learning model may be at least one of the following models: bayesian optimization models, regression models, cluster models, decision tree models, random forest models, and neural network models.
The method may include executing a machine learning model after the training step to predict one or more compounds in the population having one or more desired biological properties.
The method may further comprise synthesizing at least one of the one or more predicted compounds.
The one or more predictive compounds may be drug candidates or therapeutic molecules that have a desired biological, biochemical, physiological and/or pharmacological activity against a predetermined target molecule.
The predetermined target molecule may be a therapeutic, diagnostic or experimentally determined target in vitro and/or in vivo.
Candidate drugs or therapeutic molecules may be used in medicine; for example, for treating animals such as humans or non-human animals.
According to another aspect of the present invention there is provided a compound identified by the method described above.
According to another aspect of the present invention, there is provided a non-transitory computer-readable storage medium having stored thereon instructions that, when executed by a computer processor, cause the computer processor to perform the above-described method.
According to another aspect of the present invention, a computing device for computing a drug design is provided. The computing device includes an input configured to receive data indicative of a population of a plurality of compounds, each compound having one or more molecular characteristics, and to receive data indicative of a training set of compounds from the population of known one or more biological characteristics. The computing device includes a processor configured to select a subset of one or more compounds from the population that is not within the training set, determine a subset score for the selected subset based on molecular characteristics of the one or more compounds within the selected subset, and evaluate the selected subset based on the determined subset score. The computing device comprises an output arranged to output the evaluation result. Subset scores are determined based on the frequency of the molecular characteristics in the population and the frequency of the molecular characteristics within a sample set comprising the training set and the selected subset.
The processor may be configured to perform the method as described above.
Drawings
Embodiments of the present invention will now be described with reference to the following drawings, in which
FIG. 1 schematically illustrates ECFP2 fingerprint of aspirin molecules;
FIG. 2 shows a table of shannon entropy scores for structural features present in at least some compounds within an exemplary set of compounds;
FIG. 3 shows a plot of shannon entropy score versus frequency for different structural features present in an example set of compounds of FIG. 2;
FIG. 4 lists structural features of compounds in an exemplary set of compounds in order of shannon entropy score;
FIG. 5 schematically illustrates a relationship between a previous set of compounds and a selected subset of compounds in an exemplary set or population of compounds;
FIG. 6 summarizes the steps of a method according to the present invention;
FIGS. 7 (a) and 7 (b) show selection of a subset of compounds from a population of compounds in a chemical interaction space ("interaction space") and a chemical structural space ("chemical space"), respectively, these selections being made according to an example of the method of FIG. 6; the method comprises the steps of,
fig. 8 (a) and 8 (b) show one of the subset selections in fig. 7 (a) and 7 (b) in the chemical interaction space ("interaction space") and the chemical structure space ("chemical space"), respectively, compared to the subset selection according to the previous method.
Detailed Description
Molecular or drug design can be regarded as a multi-dimensional optimization problem that uses the formation hypothesis and experimental cycle to develop knowledge. Each compound design can be considered a hypothesis that was verified in the experiment. Experimental results are expressed as structure-activity relationships that construct a hypothetical form of what chemical structure might contain the desired feature. The drug design process is also an optimization problem because each item begins with a product profile (i.e., target function) that has the particular attributes desired. However, while the goal can be accurately described, it has been an expensive and difficult challenge until the best solution is found. One particular difficulty with such problems is effectively constructing a hypothetical scenario (land cope) that spans a broad space of viable solutions from a relatively limited knowledge base of experimental results.
The drug discovery process is typically performed in iterations called design cycles. In each iteration, a set of molecules or compounds are synthesized and their biological properties are determined. These activities were analyzed and based on experience from previous iterations, a new set of compounds was proposed. This process is repeated until a clinical candidate is found. In addition to activity, the measured biological properties may include one or more of selectivity, toxicity, affinity, absorption, distribution, metabolism, and excretion.
At any particular stage of the process, a group of compounds of known biological activity are synthesized or prepared. The purpose of the method is to find one or more optimal compounds from a large number or library of synthesizable compounds, but for these compounds it takes resources and/or time to synthesize a subset of compounds from a population.
An automated or computational drug design process uses mathematical models (e.g., machine Learning (ML) models) to predict or assume which compounds in a population of compounds that can be prepared are the best compounds, e.g., those that maximize biological activity. The ML model was trained using available structure-activity relationships from experimental results (i.e., from those compounds in the population that have been synthesized and tested). The strategy or method of selecting the compound with the highest predicted activity from the possible compound population for synthesis using the ML model is called "development". Development strategies can be considered the stage of use of the method.
This approach will only succeed if the predictive power of the ML model is sufficiently accurate, i.e. if the ML model is trained sufficiently well. Each compound in the synthesized and tested population was added to the compound training set used to train the ML model. The number of molecules or compounds that are added to the training set in a particular iteration is typically limited by the resources. That is, the number of compounds within a subset of compounds synthesized in each iteration will be defined as the specified maximum number.
The predictive power of the ML model is sufficiently accurate only when there are a sufficient number of compounds within the training set. Thus, a certain number of iterations or design cycles may need to be performed before the ML model is fully trained-e.g., a specified maximum number of compounds are added to the training set in each iteration.
Furthermore, the predictive power of the ML model is sufficiently accurate only if the compounds within the training set sufficiently represent the total population of compounds that can be selected for synthesis. It is therefore important that the compounds that are most conducive to improving the ML model (i.e., the most representative compounds) be included in the subset to be synthesized for any given iteration before the ML model is fully trained. The selection of the compound to be synthesized on this basis is called "exploration". The exploration strategy may be considered as a learning phase or training phase of the method.
Thus, there is a competing need to develop and explore strategies in selecting a subset of compounds to be synthesized in a particular iteration of the drug discovery process. In practice, the selection of which strategy is appropriate may vary depending on the particular stage of the drug discovery process. For example, at an early stage of the drug discovery project, it is not yet possible to build a trained model. Thus, the exploration strategy at this stage may be the most appropriate strategy, as the return of exploration is ultimately a better trained, and thus more accurate, model. At this stage, the development strategy cannot fully utilize limited resources, as development is not a particularly good strategy to improve training set representativeness. On the other hand, if the ML model has been sufficiently trained (e.g., at a later stage of a drug discovery project), then in this case development would be an appropriate strategy, as the model selects a subset of compounds for synthesis that are more likely to be the best compounds relative to the desired characteristics (e.g., high levels of biological activity). At this stage, the exploration strategy cannot fully utilize limited resources, as exploration is not the best strategy to select compounds that may have the desired characteristics.
As described above, the ML model will (possibly) make an accurate prediction only if the following conditions are met: a sufficient number of compounds within the set for training the ML model; and, the compounds within the training set are sufficiently representative of the library of compounds from which the compound to be synthesized is selected. The first of these means that a certain number of design cycles may be required to obtain a sufficient number of synthetic compounds (unless a sufficient number of data related to previously synthesized compounds has been obtained). The second of these means that for the initial design cycle of the early stages of the drug discovery project, it may not be desirable to make decisions based on the compounds contained in the collection to be synthesized (only) using the ML model of the design. This is because the ML model will predict which compounds are highly active based on models that have not been trained to a sufficient level, meaning that the prediction may be less accurate. Furthermore, synthesizing compounds from such predictions would not be useful in improving the ML model for subsequent design cycles, as ML model predictions further focus on relationships/information that have been identified from within the training set of compounds. In particular, ML model predictions cannot help suggest which compounds to synthesize in order to refine the ML model for the next design cycle.
To reduce the time and cost associated with drug discovery projects, the number of iterations or design cycles required to discover candidate or optimal compounds with the desired characteristics should be minimized. It is therefore crucial that a trained model for predicting compounds with the desired properties can be constructed as quickly as possible, i.e. as few compounds as possible within the training set are required. It is therefore important to select the most representative compound for synthesis at an early stage of the project to minimize (at least to some extent) the number of iterations that need to be explored, as candidate compounds are unlikely to emerge from iterations employing this strategy.
Furthermore, each iteration of the drug discovery process may implement a combination of exploration and development. That is, within a subset of compounds selected for synthesis in a given iteration, some compounds may be selected according to an exploration strategy, and some compounds may be selected according to an exploitation strategy. For example, the number of compounds selected for the subset according to the exploration strategy may decrease as the number of iterations that have been performed increases, as the accuracy of the ML model may increase with each successive iteration. Instead, the number of compounds selected for the subset according to the development strategy may increase as the number of iterations that have been performed increases.
It is an advantage of the present invention to provide an improved computational drug design method for selecting compounds for synthesis as part of an exploration strategy, optionally in combination with a development strategy, thereby reducing the time and cost of training the ML model to a sufficient level.
According to the present invention, the first step in the calculation of a drug design method is to define a population of various compounds or molecules. In particular, the population is a collection of compounds that can be selected for synthesis during a particular drug discovery project. The population may be defined or obtained in any suitable manner, such as by known computing methods and/or using human input. For example, a population may be a group of compounds obtained from a generative design algorithm or an evolutionary design algorithm. In particular, the evolutionary design algorithm can generate a number of new compounds based on an initial set of one or more known compounds (e.g., existing drugs) that have at least some desirable characteristics of the best compound for the particular project in which the method will be used. Alternatively, many of the novel compounds may be produced in any suitable manner. Those new compounds generated having at least some of the desired characteristics may be retained for further analysis. In one example, the number of starting compound sets (e.g., comprising millions of compounds) can be reduced by adding known methods to maintain certain compounds with at least some of the desired characteristics of the particular project at hand. One or more filters may be applied to the retained compounds to remove any unwanted compounds. The filter may be defined according to any suitable criteria for selecting (or filtering) a desired compound from among undesired compounds. For example, a useful filter may be used to remove duplicate compounds. Another filter may be suitable for removing compounds having a certain level of toxicity. The filtered set of compounds may then form a population from which a selection may be made for synthesis.
The population may include any suitable number of compounds. In general, there will be more compounds contained in the population than many compounds synthesized as part of a particular drug discovery project, and may be significantly more, for example, for reasons of available resources. However, the population typically does not include so many compounds that computational analysis of the population according to the present invention is not feasible. For example, the number of compounds in a population may typically be on the order of hundreds or thousands of compounds, but it should be understood that a population may be larger or smaller than this for any given item.
The next step in the calculation of the drug design method is to define a training set of compounds from a population of known one or more biological properties. The training set will be used to train the ML model to evaluate the structure-activity relationship of a particular drug discovery item. The training set includes those compounds from the population that have been synthesized and tested experimentally to determine one or more biological characteristics. Thus, as the drug discovery project progresses, i.e., as more iterations or design cycles are performed, the number of compounds within the training set increases. At the beginning of the drug design method, i.e. before any compounds in the test population, the training set may be (initially) empty, i.e. the training set may comprise zero compounds. Optionally, the training set may include compounds whose biological properties are known a priori, such as compounds that have been previously tested as part of different projects, and which have at least some of the desired properties of the best compounds, depending on the particular project under consideration.
The next step in the computational design method includes selecting a subset of at least one compound from the population, wherein the compounds within the subset are not within the training set. The number of compounds selected is based on the number of compounds that can be tested in any given iteration or design cycle of the drug design project, taking into account the available resources. Thus, the number of compounds to be selected within the subset, or at least the upper limit of the number, may be predetermined. Generally, the method includes specifying the number of compounds to be selected within the subset. The manner in which the subset of one or more compounds is selected will be discussed in more detail below. The size of the selected subset, i.e., the number of compounds within the selected subset, is likely to be significantly smaller than the size of the population. For example, the number of compounds within the selected subset may be at least one order of magnitude lower than the number of compounds in the population, and optionally at least one order of magnitude or more lower than the number of compounds in the population.
In its broadest sense, the computational design method of the present invention can be seen as providing a method for selecting (a subset of) compounds according to an improved exploration strategy in a given iteration or design cycle of a drug design project, as described below. However, it should be appreciated that this may be combined with compound selection according to different strategies (e.g., different exploration or development strategies) at any given design cycle.
The exploration strategy of the method is based on information theory. In particular, the selection of compounds should be based on which compounds will provide the greatest amount of information at the time of the assay (e.g., regarding structure-activity relationships in the population). The amount or type of information provided by a compound or subset of compounds is determined based on the characteristics of the compound.
Each compound in the population has a number of structural features that combine to form its chemical structure. These structural features may be represented in any suitable manner. For example, one way to describe the structure of a compound or molecule is by fingerprint identification. In particular, a fingerprint of a particular compound may be expressed as a mathematical object, such as a series of loci or a list of integers, reflecting which particular structural features or substructures are present or absent in the compound.
There are several different categories of fingerprints, such as topological, structural, and circular fingerprints. One common round fingerprinting method is Extended Connectivity Fingerprinting (ECFP). A variety of ECFP methods are known, such as ECFP0, ECFP2, ECFP4 and ECFP6. As is well known in the art, determining a fingerprint of a compound generally includes assigning an identifier to each atom in the compound, updating the identifiers based on adjacent atoms, removing duplicate entries, and then forming a vector from a list of identifiers.
For ease of illustration only, fig. 1 shows an example of a molecule and its ECFP2 fingerprint features. In particular, the exemplified molecule of choice is aspirin. It can be seen that the compound comprises 17 fingerprint features, each feature representing a segment or portion of the compound, each feature being stored as a (positive or negative, integer) number.
The ML model constructed as part of the drug discovery project can correlate structural features that may be present in a compound with desired properties (biological activity). For example, in projects where high activity against a target is desired and a particular compound exhibits a high level of activity, one problem is determining which features of the compound result in a high level of activity. The goal of this method may therefore be to select compounds for testing so as to maximize the amount of information associated with the structure-activity relationship.
Shannon entropy (or "information entropy", or simply "entropy") in information theory is a measure of the content of information. In general, when questions about a dataset are asked, some questions are more informative than others. Shannon entropy can be used to determine which questions are most appropriate for questioning in order to maximize information extraction. The best (binary) questions to ask are those that divide the data set equally into two parts.
In the context of the present method, the entropy of individual features (e.g., fingerprint features) of the compounds in the population may be determined. In particular, the entropy of a particular feature depends on the number of compounds in the population in which that particular feature is present. Features that are present in one half of the compounds in the population but not in the other half will have the highest entropy. On the other hand, features that are present in each compound in the population or that are not present in each compound in the population will have the lowest entropy value (in fact, zero value). For example, by testing compounds with relatively high entropy characteristics, it is easier to infer which characteristics contribute to high activity levels.
The shannon entropy H of feature x in a compound population can be expressed as:
H(x)=-∑ i p(x i )ln p(x i )
wherein p (x) i ) Is the probability of different states of a feature in a compound of a population, i.e., present in the compound or absent from the compound.
As an illustrative example, consider a population of 2400 compounds. 1200 of 2400 compounds had a first characteristic (e.g., fingerprint characteristic). Thus, the shannon entropy of this first feature can be calculated as:
500 of 2400 compounds present a second feature. Thus, the shannon entropy of the second feature is:
It can be seen that the shannon entropy of the first feature is greater than the shannon entropy of the second feature. In fact, the first feature provides the greatest amount of information at the feature level, since it is present in exactly half of the compounds in the population.
Thus, the shannon entropy of a feature in a population depends on the frequency of that particular feature in the compounds of that population, i.e., the number of compounds in the population in which that particular feature is present. For example, using the data set of 2500 compounds provided in Pickett et al (2011) 'Automated lead optimization of MMP-12 inhibitors using a genetic algorithm' (ACS Medicinal Chemistry Letters,2 (1), 28-33), the shannon entropy scores or values for different features in the population (expressed as ECFP fingerprints) are shown in FIG. 2. FIG. 2 shows that the structural features of the amide attached to the aromatic ring are relatively rare in this collection or population, especially in 50 of the 2500 compounds in this collection. Thus, the shannon entropy score for this feature is relatively low, 0.098. At the other end of the scale, the hydroxyl group (substructure of the carboxylic acid) appears very frequently in the collection and is in fact present in each of the 2500 compounds in the collection, which means that its shannon entropy score is 0. Among the features shown in fig. 2, ether oxygen has the highest shannon entropy score because it is closest to equally dividing a group of compounds into two in terms of whether the feature is present. That is, a frequency of 0.5 would be the best feature and would maximize the shannon entropy score.
FIG. 3 shows the distribution of shannon entropy scores characteristic of ECFP6 in the above group of compounds proposed by Pickett et al. In particular, it can be seen that those fingerprint features in dots present in about half of the compounds in the collection have the highest shannon entropy score, while those features present in few or almost all compounds have the lowest shannon entropy score. It can also be seen that most of the fingerprint features occur in less than half of the compound sets. Note that in fig. 3, a single point may represent more than one feature.
In an exploration strategy utilizing information theory, one option is to use the shannon entropy values of the features to select the compounds to be tested. In particular, shannon entropy of each compound in the population can be calculated based on shannon entropy values of the compound features. For example, the shannon entropy of a compound may be the sum of shannon entropy values characteristic of the presence of the compound. However, the shannon entropy values characteristic of the compounds may be combined in any suitable manner to obtain the entropy value of the compound. Thus, the first compound selected as part of the first iteration or design cycle according to the exploration strategy may be selected as the compound in the population having the highest shannon entropy score to maximize the extracted information content. However, selecting subsequent compounds for testing using the same method (as other compounds to be tested in the same iteration, i.e., as part of a selected subset, or as compounds to be tested in subsequent iterations) does not produce the same beneficial results. In particular, selecting the second compound by maximizing the shannon entropy score may mean that the first and second compounds will be selected based on the same factor, i.e. the factor that causes the first compound to have a high score may also be the same factor that causes the second compound to have a high score. This may be considered as a similar problem for multiple queries, thereby reducing the information content to be extracted. Thus, there is a need for a compound selection strategy that balances maximizing shannon entropy scores on the one hand and minimizing feature overlaps present in the selected compounds on the other hand.
A method is described below for providing an index or metric to determine the degree of "undersampling" of different features during selection. The undersampled feature is a feature that balances the high shannon entropy score with the low degree of overlap of the selected feature. Figure 4 illustrates the selection of a subset of five compounds from a collection or population of compounds. Specifically, FIG. 4 lists the fingerprint features (ECFP 4 features in this example) in the order of shannon entropy scores, i.e., based on how many compounds in the collection have features relative to the total number of compounds in the collection. Figure 4 also shows how many of the five compounds selected have certain characteristics. A score is defined that balances the shannon entropy score with the number of times the feature is sampled within the subset. However, scores derived based on simple consideration of the ratio between the two ignore some of the available information. For example, referring to FIG. 4, consider the ratio between the number of times a feature is sampled within a subset and the total number of compounds present for that feature, ignoring whether it is more important in the present case, such as 2/200 or 3/300.
Thus, an index may be defined that integrates the ratio and significance into a score or metric. In particular, the defined index is referred to as the "coverage score" of a (fingerprint) feature in a population, as it provides an indication of the coverage of information about that feature extracted across the chemical space of the population. For example, during the exploration phase of the project, extracting information with extensive coverage for the population facilitates reducing the time or number of data points to fully train the ML model describing or predicting the structure-activity relationship.
The coverage score of the feature is calculated (as described below) for subsequent calculation of the coverage score for the particular compound, and is in effect the coverage score for a subset of compounds selected from the population of compounds (e.g., for testing). Referring to fig. 5, in a given compound population or compound set 50, there is a previous set or training set 51 of compounds that have been (previously) selected and tested, for example in a previous iteration of the relevant drug discovery project, wherein the previous set or training set 51 is separate from the subset 52 selected as part of the current iteration, i.e. the compounds in the selected subset 52 are different from the compounds in the previous set or training set 51. Then, the sampled structure 51, 52 or compound N sampled Is the sum of the previous set and a subset of the selected compounds, i.e
N sampled =N prior +N subset
Wherein N is prior The number of compounds (. Gtoreq.0) that have been selected before the subset is selected, e.g.in the previous design cycle, and N subset Is the size of the desired subset. Then, the chance or probability P of randomly extracting the random compound from the collection or population base Is given by
Wherein N is total Is the number of compounds in a collection or population.
The characteristics (fingerprints) of each compound in the collection or population are determined and the frequency of occurrence of each characteristic i in each group is provided as follows: f (F) set,i Is the frequency of (all) the sets or populations of feature i, i.e. the number of compounds in the set in which the feature i is present; f (F) prior,i Is the frequency of the feature i within the previous or training (sub) set; f (F) subset,i Is the frequency of feature i within the (current) selected subset; and F sampled,i =F subset,i +F prior,i Is the frequency of feature i in a so-called "sample set" which is a combination of the previous set and the selected subset.
Then, the (Laplacian corrected) normalized probability P of feature i within the sample set corr,i Can be calculated as
Then, the "uncorrected" coverage score Cov for feature i i Can be defined as
Cov i =-ln(P corr,i /P base )
In this way, a measure of the number of times a feature is sampled is provided relative to the number of times the feature appears in a set or population. However, correction of the uncorrected coverage score is required to take into account the information content that may be provided by the different features. In this way, it can be ensured that the features are not "oversampled" (in the group of sampled compounds). In particular, the correction is based on shannon entropy scores of features i in the sampled group or collection of compounds. Frequency or score of feature i in the sampled compound group
Then, feature i "shannon correction" SC i (normalized probability for it) can be given by
Wherein denominator normalizes shannon entropy correction such that 0 < SC i And is less than or equal to 1. Note that since shannon entropy correction depends on the frequency at which features are sampled, shannon entropy correction will vary depending on the particular compounds within the previous set and the selected subset (e.g., it willVarying between iterations of the drug discovery project). This is different from the shannon entropy score described above for population characteristics, which is constant because the specific compounds in the population remain unchanged during different iterations of the drug discovery project.
In some examples, shannon correction may be applied slightly differently depending on whether the uncorrected coverage score is greater or less than zero. In some examples, the final (corrected) coverage score Cov for feature i final,i And thus can be defined as
And f i >0.5
In some examples, the coverage score for a compound may then be calculated as the sum of the coverage scores of its features, and the coverage score for the selected subset may be calculated as the sum of the coverage scores of the compounds within the selected subset. Since the (feature) coverage score of a feature depends on the frequency of features within the sample set (which includes the selected subset and the previous set), the (compound) coverage score of compounds within the selected subset depends on which other compounds (particularly their features) are within the selected subset (and within the previous set). That is, if the selected subset includes a plurality of compounds, and one of these compounds is replaced by another compound in the population that is not in the previous set, then the (compound) coverage score for each compound within the (updated) selected subset needs to be recalculated in order to subsequently determine the (subset) coverage score for the (updated) selected subset.
If a feature becomes "oversampled" within the sample set, its coverage score may decrease, thereby making it less likely that the compound in which the feature is present is selected into the subset. If a compound is selected (i.e., the selected subset comprises a compound), the coverage score for the features present in the compound will change and may become negative. In this sense, the features present in the selected compounds may now be considered "oversampled" relative to other features present in the compounds of the population, and thus their coverage score is reduced, as the re-selection of these features (over other features) may not be optimal for extracting information content in the context of structure-activity relationships. The selection of the first compound may be made in any suitable manner, for example by a highest coverage score, which may be equivalent to the highest shannon entropy score for the whole population for the selection of the first compound, wherein the coverage score of a compound may be determined as the sum of its characteristic coverage scores.
The shannon entropy score of a compound at population level is static or constant between iterations, while the coverage score of a compound is dynamic or variable between iterations, depending on the number of times each of its features is sampled. In particular, those compounds having an "oversampling" feature will have a lower coverage score relative to other features, so that at each iteration, the compound that maximizes the information gain can be selected taking into account the previously sampled compound.
Many features with higher shannon scores at the population level may have been sampled multiple times after sampling many compounds, for example after multiple iterations of a drug discovery project or after testing a selected subset of compounds. Thus, these features may tend to have lower coverage scores at this stage or iteration relative to features having lower shannon scores at the population level, but so far these features may not be sampled so frequently. In particular, the more rare features of the collection or population, i.e., features that are present in a relatively small number of compounds in the population, become more attractive, which may be reflected in the relatively high coverage score at this stage, meaning that compounds containing these more rare features become more likely to be selected.
Thus, in the broadest sense, the steps of the invention comprise selecting a subset of one or more compounds from a population that has not been previously sampled, and thus are not within a training set of compounds of known biological activity, based on both: which compounds are in the population; and which compounds are within a sampling set comprising the training set and the selected subset. More specifically, the selection of a subset of one or more compounds (e.g., at a particular iteration) is based on the frequencies at which structural features present in the compounds of the subset appear in the compounds of the population, and on the frequencies at which these structural features appear within the sample set. In other words, the selection of the subset of one or more (selected) compounds depends on consideration of each structural feature of the one or more selected compounds, the number of compounds in the population that comprise the corresponding structural feature and the number of compounds in the sample set that comprise the corresponding structural feature. In general, such a situation may occur: the likelihood of a compound being selected that is characterized by a relatively large number of compounds within the sample set is reduced.
As described above, one way in which a subset of compounds may be selected according to the above considerations is that a score may be assigned to the subset to quantify the considerations. In particular, the score may balance the frequency of finding a characteristic of the selected compound in the population with the frequency of finding a characteristic of the selected compound within the sample set. Where a subset of compounds is selected to maximize the subset (subset) coverage score, compounds that have "undersampled" structural features within the sampling set, but that will provide a relatively high level of information content (according to shannon correction defined above), will have higher (compound) coverage scores relative to their frequency in the population, thereby increasing the chance that such compounds remain within the selected subset. The greater the number of compounds within the sample set that exhibit a particular structural feature, the more likely the score for compounds that contain such structural feature will decrease. Although the score is described above as calculated according to the above formula using normalized probabilities, it should be understood that this is merely one example of how the score may be determined, taking into account the factors and considerations currently described, namely proportional sampling of features in the population while maximizing the extracted information content, and that other equations or methods may also be used.
According to the steps of the present invention, since the coverage score of a compound within a selected subset depends on other compounds within the selected subset, the (subset) coverage score within the selected subset may be determined based on the (compound) coverage scores of the selected compounds within the selected subset. The selected subset is then evaluated based on the determined subset coverage score. Subset scores are determined based on the frequencies of structural features of the compounds within the selected subset, population and sampling set. The method of the invention may comprise determining a (feature) coverage score for each of one or more structural features (e.g. fingerprints) of the selected compounds within the subset according to the frequencies of the corresponding structural features within the population and the sampling set, wherein the (compound) coverage score for each selected compound may be based on the determined score for the one or more structural features of the selected compound. For example, the score for each selected compound may be determined as the sum of the determined scores for one or more structural features of the selected compound. Optionally, the sum may be a weighted sum of feature scores, e.g., where the weights are based on the particular features and/or dimensions (e.g., number of features) of the compound.
The step of evaluating may comprise evaluating whether the selected subset of one or more compounds is suitable for a particular purpose, e.g. proposed for synthesis to determine the biological properties of the compounds within the subset, or whether a different subset of compounds is to be selected. For example, the evaluating step may include determining whether the determined scores of the selected subset (e.g., the coverage scores as described above) meet a specified condition, e.g., whether the score is greater than a specified minimum threshold score. If a specified condition is met, or it is determined from the evaluating step that a different/updated subset of compounds is not selected, the method may include synthesizing selected compounds within the subset to determine one or more biological properties of the selected compounds. One or more synthetic compounds may then be added to the training set.
In the case where the selected subset is the initial selected subset, the method may include selecting a second subset from the population that is different from the initial selected subset, wherein compounds within the second subset are also not within the training set. The score of the selected second subset may then be determined in a manner corresponding to the first initial subset (note that the score of any compound common to the initial and second subsets needs to be recalculated), and the selected second subset may then be evaluated. For example, the evaluation may determine whether a prescribed condition is satisfied. As described above, the step of selecting the second subset and determining the score thereof may be performed only in the case where the prescribed condition is not satisfied with respect to the initial subset. The initial subset may be selected randomly from the population, or may be selected using any suitable alternative method.
The selection of the first (initial), second and subsequent subsets may be part of an iterative process to obtain a subset of compounds that meet the desired conditions, and thus may be suitable for synthesis in a particular iteration or design cycle of the drug discovery project. Such a method or process may include iteratively selecting and evaluating (based on the scores determined thereof) a new subset of one or more compounds until a termination condition is met. Each selected new subset is different from the selected subset in the last iteration, wherein compounds within the selected new subset are from the population and are not within the training set. The termination condition may be any suitable condition that causes further selection of the new subset to be no longer performed. For example, the termination condition may be that the iterative process has selected a maximum number of new subsets, i.e. a maximum number of iterations has been performed. Alternatively, the termination condition may be that the scores of the subset selected in one iteration meet a prescribed condition. The termination condition may also be that the difference between the respective scores of the selected subset at successive iterations is less than a prescribed difference threshold. The termination condition may include any combination of these example conditions, and/or include any other suitable condition. Thus, the method may comprise synthesizing some or all of the compounds within the selected subset in an iteration that satisfies the termination condition to determine one or more biological properties of the selected compounds.
In general, there may be a prescribed upper limit on the number of compounds that can be tested during an iteration or design cycle of the drug discovery process, which may inform the number of compounds to be included in the selected subset. The selection of the subset may be performed in any suitable manner. As part of the exploration strategy, it may be desirable to select a subset with a high coverage score. However, it is difficult to determine a subset to optimize coverage scores for the entire population. This is because the coverage score of a single compound within a subset depends on the other compounds within that subset, and because it is generally possible to form a plurality of different combinations of compounds of the subset from the population. For example only, the subset may include about 10, 20, or 30 compounds; however, it should be understood that the subset may include any suitable number of compounds from the population. The number of unique subsets increases exponentially with increasing subset size or population size, so it is not always possible to enumerate all possible subsets and select the best (highest scoring) subset.
One option is to generate or select an initial subset of one or more compounds, which are then modified by replacing one or more compounds within each subset to increase their coverage score. For example, if one or more initial subsets are selected, the evaluating step of the method may include determining whether the score of any subset within the selected subset meets a specified condition. The specified condition may be that the score is greater than a specified minimum threshold score. If the prescribed conditions are met, the method may include synthesizing compounds within the selected subset that meet the prescribed conditions to determine one or more biological properties of the compounds. The synthesized compounds may then be added to the training set. This process may be performed using genetic algorithms, which are good ways to find near optimal solutions when a full scan of all options is not possible or feasible.
If the one or more initially selected subsets do not meet the prescribed conditions, one or more second subsets may be selected and checked for satisfaction of the prescribed conditions. In practice, one or more subsets may be iteratively generated until a desired subset is obtained. In particular, multiple subsets may be generated initially (in parallel), e.g. randomly or using evolutionary or genetic algorithms. It should be appreciated that any suitable number of subsets may be generated, such as less than 100, less than 50, or less than 10. The coverage score for each of these generated subsets may then be determined, and one or more of the subsets with the highest determined score may then be iterated to attempt to further increase their score. Note that at this stage, a particular compound may be included within more than one subset of the plurality of subsets. Further, note that since the coverage score for one compound within a subset depends on the other compounds within the subset, when one or more compounds within a subset are replaced during an iteration to maximize the coverage score within the subset, the coverage score for the remaining compounds within the subset will change at each iteration, thus requiring recalculation at each iteration to determine the score within the subset. That is, if a high scoring compound is selected within the subset, then if a similar compound is added to the training set or subset (i.e., the sampling set), its score will decrease because similar compounds will have common features, which will therefore be sampled more, resulting in a decrease in value. Thus, simply replacing one compound within a subset with another compound having a higher coverage score does not necessarily increase the overall coverage score of that subset. Therefore, what is needed to optimize is the score of the subset, not the score of the individual compounds. For example, a genetic algorithm may be used to optimize the subset in this way.
Such iteration of the subset may be performed in any suitable manner. For example, one or more compounds within the subset may be replaced or substituted with one or more new compounds in a population that is not within the training set. In one example, the method may include identifying one or more compounds to be replaced from within the initially selected subset based on respective determined scores for the plurality of compounds within the initially selected subset. Optionally, identifying one or more compounds with the lowest determined scores within the initially selected subset for substitution.
At a given design cycle of the drug discovery project, the subset may be iterated until a termination condition is met, where the termination condition may be one of the conditions described above. In the case where a plurality of subsets are generated per iteration in maximizing the coverage score, the method may include identifying one of the plurality of selected subsets in iterations that satisfy the termination condition based on the determined scores for the respective plurality of selected subsets. The determined subset of compounds can then be used for synthesis. The identified subset may be selected as the subset having the highest score among the plurality of subsets in the iteration that satisfies the termination condition.
The above describes a subset of compounds that are optimized to obtain coverage scores, i.e., a purely exploratory strategy. However, a degree of development may be incorporated into the subset selection. For this purpose, an activity model is defined which predicts the activity of the compound. For example, a bayesian model or regression model may be used for this purpose. The activity of a compound can be defined with reference to half maximal inhibitory concentration (IC 50). For example, a compound may be simply classified as active or inactive depending on whether its IC50 value is above or below a threshold level of activity. Alternatively, a specified number of compounds with the highest IC50 values may be classified as active compounds, with the remainder being inactive compounds. The activity scores, e.g., bayesian model scores, of a selected subset of the activity scores for each compound within each subset from the activity model are then balanced with the (subset) coverage scores of the selected subset to obtain a subset that is balanced for exploration and development in the desired combination. Likewise, evolutionary or genetic algorithms can be used to optimize subsets according to a desired mix of exploration and development. In particular, where multiple subsets are generated in parallel, each individual subset may be optimized for a given design cycle based on a different balance of exploration and development. After a sufficient iteration of the evolutionary or genetic algorithm in a given design cycle, the pareto front of the optimized subset will appear, from the subset with the highest exploratory weight to the subset with the highest exploratory weight. The particular subset with the desired balance of development (higher model score at the expense of coverage score) and exploration (higher coverage score at the expense of model score) can then be selected as needed, for example, for synthesis.
The training set of compounds is used to train a Machine Learning (ML) model that is used to predict or determine compounds in a population that are more likely to exhibit a desired property relative to a target. In particular, the invention may include defining a machine learning model for modeling one or more biological characteristics of compounds in a population based on one or more structural features of those compounds. The ML model may be a bayesian optimization model, a regression model, a clustering model, a decision tree model, a random forest model, a neural network model, or any other suitable type of ML model. The training set of compounds may then be used to train the ML model. The number of compounds within the training set will increase during each iteration or design cycle of the drug discovery project. The training phase of the ML model may be performed each time one or more compounds are added to the training set. As the number of compounds within the training set increases, a better trained model, i.e., a model that can more accurately predict which compounds have the desired characteristics required for a particular project, e.g., a high activity level, can be obtained. In particular, when at least some compounds added to the training set for training the ML model are selected using the above-described exploration method, an ML model that trains in a shorter time and/or provides higher accuracy predictions is obtained. The ML model may be executed to predict one or more compounds in the population that have one or more desired biological properties. The ML model may be performed after each design cycle or iteration, or may be performed only after the model is trained to a certain level. It should be noted that during a particular design cycle of a project, the ML model may select one or more compounds to be synthesized and tested in a given iteration as part of a development strategy. For example, at an early stage of the project, i.e. when relatively few iterations have been performed, only a few subsets selected for synthesizing compounds may be selected using the ML model, since the model may not be particularly well trained at this time, the remainder of the subset being selected by the exploration strategy as described above in order to refine the ML model. However, at a later stage of the project, once the ML model is trained to a better level, the ML model may select most or all of the compounds within the subset for synthesis. These compounds can then be synthesized to provide candidate drug compounds having the desired biological, physiological, or pharmacological activity.
Fig. 6 summarizes the steps of a method 60 of computing a drug design in accordance with the present invention. In step 61, a population 50 of a plurality of compounds is defined, wherein each compound has one or more structural features, such as described as fingerprint features. In step 62, a training set 51 of compounds from population 50 is defined, wherein one or more biological properties of the compounds in training set 51 are known. In step 63, a subset 52 of one or more compounds is selected from the population 50, wherein the one or more compounds in the subset 52 are not yet within the training set. In step 64, a coverage score for the selected subset 52 is determined from the structural characteristics of one or more compounds in the selected subset 52, and the selected subset 52 is evaluated or analyzed based on the determined subset score. A subset score is determined from the frequency of each structural feature in the population 50 and the frequency of each structural feature in the sample set 51, 52, wherein the sample set 51, 52 comprises the training set 51 and the selected subset 52. The selection and evaluation of the subset of compounds may be part of an iterative process, e.g. until a predetermined condition is met, e.g. the selected subset has a sufficiently high score.
The methods of the present invention can be implemented on any suitable computing device, for example, by one or more functional units or modules implemented on one or more computer processors. Such functional units may be provided by suitable software running on any suitable computing substrate using a general processor or client processor and memory. One or more of the functional units may use a common computing substrate (e.g., they may run on the same server) or separate substrates, or one or both of the substrates themselves may be distributed among multiple computing devices. The computer memory may store instructions for performing the method, and the processor may execute the stored instructions to perform the method.
Many modifications may be made to the examples described above without departing from the spirit and scope of the invention, which is defined herein with particular reference to the appended claims and claims.
Embodiments of the present invention are advantageous in that they provide a more efficient method of identifying compounds or molecules that are optimized for a target as part of a drug discovery project. In particular, the present invention provides an improved technique for identifying the most representative molecules within a population or collection, and thus which is optimal for training a machine learning model that will be used to predict one or more molecules of a population that exhibit a particular desired characteristic of a particular item. The present invention advantageously uses information theory to select molecules having structural features that provide the greatest amount of information about the population of molecules. For example, by focusing on molecules that have features that are not "oversampled" relative to the prevalence in the population, but that provide a relatively high level of information content, it may be easier to determine whether a particular feature contributes to, or is associated with, one or more desired characteristics exhibited by certain molecules. Thus, the examples of the present invention can advantageously be considered as a compromise between maximizing information of a subset of molecules or shannon entropy and the frequency with which features of these molecules have been selected/tested (i.e., not repeatedly asking the same question) and how many features each molecule has (i.e., not simultaneously asking too many questions). That is, the present invention provides a way to identify which features of a molecule are important, but not over-sample those features, as this is equivalent to asking the same (good) question more than once. Thus, the number of iterations or design cycles required to obtain clinical candidate molecules may be advantageously reduced using the methods of the present invention, thereby saving time and/or cost. The methods of the invention can also reduce the number of compounds that must be selected, synthesized, and tested in order to generate a training set and obtain one or more suitable clinical candidates. In this way, the method of the present invention uses active learning or machine learning to optimize the drug.
Unlike some other approaches for implementing exploration strategies, the present invention does not rely on clustering "similar" molecules in the (unequal) chemical space in an attempt to select a different range of molecules based on a certain distance metric. In contrast, the present invention advantageously provides an index for optimizing the coverage of the information provided by the selected subset of molecules, i.e. it provides a mechanism for identifying the best problem to be posed. While some clustering methods will pick outliers from the population, examples of the invention will focus on differences within the test chemistry series. It is also an advantage of the present invention that the described method is applicable to any population or group of molecules and allows for variable R groups and is not limited to modification of molecules while maintaining a static core, for example, in the case of other known methods.
In the above description, an index (i.e. "coverage score") is defined that provides an indication of the coverage of the extracted information about structural features (fragments) in the chemical space of the compound population. For example, during the exploration phase of the project, extracting information with extensive coverage for the population facilitates reducing the time or number of data points to fully train the ML model describing or predicting the structure-activity relationship. In different examples of the invention, it may be desirable to obtain an indication of the coverage of the extracted information about the characteristics or parameters of the compound population other than the structural characteristics of the compound. The above examples focus on applying the coverage score indicator to structural features (fragments) present in the population; alternatively or additionally, however, the coverage score indicator may be applied to other molecular characteristics, such as chemical or physical characteristics of the compounds in the population. In particular, the coverage score indicator may be used in association with a plurality of different molecular characteristics of compounds in a population in order to construct a better ML model for describing a possible relationship between such molecular characteristics and the activity of compounds in a population having said molecular characteristics.
In one example, when a compound binds to or otherwise interacts with a target molecule, it may be desirable to determine coverage information regarding the type of interaction that the compound in the population exhibits (or is expected/predicted to exhibit). In a manner corresponding to the above example, in which it is not desirable to sample too many compounds in the population that have the same structural features during the exploration phase of the project, in this example, it is not desirable to sample too many compounds in the population that have the same interactions with the target molecules during the exploration phase of the project. For example, sampling compounds that exhibit broad binding interactions may help reduce the time or number of data points to fully train the ML model that describes or predicts the interaction-activity relationship.
In order to apply the coverage scoring method described above to a particular molecular property of a population, it may be desirable to represent the molecular property in an appropriate form for analysis. In the above examples, the structural features of the compounds in the population are represented as respective fingerprints, i.e. a list of numbers or one-dimensional vectors. In particular, in the above examples, each compound is represented as a list of binary numbers, wherein a 1 or 0 at each entry of the list indicates whether a particular structural feature (fragment) is present in the corresponding compound. In examples where different types of interactions are molecular characteristics of the population under consideration, this information may similarly be represented in fingerprint form on a single compound basis (as described below), allowing coverage score indicators to be applied in a manner corresponding to examples related to structural features of the compounds described above.
To analyze different types of interactions in a population using the coverage scoring methods of the present disclosure, it is first necessary to obtain interaction data indicative of the different types of interactions exhibited by the different compounds in the population. One method of obtaining such data includes the application of molecular docking methods. Docking is a method of predicting the conformation of a ligand at a target binding site in order to provide an accurate model of the alignment of the ligand in the binding pocket. In other words, docking provides a prediction of the preferred orientation and conformation of a compound or molecule relative to another (target) molecule when the compounds or molecules bind to each other to form a stable complex. Thus, docking may be considered an optimization problem to describe the "best match" of a ligand binding to a particular target protein, where both the ligand and the protein are flexible. In some cases, some or all of the interaction data may be obtained in a different manner. For example, interaction data for certain compounds may be obtained from experimental results or other sources.
For docking, a three-dimensional image or description of the target may be generated to simulate how different compounds in the population are contained in the binding pocket of the target. For each compound in the population, a plurality of docking poses may be generated, each corresponding to a snapshot of the orientation and conformation of the ligand-protein pair. Gestures may be scored to determine whether a particular gesture represents a likelihood of favorable binding interactions. As part of the docking process, different methods for generating and scoring docking poses are known in the art. The docked compound may have three-dimensional coordinates in the reference frame of the recipient.
Thus, three-dimensional binding interaction information for different ligand-protein complexes can be obtained through the docking process. The three-dimensional information may then be converted into a one-dimensional binary string, i.e. a fingerprint, to allow the application of the coverage scoring method. These fingerprints may be referred to as interaction fingerprints or protein-ligand interaction fingerprints (PLIF). In a manner corresponding to the molecular fingerprints described above, each site of the interaction fingerprint may indicate the presence or absence of a particular binding interaction when the relevant compound binds to the target molecule of interest in the particular drug discovery project being undertaken. Target molecules, such as proteins, may be in vivo molecules that are identified as being intrinsically associated with a particular target disease and may be targeted by a drug (e.g., a compound from a population) to produce a therapeutic effect. Thus, an interaction fingerprint is a way of describing how a given compound interacts with a receptor, i.e., how the interaction occurs and with which residues.
Interaction fingerprint may be defined as the desired number and combination of specific interactions that may be exhibited when a compound in a population binds to a predetermined target molecule. The specific interactions contained in the fingerprint may include one or more of the following: hydrogen bonding interactions, weak hydrogen bonding interactions, ionic interactions, hydrophobic interactions, face-to-face aromatic interactions, side-to-face aromatic interactions, alpha pi-cationic interactions, and metal complexation interactions.
Once the interaction fingerprint is generated, coverage score selection may be used in a manner corresponding to the method described above for structural features to select multiple compounds from a population having different interaction sets (i.e., different fingerprint sets). Note that the calculation of the "feature score" in the above example may be referred to as an "interaction" score (and may be generally referred to as a molecular property score) in this example.
FIG. 7 illustrates examples of different compounds in a selected population when coverage scores are close to structural features and interaction types in the population. Specifically, fig. 7 illustrates the case of selecting a subset of 20 compounds from 2258 compounds using coverage scores. Fig. 7 (a) shows a population of compounds in the interaction space and two selected subsets (one based on PLIF, one based on ECFP4 fingerprint). Fig. 7 (b) shows the same population and selected subset as fig. 7 (a), but in the chemical structural space. The subset is selected according to an iterative method of coverage scores until a termination condition is met.
Fig. 8 (a) and 8 (b) illustrate the same examples of subsets of coverage score selections when applied to PLIF in fig. 7 (a) and 7 (b), plotted in interaction space and chemical structure space, respectively. Unlike fig. 7, in fig. 8, PLIF coverage score selection is compared to a subset randomly selected from the population and to an alternative selection method (i.e., diversity selection method) applied to PLIF.
In addition to the above description, the three-dimensional description of the compound may alternatively or additionally be used to generate a fingerprint to which coverage score selection may be applied. For example, a compound may be described as a three-dimensional pharmacophore or three-dimensional shape that is converted to a fingerprint.
The steps of the computer-implemented method shown in fig. 6 may be generalized to be suitable for analyzing different molecular characteristics present in a defined population of compounds in accordance with the present invention. Each compound in the population has one or more molecular characteristics associated therewith. As described above, these may include structural features, the type of interaction exhibited when the corresponding compound binds to a predetermined target molecule, or other suitable molecular characteristics. In consideration of the type of interaction, it may be necessary to obtain this information by performing a molecular docking process as described above to obtain a predicted binding interaction of the corresponding compound upon interaction with the predetermined target molecule. Regardless of the particular molecular characteristics under consideration, a training set of compounds is defined to include compounds whose biological characteristics are known, and a subset of compounds that are not within the training set are selected from the population. Then, a coverage score for the selected subset is determined based on the particular molecular characteristics considered in the compounds of the selected subset, and the selected subset is evaluated based on the determined subset score. A subset score is determined from the frequency of each molecular property considered in the population and the frequency of each molecular property considered within a sampling set, wherein the sampling set comprises a training set and the selected subset.
It should be noted that the fingerprint of a particular compound may be defined to include information about more than one molecular property of the compound. For example, a first set of sites of a fingerprint may relate to structural features present in a compound, and a subsequent set of sites following the first set of sites may relate to the type of interaction exhibited when the compound binds to a predetermined target molecule. The coverage score selection may be based on some or all of the information included in the fingerprint images of the compounds in the population.
Other aspects and embodiments of the disclosure are set forth in the following clauses.
Clause of (b)
1. A method for computing a drug design, comprising:
defining a population of a plurality of compounds, each compound having one or more molecular characteristics;
defining a training set of compounds from a population of known one or more biological properties;
selecting a subset of one or more compounds from a population not within the training set; and, a step of, in the first embodiment,
determining a subset score for the selected subset based on molecular characteristics of one or more compounds within the selected subset, and evaluating the selected subset based on the determined subset score,
wherein the subset score is determined from the frequency of the molecular characteristics in the population and the frequency of the molecular characteristics within a sample set comprising the training set and the selected subset.
2. The method of clause 1, wherein the determining step comprises determining a compound score for each of the one or more compounds of the selected subset based on the one or more molecular characteristics of the compounds, and wherein the subset score is determined based on the determined compound score for each compound within the selected subset.
3. The method of clause 2, wherein the subset score is determined as the sum of the corresponding compound scores of the compounds within the selected subset.
4. The method of clause 2 or clause 3, wherein determining the compound score for a compound within the selected subset may include determining a molecular property score for each of the one or more molecular properties of the compound based on the determined score for the one or more molecular properties of the compound based on the frequency of the corresponding molecular property in the population and the frequency of the corresponding molecular property in the sample set.
5. The method of clause 4, wherein the compound score for the compound is determined as the sum of the determined molecular property scores for one or more molecular properties of the compound.
6. The method of clause 4 or clause 5, wherein the molecular property score for each of the one or more molecular properties is determined from a normalized probability of the molecular property within the sample set, the normalized probability being determined from the frequency of the molecular property within the population and the sample set.
7. The method of clause 6, wherein the normalized probability is determined from the number of compounds within the sample set relative to the number of compounds in the population.
8. The method of clause 7, wherein the normalized probability is a laplace corrected normalized probability.
9. The method of clause 8, wherein the normalized probability P of the laplace correction corr Is given by
/>
Wherein F is sampled Is the frequency of the molecular characteristics in the sample set, F set Is the frequency of the molecular characteristics in the population, and and is also provided with P base Is the number of compounds in the sample set divided by the number of compounds in the population.
10. The method of any of clauses 4-9, wherein the molecular property score for each of the one or more molecular properties is determined based on the number of compounds of the molecular property present within the sample set relative to the number of compounds within the sample set.
11. The method of clause 10, wherein the molecular property score is determined from a normalized shannon entropy value of the molecular property within the sample set.
12. The method of clause 11, wherein the normalized shannon entropy value is given by
Where f is the number of compounds in the sample set where the molecular property is present divided by To be used for Number of compounds in the sample set.
13. The method of clause 12, wherein the molecular characterization score Cov final Can be given by
And f is greater than 0.5
Wherein the method comprises the steps of
Cov=-ln(P corr /P base )
14. The method of any preceding claim, wherein the subset comprises a specified number of compounds.
15. The method of clause 14, wherein the method comprises defining the number of compounds to be selected within the subset.
16. A method according to any preceding claim, wherein the step of evaluating comprises determining whether the subset score meets a specified condition.
17. The method of clause 16, wherein the specified condition is that the subset score is greater than a specified minimum threshold score.
18. The method of clause 16 or clause 17, wherein if the specified condition is met, the method comprises synthesizing at least some of the compounds within the selected subset to determine one or more biological properties of the compounds.
19. The method of clause 18, comprising adding a synthetic compound to the training set.
20. A method according to any of the preceding claims, wherein the selected subset is an initial selected subset, and the method comprises:
selecting a second subset, different from the initially selected subset, comprising one or more compounds from a population that is not within the training set; the method comprises the steps of,
A subset score for the selected second subset is determined and the selected second subset is evaluated based on the determined score.
21. When referring to clause 16, the method according to clause 20, wherein the steps of selecting the second subset and determining the score thereof are performed if the specified condition is not met.
22. The method of clause 20 or 21, wherein selecting the second subset comprises replacing one or more compounds within the initially selected subset with one or more new compounds from the population that are not within the training set.
23. When referring to clause 2, the method according to clause 22, comprising identifying one or more compounds within the initially selected subset to be replaced based on the respective determined compound scores for the one or more compounds within the initially selected subset.
24. The method of clause 23, wherein the one or more compounds having the lowest determined compound score within the initially selected subset are identified for replacement.
25. The method of any of clauses 20-24, comprising iteratively performing the steps of:
selecting a new subset different from the subset selected in the previous iteration, including one or more compounds from the population that is not within the training set; the method comprises the steps of,
Determining a subset score for the selected new subset, and evaluating the selected new subset based on the determined score,
until the termination condition is met.
26. The method of clause 25, wherein the termination condition includes at least one of: the maximum number of iterations has been performed; the subset score of the selected subset in one iteration meets a specified condition; and, the difference between the respective subset scores of the selected subsets at successive iterations is less than a prescribed difference threshold.
27. The method of clause 25 or 26, comprising synthesizing a selected subset of compounds in an iteration that satisfies a termination condition to determine one or more biological properties of the compounds.
28. The method of any of clauses 24-27, comprising selecting a plurality of new subsets in each iteration, identifying one of the plurality of selected subsets in an iteration that satisfies a termination condition based on the determined subset scores of the respective plurality of selected subsets, and synthesizing compounds of the one identified subset to determine one or more biological properties of the compounds.
29. The method of clause 28, wherein the identified subset is the subset having the highest subset score within the plurality of subsets upon satisfaction of the iteration of the termination condition.
30. A method according to any preceding claim, wherein the selected subset is a first subset, and the method comprises: selecting a plurality of subsets from a population not within the training set, each subset comprising a plurality of compounds; determining a subset score for each subset; and selecting a first subset from among the plurality of subsets based on the determined subset scores for the respective subsets.
31. The method of clause 30, wherein the first subset is selected as the subset having the highest subset score among the plurality of subsets.
32. The method of clause 30 or clause 31, wherein the plurality of subsets each have the same number of compounds.
33. The method of any one of the preceding claims, wherein the step of evaluating comprises evaluating the selected subset based on activity scores of the selected subset obtained from an activity model for predicting the activity level of the compound in the population.
34. The method of clause 33, wherein the evaluating step comprises evaluating the selected subset based on the determined subset score and an activity score relative to a desired balance of the scores.
35. When referring to clause 28, the method according to clause 33 or clause 34, wherein the plurality of new subsets each comprise a different balance between the determined score and the activity score.
36. The method of clause 35, wherein the plurality of new subsets form pareto fronts of the determined subsets and activity scores at iterations that satisfy a termination condition.
37. A method according to any preceding claim, wherein the training set is initially empty.
38. The method of any preceding clause, wherein the molecular characteristics of each of the plurality of compounds in the population comprise structural features of the compound.
39. The method of any one of the preceding claims, wherein the structural feature of each of the plurality of compounds in the population corresponds to a fragment present in the compound.
40. The method of clause 39, wherein the fragments present in each of the plurality of compounds are represented as molecular fingerprints.
41. The method of clause 40, wherein the molecular fingerprint is an Extended Connectivity Fingerprint (ECFP), optionally ECFP0, ECFP2, ECFP4, ECFP6, ECFP8, ECFP10, or ECFP12.
The method of any one of the preceding claims, wherein the molecular property of each of the plurality of compounds in the population comprises a chemical property of the compound.
The method of any one of the preceding claims, wherein the molecular characteristics of each of the plurality of compounds in the population comprise structural features and chemical characteristics of the compound.
43. The method of clause 42a or clause 42b, wherein the chemical property corresponds to the type of interaction exhibited when the corresponding compound binds to the predetermined target molecule.
44. The method of clause 43, wherein the chemical properties of at least some of the compounds in the population correspond to predictions of the type of interactions exhibited when the corresponding compounds bind to the predetermined target molecule.
45. The method of clause 44, wherein the predicting comprises predicting which of the one or more predetermined types of interactions are exhibited when the corresponding compound binds to the predetermined target molecule.
46. The method of clause 44 or clause 45, comprising obtaining a prediction of the type of interaction exhibited when the corresponding compound binds to the predetermined target molecule.
47. The method of clause 46, wherein obtaining the prediction of each compound comprises:
generating a three-dimensional image of the compound; the method comprises the steps of,
performing a docking process using the generated three-dimensional image to predict a preferred docking pose when the compound is bound to a predetermined target molecule,
wherein the type of interaction exhibited is predicted based on the results of the docking procedure.
48. The method of any of clauses 43-47, wherein the type of interaction exhibited when the corresponding compound binds to the predetermined target molecule is represented as an interaction fingerprint; optionally expressed as a protein-ligand interaction fingerprint (PLIF).
49. The method of any of clauses 43-48, wherein the type of interaction comprises one or more of: hydrogen bonding interactions, weak hydrogen bonding interactions, ionic interactions, hydrophobic interactions, face-to-face aromatic interactions, side-to-face aromatic interactions, alpha pi-cationic interactions, and metal complexation interactions.
50. The method of any of clauses 43-49, wherein each compound in the population is a ligand and the predetermined target molecule is a protein.
51. The method of any preceding claim, wherein the one or more biological characteristics comprise one or more of: activity, selectivity, toxicity, absorption, distribution, metabolism, and excretion.
52. A method according to any one of the preceding claims, wherein one or more biological properties are defined relative to the respective desired biological properties.
53. A method according to any preceding claim, comprising:
Defining a machine learning model for modeling one or more biological properties of a compound in a population based on one or more molecular properties of the compound; the method comprises the steps of,
a machine learning model is trained using a training set of compounds.
54. The method of clause 53, wherein the method comprises performing a training step each time one or more compounds are added to the training set.
55. The method of clause 53 or clause 54, wherein the machine learning model is at least one of the following models: bayesian optimization models, regression models, cluster models, decision tree models, random forest models, and neural network models.
56. The method of any of clauses 53-55, comprising executing the machine learning model after the training step to predict one or more compounds in the population having one or more desired biological properties.
57. The method of clause 56, further comprising synthesizing at least one of the one or more predicted compounds.
58. The method of clause 56 or clause 57, wherein the one or more predicted compounds are candidate drugs or therapeutic molecules having the desired biological, biochemical, physiological, and/or pharmacological activity against the predetermined target molecule.
59. The method of clause 58, wherein the predetermined target molecule is an in vitro and/or in vivo therapeutic, diagnostic, or experimentally determined target.
60. The method of clause 58 or clause 59, wherein the candidate drug or therapeutic molecule is used in medicine; for example, for treating animals such as humans or non-human animals.
61. A compound identified by the method of any one of the preceding claims.
62. A non-transitory computer-readable storage medium storing instructions thereon, which when executed by a computer processor, cause the computer processor to perform the method of any of clauses 1-60.
63. A computing device for computing a drug design, comprising:
an input configured to receive data indicative of a population of a plurality of compounds, each compound having one or more molecular characteristics, and to receive data indicative of a training set of compounds from the population of known one or more biological characteristics;
a processor configured to select a subset of one or more compounds from the population that are not within the training set, determine a subset score for the selected subset based on molecular characteristics of the one or more compounds within the selected subset, and evaluate the selected subset based on the determined subset score; the method comprises the steps of,
An output configured to output the evaluation result,
wherein the subset score is determined from the frequency of the molecular characteristics in the population and the frequency of the molecular characteristics within a sample set comprising a training set and a selected subset.
64. The computing device of clause 63, wherein the processor is configured to perform the method of any of clauses 1-60.
65. A computer-implemented method for drug design, comprising:
defining a population of a plurality of compounds;
for each of a plurality of compounds, obtaining interaction data indicative of the type of interaction exhibited when the corresponding compound binds to the predetermined target molecule;
defining a training set of compounds from a population of known one or more biological properties;
selecting a subset of one or more compounds from the population that are not within the training set; the method comprises the steps of,
determining a subset score for the selected subset based on the obtained interaction data for the one or more compounds within the selected subset, and evaluating the selected subset based on the determined subset score,
wherein the subset score is determined based on the frequency of the interaction types in the population and the frequency of the interaction types within a sample set comprising the training set and the selected subset.
66. The method of clause 65, wherein the interaction data of at least some of the plurality of compounds in the obtained population is predicted data indicative of the type of interaction exhibited when the corresponding compound binds to the predetermined target molecule.
67. The method of clause 66, wherein the predicting comprises predicting which of the one or more predetermined types of interactions are exhibited when the corresponding compound binds to the predetermined target molecule.
68. The method of clause 66 or clause 67, comprising obtaining a prediction of the type of interaction exhibited when the corresponding compound binds to the predetermined target molecule.
69. The method according to clause 68, wherein obtaining the prediction of each compound comprises:
generating a three-dimensional image of the compound; the method comprises the steps of,
performing a docking process using the generated three-dimensional image to predict a preferred docking pose when the compound is bound to a predetermined target molecule,
wherein the type of interaction exhibited is predicted based on the results of the docking procedure.
70. The method of any of clauses 65-69, wherein the type of interaction exhibited when the corresponding compound binds to the predetermined target molecule is represented as an interaction fingerprint; optionally expressed as a protein-ligand interaction fingerprint (PLIF).
71. The method of any one of clauses 65-70, wherein the type of interaction comprises one or more of: hydrogen bonding interactions, weak hydrogen bonding interactions, ionic interactions, hydrophobic interactions, face-to-face aromatic interactions, side-to-face aromatic interactions, alpha pi-cationic interactions, and metal complexation interactions.
72. The method of any one of clauses 65-71, wherein each compound in the population is a ligand and the predetermined target molecule is a protein.
73. The method of any of clauses 65-72, wherein the determining step comprises determining a compound score for each of the one or more compounds of the selected subset according to one or more types of interactions in the interaction data of the compounds, and wherein the subset score is determined based on the determined compound score for each compound within the selected subset.
74. The method of clause 73, wherein the subset score is determined as the sum of the corresponding compound scores of the compounds within the selected subset.
75. The method of clause 73 or clause 74, wherein determining the compound score for a compound within the selected subset comprises determining an interaction score for each of one or more interaction types in the interaction data for the compound as a function of the frequency of the corresponding interaction type in the population and the frequency of the corresponding interaction type in the sample set, the compound score for the compound being based on the determined score for the one or more interaction types in the interaction data for the compound.
76. The method of clause 75, wherein the compound score for the compound is determined as the sum of the determined interaction scores for one or more interaction types in the interaction data for the compound.
77. The method of clause 75 or clause 76, wherein the interaction score for each of the one or more interaction types in the interaction data is determined from a normalized probability of the interaction type within the sample set, the normalized probability being determined from the frequency of the interaction type within the population and within the sample set.
78. The method of clause 77, wherein the normalized probability is determined based on the number of compounds in the sample set relative to the number of compounds in the population.
79. The method of clause 78, wherein the normalized probability is a laplace corrected normalized probability.
80. The method of clause 79, wherein the normalized probability P of the laplace correction corr Is given by
Wherein F is sampled Is the frequency of the interaction type in the sample set, F set Is the frequency of the interaction type in the population, and P base Is the number of compounds in the sample set divided by the number of compounds in the population.
81. The method of any of clauses 75-80, wherein the interaction score for each of the one or more interaction types in the interaction data is determined from the number of compounds of the interaction type present in the sample set relative to the number of compounds in the sample set.
82. The method of clause 81, wherein the interaction score is determined from a normalized shannon entropy value of the interaction type within the sample set.
83. The method according to clause 82, wherein the normalized shannon entropy value is given by
Where f is the number of compounds in the sample set where the molecular property is present divided by the number of compounds in the sample set.
84. The method of clause 83, wherein the interaction score Cov final Is given by
And f is greater than 0.5
Wherein the method comprises the steps of
Coc=-ln(P corr /P base )
85. The method of any of clauses 65-84, wherein the subset comprises a specified number of compounds.
86. The method of clause 85, wherein the method comprises defining the number of compounds to be selected within the subset.
87. The method of any of clauses 65-86, wherein the evaluating step comprises determining whether the subset score meets a specified condition.
88. The method of clause 87, wherein the specified condition is that the subset score is greater than a specified minimum threshold score.
89. The method of clause 87 or clause 88, wherein if the specified condition is met, the method comprises synthesizing at least some of the compounds within the selected subset to determine one or more biological properties of the compounds.
90. The method of clause 89, comprising adding a synthetic compound to the training set.
91. The method of any of clauses 65-90, wherein the selected subset is an initial selected subset, and the method comprises:
selecting a second subset, different from the initially selected subset, comprising one or more compounds from a population that is not within the training set; a kind of electronic device with a high-pressure air-conditioning system.
A subset score for the selected second subset is determined and the selected second subset is evaluated based on the determined score.
92. When referring to clause 87, the method according to clause 91, wherein the steps of selecting the second subset and determining the score thereof are performed if the specified condition is not met.
93. The method of clause 91 or 92, wherein selecting the second subset comprises replacing one or more compounds within the initially selected subset with one or more new compounds from the population that are not within the training set.
94. When referring to clause 73, the method according to clause 93, comprising identifying one or more compounds within the initially selected subset to be replaced based on the respective determined compound scores for the one or more compounds within the initially selected subset.
95. The method of clause 94, wherein the one or more compounds having the lowest determined compound score within the initially selected subset are identified for replacement.
96. The method of any of clauses 91-95, comprising iteratively performing the steps of:
selecting a new subset different from the subset selected in the previous iteration, including one or more compounds from the population that is not within the training set; the method comprises the steps of,
determining a subset score for the selected new subset, and evaluating the selected new subset based on the determined score,
until the termination condition is met.
97. The method of clause 96, wherein the termination conditions include at least one of: the maximum number of iterations has been performed; the subset score of the selected subset in one iteration meets a specified condition; and, the difference between the respective subset scores of the selected subsets at successive iterations is less than a prescribed difference threshold.
98. The method of clause 96 or clause 97, comprising synthesizing the compounds of the selected subset in an iteration that satisfies the termination condition to determine one or more biological properties of the compounds.
99. The method of any of clauses 95-98, comprising selecting a plurality of new subsets in each iteration, identifying one of the plurality of selected subsets in an iteration that satisfies a termination condition based on the determined subset scores of the respective plurality of selected subsets, and synthesizing compounds of the one identified subset to determine one or more biological properties of the compounds.
100. The method of clause 99, wherein the identified subset is the subset having the highest subset score within the plurality of subsets upon satisfaction of the iteration of the termination condition.
101. The method of any of clauses 65-100, wherein the selected subset is a first subset, and the method comprises: selecting a plurality of subsets from a population not within the training set, each subset comprising a plurality of compounds; determining a subset score for each subset; and selecting a first subset from among the plurality of subsets based on the determined subset scores for the respective subsets.
102. The method of clause 101, wherein the first subset is selected as the subset having the highest subset score among the plurality of subsets.
103. The method of clause 102 or clause 103, wherein the plurality of subsets each have the same number of compounds.
104. The method of any one of clauses 65-103, wherein the step of evaluating comprises evaluating the selected subset based on activity scores of the selected subset obtained from an activity model for predicting the activity level of the compound in the population.
105. The method of clause 104, wherein the evaluating step comprises evaluating the selected subset based on the determined subset score and an activity score relative to a desired balance of the scores.
106. When referring to clause 101, the method according to clause 104 or clause 105, wherein the plurality of new subsets each comprise a different balance between the determined score and the activity score.
107. The method of clause 106, wherein the plurality of new subsets form pareto fronts of the determined subsets and activity scores at iterations that satisfy a termination condition.
108. The method of any of clauses 65-107, wherein the training set is initially empty.
109. The method of any one of clauses 65-108, wherein the one or more biological characteristics comprise one or more of: activity, selectivity, toxicity, absorption, distribution, metabolism, and excretion.
110. The method of any one of clauses 65-109, wherein one or more biological properties are defined relative to the respective desired biological properties.
111. The method of any of clauses 65-110, comprising:
defining a machine learning model for modeling one or more biological properties of compounds in a population according to one or more interaction types in the obtained interaction data of the compounds; the method comprises the steps of,
a machine learning model is trained using a training set of compounds.
112. The method of clause 111, wherein the method comprises performing a training step each time one or more compounds are added to the training set.
113. The method of clause 111 or clause 112, wherein the machine learning model is at least one of the following models: bayesian optimization models, regression models, cluster models, decision tree models, random forest models, and neural network models.
114. The method of any of clauses 111-113, comprising executing the machine learning model after the training step to predict one or more compounds in the population having one or more desired biological properties.
115. The method of clause 114, further comprising synthesizing at least one of the one or more predicted compounds.
116. The method of clause 114 or clause 115, wherein the one or more predictive compounds are candidate drugs or therapeutic molecules having a desired biological, biochemical, physiological, and/or pharmacological activity against a predetermined target molecule.
117. The method of clause 116, wherein the predetermined target molecule is an in vitro and/or in vivo therapeutic, diagnostic, or experimentally determined target.
118. The method of clause 117 or clause 118, wherein the candidate drug or therapeutic molecule is used in medicine; for example, for treating animals such as humans or non-human animals.
119. A compound identified by the method of any one of clauses 65-118.
120. A non-transitory computer-readable storage medium having stored thereon instructions that, when executed by a computer processor, cause the computer processor to perform the method of any of clauses 65-118.
121. A computing device for computing a drug design, comprising:
an input configured to receive:
population data indicative of a population of the plurality of compounds;
interaction data for each of the plurality of compounds, the interaction data being indicative of the type of interaction exhibited when the respective compound binds to a predetermined target molecule; and, a step of, in the first embodiment,
training set data indicative of a training set of compounds from a population of known one or more biological characteristics;
a processor configured to select a subset of one or more compounds from the population that are not within the training set, determine a subset score for the selected subset according to a type of interaction in interaction data for the one or more compounds within the selected subset, and evaluate the selected subset based on the determined subset score; the method comprises the steps of,
An output configured to output the evaluation result,
wherein the subset score is determined from the frequency of interaction types in the interaction data in the population and the frequency of interaction types in the interaction data within a sample set comprising a training set and a selected subset.
122. The computing device of clause 121, wherein the processor is configured to perform the method of any of clauses 65-118.

Claims (65)

1. A method for computing a drug design, comprising:
defining a population of a plurality of compounds, each compound having one or more molecular characteristics;
defining a training set of compounds from a population of known one or more biological properties;
selecting a subset of one or more compounds from a population not within the training set; and, a step of, in the first embodiment,
determining a subset score for the selected subset based on molecular characteristics of one or more compounds within the selected subset, and evaluating the selected subset based on the determined subset score,
wherein the subset score is determined from the frequency of the molecular characteristics in the population and the frequency of the molecular characteristics within a sample set comprising the training set and the selected subset.
2. The method of claim 1, wherein the determining step comprises determining a compound score for each of the one or more compounds of the selected subset based on the one or more molecular characteristics of the compounds, and wherein the subset score is determined based on the determined compound score for each compound within the selected subset.
3. The method of claim 2, wherein the subset score is determined as a sum of corresponding compound scores for compounds within the selected subset.
4. A method according to claim 2 or claim 3, wherein determining the compound score for a compound within the selected subset comprises determining a molecular property score for each of one or more molecular properties of the compound based on the determined score for the one or more molecular properties of the compound based on the frequency of the corresponding molecular property in the population and the frequency of the corresponding molecular property in the sample set.
5. The method of claim 4, wherein a compound score for the compound is determined as a sum of determined molecular property scores for one or more molecular properties of the compound.
6. The method of claim 4 or claim 5, wherein a molecular property score for each of the one or more molecular properties is determined from a normalized probability of the molecular property within the sample set, the normalized probability being determined from frequencies of the molecular property within the population and the sample set.
7. The method of claim 6, wherein the normalized probability is determined based on the number of compounds within the sample set relative to the number of compounds in the population.
8. The method of claim 7, wherein the normalized probability is a laplace corrected normalized probability.
9. The method of claim 8, wherein the normalized probability P of laplace correction corr Is given by
Wherein F is sampled Is the frequency of the molecular characteristics in the sample set, F set Is the frequency of the molecular characteristics in the population, and P base Is the number of compounds in the sample set divided by the number of compounds in the population.
10. The method of any one of claims 4-9, wherein the molecular property score for each of the one or more molecular properties is determined depending on the number of compounds within the sample set for which the molecular property is present relative to the number of compounds within the sample set.
11. The method of claim 10, wherein the molecular property score is determined from normalized shannon entropy values of the molecular property within the sample set.
12. The method of claim 11, wherein the normalized shannon entropy value is given by
Where f is the number of compounds in the sample set where the molecular property is present divided by the number of compounds in the sample set.
13. The method of claim 12, wherein the molecular characterization score Cov final Can be given by
Wherein the method comprises the steps of
Cov=-ln(P corr /P base )
14. The method of any one of the preceding claims, wherein the subset comprises a specified number of compounds.
15. The method of claim 14, wherein the method comprises defining the number of compounds to be selected within the subset.
16. A method according to any preceding claim, wherein the step of evaluating comprises determining whether the subset score meets a specified condition.
17. The method of claim 16, wherein the specified condition is that the subset score is greater than a specified minimum threshold score.
18. The method of claim 16 or claim 17, wherein if a specified condition is met, the method comprises synthesizing at least some compounds within the selected subset to determine one or more biological properties of the compounds.
19. The method of claim 18, comprising adding a synthetic compound to the training set.
20. The method of any of the preceding claims, wherein the selected subset is an initial selected subset, and the method comprises:
Selecting a second subset, different from the initially selected subset, comprising one or more compounds from a population that is not within the training set; the method comprises the steps of,
a subset score for the selected second subset is determined and the selected second subset is evaluated based on the determined score.
21. A method according to claim 20 when dependent on claim 16, wherein the steps of selecting the second subset and determining the score thereof are performed if the prescribed condition is not met.
22. The method of claim 20 or claim 21, wherein selecting the second subset comprises replacing one or more compounds within the initially selected subset with one or more new compounds from the population that are not within the training set.
23. A method according to claim 22 when dependent on claim 2, comprising identifying one or more compounds within the initially selected subset to be replaced based on the respective determined compound scores of the one or more compounds within the initially selected subset.
24. The method of claim 23, wherein one or more compounds having the lowest determined compound score within the initially selected subset are identified for replacement.
25. The method according to any of claims 20-24, comprising iteratively performing the steps of:
Selecting a new subset different from the subset selected in the previous iteration, including one or more compounds from the population that is not within the training set; the method comprises the steps of,
determining a subset score for the selected new subset, and evaluating the selected new subset based on the determined score,
until the termination condition is met.
26. The method of claim 25, wherein the termination condition comprises at least one of: the maximum number of iterations has been performed; the subset score of the selected subset in one iteration meets a specified condition; and, the difference between the respective subset scores of the selected subsets at successive iterations is less than a prescribed difference threshold.
27. The method of claim 25 or claim 26, comprising synthesizing a selected subset of compounds in an iteration that satisfies a termination condition to determine one or more biological properties of the compounds.
28. The method of any one of claims 24-27, comprising selecting a plurality of new subsets in each iteration, identifying one of the plurality of selected subsets in an iteration that satisfies a termination condition based on the determined subset scores of the respective plurality of selected subsets, and synthesizing compounds of the one identified subset to determine one or more biological properties of the compounds.
29. The method of claim 28, wherein the identified subset is the subset having the highest subset score within the plurality of subsets upon satisfaction of the iteration of the termination condition.
30. The method of any of the preceding claims, wherein the selected subset is a first subset, and the method comprises: selecting a plurality of subsets from a population not within the training set, each subset comprising a plurality of compounds; determining a subset score for each subset; and selecting a first subset from among the plurality of subsets based on the determined subset scores for the respective subsets.
31. The method of claim 30, wherein the first subset is selected as the subset having the highest subset score among the plurality of subsets.
32. The method of claim 30 or claim 31, wherein the plurality of subsets each have the same number of compounds.
33. The method of any one of the preceding claims, wherein the step of evaluating comprises evaluating the selected subset based on activity scores of the selected subset obtained from an activity model for predicting the activity level of the compound in the population.
34. The method of claim 33, wherein the evaluating step comprises evaluating the selected subset based on the determined subset score and an activity score relative to a desired balance of the scores.
35. A method according to claim 33 or claim 34 when dependent on claim 28, wherein the plurality of new subsets each comprise a different balance between the determined score and the activity score.
36. The method of claim 35, wherein the plurality of new subsets form pareto fronts of the determined subsets and activity scores at iterations that satisfy a termination condition.
37. The method of any of the preceding claims, wherein the training set is initially empty.
38. The method of any one of the preceding claims, wherein the molecular characteristics of each of the plurality of compounds in the population comprise structural features of the compound.
39. The method of any one of the preceding claims, wherein the structural characteristics of each of the plurality of compounds in the population correspond to fragments present in the compound.
40. The method of claim 39, wherein fragments present in each of the plurality of compounds are represented as molecular fingerprints.
41. The method of claim 40, wherein the molecular fingerprint is an Extended Connectivity Fingerprint (ECFP), optionally ECFP0, ECFP2, ECFP4, ECFP6, ECFP8, ECFP10 or ECFP12.
42. The method of any one of the preceding claims, wherein the molecular characteristics of each of the plurality of compounds in the population comprise chemical characteristics of the compound.
43. The method of any one of the preceding claims, wherein the molecular characteristics of each of the plurality of compounds in the population comprise structural features and chemical characteristics of the compound.
44. The method of claim 42 or claim 43, wherein the chemical property corresponds to the type of interaction exhibited when the corresponding compound binds to a predetermined target molecule.
45. The method of claim 44, wherein the chemical properties of at least some of the compounds in the population correspond to predictions of the type of interactions exhibited when the corresponding compounds bind to the predetermined target molecules.
46. The method of claim 45, wherein the predicting comprises predicting which of the one or more predetermined types of interactions will be exhibited when the corresponding compound binds to the predetermined target molecule.
47. The method of claim 45 or claim 46, comprising obtaining a prediction of the type of interaction exhibited when the corresponding compound binds to the predetermined target molecule.
48. The method of claim 47, wherein obtaining a prediction of each compound comprises:
generating a three-dimensional image of the compound; the method comprises the steps of,
performing a docking process using the generated three-dimensional image to predict a preferred docking pose when the compound is bound to a predetermined target molecule,
wherein the type of interaction exhibited is predicted based on the results of the docking procedure.
49. The method according to any one of claims 44-48, wherein the type of interaction exhibited when the corresponding compound binds to the predetermined target molecule is represented as an interaction fingerprint; optionally expressed as a protein-ligand interaction fingerprint (PLIF).
50. The method of any one of claims 44-49, wherein the type of interaction comprises one or more of: hydrogen bonding interactions, weak hydrogen bonding interactions, ionic interactions, hydrophobic interactions, face-to-face aromatic interactions, side-to-face aromatic interactions, alpha pi-cationic interactions, and metal complexation interactions.
51. The method of any one of claims 44-50, wherein each compound in the population is a ligand and the predetermined target molecule is a protein.
52. The method of any one of the preceding claims, wherein the one or more biological characteristics comprise one or more of: activity, selectivity, toxicity, absorption, distribution, metabolism, and excretion.
53. The method of any one of the preceding claims, wherein one or more biological properties are defined relative to the respective desired biological properties.
54. The method according to any of the preceding claims, comprising:
defining a machine learning model for modeling one or more biological properties of a compound in a population based on one or more molecular properties of the compound; the method comprises the steps of,
a machine learning model is trained using a training set of compounds.
55. The method of claim 54, wherein the method comprises performing a training step each time one or more compounds are added to the training set.
56. The method of claim 54 or claim 55, wherein the machine learning model is at least one of the following models: bayesian optimization models, regression models, cluster models, decision tree models, random forest models, and neural network models.
57. The method of any one of claims 54-56, comprising executing the machine learning model after the training step to predict one or more compounds in the population having one or more desired biological properties.
58. The method of claim 57, further comprising synthesizing at least one of the one or more predicted compounds.
59. The method of claim 57 or claim 58, wherein the one or more predictive compounds are drug candidates or therapeutic molecules having a desired biological, biochemical, physiological and/or pharmacological activity against a predetermined target molecule.
60. The method of claim 59, wherein the predetermined target molecule is an in vitro and/or in vivo therapeutic, diagnostic or experimentally determined target.
61. The method of claim 59 or claim 60, wherein the candidate drug or therapeutic molecule is for use in medicine; for example, for treating animals such as humans or non-human animals.
62. A compound identified by the method of any one of the preceding claims.
63. A non-transitory computer-readable storage medium having stored thereon instructions that, when executed by a computer processor, cause the computer processor to perform the method of any of claims 1-61.
64. A computing device for computing a drug design, comprising:
an input configured to receive data indicative of a population of a plurality of compounds, each compound having one or more molecular characteristics, and to receive data indicative of a training set of compounds from the population of known one or more biological characteristics;
A processor configured to select a subset of one or more compounds from the population that are not within the training set, determine a subset score for the selected subset based on molecular characteristics of the one or more compounds within the selected subset, and evaluate the selected subset based on the determined subset score; the method comprises the steps of,
an output configured to output the evaluation result,
wherein the subset score is determined from the frequency of the molecular characteristics in the population and the frequency of the molecular characteristics within a sample set comprising a training set and a selected subset.
65. The computing device of claim 64, wherein the processor is configured to perform the method of any of claims 1-61.
CN202180072416.2A 2020-10-23 2021-10-22 Drug optimization through active learning Pending CN116508106A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
GB2016884.5 2020-10-23
GBGB2109633.4A GB202109633D0 (en) 2021-07-02 2021-07-02 Drug optimisation by active learning
GB2109633.4 2021-07-02
PCT/GB2021/052753 WO2022084696A1 (en) 2020-10-23 2021-10-22 Drug optimisation by active learning

Publications (1)

Publication Number Publication Date
CN116508106A true CN116508106A (en) 2023-07-28

Family

ID=77274515

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202180072416.2A Pending CN116508106A (en) 2020-10-23 2021-10-22 Drug optimization through active learning

Country Status (2)

Country Link
CN (1) CN116508106A (en)
GB (1) GB202109633D0 (en)

Also Published As

Publication number Publication date
GB202109633D0 (en) 2021-08-18

Similar Documents

Publication Publication Date Title
Li et al. DeepAtom: A framework for protein-ligand binding affinity prediction
Caudai et al. AI applications in functional genomics
Ehrlich et al. Maximum common subgraph isomorphism algorithms and their applications in molecular science: a review
Kiemer et al. WI‐PHI: a weighted yeast interactome enriched for direct physical interactions
Meher et al. Prediction of donor splice sites using random forest with a new sequence encoding approach
Lin et al. Clustering methods in protein-protein interaction network
Yuan et al. Alignment-free metal ion-binding site prediction from protein sequence through pretrained language model and multi-task learning
Lagani et al. Structure-based variable selection for survival data
US20240029834A1 (en) Drug Optimization by Active Learning
CN109727637B (en) Method for identifying key proteins based on mixed frog-leaping algorithm
CN112102899A (en) Construction method of molecular prediction model and computing equipment
Jin et al. Missing value imputation for LC-MS metabolomics data by incorporating metabolic network and adduct ion relations
Kyro et al. Hac-net: A hybrid attention-based convolutional neural network for highly accurate protein–ligand binding affinity prediction
Fang et al. A deep dense inception network for protein beta‐turn prediction
US20230335228A1 (en) Active Learning Using Coverage Score
Chen et al. PubChem BioAssays as a data source for predictive models
US20140309122A1 (en) Knowledge-driven sparse learning approach to identifying interpretable high-order feature interactions for system output prediction
Gu et al. Surface‐histogram: A new shape descriptor for protein‐protein docking
Welchowski et al. A framework for parameter estimation and model selection in kernel deep stacking networks
Guo et al. TRScore: a 3D RepVGG-based scoring method for ranking protein docking models
He et al. Measuring boundedness for protein complex identification in PPI networks
Aggarwal et al. Learning rmsd to improve protein-ligand scoring and pose selection
CN116508106A (en) Drug optimization through active learning
US20050177318A1 (en) Methods, systems and computer program products for identifying pharmacophores in molecules using inferred conformations and inferred feature importance
Kutuzova et al. Bi-modal variational autoencoders for metabolite identification using tandem mass spectrometry

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination