COMPOUND PRIORITIZATION FOR DRUG
DISCOVERY
Alan P. Beresford, and Kenneth R. Korzekwa
CROSS REFERENCE TO RELATED APPLICATIONS
This application claims priority under 35 U.S.C. §119(e) from co-pending US Provisional Patent Application No. 60/338,032, filed November 13, 2001, naming Korzekwa et al. as inventors, and titled "Scoring Functions Using Multiple ADMET/PK Properties." That provisional application is incorporated herein by reference for all purposes.
BACKGROUND
For any molecule to act as a safe and effective medicine, it must fulfill a wide variety of criteria. In addition to having the required pharmacological activity against its intended target, it must reach the appropriate site in the body at a high enough concentration and for a sufficient period of time to be therapeutically beneficial. In order to achieve this, the molecule must overcome a variety of physiological barriers. For example, an ideal drug for the prevention of grand mal epilepsy is one which can be administered by mouth whenever a seizure is anticipated, be rapidly absorbed into the bloodstream, penetrate into the central nervous system (CNS) and efficiently and selectively block the receptors responsible for initiation of convulsions. Once the threat of seizure is passed, the compound should be rapidly removed from the CNS and eliminated from the body. To accomplish these things, the swallowed drug must disperse as it passes from the mouth, through the esophagus and stomach, and be in solution for absorption from the intestine. It will need to be stable to gastric enzymes and the acid environment of the stomach and not be degraded by bacteria in the gut or enzymes in the gut wall. Even if it has suitable physicochemical properties to cross the gut wall into the bloodstream, this may be prevented or attenuated by active transport proteins in the gut epithelium. Once in the bloodstream, a sufficient amount of drug will need to remain free in circulation, avoiding breakdown by metabolic enzymes and excretion by the liver and kidneys, to permit an effective proportion of the dose to cross the epithelium of capillaries in the brain and reach the desired
receptors in the CNS. The drug's total or partial failure at any of these steps could result in reduced or negligible efficacy.
Regardless of whether an epileptic seizure would have occurred, any palliative drug should have caused no detrimental effects to the patient, either directly, or through interactions with other medicines which the patient may have taken. In order to fulfill these criteria, the molecule needs to be selective for its target receptor and concentrations at all other sites in the body must be below those which might result in unwanted side-effects and toxicity. It must also be metabolized to pharmacologically inactive products and/or excreted from the body in order to minimize the potential for accumulation on subsequent dosing, and must not adversely affect these metabolism and excretion mechanisms in the process. Historically, the anti-convulsant market has been awash with compounds which precipitate non-specific adverse reactions, show CNS-related side-effects, contribute to drug-drug interactions and both affect, and are affected by, changes in metabolic enzymes and excretion mechanisms.
In addition to the pharmacological, physiological and safety hurdles which a potential new drug molecule must overcome, it is unlikely to become a medicine unless it can be manufactured and supplied to the patient in a cost-effective mamier. For this to happen, the level of demand and profitability of the final product has to be sufficient for a manufacturer to underwrite the enormous Research and Development (R&D) cost needed, and accept the potential risk of failure in getting the drug to market. To satisfy the requirements of profitable production and marketing, the drug will have to be relatively easy and cheap to synthesize in bulk, be readily amenable to a variety of formulations and be stable on storage under a wide variety of environmental conditions. To be successful, the compound will need to show significant improvement in at least one property compared to other drugs in the same therapeutic area and, ideally, its performance as a medicine should be such that it will not be readily superseded by a competitor compound.
Pharmaceutical R&D is not only an expensive but also a high-risk business, and any process by which the risk can be reduced or rigorously assessed is of value. For every potential new drug molecule there is a risk that it will not have the requisite properties to pass one, several or all of the requirements for it to become a successful medicine. So for each of the potential hurdles it would be useful to assess the likelihood that the compound will succeed. This insight would be particularly valuable at the early, "discovery", stage of the R&D process, where very large numbers of molecules are assessed and many fail due to poor pharmacokinetics resulting from inappropriate absorption, distribution, metabolism and excretion
(ADME) properties. Potential toxicity also claims many molecules at later stages of discovery. Collectively, the pharmacokinetic properties and toxicity are sometimes referred to as ADMET properties.
Methods are now available, not only for the measurement of these properties, but also for their computational prediction. Hence, it is possible to assess the potential ADMET risk for a molecule even before it has been synthesized. Such "in silico" predictions can also be made for other drug properties such as potency.
While such computational tools represent a significant advance in the discovery stage of drug development, additional improvements would be desirable. For example, it would be beneficial to have computational tools that rapidly account for multiple of the various factors important to the success of a molecule in the marketplace.
SUMMARY
The invention improves on the prior art by providing a "score" for each compound under consideration. Using this invention, combinations of predicted, estimated and/or measured properties are combined in generating an overall risk assessment. It then becomes possible to prioritize compounds such that resources are focused on those having the highest potential to succeed.
One aspect of the invention provides methods for determining the likelihood that a compound will become a successful therapeutic product. Such methods may be characterized by the following sequence of operations: (a) providing a success function of properties identified as relevant to the success of the compound as a therapeutic product; (b) receiving values of the properties characterizing the compound; and (c) calculating a value of the success function using the property values for the compound. In this method, the success function relates the property values in a manner specifying their impact on the likely success of the compound. It is typically developed by or for the research team engaged in a drug discovery project.
Typically, operation (c) is repeated many times in order to "score" each of many different compounds associated with a project. Preferably, the operations are not computationally expensive, so the scoring can be conducted in a high throughput manner. In some cases, compounds from two different projects will be separately scored so that the projects can be compared. The different projects will have different success functions. Frequently, the compound scores are employed to classify
individual compounds within a project. The compounds can be prioritized for accelerated consideration or eliminated from consideration in a drug discovery project.
Another aspect of the invention pertains to focused methods for selecting an optimized chemical compound or compounds. These methods may be characterized by the following sequence of operations: (a) generating at least two predicted or experimental properties for each compound; (b) transforming each property into a score using success functions; and (c) combining the scores from (b) by multiplying at least two scores to obtain an overall score. The overall score is intended to be used to select the optimized chemical compound or compounds. The compounds may be optimized for any number of chemical features, with safe effective therapeutic properties being preferred. Thus, in one example, operation (a) involves generating predicted properties using ADMET/PK models. The form of the success functions and the form of their combination to give the overall score can account for (i) errors in predictive models or in experimental data and (ii) the relative importance of at least one predicted or experimental property of the compounds.
The properties of the success function used to characterize compounds should have a significant impact on the value of the compound as a potential drug. General categories of such properties include, for example, physicochemical properties, binding affinities to the target receptor and related receptors, ADME, toxicology and ease of chemical synthesis.
Yet another aspect of the invention pertains to computer program products including machine-readable media on which are provided program instructions for implementing the methods described above, in whole or in part. Frequently, the program instructions are provided as code for performing certain method operations. Any of the methods of this invention may be represented, in whole or in part, as program instructions that can be provided on such machine-readable media. In addition, the invention pertains to various combinations and arrangements of data generated and/or used as described herein.
These and other features of the present invention will be described in more detail below in the detailed description of the invention and in conjunction with the following figures.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is a high-level flowchart depicting typical operations employed in using the compound scoring algorithms of this invention.
Figure 2A is an example of a sigmoidal success function component.
Figure 2B is an example of a sigmoidal success function component, as in Figure 2A but with scaled asymptotes.
Figure 2C is an example of a "threshold" success function component.
Figure 2D is an example of a "bandpass" success function component.
Figure 2E is an example of a "linear" success function component.
Figures 3 A and 3B illustrate a computer system suitable for implementing embodiments of the present invention.
Figure 4 is a block diagram of an Internet based system for "scoring" therapeutic compounds in accordance with an embodiment of the present invention.
DETAILED DESCRIPTION
INTRODUCTION
This invention provides a mechanism for prioritizing compounds by combining various properties deemed necessary for a molecule to become a successful drug. The measured or predicted values of the individual properties contribute numerically to an overall score of the compound. Those contributions are provided on comparable scales.
The invention provides both methods of creating models that score individual compounds under investigation as well as methods of using such models to evaluate potential compounds as part of a given project and/or compare multiple different projects within an enterprise. Such methods are shown in consolidated fashion in Figure 1, which presents a very general process flow in accordance with a preferred embodiment of this invention.
As illustrated, a general process 101 begins at a block 105 where the system responsible for creating the model receives a success function that defines the
expected success of a compound for its intended purpose within the project as a function of the properties relevant to the project.
As explained in the discussion that follows, the success function specifies the relative importance of the various properties to a compound's overall success. It accomplishes this by defining the mathematical forms of the contributions of individual property values. These forms may be Gaussian, sigmoidal, impulse, step change, etc.
Returning to Figure 1, at a block 107, the system receives the property values for each of the compounds under consideration in the project. These are the values for properties employed in the success function. With the property values in hand, the system can calculate a compound score for each compound in the project. It accomplishes this using the success function. See block 109.
Blocks 111 and 113 depicted in process 101 illustrate two different applications for the compound scores. In block 111, the compound scores are used to compare two or more related or unrelated projects undertaken by an enterprise. If the likelihood of success appears problematic, it may suggest that an enterprise should devote more of its resources to a different project that has a greater likelihood of success. At block 113, a different application employs compound scores in selecting or filtering compounds for further investigation during a drug development project. The selected compounds are likely to have an optimized set of property values conducive to success.
To further understand the invention, consider the following simple example. Assume that a drug discovery enterprise has undertaken a project to develop a neurologically-active drug and is at a point in its research where it has 500 separate compounds to consider. In accordance with block 105 of the process depicted in Figure 1, the enterprise has identified the following three properties as important to the success of its project: blood brain barrier penetration, CYP3A4 turnover (metabolism), and potency against a particular target. For each of the 500 compounds, blood brain barrier penetration values are obtained in units representing the partition of a compound across the blood brain barrier. For purpose of this project, a significant penetration of the barrier is required. CYP3A4 is an enzyme that has been identified as primarily responsible for metabolizing the compounds under consideration. Ideally, a compound will have a metabolic half-life sufficient to require only one or two oral doses each day. The potency against a target is measured in terms of an IC50 value, typically in the μM range.
To develop an appropriate success function, the enterprise generates a success function that employs the three property values as independent variables. The success function is intended to formally represent the criteria for success of the project in the marketplace. So, for example, considering the CYP3A4 turnover, the success function treats a twelve-hour half-life as ideal. This would correspond to a patient taking two doses of the medication per day. Thus, the success function may be constructed so that it has a relatively high value when the CYP3A4 turnover is in the neighborhood of twelve-hours. The enterprise may also conclude that a six-hour half- life is too short and a twenty-four hour half-life is too long. Using these boundaries, the success function is constructed so that, all other variables being equal, it has a maximum value when the CYP3A4 half life is approximately twelve hours it has a very low value when the half life is twenty-four hours or greater and when it is six hours or shorter. In one example, a Gaussian or other mean-centered function may be appropriate for representing the success value as a function of CYP3 A4 turnover.
In considering the potency against the target, the enterprise has identified a range of IC50 values that are optimal and a range that are inadequate. This information may be fit to a sigmoidal function, for example, in the success function. Similarly, the blood brain barrier penetration independent variable may be fit to a step function or a sigmoidal function, for example. The functions for the three properties may be combined in a fashion that accounts for correlation between the individual properties. In the simplest case, there is no correlation between the independent variables (property values) present in the success function. In that case, each property is treated by a separate "sub-function." The overall success function is then represented as a multiplicative combination of the individual sub-functions. More generally, however, one should assume that the success function is a multidimensional function having scalar property values or vectors of property values as independent variables.
In the example at hand, each compound from among the 500 under consideration in the current stage in the project, has its compound score calculated. In many cases, this is merely the success function calculated for the particular compound under consideration.
From the above discussion, the meanings of certain relevant terms should be reasonably clear. However, to further illuminate certain concepts pertinent to this invention, the following definitions are provided. These definitions are provided to assist in understanding the concepts presented in the specification. They do not necessarily limit the scope of this invention.
The term "project" refers to a drag development effort undertaken by an enterprise, a research group and/or an individual researcher. A project is defined by a specific set of criteria for a successful drug. These criteria are manifest numerically as certain "property values." The Background section of this document sets forth criteria for a hypothetical project to bring an epilepsy drug to market. The immediately previous example provides values of BBB penetration, CYP3A4 turnover, and potency as criteria for a successful drug in a project.
A "score" or "compound score" is a calculated numerical value estimating the likelihood of a molecule being a successful drug molecule based on predictions or measurements and the information provided by the user on the compound's desired profile. In a specific embodiment, the compound score is given directly by the value of a success function.
A "success function" (denoted S(X) herein) is a function mapping the true values of a vector of properties to a single numerical value representing an estimate of the likelihood of a molecule with those properties being a successful drug molecule within the goals of a project. Typically, it is created by individuals who understand how particular physical and chemical properties impact the goals of a project.
SUCCESS FUNCTIONS
As indicated a success function maps particular chemical and physical properties to the likely success of a compound for meeting the goals of a drug development project. The success function S(X) represents a project team's estimate of the likelihood of a molecule with properties X=(Xι,X2,Xz, ...) becoming a successful drug, assuming all other properties are ideal. The shape of this function is subjective and may be based on the project team's knowledge of the intended therapeutic class, the properties of competing compounds and the stage of chemotype qualification, among other factors.
In general, the success function may not be separable, i.e. S(X) ≠ S1 (Xl) x S2(X2) x S3:(X3)x ... , where Sι(Xj) is the success function for property one etc., because there may be interdependencies between the properties when determining the score. However, in many cases the separable approximation will be valid.
Generally, the properties that will be relevant to the success of a compound (and therefore are candidate independent variables for a success function) are specific
to individual compounds and allow one compound to be distinguished from another. Typical examples include affinity for the designated target receptor, no significant affinity for other related receptors, ADMET properties suitable for the delivery context of the drug, physicochemical properties and ease of chemical synthesis. Among the specific physicochemical properties that can be considered in a success function are pKa, logPoctanoi/water5 solubility in water and other solvents, thermal stability, photostability and pH stability.
Among the relevant ADMET properties that can be considered in a success function are intestinal, dermal and pulmonary absorption, blood-brain barrier penetration, solubility, binding affinity, metabolic rates and products of metabolism metabolism by drug metabolizing enzymes including cytochrome P450s, UDP- glucuronyl transferases, sulfo transferases, GSH, esterases, aminoacid transferase, cetyl transferase, and combinations of these, microsomal metabolism rate, hepatocyte induction and metabolism rate, active transport by p-glycoprotein and other transporters, gut stability, P450 Type II binding, PXR and CAR mediated induction, membrane binding (non specific), renal clearance, P450 CYP3A4 time dependent inhibition, biliary clearance, oral bioavailability, clearance, volume of distribution, half life, binding to any of human serum albumin, alpha-glyco-protein, aryl hydrocarbon hydroxylase receptor, toxicity of the compound and metabolites.The success function for a single property may take many forms. A common form is a sigmoidal function, an example of which is shown in Figure 2A for affinity for the CYP2D6 enzyme (shown as pKi). A low affinity (low pKi) indicates a low risk of interaction with CYP2D6 and hence a good likelihood of successfully overcoming this hurdle. Conversely a high affinity (high pKi) indicates a strong interaction with CYP2D6, which can give rise to drug-drug interactions due to inhibition of CYP2D6 or a high metabolic turnover of the compound by 2D6. This translates to a significant risk to the success of the molecule as a drug. The consensus is that a pKi of greater than approximately 6 would indicate a problematic interaction with CYP2D6.
The upper and lower asymptotes of a sigmoidal success function need not be 1 and 0 respectively, but in certain embodiments the function may only take values between 0 and 1 inclusive. For example, the success function for logBB, for a project with a non-CNS target could be of the form shown in Figure 2B. Ideally a compound with a non-CNS target would not penetrate the BBB to avoid potential CNS side effects. However, if the compound were to penetrate the BBB, this would not significantly diminish the efficacy of the compound although it would introduce an additional risk of side-effects. Hence, the lower asymptote of the success function would be greater than zero, but still less than one. Other examples of possible success
functions include a "threshold" (Figure 2C), "bandpass" (Figure 2D) and "linear" (Figure 2E) functions.
As indicated, the success function may employ individual scoring functions to transform experimental or predicted property values (often ADMET/PK properties) into a "score" that typically lies within set boundaries (e.g., between 0 and 1). The specification of the scoring function can take into consideration the importance of the property, the "confidence" in the predictions or experimental measurements and the variation of the score with the property value. The scores for multiple properties are provided independently and then combined to give an overall consensus score for the overall set of properties.
In essence, the success function is comprised of constituent scores for each property value of relevance. The constituent scores are provided on a defined scale. In a preferred embodiment the scale ranges between 0 and 1. Each constituent score, which is derived for a separate model of a property, must be separately scaled. For example, a model predicts the binding affinity (K ) of substrate binding with CYP2D6 from sub-micromolar to hundreds of micromolar. In one case, a CYP2D6 binding score is obtained by converting the specific Kj value to a scale of 0 to 1, with 1 being most effective and some minimum value being least effective. In another case, a model classifies blood brain barrier penetration as either positive or negative. A positive classification is given a score of 1. And a negative classification is given a lower (but positive) score. This works for compounds having a CNS target. The roles are reversed for non-CNS targets.
The constituent scores are multiplied together to give the overall score. Other forms of the model are possible. For example, a success function may be partially additive and partially multiplicative. But, at its core, the model multiplies the important constituent scores with one another.
Most drug candidates must meet minimum acceptable requirements for many ADMET properties. For example, many programs have a requirement that a drug is orally bioavailable. Even if all other properties of a compound are perfect, a compound that is not absorbed will not become a drug for a program that requires oral administration. Therefore, a high degree of importance may be placed on predicted intestinal absorption characteristics. In this case, the success function will have a low minimum value associated with poor absorption. Since the functions are multiplicative, any compound predicted to have low absorption will have a low overall score. For additive schemes, a deficiency in an important property can be compensated for by good characteristics in other properties.
As indicated, all underlying properties used as constituent properties for the overall score are provided on a common scale. To this end, raw property values must be transformed or converted by a conversion function. The underlying property may be initially represented as either a discrete classification or a continuous value. These values are converted to a common scale. Preferably, the scale is the same for all constituent properties (e.g., 0 to 100 or 0 to 1). In a most preferred embodiment, the scale ranges from 0 or some minimal non-zero value up to 1.
For each constituent property, the upper bound of the converted function is preferably the same number. For example, each property is given 1 as the maximum allowable value. The lower bound of the scaled property is some value from 0 up to a number below 1. For more important properties, the minimum value is made lower. Thus, a particularly poor value of the property will strongly bias the overall score towards 0. This is because the preferred form of the overall success function is multiplicative. Less important properties may have relatively higher minimum values. Therefore, a particularly poor score associated with such property will not necessarily drive the overall score all the way towards 0.
Also, if there is very high confidence in a model, the minimum value may approach zero. On the other hand, if the confidence in the model is low, the minimum value will approach one. This is because a typical goal of the filter is to remove only those compounds very likely to be poor. Compounds having unknown properties (or known good properties) are preserved. Relatively unreliable models cannot predict with assurance that a given compound is in fact bad. Therefore, a poor score predicted by such model should not drive the overall score strongly towards 0.
The form of the prediction may be used to define and weight the transformation function. For example, if the property takes a sigmoidal form, one will have strong confidence that the horizontal, flat regions (upper and lower) are accurate. Values in these regions are given clearly minimal or clearly maximum scores in the converted scale.
As an example, consider compounds that bind to tightly to a cytochrome P450 can cause drag-drug interactions. It is generally accepted that molecules that have sub-micromolar binding constants have a high risk for causing drug interactions whereas compounds the have binding constants greater than 10 μM are low risk. Therefore a success function for P450 affinity could approach 1.0 above 10 μM and approach a minimum value below 1 μM. A sigmoidal function that transitions between a minimum value and 1.0 over the 1 - 10 μM range would be appropriate for representing the potential for drug interactions for a P450 affinity prediction.
One example of a relevant transformation function is the following:
where bO = 0.98 and bl= -1.96
The success function also compresses the value between an arbitrary min and max:
Compressed score = min + ((max - min) * score);
For example, CYP2D6 affinity may be compressed between 0.6 and 1.0
This equation specifies a sigmoidal curve that transforms a predicted binding constant to a drug interaction score. The minimum value is determined by the importance of the binding constant in the overall drug design effort (0.6 in this case). The value transitions to 1.0 between -1 and 2 log K units. This function represents
(1) compounds with binding constants above 10 are not at risk for drag interaction,
(2) compounds with binding constants below 1 are high risk, and (3) the error in the predictive model flattens the curve such that the minimum value is reached close to 0J and the maximum value is reached close to 100.
In one specific example, scaled values of each of the following five ADMET/PK properties are multiplied together to obtain an overall score: human intestinal absorption, blood brain barrier penetration, solubility, CYP2D6 Kj, and
PREDICTIVE TOOLS
Any reliable model for predicting property values used in the success function may be employed. Individual models may predict individual properties. Or they may predict combinations of relevant properties. Examples of ADMET/PK properties that may be analyzed include absorption (including but not limited to solubility, passive permeation of cell membranes, passive human and animal intestinal absorption, active transport across human and animal cell membranes), distribution (including but not limited to blood-brain-barrier penetration, human serum albumin binding, protein binding, receptor binding, drag transport across cell membranes), metabolism (including but not limited to compound binding affinity to metabolizing enzyme, rate of metabolism, regiolability, regiospecificity), excretion (including but not limited to hepatic and renal excretion), and toxicity (including but not limited to bioactivation of
toxic metabolic intermediates, identification of structural features in the parent compound that are correlated with toxicity).
Regardless of which properties an individual model predicts, it typically makes that prediction based upon the chemical structure of the compound under consideration. In some cases, the model will rely on a full two-dimensional or three- dimensional representation of the compound. In other cases, the model will require only a partial two-dimensional or three-dimensional representation. Other types of models will require only "structural descriptors" of the input compounds. These may describe a molecular fragment, a molecular site, or geometric parameter of the molecule.
Structural descriptors have been found to be particularly useful in generating some models, particularly regression-based models. They represent 'important' structural features of molecules. These features are likely to influence the reactivity (or other relevant property) of a particular site on the compound. Examples of site- specific structural properties captured in descriptors include (a) the classification of a site according to atom type and electronic hybridization, (b) the influence of neighboring atoms and groups, (c) the geometric constraints on the site resulting from participation in a ring, size of the ring, and/or flexibility of the ring, and (d) the partial and/or formal charge on the atom at the site (or elsewhere on the molecule). For further discussion on how certain descriptors that are appropriate to the current invention can be used, see U.S. Patent Application No. 09/811,283.
The influence of neighboring atoms may be captured with descriptors that characterize electron withdrawing properties of neighbors, participation in a conjugation system, participation in a ring system, etc. The geometric state of a site may be captured using descriptors that specify steric factors hindering or facilitating accessibility to a particular site. These factors may result from neighboring structures on the molecule or the relative geometric positioning of a particular site (e.g., at the end of a major axis on an ellipsoid shaped compound). The partial charge on an atom reflects the degree to which the atom donates its electrons to (or receives electrons from) neighboring atoms. More details associated with structural descriptors, particularly as used for CYP metabolism models, are provided in US Patent Application No. 09/811,283, incorporated by reference.
For ADMET/PK properties typically affected by a group of sites (e.g., adsorption), groups of structural descriptors may be employed. In another specific embodiment, the descriptors identify relevant fragments of a molecule. A system generating such fragments takes, as an input, a molecular structure and applies a set of
fragmentation rales. Generally, such rales fragment a molecular structure in a manner that preserves in the resulting fragments the important descriptor information identified above. The fragment sizes are chosen to be relevant to the particular property under consideration. In some cases, the only meaningful way to consider the property of a compound is by considering the entire structure of the compound, via its detailed three-dimensional structure, for example.
The mathematical or algorithmic underpinnings of models can take various forms. This variation is independent of the type of structural representation or descriptor used as an input to the models. Examples of model formats include mathematical expressions (e.g., linear or non-linear single or multi-order equations), neural networks, trees or graphs, etc.
In one embodiment, quantum chemical techniques model the electronic configurations and energies associated with atomic orientations. The essence of a quantum chemical method involves calculating the electronic structure of a given atomic configuration. The electronic configuration of a molecule is obtained by combining atomic orbitals to form molecular orbitals. The equations for the electronic waveforms have been around since the beginning of the twentieth century, but they are not amenable to solution. Therefore different approximations such as semi-empirical methods (using experimental data) and ab initio methods (using a basis set of Gaussian functions to approximate atomic orbitals) are used in the solution to these equations. Approximate geometries are optimized to stable geometries by minimizing the energy with respect to the atomic coordinates. Various properties can be modeled using quantum chemical techniques. In one example, reactions are modeled by transforming the reactant geometry to the product geometry and minimizing all but one degree of freedom. Additional details of quantum chemical modeling of metabolism can be found in U.S. Patent Applications Nos. 09/368,511, and 09/613,875.
As mentioned, the model can take the form of a specific expression for site reactivity. Often such expressions will be comprised of a sum of terms, each representing a different structural descriptor. A separate coefficient may be provided for each such term. Often such expressions derive from a statistical technique such as a multivariate regression analysis.
Generation of a statistical model (or a neural network) requires the use of a training set. This is a sample group of compounds having known or reliably predicted properties. The training set should have a wide range of chemical structures and properties.
In a preferred embodiment of the invention, a set of fragments or geometry descriptors are determined for a training set of molecules. The properties of interest for the molecules are predetermined from an external method, such as actual experimental measurements or more computationally intensive estimates, e.g., quantum chemical modeling. The sites of the molecules are described in terms of the descriptors chosen, and a linear regression analysis or other fitting is done to create a simple relationship between descriptor values and property. In a specific embodiment, the relationsliip is a linear equation with coefficients that match the data of the training set with a least squares fit. The linear equation is then applied to subsequent molecules to model and predict their properties.
Metabolism
Metabolic rates and regioselectivity may be predicted by various techniques. For example, hydrogen atom abstraction, aromatic oxidation and/or other metabolic reactions mediated by CYP enzymes can be reliably modeled using parameterized expressions of quantum chemical descriptors (see US Patent application no. 09/368,511, filed August 5, 1999, US Patent application no. 09/613,875, filed July 10, 2000, and US Patent application no. 09/902,470, filed July 9, 2001). Metabolism can also be predicted by simple expressions of molecular descriptors as described in US Patent application no. 09/811,283, filed March 15, 2001. Binding affinities of arbitrary compounds with respect to enzyme and receptor binding sites can be predicted by various models such as the parametric transformation models described in US Patent application No. 10/034,663, filed December 19, 2001. The mathematical forms of models described in these documents can serve as components of a separable success function or as the overall success function - if appropriate. Each of the documents cited in this paragraph is incorporated herein by reference for all purposes.
Absorption
In a preferred embodiment, absorption, distribution and excretion analysis is carried out using a single software application, since there are common aspects of the analytic models required for each of these properties. However, other embodiments call for these to each have their own software application.
With regard to intestinal absorption, the drug must cross the epithelial cells that line the gastrointestinal tract, and for distribution, they must cross the endothelial cells that line the blood capillaries that perfuse the target organs. Such traversing can occur via the paracellular (in between cells) or the transcellular (through cells) pathway. Since only small hydrophilic molecules (MW less than 350) can use the paracellular route, the majority of compounds cross cell membrane barriers via transcellular routes. Transcellular routes include passive diffusion and carrier- mediated (or active) transport.
Most drugs and drug candidates will fall into the higher MW category, so that the paracellular sub-module will not be run or will return a null value. The transcellular sub-module models transport-independent (passive) processes and involve the use of standard descriptors well-known in the art that represent the key transitions in the permeability process. Modeling of and transport-dependent processes focuses on modeling the transport-protein interactions.
At the intestinal epithelium, compounds are primarily absorbed via passive diffusion across the vast surface area of the densely packed microvilli at the apical brush border membrane, and via transport mediated by selective transporter proteins. However, this import is counter-acted by efflux mechanisms that transport compounds out of the cell back into the intestinal lumen. One such efflux mechanism is mediated by the transporter protein P-glycoprotein (P-gp), also called the multidrag transporter (MDR1). P-gp is the most important transported protein in the transport- dependent model of the transcellular sub-module. P-gp is a member of a highly conserved group of energy-dependent ATP -binding-cassette (ABC) transporters found in a wide range of cell types.
P-gp is an integral membrane protein. It is made up of two homologous halves, each consisting of an N-terminal hydrophobic domain with six transmembrane segments that is separated from a hydrophilic domain containing a nucleotide-binding fold by a flexible linker polypeptide. P-gp's efflux action is powered by ATP- hydrolysis. Other active transport proteins that are modeled according to the current invention include MDR2 and cMRP (also known as cMOAT or MRP2) that translocate phosphohpids, glutathione, glucuronide, and sulfate conjugates of certain drugs across the hepatocyte canalicular membrane and monocarboxylic acid transporter MCT1, which mediates the bi-directional transport of some organic anions at the Blood-Brain-Barrier (BBB) and the efflux of valproic acid and probenecid.
A striking feature of P-pg and these other transporters are their broad substrate specificities. P-gp transports many structurally dissimilar drugs that act on different
intracellular targets, including anticancer drugs, cardiac drugs, anti-fungal and antimicrobial agents, immunosuppressive agents, HIV protease inhibitors, steroids, calcium channel blockers, and other cytotoxic drugs. While not entirely non-specific, these proteins do appear to be highly "versatile" in their substrate interactions.
As discussed above, an important consequence of this versatility is that the computational tools in the current art used for the modeling of classical protein- substrate interactions are not useful for modeling these active transporters. Ordinary modeling of selective protein interactions specifically focuses on the atomic interactions that impart the specificity. Likewise, pharmacophore-based approaches and field-based approaches use the shared three-dimensional characteristics of ligands (and non-ligands) to construct a three-dimensional representation of the binding site.
In a preferred embodiment, the invention employs at least some descriptor- based stracture-activity relationship (SAR) models that are based on the physical and chemical properties of the substrates alone and that disregard the structure of the substrate-binding fold of the transporter. These descriptors represent individual or composite molecular properties and are used to relate structure to activity. They include, but are not limited to, atomic and functional groups-based descriptors, such as the number of aromatic halides, amides, or basic nitrogens, and specific fragment descriptors. Again see U.S. Patent Application No. 09/811,283 for further discussion of such descriptors. They also include whole molecule property descriptors, such as lipophilicity, size, and HOMO/LUMO (Highest Occupied Molecular Orbital and Lowest Unoccupied Molecular Orbital) descriptors.
Modeling of active transport-mediated and passive transport independent movement of drug compounds across cell membranes is carried out in a similar manner to the intestinal processes described above. Aqueous solubility is also an important component of distribution that is taken into account by the current invention.
Distribution and Excretion
As mentioned, modeling of ADMET/PK properties can include modeling distribution and/or excretion or particular compounds. "Distribution" includes Blood- Brain-Barrier (BBB) penetration, protein binding in the blood, receptor-binding and drag transport across cell membranes. The active and passive transport models are similar to those discussed above with respect to absorption models. Modeling of
excretion includes hepatic and renal excretion, which are also mediated by active and passive transport mechanisms, and require similar models. For instance, P-gp is expressed on the brash border and biliary face of proximal tubule cells in the kidney and hepatocytes, respectively, consistent with a role for this active transporter in the excretion of compounds into the urine and bile.
In general, distribution and excretion can be modeled as described above, preferably using a collection of suitable structural descriptors. The descriptors identify key structural motifs that strongly impact a distribution and/or excretion mechanism.
Toxicity
The effect of toxic compounds involves all aspects of their interaction with the biological system, including absorption, distribution, metabolism, elimination, and other biochemical reactions. Metabolism, in particular, plays a complex role in toxicity as it serves both as a mechanism for detoxification and as a process that can create toxic compounds. Detoxification is achieved through the metabolism and removal of exogenous compounds from the biological system. Compounds that are not metabolized rapidly enough or whose metabolism is inhibited by the co- administration of another drug may build up to toxic levels and lead to adverse effects. Besides this detoxification function, metabolism can also cause toxicity problems by transforming harmless compounds into more reactive molecules that are toxic. For example, the oxidation reaction of acetaminophen that is catalyzed by CYP2E1 and CYP3A4 produces electrophilic metabolite(s). Similarly, a compound's rate of absorption, distribution, and elimination can have important implications for its toxicity profile. For example, neurotoxicity is likely to be low for compounds that cannot cross the blood-brain barrier. Genetic polymorphisms in metabolism enzymes and transporter molecules may also affect the local concentration of compounds and their metabolites at target sites and can be at the root of variations in toxicity profiles observed across patient populations.
Computational models in the current art for predicting toxicity can, for example, generally be categorized as either (1) knowledge-based or (2) statistically- based. Knowledge-based approaches codify human expert judgment into generalized rules that can be used to predict the toxicity endpoints of novel compounds. They do not discover new associations, but rather attempt to mimic human expert reasoning derived from past experience. Knowledge-based approaches were used to generate
commercial toxicity programs such as OncoLogic, DEREK, and HazardExpert. Statistically-based approaches use databases of toxicity date and data mining techniques to uncover associations between chemical structures and biological activities. The structures are encoded with molecular descriptors, the distribution of which among active and inactive molecules in the training set is analyzed statistically. Molecular descriptors that are strongly associated with specific toxicity endpoints are incorporated into predictive, computational models. Statistically-based approaches used in commercial programs such as CASE, MULTICASE, and TOPKAT.
One limitation of knowledge-based approaches relates to inadequate rales for a number of toxicity endpoints and chemical classes due to knowledge gaps of the expert developers. Statistically-based approaches, since they do not rely on expert knowledge they can be more broadly applicable. However, since most chemical toxicity databases contain compounds whose toxicities are mediated by a diverse range of biological mechanisms, detection of strong associations between molecular descriptors and toxicity endpoints is a challenging proposition. Subdivision of these databases into specific toxicity pathways generally results in significantly smaller databases. Computational models derived from these smaller databases tends to be applicable to only a limited range of compounds in a narrow range of chemical space. Therefore, the quality and applicability of statistically-based prediction models is critically dependent on the use of large databases of highly diverse compounds with specific toxicity endpoints.
One computational model for use with the current invention involves the use of the two million adverse drag events recorded over the past three decades years through the Adverse Event Reporting System (AERS) by the FDA Center for Drug Evaluation and Research (CDER). This information was collected in the Adverse Event database, a computerized database of drug adverse events reported by health professionals and others on marketed drags. The database is currently growing at a rate of approximately 300,000 reports annually. It contains compound structures with associated adverse events information encoded with either Event Codes in COSTART (Coding Symbols for Thesaurus of Adverse Reaction Terms) (used up until 1997) or with Preferred Term Event Codes in MedDRA (Medical Dictionary for Regulatory Activities) (used after 1997). Both COSTART and MedDRA are hierarchical coding systems, with multiple levels of term grouping, from (in the case of MedDRA) "lower-level terms" to "primary terms" to "higher-level terms" to "higher-level-group terms" to "system-organ-classes."
The AERS database is the accumulation of numerous spontaneous adverse event reports, each of which enumerates one or more suspected drugs and one or more suspected adverse reactions. This data is easily transformed into a large two-way table of counts; i.e. the number of times a given drug and a given event are mentioned together in the same report, for all distinct pairs of drags and events. Numerical strength-of-association scores ("signal scores") can be assigned based only on this raw count data. This generates a Bayesian shrinkage estimate for the ratio of the "observed" to the "expected" count for each combination of drug and event (the "expected" count is the count arising from the separate marginal distributions of drug and event counts along with an assumption of independence). One example of software that generates data under such an approach, known as the DuMouchel algorithm, is available from Lincoln Technologies of Weston Newton, Massachusetts.
In a preferred embodiment, each MedDRA adverse event code is categorized by organ-specific type of toxicity (hepatic, renal, cardiac, neurological, etc.), based on the built-in hierarchical organization of the MedDRA terminology, in which Primary Terms are linked to Higher Level Terms that are in turn linked to Higher Level Group Terms that are finally linked to 26 top-level System Organ Classes. For each set of MedDRA terms making up a "toxicity group", a weighted average of the corresponding "signal scores" is used to produce a corresponding "toxicity score." The resulting database contains a set of organ-specific toxicity scores for each drag appearing in the adverse events database.
A database such as of the MOE (Molecular Operating Environment) software package available from Chemical Computing Group Inc. of Montreal, Canada can store the substructure elements of parent compounds with reactive intermediates as well as their associated toxicity data. The toxicity software module thus works on the basis of a substructure searching functionality and the substructure-toxicity database and will systematically screen query compounds for their reactive intermediate potential. Upon identification of a relevant substructure, the model accesses the predictive metabolism model and determines the relative likelihood for metabolism to occur at that particular site and to give rise to the reactive intermediate responsible for toxicity.
This process is automated, in a preferred embodiment of the invention, with a computational engine that flags compounds with potentially reactive functionalities that are predicted to be significantly metabolized by a P450 major isoform. With toxicity scores computed for each of several organ systems for each of approximately 1,000 compounds, statistical modeling techniques are used to produce QSAR models
that predict each of the types of toxicity from chemical structure. In order to develop the QSAR model, "recursive partitioning" of molecular features is used. Recursive partitioning involves linear and non-linear relationships to describe combinations of substructures and other molecular features that are responsible for activity. Recursive partitioning methods result in a dendrogram, in which structural characteristics responsible for activity (or inactivity) are associated with branches of the dendrogram. The splits to active and inactive branches or terminal nodes give information about the structural requirements for activity. This QSAR information can then be used to identify which metabolites and intermediates produced from the drag (as predicted by the metabolism model) are in fact toxic.
An example of toxicity caused by reactive intermediate formation is the drag troglitazone. Rezulin is one of the highest profile cases of a drag failure due to ADMET/PK properties. A method suitable for use with the current invention has been found successful at predicting and modeling Rezulin' s toxicity. Troglitazone causes time and NADPH-dependent inhibition of CYP3 A4, indicating that a reactive intermediate is formed by CYP3A4 metabolism. At the 2001 ISSX meeting, Tom Baillie (VP, Preclinical Research at Merck) reported that the hepatotoxicity of troglitazone was mediated by the initial CYP3A4-mediated sulfur oxidation reaction. The toxicity model discussed here accurately predicts the susceptibility of troglitazone to metabolism at the site on the molecule that leads to hepatotoxicity from the sulfur oxidation intermediate. The current invention predicts that this intermediate accounts for 60% of metabolism, and that 40% of metabolites are distributed among a subset of other intermediates.
APPLICATIONS
The use of compound scores allows interpretation of the results using a mathematical framework and permits comparison of scores between, as well as within, compound sets. Some example applications of compound scores include, but are not limited to, compound selection within a project, project prioritization, and resource allocation.
For compound selection, those molecules with the highest scores are preferentially selected for future research, as these have the highest likelihood of success. Project prioritization allows an organization to make decisions on how to promote two or more distinct projects. If a project appears to have a high risk of failure, a less risky project might be emphasized. This analysis is based on a
calculation of total likelihood of success of a project, using the compound scores for many or all compounds under consideration in a project. This gives a projection of the likelihood that one or more of the project's compounds will be successful.
Resource allocation identifies the number of molecules that must be progressed in order to achieve a chosen overall likelihood of success for a project. The overall likelihood of success for the best n molecules in a project may be estimated from n
Probability of success = 1 - ]^J (1 - score t )
.=1 where the molecules are ordered in terms of their score (highest score first) and scoret is the score of the t'th compound. Therefore, n is the number of compounds that must be progressed for the project to have a chosen degree of confidence (say 90%) of identifying at least one successful molecule. Thus, decisions may be made regarding the total resource allocation that will be necessary to progress a project.
For many applications, a simplified expression for computing a compound score will allow rapid reliable calculations. For example, it is often acceptable, to assume that the scoring function, S(X), is separable. It appears that, at least for the range of properties that can currently be modeled, the scoring function is in fact separable. Therefore, one can input scoring functions for each property independently and construct the overall scoring function by multiplying together the individual functions. Finally, the form of the success function can be made very simple. In case, the model employs a step function form for the success functions for the individual properties. See Figure 2C for an example.
These simplifications in the form of the scoring function simplify the numerical code for the calculation of the score and design of the user interface. Some models that are widely available for use in scoring are classification models, so this unifies the user's view of the models from the scoring perspective. Continuous models may have an additional parameter in the scoring interface, the value of the threshold.
Another application of this invention converts a virtual library of chemical compounds to an actual library of compounds for synthesis. A scoring function is used to filter less interesting compounds, while preserving chemical/structural diversity. In one approach, the model can remove all compounds having an overall score below a particular value. In another embodiment, different score thresholds are
used to remove compounds in different regions of chemical/stractural space. If a particular region of chemical/stractural space is particularly important or otherwise underrepresented in the pool of available compounds, a lower overall score may be used as the passing threshold for filtering compounds.
EXAMPLE
In one example, approximately 8 million compounds were scored for ADME issues in accordance with this invention. The scored compounds were chosen on the basis of their potential interaction with an ion channel class of targets. They were provided in electronic form and evaluated with models that predict blood brain barrier penetration, human intestinal absorption, solubility, binding affinity (Ki) to CYP2D6 and binding affinity (Ki) to CYP2C9. Transformation functions for each of the properties were developed. The transformation functions converted the predicted property values to numbers indicating the likelihood that an ADME problem would occur.
The 8 million starting compounds were divided into 16 arrays. Each array contained 20,000 to 200,000 compounds sharing some chemical structural basis. The scoring function distinguished between arrays having good overall ADME scores and those not having such good scores.
In another example, a "library" of 952 US Food and Drag Administration approved drags was divided into two smaller libraries: oral route of administration and non-oral route. Compounds giving poor ADME scores (generated as described above) were more likely to be found in the non-oral group than in the oral group. This is likely due to low scoring resulting from predicted low levels of human intestinal absorption.
HARDWARE/SOFTWARE IMPLEMENTATION
Certain embodiments of the present invention employ processes acting under control of instructions and/or data stored in or transferred through one or more computer systems. Embodiments of the present invention also relate to an apparatus for performing these operations. This apparatus may be specially designed and/or constructed for the required purposes, or it may be a general-purpose computer selectively activated or reconfigured by a computer program and/or data structure
stored in the computer. The processes presented herein are not inherently related to any particular computer or other apparatus. In particular, various general-purpose machines may be used with programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required method steps. A particular structure for a variety of these machines will appear from the description given below.
In addition, embodiments of the present invention relate to computer readable media or computer program products that include program instructions and/or data (including data structures) for performing various computer-implemented operations. Examples of computer-readable media include, but are not limited to, magnetic media such as hard disks, floppy disks, magnetic tape; optical media such as CD-ROM devices and holographic devices; magneto-optical media; semiconductor memor - devices, and hardware devices that are specially configured to store and perform program instructions, such as read-only memory devices (ROM) and random access memory (RAM), and sometimes application-specific integrated circuits (ASICs), programmable logic devices (PLDs) and signal transmission media for delivering computer-readable instructions, such as local area networks, wide area networks, and the Internet. The data and program instructions of this invention may also be embodied on a carrier wave or other transport medium (e.g., optical lines, electrical lines, and/or airwaves). Examples of program instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter. Further, the program instructions include machine code, source code and any other code that directly or indirectly controls operation of a computing machine in accordance with this invention. The code may specify input, output, calculations, conditionals, branches, iterative loops, etc.
Figures 3 A and 3B illustrate a computer system 300 suitable for implementing embodiments of the present invention. Figure 3 A shows one possible physical form of the computer system. Of course, the computer system may have many physical forms ranging from an integrated circuit, a printed circuit board and a small handheld device up to a very large super computer. Computer system 300 includes a monitor 302, a display 304, a housing 306, a disk drive 308, a keyboard 310 and a mouse 312. Disk 314 is one example of a computer-readable medium used to transfer data to and from computer system 300.
Figure 3B is a block diagram of certain logical components of computer system 300. Attached to system bus 320 are a wide variety of subsystems.
Processor(s) 322 (also referred to as central processing units, or CPUs) are coupled to storage devices including memory 324. Memory 324 includes random access memory (RAM) and read-only memory (ROM). ROM acts to transfer data and instructions uni-directionally to the CPU and RAM is used typically to transfer data and instructions in a bi-directional manner. Both of these types of memories may include any suitable computer-readable medium, including those described above. A fixed disk 326 is also coupled bi-directionally to CPU 322; it provides additional data storage capacity and may also include any of the computer-readable media described below. Fixed disk 326 may be used to store programs, data and the like and is typically a secondary storage medium (such as a hard disk) that is slower than primary storage. It will be appreciated that the information retained within fixed disk 326, may, in appropriate cases, be incorporated in standard fashion as virtual memory in memory 324. Removable disk 314 may take the form of any of the computer- readable media described below.
CPU 322 is also coupled to a variety of input/output devices such as display 304, keyboard 310, mouse 312 and speakers 330. In general, an input/output device may be any of: video displays, track balls, mice, keyboards, microphones, touch- sensitive displays, transducer card readers, magnetic or paper tape readers, tablets, styluses, voice or handwriting recognizers, biometrics readers, or other computers. CPU 322 optionally may be coupled to another computer or telecommunications network using network interface 340. With such a network interface, it is contemplated that the CPU might receive information from the network, or might output information to the network in the course of performing the above-described method steps. Furthermore, method embodiments of the present invention may execute solely upon CPU 322 or may execute over a network such as the Internet in conjunction with a remote CPU that shares a portion of the processing.
Figure 4 is a schematic illustration of an Internet-based embodiment of the current invention. See 400. According to a specific embodiment, a client 402, at a drug discovery site, for example, sends data 408 identifying organic molecules 408 to a processing server, 406 via the Internet 404. The organic molecules are simply the molecules that the client wishes to have analyzed by the current invention. At the processing server 406, the molecules of interest are analyzed by a model 412, which scores them according to whether they are likely to be successful in terms of a specific project, for example.
After the analysis, predicted success values 410 (and any other appropriate information) are sent via the Internet 404 back to the client 402. The computer
system illustrated in Figures 3A and 3B is suitable both for the client 402 and the processing server 406. In a specific embodiment, standard transmission protocols such as TCP/IP (transmission control protocol/internet protocol) are used to communicate between the client 402 and processing server 406. Security measures such as SSL (secure socket layer), VPN (virtual private network) and encryption methods (e.g., public key encryption) can also be used.
OTHER EMBODIMENTS
Although the above discussion has presented the present invention according to specific processes and apparatus, the present invention has a much broader range of applicability. In particular, the present invention is not limited to drag discovery. Other areas of chemical research and development can profit from the multivariate scoring methods of this invention. Examples of other chemical products that could be evaluated and promoted using this invention include pesticides, catalysts, polymer products, pigments, fertilizers, hygiemc products, fuels, and the like. Of course, one of ordinary skill in the art would recognize other variations, modifications, and alternatives.