WO2000079263A2

WO2000079263A2 - Identifying active molecules using physico-chemical parameters

Info

Publication number: WO2000079263A2
Application number: PCT/EP2000/004338
Authority: WO
Inventors: Roger Lahana; Philippe Clair; Abdelaziz Yasri
Original assignee: Synt:Em S.A.
Priority date: 1999-06-18
Filing date: 2000-05-15
Publication date: 2000-12-28
Also published as: AU4565600A; WO2000079263A3

Abstract

The present invention discloses a method of identifying psyco-chemical and/or topological parameters which are associated with biological activity, the method using data relating to a set of lead molecules including active and a predetermined set of physico-chemical parameters of which the values are known or obtainable for each of the lead molecules, the method further using a function (f) which is defined for any subset of said parameters and which depends on the statistical significance (p) of correlations between the values of that subset of parameters for the active molecules in comparison to the values of that subset of parameters for other or said molecules.

Description

Identifying Active Molecules using Physico-chemical Parameters

Field of the invention The present invention relates to methods and apparatus for identifying physico-chemical or topological parameters of molecules which are associated with biochemical activity (e.g. molecules which are suitable for being used as a pharmaceutical for a certain task) . The invention further relates to molecules which are identified as being active or potentially active on the basis of the identified physico-chemical or topological parameters.

Background of the invention Modern high-speed computational technigues are having an ever greater influence on the design of new drugs. In particular, a variety of techniques have been proposed to identify computationally compounds which are likely to have a certain biochemical activity (that is to be suitable for a particular biological use, e.g. a pharmaceutical use) . The purpose of this is so that subsequently in vi tro or m vi vo testing can be carried out on those molecules which have a relatively high likelihood of being biologically active. That is, in vi tro and m vivo testing (which are both relatively expensive to carry out) can be concentrated on drugs which have the maximal chance of being active.

Some such computational techniques for predicting the activity of molecules require and exploit biochemical understanding why a certain molecule is active. For example, such a method may create a candidate molecule purely as a representation within a computer. Such a molecule, which does not yet necessarily have physical existence, will be referred to herein as a "virtual molecule". The method may then attempt to predict the activity of the virtual molecule by molecular modelling, taking into account biochemical understanding of the role the molecule has to play m order to be active (for example, what chemical bonds the molecule would have to be capable of forming) . By contrast, so called QSAR (Quantitative Structure- Activity Relationship) methods attempt to infer whether a given molecule is active, without specific biochemical understanding. These methods make use of the fact that certain molecules ("lead compounds") may already be known to exhibit the desired activity, at least to a certain degree. The activity of a new molecule can then be inferred based on a comparison between measured and/or calculated physico-chemical properties of the lead compounds and the new molecule. In other words, m QSAR the activity of a new candidate molecule is predicted not on the basis of biological insight, or at least not exclusively on that basis, but rather by inference by comparing known (or easily derivable) physico-chemical properties of the candidate molecule to the properties of lead compounds.

In this document the term physico-chemical parameters will be used to include any physical or chemical property of a molecule, including topological parameters of the molecule such as its folding conformations. It includes both properties which are "static" (at least m a time- averaged sense) such as the dipole moment of a molecule, and also "dynamic" properties of a molecule, such as ones characterising the range of conformations through which the molecule may flex over a period of time. In the case of some molecules, the flexing of the molecule over time can be determined with high accuracy using modern molecular modelling techniques. Each physico-chemical parameter of a molecule partially describes that molecule, and for this reason m this document the term "descriptor" will be used interchangeably with the term "physico- chemical parameter" . The pioneering work initiating the QSAR concept was by Hansch and al . (1962) X This work demonstrated that biological activity can be linked quantitatively to some physico-chemical parameters and it also introduced the idea that the activity can be described by more than one parameter (i.e. multiple regression). Following this, some methods have been developed using regression analysis such as partial least square and multiple linear regression analysis.^2"3 Other QSAR multivaπate statistical methods use principal component analysis⁴ and discriminant analysis.⁵ These methods proceed by reducing the space of initial variables to three- or bidimensional space. They allow the classification of the compounds by synthetic linear variables and link the obtained classes to the observed activity.

Although these techniques appear extremely useful, transparent and easy to interpret, they present two ma]or disadvantages .

First, they assume the continuity of the descriptor space (i.e. information carried by the physico-chemical descriptors) with the biological space (biological activity information) .

Second, the activity of the molecules is taken as a linear combination of the molecular physico-chemical descriptors.

The continuity between chemical and biological space is only true if we consider molecules m the same chemical series of compounds. Such a series is based on a definition of a common skeleton, with a general shape and fundamental functionalities. The members of these series are then generated by relatively small variations on this common theme. Thus, the use of chemical series restricts the above QSAR techniques to modelling relatively small and continuous variations of activity over members of the series.⁶ Thus, it is not surprising that QSAR models can predict the activity of these series reasonably well by combining their physico-chemical parameters linearly.

However, m a more general class of compounds (e.g. a biological set of compounds), we usually find a discontinuity between the chemical and biological spaces. This arises from the fact that these series contain a collection of different chemical species sharing the same biological message (i.e. same mechanism of action) but which do not obey the common skeleton rule. Consequently, the classical linear QSAR methods are not able to provide accurate predictions of the activity of biological series of compounds. Furthermore, the information derivable from such approaches is balanced by the low robustness of the methods: the constraints (m terms of the range of molecules) within which the predictions are accurate are very strict. Thus, these known QSAR methods, even if they are adequate for lead optimization, are clearly unsatisfactory for de novo design, especially when molecular diversity exploration is a concern.

To deal with the complexity of relationships between biologically active molecules, some known non-linear QSAR methods are based on optimization algorithms using artificial neural networks or genetic

These methods attempt to produce quantitative models which are more robust (valid over a wider range of compounds) at the expense of transparency.

Recently Grassy et al¹⁰ (a citation which is incorporated herein by reference) have suggested a technique which significantly improves on the above QSAR methods. Whereas m QSAR the function f is derived only from lead compounds which are biologically active, Grassy et al exploited also lead compounds which were known to be inactive. Also, whereas classical QSAR uses a simple linear function, Grassy et al employed a more rigorous understanding of the relationship between activity and the space of physico-chemical parameters.

Specifically, rather than combining a number of different descriptors into a single linear function, they determined for each descriptor a range which was associated with biological activity. For example, for the task of causing lmmunosuppressive activity m a certain environment, it turned out that having a dipole moment m the range 34.23 and 80.79 was associated with activity (i.e. most molecules (e.g. of a certain class) which were already known to be active had a dipole moment m this range, whereas most molecules which were already known to be inactive had a dipole moment outsiαe this range) . Candidate molecules having a descriptor value m this range thus have a higher likelihood of biological activity. A candidate molecule for which the values of each of several descriptors are inside a respective range associated with activity, is predicted to be active. Grassy et al only used descriptors which can be calculated by computer. This meant that a very large number of virtual molecules could be screened to determine whether they fall into the descriptor ranges, without it being necessary to actually fabricate them. By analogy with in vi tro and m vi vo screening, Grassy et al refer to their technique as "m silico" screening.

Due to the non-linearity of the structure-activity models of Grassy et. al (the descriptor ranges have hard limits) , their techniques are particularly applicable predicting the activity of compounds which are not m the same chemical class as the lead compounds.

Note that m vi tro or m vi vo screening allow one to derive relatively valuable information about a molecule, but only at the considerable expense of chemically fabricating it. By contrast, a single descriptor value obtained computationally tends to give relatively little information about the activity of the molecule. In other words, to achieve accurate predictions of activity several descriptors must be calculated, and information from them combined. Grassy et al showed that combining computationally-calculated descriptor data relating to both static and dynamic descriptors permits biological activity to be predicted relatively well without the need for chemical fabrication. This therefore permits a large number of chemicals to be screened, compared to m vi vo or m vi tro screening.

Even if we restrict ourselves to descriptors which can be calculated for virtual molecules, the number of possible descriptors which can be envisaged is almost limitless. If a large number of virtual molecules are to be screened, a selection must be made of which descriptors to use so that the computational time required to calculate the descriptors for each molecule does not rise excessively. In addition to, or instead of, computationally calculable descriptors, many known QSAR methods use descriptors which can only be measured by chemical testing, and again a great variety of such descriptors can be envisaged. Therefore any QSAR technique must choose which descriptors to use. This choice may be made on a case-by-case basis, on the basis of biochemical intuition or entirely at random. Preferably the physico-chemical parameters which are chosen are not hignly correlated with each other (e.g. the values of any two descriptors are not highly correlated as measured over a large number of molecules) so that there is maximum information per descriptor value.

Attempts to rationally select descriptors have been made before, for example by So and Karplus^2** who employed a "genetic algorithm" (GA) . GA is one of the recent optimization methods based on the natural selection concept. ^{8/ 13~17} To perform a genetic algorithm, two requirements must be satisfied: the encoding of the possible solution of the problem, and the quantitative evaluation of a given solution by an objective function. So and Karplus²⁴ considered the problem of selecting descriptors from a large space of D descriptors (including both descriptors which can be calculated and descriptors which must be found by chemical testing) . They considered a string of D components (i.e. with a component for each respective descriptor) , each component being either 1 (to indicate that the descriptor is worth using m QSAR) or -1 (to indicate that the molecule should not be used) . Selection of a set of descriptors is encoded as optimising the string within these constraints. To do this, they defined a fitness function of the string (i.e. a measure of the badness of the string) , and tried to minimise the objective function with respect to possible strings by a GA.

To define an objective function, they constructed for any string a neural network which uses as inputs the descriptors which are +1 m the string, and which is trained to predict biochemical activity of certain lead compounds ("the learning set"). The objective function is then defined as the error rate of the neural network m predicting the activity of other lead compounds ("the test set"), i.e. a cross-validation. In other words the objective function measured the quality of a certain set of descriptors on the basis of the error rate m predicting biological activity by a neural network using those descriptors. A major disadvantage of this technique is that the time taken to train a neural network is very long. Thus, each evaluation of the objective function is computationally expensive. This means that for the GA to work m a reasonable time the total number of descriptors from which the method selects a subset of descriptors must be small. So and Karplus began with only 10 descriptors (of which 3 were calculable and 7 were experimental) and selected 6 using 57 lead compounds. Even this took A days of CPU time on a fast workstation. Using the 6 selected descriptors (including descriptors which can only be evaluated by experiment) , So and Karplus obtained 93% prediction accuracy.

However, as pointed out above, to achieve screening m sili co only descriptors which can be determined by calculation can be used, and for reasonable prediction success the number of such descriptors must be greater than 6 (e.g. the best 10 descriptors of out of the 100 calculable descriptors which are most easily envisaged) . For this reason the technique of So and Karplus is inapplicable to selection of descriptors for m sili co screening.

Summary of tne present invention

The present invention aims to permit a faster rational selection of descriptors, m particular a selection of descriptors which is suitable for use m m sili co screening .

The invention may further aim to identify active candidate molecules on the basis of the selected descriptors . In its most general terms the present invention proposes that a subset of descriptors (e.g. about 10) is selected from a larger set of possible descriptors (e.g. at least 100) based on the statistical significance of correlations between the values of those descriptors for active lead compounds.

For example, m the space of a certain subset of descriptors, the descriptors values of the active lead compounds may be highly correlated (e.g. m comparison to the inactive molecules) . This set of values is said to form a tight cluster m the space. The statistical significance of this cluster can be quantified, and this provides a very efficient method of deciding whether the set of descriptors associated with biochemical activity (i.e. whether the set of descriptors well encodes the biochemical significance of the lead compounds).

This cluster significance analysis (CSA) can be performed relatively quickly (e.g. much more quickly than building a neural network) , and therefore descriptors associated with the activity can be selected quickly, even from a large set of possible descriptors. The concept of considering the clustering of the acti ve molecules among the lead compounds (e.g. correlations between the inactive molecules may optionally just be used to evaluate the statistical significance of the clustering of the active ones) is based on the biochemical realisation than there are a variety of reasons why inactive molecules fail, and thus the clustering of the active molecules is of greater significance than clustering of the inactive ones.

Specifically, the invention proposes a method of identifying physico-chemical and/or topological parameters which are associated with biological activity, the method using data relating to a set of lead molecules including active molecules, of which said activity is known to be at least a predetermined level, and inactive molecules, of which said activity is known to be below said predetermined level, and a predetermined set of physico- chemical and/or topological parameters of which the values are known or obtainable for each of the lead molecules, the method further using a function (f) which is defined for any subset of said parameters and which depends on the statistical significance (p) of correlations between the values of that subset of parameters for the active molecules m comparison to the values of that subset of parameters for other of said molecules, the method comprising the steps of:

(l) selecting a plurality of first subsets of said parameters from among said set of parameters; (n) determining the value of said function for each first subset of parameters; and

(m) selecting at least one second subset of said parameters from among said set of parameters based on the values of said function for the respective first subsets of parameters, whereby the or each second subset of parameters is more closely associated with said activity than the first subsets of parameters.

For example, when a high (low) value of said function (f) is associated with a high value of said statistical significance (p) , the method (e.g. iteratively) may select a second subset of parameters with a high (low) value of

(f) •

The set of parameters are preferably ones which can be determined for a given molecule computationally (e.g. without m vi tro or in vi vo testing) . It may include for example at least 100, at least 200 or at least 500 descriptors. Each subset of descriptors may for example be no more than 50 or 100 descriptors. In further aspects the invention relates to using the selected descriptors to develop criteria for screening candidate molecules (e.g. by building filters from them based on ranges, m the way described in Grassy et al¹⁰) , to ways of generating candidate molecules to test using the criteria, to molecules which have been thus derived

(and pharmaceuticals based on them), and to apparatus for carrying out all the methods.

Following identification of a molecule m accordance with the method of the present invention, the substance may be investigated further. Furthermore, it may be manufactured and/or used m preparation, i.e. manufacture or formulation, of a composition such as a medicament, pharmaceutical composition or drug. It may itself be used m the generation of mimetic molecules according to the method disclosed herein or any other suitable technique known to those skilled in the art. The designing of molecules according to the method of the invention might be desirable where it is difficult or expensive to synthesise known molecules having the desired behaviour or where it is unsuitable for a particular method of administration, e.g. peptides are not well suited as active agents for oral compositions as they tend to be quickly degraded by proteases m the alimentary canal. The methods disclosed herein may be used to avoid randomly screening large number of molecules for a target property.

The molecule or composition may be used m a variety of contexts depending upon the criteria set (e.g. biological activity, physiochemical properties) m the method of the invention. By way of example the molecules may be used:

(l) as pharmaceuticals, e.g. anti-bacterial molecules, anti-fungal molecules, anti-viral molecules, antibiotics, immuno-stimulatory molecules, e.g. for use m vaccines or lmmuno-suppressants; (n) as cosmetics e.g. new molecules with a deoαerant effect. Undesirable body odours are caused by bacteria, typically gram positive aerobic bacteria e.g. Corneybacterium xerosis or negative coagulase anaerobic micrococci (S. epidermidis) . Using the method of the present invention it is possible to design new antibacterial molecules (e.g. peptide-based) whose antibacterial effect is targetted specifically to the odour-causing bacteria;

(m) m veterinary applications. New molecules may be designed to protect bodily fluids (e.g. semen) from microbial infection during storage. For example, pig semen is typically stored at a relatively high-temperature (approximately 20°C) . At such temperatures bacteria proliferate. Some of the bacterial strains are resistant to known antibiotics. Using the method of the present invention antibiotic molecules may be designed which have broad spectrum anti-bacterial activity (including anti- gram negative and anti-gram positive activities) whilst not exhibiting significant spermicidal activity;

(IV) as agrochemicals, e.g. mimics of natural peptides having antifungal, antibacterial or antiviral activity may be designed. Such peptides are preferably non-toxic to humans and may be expressed directly m genetically modified plants;

(v) as biomaterials, molecules may be designed which favour certain dialysis membrane properties. Starting from a known polymeric membrane which is used for dialysis purposes (e.g. human dialysis), new molecules may be designed with improved permeability or dialysis properties, such molecules can be used as additives to the polymeric membrane.

Thus, the present invention extends m various further aspects to a molecule identified or defined m accordance with a method of the present invention, and also a pharmaceutical composition, medicament, drug or other composition comprising such a molecule, a method comprising administration of such a composition to a patient, e.g. for antibiotic/anti-fungal treatment, which may include preventative treatment, use of such a substance m manufacture of a composition for administration, e.g. for antibiotic/antifungal treatment, and a method of making a pharmaceutical composition comprising admixing such a substance with a pharmaceutically acceptable excipient, vehicle or carrier, and optionally other ingredients.

A substance identified using a method of the present invention may be peptide or non-peptide m nature. Non- peptide "small molecules" are often preferred for many m vi vo pharmaceutical uses. A convenient way of producing a polypeptide is to express nucleic acid encoding it. This may conveniently be achieved by growing m culture a host cell containing the nucleic acid under appropriate conditions which cause or allow expression of the polypeptide. The nucleic acid may be introduced alone or as part of a vector, and may be extragenomic or integrated into the genome. Polypeptides may also be expressed m m vi tro systems, such as reticulocyte lysate.

Systems for cloning and expression of a polypeptide m a variety of different host cells are well known. Suitable host cells include bacteria, eukaryotic cells such as mammalian and yeast, and baculovirus systems. Mammalian cell lines available m the art for expression of a heterologous polypeptide include Chinese hamster ovary cells, HeLa cells, baby hamster kidney cells, COS cells and many others. A common, preferred bacterial host

Suitable vectors can be chosen or constructed, containing appropriate regulatory sequences, including promoter sequences, terminator fragments, polyadenylation sequences, enhancer sequences, marker genes and other sequences as appropriate. Vectors may be plasmids, viral e.g. 'phage, or phagemid, as appropriate. For further details see, for example, Molecular Cloning: a Laboratory Manual: 2nd edition, Sambrook et al . , 1989, Cold Spring Harbor Laboratory Press. Many known techniques and protocols for manipulation of nucleic acid, for example m preparation of nucleic acid constructs, mutagenesis, sequencing, introduction of DNA into cells and gene expression, and analysis of proteins, are described in detail m Current Protocols in Molecular Biology, Ausubel et al. eds., John Wiley & Sons, 1992.

The introduction of DNA may employ any available technique. For eukaryotic cells, suitable techniques may include calcium phosphate transfection, DEAE-Dextran, electroporation, liposome-mediated transfection and transduction using retrovirus or other virus, e.g. vaccinia or, for insect cells, baculovirus . For bacterial cells, suitable techniques may include calcium chloride transformation, electroporation and transfection using bacteriophage . Following production by expression, a polypeptide may be isolated and/or purified from the host cell and/or culture medium, as the case may be.

Peptides can also be generated wholly or partly by chemical synthesis. The well-established, standard liquid or solid-phase peptide synthesis methods can be used, general descriptions of which are broadly available (see, for example, m J.M. Stewart and J.D. Young, Solid Phase Peptide Synthesis, 2nd edition, Pierce Chemical Company, Rockford, Illinois (1984), in M. Bodanzsky and A. Bodanzsky, The Practice of Peptide Synthesis, Springer

Verlag, New York (1984); and Applied Biosystems 430A Users Manual, ABI Inc., Foster City, California), or they may be prepared m solution, by the liquid phase method or by any combination of solid-phase, liquid phase and solution chemistry, e.g. by first completing the respective peptide portion and then, if desired and appropriate, after removal of any protecting groups being present, by introduction of the residue X by reaction of the respective carbonic or sulfonic acid or a reactive derivative thereof.

Pharmaceutical compositions according to the present invention, and for use m accordance with the present invention, may include, m addition to active ingredient, a pharmaceutically acceptable excipient, carrier, buffer, stabiliser or other materials well known to those skilled m the art. Such materials should be non-toxic and should not interfere with the efficacy of the active ingredient. The precise nature of the carrier or other material will depend on the route of administration, which may be oral, or by injection, e.g. cutaneous, subcutaneous or intravenous . Pharmaceutical compositions for oral administration may be m tablet, capsule, powder or liquid form. A tablet may include a solid carrier such as gelatin or an adjuvant. Liquid pharmaceutical compositions generally include a liquid carrier such as water, petroleum, animal or vegetable oils, mineral oil or synthetic oil. Physiological saline solution, dextrose or other sacchaπde solution or glycols such as ethylene glycol, propylene glycol or polyethylene glycol may be included. For intravenous, cutaneous or subcutaneous injection, or injection at the site of affliction, the active ingredient will be in the form of a parenterally acceptable aqueous solution which is pyrogen-free and nas suitable pH, isotonicity and stability. Those of relevant skill m the art are well able to prepare suitable solutions using, for example, lsotonic vehicles such as Sodium Chloride Injection, Ringer's Injection, Lactated Ringer's Injection. Preservatives, stabilisers, buffers, antioxidants and/or other additives may be included, as required.

A molecule defined by a method of the present invention, or a composition containing such a molecule may be provided in a kit, e.g. sealed in a suitable container which protects its contents from the external environment. Such a kit may include instructions for use.

The invention may alternatively be expressed as a method of identifying physico-chemical and/or topological parameters which are associated with biological activity, the method using data relating to a set of lead molecules including active molecules, of which said activity is known to be at least a predetermined level, and inactive molecules, of which said activity is known to be below said predetermined level, and a predetermined set of physico-chemical and/or topological parameters of which the values are know or obtainable for each of the lead molecules, the method further using a function (f) which is defined for any subset of said parameters and which depends on the statistical significance (p) of correlations between the values of that subset of parameters for the active molecules m comparison to the values of that subset of parameters for other of said molecules, the method comprising the steps of iteratively determining subsets of said parameters having a respective value of said function (f) which is progressively (i.e. on successive iterations) more highly associated with a high value of said statistical significance.

Although the method has been defined aoove m relation to the selection of a minimal set of pertinent descriptors (e.g. physico-chemical and/or topological descriptors) characterising a molecule, in fact the invention is m principle applicable more generally (e.g. outside the field of chemistry) for the selection of a minimal set of pertinent independent variables m order to discriminate between populations of individuals (elements) with regard to observed, estimated or calculated features. The method of the invention is thus relevant to numerous fields including econometrics, agronomy, opinion polling, marketing, criminology, etc. That is, the method may be expressed as a method of identifying parameters associated with a feature, the method using data (observed, estimated or calculated) distinguishing among a plurality of lead individuals active individuals which have that feature and inactive individuals which do not have that feature, and a set of parameters known or obtainable for each lead individual, the method using a function defined for any subset of parameters and which depends on the statistical significance of correlations between the parameter values of that subset of the active individuals m relation to the inactive individuals, the method comprising determining (e.g. iteratively) a subset of parameters having a value of said function (f) associated with a high value of said statistical significance. The iterative method is preferably by a genetic algorithm as described below, but may alternatively m principle be by any other iterative algorithm (e.g. simulated annealing).

Embodiments of the invention will now be described as non-limiting examples, with reference to the accompanying drawings.

Brief description of the figures

Fig. 1 illustrates the principle of the m silico screening approach. The active and inactive classes of molecules are represented as distributions m a given physico-chemical parameter.

Fig. 2 shows the iterative process m which, once descriptor filters are set, virtual molecules obeying the filters are generated m an embodiment of the invention. Fig. 3 shows six descriptor maps based on benzodiazepme affinity.

Fig. 4 shows experimental vs. predicted values for benzodiazepme affinity (Fig. 4(a)), and immunosuppressive peptide activity (Fig. 4(b)), for NN models built using descriptor subsets selected by the embodiment.

Fig. 5 illustrates encoding of genetic chromosomes and steps used to select descriptor subsets by a genetic algorithm combined with cluster significance analysis, according to the invention. Fig. 6 shows evolution of the maximum and minimum fitness (A) , and variance and standard deviation (B) during the optimization.

Fig. 7 shows evolution of the cluster significance and normalised mean squared distance NMSD (A) and correlation percentage (NC²) in the descriptor subsets (B) during the optimization. Fig. 8 shows the correlation matrix the descriptor subset before (A) and after (B) the GA-CSA selection by the embodiment of the invention.

Fig. 9 shows experimental vs. predicted log IC50 values for NN models built by descriptor subset S-21 selected by the embodiment of the invention for the benzodiazepme data set. Description of embodiments Immunosuppressive peptides derived from HLA Class 1 protein, flexible molecules, were chosen to illustrate an application of the different capabilities of the invention: descriptors selection, building activity filters, molecular de novo design, and screening of new candidates . As a shorthand we will refer to the embodiment as

"Oasis". Oasis involves two principal parts: a first part treats the rational selection of pertinent descriptors by using a module we will refer to as "VarSelect", and a second part which involves four interconnected modules for lead optimisation and de novo design. We will refer to these modules as "Generator", "Builder", "Descriptor", and "Evaluator" modules. Principl e of VarSelect Module

The module VarSelect of OASIS was used to select the pertinent subset of descriptors from the initial set. This module is based on a method according to the invention which combines genetic algorithms (GA)¹¹ and cluster significance analysis (CSA).¹² Descriptors The method uses N lead compounds, including some (a number N_a) which are known to be active (e.g. the sense of having an activity level above a predetermined level) and some of which are known to be inactive (e.g. m the sense of having an activity level below the predetermined level) .

We consider a set of D possible descriptors from which a subset of d descriptors is to be chosen. Preferably, the value of d is not predetermined, but rather is optimised during the algorithm. The values of all descriptors for all N lead compounds are assumed to be known (or determinable) .

For statistical reasons the descriptor values are normalised. That is, for each given descriptor, the descriptor value is pread usted so that its average over the lead compounds is 0, and its standard deviation is 1.

Geneti c Algori thm

As mentioned above, GA is one of the recent optimization methods based on the natural selection concept^{8, 13~}-^~ (these citations are incorporated herein by reference) , and to perform a genetic algorithm, two requirements must be satisfied: the encoding of the possible solution of the problem, and the quantitative evaluation of a given solution by an objective function. In this embodiment, the objective function employs cluster significance analysis (CSA) . To encode the problem, each descriptor subset is expressed as a binary string of D digits (a "chromosome") . Each digit is either one or zero to indicate the presence or absence of the respective descriptor the subset. The length of each string is the same and is equal to the total starting number of descriptors. At the beginning of the process, a population of descriptor subsets randomly generated is evaluated by CSA (see below) . A pair of chromosomes that has a high score is randomly selected by the roulette wheel selection method to serve as parents, and a pair of children is generated by randomly performing a cross-over of the parents' genes so that each child is derived from part of a gene from each parent. Both chromosomes associated to the respective children are subjected to single-point mutation, that is, a randomly selected one

(or zero) is changed to zero (or one) and evaluated; those that have high scores replace the old chromosomes. The genetic operation is repeated until a predefined maximum number of steps or a predefined convergence criterion is achieved. As the convergence criterion, we use the variance value of the fitness function m the evolved population: the calculations are terminated when the variance is equal to zero (or a predefined low value) .

In summary, each parent m our method represents a combination of randomly chosen descriptors, and the purpose of the calculation is to evolve the initial set of descriptors into a population with the hignest significance value of the CSA. Obj ecti ve function

This is a function of any string of D elements, and is evaluated m the following way.

CSA measures the statistical significance p of a classification of data molecules with a given subset of descriptors (i.e. the statistical significance of the classification of the molecules by the d descriptors of the string which have value +1) . This is done by calculating the sum of squared distances (MSD) among the N_a active molecules, i.e the sum of squared distances d between each pair of active molecules, and dividing it by the number C of pairs of active molecules.

MSD = (l/α∑,_,. d²(ι,j) , (1)

where l and j are integers, l and j are summed from l,...,N_a with K], and d(ι,j) is a sum over the d descriptors (i.e. the descriptors which have value +1 the string) of the differences the value of that descriptor between the l-the and j-the molecule among the active molecules.

This quantity MSD may then be calculated for all other possible subsets of N_a molecules from the set of N lead compounds (i.e. both active and inactive groups) . The proportion of such subsets which have an MSD higher than the MSD for the active set is called p.

Alternatively (if the number of molecules or the number of descriptors increases, so that this process becomes very lengthy) the quantity MSD is calculated using 1000 randomly selected subsets of N₃ olecules from among the N active and inactive molecules. In this case, p is defined as the proportion of randomly selected subsets that have MSDs higher than the MSD of the active group. p measures the statistical significance of the cluster of active molecules the space defined by the d descriptors (i.e. the probability that a cluster as tight as the one of active molecules has not arisen only by chance) . In general we are seeking a set of descriptors for which the value of p is high.

A problem with using a function of p on its own as the objective function is that two subsets of descriptors (each with an equal number d of descriptors) could have the same p value although their MSD quantities are different. To take this into account we added to the objective function a term depending on the MSD value.

This creates a further problem that the quantity MSD for the active molecules depends on the number of descriptors d used the distance calculations. Thus, when comparing two descriptor subsets for the same classification, we normalised MSD for the active molecules by dividing it by the number of the +1 descriptors m the string. The resultant quantity we called NMSD.

An additional objective of the GA is to reduce the number of correlated descriptor pairs within each descriptor subset. To take this into account, we added to our objective function a term which tended to produce lower correlation, using a variable NC defined as the total number of pairs of descriptors the subset represented by the string for which the values of the two descriptors are correlated (e.g. to above a predetermined level), averaged over the lead compounds.

Taking all tnese points into account, our objective function preferably contained 3 parameters: the normalised mean squared distance NMSD (to be minimised); the statistical significance p (to be maximised) ; and the number of correlated descriptors NC (to be minimised) . Consequently, our fitness function (objective function) F

F = NMSD² + (1/pX+NCX (2)

Note that the exact form of this expression may be varied within the scope of the invention.

Principles of In Silico Screening:

The following text explains how, once descriptors have been chosen, they are used to identify candidate molecules predicted to have high activity. Mapping of the activi ty As described above, m Silico Screening is a qualitative technique consisting of an evaluation of the distribution (global or percentwise) of the active and inactive molecules as a function of the values of given parameters (descriptors) . For example, Fig. 1 shows on the horizontal axis the values of a descriptor. The upper shaded area (light shading) shows the range of values for that descriptor of lead compounds which are known to be active. The lower shaded area (dark shading) shows the range of values for that descriptor of lead compounds which are known to be inactive. For each class, the limiting values are shown. Mm_a and Max_a are the minimum and maximum values for the active class. Mm_ and Max_m are the minimum and maximum values for the inactive class. From Fig. 1 the limiting values of this descriptor associated with activity can be deduced. This is called a "filter". By combining a plurality of filters, activity can be predicted.

This method, which is easy and fast, gives a diagnosis of the qualitative non-linear dependencies between the activity and molecular properties, and so allows the building of physico-chemical filters for the activity of interest.

The experimental activity database is produced by the Descriptors module of OASIS, which associates a numerical vector to each molecule the database. The vector components are the values of calculated physico-chemical parameters. Then, the activity values of the compounds are divided into a user-defined number of classes giving the numerical values of the descriptors associated to each activity class or converted to maps with the Evaluator module of OASIS. Extracti on of Activi ty Fil ters

After building the activity maps for each class of compounds activity with the various descriptors selected by the VarSelect module, each descriptor is taken separately order to extract the limiting values of the activity. This extraction is based on the comparison of the points density associated with active and inactive molecules m the given map. Although Fig. 1 showed an ideal case which this process is relatively straightforward, different situations are possible according the distributions of the active, and inactive molecules. This is illustrated m Table 1, which shows (m the left column) 10 descriptor maps drawn m a way analogous to Figure 1.

For each map, a label (middle column; the lower row) is derived which represents the way the activity interval (i.e. the range of active molecules) is positioned with regards to the inactivity interval. For example, the symbol ">" is attributed to a map where the activity interval extends to the right past the end of the inactivity interval. The symbol ">|--|<" signifies that the inactivity interval extends at both the right and left ends by more than the activity interval.

The maps and the associated symbols determine segments to explore for each parameter (e.g. the values of the descriptor which a good candidate molecule being screened should have) . Information is given by each map indicating the zone to be explored (m Table 1, the right column; upper row) , and the expected usefulness ("validity") of the filter predicting activity (the right column; lower row) . For example, when a map possesses the symbol ">", this means that during the screening step, the zone to be explored is at the right of the parameter interval, and that this filter is considered to be a good one ("good") . A descriptor for which both ends of the active range are displaced the same direction from the inactive range is labelled "best".

In practice, the filter extraction and the symbols attribution are performed by the Evaluator module of OASIS.

Screening New Candi da tes

The extracted filters are used to screen new candidates for the activity of interest. These new molecules can be generated during a virtual combinatorial explosion (e.g. exploring all possible compounds within a given range, which is the technique used m Grassy et al¹⁰) or by a non-systematic approach (such as a genetic algorithm) . OASIS, its module Generator, uses a genetic algorithm to generate a population of molecules. The same genetic algorithm engine (software section) as that used the descriptor selection step is used here with a different encoding process. The screening of the GA-proposed molecules is performed m a cyclic way as shown m Fig . 2. A population of new candidate virtual molecules are built by the Builder module of OASIS. This module converts the GA-encoded molecules to Smiles encoded molecules ^18~19, and then uses this encoding to deduce 3-D structures via Coπna software⁰.

The built population of molecules is transmitted to the Descriptor module which uses the descriptors contained m the filters to describe the molecules. When this step is completed, the described molecules are evaluated by the

Evaluator module using the filters. Based on the filters, a satisfaction ("score") is attributed to each molecule m the population, and this information is returned to the

Generator module, which the next iteration generates new candidate molecules using a GA.

During each screening iteration, the Generator module attempts to maximise the score. The cyclic process stops when the variance m the scores of molecules m the generated population is equal to zero.

Computa tion Details

The OASIS program is implemented ANSI-C code and runs on a SGI Iris workstation. The graphical user interface is based on Xtlntrmcisec/Motif libraries. The architecture of OASIS integrates five interconnected modules: VarSelect, Generator, Builder, Descriptor, and

Evaluation module. As input/output, the program reads and produces ASCII files.

APPLICATIONS OF OASIS

Benzodiazepme Data Set

Benzodiazepme are well-known as anxiolytics, tranquillisers, and anticonvulsants epilepsy treatment. In this work, we used a set of 54 benzodiazepme analogues whose biological activity (IC50 values) are derived from the work of Haefely and al . (1985) .²¹

A set of 766 descriptors were computed for the benzodiazepme data set by using MolconnZ 3.15 and TSAR V3.1 softwares X^2"23 The descriptors with null variance were removed from the analysis resulting 312 descriptors. Fig. 3 shows maps for a selection of 6 of the remaining descriptors. Again, each map the lower row shows descriptor values for tne inactive molecules, while the upper row shows descriptor values for the active molecule. The bright dot m each of the upper rows indicates the active molecule having the highest activity.

In order to apply VarSelect module m descriptor selection, the benzodiazepme derivatives were divided to two classes according to their logIC50 values: a class with logIC50 values below 0.8 representing the class of active molecules, and a class whose logIC50 values are higner than 0.8 corresponding to the class of inactive molecules. The data set containing the molecules belonging to the two activity classes which are described by the 312 descriptors were submitted to VarSelect module of OASIS. The GA of VarSelect module was run with a population size of 100 descriptors subsets. The population was evolved until its fitness variance reaches a value of zero.

The optimisation process by VarSelect module converged after 60 generations evaluated during the GA evolution resulting m only 53 non correlated descriptors. These selected parameters were submitted to different QSAR analysis techniques to build a predictive model for benzodiazepme data set. These techniques concerned Principal component analysis (PCA) multiple linear regression (MLR, partial least square (PLS) , and backpropagation artificial neural networks (NN) . In PCA, the first 3 principal components contained on 37.46% of total variance with 29.17% explained m the first 2 principal components. The whole variance is explained 44 first components.

The use of PLS and MLR techniques to build QSAR model for benzodiazepme data set gave regression terms R² of

0.57 and 0.76, and cross-validated terms R² (CV) of 0.42 and 0.31 respectively.

Finally, we used a backpropagation neural network (NN) that contained 53 inputs, 2 nidden layers, and 1 output (logIC50 value) . In this NN, weights and bias were optimized by a Monte Carlo algorithm; the training step was achieved m 807 cycles with a best root-mean-square fit of 0.032. The best model had a training term R² of 0.94 and a cross-validated term R-(CV) of 0.85. A plot of predicted vs. experimental logIC50 values for tne best model is shown m Fig. 4(a) . The predictive power of the model was determined by using 30% of initial data for the cross-validation.

The attempt to build QSAR models with the descriptors selected by VarSelect module failed when using the linear methods (PCA, PLS, and MLR) . By contrast, we obtained a satisfactory QSAR model by using artificial NNs .

2. Immunosuppressive Peptide Data Set

In Grassy et al¹⁰, we successfully applied In Silico screening method to identify a new immunosuppressive peptide to prevent allograft rejection m mice.¹⁰ Our descriptor selection was based on knowledge of the immunosuppressive peptide biology and chemistry.

By contrast, the present work, we used VarSelect module of OASIS to perform the rational choice of QSAR descriptors without any biological or chemistry peptide knowledge. The biological activity of the peptide used are shown in Table 2.

Initially, the structural models of the immunosuppressive peptides were built by the Builder module of OASIS. Physicochemical and topological descriptors were generated by TSAR V3. I²³ resulting 83 descriptors representing 19 peptides. VarSelect module was used to select a pertinent subset of descriptors . A set of five additional peptides (RDP1257, RDP1259,

RDP1271, RDP1277 and RDP1258) with known immunosuppressive activities were kept to test the validity of the predictive power of the QSAR model.

The optimisation process of VarSelect module converged after 254 iterations by evolving a population with 50 chromosomes (i.e. 50 descriptors subsets) . The best descriptors subsets is achieved with 22 uncorrelated and significantly clustering descriptors.

The 22 VarSelect-selected descriptors were used as static filters to screen the data test peptides by using their map m Evaluator module of OASIS.

This screening resulted 5 peptides satisfying 100% of filters and which are predicted active. The activity prediction by OASIS static filters is consistent to the experimental activity of the peptides except for RDP1277 peptide.

Once more, the use of linear methods failed to build a good QSAR model for immunosuppressive peptide activity, whereas non linear methods provided a powerful predictive model. In fact, PCA on the 22 selected descriptors resulted m only 56.97% of explained variance m the first 3 principal components. MLR and PLS resulted m poor predictive models since the regression terms R² were about 0.46 and cross validated R² (CV) terms were about 0.11. On the other hand, the use of backpropagation NNs with 22-2-1 architecture resulted m a training term R² of 0.88 and a cross-validated term R² (CV) of 0.77. A plot of calculated vs. observed activities for the final model is shown m Fig. 4 (b) .

As an additional test of the validity of the NN model using the selected 22 descriptors, we predicted the immunosuppressive activity of the compounds RDP1257, RDP1259, RDP1271, RDP1277, and RDP1258. The result of these predictions is summarised m Table 2. The predicted activities were highly consistent with the experimental activities except for compound RDP1277. Although the predicted value for compound RDP1277 is different from the observed one, it remains higher than the inactive molecules. It is noteworthy that the 22 selected descriptors are of the same nature as those used m our previous study¹⁰ and which were selected on the basis of biological and chemical knowledge of the immunosuppressive peptides. These results show that the choice of descriptors by OASIS makes good chemical and biological sense, and can be used to build highly predictive QSAR models conjunction with artificial neural networks.

3. A second benzodiazepme example

In this example we used a set of 60 benzodiazepme analogues (Table 3) whose biological activity is derived from the work of Haefely et al²¹. Oasis software was used to generate chemical structure m Smiles code form, and the generated structures were introduced into TSAR 3.1 software (produced by the company Oxford Molecular) , and a set of 105 two-dimensional (2D) molecular descriptors were computed. Descriptors with zero variance were removed, and GA-CSA was then performed to reduce the number of descriptors to 21 (shown m table 4) . The calculations were performed on a Silicon Graphics Origin 200 workstation .

In the genetic process the population is important to get a good solution. If the population size is too small, there is not enough genetic diversity to make a good solution evolve. Larger population can broaden the genetic diversity, which may evolve into much higher fitness score, but this will require more time. In this case, the population containing the descriptor subsets was set to 40, and the population was evolved until its fitness variance reached a value of substantially zero.

The GA is illustrated m Fig. 5. The minimum, the maximum, the variance and the standard deviation are shown m Fig. 6. The evolution of these parameters indicated that the minimum score remains almost unchanged after the 85th generation. This convergence can be found m the decay of the fitness variance of the population during the genetic step (Fig. 6 (b) ) . The remaining fluctuations of the variance are due to the use of elitism by the GA wnich retains the best descriptor subset and continues to create new subsets randomly at every generation.

Fig 7 shows the evolution of statistical significance (darker line m Fig. 7(a)), the normalized mean square distance (lighter line m Fig. 7(a), and the correlation percentage (Fig. 7(b))). Here the horizontal axis numbers chromosomes m the order of their generation, 40 per iteration. The inset grapns show the behaviour during tie first 800 chromosomes. NMSD undergoes a slow decay and reacnes a stable plateau after 2400 solutions whereas the statistical significance presents an inverse evolution and a rapid stabilization m only 1000 tested solutions. These inverted evolutions suggest that the lower the NMSD and the higher the statistical significance, the better the descriptor subset. The best emergent descriptor subset (S-21) from the GA-CSA algorithm contained 21 uncorrelated descriptors out of 105 initial descriptors as shown m Fig. 8(b) . It is clear that the GA-CSA algorithm converged to a solution containing the less correlated descriptors (4%) with regard to those existing m the initial descriptor set

(Fig. 9(a)) . The remaining 9 correlations in the best 21 descriptors solution were m the interval ]-0.7,0.7[.

The S-21 descriptor subset selected by the GA-CSA method was submitted to a backpropagation neural network (NN) that contained 21 inputs, 2 hidden layers, and 1 output (logIC50 value) . In this NN, weights and bias were optimized by a Monte Carlo algorithm; the training step was achieved 1972 cycles with a best root-mean-square fit of 0.031. The best model had a training term R² of 0.933 and a cross-validated term R² (CV) of 0.87. A plot of predicted vs. experimental logIC50 values for the best model is shown m Fig. 9. Tne predictive power of the model was determined by using 30% of initial data for tne cross-validation.

The selected descriptors contain information on the nature of the substituents, the molecular shape, the charge, the hydrophobicity, the connectivity, the topology, and some atomic elements (table 4) . All these descriptors describe essential interactions such as steπc, electronic, and hydrophobic parameters which are dominant factors m receptor-drug interactions. These results show that the choice of descriptors by GA-CSA makes good chemical sense.

The final NN model was compared to 10 NN models built by a random selection of descriptors with either different number of descriptors (R21-1, R21-2, R21-3, R21-4 and R21- 5) or the same numbers of descriptors (R8, R15, R30, R50, and R100) . We also compared the final NN model to a model built using all 105 initial descriptors (1-105), and to one built using the 84 descriptors (Rm-84) which were removed during the genetic process. The R² and R²(CV) values of these NN models are shown m table 5. From this comparison, the best NN model built by the S-21 descriptor subset selected by GA-CSA showed the highest R² and R²(CV) values. The randomly selected descriptors did not allow a powerful NN to be built.

In order to test the reproducibility of the GA-CSA, which belongs by its nature to random search techniques, different random sets of initial populations were used (RSI, RS2, RS3, RS4, RS5 and RS6), and evolved until the fitness variance m the population reached a value of zero. The results are summarized m table 6. The different descriptor subsets are compared on the basis of the fitness value, R² and R²(CV) terms obtained for QSAR models built by NNs. Similar fitness values, which ranged from 4.98 to 5.12 were observed. The number of descriptors selected by GA-CSA varied between 21 and 25. However, the solution possessing the minimum value of the fitness function corresponded to the best R² and R- XV) values. The best solution RSI contained descriptors of tne same nature as the S-21 subset analysed above. Descriptors differing m solutions RSI and S-21, are correlated with a correlation coefficient higher than 0.8.

Many variations are possible on the embodiment described above without departing from the scope of tne invention. For example, although in the embodiment after the descriptors have been selected candidate molecules are derived using a genetic algorithm with successive screening of generations, an alternative is to predefine a class of molecules (e.g. by a "combinatorial explosion" m which all possible combinations of possible atomic selections are considered) and use the selected descriptors to screen all of them.

Also, whereas in the embodiment the GA only seeks to optimise the subset of descriptors which s selected, it is possible for additional variables to be simultaneously optimised, for example a numerical parameter of each respective descriptor which indicates the importance of that descriptor m determining or predicting activity.

Furthermore, as described above, once the descriptors are selected there are a variety of ways m which screening can be performed using them, for example using filters derived on the basis of active and inactive intervals (as illustrated m Fig. 1, and as used m Grassy et al) , or by using a neural network to predict activity levels of candidate molecules.

REFERENCES

The disclosure of all the following documents is incorporated herein by reference. 1. Hansch, C; Muir, R.M.; Fujita, T;Maloney, P.P.; Geiger, F.; Streich, M. The correlation of biological activity of plant growth regulators and chloromvcetm derivatives with Hammett constants nd partition coefficients. J. Am. Chem. Soci . 1963, 85,

2817-2824. 2. Stahle, L; and Wold, S. Progress m Medicinal

Chemistry, eds Ellis, G.P and West, G.B. Elsevier

1988, Vol 25. 3. Draper, N.R., and Smith, H., Applied regression

Analysis, 2nd edition, John Wiley & Sons, 1981. 4. Chatfield, C, and Collins, A.K. Introduction to

Multivaπate Analysis. Chapman and Hall, London.

1980. 5. Manly, B.F.J. Multivaπate Statistical Methods A primer. Chapman and Hall, London 1986. 6. Hansch, C. On the structure of medicinal chemistry.

J. Med. Chem, 1976, 19, 1-6. 7. SO, S-S., and Richards, W.G. Application of neural networks : Quantitative structure-Activity relationships of the derivatives of 2 , 4-dιammo-5- (substituted-benzyl) pyrimidines as DHFR inhibitors,

J. Med. Chem. 1992, 35, 3201. 8 Goldberg, D.E. Genetic Algorithm Search,

Optimisation, and Machine Learning, Addison-Wesley,

Reading, MA 1989. 9. Holland, J.H. Genetic Algorithms, Scientific

American 267, 66-72(1992); Forest, S., Genetic

Algorithms: Principles of Natural Selection Applied to computation. Science 1993, 261, 872-878.

10. Grassy, G., Calas, B., Yasπ, A., Lahana, R., Woo, J., Iver, S., Kaczorek, M., Floc'h, R., and Buelow R. Computer-assisted rational design of immunosuppressive compounds, Nat. Biotech. 1998, 16, 748-752.

11. Yasπ, A., and Lahana, R., Rational selection of QSAR descriptors by using genetic algorithms combined to cluster significance analysis. Application to benzodiazepme (unpublished) . 12. McFarland, J.W.; and Gans, D.J. On the significance of clusters m the graphical display of structure- activity data, J. Med., Chem. 1986, 29, 505-514. 13. Rogers, D.; Hopfmger, A.J. Application of Genetic Function Approximation to Quantitative Structure- Activity Relationships and Quantitative Structure- Property Relationships, J. Chem. Inf. Comput . Sci. 1994, 34, 854-866. 14. Davis, L. Handbook of genetic algorithm; Van Norstrand Remhold; New York 1991. 15. Hibbert D.B. Generation and display of chemical structures by genetic algorithms, Chemo . Intell. Lab. Syst. 1993, 20, 35-43. 16. Hibbert, D.B. Genetic Algorithm m chemistry, Chemom. Intell. Lab. Syst. 1993, 19, 277-293. 17. Leardi, R. ; Boggia, R. ; Terrile, M. Genetic algorithms as a strategy for feature selection, J. Chemom. 1992, 6, 267-281. 18. Wem ger, D. SMILES, a chemical language and information system. I. Introduction to methodology and encoding rules, J. Chem. Comput., Sci. 1988, 28, 31-36.

19. Daylight software Manual. Daylight Chemical Information Systems: Santa Fe, NM, USA, 1993.

20. Gsteiger, J., Rudolph, C, Sadowski, J. Automatic generation of 3D-Atomιc Coordinates for organic molecules. Tetrahedron Comp . , Method., 1990, 3, 537- 547. 21. Haefely, W., Kyburz, E., Gegecke, M., Mohloer, H. Recent advances m the molecular pharmacology of benzodiazepme receptors and the structure-activity relationships of their agonist and antagonists, Adv. Drug Res. 1985, 14, 165-322. 22. Molconn-Z, Molconn software version 3.15, Lowell H. Hall copyright 1998. 23. Oxford Molecular Group, the Medawar Centre, Oxford Science Park, Oxford 0X4 4GA, UK. 1997.

24. S-S. So and M. Karplus, Genetic Neural Networks for Quantitative Structure-Activity Relationships: Improvements and Applications of Benzodiazepme

Affinity for Benzodiazepme/GABA. receptors, J. Med. Chem. 1996, 39, p5246-5256.

Table I The different map built in OASIS analysis Each map represents the distribution of active (D) and inactive molecules (■) in the interval of given descnptor

OASIS Limit values of D and Information for zone to be activity maps classes explored '" Range " Validity of filter

-o-

[mm(D) , max(D)] min(D) > nun(i) ;

Descriptor interval max(D) < max(-i) Good

> -- <

> max(B) min(D) = min(B) ; max max(D) > max(β) Good Descπptor interval

>

< min(B) min(D) < min(B) ; max max(D) = max(B) Good Descriptor interval

> max(B) mιn(D) > mιn(H) , mm max

Descriptor interval max(D) = max(β) Explore nght

< mιn(B) mιn(D) = mιn(B) ; mm max max(D) < max(B) Explore left

Descriptor interval

Table 2 Biological activity of the initial and predicted peptides tested in a hetcrotopic allograft model of mouse

# Name Peptide Sequence * Expenmental MST Predicted ** MST

0 untreated 79

1 270275-84 RENLRIALRY 114 1337

2 270284-75 YRLAIRLNER 121 1213

3 D270275-84 rennalry 114 1143

4 D270284-75 yrlairlner 132 133

5 P2 RVNLRIALRY 115 1152

6 RP2 YRLAIRLNVR 125 1247

7 D2 rvnlπalry 131 1302

8 RD2 yrlairlnvr 122 1223

9 P15 NLRIALRYYW 118 1179

10 Kk 75-84 RVNLRTALRY 85 951

11 Dk 75-84 RVDLRTLLRY 72 717

12 Kb 75-84 RVDKRTLLGY 78 777

13 Db 75-84 RVSLRNLLGY 78 762

140775-84 RESLRLLRGY 74 758

15270575-84 REDLRTLLRY 77 774

16270276-83 ENLRIALR 85 851

17 D2702(E>V,R>P) rvnlpialry 95 952

18 E 75-84 RVNLRTLRJRY 80 798

19 G 75-84 RMNLQTLRGY 77 771

20 RDP1257 RLLLRLLLGY 131 1331

21 RDP1259 RVLLRLLLGY 131 1182

22 RDP1271 RWLLRLLLGY 113 1225

23 RDP1277 RYLLRLLLGY 90 1334

24 RDP1258 RiiLiiLnLRnLnLnLGY 127 1331

* nL = Norleucm

** MST Mouse survival time (day)

Table 3: Structure and chemical groups of benzodiazepines derivatives Experimental (Exp ) and predicted (Pred ) logIC50 values

Name R7 RI R2' R6¹ R3 R8 Exp logIC50 Pred LogIC50 clonazepam NO. H CI H H H 0 255 0 020 delorazepam CI ^" H CI H H H 0 255 0 073 dαazepam CI Me H H H H 0 908 0 822 fluiutrazepam N0₂ Me F H H H 0 580 0 649 halazepam CI CH,CF₃ H H H H 1 964 2 031 lorazepam CI H CI H OH H 0 544 0 532 meClonazepam NO, H CI H Me H 0 079 0 075 n trazepam NO-, H H H H H 1 000 0 990 nordazepam CI H H H H H 0 973 1 086 oxazepam CI H H H OH H 1.255 0 665

Ro 05-2904 CF, H H H H H 1 114 1 062

Ro 05-2921 H H H H H H 2 544 2 430

Ro 05-3061 F H H H H H 1 602 1 723

Ro 05-3072 NH₂ H H H H H 2 587 2 648

Ro 05-3367 CI H F H H H 0.301 0.102

Ro 05-34.18 NH, Me H H H H 2.663 2 500

Ro 05-3590 NO, H CF₃ H H H 0 544 0 525

Ro 05-4082 N0₂ Me CI H H H 0 342 0.569

Ro 05-4336 H H F H H H 1.322 0 800

Ro 05-4336 H H F H H H 0 342 0.575

Ro 05-4435 N0₂ H F H H H 1 322 0 800

Ro 05-4520 H Me F H H H 1 146 1.193

Ro 05-4528 CN Me H H H H 2.580 2.000

Ro 05-4608 H Me CI H H H 0 580 1.168

Ro 05-4619 NH₂ H CI H H H 1.875 1 852

Ro 05-4865 F Me H H H H 1 230 1 186

Ro 05-6820 F H F H H H 0.869 0765

Ro 05-6822 F Me F H H H 0.708 0 734

Ro 07-2750 CI (CH₂)₂0H F H H H 1.389 1.402

Ro 07-3953 CI H F F H H 0 204 0.355

Ro 07-4065 CI Me F F H H 0 613 0.372

Ro 07-4419 H H F F H H 1.279 1.080

Ro 07-5193 CI H CI F H H 0 477 0.620

Ro 07-5220 CI Me CI CI H H 0.740 0 437

Ro 07-6198 H H F F H CI 1.447 1 690

Ro 07-9957 I Me F H H H 0.462 0.509

Ro 07-4878 CI H F H Me H 0.544 0.473

Ro 07-6896 N0₂ Me F H Me H 0 845 0.978

Ro 13-3780 Br Me F F H H 0.380 0.356

Ro 14-3074 N3 H F H H H 0.724 1.200

Ro 20-1310 CI C(CH₃)₃ H H H H 2.792 2.675

Ro 20-1815 NH2 Me F H H H 1.813 1.870

Ro 20-2533 Et H H H H H 1.566 1.566

Ro 20-2541 CN Me F H H H 1.477 1.319

Ro 20-3053 COMe H F H H H 1.255 1.261

Ro 20-5397 CHO H H H H H 1.633 1.636

Ro 20-5747 CH=CH₂ H H H H H 1.380 1.397

Ro 20-7078 CI H F H CI H 0.724 0 645

Ro 20-8065 CI H F H H CI 0.556 0.646

Ro 20-8552 Me H F H H CI 1.146 1.138

Ro 20-8895 H H F H H Me 1.279 0.973

Ro 22-3294 CI H CI CI H H 0.845 0.429

Ro 22-4683 N0₂ C(CH₃)₃ F H H H 2.477 2.518

Ro 22-6762 CI Me H H H CI 1.602 1.610 temazepam CI Me H H OH H 1 .204 1 7.61 Table 4: Nature and number of descriptors used for benzodiazepines dataset before and after selection by GA- CSA

Descπptor Nature Before GA-CSA After GA-CSA

Verloop for substituents Steπc 36 5

Molecular Mass Mass 1 0

Molecular Surface Area Surface 1 0

Molecular Volume Volume 1 0

Moments of inertia Volume 6 2

Ellipsoidal volume Volume 1 0

Dipole moments Electronic 4 3

LogP and hpole moments Lipoptulicity 5 3

Molecular refractivity Refractivity 1 0

Kier CfuV indices Connectivity 20 2

Kappa and flexibility Shape 7 1

Wiener, Balaban and Randic index Topology 3 1

Sum of E-state indices Electrotopology 1 0

Atom count Atomic 11 3

H-bond donor and acceptor H-bond 2 1

Group count Chemical groups 5 0

Total 105 21

l bie : p- _Λncj R^:(CV) values ot NN models built with different size of descπptors subsets compared to the final NN model built bv the S-21 descπDtors suoset selected bv GA-CSA

NN model umber of descπptors R² R-(CV)

S-21 21 0 933 0 87

R21-1 21 0 51 0 24

R21-2 21 0 43 0 3

R21-3 21 0 46 0 14

R21-4 21 0 52 0 2

R21-5 21 0 62 0 41

R8 8 0 24 0 11

R15 15 0 5 0 25

R30 30 0 64 0 49

R50 50 0 5 0 29

R100 100 0 54 0 28

1105 105 0 56 0 31

Rm84 84 0 57 0 29

Table 6: Fttness R" and R²(CV) values of NN models built with descπptors subsets selected by GA-CSA generated from different random seeds of GA, and compared to the final NN model built by the S-21 descπptors subset _=_

NN model Number of descπptors Fitness R- R^Z(CV)

S-21 21 0885 0933 087 RSI 21 0901 0931 086 RS2 22 0910 0925 086 RS3 23 1086 0928 083 RS4 24 1180 0922 079 RS5 25 1208 0911 078 RS6 25 1211 0899 076

Claims

1. A method of identifying physico-chemical and/or topological parameters which are associated with biological activity, the method using data relating to a set of lead molecules including active molecules, of whicn said activity is known to De at least a predetermined level, and inactive molecules, of which said activity is known to be below said predetermined level, and a predetermined set of physico-chemical parameters of which the values are known or obtainable for each of the lead molecules , the method further using a function (f) which is defined for any subset of said parameters and which depends on the statistical significance (p) of correlations between the values of that subset of parameters for the active molecules m comparison to the values of that subset of parameters for other of said molecules, the method comprising the steps of: (I) selecting a plurality of first subsets of said parameters from among said set of parameters;

(ii) determining the value of said function for each first subset of parameters; and

2. A method according to claim 1 m which steps (ii) and (m) are repeated at least once, each time using step (ii) the newly derived second subsets of parameters as the first subsets of parameters, whereby new subsets of parameters are successively generated having successively closer association with said activity.

3. A method according to claim 1 or claim 2 m wnich each second subset of parameters is selected to resemble at least one of the subsets among sa d first subsets of parameters which have a value of said function associated with relatively high statistical significance.

4. A method according to claim 1, claim 2 or claim 3 m which each second subset of parameters is selected by combining parameters from among a plurality of said first subsets of parameters which have a value of said function associated with relatively high statistical signif_cance .

5. A method according to any preceding claim in which said function (f) further depends upon a correlation (NC) between the values of the parameters of the subset of parameters .

6. A method according to any preceding claim which said function (f) further depends upon the value of a variance (NMSD) of values of the active molecules for the subset of parameters.

7. A method of identifying a molecule having biological activity, the method including: identifying, using a method according to any preceding claim, a set of physico-chemical parameters associated with said activity, determining criteria for predicting activity of a molecule based on the values of the physico-chemical parameters for that molecule, generating a plurality of virtual molecules, determining whether any of the plurality of molecules conforms to the criteria.

8. A method according to claim 7 m which the criteria include ranges of the values of said identified parameters associated with said activity, and determining whether any of the plurality of molecules conforms to the criteria includes determining if the molecules have values for said identified parameters with said determined ranges.

9. A method according to claim 7 m wnic tne criteria are a function performed by a neural network which has been trained on the basis of molecules of known activity.

10. A method of synthesismg a compound having biological activity, the metnod comprising identifying a molecule having said activity using a method according to claim 7, and chemically synthesismg the corrpound made up of that molecule.

11. A compound synthesized by a method according to claim 10.

12. A method of using a molecule synthesized according to claim 9 which exploits the activity.

13. Use of a molecule synthesized according to claim 9 in the preparation of a pharmaceutical composition having the activity.

14. An apparatus for identifying physico-chemical and/or topological parameters which are associated with biological activity, the apparatus comprising: means for storing representations of a set of virtual molecules including active molecules, of which said activity is known to be at least a predetermined level, and inactive molecules, of which said activity is known to be below said predetermined level, means for storing or deriving the values of a predetermined set of physico-chemical parameters for each of the virtual molecules, means for calculating the value of a function (f) which is defined for any subset of said parameters and which depends on the statistical significance (p) of correlations between the values of that subset of parameters for the active molecules m comparison to the values of that subset of parameters for other of said virtual molecules; and processing means for:

(l) selecting a plurality of first subsets of said parameters from among said set of parameters;

(III) selecting at least one second subset of said parameters from among said set of parameters Pased on the values of said function for the respective first subsets of parameters, whereby the or each second subset of parameters is more closely associated w_tn said activity than the first subsets of parameters .