WO1999026901A1

WO1999026901A1 - Method of designing chemical substances

Info

Publication number: WO1999026901A1
Application number: PCT/GB1998/003017
Authority: WO
Inventors: John Wood; Valerie Sarah Rose
Original assignee: Biofocus Plc
Priority date: 1997-11-24
Filing date: 1998-10-08
Publication date: 1999-06-03
Also published as: GB9724784D0; AU9358698A; EP1034153A1

Abstract

A method for designing chemical substances is disclosed, in which a plurality of sets of candidate elements are selected. From those sets of candidate elements, an array of all possible substances that can be made is generated. From this array, a smaller, sub-array is identified which includes all possible pairings of the elements within each set. The substances within that sub-array are then synthesised and their physical properties tested in each case. Based on the results of these tests, the physical properties of substances within the array of all possible substances but not within the sub-array may be predicted. The method allows a chemical substance, and particularly a pharmaceutical substance, having an optimised physical characteristic to be identified from a large array of possible substances more quickly and accurately.

Description

Method of Designing Chemical Substances

This invention relates to a method of designing chemical substances, and particularly, but not exclusively, to such a method for use in Combinatorial Chemistry applications.

Combinatorial Chemistry (CC) is the synthesis of large numbers of chemical compounds from smaller numbers of building blocks by assembling them in all combinations; the approach has been increasingly used in the last few years as an aid to chemists searching for (or designing) a pharmaceutical compound with desired biological properties. It is, however, of more general applicability.

The approach lends itself to automated synthesis using robots, such as those produced by ACT, Zvmark and others, to enable large numbers of chemicals to be synthesized in a short time compared with conventional chemical synthesis. Together with the automation of biological screens enabling High Throughput Screening (HTS) these two techniques have resulted in a many-fold increase in the numbers of compounds capable of being screened for biological activity. A basic introduction to the topic is given in "Pharmaceutical Visions^", ABC Business Press, Summer 1997. See also Hogan, J.C., Nature, 384, 17-19 (1996); Gallop, M.A. et al J. Med. Chem. 37, 1233- 1251 ( 1994); and Gordon, M.A. et al J. Med. Chem. 37, 1385-1401 (1994).

Several methods have been proposed to allow even larger numbers of compounds to be synthesized and screened including "split and mix" synthetic strategies and pooling compounds for testing. Such techniques have been described, for example, by N.K. Terrett et al., Tetrahedron, 51 , pages 8135- 8173 (1995). Commonly, chemical arrays for pharmaceutical use are designed using a 3 monomer synthesis protocol which gives rise to a final molecule with 3 sites of possible variation. If a chemist wishes to make an array of πr compounds, two common approaches are used to select which compounds to make. In the first, a set of m candidates (monomers) are selected for each substituent group and an m x m x m array is synthesized of all possible combinations of substituent groups. If m is 10 then the final array contains 1000 compounds. This will be referred to as the 'All-Combinations' approach. For this approach the number of monomers (m) at each site need not be identical. With m_{s l} at one site. m_s2 at the second site, and m_s3 at the third site, then an m_{s l} x m_s2 x m_s3 array is created. Appendix 1 provides a definition of the symbols used.

This procedure is well established, and a number of companies now offer a library of compounds comprising all the possible combinations of substituent groups or monomers. There are several drawbacks with this technique, however. As the size of the array increases, so does the time to manufacture all of the compounds in that array. To generate and test 1000 compounds will take a day or more, and as the number of substituent group candidates increases the time taken to manufacture and test the array becomes impractical. Even with smaller array sizes, it is increasingly unusual for vendors of such libraries to test the full range of compounds for purity. It is therefore possible to obtain such a library of compounds which fails to contain some, possibly important, subsets of the array.

The alternative ('Cherry-Picking') approach involves creating a virtual array of n by n by n (where n > m) compounds and using commercially available molecular modelling software to select a diverse subset of m³ compounds to synthesise. This approach allows n to be greater than m for the same final number of compounds and so potentially increases the molecular diversity that is present in the final array.

A problem is that this approach is chemically complex, as the chemical reactivity of a larger number of monomers has to be investigated, and is expensive in raw materials: if commercially available monomers have to be procured for each of the possible sites α, β and γ, the increase in diversity and variety of synthetic protocols that this would often require restricts the degree of automation that can be applied.

Selecting a suitable set of compounds for synthesis which are diverse and representative of a larger set of compounds is therefore important particularly when the virtual array n x n x n is large. This scientific discipline is commonly termed Molecular Diversity. A description of some approaches is given, for example, by R.P. Sheridan and S.K. Kearsley, in J. Chem. Inf. Comput. Sci., 35, 310-320 (1995), and by E.J. Martin et al, in J. Med. Chem. 38. 1431 - 1435 (1995).

It is an object of the present invention at least to alleviate the problems of the prior art.

It is a further object to provide an improved method of designing a chemical substance.

According to a first aspect of the present invention, there is provided a method of designing a chemical substance having a desired physical property, the substance including a plurality r of individual elements, the method comprising: (a) selecting r (r >3) sets of candidate elements , C ... C_r;; (b) generating an all-combinations array of possible substances, each element of the array being representative of a different substance having one element chosen from each of the sets Cj, ...C_r ;

(c) defining a sub-array within the all-combinations array, the sub-arra)' being smaller than the all-combinations array but including all possible pairings of candidate elements;

(d) synthesising the possible substances in the sub-array, and measuring the said physical property for each synthesized substance;

(e) using the measured physical properties to predict the characteristics of the possible substances in the all-combinations array which have not been synthesised:

(f) selecting and synthesising further possible substances on the basis of their predicted characteristics, and measuring the said physical property for each synthesised further possible substance; (g) repeating steps (e) and (f) one or more times until a substance has been synthesised which displays a characteristic sufficiently close to the desired physical property.

The method of the present invention has a number of advantages. For example. it provides an optimum selection of substances with a larger virtual array, thus minimising the possibility of "missing" desirable substances within that larger virtual array.

Preferably, the sub-array is defined by a Latin Square or a set of orthogonal Latin Squares. The Latin Square is a sub-array optimised such that each element within a given set of candidate elements is paired exactly once with each element within each other set of candidates. Alternatively, the sub-array may be defined by two Latin Squares. Each of the sets C_\... C_r may have the same number, m, of candidates. The candidate elements in the sets may further, or alternatively, be mutually exclusive.

The chemical substance to be designed may be an admixture of the candidate elements, such as, for example, a paint. Alternatively, the chemical substance may be a molecule.

In the latter case, the desired physical property may be biological activity. In that case, the molecule will preferably be a pharmaceutical compound. Often, the contributions of the various candidates elements in a pharmaceutical compound are not synergistic or antagonistic. It is thus possible to generate a mathematical algorithm that permits prediction of the overall biological activity of those compounds within the complete array that have not yet been synthesized.

The molecule preferably comprises a fixed scaffold having r sites of variation, each of the sets C_\... C_r being representative of candidate elements for an individual site. The scaffold may, for example, be a benzene ring, or a plurality of aromatic or heterocylic rings. The scaffold may arise from reaction of the sets C_\... C_r or may constitute a starting framework to which the sets C_ ... C_r are attached. Compounds created by linear addition of monomers (Cj + C₂...+ C_r) do not contain a core scaffold as such but are still relevant to the design.

Preferably, the sub-array is defined by a Graeco-Latin Square consisting of two orthogonal Latin Squares. Such a sub-array permits r to be 4 whilst still providing an optimal sub-array such that each element within a given set of candidate elements is paired exactly once with each element within each other set of candidates. An additional orthogonal Latin Square is added for each increment of r.

The method may further comprise calculating from the measured physical properties of the synthesised substances property' contribution values representative of the respective contribution to the physical property of each of the individual elements. Additionally, it may include predicting the characteristics of the possible substances which have not been synthesised in dependence upon the property contribution values. The property contribution value for a given candidate element is calculated by summing the measured physical properties of each synthesised substance containing that element, and subtracting the overall mean.

The predicted characteristics may be calculated by minimising an error function which characterises an error between the actual and predicted measurements of the property' value for those substances that have been synthesised or calculated by summing the property' contribution values for the relevant substance.

Information is obtained from the screening process about active and inactive compounds in a structured way enabling trends between structure and activity to be modelled and other compounds with unexpected activity or inactivity levels to be identified and f rther studied.

One preferred method of selecting the r sets of candidate elements is to use a cluster analysis. For example the method might include selecting a candidate element having a high property contribution, identifying from the cluster analysis a so far unused element which would be expected to have similar properties, adding the said so far unused element to one of the r sets of candidate elements Cι,...C_r, either in addition to the existing elements or in substitution for one of them, and repeating steps (b) to (g) of the method according to the first aspect of the present invention one or more times.

According to a second aspect of the present invention, there is provided a method of designing a chemical compound, including:

(a) selecting a scaffold having r sites of variation Si, S S_r thereon;

(b) selecting groups of candidates Cι,...C_r for inclusion, respectively, on each of the r sites of variation, such that there are n _s] candidates for the site S_l5 m_S2 candidates for the site S₂, ... m_Sr candidates for the site S_r; and (c) generating an array of compounds of size p such that all m_Sα candidates on a site of variation S_α (l<α<r) are paired at least once with all m_Sp candidates on any other site of variation Sp (l<β<r; α ≠β), and such that the array size p is less than q, where q=τn_Sι x m_s2 x • • -ms_r> the size of an array of all possible combinations of the candidates at their respective sites Sι,S₂,. - .S_r on the scaffold.

According to a third aspect of the present invention, there is provided a method of designing a chemical compound, including:

(a) selecting a plurality z of scaffolds each having r sites of variation Si, S2,...Sr thereon;

(b) selecting, for each scaffold, groups of candidates C_\ ... C_r for inclusion, respectively, on each of the r sites of variation, such that there are m_sι candidates for each of the sites Si; m^ candidates for each of the sites S₂; ... m_Sr candidates for each of the sites S_r; and (c) generating an array of compounds of size p such that all ms_α candidates on a site of variation S_α (l<α<r) are paired at least once with all msp candidates on any other site of variation Sp (l<β<r; ≠β), such that all candidates are paired at least once with each of the z scaffolds, and such that the array size p is less than q, where: q=z x m_S) x m_S2 x ...πι_Sr, the size of an array of all possible combinations of the candidates at their respective sites Sι,S₂,.. -S_r on all of the z scaffolds.

The method according to these aspects of the present invention allows an increase in m (the number of possible variants on a particular site in a compound) without increasing the number of compounds that need actually to be generated. This is done by generating a structured sub-array in the full n ^r array which maximises the chances of locating materials of interest from that full array. The method therefore provides an advantage over the all- combinations approach outlined above where the time to produce the whole n ^r array rapidly becomes prohibitive.

The combination of compound generation and testing provides an iterative technique to pinpoint compounds within the overall array (N) to be synthesized which may be of interest, but which were not part of the original sub-array generated.

The present invention can be put into practice in various ways which will now be described by way of example with reference to the accompanying drawings in which: -

Fig. 1 shows a schematic representation of a chemical compound comprising a core scaffold and three sites of variation;

Figs. 2a, 2b and 2c show three different chemical scaffolds with similar substituent group positions and orientations; Fig. 3 shows an example of a hetero ring scaffold with 2 substituent sites where variation can occur, formed by the reaction of two monomers.

Figs. 4a, 4b and 4c show three chemical scaffolds with different numbers of possible substituent sites;

Fig. 5 shows an example of a linear addition of monomers giving rise to a molecule with substituent sites; and

Fig. 6 shows, schematically, the method according to a preferred embodiment of the present invention.

Fig. 1 shows, schematically, a chemical compound 10 having a core scaffold 20 and three sites of variation around that scaffold, labelled α, β and γ. Of course. chemicals having more than three sites of variation can be used in the method described herein. The actual number of sites to be varied w ill largely depend upon the application envisaged for the chemicals to be generated and on the synthetic route. Three examples of scaffolds are shown in Figs. 2a, 2b and 2c, and an example of a hetero scaffold with 2 substituent sites (Si and S₂) where variation can occur, formed by the reaction of two monomers, is shown in Fig. 3. The multiple arrows shown in Fig. 3 denote that the reaction is carried out over several synthetic steps.

Furthermore, scaffolds may have numerous substituent group sites. Three such examples are shown in Figs. 4a, 4b and 4c. In most cases, it is unlikely that more than four different sets of monomers would be varied in pharmaceutical applications, except in the case of oligopeptides which contain long lengths of amino acids; the principle, however, is not limited to any specific number of sites and is applicable to compounds synthesised by sequential addition of monomers and hence containing no core scaffold. Such an arrangement is shown in Figure 5, in which the multiple arrows denote that the reaction is carried out over several synthetic steps.

Generally, there will be a very large number of possible substituent groups that can be attached at each of the substituent group sites. As shown in Fig. 6, the first step 100 in a preferred embodiment of the present invention is to identify a group of n compounds or "elements" for each of the candidate substituent group sites on the scaffold, preferably using a cluster analysis.

Possible substituent group candidates are identified from those which are commercially available, previously prepared by the user of the process, or not readily available but nevertheless desirable for the proposed application. Candidates incompatible with the proposed synthesis are omitted from this set. Many different sets of properties have been used to model molecular diversity and selection of a suitable set depends on the problem in hand. Further description is given in the article by Martin et al, referred to above, and also Brown, R.D. and Martin, Y.C., J. Chem. Inf. Comput. Sci.. 36,572-584 (1996).

A diverse set of substituent groups or "elements", usually monomers, is selected for each substitution position to give variation in both physicochemical properties (e.g. size, shape, electronic properties etc.) and in structural fragments (i.e. in the physical make up, bonding structure etc of a variety of possible molecular fragments within the overall substituent group). The latter may be achieved using the 2D fingerprints approach, discussed by Brown et al. This approach defines strings of 0/1 values indicating the absence/presence of structural fragments within possible groups; such strings can be calculated using commercially available software such as that from Tripos or Daylight. For diverse substitution group sets, combinations of properties and structural fragments are preferred, while for sets showing a limited amount of diversity, properties only are preferred to model diversity.

Many methods are available to select a diverse set of substitution groups once they have quantified descriptors based on properties or fragments. A preferred approach here is to use a hierarchical clustering approach, such as Ward's (see the Brown reference, referred to above), coupled with manual or automated selection of substituent group candidates from the clusters identified. Other known clustering techniques or selection algorithms may be used.

Referring back now to Figure 6, the next stage 1 10 in the process is to create what will be called a Predictive Array Design (PAD) of virtual compounds. The design uses a modification of a statistical arrangement of variables known as a Latin Square, see Cochran. W.G. and Cox, G.M., Experimental designs, Pub. John Wiley & Sons, New York (1957). A Latin Square is a square array of side n that contains symbols from a set of size n. The symbols are arranged so that every row of the array has each symbol of the set occuring exactly once, and also every column of the array has each symbol of the set occuring exactly once. Thus, the symbols are arranged in such a way that no orthogonal (row or column) contains duplicate symbols.

A Graeco-Latin Square (used in some embodiments to be described below) is formed when two Latin Squares with sides of dimension j are superimposed on one another in such a way that each of the j² combinations of the symbols (taking the order of the superimposition into account) occurs exacth' once in the j² cells of the array. The two individual Latin Squares are then said to be "orthogonal". See Montgomery, D.C., Design and Analysis of Experiments. Third Edition, Pub. John Wiley & Sons, New York (1991).

The application of this mathematical technique to the present method can be explained by way of some examples. Consider the schematic molecule shown in Fig. 1. In order to allow the principle to be demonstrated, assume that only 5 different candidates are selected for each substituent group site, α, β and γ. If substituent group site α consists of five candidates labelled A to E, substituent group site β consists of five different candidates labelled i to v. and substituent group site γ consists of five different candidates labelled 1 to 5. then the Latin Square of Table 1 can be designed following the above rules; this is the PAD:

A B C D E

1 2 3 4 ii 2 3 4 5 1 iii 3 4 5 1 2 iv 4 5 1 2 3

V 5 1 2 3 4

Table 1

This design gives rise to the following 25 compounds on a given scaffold S:

S_i_A_l S_i_B_2 S_i_C_3 S_i_D_4 S_i_E_5

S_ϋ_A_2 S_ii_B_3 S_ii_C_4 S_ii_D_5 S_ii_E_l

S_iii_A_3 S_iii_B_4 S_iii_C_5 S_iii_D_l S_iii_E_2

S_iv_A_4 S_iy_B_5 S_iv_C_l S_iv_D_2 S_iv_E_3

S v A 5 S v B l S v C 2 S v D 3 S v E 4 There are, in total, 125 different compounds that can be generated with 5 different candidates on each substituent group site. It will be observed. however, that the above set of 25 compounds contains every possible combination of candidates between any two of the three sites, but only once. The first two sets of substituent group candidates are arranged in a square array, with the third set constructed around the diagonal. That is. every candidate on the α site (A...E) is combined with every candidate on the β site (i...v), but only once; every candidate on the α site (A...E) is combined with every candidate on the γ site (1...5), but only once; and even' candidate on the β site (i...v) is combined with every candidate on the γ site ( 1 ...5). but only once. Every possible dimer combination is therefore represented in the array. For scaffolds having four sites of variation, the PAD may consist of a Graeco- Latin square, as explained above. In the example shown in Table 2, the scaffold has four sites α, β, γ and δ. where substituent groups can be added. The first three (α, β and γ). are labelled as in table 1. The fourth site, δ, also has five candidates, labelled a to e.

A B C D E

1 L 2, b 3, c 4, d 5, e ii 2, c 3, d 4, e 5, a l, b iii 3, e 4, a 5, b l, c 2, d iv 4, b 5, c l, d 2, e 3, a

V 5, d l, e 2, a 3, b 4, c

Table 2

This gives rise to the following 25 compounds (each of which will be attached on a scaffold S): i_A_l_a i_B_2_b i_C_3_c i_D_4_d i_E_5_e ii_A_2_c ϋ_B_3_d ϋ_C_4_e ϋ_D_5_a ϋ_E_l_b iii_A_3_e iϋ_B_4_a iϋ_C_5_b iϋ_D_l_c iϋ_E_2_d iv_A_4_b iv_B_5_c iv_C_l_d iv_D_2_e iv_E_3_a v_A_5_d v_B_l_e v_C_2_a v_D_3_b v_E_4_c

A Graeco-Latin square of a prime number dimension can always be generated, but solutions can only be derived for certain other dimensions. The number of candidates on each site may therefore need to be partially restricted, to ensure all combinations of pairs of substituent group candidates will be present.

Generalising mathematically, for m candidates per substituent group site, the Graeco-Latin Square will give rise to m^" virtual compounds from a total all- combinations array size of m⁴ compounds. Once again, all possible dimer combinations are represented in the array.

Graeco-Latin Squares for 4 sets of substituents can be extended to more sets of substituents by adding an additional orthogonal Latin Square for each new set of substituents. These orthogonal squares are generated in the following general manner. The 2 sides of the square remain invariant in all cases as in Table 1 for the first 2 substituents. The third substituent is added as in Table 1 i.e. the first row is in order (here, 1 to 5); in the second row. the numbers are shifted one position to the left; in the third row they are shifted again one position to the left and so this continues as one goes down the rows. Table 2 shows the square when the fourth substituent is added. This is derived in a similar manner to the third substituent in that the first row is in order (here, a to e) and each subsequent row is shifted to the left. However in this case the positional shift of each row is by two places to the left. If a fifth substituent site is used then another orthogonal Latin Square is added, and the positional shift is by three places. For each new substituent set C_r from r = 3 to r=j+l the shift to the left of each row is by r-2 places. The maximum number of sets of substituents is j+1 where j is the dimension of the square and for this construction the square must be of prime number dimensions. If j is not prime then repeat rows will occur if r-2 is a factor of j.

Other constructions are also possible in certain circumstances but the above provides a generalised approach.

A modification of the Graeco-Latin Square has been developed, to accommodate different scaffolds with similar substitution sites. Three examples of such compounds are shown in Figs. 2a. 2b and 2c.

In the modified Graeco-Latin Square, the orthogonal diagonal has repeat copies of each scaffold. This modification is dictated by the limited number of scaffolds (compared with the number of substituent group candidates) which would normally be incorporated into a single chemical array due to synthetic complexity.

The modified Graeco-Latin Square in Table 3 below gives rise to single copies of all substituent group pairs but multiple copies of scaffold-substituent group pairs. For example an array of 5 different substituent group candidates at 3 different sites with 3 different scaffolds (x, y and z) would generate the following design: - A B C D E i i.y _4»9 Λ 3,z 4,y 5.x ii 2,z 3,y 4,x 5,z i.y iii 3,x 4,z 5,y ι,^χ 2,z iv 4,y 5, x l,z 2,y 3,x

V 5,z i.y 2,x 3,z 4,y

Table 3

Giving rise to the following 25 compounds: A_l_y i_B_2_x i_C_3_z i_D_4_y i_E_5_x ii_A_2_z ii_B_3_ ii_C_4_x ϋ_^D_⁵_^z ii_E_l_y iii_A_3_x iii_B_4_z iii_C_5_y ϋi_D_l_x iii_E_2_z iv_A_4_y iv_B_5_x iv_C_l_z iv_D_2_y iv_E_3_x v A 5 z v_B_l_y v C 2 x v D 3 z v E 4 v

Generally, for n candidates per substituent group site, and 3 different scaffolds (x, y and z), m" compounds are generated from a total all-combinations array size m x 3.

In some instances it may be desirable to have more than 1 copy of each pair of substituent group candidates, but no duplicate compounds. This situation may occur, for example, when the total number of candidates identified for each substituent group site is relatively small. Then, the whole array of compounds that can be manufactured is relatively small, and synthesising and testing a non- optimal sub-array of compounds is feasible. Of course, by increasing the number of compounds to be made and tested relative to the number in the all- combinations array, accuracy of prediction can be improved. It is also of value to test the predictive power of the PAD by using one array to derive the mathematical model to make the prediction and use a second array to test the predictive power of the model. It may also, on occasion, be desirable to have repeated dimers within the PAD so that, when the compounds are subsequently synthesised and screened, a single error in the screening process will not necessarily invalidate the figures for that dimer pair which was being screened when the error occurred.

Such an enlarged array can be achieved by using more than 1 Latin Square.

There are many possible Latin squares for each array dimension: see Fisher. R.A. and Yates, F., Statistical Tables for Biological. Agricultural and Medical Research, Pub. Longman, London (1963). One example is given below of a design for 3 sites of variation on a scaffold, with 5 different substituent group candidates at each site and 2 copies of substituent group candidate pairs, but no duplicated compounds. In this example the lower square is created by permuting the symbols of the upper square such that each symbol is mapped to a different symbol ( 1→2. 2→3, 3→4, 4→5. 5→ 1 ).

A B C D E

1 2 3 4 5 ii 2 3 4 5 1 iii 3 4 5 1 2 iv 4 5 1 2 3

V 5 1 2 3 4

2 3 4 5 1 ii 3 4 5 1 2 iii 4 5 1 2 3 iv 5 1 2 3 4 v 1 3 4 5

Table 4

This design gives rise to the following 50 compounds: i_A_l i_B_2 i_C_3 i_D_4 i_E_5 ii A 2 ii B 3 ii C 4 ii D 5 ii E 1

» A_3 iii_B_4 iii_C_5 iii_D_l iii_E_2 iv A 4 iv B 5 iv C 1 iv D 2 iv E 3 v A 5 v B 1 v C 2 v D 3 v E 4 i A 2 i B 3 i C 4 i D 5 i E 1 ii_A_3 »_B_4 ii_C_5 ii_D_l ii_E_2 iii_A_4 iii_B_5 iii_C_l iii_D_2 iii_E_3 iv _A_5 ^iv_B iv_C_2 iv_D_3 iv_E_4 v A 1 v B 2 v C 3 v D 4 v E 5 In this case, if there are m candidates per substituent group site, m x 2 (i.e. 50) compounds are defined within the PAD out of a total of nτ^J (i.e. 125) possible compounds.

The flexibility of the PAD allows for different numbers of candidates to be used at each site. This is particularly useful when a more focused, smaller selection of candidates for some of the sites has been possible. For example, with 10 different candidates at site α (labelled i to x) and 5 different candidates at each of sites β and γ, the following PAD can be used:

A B C D E

1 2 3 4 5 ii 2 3 4 5 1 iii 3 4 5 1 2 iv 4 5 1 2 3

V 5 1 2 3 4 vi 1 2 3 4 5 vii 2 3 4 5 1 viii 3 4 5 1 2 ix 4 5 1 2 j x 5 1 2 3 4

Table 5 This design gives rise to the following 50 different compounds : LΛ_1 i_B_2 i_C_3 i_D_4 i_E_5 ii_A_2 ii_B_3 ϋ_C_4 ii_D_5 ii_E_l iii_A_3 iii_B_4 iii_C_5 iii_D_l iϋ_E_2 iv A 4 iv B 5 iv C 1 iv D 2 iv E 3 v_A_5 v_B_l ^V_C_2 v_D_3 v_E_4 vi_A_l vi_B_2 vi_C_3 vi_D_4 vi_E_5 vii A 2 vii B 3 vii C 4 vii D 5 vii E 1 viii_A_3 viii_B_4 viii_C_5 viii_D_l viii_E_2 ix_A_4 ix_B_5 ix_C_l ix_D_2 ix_E_3 x A 5 x B 1 x C 2 x D 3 x E 4

which contain 1 copy of all α-β and α-γ substituent group candidate pairs, but 2 copies of all β-γ pairs.

All of the above examples are of PADs which include all possible pairings of candidates. However, an extension of the same approach could be used to create all possible triplets within a range of four possibilities. This is the Latin cube.

The next step, 120, in the process of Fig. 6 is to synthesise each of the compounds in the PAD. In the present example, the synthesized compounds are then tested for biological activity at step 130.

At step 140, the substituent group activity contributions are calculated on the basis of the actual biological activity for each tested sample.

The average contribution (a) of each substituent group at a particular site to biological activity is calculated in the following way: -

a = «

- e

\oj where: b = number of compounds containing the substituent group candidate for that site, o = the observed number of biologically active compounds containing the substituent group candidate at that site (or the sum of the activities recorded for compounds containing the substituent group candidate in the range 0 tol , where 1 denotes maximum activity and 0 denotes inactive), and e is the expected number of active compounds that would be associated with the substituent group candidate by chance and is calculated as:

e -

M

where t = total number of biologically active compounds (or sum of activities for quantified activity expressed in the range 0 to 1. where 1 denotes maximum activity and 0 denotes inactive).

and M = the total number of compounds in the array.

This information is then used, at step 150 of the method of Fig. 6, to provide a prediction of activity' for all of the entries in the entire combinatorial array of compounds. In many circumstances, this may be done simply by summing the calculated substituent group activity contributions for each compound; e.g. for 3 sites of variation, the predicted activity rating CT_OTAL may be calculated as:

CTOTAL = C _α + C p + C _γ+ e

where

C_α is the activity contribution calculated for the substituent group at the α site, Cp is the activity contribution calculated for the substituent group at the β site, and C_γ is the activity contribution calculated for the substituent group at the γ site.

The skilled man will appreciate that other functions may be appropriate, in some situations, to calculate the predicted activity rating for a substance. A suitable function may be chosen, according to the application, on the basis either of theoretical considerations or practical considerations. If sufficient data is available, the function might have a number of adjustable parameters that may be automatically selected to provide the best fit to the data. For example, the function might be:

CTOTAL = aC _a÷ bC $ - cC

and where a, b and c are parameters selected to provide the best fit, overall, between predicted activity ratings and actual measured activity values for the synthesised compounds; an error function may be created which may then be minimised to perfect the fit.

A skilled man will have no difficulty in deriving other appropriate parameter- based functions that could be fitted on the basis of the data available, for example using a multi-dimensional least squares approach. A neural network or a rule induction algorithm could also be used.

Using a variable function, to be fitted assessing to the available data, is a particularly useful approach when the individual substituent group candidates are expected to be either synergistic or antagonistic.

The data and functions may conveniently be embodied within a spreadsheet- based analysis program, and a specific example of such an approach will now be provided. The algorithms and discussion in this section are exemplified for the simple case of 3 sites of variation in a pharmaceutical molecule with n=31 different substituent group candidates at each site, where biological activity is the parameter of interest. The PAD in this case is of the type shown in table 1 , above, and will have m² (i.e. 961) entries. The all-combinations array will have m³ (i.e. 29,791) entries.

A spreadsheet is created with the following pages (or worksheets): -

Page 1 consists of the following columns containing information on the n" compounds synthesized: -

(a) A no. for each compound synthesized ( 1 to m^")

(b) A reference no. (which refers back to the entire combinatorial array of m³ compounds) (c) An activity column initially set to 0s

(d) A predicted activity column initially set to 0s

(e) A cell with a function to find the total number of active compounds.

After biological testing, the operator enters a 1 in the activity column for each compound which is active. Alternatively, a quantified measure of activity may be used such as % activity in the range 0 to 100. In this case, the total activity is calculated as the sum of all the activity values and this measure is used as a weight rather than a 0/1 indicator. The algorithms remain essentially the same whichever method is used.

Page 2 consists of the following columns containing information on the entire virtual set of n³ compounds: -

(a) The reference no. for the entire array ( 1 to m ), (b) A trivial number used to identify the compound (either 0, or a value between 1 and m² if the compound was synthesized as part of the initial design)

(c) A function (as described above) to calculate the expected activity of each entry based on the calculated activity contribution of each substituent group element for that compound.

Page 3 consists of the following columns containing information on the individual substituent group elements (e.g. monomers) at each substitution site:- (a) The substituent group candidate label (e.g. αO l to α31. βO l to β31 or γOl to γ31 for an array with 3 sites of variation and m=31 ) (b) A function to calculate the total number of active compounds for each substituent group candidate (or the sum of the activity values entered for quantified activity) (c) A function (as described above) to calculate the average activity contribution for each substituent group

(d) A function to calculate χ² (Chi²)

for each substituent group is calculated as the measure of statistical significance of the activity contribution: -

2 (° - ^g) X =

where o and e are as previously defined.

To identify large positive and negative differences between observed and expected values, χ^: is multiplied by the sign of the difference between observed and expected. Thus, substituent group candidates associated with activity have a large positive χ", while those associated with inactivity have a large negative χ².

Thus the overall equation used is: -

₂ (o - e)¹

X = x sign(o - e)

where: sign (o-e) = +1 for substituent group candidates with a higher occurrence in active compounds than expected (i.e. higher than the mean). sign (o-e) = -1 for substituent group candidates with a lower occurrence in active compounds than expected (i.e. lower than the mean). sign (o-e) = 0 for substituent group candidates with the expected occurrence of actives (i.e. equal to the mean).

Graphs are generated to provide a visual display of substituent group activity contributions and measured versus predicted activity for the compounds synthesized.

Once the activity has been predicted for the all-combinations array (step 150 of Fig. 6), the columns in the spreadsheet can be sorted by descending predicted activity rating. Thus, compounds predicted to be active are clustered at the top of the worksheet and may be readily identified. Those not already synthesized can then be identified and decisions made on which to synthesise. Typically, all compounds predicted to be active, or at least the most likely looking candidates. are synthesised (step 160 of Fig. 6). In deciding which compounds are worthy of synthesis, the operator may take the χ~ values into consideration.

As shown by the arrow 165 in Figure 6, these newly-synthesised candidates are then tested for biological activity at 130. The array is updated, and the calculations repeated using the new information to update the average constituent group contribution values. These new values are then used to recalculate the predicted activity values for those virtual compounds in the all- combinations array that have not yet been synthesised, and the process is repeated.

The iterative procedure can be expanded still further, if desired, as indicated at step 170. If, during the steps 120 to 140 above, certain substituent group candidates are found to have increased activity, the original cluster analysis outlined in connection with step 100 can be consulted once more. The cluster analysis may indicate a group of substances, not included in the previous all- combinations array, clustered around the substituent group candidates found in that previous array to have increased activity. The PAD can then be reconstructed with different substituent group candidates, using the results of the cluster analysis.

Activity data for the newly synthesized compounds can be entered into the model and the substituent group activity contributions and predicted activities can be up-dated.

The usefulness of the method can better be illustrated by comparing it to the "All-Combinations" approach. Typically, it is reasonable to synthesise and test 1000 compounds for biological activity, a task taking about a day with automated synthesis and HTS. In the "All-Combinations^" approach, if 1000 compounds with a scaffold and 3 sites of variation are to be made and tested, then one is restricted to 10 substituent group candidates per site. With the PAD described above, it is possible to have 31 candidates per site. (31)² = 961 compounds are synthesized and tested, but the total predicted array size is (31 Y = 29,791 compounds. This assumes that none of the candidates for the three sites of variation are similar, i.e. there are in this example 93 different candidates in total. It is possible to use similar compounds on the different sites. but this will decrease the molecular diversity of the all-combinations array.

For example, if: substituent group site α consists of 31 candidates labelled A to Z. then AA to

EE; substituent group site β consists of 31 candidates labelled i to xxxi; and substituent group site γ consists of 31 candidates labelled 1 to 31 , then an array can be designed following the above rules:

A B C D DD EE

1 1 2 3 4 30 31 ii 2 3 4 5 31 1 iii 3 4 5 6 1 2 iv 4 5 6 7 2 3

XXX 30 31 1 2 28 29 xxxi 31 1 2 3 29 30

If a scaffold with four sites of variation is employed then 961 compounds are synthesised and predictions are made for a total array size of (31) = 923,521 compounds, using a Graeco-Latin Square similar to that shown in Table 2. Although the invention has been described in relation to pharmaceutical molecules, it is applicable to various other arrays of chemicals. For example, it would be equally applicable to the development of plastics (where one might wish to find substances having optimal melting temperature or rigidity, for example) or to superconductive materials, where it may be desirable to locate the substance within an array having the optimal critical temperature. It will be appreciated that exactly the same principles can be applied when the chemical product to be selected is an admixture of chemical compounds or molecules, rather than (as described) a single molecule. In such a case, the candidates will not be substituent groups but rather individual elements of the admixture.

Appendix 1

Symbols Used

^' Number of sites at which vaπaϋon occurs on a scaffold z Number of scaffolds n Total number of possible vaπauons at subsutuent site r r Number of vaπauons at subsutuent site r, a designed set here m < n

N Total possible number of compounds that could be made

M Number of compounds selected for synthesis (M<N)

C A set of subsutuents

Cj.C; C, The sets of substituents for the r sites

S A site of vaπauon on a scaffold

Sι,S_: . S. The r sites of vaπation on a scaffold m_S! .m_s; m,_r The number of subsutuents at each site mv.α The number of subsutuents at the α site m_sβ The number of substituents at the β site

P The number of compounds in a designed a

rra\ q The number of compounds in an all-combinations aπa> a The a\ erage contπbuuon of a subsutuent to biological activ ity

~ The observed number of biologically active compounds containing the substituent of interest ^e The expected number of

compounds for any substituent t The number of biologically active compounds The predicted activity of a compound

C_α The contπbuuon to activity of the subsutuent at the α site

C3 The contπbuuon to activity of the subsutuent at the β sue

C_γ The contπbuuon to activity of the subsutuent at the γ site

J The dimension of tr.e s ides o f a Lat in Square b The number of corpounds containing tne substituent group candidate for a specific

Claims

1. A method of designing a chemical substance having a desired physical property, the substance including a plurality r of individual elements, the method comprising:

(a) selecting r (r >3) sets of candidate elements C_, C₂... C_╧Ç;

(b) generating an all-combinations array of possible substances, each element of the array being representative of a different substance having one element chosen from each of the sets Ci, ...C_r ; (c) defining a sub-array within the all-combinations array, the sub-array being smaller than the all-combinations array but including all possible pairings of candidate elements; (d) synthesising the possible substances in the sub-array, and measuring the said physical property for each synthesized substance; (e) using the measured physical properties to predict the characteristics of the possible substances in the all-combinations array which have not been synthesised;

(f) selecting and synthesising further possible substances on the basis of their predicted characteristics, and measuring the said physical property for each synthesised further possible substance;

(g) repeating steps (e) and (f) one or more times until a substance has been synthesised which displays a characteristic sufficiently close to the desired physical property.

2. A method as claimed in claim 1 in which the sub-array is defined by a Latin Square.

3. A method as claimed in claim 1 in which the sub-array is defined by two Latin Squares.

4. A method as claimed in claim 1 in which there are exactly m candidate elements in each of the sets Ci, ...C_r.

5. A method as claimed in claim 1 in which the candidate elements of the sets

Ci, ...C_r are mutually exclusive.

6. A method as claimed in claim 1 in which r is 4 or more and in which the sub-array includes all possible triplets of candidate elements.

7. A method as claimed in any one of claims 1 to 6. in which the chemical substance to be designed is an admixture of the candidate elements.

8. A method as claimed in any one of claims 1 to 6 in which the chemical substance to be designed is a molecule.

9. A method as claimed in claim 8 in which the desired physical property is biological activity.

10. A method as claimed in claim 8 or claim 9 in which the molecule comprises a fixed scaffold having r sites of variation, each of the sets Cj, ...C_r being representative of candidate elements for an individual site.

11. A method as claimed in claim 8 or claim 9 in which the molecules includes a variable scaffold, each possible scaffold having a plurality of sites of variation thereon; and in which one of the sets of candidate elements C_\, ...C_r is representative of a set of possible scaffolds.

12. A method as claimed in claim 1 1 in which there are a plurality of possible scaffolds, each having a plurality of sites of variation for substituent element candidates, and in which the sub-array contains exactly one example of every possible element/element pairing and multiple examples of every possible element/scaffold pairing.

13. A method as claimed in claim 1 in which r is 4 or more and in which the sub-array is defined as a Graeco-Latin Square.

14. A method as claimed in any one of the preceding claims including calculating from the measured physical properties of the synthesised substances property contribution values representative of the respective contribution to the physical property of each of the individual elements.

15. A method as claimed in claim 14 including predicting the characteristics of the possible substances which have not been synthesised in dependence upon the property contribution values.

16. A method as claimed in claim 14 in which the property contribution value for a given candidate element is calculated by summing the measured physical properties of each synthesised substance and subtracting the overall mean.

17. A method as claimed in claim 15 in which the predicted characteristics are calculated by summing the property contribution values for the relevant substance.

18. A method as claimed in claim 15 in which the predicted characteristics are calculated by minimising an error function which characterises an error between the actual and predicted measurements of the property value for those substances that have been synthesised.

19. A method as claimed in any one of the preceding claims including selecting the r sets of candidate elements C_] ..C_r on the basis of a cluster analysis.

20. A method as claimed in claim 19 when dependent upon claim 14 including selecting a candidate element having a high property contribution value, identifying from the cluster analysis a so far unused element which would be expected to have similar properties, adding the said so far unused element to one of the r sets of candidate elements C],...C_r, either in addition to the existing elements or in substitution for one of them, and repeating steps (b) to (g) of the method one or more times.

21. A method of designing a chemical compound, including:

(a) selecting a scaffold having r sites of variation S╬╣,S₂,...S_r thereon;

(b) selecting groups of candidates C╬╣,...C_r for inclusion, respectively, on each of the r sites of variation, such that there are m_sl candidates for the site Si, m_S2 candidates for the site S₂, ... ms_r candidates for the site S_r; and

(c) generating an array of compounds of size p such that all m_S╬▒ candidates on a site of variation S_╬▒ (l<╬▒<r) are paired at least once with all m_sp candidates on any other site of variation Sp (l<╬▓<r; ╬▒ Γëá╬▓), and such that the array size p is less than q, where q=ms╬╣ x m_S2 x .. -ms_r, the size of an array of all possible combinations of the candidates ms╬╣,m_S2,- - -m_Sr at their respective sites S_lsS₂,...S_r on the scaffold.

22. A method as claimed in claim 21, in which

ΓÇó -irisr-

23. A method as claimed in claim 21 or claim 22, in which the scaffold selected has exactly 3 sites of variation, Si, S₂ and S₃.

24. A method as claimed in claim 23, in which the three groups of candidates Cj, C₂ , C₃. are mutually exclusive.

25. A method as claimed in claim 24, in which the array of dimension p is a Latin Square wherein all candidates are paired exactly once with all candidates C₂, all candidates C] are paired exactly once with all candidates C₃, and all candidates C₂ are paired once only with all candidates C_?.

26. A method as claimed in claim 25, in which the scaffold selected has exactly 4 sites of variation, S_b S₂, S₃ and S .

27. A method as claimed in claim 26, in which the scaffold includes a plurality of aromatic or heterocyclic rings.

28. A method as claimed in claim 26 or claim 27. in which the four groups of candidates C_i; C₂, C₃ and C₄ are mutually exclusive.

29. A method as claimed in any one of claims 26 - 28, in which the array of dimension p defines a Graeco-Latin Square wherein all candidates are paired exactly once with all candidates C₂, all candidates C_\ are paired exactly once with all candidates C₃, all candidates Ci are paired exactly once with all candidates C , all candidates C₂ are paired exactly once with all candidates C₃, all candidates C₂ are paired exactly once with all candidates C , and all candidates C₃ are paired exactly once with all candidates C .

30. A method as claimed in claim 21 , in which at least one of the groups of candidates Ci, C₂ ... C_ris of a different size from the other groups of candidates.

31. A method as claimed in claim 30, in which (r- 1 ) of the groups of candidates d, C₂ ... C_r are of the same size, the remaining group having a different size.

32. A method as claimed in claim 31, in which the scaffold has 3 sites of variation Si, S₂ and S₃ and in which ms╬╣=m_S2; m_S3>m_S!.m_S2. and m_S╬╣ and m_S2 are factors of m_S3. the array of dimension p being defined by a Latin Rectangle wherein all candidates Ci are paired exactly once with all candidates C₂. all candidates C_t are paired exactly once with all candidates C₃. but all candidates C₂ are paired more than once with all candidates C₃.

33. A method as claimed in claim 31 , in which the scaffold has 3 sites of variation Si, S₂ and S₃ and in which m_S╬╣=m_S2 and m_S3<m_S╬╣. m _S2. the array of dimension p being defined by a Latin Rectangle wherein all candidates C] are paired exactly once with all candidates C₂. but all candidates C] are paired more than once with all candidates C₃, and all candidates C₂ are paired more than once with all candidates C₃.

34. A method of designing a chemical compound, including:

(a) selecting a plurality z of scaffolds each having r sites of variation Si,

S₂,...S_r thereon; (b) selecting, for each scaffold, groups of candidates C╬╣ ...C_r for inclusion, respectively, on each of the r sites of variation, such that there are m_S] candidates for each of the sites Si ; m^ candidates for each of the sites S₂:

... ms_r candidates for each of the sites S_r; and (c) generating an array of compounds of size p such that all candidates ms_╬▒ on a site of variation S_╬▒ (l<╬▒<r) are paired at least once with all candidates m$p on any other site of variation Sp (l<╬▓<r; ╬▒ Γëá╬▓), such that all candidates are paired at least once with each of the z scaffolds, and such that the array size p is less than q, where: q=z x m_S╬╣ x ms₂ x ΓÇó ΓÇó ΓÇóms_r. the size of an array of all possible combinations of the candidates m_S╬╣,m_S2,...m_Sr at their respective sites S╬╣.S₂,...S_r on all of the z scaffolds.

35. A method as claimed in claim 34, in which there are exactly three sites of variation S╬╣,S₂,S₃ on each of the z scaffolds.

36. A method as claimed in claim 35, in which three groups of candidates C_, C₂, C₃ are selected for inclusion, respectively, on the 3 sites of variation Si. S₂, S₃ .

37. A method as claimed in any one of claims 21 to 36, further including synthesising and testing each of the compounds in the array of size p for a predetermined property.

38. A method as claimed in claim 37, in which the predetermined property is biological activity.

39. A method as claimed in claim 38, in which the candidates C_s within each group are monomers.

40. A method as claimed in claim 37 further including calculating the contribution to the measured values of the predetermined property by the individual candidates within each group C.

41. A method as claimed in any one of claims 37 to 39 further including predicting the predetermined property of those compounds within an array of size q but not included within the array of size p, based on the results of testing the synthesised compounds within the array of size p.

42. A method as claimed in claim 41 further including synthesising and testing a selection of compounds within an array of size q but not included within the array of size p, based on the predictions of the predetermined property.

43. A method as claimed in claim 41 or claim 42, including selecting further candidates within at least one of the groups based on the predictions of the predetermined property, and generating a further array of compounds.