WO2005087363A1

WO2005087363A1 - Q§method for carrying out high throughput experiments

Info

Publication number: WO2005087363A1
Application number: PCT/EP2005/001746
Authority: WO
Inventors: Arne Ohrenberg; Bernhard Knab
Original assignee: Bayer Technology Services Gmbh
Priority date: 2004-03-05
Filing date: 2005-02-19
Publication date: 2005-09-22
Also published as: DE102004010808A1; EP1725326A1

Abstract

The present invention concerns a method for carrying out high throughput experiments, characterised by the use of a rapid and targeted evaluation method (e.g. clustering method).

Description

Procedure for carrying out high throughput experiments

The present application relates to a method for carrying out high throughput experiments, characterized by the use of a fast and targeted evaluation method.

In recent years, large investments in High Throughput Experimentation (HTE) have been made worldwide to speed up and improve work processes. In the field of catalysis alone, € 13 billion was invested worldwide in 2001, of which around half is likely to be attributed to the high throughput area. Priority fields of application are: drug research, (heterogeneous and homogeneous) catalysis, material research and identification of optimal reaction conditions in chemical, biochemical or biotechnological systems.

So far, methods of statistical test planning have been used to support targeted test planning and data analysis in high-throughput experiments

E. Scheffler: Statistical experiment planning and evaluation, 3rd edition. German publisher for basic material industry, Stuttgart 1997.

uses or in the field of heterogeneous catalysis also evolutionary techniques as described under

D. Wolf et al .: An evolutionary approach in the combinatorial selection and optunization of catalytic materials. Applied Catalysis A: General, 200 (2000), 63-77

are used.

Further explanations on the use of mathematical methods in high-throughput research are described in

Holzwarth et al .: Combinatorial approaches to heterogeneous catalysis: strategies and perspectives for academic research. Catalysis Today, 67 (2001) 309-318,

J. N. Cawse: Information Based Strategies for Combinatorial and High Throughput Materials Development. Technical Report, GE Research & Development Center. 99CRD166, Feb. 2000,

K. Huang et al .: Artifϊcial neural network-aided design of a multi-component catalyst for methane oxidativ coupling. Applied catalysis A: General 219 (2001) 61-68,

S. Rose: Statistical design and application to combinatorial chemistry. Combinatorial chemistry, reviews, vol. 7 (2), 2002, 133-138. An overview of suitable mathematical methods can be found at:

M. Berthold, D.J. Hand: Intelligent Data Analysis. Springer, Heidelberg 1999.

In addition, there are software systems such as the "Lead Discovery" extension of the Spotfire software, which provides mathematical methods.

In most high throughput experiments, however, increasingly larger amounts of data are generated per run (more than 5 data sets per run). The larger the amount of data generated, the less this amount of data can be adequately evaluated in an adequate time (less than 0.5 day) and the results integrated into a suitable test strategy. The aim is therefore to develop methods that enable rapid data evaluation to the extent that important information for further planning of experiments is quickly made available to the experimenter.

The known mathematical methods used in high-throughput experiments can be applied to the problems described in the introduction, but they often require a higher level of familiarization by the user, are more lengthy to use, or can affect the dimensions of the test room (ie the number of influencing variables considered) do not reduce sufficiently.

In addition, most methods for optimization are limited to a target variable (output variable) such as yield, selectivity and certain physical properties or variables derived from them. Multi-target optimizations are hardly possible.

Based on the state of the art, the task was to provide a method for carrying out high-throughput experiments, which, by evaluating the data and optimizing the experiment planning based on it, with little familiarization and application effort, would allow the experiments to be carried out more efficiently, in particular one as possible strong reduction of the test room. This object is surprisingly achieved by the invention described here.

The invention described below is not limited to use under "real" high throughput conditions, but can generally be used if a combinatorial approach in the area of research and development can be identified. "High throughput experiment" or "high throughput process" is therefore in the context of this application as " combinatorial procedure when performing experiments ". The invention therefore relates to a method for carrying out high-throughput experiments, characterized by the use of a fast and targeted evaluation method with regard to certain variables in the results, which increases the efficiency during experimentation.

The experiments are, independently of one another, preferred, but not restrictive:

• the screening / search / optimization of heterogeneous catalysts

• the screening / search / optimization of homogeneous catalysts

• the screening / search / optimization of active substances • the screening / search / optimization of new materials / material properties

The focus is on high throughput experiments. An evaluation cycle of the method according to the invention should comprise at least 5, preferably at least 10, particularly preferably at least 100, very particularly preferably at least 500 and in particular at least 1000 experiments. Ideally, the evaluation cycle is based on at least 5000 experiments. It is therefore particularly sensible to use technical systems which, for. B. min. 5, preferably at least 10, particularly preferably at least 100, very particularly preferably at least 500 and in particular at least 1000 experiments, ideally at least 5000 experiments per day. Alternatively, at least 5, preferably at least 10, particularly preferably at least 100, very particularly preferably at least 500 and in particular at least 1000, ideally at least 5000 experiments over several days can be combined and treated as one evaluation cycle.

The tests are evaluated on the basis of a target variable (starting variable) such as activity, selectivity, effectiveness or predetermined material properties. A prerequisite for the application of the method is that frequency statistics can be created for the target variables and that the setting parameters (influencing variables, input variables) of the tests can be formulated as binary variables.

No comparable evaluation methods are known to date.

In the present invention, on the other hand, the settings of the influencing variables are used for a successful experiment on the basis of frequency statistics of the influencing variables in relation to the target variables. The method is quick and robust with regard to its application and enables a quick gain of information with large amounts of data per day, on the basis of which, for example, a reduction in the dimension of the test room, that is to say the number of attempts, can be carried out. The method is particularly suitable for test series that are designed so that the influencing variables are not strongly correlated with one another.

In addition, the evaluation method is able to identify possible interaction effects of influencing variables and to detect possible anomalies in the purity of chemical substances.

This evaluation method enables large amounts of data, e.g. in the high-throughput area, they must be evaluated quickly and information made available that cannot be easily discovered using standard methods. In contrast to conventional methods, the original target quantity (e.g. yield, selectivity, costs, ...) itself is not used as the target variable, but an additional or only frequency statistics based on the original target variable are used.

The method according to the invention thus represents a possibility for identifying components and compositions or groups of molecules, partial molecules etc. for an optimal catalyst, active substance or optimal material (e.g. polymer, lacquer, plastic). The evaluation process supports the overall process with an optimization process.

The method according to the invention can be described as follows:

1. Experiments are carried out. The experiments can be carried out in parallel or sequentially. However, at least 5, preferably at least 10, particularly preferably at least 100, very particularly preferably at least 500 and in particular at least 1000, ideally at least 5000, experiments should be carried out per evaluation cycle. The experiments can come from the fields of catalysis, drug discovery, new materials or reaction optimization. The experiments are characterized by the fact that influencing factors - as a rule, these are certain input variables at the same time - are primarily elements, mixture components, chemical compounds or sub-molecules (functional groups). These influencing variables must be manageable as discrete variables or binary. Example: a) A catalyst consists of a maximum of 5 components. The components A, B, C, D, E, F, G, H, I, J are suitable as components. Then there is the possibility for each element that it is "present" or "not present" in the catalyst. b) A catalyst consists of a maximum of 5 components. The components A, B, C, D, E are suitable, although they can be present in the concentrations high, medium, low. Then the discrete influencing variables of the system are: Ahoch, Amittel, Aniedrig, Bhoch, Bmittel, Bniederig, Choch, Cmittel ... and eg Ahoch can then be "available" or "not available". For processing the data, e.g. B. one of the binary states is encoded as "0" and the other as "1".

2. The experiment data and results (results = target values) are recorded, preferably in tabular form, and processed in accordance with a discrete handling of the starting values.

3. For one or more output variables, frequent statistics are created in the form of "probability profiles". The discrete influencing variable should generally be referred to as the EG and the continuous output variable as the AG, then the following applies to the EG-AG function:

_Λ , _ „, (number of attempts with AG> x) _r EG -AG function = ± ^iM - (Eq. 1) (number of attempts) _EG This means: For each influencing ^variable , the number of attempts in which this is counted Influencing variable is present (e.g. value greater than zero) and the output variable assumes a value> x. It makes sense that xe [0; xmctx] with xmax ≥ (measured maximum value of the output variable)

Eq. (1) represents an inverse empirical distribution function. With suitable mathematical constraints, the one given by Eq. (1) frequency described a good approximation of the probability.

4. For each element, an EG-AG function can be created graphically depending on x. Based on the curves, a ranking of the influencing variables can be be taken. This ranking enables statements about the importance of the influencing variables for the optimization of the initial variable. Different approaches can be used for the ranking:

a) Consideration of the initial values: This enables statements with which truth. Probably the use or setting of a certain influencing variable delivers a result in the initial variable. In the case of a catalytic converter, the initial value provides, for example, clues about the probability with which the catalytic converter is even active when the influencing variable under consideration is taken into account. The ranking is based on the initial values: the higher the initial value, the more important the influencing factors are understood,

or

b) Consideration of the maximum values: Here, the influencing variables are evaluated according to their maximum x values, i.e. based on a catalyst, the ranking is e.g. on the basis of maximum yield values that are achieved when using the individual influencing variables. In this case, the maximum A value is the largest jc value at which the associated EG AG function is not zero. Influencing factors with a high maximum x value are then rated higher,

or

c) considering a combination of a) and b),

or

d) Consideration of the curve shape: Different influencing variables can have different curve profiles that differ in shape or in absolute values. A ranking could be done in such a way that the influencing variables, whose curves run largely above the curve of another influencing variable, are ranked better than the lower curve. In other words: The more curves are wholly or partially below the corresponding curve, the better the ranking is,

or

e) Consideration of a combination of a) and d), b) and d), c) and d). 5. The ranking is taken into account when planning new experiments by largely carrying out new experiments using influencing factors that have a good ranking position. The following distinctions are conceivable: a) Use almost exclusively of influencing factors that have a good ranking position, which results in a reduction of the dimension in the test room, and / or b) When carrying out experiments, tests with influencing variables that have a good ranking position are especially considered, e.g. by number or targeted consideration.

6. Planning and execution of new tests taking into account the ranking

7. If the test goal or optimization goal has not been sufficiently achieved, the procedure is repeated from step 2. on the basis of new experiments, by a) applying the evaluation method only to the new experiments, or b) applying the evaluation method to all experiments carried out will, or

• c) the evaluation method is applied to all experiments (including previous ones) that are related to influencing factors that were taken into account in the last runs.

Particularly preferred embodiments of the high-throughput method according to the invention result from the following features:

(1) The ranking of the influencing factors resulting from the evaluation procedure will be used directly or in combination with other experiment planning procedures for planning new experiments.

(2) The ranking under 4. is created using classification algorithms, eg cluster processes. In the preferred use of cluster methods according to the invention, groups of EG-AG functions that behave similarly and have similar characteristics can be combined particularly preferably. The Clustering can also take place, for example, with regard to the shape or the absolute values of the curve profiles or in a combination of these criteria. Suitable cluster processes are: kmeans, knn (next neighbor), fuzzy cmeans or generally hierarchical processes. Other methods are described in H.-J. Mucha, cluster analysis with microcomputers, Akademie Verlag, Berlin 1992

BS Everitt, S. Landau, M. Leese, Cluster Analysis, Edward Arnold, 4 ^th Ed., 2001. The clustering can, for example, by the curves are broken down into nodes. The dimension of the cluster space is then the x-values of the support points; in this way the clustering z. B. in a 100-dimensional space if each curve is broken down into 100 support points to the same x values.

(3) The influencing variables are grouped on the basis of the EG-AG function according to (2), which takes into account a multi-dimensional matrix for evaluating and refining the ranking. I.e. the influencing variables are evaluated with regard to their influence on the output variables, e.g. by the mean target size of all tests that take into account or contain a certain influencing variable is determined. In this way it is also possible to rank the influencing variables on the basis of the output variables. These different rankings of the influencing variables can be compared with each other. When planning new tests, tests with influencing factors that have been rated well in more than one ranking can then be given special consideration. This can enable an additional reduction in the test room.

(4) The EG-AG function from 3. is considered in the following modified form:

" _^ ," _Π , (number of experiments with AG> x) _πr EG -AG function = - i I - ( _{G 2} ) total number of experiments considered

(5) The EG-AG function is considered in 3. in the following modified form:

, v (number of attempts with AG> x) _r ., (EG- AG function) = ± - - - ≡ ^ - (G1.3) (number of attempts) _{EG N} specifies N the number of influencing factors, which of Are zero different, ie the EG-AG function is set up in such a way that only tests that take into account the sought influencing variable and for which exactly N influencing variables are available or documented. I.e. for a catalyst, for example, N corresponds to the number of components in the catalyst mixture. This modification enables a more detailed investigation of the importance of the influencing factors. (6) The described method is combined with other data analysis methods before step 3. For example, before step 3. the input and output variables can be subjected to a correlation analysis and then only the uncorrelated variables are taken into account in 3.

By using the evaluation methods according to the invention, the method according to the invention enables rapid identification of important influencing variables when carrying out experiments and thus an increase in efficiency and possibly a reduction in experimental outlay and a reduction in "time to market". At the same time, this evaluation method enables the ranking compares the influencing factors with one another, the latter can be used as a basis, especially in the catalysis sector or in drug discovery for structure-property relationships, for example, in the catalyst search, components can be identified which may be "guarantors" of an activity or which are fundamentally no activity allow.

Et al The graphical representation of the evaluation functions (EG-AG functions) makes it possible to detect anomalies or incorrect test settings, e.g. With simple frequency statistics of catalyst components, the EG-AG functions often have similar curve profiles in all the variants mentioned in the process and implementation description (1-7 and (1) - (6)). In such cases, large deviations of individual components indicate impurities, for example.

In addition, the combination with the curve clustering enables quick identification of influencing variables that behave similarly, which in relation to catalyst components can make it easier to replace expensive components with less expensive ones.

The use of the method according to the description of embodiment (5) can also be used to disclose interaction effects of the influencing variables or to discover a limitation in the number of influencing variables if, for example, curve profiles with respect to the absolute values are better for N <Nmax than for N = Nmax (Nmax: maximum number of zero different influencing variables). In addition to the process and embodiment descriptions (1-7 and (l) - (6)) set out above, the following embodiments of the evaluation method and thus of the method according to the invention are particularly suitable:

I. The data collection is generally computer-based. Manual entry is also possible. The EG-AG functions are generally calculated with the aid of a computer. A manual calculation is also possible.

II. The calculation of the EG-AG functions is generally computer-based. A manual calculation is also possible.

πi. The data input for calculating the EG-AG functions and data preparation can be computer-based. Manual processing is possible.

IV. The ranking of influencing factors can be computer-based. It is also possible to carry out the ranking manually.

V. Steps 2., 3. and 4. The description of the process can run entirely or partially with the aid of a computer.

VI. Steps 2., 3. and 4. The description of the method can be wholly or partially depicted in a computer program and run automatically one after the other.

VII. The method described in the method description can be completely implemented in a system for test planning, so that only one computer unit is used in the method.

VIII. The process described in the process description can also be used as a test planning process as part of a high-throughput process.

IX. The method described in the method description can be combined with cluster methods as shown in embodiment (2). The clustering can be computer-aided, manual or analogous to the technical embodiments V. and VI. in combination with data preparation, calculation of EG-AG functions and the ranking take place completely or partially with computer support.

Particularly noteworthy are the advantages that the application of simple counting statistics offers to experimental factors. Figure 1: possible embodiment according to the process and execution description (step 1-7.)

Figure 2: possible embodiments of the method and execution description (steps 1.-7.) Taking into account a graphic representation of the EG-AG functions

Figure 3: possible embodiments of the method and implementation description (steps 1-7.) Taking into account embodiment (2)

Figure 4: possible embodiment of the process and implementation description (steps 1-7), the experiment planning results directly from the ranking, possibly taking into account (2) or (3)

Figure 5 possible embodiment of the method and implementation description (step 1.-7.) Taking into account (2), the method being combined directly with an experimental planning tool according to (1). The acquired experimental data can go directly into the test planning system. The test planning tool can also take into account the information that results from the application of the method according to steps 1-7. incl. (2) result.

Claims

claims:

1.Procedure for carrying out high throughput experiments, characterized in that a special result evaluation method is used to optimize the test planning and implementation, characterized in that influencing variables are subjected to frequency statistics in dependence on the starting variable in this evaluation method, from which a ranking of the influencing variables results, from which an optimized test planning can be derived.

2. The method according to claim 1, wherein clustering methods are used in the evaluation.

3. The method according to claim 1, wherein the influencing variables are ranked both via the frequency distribution and via the target variables.

4. The method according to claim 2, wherein the influencing variables are ranked both via the clustering of the frequency distribution and via the target variables.

5. The method according to any one of claims 1 to 4, wherein the evaluation method is carried out with the aid of a computer.

6. The method according to any one of claims 1 to 5, wherein as many method steps as possible are carried out with the aid of a computer.