WO2009089890A1

WO2009089890A1 - Device and method for determining a pharmaceutical activity of a molecule

Info

Publication number: WO2009089890A1
Application number: PCT/EP2008/010779
Authority: WO
Inventors: Tamas Horvath; Thomas Gärtner; Stefan Wrobel
Original assignee: Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.
Priority date: 2008-01-18
Filing date: 2008-12-17
Publication date: 2009-07-23
Also published as: DE102008005062B4; DE102008005062A1; EP2232395A1

Abstract

A device for determining a pharmaceutical activity of a molecule (M) has a unit (110) for determining atomic structures occurring in the molecule, a unit (120) for assigning a feature index (MI), a unit (130) for ascertaining a feature vector (MV), and a unit (140) for determining an association. The unit (120) assigns the feature index (MI) to one of the occurring atomic structures in the molecule (M) as a function of the particular atomic structure and a vicinity of the particular atomic structure in the molecule (M). The unit (130) ascertains the feature vector (MV) for the molecule (M) as a function of the assigned feature index (MI), wherein said feature vector (MV) points to a point in a feature space (MR), and wherein said feature space (MR) has a first domain (A), which corresponds to pharmaceutically active molecules, and a second domain (B), which corresponds to pharmaceutically inactive molecules. The unit (140) determines the association of the point with the first domain (A) or the second domain (B).

Description

Apparatus and method for determining a pharmaceutical activity of a molecule

description

The present invention relates to an apparatus and method for determining a pharmaceutical activity of a molecule, and more particularly to an atomic cycle tree molecular fingerprint (ACT).

Studies of graphene-structured objects used in, for example, biology, the World Wide Web (WWW), and a number of other fields have attracted considerable interest in the recent past. This includes, for example, a data collection in graph-based databases, in which certain events can be represented by special graphs and which also make it possible to make a prediction regarding the occurrence of the events. An example of an event would be a desired pharmaceutical activity of a molecule. Some methods that show very good performance in terms of the reliability of classifications of events are based or use the so-called support vector machine. For example, to limit the computational complexity of these methods, which are described, for example, in V. Vapnik: "Statistical Learning Theory," John Wiley, 1998, kernel functions may be used that underlie frequently occurring patterns Frequency are, however, disadvantageous in that their predictive power is often insufficient.

While in most applications, focusing on finding data in a data network, where _ _

While certain events are represented by vertices from a simple massive network graph, in other applications it may happen that each event itself is represented by a graph. An example of this are applications that involve molecules, since each molecule itself consists of a series of atoms (= vertices in a graph), which in turn are linked to other atoms. In such applications, usually each vertex and each compound is assigned a label specifying, for example, the atom type or type of atom and the type of binding.

As an example of such chemical applications may be mentioned, for example, the pharmaceutical environment. Given the diversity of chemicals and compounds available today, it is extremely important to be able to estimate the activity of a particular molecule in advance of specific biological studies. Thus, the identification of new chemical constituents that could be developed into new drugs would require an extremely high number of experimental studies for a very large number of compounds. This is particularly true because for the pharmaceutical activity not only the presence of certain molecules, but also the combination of certain molecules is crucial. It is not uncommon that databases containing pharmaceutical compounds and sample libraries currently contain several million molecules.

For this reason, chemoinformatic methods have found increasing use in order to accelerate the identification of a promising candidate while at the same time reducing the scope of biological probing studies. Thus, for example, a large number of samples can be preselected computer-based, so that promising candidates for compounds can already be selected. The design an efficient algorithm for checking virtual

(chemical) compounds and for other chemoinformatic

Applications has become an integral part of computer-aided drug development. An overview of the state of the art in this field can be found, for example, in the following document: B.

A. Bunin, B. Siesel, G. Morales, and J. Bajorath:

"Chemoinformatics: Theory, Practice, &Products"; Springer,

Of 2007.

A disadvantage of the known art is that the methods shown there have only a very limited predictive power in terms of the activity of molecules, which, given the considerable number of available molecules, means an enormous overhead in biological tests. In order to provide the most efficient method, it is of enormous importance to find out a set of features for those molecules in the drugs to be designed for which pharmaceutical activity can be expected.

Based on this prior art, the present invention has the object to provide an apparatus and a method for determining a pharmaceutical activity of a molecule whose predictive ability is significantly increased, so that the cost of performing biological tests is significantly reduced.

This object is achieved by a device according to claim 1 and a method according to claim 15.

The present invention is based on the finding that the pharmaceutical activity of a molecule can be determined by atomic structures which form the molecule and, in addition to the atoms themselves, in particular also their neighboring atoms, can be determined. The atomic structures can thus individual Atoms as well as groups of atoms and the molecule can be represented by the totality of the occurring atomic structures. To determine the pharmaceutical activity, the atomic structures or their characteristics for a molecule are combined in a feature vector. The feature vector can then be examined, for example, using a support vector machine with regard to the expected pharmaceutical activity.

The examination can preferably be done by means of graphs, the graph having vertices and edges (connecting lines between vertices). For the present application, a graph is assigned to each molecule, the vertices representing the atoms and edges of the (chemical) compounds. The atoms or atomic species can be marked by labels on the vertices.

In addition to the individual atoms and their neighboring atoms in the molecule, the graph of the molecule is optionally examined in terms of how many and what kind of closed cycles (closed loops in a graph) are formed in the graph and through which bridges the

Cycles are interconnected. The cycles therefore describe closed paths along edges of a graphene that do not intersect themselves. The occurring cycles and bridges can in turn be associated with corresponding labels which serve as further components in the

Feature vector recorded. In other embodiments as well, the type of binding between the atoms or between the cycles in the feature vector may be included as a feature.

This realization can be implemented as follows in embodiments of the present invention. A device for determining a pharmaceutical activity of a molecule initially has a device for determining atomic structures occurring in the molecule on. Furthermore, the apparatus comprises means for assigning a feature index, wherein the feature index is assigned to one of the occurring atomic structures in the molecule depending on the respective atomic structure and the vicinity of the respective atomic structure in the molecule. The apparatus also includes means for determining a feature vector for the molecule, wherein the feature vector depends on the assigned feature index and the feature vector points to a point in a feature space, the feature space corresponding to a first domain corresponding to pharmaceutically active molecules and a second Domain, which corresponds to pharmaceutically inactive molecules has. Finally, the device has a device for determining an affiliation of the point to the first domain or the second domain.

Further embodiments describe the feature vector as a binary vector having components that signal either the presence or absence of a feature. For example, a particular entry for a vector component of the feature vector may correspond to the presence of the particular feature (eg, a particular atomic type such as hydrogen) and a different entry to the lack of that feature type. It is also possible to introduce a multiplicity into the feature vector which, for example, denotes the number of a particular feature (eg the number of atomic structures such as, for example, of cycles) in the present graph.

To better understand the approach, it is important to accurately analyze and describe the two-dimensional graph structure. The explanations here are based on the assumption that they are planar graphs, ie that the molecules can be represented by graphene in a two-dimensional plane (without overlaps). - S -

This is simplicity, but generally need not be the case.

The algorithm is based on the two-dimensional graphene structure given by the atoms and the bonds for the chemical compound that constitutes the molecule. The atomic cycle tree (ACT = Atom Cycle Tree) molecular fingerprinting method given below gives a ranking of the chemical compounds (molecules) in terms of the expected pharmaceutical activity of the chemical compound. The pharmaceutical activity may relate to a change in a clinical picture with regard to a disease to be treated or to the cosmetic sector (influencing or altering of biological tissue).

The algorithm incorporates two separate sets of chemical compounds, of which:

1. a set A which has a small number of molecules known to show disease activity and

2. a set U that has molecules whose activity is unknown.

As a result, the algorithm provides a real-valued function that predicts the activity of the molecule belonging to the set U. The value of the function indicates the likelihood that an activity of the molecule will be expected - for example, the higher the function value, the higher the likelihood that the compound will be active in the disease.

A method according to the invention comprises the following steps: (i) For each chemical compound M from the union of A and U (AUU), an undirected graph G _{M representing} the atomic bond structure of M is calculated as follows. For each atom a of M, the corresponding vertex v _a of G _{M is} labeled with a pair consisting of a pair (Li, L ₂ ), where Li is the atomic type of A and L _{2 is} a plurality of types of atoms adjacent to a Represents atoms. The union of all these datasets (all labels) calculated for each molecule in the union of A and U is denoted F _ATOM . Further, each compound E in the graph G _{M may be provided} with another label, the further label having the type of connection (the present atomic bond, for example) of E.

(ii) For each graph computed under (i) and labeled, a set of two-connected components and a set of bridges can then be formed. Two-connected components refer to subgraphs (subgraphs) formed by edges belonging to cycles. On the other hand, bridges denote a subgraph formed by edges, the edges not belonging to a cycle. From the set of two-connected components, the cycles are listed and each cycle is uniquely assigned a string except isomorphisms. For example, the string corresponds to a series of labels that identify the cycle (number and type of atoms, connection type, etc.). Thus, the set of strings represents a set of cycles of the molecule or molecules in the union of A and U and is denoted by F _CYCLE .

The set of bridges is also referred to as forest (ie the non-contiguous union of trees). Similar to the cycles, each tree in the forest is uniquely assigned a string except for isomorphisms, where the set of strings assigned to the trees is for the molecules in the union of A and U are called F _TREE . A representation of how F _CYCL E and F _TREE can be calculated for general graphs is described, for example, in T. Horvath, T. Gärtner and S. Wrobel: "Cyclic pattern kernel for predictive graph mining" in Proc. Of the 10th ACM SIGKDD Int. Conf. On Knowledge Discovery and Data Mining, pages 158-167, 2004.

Using a non-empty subset of the union {F _{ATOM /} F _{CYCLE /} F _TREE } and forming the union F of the sets belonging to the non-empty subset, for each molecule M in the union of A and U the the following rate can be calculated:

F _M = {fe F: f represents a subgraph of M}.

Since F _M is a sentence, it can be understood as a Boolean vector in a high-dimensional space, where the high-dimensional space is also referred to as a feature space corresponding to F.

(iii) Using standard techniques such as a support vector machine, which are a subset of the kernel methods, a hypersurface can then be calculated that separates the active and inactive links from A into the feature space described above.

(iv) Finally, the method yields a function f: U → SR (set of all real numbers) which gives the distance of F _M (for each molecule M e U) from the hypersurface described above, where the sign of the function f is the side of the hypersurface. The one sign corresponds to a feature vector (for a molecule) directed to that half space of the feature space containing active training examples, while the other sign describes a feature vector corresponding to the area of inactive training examples. For the problem to be solved, ie for the selection of the most promising N candidates from the set of molecules of the set U for possible in vitro tests (biological tests), the prediction function f can be used as follows: The function value of the function f gives a Prediction of activity for each molecule from set U. By comparing the function values belonging to different molecules, the N molecules corresponding to a highest predictive value are determined. Thus, f provides those molecules for which pharmaceutical activity is most likely to be expected. Accordingly, a ranking of molecules can be established wherein possible in vitro assays can be performed in the rank order so that initially only biological assays are performed on promising molecules and neglecting those molecules for which the simulation provides no evidence of a pharmaceutical activity.

Accordingly, embodiments of the present invention provide an apparatus and method for selecting or screening a large number of molecules for promising candidates and less promising candidates so that resources are not wasted on tests that are unlikely to succeed. In view of the fact that the number of potential molecules or constituents that can be taken up in medicines can clearly exceed the one million mark, it is particularly important to perform only in vitro tests on molecules or chemical compounds that expect a significant pharmaceutical activity and ignore all molecules that promise no indication of pharmaceutical activity. Embodiments of the present invention will be explained below with reference to the accompanying drawings. Show it:

Fig. 1 is a schematic representation of an embodiment of the present invention;

2 is a flowchart for determining the

Feature vector;

Fig. 3 shows an example of a feature vector constructed from binary components;

Fig. 4 is a part of a feature vector identifying an atom and its neighbors;

Figures 5A, 5B are a representation of the feature space with different molecules separated by a domain boundary;

Figs. 6A, 6B are graphs for undirected graphs; and

FIG. 7 shows an example of a representation of a chemical compound in the form of a graph.

With regard to the following description, it should be noted that in the different embodiments, the same or functionally equivalent functional elements have the same reference numerals and thus the descriptions of these functional elements in the various embodiments are interchangeable.

FIG. 1 shows an apparatus for determining a pharmaceutical activity of a molecule, the apparatus comprising means 110 for determining atomic structures of a molecule, means 120 for assigning a feature index MI, wherein the Feature index MI depending on the atomic structure and the neighborhood. Furthermore, the device has a device 130 for determining a point, the point being part of a feature clearance MR for the molecule and being dependent on the assigned feature index MI. Finally, the device has a device 140 for determining, wherein an affiliation of the point to a domain of pharmaceutically active molecules is determined.

The point in the feature space MR thus characterizes a set of features for the corresponding molecule, wherein a feature vector MV, whose components signal a presence / absence of a certain feature, points to the point in the feature space MR. The feature space MR is often a high-dimensional space whose dimension depends on the number of features related to the characterization of the molecules (eg chemical elements of the individual atoms, chemical bond between the atoms, type and number of cycles and bridges, etc.) ,

In further embodiments, means 130 for determining is first to use molecules whose pharmaceutical activity (eg, with respect to a disease) is known to determine a first domain in feature space MR, where the first domain is the feature vectors corresponds to molecules whose pharmaceutical activity has been demonstrated. Furthermore, molecules known to have no pharmaceutical activity can be used to identify a second domain in the feature space MR such that feature vectors MV pointing into the second domain correspond to molecules that have no pharmaceutical activity , Following this learning process, a feature vector (with the same number of molecules) can be formed for an unknown molecule and then it can be determined whether or not a pharmaceutical activity is to be expected for the unknown molecule (depending after the feature vector MV points to the first or second domain).

FIG. 2 shows an exemplary embodiment for determining a feature vector MV into a molecule. In the illustrated flow chart, atomic structures of the molecule are first detected. The atomic structures are, on the one hand, the multitude of atoms (chemical elements) that make up the molecule. Other atomic structures include a plurality of atoms, which are combined by chemical bonding into a structure. Examples of this are cycles, bridges (which can, for example, connect zykles), or other atoms that form a group and that occur more frequently in molecules.

The acquired atomic structures are processed successively in a subsequent step, i. H. It is first examined for one of the atomic structures, whether this atomic structure is already known, d. H. whether this atomic structure has already been assigned a feature index MI. A feature index MI relates to a specific component of the feature vector MV, to which a specific feature (here: atomic structure) has been assigned. For example, the presence of a carbon atom may be characterized by a particular value in a particular component of the feature vector MV. If, therefore, the atomic structure is already known, the next step is to set the corresponding component in the feature vector MV, which can be done, for example, by setting a flag or assigning a predetermined component.

If the atomic structure is not yet known (for example, a cycle has occurred which has not yet been coded), a feature index MI is added and the feature index MI is added to the feature vector MV. This ensures that the feature vector MV successively gets more components, with the additional Components correspond to additional structures in the molecule. So if z. B. a certain Zykel, which may for example comprise six carbon atoms occurs, and so far such Zykel has not yet been indexed, the feature vector MV is extended by a further component, wherein the further component signals the presence of such a cycle.

When this is done, the next atomic structure is processed, i. H. it asks if there are more

Atomic structures are still in place and if that is the case, in turn, it will first be determined if the more

Atomic structure is already known and if so, takes place

Setting the corresponding feature index MI and if not, a new feature index MI is awarded. These

The procedure is carried out successively until all the atomic structures present in the molecule have been indexed, so that the feature vector MV is linked to the corresponding atomic structure

Components (which correspond to atomic structures) e.g. B. has a predetermined value. If no more

Atomic structures are present, the algorithm is terminated.

In addition to the indexing of the corresponding atomic structures that are present in the molecule and thus the setting of components of the feature vector MV, there is likewise an indexing of the atomic structures adjacent to the atomic structure. The scheme is analogous to the scheme of indexing the atomic structures, ie for each atomic structure the neighboring atomic structures are examined to see if they are already known and if so, a corresponding index is given in the feature vector MV and if not, a new index is added , This ensures that, in addition to the atomic structures themselves, the neighborhood of the atomic structures in the feature vector MV is also indicated. To allow a meaningful comparison of different molecules, the feature vectors should be the same length and the components should conform to the same characteristics. Therefore, it may be useful to first examine all molecules in terms of their characteristics and set up a feature vector with enough components whose components are then subsequently determined for each molecule.

FIG. 3 shows an example of a feature vector MV whose number of components is D, which at the same time also represents the dimension of the feature remainder MR. The feature vector MV here describes, for example, an atomic structure A, an atomic structure B and an atomic structure C, wherein the atomic structure A and the atomic structure B do not occur in the molecule which the feature vector MV describes for FIG. 3, during which the atomic structure C is present. If, as shown in FIG. 2, the algorithm determines an atomic structure for which no component has yet been assigned in the feature vector MV, then, as described above, a new feature index MI is assigned and added to the feature vector MV. As an example, here is shown an atomic structure Z, which until then has not been indexed in the feature vector MV and which is consequently added to the feature vector MV. As a result, the dimension of the feature vector MV increases by one dimension. Thus, it is a successive process in which the dimension D of the feature vector MV increases until all of the atomic structures (including the neighborhood) occurring in the molecules can be indexed by setting various components in the feature vector MV.

In the embodiment of Figure 3, it is shown that the feature vector MV is a binary vector consisting of 0 components and 1 components, for illustrative purposes only. In other embodiments, the presence of a particular feature (= a particular atomic structure) may be signaled by having the corresponding component (e.g. 3) has a predetermined value during which the absence of the feature can be signaled by any other value of the feature vector MV or, as shown in FIG. 3, no value is assigned to those components Zero is awarded.

FIG. 4 shows a feature vector MV in which, in addition to the atomic structure itself (here an atom), the neighborhood of the atomic structure has also been indicated. The atom corresponds to a vertex in the graph and the vertex is assigned two labels, a first label Li, which describes the atomic structure itself, and a second label L ₂ , which describes the neighborhood of the atomic structure. In the example shown in Figure 4, for the sake of simplicity, only three chemical elements are used for the molecules: e.g. As hydrogen H, carbon C and oxygen O used so that the presence of a particular chemical element (= a specific atomic structure) can be indexed by a sequence of three components. For example, (1, 0, 0) if the vertex is formed by a hydrogen atom or (0, 1, 0) for a carbon atom or (0, 0, 1) for an oxygen atom.

The pair (Li, L ₂ ) can also be coded differently, eg with a single number (eg a positive integer). For example, the atom C with the three neighbors; C.,

C, O may be labeled with the pair: (C, {C, C, O}) and encoded with a single number (e.g., 142). In a feature vector MV of one molecule, MV [142] = 1 (eg the 142th component is 1) if the molecule has an atom labeled (C, {C, C, O}) (ie an atom C with three neighbors C, C and O).

Accordingly, the part of the feature vector MV describing the given atomic structure may look like this. First, the index Li, which describes the atomic structure itself and coded by the sequence (0, 1, 0) appears , _e

is, i. it is a carbon atom. This is followed by the index L2, which in this example has the sequence of numbers 1, 0, 0, 1, 0, 0,. It is therefore an atomic structure formed by a carbon atom chemically linked to two adjacent hydrogen atoms. Thus, by continuing the feature vector MV and adding additional components, a complex molecule can be described by a binary vector (a string having, for example, "0" and "1" components).

Fig. 5A shows an example of a feature space MR, wherein for simplicity, the feature space MR has been given only by two dimensions. As previously described, the dimension D of the feature space MR is generally very large (often greater than 1000 or greater than 100,000) and is essentially determined by the complexity of the molecules used. The feature space MR has a first domain A and a second domain B, which are separated by a domain boundary H (= hypersurface in the feature space MR). For example, the first domain A describes points in the feature space MR that correspond to pharmaceutically active molecules, and the second domain B includes points in the feature space MR that describe pharmaceutically inactive molecules. For example, it is shown in FIG. 5 that the first domain A has five points that correspond to five pharmaceutically active molecules (a1, a2, a3, a4, a5) and that the second domain B comprises four points that contain pharmaceutically inactive molecules (bl, b2, b3, b4).

The domain boundary H can be selected in such a way that initially a set of molecules is considered whose pharmaceutical activity is known, ie which are either pharmaceutically active or have been proven to be pharmaceutically inactive. For these known molecules, as described above, feature vectors MV are set up which correspond to points in the feature space MR and which in FIG. 5 are represented by circles for pharmaceutical active molecules or by crosses for pharmaceutically inactive molecules are shown.

The first domain A and the second domain B are separated by the domain boundary H, which is preferably selected such that the distance to the points in the feature space MR whose pharmaceutical activity is known is as large as possible (maximum distance), i. the distance to the domain boundary H indicates the degree of pharmaceutical activity. For example, the molecule al shows a lower pharmaceutical activity than the molecule a2, which is further away from the domain boundary H than the molecule al. The degree of activity can be determined, for example, by the in vitro tests, i. by evaluating series of measurements how often there was a positive / negative result regarding the activity.

The distance to the domain boundary H corresponds to the minimum distance / distance and can be taken, for example, as the length of the vector which is parallel to a surface normal of the domain boundary H and at the same time intersects the point in the feature space MR (eg al). It should also be noted that the domain boundary H generally represents a hypersurface in a high-dimensional feature space MR and may also be thought of as a domain wall separating the pharmaceutically active domain from the pharmaceutically inactive domain. At the domain boundary H, therefore, the pharmaceutical activity is unclear or indeterminate. After the domain boundary H has been formed on the basis of learning examples (molecules whose pharmaceutical activity is known), in a subsequent process it is possible to study the pharmaceutical activity of candidates for whom it is desirable to know about their anticipated pharmaceutical activity. By means of the distance to the domain boundary H (ie minimum distance) it is also possible to make a ranking or an order with regard to the pharmaceutical activity. As shown in Fig. 5B, the molecules can be recorded on a directional beam, in the embodiment shown here the positive part of the directional beam corresponds to a pharmaceutical activity and the negative beam corresponds to a pharmaceutical inactivity. The zero point thus represents the domain boundary H. This directed beam can also be described by the function f, which, as already described above, can be determined by a support vector machine. Thus, in the embodiment shown, the molecule corresponding to the point b2 is plotted on the negative side and the molecules al and a2 are plotted on the positive side, with the molecule a2 having a larger value than the molecule a1. This representation thus provides a ranking with regard to the expected pharmaceutical activity of the molecules, so that a higher activity is to be expected for a2 than for the molecule al.

FIG. 6A illustrates a non-directional graph G formed of six vertices V1, V2,..., V6 which are connected to each other via edges E. FIG. In a non-directional graph, connecting a first vertex to a second vertex is equivalent to connecting the second vertex to the first vertex, while in a directed graph, the connection direction is meaningful and represented by a corresponding arrow in the edge E (e.g. B. when the connection is formed by a directed field). In the example shown here the vertices V2, V3, V4, V5 form a cycle C. As already stated, a cycle C is a doubly connected subgraph, ie for every vertex of a cycle S there exists a closed path along which a path back to leading the vertex without having to go the same path twice. In other words, a two-connected graph is given by the fact that, when intersected by an edge E, the doubly connected graph decays into a simply connected graph, ie still forms a continuous graph. In turn, a simply connected graph may be characterized by the fact that, when intersected by an edge E of the single-connected graph, the simply connected graph decays into two components which are not connected to each other. Or, in general terms, for a n-connected graph, there is always a cut, so that the n-connected graph breaks up into a (nl) -fold connected graph, where a 0-connected graph represents a disjoint graph (separate components ). Simple connected graphs connecting two cycles are also called bridges.

Fig. 6B shows another example of a non-directional graph, which is also formed of six vertices, but in the example shown here, the graph has three cycles, wherein a first cycle Cl is formed by the vertices V2, V3, V4 and the second cycle C2 is formed by the vertices V3, V4 and V5 and finally the third cycle is formed by the vertices V2, V3, V5, V4. The vertices V3 and V4 are threefold in this example.

Fig. 7 shows an example of a graph G _M for a molecule M. The graph G _M has a first cycle Cl and a second cycle C2 connected to each other by a bridge Bl, and further, the cycle Cl is an atomic structure A connected. The atomic structure A has, for example, an atom al which has three neighboring atoms n1, n2 and n3. In the notation described above, which again assumes that the atoms are merely If hydrogen H = (I, 0, 0), carbon C = (O, 1, 0) and oxygen O = (O, O, 1), the atom al may have, for example, the following labels: Li = (0, 1, 0) and L ₂ = (1, 0, 0, 1, 0, 0, 0, 1, 0). The first three entries in the label L ₂ identify the neighboring atom nl, the following three entries identify the neighboring atom n2 and the last three entries in the label L ₂ identify the third neighboring atom n3. Thus, the atom al is a carbon atom, the neighbor nl and n2 are hydrogen atoms, and the neighbor n3 is also a carbon atom. The labels Li and L ₂ determine, as already described above, the F _ATOM component of the feature vector MV.

Furthermore, the feature vector MV still has F _CYCLE and F _TREE , where in the example shown here F _CYCLE = (1, 1, 0, ...), wherein the first entry the presence of the first cycle Cl and the second entry the presence of second cycle C2 and the third entry signals the absence of a (non-existent) third cycle C3. Since the cycle C1 and the second cycle C2 are different from each other, they get different entries in the feature vector MV. Further, in the example shown, F _TREE = (1, 0, 0, ...), again with the first entry referring to and identifying bridge (indicating presence) and the following entries in the molecule shown above occurrence. Each bridge and each cycle, which differ from one another by their atomic structure, thus obtain their own entry in the feature vector MV. In the preparation of the feature vector MV is first examined whether this atomic structure (Zykel, bridge, ...) already exists in the feature vector MV or not, if so, is a setting of the component (eg by setting a "1 Otherwise, the feature vector MV is extended by this component. In other embodiments, it is also possible to introduce a multiplicity so that not only binary components are assigned, but at the same time the number of occurring structures is identified accordingly. This can be done, for example, by the fact that in a cycle the number in the feature vector MV indicates how often this cycle occurs in the molecule. Of course, the same also applies to bridges and other structures occurring in the molecule.

In the representation in the feature space MR, it may happen that the feature vectors MV of the different molecules only point to points that differ from each other only in terms of a subset of the dimensions (i.e., only along certain directions) of the feature space MR, and coincide with each other in many of the components. In this case, the dimensionality of the feature space MR or the calculation of the distance to the domain boundary H can be simplified by considering only the subspace in which the feature vectors MV differ significantly from one another. In the example shown in Fig. 5, it could e.g. B. be such that the points shown with respect to the third dimension (height) hardly or not differ from each other. For example, the difference (altitude value, for example) could be less than 50 or less than 10 or less than 1 percent of the distance of the points to the domain boundary. In such a case, for example, the height value for determining the distance to the domain boundary may be neglected, and thus the dimensionality of the feature space is effectively reduced, thereby significantly reducing the computational effort.

Thus, embodiments of the present invention can significantly increase the prediction of pharmaceutical activity. This has become possible in particular because not only the atomic structure itself but also including the neighbors of the atomic structure. It has been shown that an interaction between the atomic structure and the neighbors has a significant influence on the pharmaceutical activity of the respective molecule. Thus, not only the atomic structures, the cycles, the connecting bridges are detected, but also the neighbors belonging to these structures, cycles, bridges are added in the parameterization of the feature space MR.

Further, the present invention is advantageous in that it provides a ranking for the molecules (eg, by the function value of function f) and provides not only a prediction for a pharmaceutical activity or inactivity. For example, since the number of molecules to be assayed may be more than one million, of which, for example, only 20 are known for their activity, such a ranking is of paramount importance. Only in this way is it possible to consider among the more than one million existing molecules those whose pharmaceutical activity is most likely to appear. In fact, the totality of all molecules classified as pharmaceutically active could still be far too extensive to perform on all in vitro assays. Only a ranking will solve the problem.

Unless all known molecules are used to establish the domain boundary H, the remaining known molecules can be used to verify the reliability of the process and possibly make readjustments (shifts in the Domaingreze H), so that the quality of the statement continue to increase leaves. Thus, the present invention also offers a possibility of error estimation. Furthermore, it is possible to achieve a time saving in embodiments in that a redundancy is neglected. An example of redundancy is the above-mentioned independence of pharmaceutical activity with respect to particular features or combinations of features (particular atomic structures, particular cycles, etc.) that, if taken into account in the memory space MR, provide only a lesser variation in the points. Namely, the dimension of the feature vector MV can be up to 100,000, but only a smaller subset thereof are directions (features or feature combinations) in which the points in the feature space MR are significantly different from each other. This subset may be, for example, only 20 to 50 numbers (directions) and a projection on this 20- to 50-dimensional subspace is often useful, so that these respective components can be neglected accordingly in the evaluation, resulting in a tremendous time savings.

Similarly, it is with others

Embodiments possible, all those components

(= Feature combinations), which correspond to a parallel shift of the point in the feature space MR to the hypersurface H. It is interesting to find in the evaluation that direction in the feature space MR (that combination of features) which is perpendicular to the domain boundary H (or parallel to the normal vector), since the distance in this direction provides a ranking for the pharmaceutical activity of the molecule.

In particular, it should be noted that, depending on the circumstances, the inventive scheme can also be implemented in software. The implementation may be on a digital storage medium, in particular a floppy disk or a CD with electronically readable control signals, which may interact with a programmable computer system such that the corresponding method is performed. In general, the invention thus also consists in a computer program product with program code stored on a machine-readable carrier for carrying out the method according to the invention when the computer program product runs on a computer. In other words, the invention can thus be realized as a computer program with a program code for carrying out the method when the computer program runs on a computer.

Claims

claims

1. Device for determining a pharmaceutical activity of a molecule (M), having the following features:

means (110) for determining atomic structures occurring in the molecule;

means (120) for assigning a feature index (MI) to one of the occurring atomic structures in the molecule (M) depending on the respective atomic structure and a neighborhood of the respective atomic structure in the molecule (M);

means (130) for determining a feature vector (MV) for the molecule (M) as a function of the assigned feature index (MI), wherein the feature vector (MV) points to a point in a feature space (MR), the feature space (MR) a first domain (A) corresponding to pharmaceutically active molecules and a second domain (B) corresponding to pharmaceutically inactive molecules; and

a device (140) for determining an affiliation of the point to the first domain (A) or the second domain (B).

The apparatus of claim 1, wherein the atomic structure comprises an atom, a nucleus or a bridge, wherein the nucleus or the bridge is formed by chemically linked atoms and wherein the means (120) is arranged to assign as a feature index (MI ) assign a label (Li), where the label (Li) identifies the atom, the nucleus or the bridge. _

- 26 -

An apparatus according to claim 1 or claim 2, wherein said means (120) for assigning is arranged to assign a predetermined value as a feature index (MI) if a predetermined atomic structure exists in the molecule (M).

A device according to claim 3, wherein the feature index (MI) is binary such that the predetermined value corresponds to a logical "1".

The apparatus of any one of the preceding claims, wherein the neighborhood comprises a set of atoms in chemical communication with the atomic structure, and wherein the means (110) for determining is further configured to analyze the amount of atoms and another Label (L ₂ ) assign, where the other label (L ₂ ) identifies the amount of atoms.

6. Apparatus according to claim 5, wherein the neighborhood comprises a cycle or a bridge and the further label (L ₂ ) identifies the cycle or the bridge.

7. Device according to one of claims 2 to 6, wherein the Zykel a two-connected

Represents the subgraph and the bridge represents a simply connected subgraph, where the subgraph is formed by vertices and edges, where the vertices represent atoms and the edges represent chemical bonds.

8. Device according to one of claims 2 to 7, wherein the molecule (M) has different Zykel and the means (120) is adapted for assigning different Zykel different feature indices (MI) assign.

9. Device according to one of the preceding claims, wherein the means (130) is designed for determining, based on test examples, the first domain (A) and the second domain (B) to determine, the test examples having a known pharmaceutical activity.

10. Device according to one of the preceding claims, wherein the means (130) is designed for determining a Domaingreze (H), which separates the first domain (A) from the second domain (B) from each other to determine by means of test examples, wherein the test examples have a known pharmaceutical activity and represent points in feature space (MR), and wherein the domain boundary (H) has a maximum distance from the test examples.

An apparatus according to any one of the preceding claims, wherein the means (140) for determining is arranged to determine the affiliation of the point to other molecules whose pharmaceutical activity is known and to use the particular affiliation to determine the reliability in the determination to check the pharmaceutical activity.

A device according to any one of the preceding claims, wherein the means (140) ^{"is arranged} to determine a distance to the domain boundary (H) and to compare distances to the domain boundary (H) for different molecules, thereby ordering regarding the expected pharmaceutical activity for the different molecules.

13. Device according to one of the preceding claims, wherein the means (120) is adapted for assigning, a new feature index (MI) for another Atomic structure and in which the device (130) is designed to determine the feature vector (MV) to expand the new feature index (MI).

14. An apparatus according to any one of the preceding claims, wherein the means (120) for assigning is adapted to assign a predetermined feature index (MI) for a predetermined aetomic structure and wherein the means (130) is arranged to determine a component of the feature vector (MV ) to a predetermined value.

15. A method for determining a pharmaceutical activity of a molecule comprising the steps of:

Determining atomic structures occurring in the molecule (M);

Assigning a feature index to one of the occurring atomic structures in the molecule (M) depending on the particular atomic structure and the neighborhood of the respective atomic structure in the molecule;

Determining a feature vector (MV) for the molecule (M) as a function of the assigned feature index (MI), wherein the feature vector (MV) points to a point in a feature space (MR), the feature space (MR) defining a first domain (A) which corresponds to pharmaceutically active molecules and a second domain (B) corresponding to pharmaceutically inactive molecules; and

Determining an affiliation of the point to the first domain (A) or to the second domain (B).

16. Computer program with a program code for carrying out the method according to claim 16, when the computer program runs on a computer.