EP3642398A1

EP3642398A1 - Method and device for selecting a subassembly of molecules for use in predicting at least one property of a molecular structure

Info

Publication number: EP3642398A1
Application number: EP18749450.5A
Authority: EP
Inventors: Raphaël TERREUX; Charlotte ALLIOD; Roland Denis; Guy Jacob
Original assignee: Centre National de la Recherche Scientifique CNRS; Universite Claude Bernard Lyon 1 UCBL; ArianeGroup SAS
Current assignee: Centre National de la Recherche Scientifique CNRS; Universite Claude Bernard Lyon 1 UCBL; ArianeGroup SAS
Priority date: 2017-06-22
Filing date: 2018-06-22
Publication date: 2020-04-29
Also published as: FR3068047B1; WO2018234718A1; US20230154571A1; FR3068047A1

Abstract

The selection method according to the invention is iterative and comprises an initialisation step (E10) associating with a molecule, referred to as the current molecule, a predetermined descriptor value of molecules associated with the target molecular structure, and, during each iteration (E20) of the selection method: — a step (E30), for each molecule of a database comprising a plurality of molecules each associated with a value of said descriptor, of evaluating a degree of similarity, referred to as overall similarity, between the value of the descriptor associated with said molecule and the value of the descriptor associated with the current molecule; — a step (E40) of selecting molecules of the database having a degree of overall similarity greater than a predetermined threshold, the selected molecules being added (E50) to the reference subassembly; and — a step (E60) of updating the value of the descriptor associated with the current molecule based on values of descriptors associated with at least one portion of the molecules belonging to the reference subassembly.

Description

Method and device for selecting a subset of molecules for use in predicting at least one property of a molecular structure

Background of the invention

The invention relates to the general field of chemical molecules.

It relates more particularly to the prediction of properties of a molecule having a molecular structure.

The invention thus has a preferred but non-limiting application in the prediction of the toxicity of compounds, inert or energetic materials, or even highly energetic materials, which, in known manner, are capable of releasing energy in a very short time. Because of the energy released, such energetic materials are of interest to both military and civilian domains. They are nowadays commonly used in the manufacture of military machines, enter the constitution of gas (ex.propergol) necessary for the propulsion of missiles and space launchers, or are still used in the automobile industry for the manufacture of airbags, etc.

The entry into force in 2007 of the European Regulation REACH (Registration Evaluation Authorization of CHemicals) requires European Economic Area (EEA) manufacturers who manufacture, import or use chemicals in their business in quantities of more than 1 tonne per year to register. at European level these substances. The aim is to identify, evaluate and control all chemical substances manufactured, imported or placed on the European market. This regulation is intended to provide the European Union with legal and technical means to guarantee a high level of protection against the risks associated with chemical substances. It concerns all chemical substances, whether energetic materials or inert products (eg additives, stabilizers, plasticisers, glues, etc.).

There is therefore a need for manufacturers, in order to comply in particular with this regulation, to have techniques for identifying the toxic effects that a chemical may produce on humans or on the environment, and more generally on identify its properties, that is, its biological activity. We are interested here in the chemical substances having mono-molecular structures, so that one uses indifferently thereafter the expressions chemical substances (mono-) molecular, structures (mono-) molecules or molecules to designate these substances.

In vitro or in vivo techniques exist, but they are generally long, complex to implement and very expensive in terms of resources, reagents and detection methods.

There are also other so-called in silico techniques that are used to predict the properties of a chemical substance on computer tools (eg computer models, computerized calculation means). The most common in silico techniques use Quantitative Structure Activity Relationship (QSAR), which are algorithms (or equivalent programs) establishing a quantitative prediction of the biological activity of a monomolecular chemical substance from its chemical structure. The biological activity of the QSAR-mediated molecular substance is based on experimental results and is test-specific, typically correlated with the requirements of the REACH Regulation and / or the OECD (Organization for Economic Cooperation and Development). Development).

To determine the biological activity of a molecular substance by means of a QSAR, in silico techniques use databases (for example public databases), specific to the test under consideration, and comprising a plurality of diversified molecules, harmonized in accordance with the REACH and / or OECD regulations (eg database of high energy molecules). Various strategies can then be considered.

According to a known strategy, a QSAR is applied directly to the entire database. One of the drawbacks of this first strategy is that the database on which the QSAR is applied may contain molecules that are too different from the molecular substance whose biological activity is to be predicted, so that the resulting prediction can be made. prove wrong.

Other strategies are based on a search for structural similarity between the molecular substance whose biological activity is to be predicted and the molecules listed in the database. This similarity search is based on the assumption that all molecules in the database analogous to the molecular substance under consideration have similar properties, including a similar biological activity.

To facilitate the search for structural similarity in the database, it is common to represent the molecules by keys or structural fingerprints (also called "fingerprints" in English). These keys are descriptors consisting of a plurality of structural characteristic values which make it possible to characterize the molecular structures. One of the best structural keys known to characterize a molecule is the structural key MACCS 166 (for Molecuiar ACCess System), published by MDL Information Systems. This structural key characterizes each molecule by relying on a table of 166 molecular fragments chosen complex enough to hope to discriminate different molecules between them.

Each MACCS structural key 166 is more precisely a vector comprising 166 components or characteristics, having positive or zero values and reflecting the presence or absence of one of the 166 molecular fragments in the molecule in question: thus, a zero value reflects the absence of the corresponding fragment in the structure of the molecule, while a positive value indicates the number of times that the corresponding fragment is present within the molecule, or simply its presence within the molecule.

In order to compare two molecular structures with each other, a numerical measure of similarity between the two structures can then be calculated using a metric predetermined. A metric conventionally used in combination with the structural keys MACCS 166 is the Tanimoto metric defined by:

where X and Y designate the two structural keys associated respectively with the two compared molecular structures and where:

- Xi Λ Vf is equal to 1 if the components X _t and ^ are both positive, and to 0 otherwise; and - Xi v Y _t is equal to 1 if at least one of the components X _t and ₍ is non-zero, and at 0 otherwise.

It is noted that this metric is applied by simplifying the MACCS structural key 166 of each molecule so as to obtain a binary vector, a zero component value reflecting the absence of the corresponding molecular fragment, while a component value equal to 1 translated the presence of this fragment. The Tanimoto metric thus calculated thus provides the ratio between the number of components of the keys X and Y common to the two molecular structures on the total number of components of the keys X and Y expressed (ie to which a non-zero value has been assigned in the keys ) for these two molecular structures.

The strategies proposed today in the state of the art use this search for structural similarity in two different ways.

According to one strategy, a structural similarity search is performed on the database, leading to the identification of a subset of molecules in the database having a minimal similarity to the molecular substance whose properties are to be predicted. Then a QSAR is applied on the subset of molecules thus identified. It is therefore clear that, depending on the similarity threshold that one sets to select the subset of molecules, it is possible to obtain a subset that does not contain enough molecules to apply the QSAR of in a relevant way, or on the contrary a subset which contains molecules too different from the molecular substance whose properties one seeks to predict. This can result in a false prediction.

One known strategy for improving the performance of the aforementioned strategy is to identify a subset of molecules in the database from another known subset of molecules (eg subset of high energy molecules used). by an industrialist), and to select the molecules of the database which have a minimum similarity with each of the molecules of the known subset. A QSAR is then applied to the subset of the database thus identified from the known subset of molecules. Although this strategy has better performance, prediction errors may remain. Qbiet and summary of the invention

The invention proposes a strategy for predicting the properties of an alternative molecular substance to the strategies proposed in the state of the art and making it possible to obtain a better quality prediction.

More precisely, according to a first aspect, the invention proposes an iterative method of selecting a subset of said reference molecules intended to be used for predicting at least one property of a so-called target molecular structure, the iterative process of selection comprising an initialization step associating with a so-called current molecule a value of a predetermined molecule descriptor, associated with the target molecular structure, and during each iteration of the selection method:

An evaluation step, for each molecule of a base comprising a plurality of molecules each associated with a value of the descriptor, of a so-called overall similarity measure between the value of the descriptor associated with said molecule and the value of the associated descriptor; to the current molecule;

A step of selecting molecules of the base having a global similarity measurement greater than a predetermined threshold, the selected molecules being added to the reference subset; and

A step of updating the value of the descriptor associated with the current molecule from the values of the descriptors associated with at least a part of the molecules belonging to the reference subset.

Correlatively, the invention is directed to a device for selecting a subset of said reference molecules intended to be used for predicting at least one property of a so-called target molecular structure, the selection device comprising a configured initialization module. for associating with a so-called current molecule a value of a predetermined molecule descriptor associated with the target molecular structure, said selection device being further configured to activate, during a plurality of successive iterations:

An evaluation module configured to evaluate, for each molecule of a base comprising a plurality of molecules each associated with a value of the descriptor, a so-called global similarity measure between the value of the descriptor associated with said molecule and the value of the descriptor associated with the current molecule;

A selection module configured to select molecules of the base having a global similarity measurement greater than a predetermined threshold, the selected molecules being added by said selection module to the reference subset; and

An update module configured to update the value of the descriptor associated with the current molecule from the values of the descriptors associated with at least a part of the molecules belonging to the reference subset.

The invention also provides, according to a second aspect, a method for predicting at least one property of a so-called target molecular substance comprising: A selection step, by means of an iterative selection process according to the invention, of a subset of said reference molecules in a database comprising a plurality of molecules each associated with a value of a descriptor predetermined molecules;

A step of predicting at least one property of said target molecular substance from the subset of reference molecules selected.

Correlatively, the invention also relates to a prediction device configured to predict at least one property of a target molecular substance comprising:

A selection device according to the invention, configured to select a subset of said reference molecules in a database comprising a plurality of molecules each associated with a value of a predetermined molecule descriptor;

A prediction module, configured to predict at least one property of said target molecular substance from the subset of reference molecules selected.

It is noted that no limitation is attached to the molecule descriptor considered in the invention to describe each molecule of the base as well as the target molecular substance. This descriptor may be a descriptor comprising a plurality N of characteristics or components, N denoting an integer greater than or equal to 1, in which case the value of the descriptor is defined by the value of each of its N characteristics. These N characteristics can be, for example, structural characteristics making it possible to characterize each molecule and if possible to discriminate between them. For example, the values of the N characteristics of the molecule descriptor may reflect the presence or absence of N molecular fragments considered in the definition of a structural key MACCS 166.

Alternatively, other descriptors may be envisaged, such as other known two-dimensional (or fingerprints) descriptors such as MolPrint2D fingerprints, BCI, or those defined by the companies Tripos and Scitegic. These fingerprints are in the form of bit vectors, each bit encoding the presence (bit equal to 1) or the absence (bit equal to 0) of certain predefined structural fragments in the molecule or other characteristics. The invention also applies to other types of descriptors than 2D fingerprints. For example, a descriptor having the form of a simple variable (that is, comprising a single component / characteristic), whose value can be a quantitative or qualitative numerical value, can be considered. The invention also applies to descriptors having more complex forms, such as vector, matrix or even graphic forms. Such a descriptor is for example a connectivity matrix between a plurality of predetermined atoms indicating for each pair of atoms the presence or absence of a bond in the molecule in question (the descriptor then comprises a plurality of characteristics or characteristics given by the components of the matrix).

No limitation is attached either to the technique used to predict the properties of the target molecular substance from the reference subset molecules. It may be a quantitative structure-activity relationship (QSAR) as previously described, a neural network, a Principal Component Analysis (PCA) method, or partial least squares ( or Partial Least Squares), etc.

The invention therefore proposes a new way of selecting the molecules of the initial database used to predict the properties of a molecular substance, and which makes it possible to select a larger subset of molecules similar to the molecular substance and relevant for the prediction of its properties. This new way of selecting molecules is based on an iterative process of similarity search, initialized first with the target molecular substance whose properties are to be predicted. Then, over the iterations, "virtual" molecules are constructed from the descriptors of the molecules selected in the initial database during the iterations, and a new similarity search is performed from these virtual molecules. The invention thus leads, thanks to this recursive selection and to the taking into account of the similarities with the molecules of the database, to a more complete and more careful selection of the molecules of the base intended to be used for predicting the biological properties. of the target molecular substance.

It should be noted that the prediction produced by the invention is advantageously adaptive. It can easily use public databases, regularly updated, and listing the properties of different molecules with regard to different tests performed on these molecules.

The number of iterations considered for selecting the subset of reference molecules can be fixed by means of a parameterizable stopping criterion. In this embodiment, the evaluation, selection and updating steps are then repeated until a predetermined stopping criterion is verified. Different stopping criteria can be envisaged, for example:

A predetermined number of iterations carried out;

A predetermined number of molecules reached in the reference subset;

The absence of newly selected molecules during the selection step, that is to say of molecules not already belonging to the reference subset before the selection step. In other words, the reference set is no longer enriched over the iterations, so it is useless to continue to iterate.

The number of iterations and / or molecules of the reference subset can be calibrated empirically.

The choice of one or the other of the aforementioned criteria (or of another criterion) may depend on several parameters, such as, for example, the type of target molecular substance considered, a compromise between the number of molecules selected and the quality. prediction, the method that will be used to predict the properties of the target molecular substance from the properties of the selected molecules, etc.

In a particular embodiment in which the molecule descriptor comprises N characteristics where N denotes an integer greater than 1, the evaluation step comprises, for each molecule of the base, a step of calculating, for each of the N characteristics of the descriptor, a so-called local similarity measure between the value of this characteristic of the descriptor associated with said molecule and the value of this characteristic of the descriptor associated with the current molecule, the global similarity measure evaluated for said molecule being obtained from the local similarity measurements calculated for this molecule.

For example, the calculation step includes for each descriptor feature:

Calculating a distance between the value of the descriptor characteristic associated with said molecule and the value of the descriptor characteristic associated with the current molecule; and

A conversion of the calculated distance into a real number between 0 and 1 by means of a predetermined conversion function, said number being used as a measure of local similarity for said descriptor characteristic and said molecule.

Such a calculation step advantageously makes it possible to obtain a measurement of similarity that is more precise than in the state of the art. It can be easily applied to numerical values (eg integers) of descriptor characteristics that are positive or null, and not just binary. This gives an assessment of the similarity between two molecular substances more precise and more generic than in the state of the art.

Different (algebraic) distances and conversion functions may be envisaged to implement the invention.

An example of algebraic distance that can be considered is d (x, y) = x-y where x and y respectively denote the value of the characteristic considered of the descriptor associated with said molecule and y the value of the characteristic considered of the descriptor associated with the current molecule.

However, such a distance, although very simple to compute, does not distinguish between two descriptor characteristic values equal to 0 and 1, and two descriptor characteristic values equal to 10 and 11 having the same difference between them as the values 0 and 1. In other words, it does not allow to take into account the fact that the two molecules compared have in both cases descriptor characteristic values having different levels.

To take into account such subtleties and offer a more precise evaluation of the similarity between two molecular substances, in a particular embodiment of the invention, the calculated distance, denoted d, can verify:

where x and y respectively denote the value of the descriptor characteristic associated with said molecule and y the value of the descriptor characteristic associated with the current molecule.

Of course, these examples are given for illustrative purposes only. Moreover, a measure of similarity is defined as a real number between 0 and 1, taking conventionally the value 0 when the two molecules are considered totally different (ie not similar), and the value 1 when they are considered. as totally identical (ie similar). Intermediate values can be considered, representing shades of similarity between these two extremes. To comply with this definition, different conversion functions may be considered.

Thus, in a particular embodiment, the conversion function, noted f, can verify:

where d denotes the distance to be converted and σ a predetermined real number.

In a particular embodiment, during the evaluation step, the overall similarity measure evaluated for said molecule is the ratio between:

The weighted sum of the N local similarity metrics calculated for the N characteristics of the descriptor for this molecule, and

- twice the sum of the weights applied to the local similarity metrics in said weighted sum minus said weighted sum.

This definition of the global similarity measure makes it possible to take into account several levels of expression of the same descriptor characteristic in the compared molecules: it is not limited to discerning only two levels of binary expression (absence or presence of the characteristic of the descriptor) unlike in particular the metric Tanimoto described above and considered in the state of the art. In addition, this global similarity measure advantageously considers that the common non-expression of the same descriptor (i.e. null value for this descriptor for the two compared molecules) is a mark of similarity between the two compared molecules.

To update the current molecule during each iteration of the selection process, different strategies can be envisaged. This current molecule is in a way the representative of the molecules of the reference subset used at the next iteration to complete the reference subset.

Thus, in a first variant, during the updating step implemented during an iteration of the selection method, said at least part of the molecules belonging to the reference subset used for the update comprises the molecules selected during the selection step of this iteration that did not already belong to the reference set before this selection step.

In other words, according to this first variant, only the newly selected molecules are taken into account during the current iteration. This first variant may, however, lead to the selection in the reference set of molecules a little too far in terms of similarity of the target molecular structure.

In a second variant, during the updating step implemented during an iteration of the selection process, the said at least part of the molecules belonging to the reference subset used for the update comprises the molecules selected during the step of selecting this iteration.

According to a third variant, during the updating step implemented during an iteration of the selection method, the said at least part of the molecules belonging to the reference subset used for the update all comprise the molecules belonging to the reference subset at the end of the step of selecting this iteration.

The inventors have found that the second and third variants above have a fairly similar behavior and lead to comparable results in terms of prediction. They also give better results than the first variant.

In addition to different strategies for selecting the molecules taken into account for updating the current molecule, different strategies can be envisaged for determining the values of the characteristics of the descriptor associated with the updated current molecule.

According to a first variant, during the updating step, the value associated with the current molecule of each descriptor characteristic is updated with an arithmetic or weighted average of the values of this characteristic of the descriptor associated with the molecules of said descriptor. least part of the molecules belonging to the reference subset.

This first variant leads to values of the characteristics of the descriptor which are in some way "artificial", and do not correspond to characteristic values present in said at least part of the molecules of the subset used for the update.

To remedy this aspect, according to a second variant, during the updating step, the value associated with the current molecule of each feature of the descriptor is updated with the most frequent value of this characteristic of the descriptor among the values of this feature of the descriptor associated with the molecules of said at least a portion of the molecules belonging to the reference subset, or if a plurality of distinct values satisfy this condition, with the highest value among this plurality of distinct values.

In a particular embodiment, the various steps of the selection method and / or the prediction method are determined by computer program instructions.

Accordingly, the invention also relates to a computer program on an information carrier, this program being capable of being implemented in a device of selection, respectively in a prediction device, or more generally in a computer, this program comprising instructions adapted to the implementation of the steps of a selection method, respectively of a prediction method, as described above .

This program can use any programming language, and be in the form of source code, object code, or intermediate code between source code and object code, such as in a partially compiled form, or in any other form desirable shape.

The invention also relates to a computer readable information or recording medium, and comprising instructions of a computer program as mentioned above.

The information or recording medium may be any entity or device capable of storing the program. For example, the medium may comprise storage means, such as a ROM, for example a CD ROM or a microelectronic circuit ROM, or a magnetic recording means, for example a hard disk.

On the other hand, the information or recording medium may be a transmissive medium such as an electrical or optical signal, which may be conveyed via an electrical or optical cable, by radio or by other means. The program according to the invention can be downloaded in particular on an Internet type network.

Alternatively, the information or recording medium may be an integrated circuit in which the program is incorporated, the circuit being adapted to execute or to be used in the execution of the method in question.

It can also be envisaged, in other embodiments, that the selection method, the prediction method, the selection device and the prediction device according to the invention present in combination all or part of the aforementioned characteristics. Brief description of drawings and annexes

Other features and advantages of the present invention will emerge from the description given below, with reference to the drawings which illustrate an embodiment having no limiting character, and to Annexes 1 to 6.

In the figures:

- Figure 1 shows, schematically, a prediction device according to the invention, in a particular embodiment;

FIG. 2 represents the hardware architecture of the prediction device of FIG. 1, in a particular embodiment;

FIG. 3 illustrates the different steps of a selection method according to the invention; and FIG. 4 illustrates the different steps of a prediction method according to the invention.

Annexes 1 to 6 show the performances achieved by the prediction method according to the invention. Detailed description of the invention

FIG. 1 represents, in its environment, a prediction device 1 according to the invention, in a particular embodiment.

In the example envisaged in FIG. 1, the prediction device 1 is configured to predict at least one property of a substance called TARGm target unknown. It is assumed that this target substance has a mono-molecular structure from which it is possible to extract the value of a descriptor comprising a predetermined number N of (structural) characteristics for characterizing the target substance. In the embodiment described here, the descriptor is a vector comprising N = 166 characteristics (or components) reflecting the presence or absence in the molecular structure considered of the 166 molecular fragments considered in the definition of the MACCS structural key 166. Otherwise said, the value of a descriptor characteristic of a molecular substance indicates the presence or absence of the corresponding molecular fragment in the molecular substance.

Alternatively, other descriptors may be envisaged for the implementation of the invention, as mentioned previously (eg 2D fingerprints MolPrint2D, BCI, or defined by the companies Tripos and Scitegic, simple variable whose value can be a value quantitative or qualitative numerical, matrix of connectivity between a plurality of predetermined atoms indicating for each pair of atoms the presence or absence of a bond in the molecule in question, etc.)

No limitation is attached to the nature of the mono-molecular substance under consideration. This is for example here a high energy molecule (or HEM), however this example is given for illustrative purposes and the invention applies to all types of molecules.

By "prediction of at least one property of TARGm target substance" is meant here the prediction of its biological activity. Thus a property that we are trying to predict may be for example a toxicological property of TARGm target substance, in particular to meet the requirements of the European REACH Regulation. However, the invention also applies to the prediction of other types of properties of a molecule, such as, for example, physico-chemical properties (logP or molecular weight), structural properties, absorption properties, distribution properties, of Metabolism, or Elimination (ADMET), therapeutic properties, etc.

To predict these properties, the prediction device 1 comprises:

A selection device 2 according to the invention; and

A prediction module 3.

In the embodiment described here, the prediction device 1 has the hardware architecture of a computer as represented in FIG. 2, and the selection device 2 and the prediction module 3 are software modules installed in a memory of the prediction device 1. More particularly, the prediction device 1 comprises in particular a processor 4, a random access memory 5, a read-only memory 6, a non-volatile flash memory 7, input / output interfaces 8 (such as a screen, a keyboard, etc. .), as well as means of communication 9.

These communication means 9 allow the prediction device 1 to access or download for example one or more databases 10 each listing a plurality of molecules. In the embodiment described here, each database 10 considered comprises, for each molecule it contains, its name, its molecular structure, the values of the N structural characteristics of the structural key MACCS 166 (in other words, the associated values N = 166 molecular fragments considered in structural key MACCS 166), and the experimental result reached by this molecule in a given biological test.

Such databases are known per se and are not described in detail here. Each database corresponds to a biological test performed on the molecules it contains. Examples of these databases are described in particular in the DJ document. Kirkland et al., Entitled "Testing strategies in mutagenicity and genetic toxicology: an appraisal of the guidelines of the European Scientific Committee for Cosmetics and Non-Food Products for the Evaluation of Hair Diseases", Mutat. Res. Toxicol. About. Mutagen, vol. 588, pp. 88-105, 2005, or in the document by V. Thybaud et al. entitled "Strategy for genotoxicity testing: hazard identification and risk assessment in vitro testing", Mutat. Res. Toxicol. About. Mutagen, vol. 627, pp. 41-58, 2007.

The databases 10 may be hosted on remote servers or stored in a memory of the prediction device 1 (for example in its non-volatile memory 7). The communication means 9 of the prediction device 1 allow it to access or download them via a telecommunications network, or to obtain these databases via a recording medium such as a USB key (Universal Serial). Bus) or a CDROM. They can include for this purpose a USB port, a network card, a WIFI interface (WIreless FIdelity), etc.

The read-only memory 6 of the prediction device 1 constitutes a recording medium in accordance with the invention, readable by the processor 4 and on which is recorded here a computer program PROG according to the invention.

The computer program PROG defines functional modules (and software here), configured to implement the steps of the selection method and the prediction method according to the invention. In a variant, the two aforementioned methods can be defined by instructions from two different programs.

The functional modules defined by the program PROG rely on and / or control the hardware elements 4-9 of the prediction device 1 mentioned above. They include in particular here, as illustrated in FIG. An initialization module 2A configured to associate with a current so-called CURm molecule updated during the selection process according to the invention, the value of the MACCS descriptor 166 associated with the target molecule TARGm (the value of the descriptor comprising here N characteristics) ;

An evaluation module 2B configured to evaluate so-called "global" similarity measurements between the values of the descriptors associated with a predetermined set of molecules (typically the molecules of a database 10) and the value of the descriptor associated with the current molecule CURm;

A selection module 2C configured to select molecules of the predetermined set considered having a global similarity measurement greater than a predetermined threshold, and to add the molecules thus selected to a so-called reference subset designated CREF; and

A 2D update module configured to update the value of the descriptor associated with the current molecule CURm from the values of the descriptors associated with at least a part of the molecules belonging to the CREF reference subset.

Evaluation module 2B, selection module 2C and 2D update module are modules of selection device 2, and are configured for the implementation of a selection method according to the invention. They are activated by the selection device 2 repeatedly during a plurality of iterations, and more specifically in the embodiment described here, as long as a predetermined criterion (parameterizable) is not verified.

The program PROG here also defines the prediction module 3 of the prediction device 1. The prediction module 3 is configured to predict at least one property of the target molecular substance TARGm from the molecules of the reference subset CREF selected by the selection device 2. No limitation is attached to the prediction technique implemented by the prediction module 3. It may be for example a QSAR type relationship, a neural network, a prediction technique by principal component analysis, etc. This prediction technique uses the experimental results achieved by the molecules of the reference subset CREF listed in the database 10 whose subset CREF was extracted.

The various functions of the modules 2A, 2B, 2C, 2D and 3 above are described now with reference to the steps of the selection method and the prediction method according to the invention.

As mentioned previously, the prediction device 3 predicts at least one property of the molecular substance TARGm from the properties listed in the databases 10 for a plurality of molecules. For the sake of simplicity, here we consider a single database 10 comprising a plurality of molecules and the experimental results achieved by these molecules corresponding to a given biological test. According to the invention, the prediction made by the prediction device 3 is based on a prior selection by the selection device 2 of a reference subset CREF comprising a plurality of molecules extracted from the database 10. FIG. 3 illustrates the main steps of the selection method according to the invention implemented by the selection device 2 in order to make this selection of the CREF reference subset.

As mentioned above, the selection method is an iterative method, comprising an initialization step (step E10) and implementing a plurality of iterations. In the embodiment described here, the iterations are linked as long as a predetermined stop criterion CRU is not checked. The different stopping criteria envisaged are described in more detail later.

During the initialization step E10 (corresponding to the iteration iter = 0), the initialization module 2A of the selection device 2 initializes the reference subset CREF to an empty set.

In addition, it initializes the current molecule CURm to TARGm target molecule whose properties are to be predicted. This initialization consists more particularly in associating with the current molecule CURm the value of the structural key MACCS 166 associated with the target molecule TARGm. Since this key comprises N = 166 characteristics, the initialization consists of other words to associate with the current molecule the values of the N = 166 characteristics of the MACCS structural key associated with the target molecule TARGm (ie the value of the descriptor consists of values of its N = 166 characteristics). MACCS (CURm, l),..., MACCS (CURm, N) are subsequently denoted as the values of the N MACCS characteristics associated with the current molecule CURm.

The selection device 2 then starts the iterations of the selection process (step E20 of incrementing the index iter).

More particularly, the selection device 2 evaluates, via its evaluation module 2B, for each molecule MOLk of the database 10 considered, k = l, ..., K where K is an integer designating the number of molecules listed. in the base 10, a so-called global similarity metric denoted S (CURm, MOLk), between the value of the MACCS descriptor 166 associated in the base 10 with this molecule MOLk and the value of the MACCS descriptor 166 associated with the current molecule CURm (step E30). This global similarity metric is more precisely calculated here between the N values of the N characteristics of the associated MACCS descriptor 166 in the base 10 of the MOLk molecule and the N values of the N characteristics of the MACCS descriptor 166 associated with the current molecule CURm (step E30 ).

In the embodiment described here, the global similarity metric S (CURm, MOLk) between each molecule MOLk of the base 10 and the current molecule CURm is evaluated from so-called local similarity measures ls (CURm, MOLk, n). , n = 1, ... N calculated for each of the N characteristics of the MACCS descriptor 166 of the considered molecules.

These local similarity measures are defined here from a local similarity function Is which at any pair of integer characteristic values (x, y) associates a real number ls (x, y) (denoted here ls (CURm, MOLk, n) for the nth characteristic), between 0 and 1 and satisfying the following properties:

ls (x, x) = 1 for any natural integer x;

ls (x, y) = ls (y, x) for x and y any natural integers.

In the embodiment described here, the function Is results from the composition of a function d comparable to a geometric distance between the values x and y, and a function f of converting the distance between x and y into a measurement local similarity, ie:

ls (x, y) = f (d (x, y))

Different choices are possible for the algebraic distance d (x, y). In the embodiment described here, the evaluation module 2B uses the distance d thus defined:

Furthermore, the evaluation module 2B uses as conversion function f, a standardized Gauss function defined

where σ is a predetermined real number.

Of course, other distances and other conversion functions can be used by the evaluation module 2B to determine the local similarity metrics between the N characteristic values of the considered descriptor of the current molecule CURm and the N values of characteristics of the relevant descriptor of the molecule MOLk. However, a conversion function is preferably chosen, associating with any number of the real straight line a real value between 0 and 1 such that:

(i) f (+/- ∞) = 0 (i.e. at an infinite distance between two values of a characteristic we associate a zero similarity value); and

(ii) f (0) = 1 (i.e. at a zero distance between two values of a characteristic, we associate a unit similarity value).

Thus, during the evaluation step E30, for each molecule MOLk of the database 10, the evaluation module 2 calculates for each characteristic of the MACCS descriptor 166 indexed by the integer n, n = 1. .., N, the following local similarity metric:

ls (CURm, MOLk, n) = f (d (MACCS (CURm, n), MACCS (MOLk, n)) where MACCS (CURm, n) and MACCS (MOLk, n) respectively denote the value of the nth characteristic of the MACCS descriptor of the current molecule CURm and the value of the nth characteristic of the MACCS descriptor of the molecule of the MOLk molecule.

Then the evaluation module 2 evaluates the global similarity metric S (CURm, MOLk) between the molecule MOLk and the current molecule CURm according to the following equation: Σn = _i > s (M0L - A, 0L - B, n)

S (M0L - A, 0L - B) = ^■ " ¹ '

2Σn = iw _n -Σn = iw _n ls (MOL-A, MOL-B, n)

with MOL-A = CURm and MOL-B = MOLk and where w ", n = 1, ..., N denote real weights.

It should be noted that this expression of the overall similarity results from a search by the inventors of a similarity measure which, unlike the Tanimoto metric commonly used in the techniques of the prior art, makes it possible to take into account different levels of expression of the same characteristic of the descriptor (ie different values of the same characteristic) between two compared molecules, and which also considers the common non-expression of the same descriptor characteristic (ie null value of this characteristic) as a mark of similarity between the two compared molecules.

To obtain this expression, the inventors had the judicious idea of using the Jaccard J index (A, B) of two sets A and B defined by:

^Λ ' ⁾ \ AUB \ \ A \ + \ B \ - \ A n B \

where the symbols n and u respectively denote the intersection and the union of the sets A and B, and | X | refers to the cardinal of a set X. They then applied this index of Jaccard to two sets A and B made up of the set of pairs formed of each index n of characteristic, n = 1, ..., N and of the value of the corresponding characteristic, associated with two distinct molecules denoted MOL-A and MOL-B (for example here MOL-A = CURm and MOL-B = MOLk). The intersection of the sets A and B can then be written in the form:

\ AB \ = Σ £ _{= 1} w _n | {n, MACCS (MOL - A, n)} n {n, MACCS (MOL - B, n)} | considering that the pairs of MOL-A and MOL-B molecules corresponding to different MACCS descriptor characteristics have empty intersections, and where w _n , n = 1, ..., N denote real weights. Then asking:

| n, MACCS (MOL-B, n) n {n, MACCS (MOL-B, n)} | = ls (MOL - A, MOL - B, n) we obtain that:

w _n ls (MOL - A, MOL - B, n)

n =

Noting that | A | = | B | = N, we obtain from the formula of the Jaccard index:

By applying the Jaccard index to the CURm and MOLk molecules, the inventors obtained the overall similarity measure used by the evaluation module 2 during the step

E30.

We note that a different definition of the sets A and B to which the Jaccard A and B index defined above with weights w "= 1 for n = 1, ..., N, is applied makes it possible to obtain the metric of Tanimoto. In the embodiment described here, evaluation module 2 uses weights w _n , n = 1, ..., N all equal to 1.

Alternatively, different real weights of 1 can be applied by the evaluation module 2. Different strategies can be considered to determine the weights w _n , n = 1, ..., N. For example, these weights can be determined by expertise from a business knowledge of the relevance of each feature of the descriptor given the type of TARGm target molecule whose property is to be predicted. These weights can also be determined using statistical methods, in particular classification methods such as Linear Discriminant Analysis (LDA), which makes it possible to determine weights leading to a better discrimination between the experimentally positive molecules. (ie who are considered to have responded positively to the toxicity test considered) and negative (ie who are considered to have responded negatively to the toxicity test considered).

Once the global similarity metrics S (CURm, MOLk) evaluated for each molecule MOLk of the database 10, the selection device 2, via its selection module 2C, determines which molecules of the base 10 have measuring overall similarity greater than a predetermined threshold THRmin (or equivalently greater than or equal to a predetermined threshold THRmin ') and selects them (step E40).

The molecules thus selected form a set C (iter) of molecules considered to be similar to the current molecule CURm. The threshold THRmin is a constant parameter here during the iterations of the selection process, and between 0 and 1. It may depend in particular on the type of target molecule TARGm whose properties are to be determined (eg high energy molecule, solvent, plasticizers, liquid, etc.). This threshold can be determined experimentally beforehand.

By way of example, the inventors have determined by experimentation that a threshold

THRmin = 0.85 (or greater than or equal to 0.85) leads to good predictions for different categories of molecules (fillers, plasticizers, liquids, etc.).

As a variant, the THRmin threshold may change over the iterations.

The set of molecules C (iter) selected during the current iteration iter is then added by the selection module 2C to the set of reference CREF (step E50). It should be noted that certain molecules contained in the set C (iter) may already be present in the reference set CREF, in which case the addition of the molecules of the set C (iter) to the set of reference CREF is limited to add only the new molecules not already present in the CREF reference set.

Then, in the embodiment described here, the selection device 2, via its 2D update module, updates the value of the MACCS descriptor associated with the current molecule (step E60). This results in an update of the N values of the MACCS characteristics (CURm, l), ..., MACCS (CURm, N) of the descriptor associated with the molecule. current CURm. In this way, it is a question of defining a new "virtual" molecule that is current for the next iteration, from which a new similarity search will be performed in the database 10.

According to the invention, this update is carried out from the descriptor values of at least a part of the molecules present in the CREF reference subset at the end of step E50.

Different ways of updating the N MACCS characteristic values (CURm, n), n = 1, ..., N of the MACCS descriptor can be implemented by the 2D update module. These ways can be distinguished, on the one hand, by the molecules of the CREF reference subset that are used, and on the other hand, by the way in which the values of the characteristics of the descriptor of these molecules are combined to obtain the put values. current molecule CURm.

In the embodiment described here, the update of the MACCS descriptor characteristic values of the current molecule CURm is based on the values of the characteristics of the MACCS descriptor of the molecules selected during the iterative iteration, ie on the molecules contained in the set C (iter).

In another embodiment, the update of the MACCS descriptor characteristic values of the current molecule CURm is based on the MACCS descriptor characteristic values of all the molecules belonging to the CREF reference set at the end of step E50.

In yet another embodiment, the updating of the MACCS descriptor characteristic values of the current molecule CURm is based solely on the values of the characteristics of the MACCS descriptor of the newly selected molecules during the selection step E40 implemented. during the current iteration iter, in other words on the values of the characteristics of the MACCS descriptor of the molecules belonging to the set C (iter) but which do not already belong to the set of reference CREF before the step E50.

Furthermore, in the embodiment described here, to update each MACCS value (CURm, n) of the MACCS descriptor characteristic of the current molecule CURm, n = 1,..., N, the update module 2D uses the most frequent value of each characteristic among the values of this characteristic associated with the molecules considered for the update. In case of ambiguity, that is, if several distinct values satisfy this frequency condition, the 2D update module uses the highest value among this plurality of distinct values.

Alternatively, to update each MACCS value (CURm, n) of the characteristics of the MACCS descriptor of the current molecule CURm, n = 1, ..., N, the 2D update module may use an average of the values of this characteristic associated with the molecules considered for updating (or the integer value closest to this average to obtain integer characteristics), this average possibly being an arithmetic or weighted average. At the end of this step E60, a new current molecule CURm is thus obtained on which a new search for similarity in the base 10 can be performed during the next iteration.

In the embodiment described here, the selection device 2 verifies, at the end of step E60, whether the CRU stop criterion is verified (test step E70). Different stopping criteria can be envisaged, for example:

A predetermined number ITERMAX of iterations carried out;

A KMAX number of molecules reached in the CREF reference set;

The absence of newly selected molecules in the set C (iter) during the selection step E40.

This stopping criterion can be parameterizable. The numbers ITERMAX and KMAX are also parameterizable, and depend in particular on the type of molecules considered.

If the stopping criterion is not checked (answer no to the test step E70), then a new iteration of the selection method is implemented (incrementation step E20), this iteration comprising the repetition of the steps E30 to E70 for the new current molecule CURm obtained during step E60.

If the stopping criterion is checked (answer yes to the test step E70), the iterations of the selection method are interrupted and the reference set CREF is supplied to the prediction module 3 for the prediction of the properties of the molecule substance TARGm target.

It should be noted that if the CRU judgment criterion considered is a KMAX number of molecules reached in the CREF reference set, the reference set CREF considered is preferably that obtained at the end of the iteration making it possible not to exceed the KMAX number.

FIG. 4 illustrates the different steps of the prediction method implemented by the prediction device 1.

In this figure, the step F10 repeats the steps of the selection method of the reference subset CREF previously described with reference to FIG. 3 and implemented by the selection device 2 of the prediction device 1.

As mentioned above, the reference set CREF obtained by the selection device 2 is then supplied to the prediction module 3. The latter is configured to predict at least one property of the target molecular substance TARGm from the molecules of the reference set CREF selected by the selection device 2 (step F20).

No limitation is attached to the prediction technique implemented by the prediction module 3 for this purpose. It can in particular use a QSAR type relationship as described above and commonly used in the state of the art, or a neural network, a prediction technique by principal component analysis, etc. This prediction technique uses the experimental results achieved by the molecules of the CREF reference set and listed in the database whose CREF set has been extracted. The use of such prediction techniques is known per se and is not described in more detail here.

The prediction device 1 then obtains at the end of step F20 a prediction of at least one biological property of TARGm target molecular substance. Other predictions can be made by the prediction device 1 from other databases corresponding to other biological tests.

The invention, via the proposed new selection method, makes it possible to obtain a reliable prediction of the properties of a molecular substance from the properties of molecules of the same type listed in public databases in particular. The inventors have observed an improvement in the predictions obtained with respect to the state of the art prediction techniques for different categories of molecules (fillers, plasticizers, oxidizers, liquids, stabilizers, pyrotechnic components, etc.) and for various regulatory tests known to those skilled in the art (eg AMES mutagenicity test, chromosome aberration test, UDS unscheduled DNA synthesis test, carcinogenicity test, etc.). Some results are provided in Annexes 1 to 6 to illustrate the performance of the selection and prediction methods according to the invention.

Appendix 1 illustrates prediction results obtained for the AMES test using five different prediction methods. The AMES test is, in a known manner, a mutagenicity test carried out on different bacterial cultures and aimed at determining whether a molecule has a mutagenic property (indicated in the table in Appendix 1 by a "+" symbol, a "-" symbol indicating that the molecule does not exhibit mutagenic property).

The table presents in its first column data which were obtained experimentally from the molecules tested. These data were validated at European level and were used as a reference to determine the relevance of the predictions made using the different prediction methods tested. For each of these methods, when a result obtained is between 0 and 0.4, it is considered negative, that is to say as reflecting the absence of mutagenic property in the molecule tested; when this result is between 0.4 and 0.6, it is considered doubtful; and when this result is greater than 0.6, it is considered to be negative, that is to say as reflecting the presence of the mutagenic property in the molecule tested.

The table given in Appendix 1 provides the prediction results obtained via the five methods tested for different charge-type molecules: the five prediction methods were each applied on a starting data base comprising 7723 reference molecules. More precisely :

The column of the table bearing the reference (1) corresponds to the application of a QSAR relation on the initial data base; The column of the table bearing the reference (2) corresponds to the application of a QSAR relation on a database obtained by selecting in the starting database the molecules presenting a similarity metric (Tanimoto metric) of 0.8 ;

The column of the table bearing the reference (3) corresponds to the application of a QSAR relation on a database obtained by selecting in the starting database the molecules presenting a similarity metric (Tanimoto metric) of 0.8 ;

The column of the table bearing the reference (4) corresponds to the application of a QSAR relation on a database obtained by the iterative selection method according to the invention and applied on the basis of initial data (MACCS structural descriptors) 166). The stopping criteria considered for the iterative process are a maximum of 5 iterations or 600 selected molecules in the starting base. The local and global metrics described in the previously detailed embodiment have been used; and

The column of the table bearing reference (6) corresponds to the application of an automatic learning algorithm also commonly referred to as a "machine learning" algorithm on a database obtained by means of the iterative selection method according to the invention and applied on the basis of initial data (MACCS 166 structural descriptors). The stopping criteria considered for the iterative process are a maximum of 5 iterations or 600 selected molecules in the starting base. The local and global metrics described in the previously detailed embodiment have been used.

It appears from the results obtained for various molecules that the prediction method according to the invention, whether based on a QSAR or on an automatic learning algorithm, makes it possible to obtain very good prediction results ( respectively 16 and 17 predictions performed correctly on the 17 performed), and better performance than the other methods of the state of the art tested (corresponding to columns (2) and (3)).

Appendix 2 reflects other prediction results obtained for the AMES test, for different categories of molecules (fillers, plasticizers, oxidants, liquids, stabilizers and pyrotechnic molecules), with the selection and prediction methods according to the invention ( column "prediction" of the different tables in Appendix 2). The same assumptions as those used in Annex 1 were considered (maximum number of iterations equal to 5, 600 molecules selected at most, local and global metrics detailed previously, MACCS 166 structural descriptors); the actual prediction step was carried out on the basis of molecules selected by the selection method according to the invention by applying a machine learning type algorithm.

Experimental data obtained for the tested molecules are given as an indication (column "Exp. Data"). The percentages indicated correspond to the reliability of the prediction produced by the invention. When this reliability is between 40 and 60%, the result of the prediction is considered doubtful. Beyond 60%, the prediction is considered correct. Below 40%, the prediction is considered wrong.

Thus, the different tables produced in Appendix 2 show that:

The prediction method has led to a correct prediction for all the test molecules of the charge type (ie all the percentages reported are greater than 60%), for all the liquid-tested molecules, and for the whole tested molecules of the stabilizing type;

For the sets of molecules tested of the pyrotechnic and oxidizing type, only one molecule led to a dubious prediction (corresponding to a reliability of 58% and 57% respectively).

Annexes 3 to 5 reflect prediction results obtained via the prediction method according to the invention for other known regulatory tests (chromosome aberration test in Annex 3, UDS test in Annex 4, carcinogenicity test in Annex 5). . The same assumptions as those used in Appendix 2 were considered for the implementation of the processes according to the invention and the interpretation of the results presented.

Appendix 6 compares the results obtained via the prediction method according to the invention and via another prior art prediction method known as ACD (Advanced Chemistry Development) Percepta (described in more detail on the web page https://www.acdlabs.com/products/percepta/).

The results concerning the prediction method according to the invention were obtained from two different starting bases (referenced by "first test base" and "second test base"). The first test basis is the one already used to generate the results reported in Appendices 2 to 5. The first column of results in the table presented in Appendix 6 gives the rate of good predictions obtained via the prediction method according to the invention with respect to different molecules tested for the different tests considered. This first column lists the different results shown in Appendices 2 to 6 for all categories of molecules considered together, and supplements these results for other known regulatory tests (Mouse Lymphoma Test (M LA), DLT, and Reprotoxicity Test).

Other results obtained on a second base of departure are also reported in the table in Appendix 6. These results make it possible to compare the performances obtained on the second base with the prediction method according to the invention (always according to the same hypotheses as previously described) with the performances obtained on this same basis with the ACD process. It can be seen that the rate of good predictions obtained with the prediction method according to the invention is approximately 90% as against 55% for the ACD process. Annex 1

AMES mutagenicity test Charge type molecules

Annex 2

Mutagenicity test AMES Different categories of molecules

Annex 3

Chromosome aberration test Different categories of molecules

Annex 4

UDS test

Different categories of molecules

D - doubtful

Annex 5

Carcinogenicity test Different categories of molecules

Annex 6

Comparison of the results obtained with the prediction method according to the invention and the ACD method

Process Method

Invention Invention

ACD method applied on the applied on applied on a

second base test a first second base

test test base

39/45

Ames 59/61 test (of which 1

(2 of which impossible)

impossible)

Aberration test 16/22

26/30 22/22

chromosomal (of which 1 impossible)

Test M LA 16/18 15/15 14/15

11/25

UDS test 15/16 17/25

(2 of which impossible)

11/12

Unrealizable (undeveloped

DLT 13/16 test (of which 1

by ACD)

impossible)

19/38

Carcinogenicity test 30/32 34/38

(of which 5 impossible)

Reprotoxicity test 18/21 23/27 1/27

Number of good

177/194 161/184 100/184

responses

Percentage (%) of

91.2 87.5 54.4

correct answers

Claims

An iterative method for selecting a subset of molecules (CREF) referred to as reference for use in predicting at least one property of a so-called target molecular structure, the iterative selection process comprising an initialization step ( E10) associating with a so-called current molecule a value of a predetermined molecule descriptor associated with the target molecular structure, and during each iteration (E20) of the selection method:

An evaluation step (E30), for each molecule of a base (10) comprising a plurality of molecules each associated with a value of said descriptor, of a so-called overall similarity measure between the value of the descriptor associated with said molecule; and the value of the descriptor associated with the current molecule;

A step of selecting (E40) molecules of the base having an overall similarity measurement greater than a predetermined threshold, the selected molecules being added (E50) to the reference subset; and

A step of updating (E60) the value of the descriptor associated with the current molecule from the values of the descriptors associated with at least a part of the molecules belonging to the reference subset.

A selection method according to claim 1 wherein the molecule descriptor comprises N characteristics where N denotes an integer greater than 1, and wherein the evaluation step (E20) comprises, for each molecule of the base, a step calculating, for each of the N characteristics of the descriptor, a so-called local similarity measure between the value of this characteristic of the descriptor associated with said molecule and the value of this characteristic of the descriptor associated with the current molecule, the global similarity measure evaluated for said molecule being obtained from the local similarity measurements calculated for this molecule.

3. Selection method according to claim 2 wherein the calculation step comprises for each descriptor feature:

A calculation of a distance between the value of this characteristic of the descriptor associated with said molecule and the value of this characteristic of the descriptor associated with the current molecule; and

4. Selection method according to claim 3 wherein the calculated distance, noted d, verifies:

The selection method according to claim 3 or 4 wherein the conversion function, denoted f, verifies:

f = exp (^)

6. A selection method according to any one of claims 2 to 5 wherein during the evaluation step (E20), the overall similarity measure evaluated for said molecule is the ratio between:

7. Selection process according to any one of claims 2 to 6 wherein the values of the N characteristics of the descriptor reflect the presence or absence of N molecular fragments considered in the definition of a structural key MACCS 166.

The selection method according to any one of claims 1 to 7 wherein during the updating step (E60), the value associated with the current molecule of each feature of the descriptor is updated with an arithmetic mean. or weighted values of this feature of the descriptor associated with the molecules of said at least a portion of the molecules belonging to the reference subset.

A selection method according to any one of claims 1 to 8 wherein the molecule descriptor comprises N characteristics wherein N denotes a number greater than or equal to 1, and wherein, in the updating step (E60 ), the value associated with the current molecule of each feature of the descriptor is updated with the most frequent value of this feature of the descriptor among the values of this descriptor characteristic associated with the molecules of said at least part of the molecules belonging to the descriptor. subset of reference, or if a plurality of distinct values satisfy this condition, with the highest value among this plurality of distinct values.

10. Selection method according to any one of claims 1 to 9 wherein during the updating step (E60) implemented during an iteration of the selection process, said at least a part of the molecules belonging to the reference subset comprises the molecules selected during the step of selecting this iteration that did not already belong to the reference set before this selection step.

11. A selection method according to any one of claims 1 to 9 wherein during the updating step (E60) implemented during an iteration of the selection process, said at least a part of the molecules belonging to the reference subset comprises the molecules selected during the step of selecting this iteration.

12. A selection method according to any one of claims 1 to 9 wherein during the updating step (E60) implemented during an iteration of the selection process, said at least a part of the molecules belonging to the reference subset comprises all the molecules belonging to the reference subset at the end of the step of selecting this iteration.

13. Selection method according to any one of claims 1 to 12 wherein the evaluation, selection and updating steps are repeated until a predetermined stopping criterion (CRU) is verified ( E70), said stopping criterion being chosen from:

A predetermined number of iterations carried out;

A predetermined number of molecules reached in the reference subset;

An absence of molecules selected during the selection step that does not already belong to the reference subset.

A method of predicting at least one property of a so-called target molecular substance comprising:

A selection step (F10), by means of an iterative selection process according to any one of claims 1 to 13, of a subset of said reference molecules in a database comprising a plurality of molecules; each associated with a value of a predetermined molecule descriptor;

A step of predicting (F20) at least one property of said target molecular substance from said selected subset of reference molecules.

Computer program (PROG) comprising instructions for executing the steps of the selection method according to any one of claims 1 to 13 or for performing the steps of the prediction method according to claim 14 when said program is executed by a computer.

16. Recording medium (6) readable by a computer on which is recorded a computer program comprising instructions for executing the steps of the selection method according to any one of claims 1 to 13 or for execution steps of the prediction method according to claim 14.

17. A selection device (2) for a reference subset of molecules (CREF) intended to be used for predicting at least one property of a so-called target molecular structure, the selection device comprising an initialization module (2A) configured to associate with a so-called current molecule a value of a predetermined molecule descriptor associated with the target molecular structure, said selection device being further configured to activate, during a plurality of successive iterations:

An evaluation module (2B) configured to evaluate, for each molecule of a base comprising a plurality of molecules each associated with a value of the descriptor, a so-called overall similarity measure between the value of the descriptor associated with said molecule and the value of the descriptor associated with the current molecule;

A selection module (2C) configured to select molecules of the base having a global similarity measurement greater than a predetermined threshold, the selected molecules being added by said selection module to the reference subset; and

An update module (2D) configured to update the value of the descriptor associated with the current molecule from the values of the descriptors associated with at least part of the molecules belonging to the reference subset.

A prediction device (1), configured to predict at least one property of a so-called target molecular substance comprising:

A selection device (2) according to claim 17, configured to select a subset of said reference molecules in a database (10) comprising a plurality of molecules each associated with a value of a predetermined descriptor of molecules;

A prediction module (3), configured to predict at least one property of said target molecular substance from the subset of reference molecules selected.