WO2024121356A1 - Procédé et système pour prédire au moins une valeur de propriété physico-chimique et/ou d'odeur pour une structure ou une composition chimique - Google Patents

Procédé et système pour prédire au moins une valeur de propriété physico-chimique et/ou d'odeur pour une structure ou une composition chimique Download PDF

Info

Publication number
WO2024121356A1
WO2024121356A1 PCT/EP2023/084821 EP2023084821W WO2024121356A1 WO 2024121356 A1 WO2024121356 A1 WO 2024121356A1 EP 2023084821 W EP2023084821 W EP 2023084821W WO 2024121356 A1 WO2024121356 A1 WO 2024121356A1
Authority
WO
WIPO (PCT)
Prior art keywords
value
neural network
bond
atom
chemical
Prior art date
Application number
PCT/EP2023/084821
Other languages
English (en)
Inventor
Guillaume Godin
Ruud VAN DEURSEN
Julien Herzen
Florian Ravasi
Original Assignee
Firmenich Sa
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Firmenich Sa filed Critical Firmenich Sa
Publication of WO2024121356A1 publication Critical patent/WO2024121356A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Definitions

  • TITLE METHOD AND SYSTEM TO PREDICT AT LEAST ONE PHYSICO-CHEMICAL AND/OR
  • the present invention aims at a method to predict at least one physico-chemical and/or odor property value for a chemical structure or composition, a system to predict at least one physicochemical and/or odor property value for a chemical structure or composition and a method to efficiently assemble chemical structures or compositions.
  • measurements 305 stored in databases vary due to the environment of said experiment.
  • Even in near perfect environmental conditions, instruments and sample preparation by technicians may slightly vary from one to another. Merging experimental data from different sources can be difficult because of changing experimental variations depending on measurement conditions used.
  • Statistical methods to homogenize experimental data have been developed to reduce those variations with moderate to good success. In the field of machine learning, such variations may exist between training and test sets as well as between the known and future data.
  • augmented NLP transformer models for direct and single-step retrosynthesis. Nat Commun 11 , 5575 (2020)). Simultaneously, augmentation can be used to identify if a network is critically parametrized, i.e., the point where augmentation has no or little effect on the model’s performance. Not all models are open to data augmentation.
  • Graph neural networks for instance, are invariant to representation shuffle. GNNs are thus incompatible with existing data augmentation methods used for natural language processing or images.
  • a third issue is that the training procedure 315 of a model defines an important aspect in modelling. Frequently, there are too many variables that may influence a model’s decision. This may explain why hyper-parametrization optimization strategies may be required to improve models for performance or efficiency. Apart from the selected model, the question of the data split between train and test sets also plays a significant role. Several methods can be used from fully leave-one out, random-split to K-fold cross-validation to simulate and estimate the model quality on unseen data. In the end, a model’s prediction is just an educated guess depending on the used training conditions, model size, optimization parameters, and data split.
  • the first branch is learning models from graph neural networks (GNNs).
  • GNNs graph neural networks
  • the second branch is NLP methods based upon line augmentation strings (such as the SMILES format), where the chemistry is exclusively learned from this syntax. This method has the benefit of data augmentation because the same molecule can be written by writing a new sentence in a different rule-based order (sentence grammar).
  • the third branch is image convolution neural networks learning and predicting from molecular images.
  • Ensembling is a technique that consists in training several models (usually called base models or weak learners) and at inference time aggregating their outputs with some voting mechanism.
  • This technique is widely used by practitioners (notably to obtain winning solutions in many machine learning competitions), and it is often a key step to improve final performance.
  • ensembling techniques have been trying to produce a diverse (or complementary) collection of base models, and to combine them using some voting technique, usually meant to reduce the bias and/or the variance of the resulting system.
  • a host of different techniques can be used to train diverse models. For example, bagging (with bootstrap resampling) introduces diversity via sampling of the training dataset and boosting introduces diversity by training models in sequence in a way such that each model has the incentive to compensate for the errors made by the preceding ones.
  • Voting techniques can consist of simple averaging, majority voting (for classification), or stacking, whereby the final prediction is produced by a meta-model that is trained to combine the base models on some held-out dataset.
  • the present invention aims at addressing all or part of these drawbacks.
  • the present invention aims at a method to predict at least one physico-chemical and/or odor property value for a chemical structure or composition, comprising the steps of:
  • an end-to-end trained ensemble neural network or multi-branch neural network model to predict at least one physico-chemical and/or odor property value for a chemical structure or composition
  • an end-to-end ensemble neural network or multi-branch neural network device comprising:
  • each sub-device being configured to provide an independent prediction based upon the exemplar data
  • - a layer configured to output at least one value based on, or representative of, the distribution of said independent predictions
  • said layer comprising a sampling device configured to output at least one random value as a function of a probability distribution representative of the distribution of independent predictions, said output random values being computed in a differentiable way and used for backpropagation within the end-to-end ensemble neural network or multi-branch neural network device,
  • Such provisions allow for the accurate prediction of physico-chemical and/or odor property values for defined chemical structures or compositions.
  • Such provisions allow, as well, for much greater prediction stability, reliability, as well as improved training speed and overall performance and the provision of a metric of variance representative for the model uncertainty. Such embodiments thus allow resource savings, in terms of computation time or power, as well as in terms of model complexity. Typically, current approaches require the use of numerous models and iterations to obtain a reliable prediction model.
  • Such provisions also offer a simple means to regularize the ensemble by introducing noise. Finally, such provisions allow for more stable training dynamics and better individual base models. This approach does not require any extra tuning, and it does not introduce new learnable parameters.
  • At least one set of inputs of the exemplar data corresponds to hash vectors of at least one atomic property in a chemical structure or composition
  • the method further comprising, upstream of the step of executing, a step of converting the defined digitized chemical structure or composition into a set of hash vectors of at least one atomic property representative of the digitized chemical structure or composition, said set of hash vectors being used as input during the step of executing.
  • At least one hash vector of a bond property is representative of one of the following: bond order, bond type, stereochemistry of the bond: bond direction for tetrahedral stereochemistry, bond direction for double bond stereochemistry or bond direction for spatial orientation, atomic number(s) for the “from” and/or “to” atoms, atomic symbols for the “from” and/or “to” atoms, dipole moment in the bond, quantum-chemical properties: electron density in the bond, electron configuration of the bond, bond orbitals, bond energies, attractive forces, repulsive forces, bond distance, aromatic bond, aliphatic bond, ring properties of the bond: number of rings on the bond, ring size(s) of the bond, smallest ring size of the bond, largest ring size of the bond, rotatable bond, spatially constrained bond, hydrogen bonding properties, ionic bonding properties, bond order for reactions, including the “null” bond to identify a broken/formed bond in a reaction: bond order in reagents, bond
  • At least one output value representative of the distribution is representative of a dispersion of the distribution.
  • the end-to-end ensemble neural network or multi-branch neural network device is trained to minimize at least one value representative of the dispersion of the distribution.
  • At least one odor property is representative of:
  • At least one physical property is representative of:
  • At least one neural network device is: - a recursive neural network device
  • the method object of the present invention comprises, upstream of the step of providing, a step of atom or bond relationship vector augmentation.
  • one molecular structure represented by a one or an augmented series of hashes, can be augmented up to a maximum of times corresponding to the number of hashes of the series.
  • one molecular structure can become several inputs in the natural language processing application.
  • the step of atom or bond relationship vector augmentation comprises a step of horizontal augmentation, configured to provide several vectors representing a single digitized representation of a molecular structure or composition, each vector representing a particular representation of the canonical representation molecular structure or composition, each vector being treated as a single input during the step of providing.
  • one molecular structure represented by a one or an augmented series of hashes, can be augmented up to a maximum of times corresponding to the number of hashes of the series.
  • one molecular structure can become several inputs in the natural language processing application.
  • the step of atom or bond relationship vector augmentation comprises a step of vertical augmentation, to create several groups of several horizontal augmentations, representing a unique molecular structure or composition, each group being treated as a single input during the step of providing.
  • one molecular structure represented by a one or an augmented series of hashes, can be augmented up to a maximum of times corresponding to the number of hashes of the series.
  • one molecular structure can become several inputs in the natural language processing application.
  • the present invention aims at a method to efficiently assemble chemical structures or compositions, comprising:
  • the present invention aims at a system to predict at least one physico-chemical and/or odor property value for a chemical structure or composition, comprising the means of:
  • an end-to-end trained ensemble neural network or multi-branch neural network model to predict at least one physico-chemical and/or odor property value for a chemical structure or composition
  • an end-to-end ensemble neural network or multi-branch neural network device comprising:
  • each sub-device being configured to provide an independent prediction based upon the exemplar data
  • - a layer configured to output at least one value based on, or representative of, the distribution of said independent predictions
  • said layer comprising a sampling device configured to output at least one random value as a function of a probability distribution representative of the distribution of independent predictions, said output random values being computed in a differentiable way and used for backpropagation within the end-to-end ensemble neural network or multi-branch neural network device,
  • the trained end-to-end ensemble neural network or multi-branch neural network model configured to predict physico-chemical and/or odor properties for input digitized representations of chemical structures or compositions.
  • FIG. 1 shows, schematically, a first particular succession of steps of the method object of the present invention
  • FIG. 2 shows, schematically, a particular embodiment of the system object of the present invention
  • FIG. 3 shows, schematically, a general overview of machine learning systems
  • FIG. 4 shows, schematically, a second particular succession of steps of the method object of the present invention
  • FIG. 5 shows, schematically, a detailed view of a particular embodiment of new neural network layers used during the training of an end-to-end ensemble neural network or multibranch neural network device
  • FIG. 6 shows, schematically, a second particular succession of steps of the method object of the present invention
  • FIG. 7 shows, schematically, a particular succession of steps to obtain a hash vector used by the system or method object of the present invention
  • FIG. 8 shows, schematically, an example of implementation of the training method object of the present invention
  • FIG. 12 to 14 show, schematically, training architectures that may be used to select, classify or predict odor properties and/or physico-chemical properties of chemical structures and
  • FIG. 15 shows, schematically, a particular training architecture that may be used to classify chemical structures.
  • inventive concepts may be embodied as one or more methods, of which an example can be provided.
  • the acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
  • a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
  • the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements.
  • This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified.
  • “at least one of A and B” can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
  • the term “ingredient” designates any ingredient, preferably presenting a flavoring or fragrance capacity.
  • the terms “compound” or “ingredient” designate the same items as “volatile ingredient.”
  • An ingredient may be formed of one or more chemical molecules.
  • composition designates a liquid, solid or gaseous assembly of at least two fragrance or flavor ingredients or one fragrance or flavor ingredient and a neutral solvent for dilution.
  • a "flavor” refers to the olfactory perception resulting from the sum of odorant receptor(s) activation, enhancement, and inhibition (when present) by at least one volatile ingredient via orthonasal and retronasal olfaction as well as activation of the taste buds which contain taste receptor cells.
  • a "flavor" results from the olfactory and taste bud perception arising from the sum of a first volatile ingredient that activates an odorant receptor or taste bud associated with a coconut tonality, a second volatile ingredient that activates an odorant receptor or taste bud associated with a celery tonality, and a third volatile ingredient that inhibits an odorant receptor or taste bud associated with a hay tonality.
  • a "fragrance” refers to the olfactory perception resulting from the aggregation of odorant receptor(s) activation, enhancement, and inhibition (when present) by at least one volatile ingredient. Accordingly, by way of illustration and by no means intending to limit the scope of the present disclosure, a "fragrance” results from the olfactory perception arising from the aggregation of a first volatile ingredient that activates an odorant receptor associated with a coconut tonality, a second volatile ingredient that activates an odorant receptor associated with a celery tonality, and a third volatile ingredient that inhibits an odorant receptor associated with a hay tonality.
  • an “odor property” or “olfactive property” refers to any psychophysical property of an ingredient or composition. Namely, such properties refer to how the human body reacts to the physical presence of an olfactory ingredient or composition, considering that such psychophysical properties are directly link to the ability of the ingredient or composition to easily penetrate and by in proximity contact to the olfactory receptors present in human body.
  • the terms “means of inputting” is, for example, a keyboard, mouse and/or touchscreen adapted to interact with a computing system in such a way to collect user input.
  • the means of inputting are logical in nature, such as a network port of a computing system configured to receive an input command transmitted electronically.
  • Such an input means may be associated to a GUI (Graphic User Interface) shown to a user or an API (Application programming interface).
  • the means of inputting may be a sensor configured to measure a specified physical parameter relevant for the intended use case.
  • computing system or “computer system” designate any electronic calculation device, whether unitary or distributed, capable of receiving numerical inputs and providing numerical outputs by and to any sort of interface, digital and/or analog.
  • a computing system designates either a computer executing a software having access to data storage or a client-server architecture wherein the data and/or calculation is performed at the server side while the client side acts as an interface.
  • the term “materialized” is intended as existing outside of the digital environment of the present invention. “Materialized” may mean, for example, readily found in nature or synthesized in a laboratory or chemical plant. In any event, a materialized composition presents a tangible reality.
  • the terms “to be compounded” or “compounding” refer to the act of materialization of a composition, whether via extraction and assembly of ingredients or via synthetization and assembly of ingredients.
  • atomic properties refer to the properties of atoms and/or bonds attached to any atoms regardless of their molecular use context. As such, atomic properties refer to an absolute description of features of atoms, as opposed to the relative description of atoms within a molecule in the broader context of the molecule such atoms are a part of.
  • activation function defines, in a neural network, how the weighted sum of the input is transformed into an output from a node or nodes in a layer of the network. These activation functions may be defined by layers in the network or by arithmetic solutions in the loss functions.
  • an “end-to-end ensemble neural network or multi-branch neural network device” refer to a group of independent neural network devices collaborating to provide outputs as well as a single neural network devices comprising independent branches collaborating to provide outputs.
  • atomic properties refer to the properties of atoms and/or bonds attached to any atoms regardless of their molecular use context. As such, atomic properties refer to an absolute description of features of atoms, as opposed to the relative description of atoms within a molecule in the broader context of the molecule such atoms are a part of.
  • Figure 3 shows a general view of machine learning key components.
  • Figure 5 shows, schematically, a particular embodiment of two layers of an end-to-end ensemble neural network or multi-branch neural network training device object of the present invention. Figure 5 also helps in understanding the technical contribution of the present invention. The underlying theory of the model shown in figure 5 is presented below.
  • the layer outputs, for example, the mean of D(g(o 1 , . . . , o K )) instead of random samples.
  • sampling can be done in a differentiable way, so it is compatible with neural network training based on gradient descent. Therefore, contrary to traditional ensembling methods such as Bagging or Stacking that separate the training of each model in the ensemble, the present layer ensures that gradients are provided to all base models for all training samples, which results in a form of end-to-end training.
  • D Gaussian distribution parameterized by a diagonal covariance matrix
  • g o 1 , . . . , o K
  • This way of specifying the distribution D(g(o 1 , . . . , o K )) is reminiscent of a VAE with the difference lying in the fact that this layer computes p and o from several underlying base models.
  • the performance of this architecture can be evaluated the performance of our approach on the CIFAR-10 image classification task.
  • Each competing model is trained using 5 random seeds and 120 epochs.
  • the test loss is computed on the whole test set with the usual split for CIFAR-10, with train and test sets consisting of respectively 50 000 and 10 000 images.
  • the training method object of the present invention can be compared against different ensembling methods, in order to evaluate sampling as a new technique to train ensembles end-to- end.
  • Different variants, described above, which refers to the parameterized isotropic variant with a multilayer perceptron used for the function !( ⁇ ) are evaluated. It is observed that when Diagonal Sampling is used the training can be unstable at the beginning, if a uniform based weight initialization is used.
  • NCL Negative Correlation Learning
  • the present results can be compared with that obtained by a standalone CNN of similar capacity as the ensemble, both without (“Simple”) and with Dropout (“Simple + Dropout”).
  • This CNN has a similar structure to the CNNs used for base models, but it has 506 290 parameters, which can be obtained by increasing the depth and the number of channels.
  • the validation accuracies of the different models can be used as a measure of performance.
  • the coefficient of variation provides a measure of the diversity among the ensemble members throughout training. It is computed as the average of the elementwise standard deviation rescaled by the mean of o 1 , . . . , o K .
  • the average test accuracy of the base models can be used as a metric of performance. This measures the distillation during training, i.e. , how performant each independent base model is on the test set.
  • Full Covariance Sampling ends up having a better validation accuracy than the simpler Diagonal alternative.
  • the coefficient of variation is bigger in the case of the Full Covariance Sampling compared to the Diagonal, which seems to indicate that Full Covariance benefits from a better diversity.
  • Full Covariance Sampling has a better aggregated test accuracy from approximately the 30th epoch of training.
  • Full Covariance only gets better than Diagonal in terms of test accuracy after around the 60th epoch. In other words, even when it has worse test accuracy, Full Covariance has better averaged individual test accuracy than Diagonal. Therefore, Full Covariance offers better distillation properties.
  • the present training device first works as a regularization mechanism, which allows end-to-end training of ensembles that would otherwise be prone to overfitting. Indeed, even if Mean already provides a net increase over Single, Mean underperforms Bagging in terms of test accuracy and distillation. Nonetheless, as soon as sampling is used, better test accuracy is obtained and distillation in favor of the sampling methods. Additionally, injecting noise in this way means that the sampling procedure adapts to the magnitude of o 1 , . . . , o K during training. In comparison, relying on a noise schedule would add some complexity and would require careful tuning.
  • NCL and Bagging perform similarly in terms of test accuracy, but NCL has a slightly better distillation and more diversity which illustrates the benefits of end-to-end ensembling. Nevertheless, on top of these advantages, the present method only requires choosing the right distribution as opposed to tuning a new hyper-parameter (as in the case of NCL).
  • the present approach is thus particularly useful for combining multiple branches of a neural network, which can be seen as a way to perform end-to-end training of an ensemble of neural networks. It consists of a new neural network layer, which takes as inputs several individual predictions coming from distinct base models (or branches) and uses differentiable sampling to produce a single output while offering regularization and distributing the gradient to all base models. This approach has multiple benefits.
  • Figure 1 shows a particular succession of steps of the method 100 object of the present invention.
  • This method 100 to predict at least one physico-chemical and/or odor property value for a chemical structure or composition comprises the steps of:
  • an end-to- end trained ensemble neural network or multi-branch neural network model to predict at least one physico-chemical and/or odor property value for a chemical structure or composition
  • the method 100 object of the present invention further comprising the steps of:
  • each sub-device being configured to provide an independent prediction based upon the exemplar data
  • - a layer configured to output at least one value based on, or representative of, the distribution of said independent predictions
  • said layer comprising a sampling device configured to output at least one random value as a function of a probability distribution representative of the distribution of independent predictions, said output random values being computed in a differentiable way and used for backpropagation within the end-to-end ensemble neural network or multi-branch neural network device,
  • the trained end-to-end ensemble neural network or multi-branch neural network model configured to predict physico-chemical and/or odor properties for input digitized representations of chemical structures or compositions.
  • the layer configured to output at least one value based on, or representative of, the distribution of said independent predictions can either be understood as a layer providing a value representative of a distribution to be used by the sampling device or as a layer providing a value obtained from the sampling device.
  • a “differentiable way” it is meant a way to draw the samples from the distribution that makes it possible to compute the gradients of the layer output(s) with respect to the distribution's parameters. It also implies that these parameters are computed using differentiable functions of the outputs of the neural network sub-devices. This allows to obtain a "proper" neural network layer for which one can compute the gradient of the output(s) with respect to its input(s), which makes it possible to embed it in any larger neural network trained using backpropagation.
  • the step of defining 105 is performed, for example, by using an input device 240 coupled to I/O subsystem 220 such as disclosed in regard to figure 2.
  • a chemical structure or a composition is defined.
  • a chemical structure is defined as molecular geometry and, optionally, the electronic structure of a target molecule.
  • Molecular geometry refers to the spatial arrangement of atoms in a molecule and the chemical bonds that hold the atoms together and can be represented using structural formulae and by molecular models; complete electronic structure descriptions include specifying the occupation of a molecule's molecular orbitals. Structure determination can be applied to a range of targets from very simple molecules (e.g., diatomic oxygen or nitrogen), to very complex ones (e.g., such as protein or DNA).
  • a composition is defined as a sum of molecules or compounds, typically called flavor or fragrance ingredients.
  • a user may connect to a GUI and select existing chemical structures or design chemical structures by specifying the composing atoms and associated geometry.
  • a user may alternatively connect to a GUI and select existing fragrance or flavor ingredients, each ingredient being associated with at least one chemical structure.
  • Such selection or definition of chemical structures or compositions is performed with digital representations of the material equivalent of said chemical structures or compositions. Said representations may be shown as text and related to entries in computer databases storing, for each representation, a number of parameters.
  • the step of executing 110 is performed, for example, by one or more hardware processors 210, such as shown in figure 2, configured to execute a set of instructions representative of the trained end-to-end ensemble neural network or multi-branch neural network model. Particular embodiments for implementation of the step of executing 110 are disclosed above, in relation to figure 5 notably.
  • the input of the step of executing 110 is dependent on the parameters upon which the end- to-end ensemble neural network or multi-branch neural network device is operated to obtain an end- to-end ensemble neural network or multi-branch neural network model.
  • such parameters may correspond to:
  • the end-to-end ensemble neural network or multi-branch neural network model is configured to provide an output for a standardized input format.
  • This standardized input format may correspond to digital representations of said atoms, atomic properties, molecules, ingredients, compositions and/or chemical structures.
  • Such digital representations may correspond to character strings. Such strings may be concatenated to form unitary inputs representative of larger scale material items, such as several atoms forming a molecule, for example.
  • the step of providing 115 is performed, for example, by using an output device 235 coupled to I/O subsystem 220 such as disclosed in regard to figure 2.
  • this step of providing 115 shows, upon a GUI, the result of the prediction of the model based upon the defined chemical structure or composition fed to the model.
  • the step 120 of providing may be performed, via a computer interface, such as an API or any other digital input means. This step 120 of providing may be initiated manually or automatically.
  • the set of exemplar data may be assembled manually, upon a computer interface, or automatically, by a computing system, from a larger set of exemplar data.
  • the exemplar data may comprise, for example:
  • Such an odor property may be, for example, a tonality of the chemical structure, an odor detection threshold value for the chemical structure, an odor strength (such as a classification of olfactive power into four classes of range intensities: odorless, weak, medium and strong classes of an ingredient or composition) for the chemical structure and/or a top-heart-base (such as the classification of the three range of long lastingness during evaporation of the ingredient or composition : top, heart, base classes of an ingredient or composition, in which "top” represents ingredients or compositions that can be smelled or determined by gas chromatography analysis until 15 min of evaporation, “heart” between 15 min to 2 hours and “base” more than 2 hours) value for the chemical structure.
  • This list is not limitative, and any odor property known to the fields of fragrance and flavor design and associated industry may be associated with the hash vector.
  • An odor property may correspond to:
  • a physico-chemical may correspond to:
  • the step 125 of operating may be performed, for example, by a computer program executed upon a computing system.
  • the end-to-end ensemble neural network or multi-branch neural network device is configured to train based upon the input data.
  • each neural network sub-device of the end-to-end ensemble neural network or multibranch neural network device configures coefficients of the layers of artificial neurons to provide an output, these outputs forming a distribution of outputs.
  • the values of statistical parameters representative of the distribution may be obtained and used in activation functions to be minimized.
  • Each neural network sub-device within the ensemble may be of the same type or different types.
  • At least one neural network sub-device is:
  • At least two of the activation functions are representative of:
  • At least one output value representative of the distribution is representative of a dispersion of the distribution.
  • Such a value may correspond to, for example, the standard deviation of the outputs of the neural network sub-devices.
  • the end-to-end ensemble neural network or multi-branch neural network device is trained to minimize at least one value representative of the dispersion of the distribution.
  • the step 130 of obtaining may be performed, via a computer interface, such as an API or any other digital output system.
  • the obtained trained model may be stored in a data storage, such as a hard-drive or database for example.
  • the neural network device obtained during the step 130 of obtaining is configured to provide, additionally, at least one value representative of the statistical dispersion of the output.
  • At least one set of inputs of the exemplar data corresponds to hash vectors of at least one atomic property in a chemical structure or composition
  • the method further comprising, upstream of the step 110 of executing, a step 135 of converting the defined digitized chemical structure or composition into a set of hash vectors of at least one atomic property representative of the digitized chemical structure or composition, said set of hash vectors being used as input during the step of executing.
  • a hash corresponds to the result of a hash function, which corresponds to any function that can be used to map data of arbitrary size to fixed-size values.
  • Many such functions are known by persons skilled in the art, such as SHA-3, Skein or Snefru.
  • Such hash values can be organized into vectors that may be used by the end-to-end ensemble neural network or multi-branch neural network device.
  • a method comprising the following steps may be implemented: a step of receiving, by a computing system, a digitized representation of a chemical structure, comprising at least one atomic property digital identifier and at least one atomic property digital identifier for said at least one atomic property digital identifier, a step of determining, by a computing system, at least one value corresponding for at least one atomic or bond property of at least one atom or bond of the digitized representation of a chemical structure, a step of hashing, by a computing system, of at least one determined value to form a unique character string fingerprinting at least one atomic property digital identifier and at least one associated one atomic property and a step of providing, by a computing system, at least one hash to an ensemble of neural network devices to be trained.
  • the step of receiving is performed, for example, by any input device 240 fitting the particular use case.
  • at least one digitized representation of a chemical, atom or bond structure is input into a computer interface.
  • Such an input may be entirely logical, such as by using an API (Application Programing Interface) or by interfacing said computing system to another computing system via a computer network.
  • Such an input may also rely on a human-machine interface, such as a keyboard, mouse or touchscreen for example.
  • the mechanism used for the step of receiving is unimportant with regards to the scope of the present invention.
  • the digitized representation of a chemical structure comprises essentially two types of data: atom identifiers, corresponding to atoms that are part of the molecular structure, typically represented by at least one letter (such as “C” for carbon or “H” for hydrogen for example) and at least one relationship for at least one atom identifier, defining if and how at least one atom is linked to other atoms of the molecule.
  • atom identifiers corresponding to atoms that are part of the molecular structure, typically represented by at least one letter (such as “C” for carbon or “H” for hydrogen for example) and at least one relationship for at least one atom identifier, defining if and how at least one atom is linked to other atoms of the molecule.
  • This digitized representation can take many forms, depending on the system.
  • the SMILES for “Simplified Molecular Input Line Entry System” format is a line notation of a molecular structure providing said two types of data.
  • Another example is a molecular graph representation of the molecule.
  • Another representation is the SDF (for “Structure Data File”) format defining the atoms with properties and the bond tables.
  • Another representation is a full molecular matrix composed of the atomic numbers and the adjacency matrix defining the bonds.
  • the main digitized representation used in chemical reaction modeling and feature prediction is the SMILES format.
  • the step of determining is performed, for example, by one or more hardware processors 210, such as shown in figure 2, configured to execute a set of instructions representative of a computer software.
  • a set of instructions representative of a computer software At least one atomic property can be read from the digitized representation of the chemical structure or determined via the execution of a dedicated algorithm or obtained from a dedicated third-party software.
  • the step of hashing is performed, for example, by one or more hardware processors 210, such as shown in figure 2, configured to execute a set of instructions representative of a computer software.
  • the output of the step of hashing is a given number of hashes, each hash being representative of one atom identifier as well as at least one associated atomic property of said identified atom.
  • a chemical structure comprising several atoms is thus represented by a sentence of several hashes.
  • Each hash acts as a unique fingerprint which is particularly useful for neural network applications. This means that, within a dataset, each atom can be represented by the corresponding hash key (the unique fingerprint).
  • a hash can be composed of the repeated values for the properties to define the property value in reagents, intermediates, transition states and products.
  • At least one of the atom properties hashed is representative of one of the following: atomic number of the corresponding atom, atomic symbol of the corresponding atom, mass of the atom, explicit map number, row index in the periodic system, column index in the periodic system, total number of hydrogens on the atom, implicit number of hydrogens on the atom, explicit number of hydrogens on the atom, degree of the atom, total degree of the atom, the valence state of the atom, the implicit valence of the atom, the explicit valence of the atom, formal charge on the atom, partial charge on the atom, electronegativity on the atom, number of bonds by bond type, number of neighbors by atomic number, wild card, number of neighbors by bond type plus atomic number, wild card, number of neighbors by wild card, value to indicate aromaticity, value to indicate aliphatic atom, value to indicate a conjugated atom, value to indicate cyclic atom, value to indicate a macrocyclic atom, value to indicate a geometric
  • An alternative approach to hashing comprises a step of assigning characters to each atomic property and a step of concatenation of said characters into a “word”. Such characters may correspond to characters within a SMILES string in which all characters not identified as chemical atomic characters are removed.
  • At least one of the bond properties hashed is representative of one of the following: bond order, bond type, stereochemistry of the bond: bond direction for tetrahedral stereochemistry, bond direction for double bond stereochemistry or bond direction for spatial orientation, atomic number(s) for the “from” and/or “to” atoms, atomic symbols for the “from” and/or “to” atoms, dipole moment in the bond, quantum-chemical properties: electron density in the bond, electron configuration of the bond, bond orbitals, bond energies, attractive forces, repulsive forces, bond distance, aromatic bond, aliphatic bond, ring properties of the bond: number of rings on the bond, ring size(s) of the bond, smallest ring size of the bond, largest ring size of the bond, rotatable bond, spatially constrained bond, hydrogen bonding properties, ionic bonding properties, bond order for reactions, including the “null” bond to identify a broken/formed bond in a reaction: bond order in reagents, bond order
  • the step of obtaining is performed, for example, by using any output device 235 associated with an I/O subsystem 220, such as shown in figure 2.
  • the method to obtain a hash vector may further comprises: a step of constructing, by a computing system, a chemical structure string fingerprint by association, in a single string, of at least two hashes corresponding to at least two atomic properties and at least one step of augmentation, by a computing system, of at least one chemical structure string fingerprint, said augmented chemical structure string fingerprint being used during the step of providing.
  • the step of constructing is performed, for example, by one or more hardware processors 210, such as shown in figure 2, configured to execute a set of instructions representative of a computer software.
  • a hardware processor 210 such as shown in figure 2
  • the step of constructing at least two hashes corresponding to at least two atom identifiers and associated features are associated, typically by concatenation of the respective hashes.
  • the order of concatenation may follow concatenation rules that prevent neural network misinterpretation.
  • Figure 7 represents, schematically, a particular embodiment of this method to obtain a hash vector representative of a chemical structure.
  • figure 7 represents: a step 505 receiving an input chemical structure, a step 510 of determining atom and/or bond properties in the received chemical structure, thus annotating a vector of input properties; the properties used here are [atomic number, degree, number of hydrogens], a step 515 of property vector hashing to a single hashed key; alternately, a hash key identifying the atom type can be formed by simply concatenating the values, e.g.
  • a step 520 of constructing a vector of the hashed keys for the atoms as specified by the atom order and a step 525 of data augmentation on the vector can be applied by changing the atom order; the order may include the vector for the canonical order of atoms.
  • the method 100 object of the present invention comprises, upstream of the step 120 of providing input data to and end-to-end ensemble neural network or multibranch neural network device, a step 140 of atom or bond relationship vector augmentation.
  • At least one step of augmentation 140 is performed, for example, by a computer software ran on a computing device, such as a microprocessor for example.
  • a computer software ran on a computing device such as a microprocessor for example.
  • the order of the hashes of the constitutive hashes for a given molecular structure is shifted by one or more in the ordering of said constitutive hashes. That is to say, for example, that the last hash becomes the penultimate, the penultimate becomes the ante-penultimate and the first becomes the last or the other way around depending on the intended order of augmentation.
  • Such augmentations allow for the increase in sample size from the same chemical structure, which greatly improves the quality of the output of a neural network device.
  • the step 140 of atom or bond relationship vector augmentation comprises a step 145 of horizontal augmentation, configured to provide several vectors representing a single digitized representation of a molecular structure or chemical reaction, each vector representing a particular representation of the canonical representation molecular structure or chemical reaction, each vector being treated as a single input during the step of providing.
  • the step 140 of atom or bond relationship vector augmentation comprises a step 150 of vertical augmentation, create several groups of several horizontal augmentations, representing a unique molecular structure or chemical reaction, each group being treated as a single input during the step of providing.
  • Such a step 150 of vertical augmentation may be performed, for example, by a computer software executed by a computing system.
  • This step 150 of vertical augmentation may be performed by grouping horizontal augmentations in single inputs, typically by concatenation of the hash keys representative of the atom and/or bond properties of a chemical structure.
  • Such single inputs may be identical or different, by changing the order of concatenation for example.
  • Figure 6 shows another representation of a particular embodiment of the method 600 object of the present invention.
  • Figure 6 in particular, shows the training of the model, comprising the steps of providing 120, operating 125 and obtaining 130.
  • digitized representations 605 of chemical structures and known odor property values or physico-chemical property values 610 are used as input.
  • an end-to-end ensemble neural network or multi-branch neural network device 615 is trained to output two values, 620 and 625, representative of the distribution of the individual outputs of neural network sub-devices constitutive of the end-to-end ensemble neural network or multi-branch neural network device 615, such as the mean and the standard deviation.
  • Figure 2 represents, schematically, a particular embodiment of the system 200 object of the present invention.
  • This system 200 to predict at least one physico-chemical and/or odor property value for a chemical structure or composition, comprising the means 205 of:
  • an end-to-end trained ensemble neural network or multi-branch neural network model to predict at least one physico-chemical and/or odor property value for a chemical structure or composition
  • an end-to-end ensemble neural network or multi-branch neural network device comprising:
  • each sub-device being configured to provide an independent prediction based upon the exemplar data
  • - a layer configured to output at least one value based on, or representative of, the distribution of said independent predictions
  • said layer comprising a sampling device configured to output at least one random value as a function of a probability distribution representative of the distribution of independent predictions, said output random values being computed in a differentiable way and used for backpropagation within the end-to-end ensemble neural network or multi-branch neural network device,
  • the trained ensemble neural network or multi-branch neural network device configured to predict physico-chemical and/or odor properties for input digitized representations of chemical structures or compositions.
  • Figure 2 represents a block diagram that illustrates an example computer system with which an embodiment may be implemented.
  • a computer system 205 and instructions for implementing the disclosed technologies in hardware, software, or a combination of hardware and software are represented schematically, for example as boxes and circles, at the same level of detail that is commonly used by persons of ordinary skill in the art to which this disclosure pertains for communicating about computer architecture and computer systems implementations.
  • the computer system 205 includes an input/output (IO) subsystem 220 which may include a bus and/or other communication mechanism(s) for communicating information and/or instructions between the components of the computer system 205 over electronic signal paths.
  • the I/O subsystem 220 may include an I/O controller, a memory controller and at least one I/O port.
  • the electronic signal paths are represented schematically in the drawings, for example as lines, unidirectional arrows, or bidirectional arrows.
  • At least one hardware processor 210 is coupled to the I/O subsystem 220 for processing information and instructions.
  • Hardware processor 210 may include, for example, a general-purpose microprocessor or microcontroller and/or a special-purpose microprocessor such as an embedded system or a graphics processing unit (GPU) or a digital signal processor or ARM processor.
  • Processor 210 may comprise an integrated arithmetic logic unit (ALU) or may be coupled to a separate ALU.
  • ALU arithmetic logic unit
  • Computer system 205 includes one or more units of memory 225, such as a main memory, which is coupled to I/O subsystem 220 for electronically digitally storing data and instructions to be executed by processor 210.
  • Memory 225 may include volatile memory such as various forms of random-access memory (RAM) or other dynamic storage device.
  • RAM random-access memory
  • Memory 225 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 210.
  • Such instructions when stored in non-transitory computer-readable storage media accessible to processor 210, can render computer system 205 into a special-purpose machine that is customized to perform the operations specified in the instructions.
  • Computer system 205 further includes non-volatile memory such as read only memory (ROM) 230 or other static storage device coupled to the I/O subsystem 220 for storing information and instructions for processor 210.
  • the ROM 230 may include various forms of programmable ROM (PROM) such as erasable PROM (EPROM) or electrically erasable PROM (EEPROM).
  • a unit of persistent storage 215 may include various forms of non-volatile RAM (NVRAM), such as FLASH memory, or solid-state storage, magnetic disk, or optical disk such as CD-ROM or DVD-ROM and may be coupled to I/O subsystem 220 for storing information and instructions.
  • Storage 215 is an example of a non-transitory computer-readable medium that may be used to store instructions and data which when executed by the processor 210 cause performing computer-implemented methods to execute the techniques herein.
  • the instructions in memory 225, ROM 230 or storage 215 may comprise one or more sets of instructions that are organized as modules, methods, objects, functions, routines, or calls.
  • the instructions may be organized as one or more computer programs, operating system services, or application programs including mobile apps.
  • the instructions may comprise an operating system and/or system software; one or more libraries to support multimedia, programming or other functions; data protocol instructions or stacks to implement TCP/IP, HTTP or other communication protocols; file format processing instructions to parse or render files coded using HTML, XML, JPEG, MPEG or PNG; user interface instructions to render or interpret commands for a graphical user interface (GUI), command-line interface or text user interface; application software such as an office suite, internet access applications, design and manufacturing applications, graphics applications, audio applications, software engineering applications, educational applications, games or miscellaneous applications.
  • the instructions may implement a web server, web application server or web client.
  • the instructions may be organized as a presentation layer, application layer and data storage layer such as a relational database system using structured query language (SQL) or no SQL, an object store, a graph database, a flat file system or other data storage.
  • SQL structured query language
  • Computer system 205 may be coupled via I/O subsystem 220 to at least one output device 235.
  • output device 235 is a digital computer display. Examples of a display that may be used in various embodiments include a touch screen display or a light-emitting diode (LED) display or a liquid crystal display (LCD) or an e-paper display.
  • Computer system 205 may include other type(s) of output devices 235, alternatively or in addition to a display device. Examples of other output devices 235 include printers, ticket printers, plotters, projectors, sound cards or video cards, speakers, buzzers or piezoelectric devices or other audible devices, lamps or LED or LCD indicators, haptic devices, actuators, or servos.
  • At least one input device 240 is coupled to I/O subsystem 220 for communicating signals, data, command selections or gestures to processor 210.
  • Examples of input devices 240 include touch screens, microphones, still and video digital cameras, alphanumeric and other keys, keypads, keyboards, graphics tablets, image scanners, joysticks, clocks, switches, buttons, dials, slides.
  • Control device 245 may perform cursor control or other automated control functions such as navigation in a graphical interface on a display screen, alternatively or in addition to input functions.
  • Control device 245 may be a touchpad, a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 210 and for controlling cursor movement on display 235.
  • the input device may have at least two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
  • An input device is a wired, wireless, or optical control device such as a joystick, wand, console, steering wheel, pedal, gearshift mechanism or other type of control device.
  • An input device 240 may include a combination of multiple different input devices, such as a video camera and a depth sensor.
  • computer system 205 may comprise an internet of things (loT) device in which one or more of the output device 235, input device 240, and control device 245 are omitted.
  • the input device 240 may comprise one or more cameras, motion detectors, thermometers, microphones, seismic detectors, other sensors or detectors, measurement devices or encoders and the output device 235 may comprise a special-purpose display such as a single-line LED or LCD display, one or more indicators, a display panel, a meter, a valve, a solenoid, an actuator or a servo.
  • Computer system 205 may implement the techniques described herein using customized hardwired logic, at least one ASIC or FPGA, firmware and/or program instructions or logic which when loaded and used or executed in combination with the computer system causes or programs the computer system to operate as a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 205 in response to processor 210 executing at least one sequence of at least one instruction contained in main memory 225. Such instructions may be read into main memory 225 from another storage medium, such as storage 215. Execution of the sequences of instructions contained in main memory 225 causes processor 210 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
  • Non-volatile media includes, for example, optical or magnetic disks, such as storage 215.
  • Volatile media includes dynamic memory, such as memory 225.
  • Common forms of storage media include, for example, a hard disk, solid state drive, flash drive, magnetic data storage medium, any optical or physical data storage medium, memory chip, or the like.
  • Storage media is distinct from but may be used in conjunction with transmission media.
  • Transmission media participates in transferring information between storage media.
  • transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise a bus of I/O subsystem 220.
  • Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
  • Various forms of media may be involved in carrying at least one sequence of at least one instruction to processor 210 for execution.
  • the instructions may initially be carried on a magnetic disk or solid-state drive of a remote computer.
  • the remote computer can load the instructions into its dynamic memory and send the instructions over a communication link such as a fiber optic or coaxial cable or telephone line using a modem.
  • a modem or router local to computer system 205 can receive the data on the communication link and convert the data to a format that can be read by computer system 205.
  • a receiver such as a radio frequency antenna or an infrared detector can receive the data carried in a wireless or optical signal and appropriate circuitry can provide the data to I/O subsystem 220 such as place the data on a bus.
  • I/O subsystem 220 carries the data to memory 225, from which processor 210 retrieves and executes the instructions.
  • the instructions received by memory 225 may optionally be stored on storage 215 either before or after execution by processor 210.
  • Computer system 205 also includes a communication interface 260 coupled to bus 220.
  • Communication interface 260 provides a two-way data communication coupling to network link(s) 265 that are directly or indirectly connected to at least one communication network, such as a network 270 or a public or private cloud on the Internet.
  • communication interface 260 may be an Ethernet networking interface, integrated-services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of communications line, for example an Ethernet cable or a metal cable of any kind or a fiber-optic line or a telephone line.
  • Network 270 broadly represents a local area network (LAN), wide-area network (WAN), campus network, internetwork, or any combination thereof.
  • Communication interface 260 may comprise a LAN card to provide a data communication connection to a compatible LAN, or a cellular radiotelephone interface that is wired to send or receive cellular data according to cellular radiotelephone wireless networking standards, or a satellite radio interface that is wired to send or receive digital data according to satellite wireless networking standards.
  • communication interface 260 sends and receives electrical, electromagnetic, or optical signals over signal paths that carry digital data streams representing various types of information.
  • Network link 265 typically provides electrical, electromagnetic, or optical data communication directly or through at least one network to other data devices, using, for example, satellite, cellular, Wi-Fi, or BLUETOOTH technology.
  • network link 265 may provide a connection through a network 270 to a host computer 250.
  • network link 265 may provide a connection through network 270 or to other computing devices via internetworking devices and/or computers that are operated by an Internet Service Provider (ISP) 275.
  • ISP 275 provides data communication services through a world-wide packet data communication network represented as internet 280.
  • a server computer 255 may be coupled to internet 280.
  • Server 255 broadly represents any computer, data center, virtual machine, or virtual computing instance with or without a hypervisor, or computer executing a containerized program system such as DOCKER or KUBERNETES.
  • Server 255 may represent an electronic digital service that is implemented using more than one computer or instance and that is accessed and used by transmitting web services requests, uniform resource locator (URL) strings with parameters in HTTP payloads, API calls, app services calls, or other service calls.
  • URL uniform resource locator
  • Computer system 205 and server 255 may form elements of a distributed computing system that includes other computers, a processing cluster, server farm or other organization of computers that cooperate to perform tasks or execute applications or services.
  • Server 255 may comprise one or more sets of instructions that are organized as modules, methods, objects, functions, routines, or calls. The instructions may be organized as one or more computer programs, operating system services, or application programs including mobile apps.
  • the instructions may comprise an operating system and/or system software; one or more libraries to support multimedia, programming or other functions; data protocol instructions or stacks to implement TCP/IP, HTTP or other communication protocols; file format processing instructions to parse or render files coded using HTML, XML, JPEG, MPEG or PNG; user interface instructions to render or interpret commands for a graphical user interface (GUI), command-line interface or text user interface; application software such as an office suite, internet access applications, design and manufacturing applications, graphics applications, audio applications, software engineering applications, educational applications, games or miscellaneous applications.
  • Server 255 may comprise a web application server that hosts a presentation layer, application layer and data storage layer such as a relational database system using structured query language (SQL) or no SQL, an object store, a graph database, a flat file system or other data storage.
  • SQL structured query language
  • Computer system 205 can send messages and receive data and instructions, including program code, through the network(s), network link 265 and communication interface 260.
  • a server 255 might transmit a requested code for an application program through Internet 280, ISP 275, local network 270 and communication interface 260.
  • the received code may be executed by processor 210 as it is received, and/or stored in storage 215, or other non-volatile storage for later execution.
  • the execution of instructions as described in this section may implement a process in the form of an instance of a computer program that is being executed and consisting of program code and its current activity.
  • a process may be made up of multiple threads of execution that execute instructions concurrently.
  • a computer program is a passive collection of instructions, while a process may be the actual execution of those instructions.
  • Several processes may be associated with the same program; for example, opening up several instances of the same program often means more than one process is being executed. Multitasking may be implemented to allow multiple processes to share processor 210.
  • computer system 205 may be programmed to implement multitasking to allow each processor to switch between tasks that are being executed without having to wait for each task to finish.
  • switches may be performed when tasks perform input/output operations, when a task indicates that it can be switched, or on hardware interrupts.
  • Time-sharing may be implemented to allow fast response for interactive user applications by rapidly performing context switches to provide the appearance of concurrent execution of multiple processes simultaneously.
  • an operating system may prevent direct communication between independent processes, providing strictly mediated and controlled inter-process communication functionality.
  • FIG. 4 shows, schematically, a succession of steps of the method 400 object of the present invention.
  • This method 400 to efficiently assemble chemical structures or compositions comprises:
  • step 410 of assembling a chemical structure or composition associated to an output obtained during the step 115 of obtaining is a step 410 of assembling a chemical structure or composition associated to an output obtained during the step 115 of obtaining.
  • This step 410 of assembling is configured to materialize the composition. Such a step 410 of assembling may be performed in a variety of ways, such as in a laboratory or a chemical plant for example.
  • FIG. 8 represents, schematically, a particular implementation example of the method 800 object of the present invention.
  • This method 800 for training an ensemble neural network or multibranch neural network device is similar to the training performed by the end-to-end ensemble neural network or multi-branch neural network device used in the method 100 object of the present invention.
  • This method 800 comprises:
  • step 830 of operating a multilayer perceptron (“MLP”) layer upon the output of the flattening layer and - a step 835 of outputting a value for the target odor property and/or physico-chemical property.
  • MLP multilayer perceptron
  • the number N represents the number of points (e.g., input batch size).
  • a chemical structure is displayed as an augmented 2D-chip, which is subsequently converted using an embedding layer and recursive neural network layer.
  • the attention layer runs a feature selection.
  • the MLP part of the network is a fully connected neural network with activation.
  • FIG. 9 to 11 show performance of the architecture shown in figure 8 relative to three distinct targets:
  • - figure 9 shows the performance for odor detection threshold (“ODT”)
  • - figure 10 shows the performance for volatility
  • the present invention also aims at a computer implemented ensemble neural network or multi-branch neural network device, in which the ensemble neural network or multibranch neural network device is obtained by any variation of the computer-implemented method 300 object of the present invention.
  • the present invention also aims at a computer program product, comprising instructions to execute the steps of a method 300 object of the present invention when executed upon a computer.
  • the present invention also aims at a computer-readable medium, storing instructions to execute the steps of a method 300 object of the present invention when executed upon a computer.
  • Figure 12 shows, schematically, a training architecture 1200 to select, from a set of chemical structures, chemical structures that provide a particular feature, such as an insect repellent capacity value above a determined threshold.
  • Such an architecture 1200 comprises:
  • an ensemble neural network or multi-branch neural network 1210 comprising a set of recursive neural networks to generate an embedding 1215
  • a multivariant statistic algorithm 1220 can be used, complemented with a numerical domain eccentricity evaluation algorithm 1225,
  • a time distribution 1230 of the embedding is performed, to obtain an alternative input 1235, complemented with the use of an ensemble 1240 neural network using Tanimoto neural networks, such as disclosed in the present invention.
  • Figure 13 shows, schematically, a training architecture 1300 to classify, from a set of chemical structures, chemical structures that provide a particular feature, such as a biodegradability value.
  • Such an architecture 1300 comprises:
  • an ensemble neural network or multi-branch neural network 1310 comprising a set of recursive neural networks to generate an embedding 1315
  • a multivariant statistic algorithm 1320 can be used, complemented with a numerical domain eccentricity evaluation algorithm 1325,
  • FIG. 14 shows, schematically, a training architecture 1400 to predict, for a set of chemical structures, values for a particular feature, such as an odor detection threshold value.
  • Such an architecture 1400 comprises:
  • an ensemble neural network or multi-branch neural network 1410 comprising a set of recursive neural networks to generate an embedding 1415
  • a multivariant statistic algorithm 1420 can be used, complemented with a numerical domain eccentricity evaluation algorithm 1425,
  • an ensemble 1430 neural network using single or multitask regression neural networks such as disclosed in the present invention.
  • the present invention may be used to act as a filtration technique, using any predicted physico-chemical property and/or odor property to label molecule or ingredient digital identifiers in a database, said molecules or ingredients being selected as worthwhile points of exploration by flavorists and perfumers.
  • the present invention may be used with, as inputs, couples of molecules to predict the proximity of molecules in the couple or by using the difference observed in the couple for regression or classification.
  • the present invention may be used as a classifier used in relation to physico-chemical and/or odor property values for a chemical structures or compositions.
  • Figure 15 shows a particular architecture which highlights the performance of such a classifier.
  • chi-squared testing is often used to evaluate the performance of classification models. For example, suppose one have a binary classification problem where one want to predict whether a patient has a disease or not. one can use a chi-squared test to determine if our model is performing better than chance by comparing the predicted class distribution to the expected class distribution.
  • Ensemble learning where multiple models are combined to improve overall performance, chi-squared testing can be used to evaluate the performance of the ensemble.
  • Ensemble learning is a popular technique in machine learning where multiple models are trained and combined to improve overall performance. By using multiple models, one can reduce the risk of overfitting and improve the robustness of the model.
  • each model makes an independent prediction on the input data, and the final prediction is made by combining the predictions of all models.
  • Chi-squared testing can be used to evaluate the performance of the ensemble by comparing the predicted class distribution of the ensemble to the expected class distribution. If the ensemble is performing better than any individual model, one can conclude that the ensemble is effective.
  • chi-squared testing is a powerful tool for evaluating the performance of machine learning models and ensembles. By using chi-squared testing, one can make informed decisions about which models to use and how to improve them.
  • Forced-choice modeling is an example of a contrastive classification task, where the goal is to identify the correct example from a set of alternatives.
  • This type of task is commonly encountered in many real-world scenarios, such as identifying the correct answer in a multiple-choice exam or recognizing a specific object from a set of similar objects.
  • results are frequently evaluated in a relative setting by comparing two or more candidates between each other, one therefore hypothesize that contrastive neural networks, trained to select the more promising entry from a set of alternatives may provide valuable models.
  • chi-squared testing After making the prediction, one can use chi-squared testing to measure the statistical significance of the decision. In this case, one can compare the predicted class distribution to the expected class distribution, which is a uniform distribution over the three examples. If the chi-squared test shows that the predicted class distribution is significantly different from the expected class distribution, one can conclude that the ensemble is performing well and is able to correctly identify the correct example from the input X.
  • SMILES strings that contain explicit-implicit hydrogen atoms. For instance, let us consider the molecule toluene.
  • the explicit SMILES for toluene which is written as ”[CH3][c]1 [cH][cH][cH][cH][cH]1 ”, can be tokenized by grouping the atoms defined by characters enclosed in square brackets, from [ to ].
  • the forced-choice classification is run using a network layout where the same embedding, GRU and attention and latent layer are applied to all input entries, followed by the creation of a learnable contrastive layer, creating the differences between all pairs.
  • Figure 15 shows an architecture layout of a contrastive classifier, asked to select the molecule with the lowest molecular weight.
  • the input is a tokenized vector with integer-based tokens, followed by a Keras’ Embedding layer, a Keras’ GRU layer and an attention layer.
  • the equal sign between these layers indicate that the same layers are applied to both entries.
  • a trainable contrastive layer creates the difference between the output of attention 1 and attention 2.
  • a multi-layer perceptron with dropout is used for the classification task.
  • the model is repeated N times to create an ensemble neural network.
  • the values (None,x) and (None,x,y) indicate the output shape of the layers.
  • Such a contrastive classifier can be trained using a dataset obtained from NIST.
  • the data can be split into a training set of 8,518 molecules, a validation set of 819 molecules and a test set of 772 molecules.
  • a train and validation dataset can be used to follow the on-training performance on every epoch .
  • the classifier can be trained to detect the molecule with the highest molecular weight. Note that any numerical target can be trained, including linear retention index, volatility, or odor-detection threshold.
  • the validation set can contain a number of pairs with a maximum difference of 14.02 g/mol between the molecules. Every epoch can be trained using 46 iterations with a batch-size of 1000 pairs per iteration.
  • the model is trained using mean binary-crossentropy computed over all models in the ensemble.
  • the performance can be tested on a test set, composed of a number of pairs with a maximum difference of 14,02 g/mol between the molecules.
  • the results on the performance are displayed in table 1 , below. Given that the model performs a relative classification task and one have asked to identify the position of the lowest molecular weight, the results are only the accuracy is reported.
  • the p-value is computed using chi-squared testing on the votes produced by the ensemble. The result is considered conclusive if the p-value for the vote proportions dropped below 0.05. From the of table 1 , one can clearly see that the results for the conclusive results are clearly better than the decision on the non-conclusive entries.
  • Table 1 the confidentiality of the outcomes.
  • the forecast is formed by a combination of multiple votes on the class, along with an indication of the level of confidence in the prediction (table 2).
  • Table 2 classification tasks The example above shows a relative task to learn to select the molecule with the higher molecular weight.
  • a regression task is converted into a contrastive classification task.
  • an absolute classifier an ensemble is asked to predict the class defined in the data, such as performed in the MNIST dataset to detect numbers in an image.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Selon l'invention, le procédé (100) pour prédire une valeur des propriétés physico-chimiques et/ou d'odeur pour des structures ou des compositions chimiques comprend les étapes de : - définition (105) d'une représentation d'une structure ou d'une composition chimique, - exécution (110) sur la représentation définie, d'un réseau neuronal d'ensemble entraîné de bout en bout ou d'un modèle de réseau neuronal à branches multiples pour prédire une valeur de propriété physico-chimique et/ou d'odeur, - fourniture (115) de la valeur de propriété physico-chimique et/ou d'odeur, le procédé comprenant en outre : − la fourniture (120) de données à titre d'exemple à un réseau neuronal d'ensemble de bout en bout ou à un dispositif de réseau neuronal à branches multiples comprenant : - plusieurs sous-dispositifs de réseau neuronal configurés pour des prédictions indépendantes, - une couche pour délivrer au moins une valeur de la distribution de prédictions indépendantes et - ladite couche comprenant un dispositif d'échantillonnage configuré pour délivrer des valeurs aléatoires, - le fonctionnement (125) du réseau neuronal d'ensemble de bout en bout ou du dispositif de réseau neuronal à branches multiples et - l'obtention (130) du réseau neuronal d'ensemble entraîné ou le modèle de réseau neuronal à branches multiples entraîné.
PCT/EP2023/084821 2022-12-08 2023-12-08 Procédé et système pour prédire au moins une valeur de propriété physico-chimique et/ou d'odeur pour une structure ou une composition chimique WO2024121356A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
EP22212196.4 2022-12-08
EP22212124.6 2022-12-08
EP22212124 2022-12-08
EP22212196 2022-12-08

Publications (1)

Publication Number Publication Date
WO2024121356A1 true WO2024121356A1 (fr) 2024-06-13

Family

ID=89164472

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2023/084821 WO2024121356A1 (fr) 2022-12-08 2023-12-08 Procédé et système pour prédire au moins une valeur de propriété physico-chimique et/ou d'odeur pour une structure ou une composition chimique

Country Status (1)

Country Link
WO (1) WO2024121356A1 (fr)

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
DAVID LAURIANNE ET AL: "Molecular representations in AI-driven drug discovery: a review and practical guide", vol. 12, no. 1, 17 September 2020 (2020-09-17), XP055784361, Retrieved from the Internet <URL:http://link.springer.com/article/10.1186/s13321-020-00460-5/fulltext.html> DOI: 10.1186/s13321-020-00460-5 *
DAVID ROGERS ET AL: "Extended-Connectivity Fingerprints", JOURNAL OF CHEMICAL INFORMATION AND MODELING, vol. 50, no. 5, 24 May 2010 (2010-05-24), US, pages 742 - 754, XP055315445, ISSN: 1549-9596, DOI: 10.1021/ci100050t *
TETKO, I.V.KARPOV, P.VAN DEURSEN, R. ET AL.: "State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis.", NAT COMMUN, vol. 11, 2020, pages 5575, XP055846238, DOI: 10.1038/s41467-020-19266-y
VAN BOEKHOLDT CAS: "Molecular smell prediction using deep neural network ensemble", 1 January 2021 (2021-01-01), Tilburg, The Netherlands, XP093045231, Retrieved from the Internet <URL:http://arno.uvt.nl/show.cgi?fid=156305> *
YANG Y.Y. ET AL: "Ensemble neural network model for steel properties prediction", 5TH IFAC SYMPOSIUM ON MODELLING AND CONTROL IN BIOMEDICAL SYSTEMS 2003, MELBOURNE, AUSTRALIA, 21-23 AUGUST 2003, vol. 33, no. 22, 1 August 2000 (2000-08-01), pages 401 - 406, XP093045240, ISSN: 1474-6670, DOI: 10.1016/S1474-6670(17)37028-3 *

Similar Documents

Publication Publication Date Title
Raschka et al. Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python
Guo Generating text with deep reinforcement learning
Zhao et al. A consciousness-inspired planning agent for model-based reinforcement learning
US11348664B1 (en) Machine learning driven chemical compound replacement technology
Okewu et al. Experimental comparison of stochastic optimizers in deep learning
WO2023080923A1 (fr) Systèmes et procédés pour suggérer des ingrédients source à l&#39;aide d&#39;une intelligence artificielle
WO2023080922A1 (fr) Systèmes et procédés pour suggérer des composés chimiques à l&#39;aide d&#39;une intelligence artificielle
CN108369661A (zh) 神经网络编程器
Tahmassebi ideeple: Deep learning in a flash
Kant Recent advances in neural program synthesis
Ketkaew et al. DeepCV: A deep learning framework for blind search of collective variables in expanded configurational space
Lanrezac et al. Wielding the power of interactive molecular simulations
Merchant et al. Learn2hop: Learned optimization on rough landscapes
Hain et al. Introduction to Rare-Event Predictive Modeling for Inferential Statisticians—A Hands-On Application in the Prediction of Breakthrough Patents
Grimmeisen et al. Visgil: machine learning-based visual guidance for interactive labeling
WO2024121356A1 (fr) Procédé et système pour prédire au moins une valeur de propriété physico-chimique et/ou d&#39;odeur pour une structure ou une composition chimique
Sidorova et al. NLP-inspired structural pattern recognition in chemical application
Tahat et al. Pattern recognition and data mining software based on artificial neural networks applied to proton transfer in aqueous environments
Alif et al. Crop prediction based on geographical and climatic data using machine learning and deep learning
Oh et al. Multichannel convolution neural network for gas mixture classification
WO2022258652A1 (fr) Système d&#39;apprentissage d&#39;un dispositif de réseau neuronal d&#39;ensemble pour évaluer une incertitude prédictive
Lestari et al. Sequence-based prediction of protein-protein interactions using ensemble based classifier combined with global encoding in HIV (human immunodeficiency virus)
Konstantinov et al. Interpretable ensembles of hyper-rectangles as base models
Biba et al. Stochastic simulation and modelling of metabolic networks in a machine learning framework
Song et al. Position Paper: Leveraging Foundational Models for Black-Box Optimization: Benefits, Challenges, and Future Directions