WO1999035599A1 - Dispositif et procede utilises dans la production de composes chimiques - Google Patents

Dispositif et procede utilises dans la production de composes chimiques Download PDF

Info

Publication number
WO1999035599A1
WO1999035599A1 PCT/GB1999/000046 GB9900046W WO9935599A1 WO 1999035599 A1 WO1999035599 A1 WO 1999035599A1 GB 9900046 W GB9900046 W GB 9900046W WO 9935599 A1 WO9935599 A1 WO 9935599A1
Authority
WO
WIPO (PCT)
Prior art keywords
molecule
defining
molecules
processing
signals
Prior art date
Application number
PCT/GB1999/000046
Other languages
English (en)
Inventor
Brian Laurence Arthur Kett
Richard Alan Harris
Original Assignee
Everett, Richard, Stephen, Hans
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Everett, Richard, Stephen, Hans filed Critical Everett, Richard, Stephen, Hans
Priority to AU21719/99A priority Critical patent/AU2171999A/en
Publication of WO1999035599A1 publication Critical patent/WO1999035599A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Definitions

  • the present invention relates to an apparatus and method for use as a tool in the manufacture of chemical compounds, for example medicaments. More particularly, the present invention relates to an encoding apparatus for use in a processing system which processes input signals to produce a signal predicting a property of a chemical molecule.
  • the encoding system processes signals defining the molecule to produce encoded signals suitable for input to the processing system.
  • Processing systems have been proposed which process input signals defining a molecule to predict a particular property of the molecule. These systems include neural networks and processing systems performing regression analysis . Such systems, however, require the input signals defining the molecule to be of a fixed length (predetermined number of input bits).
  • an input neuron is provided for each element of the input signal. Accordingly, if the length of the input signal is unrestricted, then the neural network has to have an infinite number of input neurons in order to accommodate all possible inputs.
  • the training of neural networks is less successful if the input signal is not distributed across all of the input neurons for all of the input molecules (for example if the input signals vary from a small length for some molecules which uses a small number of the input neurons , to a large size for other molecules which uses a large number of the input neurons).
  • a signal processing apparatus or method in which a molecule is encoded on the basis of groups of atoms within the molecule.
  • the present invention provides a signal processing apparatus or method for use in training a processor such as a neural network, in which signals defining input molecules are processed to define groups of atoms within each molecule, a property of each group is calculated, and each molecule is encoded on the basis of the groups within the molecule and the calculated properties of those groups .
  • the invention also provides a signal processing apparatus or method for use with a processor such as a trained neural network, in which signals defining an input molecule are processed to define groups of atoms within the molecule, and the molecule is encoded on the basis of the defined groups and a property of each group previously calculated during training.
  • a processor such as a trained neural network
  • the invention further comprises a process of manufacturing a chemical compound, such as a medicament, in which a property of a molecule is predicted on the basis of groups of atoms within the molecule, and the compound is made using a molecule predicted to have a suitable property for the compound (after testing of the molecule to confirm the predicted property if necessary) .
  • the invention further comprises a compound containing a molecule predicted to have a property on the basis of groups of atoms within the molecule using a signal processing apparatus or method.
  • FIG. 2 schematically shows the components of a processing apparatus used at step S2 in Figure 1;
  • Figure 3 shows a block diagram of the functional processing elements in the apparatus of Figure 2 used during analysis of known data to train the processor;
  • FIG 4 shows the processing steps performed by the data encoder in Figure 3;
  • Figure 5 shows the structure for the example molecule glutaminyl
  • Figure 6 shows the Molfile for glutaminyl
  • Figure 7 schematically illustrates the information used to encode a plurality of atoms as a "key" for a molecule
  • Figure 8 shows the keys for glutaminyl
  • Figure 9 illustrates the key information stored within the key "library” store after the keys for one imaginary molecule have been identified;
  • Figures 10a, 10b, 10c and lOd show respectively the structure, Molfile, active atoms and keys for the molecule thymine;
  • Figures 11a, lib, lie and lid show respectively the structure, Molfile, active atoms and keys for the molecule adenine;
  • Figures 12a, 12b, 12c and 12d show respectively the structure, Molfile, active atoms and keys for the molecule guanine;
  • Figure 13 illustrates the key information stored in the key library store after the keys for four imaginary molecules have been identified
  • Figure 14 schematically illustrates the data stored in the key property store after a property value has been calculated for each individual key
  • Figures 15a, 15b, 15c and 15d show key histograms for the example molecule 1, 2, 3 and 4 in Figure 13 respectively;
  • Figure 16 shows a block diagram of the functional processing elements in the processing apparatus of Figure 2 used during the analysis of a molecule to predict one or more of its properties
  • Figure 17 shows the processing steps performed by the data encoder in Figure 15;
  • Figure 18 shows the key histogram for the example molecule 1 in Figure 13 produced in a second embodiment.
  • Figure 1 shows the steps taken to manufacture a compound in an embodiment of the invention .
  • signals defining molecules having known properties are processed together with signals defining molecules with untested properties to predict the untested properties, (and hence determine which of the untested molecules may have the required properties to produce a compound with the desired characteristics) .
  • FIG. 2 shows a block diagram of the general arrangement of a signal processing apparatus used at step S2 to predict the properties of molecules.
  • a computer 2 which comprises a central processing unit (CPU) 4 connected to a memory 6 operable to store a program defining the operations to be performed by the CPU 4 and to store the signals processed by CPU 4.
  • a disk drive 8 which is operable to accept removable data storage media, such as a disk 10, and to transfer data stored thereon to the memory 6.
  • Operating instructions for the central processing unit 4 may be input to the memory 6 from a removable data storage medium using the disk drive 8.
  • Data to be processed by the CPU 4 may also be input to the computer 2 from a removable data storage medium using disk drive 8.
  • data to be processed may be downloaded into memory 6 via a connection from a local or remote database which stores the data.
  • the connection could, for example, be the Internet.
  • a user- instruction input device 14 which may comprise, for example, a keyboard and/or a position-sensitive input device such as a mouse, a trackerball, etc.
  • a frame buffer 16 which comprises a memory unit arranged to store image data relating to at least one image generated by the central processing unit 4, for example by providing one (or several) memory location(s) for a pixel of the image.
  • the value stored in the frame buffer for each pixel defines the colour or intensity of that pixel.
  • a display unit 18 for displaying the image stored in the frame buffer 16 in a conventional manner.
  • a video tape recorder (VTR) or other image recording device such as a paper printer.
  • a mass storage device such as a hard disk drive, having a high data storage capacity, is coupled to the memory 6 (typically via the CPU 4), and also to the frame buffer 16.
  • the mass storage device 22 can receive data processed by the central processing unit 4 from the memory 6 or data from the frame buffer 16 to be displayed on display unit 18.
  • Data processed by CPU 4 and stored in memory 6 may also be recorded onto a removable data storage medium (such as a disk 10) using the disk drive 8, thereby enabling processed data to be exported from the machine.
  • Processed data may also be exported by transmitting a signal conveying the data, for example, over a communication link (not shown), which could comprise the Internet .
  • CPU 4 memory 6, frame buffer 16, display unit 18 and mass storage device 22 may form part of a commercially available complete system, for example a conventional personal computer (PC).
  • Operating instructions for causing the computer 2 to perform as an embodiment of the invention can be supplied commercially in the form of programs stored on disk 10 or another data storage medium, or can be transmitted as a signal to computer 2, for example over a data link (not shown ) so that the receiving computer 2 becomes reconfigured into an apparatus embodying the invention.
  • Processing is performed by computer 2 in two stages - a first stage to process signals defining known molecules and their known properties to train a neural network, and a second stage to process signals defining molecules with unknown properties to predict the properties using the trained neural network.
  • Figure 3 shows, as a block diagram, the functional elements within computer 2 used during the training stage to process signals defining known molecules and their known measured properties .
  • the functional elements comprise a data encoder 30, and a neural network 60.
  • the components within the data encoder 30 will be described below with respect to the processing operations performed.
  • the neural network 60 in this embodiment is a conventional backpropagation neural network with three layers, having 15 neurons in the input layer, 27 neurons in the hidden layer and 1 neuron in the output layer.
  • the instructions for causing computer 2 to be configured to have these functional elements may be input on a data storage device via disk drive 8, may be input over a communication link, for example from a remote source, or may be input directly using user-input device 14.
  • Signals defining a plurality of molecules, Mj-M n in terms of conventional "Molfiles" and signals defining a measured value, P x -P n , for a property of each respective molecule are input to data encoder 30.
  • the molecules M x - M may be identified from the literature or from laboratory tests etc .
  • Data encoder 30 processes the signals to produce a finite number of signals as inputs, ⁇ I ⁇ , to neural network 60 for each input compound M. These signals define an encoding of the input molecule M.
  • data encoder 30 produces 15 signals (Ij-Ijs) to encode each molecule, as will now be described.
  • FIG. 4 shows the processing steps performed in data encoder 30.
  • molecule analyser 32 reads the Molfile defining the first input molecule M j from the Molfile store 34 in which it was stored after input to data encoder 30.
  • the Molfile specifies the atoms and their structural relationships within the molecule, and has the conventional format used in the ISIS system of MDL Information Systems Inc.
  • the Molfile for the molecule glutaminyl whose structure is shown in Figure 5, is given in Figure 6.
  • the Molfile comprises the following conventional units as described in the ISIS documentation from MDL Information Systems Inc: a header block (containing background information such as users' initials, program name, date/time, dimensional codes, scaling factors, energy, and registry number), a counts line (which specifies the number of atoms, bonds and atom lists, the chiral flag setting and the connection table version), an atom block (which specifies the atomic symbol and any mass difference, charge, stereo chemistry, and associated hydrogens for each atom) , and a bond block (which specifies the two atoms connected by each bond, the bond type, and any bond stereo chemistry and chain or ring properties ) .
  • a header block containing background information such as users' initials, program name, date/time, dimensional codes, scaling factors, energy, and registry number
  • a counts line which specifies the number of atoms, bonds and atom lists, the chiral flag setting and the connection table version
  • an atom block which specifies the atomic symbol and any mass difference, charge, stereo
  • molecule analyser 32 processes data defining the molecule read from the Molfile at step S20 to identify "active" atoms within the molecule, that is, atoms which are likely to react to contribute to the molecule having the measured property (for example when the molecule comes into contact with another molecule, or when the molecule is exposed to certain conditions ) .
  • molecule analyser 32 identifies active atoms as atoms which satisfy one or more of the following conditions:
  • the active atoms identified at step S22 are stored in the active atom store 36.
  • molecule analyser 32 processes the information in the Molfile to determine the following physical properties for the input molecule:
  • the physical properties calculated at step S22 are stored in the physical property store 38.
  • key definition module 40 reads the active atoms identified at step S22 from the active atom store 36 and the Molfile from the Molfile store 34, identifies each unique group which comprises a predetermined number (in this embodiment 3) of the active atoms and encodes each group by defining the atoms in the group and their relative positions. This encoded information is referred to as a "key" .
  • Figure 7 schematically illustrates the information encoded in a key.
  • each of the three atoms in a group is encoded using its atomic number, and the relative positions of the atoms are encoded using the distance between each pair of atoms (in this embodiment this is defined as the smallest number of bonds which must be traversed within the molecule from one atom to the other) and the degree of freedom of the bonds between each pair of atoms (in this embodiment this is determined by adding the degree of freedom of each individual bond considered in the distance measurement for the atoms, with the degree of freedom of a bond being defined as 1 if it is a single bond (representing the ability of the bond to move) and 0 if the bond is a double bond, a triple bond, or if it is within a ring structure).
  • the key encoding performed in this embodiment enables compositional and topological data describing a molecule to be encoded in a fixed length signal. Also, the encoded information is independent of the conformal states of the molecule.
  • Figure 8 shows the keys for the molecule glutaminyl .
  • the first key listed comprises the numbers 778552541, which are derived as follows.
  • the atoms in the first key comprise nitrogen atoms 74 and 76 and oxygen atom 70. These atoms have atomic numbers of 7, 7 and 8 respectively, which form the first three numbers in the key.
  • the minimum number of bonds in the molecule between nitrogen atom 74 and nitrogen atom 76 is 5, and since each of these bonds is a single bond, each has a degree of freedom of 1 so that the total degree of freedom of the bonds between nitrogen atoms 74 and 76 is 5.
  • the bond distance 5 forms the fourth number in the key and the degree of freedom number 5 forms the seventh number in the key.
  • the minimum number of bonds between oxygen atom 70 and nitrogen atom 76 is 5 (the fifth number in the key) but, since one of these bonds is a double bond, which is assigned a degree of freedom of 0, the total degree of freedom for the bonds is 4 (the eighth number in the key) .
  • the number of bonds between nitrogen atom 74 and oxygen atom 70 is 2, (the sixth number in the key) having a total degree of freedom of 1 (the ninth number in the key) .
  • the atomic numbers, network distances, and degrees of freedom are shown in a specific order, any order can be used, so that a given key can be written in a number of different ways.
  • the first key, 778552541 could equally be defined as 877255145, for example.
  • the total number of keys for the glutaminyl molecule is four, as set out in Figure 8, since it is possible to define four unique groups which each contain three active atoms .
  • key definition module 40 stores the keys defined at step S24 in a key "library" within key library store 42.
  • Figure 9 schematically illustrates the storage of this information for an imaginary molecule (molecule 1) having four imaginary keys in the storage library.
  • information is stored identifying each key and the molecule in which that key may be found.
  • step S28 CPU 4 determines whether there is another molecule in the input training set. Steps S20 to S28 are repeated until all molecules in the training set have been processed in the manner described above to define and store their keys.
  • Figures 10a, 10b, 10c and lOd show respectively, by way of further example, the structure of the molecule thymine, its Molfile, its active atoms, and its keys.
  • the degree of freedom of the bonds between each pair of atoms in each key is 0 because all of the active atoms are either in an aromatic ring or connected to the aromatic ring with a double bond.
  • Figures 11a, lib, lie and lid show respectively the structure, Molfile, active atoms and keys for the molecule adenine .
  • the virtual atom (marked *) at the centre of the aromatic ring is defined as an active atom.
  • the atomic number of the virtual atom is defined as 0 in each key containing this atom, and that the network distance between any of the "real" active atoms and the virtual active atom is the minimum number of bonds which connects the real active atom to the aromatic ring which has the virtual atom at its centre.
  • Figure lid also shows that the key 770321000 (marked *) appears twice in the molecule adenine.
  • Figures 12a, 12b, 12c and 12d show respectively the structure, Molfile, active atoms and keys for the molecule guanine .
  • the keys 777325011 and 777523110 are identical (as noted above, the numbers in a given key can be defined in different orders ) , and therefore this key occurs twice in the molecule guanine.
  • Figure 13 shows the information stored in key library store 42 after step S26 ( Figure 4) has been repeated for all molecules in the training set.
  • Figure 13 illustrates the information stored for four imaginary molecules (not the real molecules described by way of example above) having ten imaginary keys .
  • the information stored in key library store 42 defines a superset of the keys in the molecules (that is, each key in every molecule), and the molecules in which each key is found.
  • key analyser 44 reads the molecular properties P ⁇ -P n from a measured property store 46 in which they were stored after input to data encoder 30.
  • key analyser 44 uses the key information stored in the key library store 42 at step S26 and the molecular properties read at step S30 to define a value representing the contribution each key makes to the property in a molecule.
  • Key analyser 44 then stores for each key the calculated contribution in key property store 48.
  • the property is, for example, the activity of each molecule gainst a predetermined cancer assay, having a value between 0.0 and 1.0, and in which, molecule 1 shown in Figure 13 has an activity (P ⁇ of 0.5, molecule 2 has an activity (P 2 ) of 0.4, molecule 3 has an activity (P 3 ) of 0.8 and molecule 4 has an activity ( P ) of 0.2.
  • Figure 14 shows the key information, including the calculated properties for each key in this illustrative example, stored in key property store 48 at step S32.
  • key 1 appears in two molecules, namely molecule 1 and molecule 3. Therefore, the contribution that key 1 makes to an activity level of 0.5 (this being the activity level of molecule 1) is since key 1 appears once in molecule 1 and twice in all molecules. Similarly, the contribution that key 1 makes to an activity level of 0.8 (this being the activity level of molecule 3) is also h . Key 1 does not contribute anything to an activity level of 0.4 (activity level of molecule 2) or 0.2 (the activity level of molecule 4) since it does not occur in these molecules.
  • Key 3 appears once in each of molecules 1, 2 and 3. Accordingly, the calculated property for key 3 is 1/3 at activity level 0.5, 1/3 at activity level 0.4, and 1/3 at activity level 0.8.
  • Key 6 appears once in molecule 2, twice in molecule 3 and once in molecule 4.
  • the calculated property for key 6 is, therefore, at activity level 0.4, at activity level 0.8 (since key 6 appears twice in molecule 3) and k at activity level 0.2.
  • fixed length encoding molecule 50 uses the key properties defined at step S32 and the physical properties calculated at step S22 to encode each molecule in the training set in a fixed length format.
  • the encoding at step S34 is performed for each molecule by, firstly, reading the keys from the key library store 42 which are in the molecule (previously stored at step S26) and the property calculated for each of those keys from the key property store 48 (stored at step S32), calculating values comprising the sum of each of the individual key properties, and, in effect, storing the calculated values in a histogram having a predetermined number of bins in key histogram store 52, and secondly, using the histogram numbers together with the numbers from the physical property store 38 calculated at step S22 for the physical properties as the encoding for the molecule.
  • This processing therefore encodes arbitrary molecules of unknown size (unknown number of constituent atoms) with a predetermined number of numbers (which are defined using a predetermined number of bits in a
  • fixed length encoding module 50 adds the contribution that the keys in molecule 1 (that is, keys 1, 2, 3 and 4) make to activity level 0.5 and stores the total (1.83) in a histogram for the bin 0.5.
  • fixed length encoding module 50 adds the respective contributions that keys 1 to 4 make to activity level 0.4 and stores the total (0.33) in the bin for 0.4 in the histogram, adds the respective contributions that keys 1 to 4 make to the activity level of 0.8 and stores the total (1.33) in the bin for 0.8 in the histogram, and adds the respective contributions that keys 1 to 4 make to the activity level of 0.2 and stores the total (0.5) in the bin for 0.2 in the histogram.
  • Figure 15a shows the histogram formed as described above for molecule 1.
  • CPU 4 provides 10 bins in the histogram, and therefore defines and stores ten histogram numbers for each molecule, the numbers for molecule 1 being 0, 0.5, 0, 0.33, 1.83, 0, 0, 1.33, 0, 0.
  • Fixed length encoding module 50 then uses the physical properties of the molecule previously calculated at step S22 to produce signals defining a fixed length encoding of the molecule. More particularly, CPU 4 uses the 10 values in the histogram for the molecule together with the values of the 5 physical properties calculated at step S22 to produce a signal having 15 values comprising an encoded format of the molecule.
  • Figures 15b, 15c and 15d show respectively the histograms formed at step S34 by fixed length encoding module 50 for molecules 2, 3 and 4 in the illustrative example of Figure 14.
  • fixed length encoding module 50 produces signals defining 15 values for each molecule comprising a respective value in each of the ten buckets of the histogram together with the respective value of each of the five physical properties previously calculated at step S22, thereby encoding each molecule with a fixed length format (15 numbers which can be defined using a predetermined number of bits).
  • CPU 4 inputs each respective one of the 15 numbers I 1 -I1 5 encoding a given input molecule M to a respective one of the input neurons of neural network 60.
  • CPU 4 applies a signal defining the measured property P of the molecule to the single output neuron of the neural network 60. This is done for each of the molecules Mj-M., and their measured properties P ⁇ . -P n in the training set to train the neural network in a conventional manner. This produces a trained neural network, which can then be used to predict the properties of molecules which are not in the training set.
  • Figure 16 shows the functional processing elements in computer 2 used in the second stage of processing to predict the property P n+1 of a molecule M, ⁇ which was not in the training set .
  • the elements are the same as those shown in, and described above with respect to, Figure 3, with the exception that key analyser 44 and measured property store 46 are not used during prediction (and hence are not shown in Figure 16) and key library store 42 is replaced by key store 54.
  • FIG 17 shows the processing operations performed by CPU 4 within the data encoder 30 during the second stage of processing.
  • molecule analyser 32 reads the input Molfile from Molfile store 34, and at step S102 identifies and stores the active atoms and physical properties of the input molecule in the same way that these were identified and stored at step S22 described above.
  • key definition module 40 defines the keys within the input molecule in the same way that the keys were defined at step S24. Key definition module 40 stores the defined keys in key store 54.
  • fixed length encoding module 50 reads the key properties previously defined on the basis of the training data at step S32 from the key property store 48.
  • fixed length encoding module 50 encodes the input molecule using the keys in the input molecule M n+1 stored in key store 54 at step S104, the key properties previously stored in key property store 48 at step S32 during training, and the physical properties of the input molecule M consult +1 stored in physical property store 38 at step S102. This encoding is performed in the same way that each molecule was encoded at step S34.
  • the input molecule contains a key which is not one for which key properties are stored in key property store 48 (that is, the input molecule ⁇ n+1 contains a key which was not present in any of the molecules Mj-M-, in the training set) then, in this embodiment, the key is ignored by fixed length encoding module 50.
  • the 15 numbers I 1 -I 15 encoding the input molecule are input to the trained neural network 60, which outputs in response thereto the predicted value P n+1 for the property of the input molecule .
  • neural network 60 outputs a predicted value between 0 and 1 for the activity level of the input molecule against the predetermined cancer assay.
  • step S4 molecules predicted at step S2 to have a favourable property value (for example an activity level against a particular cancer assay which is greater than predetermined value) are synthesised and tested in a conventional manner in the laboratory, clinical trials etc.
  • a favourable property value for example an activity level against a particular cancer assay which is greater than predetermined value
  • a compound is manufactured containing one or more of the molecules tested at step S4 which was found to have the required properties.
  • the compound may include other molecules, for example as carriers etc.
  • data encoder 30 and neural network 60 are provided in the same computer 2. Similarly, after training, prediction of the molecules' properties is carried out using the data encoder and trained neural network in the same computer. Alternatively, data defining the data encoder 30 with key property data stored in key property store 48 may be transferred to train a neural network in a different computer. Similarly, data defining the data encoder 30 with the key property data stored in key property store 48 and/or data defining the trained neural network 60 may be transferred to a different computer in order to predict the properties of molecules not in the training set.
  • a data encoder without key data stored in key property store 48 may be used to identify active atoms and define keys (steps S100, S102 and S104 in Figure 17), and key properties may be read (step S106) from a key property store held elsewhere, such as a remote database, to enable encoding to be performed (step S108).
  • the neural network can become biased by the training. This is because most techniques will tend to err on the side of classifying an active compound as inactive because the cumulative error of misclassification of active compounds (false negatives ) contributes little error when compared to the much larger accurate classification of inaccurate compounds (true negatives).
  • This problem can be addressed by modifying the training algorithm or by using unbiased subsets of the training data. For example, in one approach a training algorithm may be used which seeks to minimise the maximum error, such as the minimax algorithms discussed in "Neural Networks for Pattern Recognition" by CM. Bishop, Oxford University Press, 1995, ISBN 0198538642.
  • molecule analyser 32 identifies active atoms by determining each atom which (i) is not hydrogen or carbon, (ii) is charged, or (iii) is a virtual atom at the centre of an aromatic ring.
  • Different conditions may be used instead of, or in addition to, these conditions to identify active atoms. For example, the following conditions could be used:
  • the atom is any atom within an aromatic ring
  • the atom is carbon with at least one double bond to an another atom
  • the atom is not hydrogen and is bonded to at least one atom deemed active by any other condition.
  • the physical properties identified by molecule analyser 32 at steps S22 and S102 may be different to those described in the embodiment above.
  • physical properties which may be used in addition to, or instead of some or all, of the physical properties described in the embodiment above include boiling point, melting point, freezing point, proportion of hydrogen, and whether the molecule is chiral.
  • key definition module 40 defines the atoms in a key on the basis of atomic number.
  • mass number, valency and/or charge may be used.
  • absolute or relative distances between atoms in the key and/or the angles between atoms in the key may be used.
  • key analyser 44 may calculate an average property value for each key.
  • the average property value for key 1 would be ⁇ (0.5 + 0.8), since key 1 is in molecule 1 which has an activity level of 0.5 and molecule 3 which has an activity level of 0.8.
  • the average activities for keys 2, 3 and 4 would be respectively ⁇ (0.5 + 0.2), V-(0.5 + 0.4 + 0.8) and ⁇ (0.5 + 0.8).
  • the average activity for key 6 would be ⁇ (0.4 + 0.8 + 0.8 + 0.2), the activity level 0.8 being considered twice since key 6 appears twice in molecule 3.
  • fixed length encoding module 50 would then calculate a histogram of the average activities for each key in the molecule being encoded.
  • Figure 18 shows the key histogram produced for molecule 1 in the example of Figures 13 and 14 using this modified form of processing.
  • key analyser 44 may calculate the contribution of each key to a property value Pi of molecule Mi using equation ( 1 ) given in the embodiment above but without dividing by the number of molecules in which the key is found.
  • An additional neural network may be used to create a model which maps each key to the properties of the molecules in which the key is found.
  • the creation of such a model would seek to minimise the residual error of the predicted activities .
  • the model could be used as a measure of key activity for the construction of a histogram.
  • a vector could be constructed with an entry for each key. Each element of the vector could be set to the number of times its associated key appears within the molecule in question.
  • a back propagation neural network is used.
  • different types of neural networks may be used. For example, a radial basis function network or a multi-layer perceptron as described in “Neural Networks for Pattern Recognition” by CM. Bishop, Oxford University Press, 1995, ISBN 0198538642 may be used.
  • a Kohonen network as described in “Self- Organisation and Associative Memory 2nd Edition” by Kohonen, Springer Verlag, 1988, ISBN 038718140 may be used.
  • Neural network 60 may be replaced by a functional element performing conventional linear regression.
  • Neural network 60 may also be replaced by a functional element comprising a generalised additive model, for example as described in "Generalised Additive Models" by T. J. Hastie and R.J. Tibshirani in Monographs on Statistical and Applied Probability, No. 43, Chapman & Hall.
  • Neural network 60 may also be replaced by a functional element performing projection pursuit regression, for example as described in the Journal of the American Statistical Association, vol 76, No. 376, pages 817-823 by J. Friedman & W. Stutzle, 1981.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un système de traitement de signaux utilisé dans la production de composés chimiques afin de prédire les molécules présentant une propriété requise pour un composé. Dans ce système de traitement de signaux, des signaux définissant des molécules dans un ensemble d'apprentissage et une propriété mesurée de chaque molécule sont traités pour produire des signaux de longueur fixe codant chaque molécule. Ce système comprend plusieurs étapes: (i) identification, pour chaque molécule, des atomes actifs pouvant provoquer une réaction de la molécule, et des propriétés physiques; (ii) définition de chaque groupe unique venant en contact avec un nombre donné d'atomes actifs, sous forme de 'clé' caractérisant les atomes et leurs positions relatives; (iii) détermination de la contribution de chaque clé à la propriété mesurée de molécules contenant ladite clé; (iv) formation d'un histogramme des contributions des clés dans une molécule; et (v) codage de la molécule au moyen des valeurs de l'histogramme et des propriétés physiques. Une autre molécule est codée comme ci-dessus, mais au moyen de contributions de clé individuelles définies durant l'apprentissage. Le réseau neuronal ayant subi un apprentissage est utilisé pour prédire les propriétés de l'autre molécule.
PCT/GB1999/000046 1998-01-09 1999-01-07 Dispositif et procede utilises dans la production de composes chimiques WO1999035599A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU21719/99A AU2171999A (en) 1998-01-09 1999-01-07 Apparatus and method for use in the manufacture of chemical compounds

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB9800462.5 1998-01-09
GBGB9800462.5A GB9800462D0 (en) 1998-01-09 1998-01-09 Apparatus and method for use in the manufacture of chemical compounds

Publications (1)

Publication Number Publication Date
WO1999035599A1 true WO1999035599A1 (fr) 1999-07-15

Family

ID=10825072

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB1999/000046 WO1999035599A1 (fr) 1998-01-09 1999-01-07 Dispositif et procede utilises dans la production de composes chimiques

Country Status (3)

Country Link
AU (1) AU2171999A (fr)
GB (1) GB9800462D0 (fr)
WO (1) WO1999035599A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002017149A2 (fr) * 2000-08-22 2002-02-28 3-Dimensional Pharmaceuticals.Inc Procede, systeme et progiciel servant a determiner des proprietes de produits d'une bibliotheque combinatoire a partir d'elements de modules de bibliotheque
US7054757B2 (en) 2001-01-29 2006-05-30 Johnson & Johnson Pharmaceutical Research & Development, L.L.C. Method, system, and computer program product for analyzing combinatorial libraries
WO2017040001A1 (fr) * 2015-09-01 2017-03-09 Google Inc. Réseau neuronal pour traitement de données graphiques
FR3078804A1 (fr) * 2018-03-06 2019-09-13 Arkema France Procede de selection de solvants adaptes a des polymeres fluores
US10915808B2 (en) * 2016-07-05 2021-02-09 International Business Machines Corporation Neural network for chemical compounds

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0496902A1 (fr) * 1991-01-26 1992-08-05 International Business Machines Corporation Système et procédé à base de connaissance pour la recherche de molécules
WO1994028504A1 (fr) * 1993-05-21 1994-12-08 Arris Pharmaceutical Modelisation d'activite biologique d'une conformation moleculaire et modelisation d'autres caracteristiques par une technique d'apprentissage machine
WO1997014106A1 (fr) * 1995-10-13 1997-04-17 Terrapin Technologies, Inc. Identification d'activite chimique commune par comparaison de fragments substructuraux
WO1997027559A1 (fr) * 1996-01-26 1997-07-31 Patterson David E Procede pour creer une bibliotheque moleculaire virtuelle et procede pour y faire des recherches, en utilisant des descripteurs valides de structure moleculaire

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0496902A1 (fr) * 1991-01-26 1992-08-05 International Business Machines Corporation Système et procédé à base de connaissance pour la recherche de molécules
WO1994028504A1 (fr) * 1993-05-21 1994-12-08 Arris Pharmaceutical Modelisation d'activite biologique d'une conformation moleculaire et modelisation d'autres caracteristiques par une technique d'apprentissage machine
WO1997014106A1 (fr) * 1995-10-13 1997-04-17 Terrapin Technologies, Inc. Identification d'activite chimique commune par comparaison de fragments substructuraux
WO1997027559A1 (fr) * 1996-01-26 1997-07-31 Patterson David E Procede pour creer une bibliotheque moleculaire virtuelle et procede pour y faire des recherches, en utilisant des descripteurs valides de structure moleculaire

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DALBY A ET AL: "DESCRIPTION OF SEVERAL CHEMICAL STRUCTURE FILE FORMATS USED BY COMPUTER PROGRAMS DEVELOPED AT MOLECULAR DESIGN LIMITED", JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, vol. 32, May 1992 (1992-05-01), pages 244 - 255, XP000611886 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002017149A2 (fr) * 2000-08-22 2002-02-28 3-Dimensional Pharmaceuticals.Inc Procede, systeme et progiciel servant a determiner des proprietes de produits d'une bibliotheque combinatoire a partir d'elements de modules de bibliotheque
WO2002017149A3 (fr) * 2000-08-22 2003-07-24 Dimensional Pharm Inc Procede, systeme et progiciel servant a determiner des proprietes de produits d'une bibliotheque combinatoire a partir d'elements de modules de bibliotheque
US6834239B2 (en) 2000-08-22 2004-12-21 Victor S. Lobanov Method, system, and computer program product for determining properties of combinatorial library products from features of library building blocks
US7054757B2 (en) 2001-01-29 2006-05-30 Johnson & Johnson Pharmaceutical Research & Development, L.L.C. Method, system, and computer program product for analyzing combinatorial libraries
US10366324B2 (en) 2015-09-01 2019-07-30 Google Llc Neural network for processing graph data
CN107969156A (zh) * 2015-09-01 2018-04-27 谷歌有限责任公司 用于处理图形数据的神经网络
WO2017040001A1 (fr) * 2015-09-01 2017-03-09 Google Inc. Réseau neuronal pour traitement de données graphiques
US11205113B2 (en) 2015-09-01 2021-12-21 Google Llc Neural network for processing graph data
US11663447B2 (en) 2015-09-01 2023-05-30 Google Llc Neural network for processing graph data
US10915808B2 (en) * 2016-07-05 2021-02-09 International Business Machines Corporation Neural network for chemical compounds
US20210110240A1 (en) * 2016-07-05 2021-04-15 International Business Machines Corporation Neural network for chemical compounds
US11934938B2 (en) * 2016-07-05 2024-03-19 International Business Machines Corporation Neural network for chemical compounds
FR3078804A1 (fr) * 2018-03-06 2019-09-13 Arkema France Procede de selection de solvants adaptes a des polymeres fluores
WO2019170999A3 (fr) * 2018-03-06 2020-03-12 Arkema France Procede de selection de solvants adaptes a des polymeres fluores

Also Published As

Publication number Publication date
AU2171999A (en) 1999-07-26
GB9800462D0 (en) 1998-03-04

Similar Documents

Publication Publication Date Title
Bommert et al. Benchmark of filter methods for feature selection in high-dimensional gene expression survival data
Idakwo et al. A review on machine learning methods for in silico toxicity prediction
Skarding et al. Foundations and modeling of dynamic networks using dynamic graph neural networks: A survey
Fernández-de Gortari et al. Database fingerprint (DFP): an approach to represent molecular databases
Shin et al. Empirical data modeling in software engineering using radial basis functions
Kuczera Efficient subspace probabilistic parameter optimization for catchment models
US6507669B1 (en) Method of selecting clusters of items using a fuzzy histogram analysis
Kamath et al. Effective automated feature construction and selection for classification of biological sequences
Rajapakse et al. Markov encoding for detecting signals in genomic sequences
US7433857B2 (en) Techniques for reconstructing supply chain networks using pair-wise correlation analysis
CN111612039A (zh) 异常用户识别的方法及装置、存储介质、电子设备
Al-Barakati et al. RF-GlutarySite: a random forest based predictor for glutarylation sites
Pappa et al. Attribute selection with a multi-objective genetic algorithm
Du et al. UniDL4BioPep: a universal deep learning architecture for binary classification in peptide bioactivity
Digles et al. Self‐organizing maps for in silico screening and data visualization
Buterez et al. CellVGAE: an unsupervised scRNA-seq analysis workflow with graph attention networks
De Lorenzo et al. An analysis of dimensionality reduction techniques for visualizing evolution
Chaudhari et al. DeepRMethylSite: a deep learning based approach for prediction of arginine methylation sites in proteins
Maggiora et al. From qualitative to quantitative analysis of activity and property landscapes
WO1999035599A1 (fr) Dispositif et procede utilises dans la production de composes chimiques
Ko et al. Mascot: A quantization framework for efficient matrix factorization in recommender systems
Duman et al. Gene coexpression network comparison via persistent homology
Suh et al. Metaheuristic-based time series clustering for anomaly detection in manufacturing industry
Li et al. ExamPle: explainable deep learning framework for the prediction of plant small secreted peptides
Abraham et al. Exploring the application of machine learning algorithms to water quality analysis

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: KR

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase