EP1010094A1 - Gemischtes sichtungssystem - Google Patents
Gemischtes sichtungssystemInfo
- Publication number
- EP1010094A1 EP1010094A1 EP98941143A EP98941143A EP1010094A1 EP 1010094 A1 EP1010094 A1 EP 1010094A1 EP 98941143 A EP98941143 A EP 98941143A EP 98941143 A EP98941143 A EP 98941143A EP 1010094 A1 EP1010094 A1 EP 1010094A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- virtual
- compounds
- molecular
- receptor
- representation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B01—PHYSICAL OR CHEMICAL PROCESSES OR APPARATUS IN GENERAL
- B01J—CHEMICAL OR PHYSICAL PROCESSES, e.g. CATALYSIS OR COLLOID CHEMISTRY; THEIR RELEVANT APPARATUS
- B01J19/00—Chemical, physical or physico-chemical processes in general; Their relevant apparatus
- B01J19/0046—Sequential or parallel reactions, e.g. for the synthesis of polypeptides or polynucleotides; Apparatus and devices for combinatorial chemistry or for making molecular arrays
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B35/00—ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/60—In silico combinatorial chemistry
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/60—In silico combinatorial chemistry
- G16C20/64—Screening of libraries
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B01—PHYSICAL OR CHEMICAL PROCESSES OR APPARATUS IN GENERAL
- B01J—CHEMICAL OR PHYSICAL PROCESSES, e.g. CATALYSIS OR COLLOID CHEMISTRY; THEIR RELEVANT APPARATUS
- B01J2219/00—Chemical, physical or physico-chemical processes in general; Their relevant apparatus
- B01J2219/00274—Sequential or parallel reactions; Apparatus and devices for combinatorial chemistry or for making arrays; Chemical library technology
- B01J2219/00583—Features relative to the processes being carried out
- B01J2219/00601—High-pressure processes
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B01—PHYSICAL OR CHEMICAL PROCESSES OR APPARATUS IN GENERAL
- B01J—CHEMICAL OR PHYSICAL PROCESSES, e.g. CATALYSIS OR COLLOID CHEMISTRY; THEIR RELEVANT APPARATUS
- B01J2219/00—Chemical, physical or physico-chemical processes in general; Their relevant apparatus
- B01J2219/00274—Sequential or parallel reactions; Apparatus and devices for combinatorial chemistry or for making arrays; Chemical library technology
- B01J2219/0068—Means for controlling the apparatus of the process
- B01J2219/007—Simulation or vitual synthesis
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B01—PHYSICAL OR CHEMICAL PROCESSES OR APPARATUS IN GENERAL
- B01J—CHEMICAL OR PHYSICAL PROCESSES, e.g. CATALYSIS OR COLLOID CHEMISTRY; THEIR RELEVANT APPARATUS
- B01J2219/00—Chemical, physical or physico-chemical processes in general; Their relevant apparatus
- B01J2219/00274—Sequential or parallel reactions; Apparatus and devices for combinatorial chemistry or for making arrays; Chemical library technology
- B01J2219/0068—Means for controlling the apparatus of the process
- B01J2219/00702—Processes involving means for analysing and characterising the products
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B01—PHYSICAL OR CHEMICAL PROCESSES OR APPARATUS IN GENERAL
- B01J—CHEMICAL OR PHYSICAL PROCESSES, e.g. CATALYSIS OR COLLOID CHEMISTRY; THEIR RELEVANT APPARATUS
- B01J2219/00—Chemical, physical or physico-chemical processes in general; Their relevant apparatus
- B01J2219/00274—Sequential or parallel reactions; Apparatus and devices for combinatorial chemistry or for making arrays; Chemical library technology
- B01J2219/00718—Type of compounds synthesised
- B01J2219/0072—Organic compounds
Definitions
- a first aspect of invention relates to the virtual screening of molecular representations, and in particular the invention is directed to the ability to evaluate the theoretical activity of molecules in various fields, such as, but not limited to, chemistry, agriculture (e.g. crop protection chemicals, growth modifiers), pharmacology (e.g. human and veterinary pharmaceuticals, toxicological profiles, diagnostic reagents) and the physical, physicochemical, and in particular biological activity of chemical compounds in general.
- chemistry e.g. crop protection chemicals, growth modifiers
- pharmacology e.g. human and veterinary pharmaceuticals, toxicological profiles, diagnostic reagents
- the physical, physicochemical, and in particular biological activity of chemical compounds in general such as, but not limited to, chemistry, agriculture (e.g. crop protection chemicals, growth modifiers), pharmacology (e.g. human and veterinary pharmaceuticals, toxicological profiles, diagnostic reagents) and the physical, physicochemical, and in particular biological activity of chemical compounds in general.
- a further aspect of invention relates to refining the screening process in order to accentuate evaluation of likely active structures.
- Still a further aspect relates to a method of mutating structures for evaluation by the screening system.
- Still a further aspect relates to a fitness function which is used to assist in the evaluation of likely active structures.
- Other aspects are also disclosed.
- BACKGROUND OF THE INVENTION The determination of the biological activity of chemical compounds is a continuing endeavour of research institutions and chemical companies, particularly due to its implications in the development of new drugs and other therapeutic remedies to treat or cure specific diseases.
- Biological activity of a compound is generally accepted as being the consequence of the fit of chemical compound into a receptor site involved in the particular biological process in a manner that the process is altered in some desirable way, e.g. either accentuated or inhibited.
- Lead compound which is a substance which exhibits a useful biological activity.
- Lead compounds are often obtained from natural sources or by synthesis of new chemical structures.
- SAR Structure Activity Relationship
- Quantitative Structure-Activity Relationships One way of determining the theoretically most highly active compounds is to use one of the various regression techniques to map molecular structure to activity, where the physicochemical properties are used to represent structure.
- This QSAR mapping allows determination of the values of the optimum physicochemical properties of the data set, and thus the structure of the most active compounds, may be determined.
- an analytical technique is used, such as multiple linear regression (MLR).
- MLR multiple linear regression
- PCT/CA96/00166, PCT/IB94/00257 and US 5,699,268 disclose inventions related to drug-receptor interactions.
- the embodiment of these simulated receptors is in a three dimensional, molecular level form. Therefore certain properties of the molecule as a whole are difficult, if at all possible, to ascertain.
- PCT/IB94/00257 discloses a method of calculating the free energy of binding of molecules to receptors whose three dimensional structures have been determined by other means.
- US 5,699,268 discloses methods of generating computer simulated receptors using genetic evolution.
- US 5,434,796 also discloses a computer simulated system for genetically evolving a population of molecules towards higher biological activity.
- the disclosure mainly relates to the way in which the generation of molecules for screening evolves.
- the disclosure revolves around the use of SMILES (Simplified Molecular Input LineEntry System) strings, which is described in "SMILES, a chemical language and information system. I. Introduction to methodology and encoding rules", D.Weininger, J. Chem. Inf. Comput. Sci., 28, 31 (1988).
- SMILES strings are lexical forms of molecular objects which are randomly mutated. However, the mutation rules are somewhat limited in that many types of chemically-important molecular modification are not readily accessible.
- the genetically evolved lead generation system draws on work done by several groups which was aimed at generation of the large novel chemical databases referred to above (virtual combinatorial libraries).
- Nilikantan, R, Bauman, N., Venkataraghavan, R.A. J.Chem. Inf. Comput. Sci. (1991) 31 , 527-30 developed a method of random structure generation based on the random fusion of 2D chemical fragments. More recently, Clark, D.E., Firth, M.A., Murray, CW. J Chem. Inf. Comput. Sci. (1996) 36, 137-145 used graph theoretical techniques for vertex degree set generation and constructive enumeration of molecular graphs to generate 3D databases for drug design.
- the present application relates to a number of aspects, including: 1. finding relationships between molecular structure and useful properties of molecules, more particularly using a virtual or mathematical analogue or model of a biological receptor or active site (a "virtual receptor") or other biological activity, such as toxicity ;
- BRANN Bayesian regularised artificial neural network
- the database may be real or virtual, may apply to existing or hypothetical molecules or compounds;
- This aspect provides a method of creating a virtual receptor capable of being used to scan a range of compounds and providing a measure indicative of whether the compounds are likely to exhibit a particular characteristic, including the steps of: compiling a data set of compounds which exhibit the known characteristic; forming a conceptual structure/activity model with a given architecture; converting the data set into a representation readable by the conceptual model; • . training the conceptual model on at least a portion of the converted data set in order to improve the architecture of the conceptual model.
- the data input to the virtual receptor is a molecular representation of the compounds which include the entire molecule and embody relevant properties such as steric, electronic and lipophilic properties.
- a preferred output of a virtual receptor that may be determined is the binding affinity of the compounds or other biological activity.
- a further aspect is based on the use of a mathematical concept called an artificial neural net to derive a virtual receptor.
- Artificial neural networks are mathematical models, and thus it has been found that they can be used in respect of scanning compounds and training virtual receptors.
- an evolutionary neural network may be used.
- the virtual receptor may be rendered in a number of forms.
- the rendering is in a mathematical form.
- One form may be by the atomistic approach, which classifies each atom according to its element and the number of connections.
- the compounds may be represented in terms of simple molecular structural parameters, such as constituent atoms or functional groups.
- An advantage that stems from the inventive method using an atomistic representation is that it allows compounds to be screened with no more knowledge than is provided by counting molecular fragments. Many other molecular representations however are possible, such as depicting the molecules based on their optimal physicochemical properties (see example 2 below).
- topological indices, Burden's chemically intuitive molecular index (CIMI), and/or molecular hologram representation of Tripos Assoc. may be used as compound descriptors. Additional novel representations which form additional aspects of the invention are exemplified in the sections following.
- one inventive concept involves the creation of a virtual receptor by training the receptor using compounds with known properties. Once a virtual receptor has been created based on a particular molecular or mathematical representation of the compounds, all future compounds that are used as input to that receptor must also be represented in the particular molecular or mathematical representation used in the training of the receptor.
- This aspect provides a method of generating a virtual receptor by use of models which exhibit stability or compensate for noise.
- One such model is a Bayesian regularised artificial neural network (BRANN).
- Another model is Maximum Entropy Method (MEM).
- Bayesian regularisation MacKay, 1992
- MacKay 1992
- the present aspect may also be used to screen databases or chemical libraries of real, synthesised compounds derived using the concept of combinatorial chemistry.
- Screening Process is predicated on the discovery that by creating a "virtual receptor” first, and then using this virtual receptor to screen compound libraries.it is possible to test, in a "virtual” environment, the compatibility of each compound being screened to the virtual receptor.
- this aspect provides a method of screening a range of compounds, including:
- a preferred measure that may be determined is the binding affinity of the compounds or other biological activity.
- a given compound contains certain structural features (i.e. conforms to a pharmacophore) there is a high likelihood of the compound having a particular biological activity. Due to the screening being done in a "virtual" environment, the need to synthesise a large number of compounds is avoided. The number of compounds synthesised is reduced to those predicted as being suitable in the "virtual" environment, and which also have a higher likelihood of being verified in the real world.
- the virtual receptor is continually modified, in order to improve its prediction abilities, based on compounds located in database scans that have proved to in fact exhibit the characteristics sought.
- this "virtual environment” is a neural network in a computer environment.
- Hardware implementations of neural nets are also possible (and may be preferable once a virtual receptor of a given type is defined and large databases are to be screened).
- a mutation operator determines that, with some low probability, a portion of the new individuals will have some of their bits flipped.
- a crossover operation two individuals are chosen from the population using a selection operator.
- This aspect provides using mutation and cross-over strategies as applied to SMILES strings, in order to modify the behaviour of the SMILES string as applied to a compound screening system.
- a virtual receptor is dependent on the quality of the molecular representation used to develop it.
- the quality of the virtual receptor is also dependent on the quality of the training data and possibly on the architecture of the neural net.
- the numerical representation of the compound being analysed adequately represents the steric, electronic and lipophilic properties of the whole molecule.
- MMM molecular multipoie moment
- the further aspect of the invention is an additional type of molecular representation. It involves the generation of useful molecular descriptors from eigenvalues of adjacency, or modified adjacency matrices in which the diagonal elements are values relating to steric, electrostatic or lipophilic properties of the constituents atoms of the compounds. In a preferred embodiment it is envisaged that eigenvalues of three matrices (one each of steric, electrostatic, and lipophilic-related properties) would be generated.
- the steric diagonal elements of the adjacency, or modified adjacency matrices could be the Vander Waals radii of the atoms; the electrostatic diagonal matrix elements could be the atom charges derived from empirical or molecular orbital calculations and; the lipophilic diagonal matrix elements could be the atomistic lipophilicities referred to in the section above on molecular multipoie moments.
- Figure 1 shows a set of data used in an example
- Figure 2 illustrates an example size of training, validation sets and number of networks generated
- Figure 3A illustrates an example measure of the predictive ability of a network
- Figure 3B illustrates the B5 representation
- Figure 3C illustrates a summary of the A1 representation
- Figure 4A illustrates a sample output from a 23:2:1 neural network using the B2 representation as input
- Figure 4B illustrates a sample output from an 11 :4:1 neural network using the B3 representation as input
- Figure 4C illustrates an sample output of 11 :4:1 neural network using A1 representation as input
- Figure 5 shows an optimal architecture
- Figure 6 shows results for example 3
- Figure 7 shows a sample output from a 21 :8:5:3:1 network
- Figure 8 shows a comparison of neural network and MLR
- Figure 10 shows an example flowchart of a genetically-evolved lead generation system as disclosed in accordance with the further disclosed 'fitness function' invention
- Figure 11 illustrates a summary of the genetic algorithm
- Figure 12 illustrates an example mutation operator
- Figure 13 illustrates an example cross-over operator
- Figure 14 illustrates an overall concept flowchart for virtual receptor generation.
- Figure 15 illustrates a virtual screening flowchart showing use of virtual receptor to predict properties of library members, library can be real or virtual.
- Figure 16 illustrates a genetically evolved chemical library overview flowchart.
- Figure 17 illustrates a genetically-evolved chemical library detailed flowchart showing role of fitness functions and specific examples of smiles mutation.
- Figure 18 illustrates a flowchart of improved multipoie moment molecular representation generation.
- Figure 19 illustrates a flowchart for generation of improved eigenvalue indices as molecular representations.
- Figures 20 illustrates Muscarinic virtual receptor training, observed versus calculated scaled log (activity) for training set (examples).
- Virtual analogue of a receptor "virtual receptor"
- a method of screening a range of compounds which includes (a) creating a virtual or mathematical analogue of a biological receptor or active site (a "virtual receptor”) and
- a preferred measure that may be determined is the binding affinity of the compounds or other biological activity.
- This method of creating a virtual receptor may also be used to scan a range of compounds and provide a measure indicative of whether the compounds are likely to exhibit a particular characteristic in which it includes the steps of: compiling a data set of compounds which exhibit the known characteristic; 13/1 forming a conceptual structure/activity model with a given architecture; converting the data set into a representation readable by the conceptual model; training the conceptual model on at least a portion of the converted data set in order to improve the architecture of the conceptual model.
- the data input to the virtual receptor is a molecular representation of the compounds which consider the entire molecule and embody relevant properties such as steric, electronic and lipophilic properties.
- relevant properties such as steric, electronic and lipophilic properties.
- a preferred output of a virtual receptor that may be determined is the binding affinity of the compounds or other biological activity.
- Artificial Neural Networks Virtual receptors can be generated by a number of different methods, many of which rely essentially on regression in one form or another. A particularly useful way of deriving a virtual receptor is to use a mathematical concept called an artificial neural net. Artificial neural networks (ANNs) provide an improved platform from which to predict the behaviour of molecules. Several advantages in using neural networks are that they are fast, they do not rely on subjective judgements as to the form of the functional relationships between structure and activity to be provided, and they process numerous parameters simultaneously. In addition, they are robust and capable of producing reasonable results even when the data is noisy . The prime advantage of using neural networks over other known methods, however, lies in their ability to internally process complex non-linear relationships.
- ANNs are mathematical models, based loosely on the way biological neural networks process information.
- ANNs consist of layers of artificial neurones (or neurodes). Each neurode has numerous inputs (x1 ,x2,%) each of which is modified by a weight (w1 ,w2,). These inputs are summed on entry to the neurode. This net input is then modified by an internal transfer function.
- the output of the internal transfer function forms the output of the neurode, which is either passed on as the input for other neurodes or as an output carrying a result.
- ANNs can take many forms, such as single layer, multi-layer, feed forward and lateral connectivity.
- the layers of neurodes may be fully or partially connected.
- a full connection is where the output of a neurode is passed onto each neurode in the next layer, whereas in a partial connection the output is transferred only to selected neurodes.
- An example of a three layered 4:3:1 ANN architecture is shown below:
- the output of an ANN depends upon numerous factors, namely the nature of the neurodes' transfer functions, the architecture of the network and the weights connecting the neurodes. Of these factors, the weights connecting the neurodes are most easily altered.
- the ANN as a whole is trained so that it is capable of recognising the important characteristics in molecules that may mean that they exhibit a 16
- the representations of molecules, with known properties are repeatedly input to the ANN.
- the ANN is then modified by adjusting the weights connecting the neurodes until the error between its outputs and the correct outputs is minimised.
- the method used to adjust the weights in the process of training the ANN is called "the leaming rule" and may be supervised or unsupervised. Back propagation is an example of a supervised leaming rule.
- Back propagation is a gradient descent algorithm.
- the network error may be considered a function of the network weights.
- Back propagation minimises the average squared error between the network output and the "correct answer" by moving down the gradient of this error function.
- the network weights are altered according to the Delta Rule (also known as the Least Mean Squared Rule).
- the output is compared with the desired result, and a proportion of this error determined is then propagated back through the network, with the network weights modified accordingly.
- the number of neurodes in the input layer and the output layer will be determined by the number of input parameters and the number of outputs respectively. However, ascertaining the optimal number of hidden layers (the layers between the input and the output layers) and the 17
- DOE freely rotatable bonds inthe molecule
- the compounds may be represented in terms of simple molecular structural parameters, such as constituent atoms or functional groups.
- An advantage that stems from the inventive method using an atomistic representation is that it allows compounds to be screened with no more knowledge than is provided by counting molecular fragments.
- Tripos Assoc. may be used as compound descriptors. Additional novel representations which form further aspects of the invention are exemplified in the sections following. 19
- the output generated by the virtual receptor upon screening a range of compounds would indicate which compounds have the highest likelihood of forming the basis of new lead compounds.
- the most novel of these could also be used to synthesise biased combinatorial libraries of organic compounds for screening in pharmacological receptor assays.
- the use of a neural network to map structure to activity results in superior models to the use of linear methods such as MLR or PLS. This reflects the presence of non-linear relationships between structural parameters and activity, and interactions between the descriptors.
- the ability of neural networks to account for these relationships is an advantage in virtual receptor generation.
- the inventive concept involves the creation of a virtual receptor by training the receptor using compounds with known properties. Once a virtual receptor has been created based on a particular molecular or mathematical representation of the compounds, all future compounds that are used as input to that receptor must also be represented in the particular molecular or mathematical representation used in the training of the receptor.
- Regression is an "ill-posed" problem in statistics, which sometimes results in structure-activity models exhibiting instability when trained with noisy data.
- Regression methods including back propagation neural nets, also face additional problems. Principal amongst these are overtraining, overfitting, and selection of the best QSAR model from a number obtained in the validation process. Overtraining results from running the neural network training for too long and results in a loss of ability of the trained net to generalise. Overtraining can be avoided by used of a validation set.
- Cross-validation which provides a good test for the predictive capabilities of a network, also provides assistance in determining the optimal neural net architecture.
- Cross- validation involves running a data set through a network numerous times until all data points have been in both the training and the validation sets. 20
- MML Minimum Message Length
- MEM Maximum Entropy Method
- Bayesian regularised artificial neural network may be better suited to virtual receptor calculations than other regression methods. Neural network training can be regularised, a mathematical process which converts the regression into a well-behaved "well-posed" problem and overcomes model instability. Bayes theorem provides the correct language for describing the inference of a message communicated over a noisy channel. In structure-activity models the 'noise' corresponds to experimental error, poor choice of molecular representations etc. The SAR 'message' corresponds to a useful, valid structure-activity model (or virtual receptor). Where orthodox statistics provide several models with several different criteria for deciding which model is best, Bayesian statistics only offers one answer to a well-posed problem.
- FIG. 15 Another aspect of the invention, which may be referred to as a Virtual Screening Process, and one embodiment of which is illustrated in Figure 15, is predicated on the discovery that by creating a "virtual receptor" first, and then using this virtual receptor to screen compound libraries, it is possible to test, in a "virtual" environment, the compatibility of each compound being screened to the virtual receptor. If a given compound contains certain structural features (i.e. conforms to a pharmacophore) there is a high likelihood of the compound having a particular biological activity. Due to the screening being done in a "virtual" environment, the need to synthesise a large number of compounds is avoided. The number of compounds synthesised is reduced to those predicted as being suitable in the "virtual" environment, and which also have a higher likelihood of being verified in the real world.
- the virtual receptor is continually modified, in order to improve its prediction abilities, based on compounds located in database scans that have proved to in fact exhibit the characteristics sought.
- a preferred form of this "virtual environment" is 22
- neural network in a computer environment.
- Hardware implementations of neural nets are also possible (and may be preferable once a virtual receptor of a given type is defined and large databases are to be screened). 4. Genetic evolution of structures using virtual receptors as fitness functions
- Additional aspects of the invention include the use of virtual receptors as fitness functions, and the discovery of efficient methods of mutating chemical structures to span as much of combinatorial space as possible.
- the aspect of the invention involving mutation strategies is discussed in the next section.
- each structure is mutated by means of single point mutations, insertions, deletions and crossovers, to generate another population of structures for testing against the fitness function represented by a virtual receptor and possibly others such as ease of synthesis, toxicity etc.
- Examples of library evolution are shown in Figures.10, 16 and 17.
- the aspect considered unique to the approach is that the mutated structures together with a suitably defined fitness function and evolutionary process, such a genetic algorithm or other types of genetic programs, can be used to explore very large areas of combinatorial space and generate lead structures likely to be active at the specified receptor.
- the algorithm starts with an initial population of these individuals.
- the fitness of each is evaluated to determine how well it solves the problem.
- the characteristics of each individual in the initial population are generated randomly.
- two individuals are selected from the population. This is done so that the individuals that are more fit are more likely to be selected.
- the two selected individuals can be considered to be "parents”.
- two new individuals (“children") are created that are recombinations of the genes from the parents.
- the process of creating the children is called "crossover"
- Some combination of the parents and children are then passed to the "next generation”.
- the selection and crossover steps are repeated until the number of individuals in the next generation is the same as that in the current generation. That is where mutation comes in.
- a selection operator is usually used to select which member of an evolving population will be involved in crossover or other mutations. In human terms this may be analogous to selection processes which favour the most powerful male mating with the most desirable female. In this application to lead discovery selection operators choose which two or more molecules will be involved in crossover or other mutations. These operators may be: selecting the best and second best molecules for crossover; or
- a selection operator is used to give preference to better individuals, allowing them to pass on their genes to the next generation.
- the goodness of each individual depends on its fitness, which may be determined by an objective function or by a subjective judgement.
- a 'global' fitness function may involve either a weighted average of some or all of component functions, or some of the fitness criteria may be applied sequentially.
- An example of the sequential application is for all members of the evolving populations(s) may have their fitness evaluated against the chemical valence fitness function (to eliminate nonsense compounds) then be evaluated for biological activity fitness via the virtual receptor.
- the most active molecules as determined by the virtual receptor fitness function may then be 'filtered' for toxicity or some other property.
- fitness functions may be exemplified by some of the following types (not an exhaustive list):
- ⁇ A valence function which determines whether the structure represented by the chromosome obeys the laws of chemical bonding and valence.
- a stability function which eliminates chemically unstable or extremely difficult to synthesise structures such as peroxides, or large numbers of chiral centres. This could be derived from a lookup table of undesirable functional groups.
- a safety function which rates the structures represented by the chromosomes in terms of likely toxicity. For example, nitrogen mustards, alkylating agents etc would be eliminated. This could be derived from structure-activity models in a similar way to the Topkat commercial software.
- a biological activity function This would be implemented via the virtual receptor concept as disclosed above. It is most likely implemented as a neural network model.
- a molecular diversity function The evolutionary algorithms used in this invention have a stochastic element which ensures a degree of molecular diversity. However, another fitness function would be used which ensures that, for example, no individual in the population has a greater than 85% similarity to the others. This function may also screen out molecular redundancies.
- the fitness function may determine whether combinatorial methods may be adapted to be used in the synthesizing compounds for screening.
- “pharmacokinetic efficiency” fitness This is a measure of how well the molecule is transported from its site of entry to the site of action. A simple example of this may be whether a CNS active drug can penetrate the blood-brain barrier.
- a further aspect of the invention is based on the concept of using evolutionary modification of compound structures whereby the calculated activity from the Virtual Screening Process is used as a measure of the 'fitness' of a chemical structure for performing a particular function.
- the better, or a predetermined group of, compounds can be selected based on the 'fitness' or arrange of 'fitness' as base structures for subsequent genetic modification.
- 'Fitness' may be considered as an assessment of a compound exhibiting survival of the fittest in a genetic algorithm.
- Optimisation provides a'fitness function'.
- the fitness function is used to evaluate the "fitness", or superiority of one member of a population over another by some definable criteria.
- the fitness function is the mathematical embodiment of the criteria used to define the "fitness" of a chemical compound over another.
- the criteria can be set according to the particular result required or outcome hoped for. Variations and additions of the inventions disclosed are possible within the general inventive concept as will be apparent to those skilled in the art. 5. Mutating structures by modifying a SMILES string Mutation Strategies
- the mutation operator determines that, with some low probability, a portion of the new individuals will have some of their bits flipped.
- An example is shown in Figure 12.
- bit string There is relationship between the bit string and a molecular structure, which is usually 1 :1 (except in some cases where optical or geometric isomers are not accounted for). It may be noted that molecular structures may not literally be represented by bit strings but the same operations and logic which apply to bit strings in the general discussion of genetic algorithms will also apply to other representations of molecules. It should be possible, for example, to use the SMILES string to represent a molecule, then alter this by symbol substitution, addition, fragment insertion or deletion etc to produce evolved structures via the genetic algorithm and the fitness function. Mutation alone induces a random walk through the search space. Mutation and selection (without crossover) create a parallel, noise- tolerant, hill-climbing algorithm.
- the crossover operation happens in an environment where the selection of who gets to mate is a function of the fitness of the individual, i.e. how good the individual is at competing in its environment.
- Some genetic algorithms use a simple function of the fitness measure to select individuals (probabilistically) to undergo generic operations such as crossover or asexual reproduction (the propagation of genetic material unaltered). This is fitness- proportionate selection.
- Other implementations may use a model in which certain randomly selected individuals in a subgroup compete and the fittest is selected. This is called tournament selection and is the form of selection we see in nature when stags rut to vie for the privilege of mating with a herd of hinds.
- the two processes that are considered to most contribute to evolution are crossover and fitness based selection/reproduction. As it turns out, there 29 are mathematical proofs that indicate that the process of fitness proportionate reproduction is, in fact, near optimal in some senses.
- the choice of which mutation operator is carried out on a given member of the chemical population can be decided randomly eg by use of a number wheel algorithm.
- Insertion mutations involve randomly selecting a character position in the string and inserting one or more chemically parsable text strings at that position.
- the choice of which string to insert could, for example, be chosen randomly from a large lookup table of SMILES strings. Some of the strings in the lookup table, or other selection process which derives the string to be substituted, could be contained in brackets. In this case the insertion results in a branching of the new string from the old. Strings inserted without these enclosing brackets would be incorporated into the original molecule without branching.
- original string CCCCCC mutated string CCCSCCC chain insertion
- each structure is mutated by means of single point mutations, insertions, deletions and crossovers, to generate another population of structures for testing against the fitness function represented by a virtual receptor and possibly others such as ease of synthesis, toxicity etc as outlined above.
- the novelty of the approach is that the mutated structures together with a suitably defined fitness function and a genetic algorithm, can be used to explore very large areas of combinatorial space and generate lead structures likely to be active at the specified receptor.
- the quality of a virtual receptor is dependent on the quality of the molecular representation used to develop it.
- the quality of the virtual receptor is also dependent on the quality of the training data and possibly on the architecture of the neural net.
- the numerical representation of the compound being analysed adequately represents the steric, electronic and lipophilic properties of the whole molecule.
- MMM molecular multipoie moment
- MMM descriptors relating solely to molecular shape are the three principal moments of inertia, Ix, ly, Iz.
- the two descriptors that relate solely to charge are the magnitude of the dipole moment, p, and the magnitude of the principal quadrupole moment, Q. Descriptors that relate to shape and charge can be developed in a number of different ways.
- One example is by calculating the magnitudes of the dipolar components, the magnitudes of the components of displacement between the centre-of-mass and centre-of-dipole with respect to the principal inertia axes to provide the descriptors px, py, pz and dx, dy, dz.
- Quadrupolar components are calculated with respect to a translated inertial reference frame whose origin coincides with the centre-of-dipole, providing two additional descriptors Qxx and Qyy. This set of thirteen numbers is independent of the orientation and position of the molecules in three-dimensional space, (see B. D. Silverman and Daniel. E. Platt "Comparative Molecular Moment Analysis (CoMMA): 3D- QSAR without Molecular Superposition" J. Med. Chem. ,39 (1 1 ), 2129 -2140, 1996)
- a lipophilic analogue of the steric and electrostatic multipoie expansions may be derived by ascribing atomistic lipophilic values to each type of atom found in molecules. We did this by carrying out multiple regression analysis on a series of structures with known lipophilicities
- the further aspect of the invention is an additional type of molecular representation. It is possible to describe the topographical relationships between atoms contained in a given molecular structure by means of connectivity or adjacency matrices. In general the diagonal elements of these matrices are zero and the off diagonal elements are unity only if the two atoms represented by the location of the matrix element are connected. Useful molecular representations may be derived from the eigenvalues of a modification of these matrices as first described by Burden (J. Chem. Inf. Comput. Sci., 29, 225 (1989).
- eigenvalues of three matrices are generated.
- the steric diagonal elements of the adjacency, or modified adjacency matrices could be the Van der Waals radii of the atoms;
- the electrostatic diagonal matrix elements could be the atom charges derived from empirical or molecular orbital calculations and;
- t e lipophilic diagonal matrix elements could be the atomistic lipophilicities refereed to in the section above on molecular multipoie moments.
- Benzodiazepine receptor BZR
- GABAA ⁇ - aminobutyric acid receptor
- the ANN used for this experiment had full connectivity, with the input layer of neurodes having linear transfer functions and all other layers of neurodes having sigmoidal transfer functions.
- the following neural 20 network parameters were used:
- Training patterns input noise Gaussian (mean : 0; standard deviation :0.02) 34
- the network calculations were performed using a commercial software package Propagator, however any neural network package could be used.
- the input data was scaled between 0 and 1 , as it is between these values that the sigmoidal transfer functions range.
- Output data was also scaled appropriately.
- the data set used was a set of 57 1 ,4-benzodiazepin-2-ones. This data set was chosen because their activity in relation to the receptor is known. The molecular representations of this data set that were employed are shown in
- Figure 1 while the size of training and validation sets and the number of networks generated during cross-validation is shown in Figure 2.
- initial representations B1 and B2 which were based heavily on an atomistic approach, provided position information - for example, separate input parameters were provided for C4 atoms at positions 7, 1 and 3.
- the representation comprised 25 input variables.
- representation B2 the number of input parameters were slightly reduced by treating the halogens as being of the same element - "Hal”.
- no positional information was provided - the neural network would not be told whether a C4 atom was attached to position 7, 1 or 3.
- the representation B4 differs from B3 in that it does not distinguish between the halogens.
- SEP Standard Error of Prediction
- the SEPval (which provides a me 1asure of the predictive ability of the network) obtained from the two architectures used is shown in Figure 3A.
- a sample output from a 23:2:1 neural network using the B2 representation as input is shown in Figure 4A, and the sample output from an 11 :4:1 neural network using the B3 representation is shown in Figure 4B.
- MLR Multiple Linear Regression
- MLR identified four linearly significant variables - C4, N3,
- Figure 1 As A1. Due to this representation not being positionally dependent, the number of input parameters is much lower than the positionally dependent representations B1 and B2. Consequently, greater freedom is afforded in the architectures that can be devised.
- the results for the A1 representation are summarised in Figure 3C, whilst Figure 4C shows a typical output using the representation.
- MLR Multiple linear regression
- a data set was compiled from the literature consisting of 321 compounds. These were broken up into two sets: 21 compounds would form the basis of training and validation sets. Training sets consisted of 270 compounds, validation sets consisted of 30 compounds. Thus, cross validation involved the generation of 10 training and validation set pairs. The neural network produced in each case was tested using the test set.
- the representation used was based on the atomistic approach described previously. However, input parameters relating to the number and type of rings were added, thus affording the neural network some insight into the molecules topology. Twenty one input variables were used to represent each molecule: C(aromatic), C4, C3, C2, N(aromatic), N3, N2, N1 , 02, 01 , S, P, Cl, F, Br, I, 7-membered rings, 6-membered rings, 5-membered rings, 4- membered rings, 3-membered rings. An example of the representations is shown below:
- IC50 being the binding affinity, which often corresponds to biological activity
- plC50 value this work modelled log 1/IC50
- MLR was performed on the data set twice - the first (MLR1 ) used only first-order terms, whilst the second (MLR2) used first and second order terms (but no cross-terms). MLR was employed on a "training set” of 270 compounds, then the resulting equation was tested on a validation set of 30 and a test set of 21. The results are compared with the neural network results on exactly the same data sets in Figure 8.
- a portion of a chemical structure database was screened and the biological activities of the members predicted.
- the database chosen was the first 7800 compounds in the Maybridge chemical database. While this database contains known, commercially available molecules, not hypothetical structures generated by techniques such as DBMaker, it serves to illustrate the screening procedure equally well.
- the 7800 structures were converted into an atomistic representation similar to that outlined above.
- the representations were presented as input to a trained neural network representing a benzodiazepine receptor. Training was disabled so that the weights were fixed and the virtual receptor model generated 7800 outputs representing predicted log biological responses for 38
- Example 5 The Maybridge column refers to the compound ID in the Maybridge database. The results of screening the benzodiazepine data set in the virtual receptor are also included.
- Example 5 The results of screening the benzodiazepine data set in the virtual receptor are also included.
- Example 1-3 We carried out an analogous study to that in Example 1-3 to derive a Muscarinic Virtual Receptor from the analysis of a data set of 161 compounds which act upon the muscarinic receptor.
- Compounds capable of binding to this receptor are currently the subject of intense research, due to the believe that memory related problems in Alzheimer's disease could be treated using agonists at this receptor.
- the IC50 values sued in the analysis are the concentrations required to displace [ 3 H]Oxotremorine-M (OXO-M), an agonist at the M1 muscarinic receptor.
- the training sets contained 151 compounds, whilst the test set contained 10 compounds.
- An example of observed versus calculated scaled log (activity) for training set (examples) is shown in figure 20.
- NCI National Cancer Institute
- the ANN'S used were three layer fully connected, feed forward networks which were trained using a Levenberg-Marquardt[Marquardt, D.W. J.Soc.lnd.Appl.Math. 11 ,431-441 , (1963)] optimised back-propagation algorithm which incorporated Bayesian regularisation[MacKay,D.J.C. A Practical Bayesian Framework for Backprop Networks, Neural Computation, 4, 415-447,(1992)].
- Bayesian regularisation removes the need to supply a validation set since it minimises a linear combination of squared errors and weights. It also modifies the linear combination so that at the end of training the resulting network has good generalisation qualities.
- the network architecture made use of 3 hidden nodes which proved to be more than sufficient in all cases with the Bayesian regularisation method estimating the number of effective parameters.
- the concerns about overfitting and overtraining are also removed by this method so that the production of a definitive and reproducible model is attained.
- the standard error of predictions (SEPs) and correlation coefficients, using the various representations, are shown in Table 14.
- SEPs standard error of predictions
- Table 14 A number of fully-connected ANN architectures, containing different numbers of hidden layers and nodes, were tested and a single hidden layer with 3 nodes was found to be optimal in each case.
- the number of effective parameters was always considerably less than the number of weights implied by the network architecture.
- the data set compounds were scrambled to remove any inadvertent ordering effects such as by the magnitude of the biological activity.
- a K-means hierarchical clustering was carried out on the input variables and one compound from each cluster, at the 11 cluster level, was extracted for a test set. This test set, of 11 compounds, was not the same for 43
- Nl Number of independent variables.
- NPC Number of Principal Components used.
- NPar Number of effective parameters.
- peff Number of input variables/NPar (c) Randic [5] indices.
Landscapes
- Chemical & Material Sciences (AREA)
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computing Systems (AREA)
- Medicinal Chemistry (AREA)
- Library & Information Science (AREA)
- Crystallography & Structural Chemistry (AREA)
- Physics & Mathematics (AREA)
- Organic Chemistry (AREA)
- Biochemistry (AREA)
- Chemical Kinetics & Catalysis (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Applications Claiming Priority (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AUPO892197 | 1997-09-03 | ||
AUPO8921A AUPO892197A0 (en) | 1997-09-03 | 1997-09-03 | Compound screening system |
AUPP1192A AUPP119297A0 (en) | 1997-12-31 | 1997-12-31 | Compound screening system |
AUPP119297 | 1997-12-31 | ||
PCT/AU1998/000715 WO1999012118A1 (en) | 1997-09-03 | 1998-09-03 | Compound screening system |
Publications (2)
Publication Number | Publication Date |
---|---|
EP1010094A1 true EP1010094A1 (de) | 2000-06-21 |
EP1010094A4 EP1010094A4 (de) | 2001-03-07 |
Family
ID=25645594
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP98941143A Withdrawn EP1010094A4 (de) | 1997-09-03 | 1998-09-03 | Gemischtes sichtungssystem |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP1010094A4 (de) |
WO (1) | WO1999012118A1 (de) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023123021A1 (zh) * | 2021-12-29 | 2023-07-06 | 深圳晶泰科技有限公司 | 获取分子特征描述的方法、装置及存储介质 |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU4565600A (en) * | 1999-06-18 | 2001-01-09 | Synt:Em (S.A.) | Identifying active molecules using physico-chemical parameters |
EP1402454A2 (de) * | 2001-04-06 | 2004-03-31 | Axxima Pharmaceuticals Aktiengesellschaft | Verfahren zur erzeugung einer quantitativen struktureigenschaftsaktivitätsbeziehung |
US7415358B2 (en) | 2001-05-22 | 2008-08-19 | Ocimum Biosolutions, Inc. | Molecular toxicology modeling |
US7447594B2 (en) | 2001-07-10 | 2008-11-04 | Ocimum Biosolutions, Inc. | Molecular cardiotoxicology modeling |
US7469185B2 (en) | 2002-02-04 | 2008-12-23 | Ocimum Biosolutions, Inc. | Primary rat hepatocyte toxicity modeling |
CN109359833B (zh) * | 2018-09-27 | 2022-05-27 | 中国石油大学(华东) | 一种基于abc-brann模型的海洋平台燃爆风险分析方法 |
CN111916143B (zh) * | 2020-07-27 | 2023-07-28 | 西安电子科技大学 | 基于多样子结构特征融合的分子活性预测方法 |
CN114334018B (zh) * | 2021-12-29 | 2024-09-06 | 深圳晶泰科技有限公司 | 获取分子特征描述的方法、装置及存储介质 |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5434796A (en) * | 1993-06-30 | 1995-07-18 | Daylight Chemical Information Systems, Inc. | Method and apparatus for designing molecules with desired properties by evolving successive populations |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
ZA9010460B (en) * | 1989-12-29 | 1992-11-25 | Univ Technologies Int | Methods for modelling tertiary structures of biologically active ligands including agonists and antagonists thereto and novel synthetic antagonists based on angiotensin |
JP2739804B2 (ja) * | 1993-05-14 | 1998-04-15 | 日本電気株式会社 | 双極子推定装置 |
US6081766A (en) * | 1993-05-21 | 2000-06-27 | Axys Pharmaceuticals, Inc. | Machine-learning approach to modeling biological activity for molecular design and to modeling other characteristics |
US5699268A (en) * | 1995-03-24 | 1997-12-16 | University Of Guelph | Computational method for designing chemical structures having common functional characteristics |
-
1998
- 1998-09-03 EP EP98941143A patent/EP1010094A4/de not_active Withdrawn
- 1998-09-03 WO PCT/AU1998/000715 patent/WO1999012118A1/en not_active Application Discontinuation
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5434796A (en) * | 1993-06-30 | 1995-07-18 | Daylight Chemical Information Systems, Inc. | Method and apparatus for designing molecules with desired properties by evolving successive populations |
Non-Patent Citations (4)
Title |
---|
D.E. WALTERS ET AL.: "GENETICALLY EVOLVED RECEPTOR MODELS: A COMPUTATIONAL APPROACH TO CONSTRUCTION OF RECEPTOR MODELS" JOURNAL OF MEDICINAL CHEMISTRY, vol. 37, no. 16, 5 August 1994 (1994-08-05), pages 2527-2536, XP000608151 Washington, DC, US ISSN: 0022-2623 * |
F.R. BURDEN ET AL.: "Predicting maximum bioactivity by effective inversion of neural networks using genetic algorithms" CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, vol. 38, no. 2, 1 October 1997 (1997-10-01), pages 127-137, XP004097524 Amsterdam, NL ISSN: 0169-7439 * |
J. B. MOON ET AL.: "Computer Design of Bioactive Molecules: A Method for Receptor-Based de Novo Ligand Design" PROTEINS: STRUCTURE, FUNCTION AND GENETICS, vol. 11, 1991, pages 314-328, XP000560842 * |
See also references of WO9912118A1 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023123021A1 (zh) * | 2021-12-29 | 2023-07-06 | 深圳晶泰科技有限公司 | 获取分子特征描述的方法、装置及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
EP1010094A4 (de) | 2001-03-07 |
WO1999012118A1 (en) | 1999-03-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Mai et al. | Molecular photochemistry: recent developments in theory | |
Singh et al. | Comparison of multi-modal optimization algorithms based on evolutionary algorithms | |
Judson | Genetic algorithms and their use in chemistry | |
Pedersen et al. | Genetic algorithms for protein structure prediction | |
Davidor | Genetic Algorithms and Robotics: A heuristic strategy for optimization | |
Suchan et al. | Pragmatic approach to photodynamics: Mixed Landau–Zener surface hopping with intersystem crossing | |
JPH08512159A (ja) | 連続して分子群を進化させて、所望の特性を有する分子を設計する方法と装置 | |
Hasegawa et al. | GA strategy for variable selection in QSAR studies: enhancement of comparative molecular binding energy analysis by GA‐based PLS method | |
Lameijer et al. | Evolutionary algorithms in drug design | |
US6219622B1 (en) | Computational method for designing chemical structures having common functional characteristics | |
US5699268A (en) | Computational method for designing chemical structures having common functional characteristics | |
Fatemi et al. | Prediction of bioconcentration factor using genetic algorithm and artificial neural network | |
EP1010094A1 (de) | Gemischtes sichtungssystem | |
CA2478556A1 (en) | Methods and systems for discovery of chemical compounds and their syntheses | |
Danel et al. | Docking-based generative approaches in the search for new drug candidates | |
Hageman et al. | Design and assembly of virtual homogeneous catalyst libraries–towards in silico catalyst optimisation | |
Langdon et al. | Genetic programming in data mining for drug discovery | |
WO2005083616A1 (ja) | リガンド探索装置、リガンド探索方法、プログラム、および記録媒体 | |
McLeod et al. | Development of a genetic algorithm for molecular scale catalyst design | |
Ajjarapu et al. | Ligand-based drug designing | |
US20020133297A1 (en) | Ligand docking method using evolutionary algorithm | |
Lin et al. | An efficient hybrid Taguchi-genetic algorithm for protein folding simulation | |
Olariu et al. | Biology-derived algorithms in engineering optimization | |
Zaman et al. | Using subpopulation EAs to map molecular structure landscapes | |
WO2000079263A2 (en) | Identifying active molecules using physico-chemical parameters |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20000331 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): DE FR GB IT |
|
A4 | Supplementary search report drawn up and despatched |
Effective date: 20010124 |
|
AK | Designated contracting states |
Kind code of ref document: A4 Designated state(s): DE FR GB IT |
|
RIC1 | Information provided on ipc code assigned before grant |
Free format text: 7G 06F 17/50 A |
|
17Q | First examination report despatched |
Effective date: 20030429 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20040110 |