WO2003094086A2 - Analysis of gene expression data for multi-class prediction - Google Patents

Analysis of gene expression data for multi-class prediction Download PDF

Info

Publication number
WO2003094086A2
WO2003094086A2 PCT/GB2003/001258 GB0301258W WO03094086A2 WO 2003094086 A2 WO2003094086 A2 WO 2003094086A2 GB 0301258 W GB0301258 W GB 0301258W WO 03094086 A2 WO03094086 A2 WO 03094086A2
Authority
WO
WIPO (PCT)
Prior art keywords
class
individuals
samples
individual
fitness
Prior art date
Application number
PCT/GB2003/001258
Other languages
French (fr)
Other versions
WO2003094086A3 (en
Inventor
Patrick Boon Ooi Tan
Chia Huey Ooi
Original Assignee
Biotech Research Ventures Pte Limited
Denison, Christopher
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Biotech Research Ventures Pte Limited, Denison, Christopher filed Critical Biotech Research Ventures Pte Limited
Priority to AU2003217019A priority Critical patent/AU2003217019A1/en
Publication of WO2003094086A2 publication Critical patent/WO2003094086A2/en
Publication of WO2003094086A3 publication Critical patent/WO2003094086A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2115Selection of the most significant subset of features by evaluating different subsets according to an optimisation criterion, e.g. class separability, forward selection or backward elimination
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Definitions

  • the invention relates to the development of algorithms for assigning a sample to one of several classes.
  • the invention relates to the analysis of gene expression data to assign a tissue sample to one of several phenotypic classes, such as tissue origin.
  • rank-based gene selection schemes suggestested by Dudoit et al. 2000.
  • the weakness of rank-based selection methods is that while they are good at identifying genes that are strongly correlated to the target phenotype class distinction, they tend to ignore correlations between genes. Class distinctions may be due to not only the expression of individual genes, but also gene interactions.
  • previous methodologies also suffer from the constraint that the size of the predictor set has to be specified a priori . Thus, it is unclear if the number of genes used in the final predictor set is actually the optimal number to generate an accurate class prediction.
  • the inventors have sought to address these and other problems arising in the area of multi-class prediction.
  • the inventors have recognised that, in order to produce an accurate predictor set for multi-class prediction, it is desirable to select the predictor set using a method that is capable of selecting genes that interact (i.e. have inter-dependent expression levels) within any particular class and/or that is capable of selecting genes whose expression levels are uncorrelated with each other across the classes.
  • the latter criterion may also be expressed as excluding genes whose expression levels are highly correlated across the classes. Inclusion of genes showing such correlation across the classes does not add any information to the predictor set.
  • the inventors have in particular applied genetic algorithms and discriminant-based classification methods to the analysis and exploration of complex multi-class datasets.
  • GAs Genetic algorithms
  • Goldberg (1989) are randomized search and optimization techniques that derive their working principles by analogy to evolution and natural genetics. Because they are aided by large amounts of implicit parallelism (Greffenstette and Baker 1989) , GAs are capable of searching for optimal or near-optimal solutions on complex and large landscapes, or spaces of possible solutions. Furthermore, GAs allow searching of these spaces of solutions by considering multiple interacting attributes (parameters) simultaneously, rather than by considering one attribute at a time. Because of these advantages, GAs may represent another useful tool in the classification of biological phenotypes based on gene expression data. Previous reports have described the use of GAs for binary class prediction problems (Li et al . , 2001). However, despite their suitability for addressing problems involving large solution spaces, the potential for using GAs in multiple- class prediction settings has to date remained unexplored.
  • the inventors have used the parallelised searching capability of GAs to design a gene-selection scheme that determines the optimal set of R genes in a multi-class dataset which classifies the samples within the dataset with minimal error. Using this approach, it is possible not only to determine the specific genes that should belong to a predictor set, but also the optimal size of the predictor set, from within a pre-specified range.
  • the invention provides: in the identification, from a pool of genes whose expression levels are known for a plurality of biological samples ("training samples") which may be classified into multiple phenotypic classes, of a subset of genes whose expression levels are likely to be predictive of the phenotypic class of a sample of unknown phenotypic class, the use of a gene selection method which: (1) permits the selection of genes that interact within any particular class; and/or (2) selects genes whose expression levels are substantially uncorrelated with each other across the classes; and/or (3) excludes from the subset genes whose expression levels are highly correlated across the classes with another gene of the subset .
  • Said selection and exclusion need not be absolute, but may reflect a tendency or preference.
  • the gene selection method employs a genetic algorithm and/or a discriminant-based classifier.
  • the discriminant-based classifier preferably employs matrix inversion (generally inversion of a class or common covariance matrix) to avoid selecting genes whose expression levels are correlated across classes.
  • matrix inversion generally inversion of a class or common covariance matrix
  • other ways may be employed of avoiding the selection of genes whose expression levels are correlated across classes. For example, one could simply compare the expression profile of each gene in a candidate subset to all others in that subset, and reject selecting genes that display a correlation coefficient higher than some arbitrary cutoff value. Having found a plurality of genes having a too-high correlation coefficient in a candidate subset, all but one could be replaced.
  • the invention also provides the use of a genetic algorithm and/or a discriminant-based classifier in the identification, from a pool of genes whose expression levels are known for a plurality of biological samples ("training samples") which may be classified into multiple phenotypic classes, of a subset of genes whose expression levels are likely to be predictive of the phenotypic class of a sample of unknown phenotypic class.
  • the invention provides a method for identifying, from a pool of genes whose expression levels are known for a plurality of biological samples ("training samples") which are classified into multiple phenotypic classes, a subset of genes whose expression levels in a sample of unknown phenotypic class are likely to be predictive of the phenotypic class of that sample, the method comprising:
  • step (c) generating a second generation population of individuals from selected individuals in the first generation population, wherein said selected individuals are selected according to their fitness as evaluated in step (b) ; (d) repeating steps (b) and (c) to evaluate the fitness of the individuals of the second generation population and to generate and evaluate further generations until a termination condition is reached; and (e) selecting the fittest individual, as evaluated in step (b) or step (d) , from the final generation or from among the fittest individuals of a plurality of generations .
  • the phrase "generating a second generation population of individuals from selected individuals in the first generation population" in no way implies that each individual in the second generation population must be identical to one of the selected individuals in the first generation population. Indeed, the generating step will generally include crossing over and/or mutation, such that at least some individuals in the second generation population may (and usually will) not be identical to any selected individual in the first generation population.
  • the invention provides a computer program, which when run on a computer (usually a general purpose computer) or computer network will carry out the above method, and also includes a computer program, when recorded on a data carrier, for carrying out the above method.
  • a computer program which when run on a computer (usually a general purpose computer) or computer network will carry out the above method, and also includes a computer program, when recorded on a data carrier, for carrying out the above method.
  • This may be stated as a computer memory product having stored thereon at least one digital file, said memory product comprising computer readable memory and said stored digital file or files constituting a program to carry out the above method.
  • Such a program for identifying, from a pool of genes whose expression levels are known for a plurality of biological samples ("training samples") which are classified into multiple phenotypic classes, a subset of genes whose expression levels in a sample of unknown phenotypic class are likely to be predictive of the phenotypic class of that sample, can be expected to include code which is effective, when the program is run, to :
  • step (c) generate a second generation population of individuals from selected individuals in the first generation population, wherein said selected individuals are selected according to their fitness as evaluated in step (b) ;
  • the program will generally include code effective to receive input of stored data representing expression levels of the genes represented by an individual, for evaluating its fitness in step (b) or step (d) .
  • Possibilities for display or output include graphical or numeric display, output to memory and output to a printer. Furthermore the invention provides apparatus for carrying out the above method, the apparatus including provision for input of stored data representing expression levels of a pool of genes in a plurality of biological samples of known phenotypic class, a processor and a stored program as defined above.
  • the invention provides a system for identifying, from a pool of genes whose expression levels are known for a plurality of biological samples
  • training samples which are classified into multiple phenotypic classes, a subset of genes whose expression levels in a sample of unknown phenotypic class are likely to be predictive of the phenotypic class of that sample, the system comprising:
  • a processing component capable, for each individual, of evaluating its fitness to predict the phenotypic class of a sample of unknown phenotypic class by determining its ability to predict the phenotypic class of a plurality of test samples of known phenotypic class;
  • a processing component capable of generating a second generation population of individuals from selected individuals in the first generation population, wherein said selected individuals are selected according to their fitness as evaluated in step (b) ;
  • a processing component capable of directing the repetition of steps (b) and (c) to evaluate the fitness of the individuals of the second generation population and to generate and evaluate further generations until a termination condition is reached;
  • a processing component capable of selecting the fittest individual, as evaluated in step (b) or step (d) , from the final generation or from among the fittest individuals of a plurality of generations;
  • processing component may refer to the same processing component (e.g. a processor programmed to perform processing steps identified in (a) to (e) ) or different processing components.
  • the system preferably comprises a memory in which is stored an expression data set representing expression levels for the pool of genes in a plurality of biological samples of known phenotypic class.
  • the system preferably includes a processing component capable of receiving input of stored data representing expression levels for the genes represented by an individual, for the evaluation of its fitness in step (b) or step (d) .
  • Step (c) is the kind of procedure which genetic algorithms are designed to perform, and preferably the method uses such an algorithm.
  • Preferably said selection includes an intermediate step of generating a pool of individuals from the first generation population, wherein the number of copies, or absence, of a first generation individual in the pool depends on its fitness as determined in step (b) .
  • the pool then provides the selected first generation individuals from which the individuals of the second generation are generated (in the following description, a reference to a first generation individual, and especially a selected first generation individual, may therefore be interpreted as a reference to a member of the pool, unless the context requires otherwise) .
  • Selection according to fitness may involve any suitable technique, for example assigning to the pool a number of copies of each individual which is proportional to the fitness of the individual.
  • the technique is preferably probabilistic, for example stochastic universal sampling or roulette wheel selection.
  • step (c) the generation of second generation individuals from the selected individuals of the first generation preferably includes crossing over between and/or mutation of the selected individuals of the first generation.
  • Crossing over is the exchange of information between typically two individuals ("parent" individuals) to generate new individuals which represent different combinations of genes from the parent individuals.
  • Mutation is the change of information in an individual, such that it represents a new combination of genes from the unmutated individual.
  • crossing over preferably precedes mutation.
  • the following description refers to both crossing over and mutation occurring on selected first generation individuals, but applies equally to the situation where e.g. mutation follows crossing over.
  • references to selected first generation individuals may therefore also be interpreted to include the individuals resulting from earlier crossing over between selected first generation individuals, when such crossing over has occurred before the mutation step. If crossing over has not occurred, e.g. because a crossing over step is absent or because a probability level for crossing over between two individuals has not been met, then mutation will occur on first generation individuals (or copies thereof) .
  • all selected first generation individuals are subject in pairs to crossing over, preferably at a defined probability level, p c , which may be 1, but may be less than one. Typically 0.7 ⁇ p c ⁇ 1.
  • the matching together of pairs of individuals for crossing over is preferably random.
  • all selected first generation individuals are subject to mutation, preferably at a defined probability level, p m .
  • p m a defined probability level
  • the probability function may be applied to individuals (in which case, when an individual satisfies the probability function, it will contain a single mutation, preferably at random) .
  • the probability function is applied separately to each representation of a gene in each individual (using the terminology introduced below, to each gene indicator) .
  • p m is preferably much less than one, typically 0.0005 ⁇ p m ⁇ 0.01.
  • Mutation is preferably uniform, i.e. the gene replacing a mutated gene is preferably selected with equal probability from the pool of genes. However, it may be non-uniform (or dynamic), i.e. the group of genes from which the gene replacing a mutated gene is selected varies in size according to the generation of the individual being mutated. In such a case, the group preferably decreases in size with increasing generations.
  • different individuals represent different numbers of genes, preferably within defined limits.
  • the optimal number of genes for phenotypic classification of samples will depend on numerous factors, including the number of classes, the number of genes for which expression data is known for samples of known phenotypic class, and the variability in and correlation between expression levels. Accordingly, the optimal number of genes cannot easily be determined a priori .
  • the method is repeated, using different defined limits in different runs.
  • Each individual may comprise a sequence (or "string") of gene indicators, usually numbers, wherein different gene indicators represent different genes. Where different individuals in a population represent different numbers of genes, each individual in the population may have the same number of gene indicators and further comprise a number indicator which specifies (usually within defined limits) the number of genes the individual represents. Where the number of genes an individual represents is less than the maximum number of genes represented by the population, a number of gene indicators will be redundant in the individual. This method of representing individuals, however, allows convenient crossing over between individuals representing different numbers of genes and allows a mutation to affect the number of genes represented by an individual (see below) , as well as which genes are represented. It also allows the mutation of and/or crossing over between genes which are redundant in the individual, but which may cease to be redundant in a later generation if the number indicator is changed by subsequent crossing over or mutation.
  • crossing over may be uniform or one-point.
  • one-point crossing over a number of consecutive indicators are exchanged between parent individuals to generate child individuals.
  • uniform crossing over each point in the sequence is independent of the others, so the effect is the same as a random number of one-point crossings over between the same two sequences.
  • mutation may occur only within the gene indicators, or may also affect the number indicator. Mutation of an indicator may occur by replacing the mutated indicator with a randomly generated indicator. This will occasionally result in an indicator being replaced with an identical indicator. This is considered to fall within the definition of mutation. Alternatively, the possibility of replacement with an identical indicator may be excluded.
  • Mutation and crossing over may result in individuals having duplicate copies of the same gene indicator. This possibility may be excluded (e.g. by screening for duplicate gene indicators within individuals and replacing duplicates with randomly generated indicators) . For simplicity, however, duplicates are preferably ignored. It should be noted, however, that the preferred use of matrix inversion (explained in detail below) causes individuals that represent duplicate gene indicators to have zero fitness value. Such individuals will therefore not be selected for generating later generations .
  • the first generation population is preferably randomly generated. Alternatively, it may comprise selected individuals, e.g. the fittest individuals from earlier runs of the method (or other similar methods) .
  • the termination condition is preferably satisfied when a defined number of generations have been generated and evaluated.
  • Other termination conditions may, however, depend on measures of maximum or average fitness, e.g. as described in the background.
  • the termination condition may be satisfied when average and/or maximum fitness of all individuals of a generation is no longer increasing (or no longer increasing at a defined rate) between successive generations.
  • the prediction is preferably employs discriminant-based, rather than rank-based classification, more preferably classification that selects (i.e. performs with higher accuracy on) individuals which represent genes whose expression levels are correlated to class distinction but uncorrelated to each other. This may be stated as classification that selects individuals which represent genes whose expression levels are correlated to particular class distinctions and are uncorrelated to each other across the classes. More preferably the prediction employs a maximum likelihood classifier (MLHD) , more preferably a linear discriminant function.
  • MLHD maximum likelihood classifier
  • classification preferably involves the production, for each individual, of a discriminant function which assigns a biological sample to a phenotypic class, based on the expression levels in that biological sample of the genes represented by that individual.
  • the discriminant function is produced on the basis of the expression levels, in the training samples, of the genes represented by the individual.
  • the fitness of each individual is then evaluated by determining the error rate of the discriminant function for that individual in classifying the test samples (whose phenotypic classes are known, but which are not used in the production of the discriminant function) .
  • the MLHD classifier operates via the following steps:
  • each class defining a typical expression profile of a member of that class. This is based on the expression levels, for the genes represented by the individual under consideration, in the training samples of that class. Preferably this involves averaging the expression levels, preferably taking their mean, to produce a class mean vector;
  • each test sample comparing its expression profile with the typical expression profile for each class (e.g. the class mean vector), taking into account the spread as determined for that class (as in step (2) , e.g. the class covariance matrix), or (preferably) as averaged across all classes (as in step (3), e.g. the common covariance matrix) , and assigning the test sample to the class to which its expression profile matches most closely.
  • the typical expression profile for each class e.g. the class mean vector
  • the comparison in step (4) involves inversion of the class / common covariance matrix. This is effected in the Examples by the term ⁇ -1 in the calculation of f q (e) in equation (11) . High correlation even between just a pair of genes causes the covariance matrix to be badly conditioned (approaching singularity) . In turn, this causes the inverted matrix to be unreliable (approaching infinity) . Both of these can be easily identified (see e.g. Example 2). The exemplified classification method will not select individuals with highly correlated genes, because individuals with highly correlated genes are given low fitness values, because of the unreliability of the inverted covariance matrix.
  • the discriminant function described in detail herein will give a high number if the match between a test sample and a candidate class is good and a low number if the match is poor; the test sample will be assigned to the class which gives the highest number.
  • Fitness may be evaluated by cross-validation and/or independent test, preferably a combination of both.
  • the discriminant-based classifier i.e. in preferred embodiments the MLHD
  • the discriminant-based classifier is produced on the basis of a number of training samples and then tested against a number of test samples, which do not overlap with the training samples.
  • each MHLD (or other discriminant-based) classifier is produced using a subset containing all but one of the samples, and tested against the remaining sample; more preferably an MHLD (or other discriminant-based) classifier is produced for all possible such subsets, so that cross-validation against every sample occurs.
  • Fitness as determined by independent test may be accorded higher significance than fitness as determined by cross validation, especially if a small dataset is used.
  • fittest individual may be used in the same way to classify samples of unknown phenotypic class.
  • the subset is preferably used without further change in the classification of samples of unknown phenotypic class. This is in contrast to previous uses of genetic algorithms in this area (see e.g. Li et al . , 2001), in which the genes identified most often by genetic algorithm-based selection over several runs were then grouped, and this group (not a group selected directly using a genetic algorithm) was used for the classification of samples.
  • the number of phenotypic classes into which the biological samples may be classified will be at least 4 or 5, preferably at least 6, 7 or 8, more preferably 9 or more.
  • phenotypic class is intended to be interpreted in a broad sense, to encompass any desired classification of biological samples, usually based on observable or testable characteristics or responses of the samples.
  • the phenotypic classes may represent different sources of the biological samples, e.g. tissue sources, especially tissue origin of tumour samples (e.g. secondary tumour samples) , or may represent different pathologies, or different responses to external factors, such as chemotherapy or radiotherapy, or exposure to environmental factors.
  • biological sample is also intended to be interpreted in a broad sense, to include any biological sample from which a measurement of gene expression levels can be obtained, whether directly (e.g. by determination of mRNA or protein levels) or indirectly.
  • biological samples include tissue samples such as biopsies, cells, blood and other bodily fluids
  • gene and "expression level” are used in a broad sense, such that the expression level of a gene may be determined in a variety of ways. Typically, determination of the expression level of a gene will involve nucleic acid hybridisation (typically hybridisation of mRNA from a sample to an array of cDNA molecules) . However, other means of determining expression levels are also contemplated and include, for example, determination of protein levels in a sample, e.g. by binding of protein from the sample to an array of specific binding partners, such as antibodies or antibody fragments.
  • the invention is not limited to the analysis of genetic data, nor even to the analysis of biological data (though this is preferred) and may relate to the classification of any complex data.
  • An example is the analysis of protein and/or peptide patterns from a mass spectrum.
  • the term "gene” as used above in the definitions of aspects of the invention may be replaced by "parameter”; similarly, “expression level” may be replaced by "value”, “biological sample” by “sample”, and "phenotypic class” by “class”. Samples may be physical or abstract.
  • the methods of the invention may include a preprocessing step in which the size of the pool of parameters (e.g. genes) to be used in the method is reduced from an initial larger pool.
  • Such reduction may exclude parameters for which values (e.g. expression levels) are unavailable for some of the training and/or test samples.
  • the reduction may exclude parameters for which the variation in values across the training and/or test samples is lower than for other parameters.
  • Parameters may be excluded if the variation is less than a threshold level.
  • a predefined number of parameters are included, being those parameters having the highest variation. Variation may be measured by standard deviation or any other conventional measure of variation. Wherever possible, the values are first normalised, preferably by reference to control values (where available) .
  • Figure 1 is a schematic illustration of genetic algorithms (GAs) , stochastic universal sampling, and examples of crossovers:
  • Figure 2 shows the effects of varying GA conditions on performance accuracy. Fitness is depicted as 'height' for all plots.
  • Figure 3 shows a comparison of expression profiles of predictor sets obtained through different methodologies. Columns represent different class distinctions, and only training set samples are depicted. Arrows depict genes which have highly correlated expression patterns across the sample classes. Classes are labeled as follows: BR (breast) , CN (central nervous system) , CL (colon) , LE (leukemia) , ME (melanoma) , NS (non-small-cell lung carcinoma) , OV (ovarian) , RE (renal) and PR (reproductive system) .
  • each potential solution to a problem is represented in the form of a string.
  • Each string contains encoded parameters of the solution it represents.
  • a string is dubbed a chromosome or an individual, while a parameter is also called an attribute.
  • a pool of such strings forms a population.
  • a population is initialized by creating (usually randomly) a series of strings such that each string represents a point in solution space.
  • a fitness function is then defined to measure the degree of goodness of a string (i.e. how good the string is as a solution to the problem) .
  • the fitness value associated with a string indicates the optimality of the solution that the string represents.
  • a selection process is then applied to the strings, based on the principle of ⁇ survival of the fittest' .
  • a few strings are chosen from the total population.
  • Each chosen string is then assigned a number of copies to go into the mating pool based on its fitness value .
  • a new population of strings is formed for the next generation.
  • the process of selection, crossover and mutation are repeated in each subsequent generation till a termination condition is satisfied.
  • Each generation is evaluated by two parameters: the average and the maximum fitness values of the population at that generation.
  • Commonly used termination conditions include defining the maximum number of generations, or the algorithm may be compelled to terminate when there is no significant improvement to the average or maximum fitness value of the population.
  • the overall classification method used in the GA-based gene selection scheme essentially consists of two main components: (1) a GA-based gene selector which implements the gene selection scheme and (2) a Maximum Likelihood (MLHD) classifier.
  • the first component generates and evolves populations of sets (herein termed "individuals” or "strings”) of R genes that are used to classify the samples, where the value of R lies in the pre-specified range [.Rmin, -Rmax] •
  • the actual classification process i.e. assigning a sample to one of the classes
  • Each individual in the population represents a specific gene predictor set, and a fitness function is used to determine the performance accuracy of a predictor set in classifying a dataset.
  • a typical run of the algorithm covers G generations, where each generation consists of selection, crossover and mutation operations.
  • the time complexity is estimated as 0 (NMR 3 ) when there are M samples, the population size is set as N, and the predictor set size is chosen as R . That is, time complexity 0 is proportional to 2V and to M and to R 3 . Since R varies between R m i n and .R max in a run, a more realistic estimate of the time complexity is obtained by using the average of the two boundary values. Thus, the time complexity is taken to be approximately
  • NCI60 9 classes
  • N and G the maximum number of generations
  • Range of predictor set size (number of genes to be selected) ⁇ R min , R max ] : [5, 10], [11, 15], [16, 20], [21, 25] or [26, 30].
  • each predictor set size range there are thus 96 different runs (corresponding to combining conditions (a) -(d)).
  • the best strings from each individual generation in the run are then collectively compared to find the string with the best overall fitness.
  • G 100 optimal predictor sets, i.e., individuals with the highest fitness for its particular generation, and these individuals are then compared with each other.
  • the G optimal individuals were sorted according to the following criteria:
  • the NCI60 gene expression dataset was first described in Ross et al., (2000), and contains the gene expression profiles of 64 cancer cell lines. This work focuses on the NCI60 dataset as previous works on tumor classification did not seem to achieve satisfactory classification accuracy with this dataset compared to other binary datasets (see Introduction) .
  • the dataset was downloaded from the url http://genome-www.stanford.edu/ sutech/download/nci60/dross_arrays_nci60. tgz, and specifically consists of the gene expression profiles of 64 cell lines as measured by cDNA microarrays containing 9703 spotted cDNA sequences. For the purposes of this report, the single unknown cell line and two prostate cell lines, due their small number, are excluded from analysis.
  • the second dataset ( GCM' ) (Ramaswamy et al . , 2001) (Ramaswamy et al . , 2001)) was downloaded from http://www-genome.wi.mit.edu/mpr/ publications/projects/Global Cancer Map/ and contains the expression profiles of 198 primary tumor samples, 20 poorly differentiated adenocarcinomas and 90 normal tissues samples as measured by Affymetrix Genechips containing 16063 genes and ESTs. Only the 198 primary tumor samples are considered in this work, to make the results comparable to that reported by Ramaswamy et al . (2001) .
  • the 198 samples originate from 14 sites of origin: prostate (14), bladder (11), melanoma (10), uterine (10) , leukemia (30) , breast (12) , colorectal (12), renal (11), ovarian (12), pancreatic (11), lung (12), lymphoma (22), central nervous system (20) and pleural mesothelioma (11) .
  • the data was preprocessed as above to generate a truncated 1000-gene dataset, with standard deviation ranging from 0.299 to 3.089.
  • each string is a sequence of integers where each integer represents the index of a gene from the truncated dataset. Therefore the range of each element/integer in the string is [1,1000].
  • the length of a chromosome (string) is f? m ⁇ +l-
  • the first element in the string, R denotes the size of the predictive set represented by the string, while the subsequent elements g , g 2 , ⁇ ⁇ ⁇ , g n represent the indices of a subset of genes picked from the truncated 1000 gene dataset.
  • the string connotes a set of R predictive genes indexed g x , g 2 , ..., g R . Only the first R genes out of the R max genes included in the string are used for classification, where R ranges from J m i n to R max .
  • This string represents a potential solution set containing genes numbered 258, 46, 293, 144, 336, 212 and 36.
  • An initial population of potential solutions is formed by creating N random strings, where the population size N is prespecified.
  • a random string is produced by generating one random integer (represented by R) in the range [.min/ max ] plus R max random integers (the gene indices) in the range [1,1000] to be used as its attributes.
  • the fitness value of each string in the population is then computed.
  • the two error rates are obtained by classifying the samples using the genes contained in string S ⁇ as variables using a MLHD classifier.
  • the specifics of the MLHD classifier are described in subsequent sections.
  • the position of the first pointer is given by a randomly generated number in the range [0, 1/N] .
  • the number of copies of a string S entering the mating pool is determined by the number of pointers that fall within Si's segment along the line.
  • strings with high fitness get to have more copies of themselves in the mating pool than strings with lower fitness.
  • the roulette wheel selection method is a simpler method where a single pointer is placed at a random position along the line N times. In each placement, the string Si on whose segment the pointer falls is selected to enter the mating pool.
  • Crossover is illustrated in Figure lc and is performed by first randomly choosing a pair of strings from the mating pool and then applying a crossover operation on the selected string pair with a fixed probability p c .
  • the outcome of this operation is two offspring strings which are produced through the exchange of genetic information between the two parents. This probabilistic process is repeated until all parent strings in the mating pool have been considered.
  • GA-based gene selector we assess two types of crossover operations: one-point and uniform crossover (described in Haupt and Haupt, 1998) .
  • an integer position k in the range [1, j ma ⁇ ] is chosen at random along the string.
  • Two child strings are produced by swapping all genes between positions k+1 and ma ⁇ +1 •
  • a crossover template is randomly generated.
  • the template is a binary string of the same length as string length, R ma ⁇ +1-
  • the element of the template determines which child receives the gene from which parent. For example, if the element of the template at position i is 1, child 1 receives the gene at position k from parent 1 while child 2 receives the gene at position k from parent 2. However if the element at position k is 0, then child 1 receives the gene at position k from parent 2 while child 2 receives the gene at position k from parent 1.
  • An example of uniform crossover is illustrated in Figure Id.
  • Mutation Mutation operations can also be applied at various probabilities to each of the offspring strings produced from crossover (Spears and De Jong, 1991) .
  • a gene at position k from the offspring string ( X k ) is mutated at probability p m .
  • M t tumor samples from the truncated dataset are used as training samples.
  • the remaining Me tumor samples are used as test samples.
  • the training dataset can be written as the matrix ⁇ w,
  • T (4) l fil RM, where element x j corresponds to the expression level of gene g ⁇ (see above under "String Representation") in training sample j .
  • T denotes matrix or vector transposition, where columns vectors become row vectors and vice versa.
  • the basis of the discriminant function is Bayes' rule of maximum likelihood: Assign the sample to the class with the highest conditional probability. In a domain of Q possible classes, assign the unknown sample to class q where
  • e) is the probability that the unknown sample belongs to class q given that its gene expression data vector equals e .
  • the computation of the discriminant function for class q is based upon two parameters: the class mean vector and the common covariance ⁇ .
  • the mean of expression of gene i is
  • the class covariance matrix ⁇ g is thus the covariance matrix of the truncated expression data for all training samples belonging to class q. In essence, its elements describe the "association" between genes among samples belonging to class q.
  • ⁇ ⁇ j covariance between the gene i and the gene j in class q.
  • the maximum likelihood (MLHD) classification rule is: Assign the unknown sample to Class q if f ,GD>f, ( * ) ( 12 : for all q ⁇ r and q, r e ⁇ 1, 2, ..., Q) .
  • the covariance matrix for class q, ⁇ g is used to calculate f g .
  • 4) and 5) are the simplified versions of 2) and 3) respectively. Both assume that the covariance matrix (the common covariance matrix, for 5) ) is a diagonal matrix (i.e. only the diagonal elements are non- zero) . Hence, 4) and 5) by necessity ignore correlation (relationship) between any 2 genes, which we believe to be important to achieve high levels of predictive accuracy.
  • Two methods are used to evaluate the performance of a classifier: leave-one-out cross validation and independent test.
  • one sample from the training set is excluded, and a new MLHD classifier is rebuilt using the remaining M t —1 training samples.
  • the new classifier is totally blind to that excluded sample.
  • the classifier is used to classify the left-out sample, and the procedure is then iteratively repeated for all of the M t training samples.
  • independent test the original MLHD classifier is built using all M t training samples, and is used to classify Mg independent test samples.
  • test error rate is
  • the first is the gene selection method employed by Dudoit et al . (2000), in which genes are ranked on the basis of the ratio of between-groups to within-groups sum of squares (BSS/WSS) .
  • the ratio for gene i is M, Q
  • the second and third methods adapt the binary-class signal-to-noise (S2N) ratio
  • the third method applies the all-pairs (AP) approach, in which Q ( Q - 1) sets of predictor genes are formed. For each pair-wise class combination, one set of positively correlated genes and another set of negatively correlated genes are formed. As in the previous method, one top positively correlated gene and one top negatively correlated gene were selected for each pair-wise class combination. This brought the number of chosen genes to 72.
  • AP all-pairs
  • Figures 2a-e are surface plots showing the effects of varying the crossover probability p c and mutation probability p m on the maximum fitness values achieved in a GA run using uniform crossover and stochastic universal sampling. Although the small number of available data points prevents a definitive conclusion from being made, the trends of the plots makes it possible to propose that a relatively high crossover probability (p c > 0.8) and a mutation rate in the range of 2 x 10 ⁇ 3 or above would be likely to produce a good predictor set.
  • Table la-e also shows the best predictor sets obtained for each range of predictor set size using different crossover and selection methods. Based on fitness and both cross validation and test error rates, the results show that in general for the NCI60 dataset the ranges [11, 15] and [16, 20] give the highest classification accuracies.
  • the best predictor set obtained using the GA-based selection scheme exhibited a cross validation error rate of 14.63% and an independent test error rate of 5% (Table lb, line 1, and see Table 2 for specific misclassifications) . This is a significant improvement in accuracy as compared to other methodologies assessed by Dudoit et al (2000) , where the lowest independent test error rate was reported as 19%.
  • the GA-based approach avoids reliance upon a rank-based gene selection scheme.
  • Table 3a it is clear that the rank-based methods do not identify the majority of the genes selected by the GA- based gene selector, especially when the predictor set size is similar to number of genes in the best GA predictor set (i.e. 13).
  • the BSS/WSS rank-based method comes closest by managing to select 6 of the 13 GA-based predictor genes, but it is only able to achieve this when the top 100 genes are selected, and not the top 20.
  • the genes corresponding to the optimal GA-based predictor set are listed in Table 3b.
  • the GA/MLHD classifier is able to achieve these levels of predictive accuracy without relying on massive numbers of genes (16063 genes for OVA/SVM versus 32 genes for GA/MLHD, or an approximately 500-fold reduction) .
  • This data suggests that the application of the GA/MLHD classifier may be especially useful in the arena of molecular diagnostics, in which a major aim is the selection of minimal identifier genesets which nevertheless offer high predictive accuracies.
  • a measure is typically calculated for each gene based on a function that determines the "predictiveness" of the gene. Then, after sorting the genes based on their measures or scores, the top R genes are selected to form a subset of predictive R genes.
  • This strategy coupled with a weighted voting system, was first introduced by Golub et al . (1999) with the S2N ratio used as the score and appears to work quite well for the prediction of binary class datasets. However, expanding the weighted voting method into multi-class scenarios, as implemented by Yeang et al . (2001), did not produce the same satisfactory results. Similarly, the LIK score as described in Keller et al .
  • predictor genes obtained through rank-based selection methods typically fulfill the first criterion but not the second. This is particularly true for the weighted voting method.
  • features selected to be included in a predictive set must by necessity not correlate with each other. This is because if there exists even a single correlated pair of genes in the predictor set, the classification results would be unreliable, since the covariance matrix in this case would be singular or close to singular, and hence not inversible. (Inversion of the covariance matrix is essential to calculate the discriminant function) . Therefore the predictor sets obtained through our method contains genes with no or very low correlation to each other.
  • the GA/KNN strategy may not be optimal for multi-class scenarios for several reasons. Firstly, most of the distance metrics (Euclidean, Pearson, etc.) used to determine the k neighbors of a sample invariably become less sensitive as data dimensionality increases (Keller et al . 2000), and samples might also be unclassifiable if no satisfactory majority vote is obtained from the k neighbors. Secondly, the KNN rule, while giving comparatively lower error rates (Dudoit et al . 2000), requires relatively large computational memory requirements in order to simultaneously maintain all training data in memory. Thirdly, the powerful tool of crossover was also not drawn on by Li et al . (2001) in their GA/KNN method.
  • the resulting covariance matrix, cov (a') contains 2 identical rows/columns (the 1 st 2 rows and the 1 st 2 columns) .
  • a covariance matrix which is a symmetrical matrix, rows are always similar to columns. This is an extreme example because the 2 rows (genes) are perfectly correlated, hence Matlab® issues the warning "Matrix is singular to working precision.” when it tries to calculate the inverse of the covariance, inv (cov (a' ) ) .
  • a is transposed as a' because Matlab® looks at columns as variables .
  • Bittner M., Meltzer, P., Chen, Y., Jiang, Y., Seftor, E., Hendrix, M. , Radmacher, M., Simon, R. , Yakhini, Z., Ben-Dor, A., et al. 2000.
  • V eye (size (picked, 1) ) ;
  • %disp(i); disp (size (means) ) ; disp (size (mean(temp, 2) ) ) ; disp (size (temp) ) ; means (:,i) (mean (temp, 2) ) ;
  • testOne V'*o testOne; if 0
  • %disp ( sprintf ('%d',j) ) ; disp (sprintf ( 'picked size')); disp (size (picked) ) ; disp (sprintf ( 'leaveOne size')); disp (size (leaveOne) ) ; disp (sprintf ( 'covss size')); dis (size (covss) ) ; disp (sprintf ( 'pcaED size')); disp (size (pcaED) ) ; disp (sprintf ( 'V size')); disp (size (V) ) ; disp (sprintf (' D size')); disp (size (D) ) ; end
  • V eye (size (picked, 1) ) ;
  • %disp(i); disp (size (means) ) ; disp (size (mean (temp, 2) )) ; disp (size (temp) ) ; means (:,i) (mean (temp, 2) ) ;
  • testOne ' ⁇ o testOne; if 0
  • [xpopn, fitness, meanf,maxf,xopt] selectGA(fun,popnsize,pc,pm, numgens, ... vlb, vub, options, datas, realClass, i_datas, i_realClass, setNum, direct) ; disp ( sprintf ( '*************************** ⁇ n' ) ) •
  • N combine offspring with best N parents to form next generation options (4) 0 roulette wheel selection
  • % Reproduce matingpairs Reproduce (popn, fitness, options) ;
  • % Mutate newpopn Mutation(offspring, pm, options, gen, numgens, vlb, ub) ;
  • popn Initialise (popnsize, vlb, vub) %
  • %history [gen meanfhistory (gen+1, 1) maxfhistory (gen+1, 1) ] ;
  • % Reproduce produces the mating pairs for crossover.
  • select contains the numbers of each string in popn which has been added to the mating pool if options (6) >0
  • newpopn (mutate) (choice (mutate) ).* (offspring (mutate) ...
  • % Decode converts a population (popn) of strings from binary to real.
  • % xint ( : , i) newpopn ( : , inde 1: index2) *tenpowers ' ;
  • %factor (vub-vlb) . / (10. ⁇ lbits-1) ;
  • % Scalefitness performs linear scaling on the fitness values and returns the

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biotechnology (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Bioethics (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Computational Linguistics (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Physiology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a method for identifying, from a pool of parameters whose values are known for a plurality of samples ('training samples') which are classified into multiple classes, a subset of parameters whose values in a sample of unknown class are likely to be predictive of the class of that sample. The invention also provides related uses, computer programs, apparatuses and systems. The invention is particularly applicable to the classification of biological samples on the basis of gene expression data. The classifier is preferably a maximum likelihood classifier.

Description

ANALYSIS OF GENE EXPRESSION DATA FOR MULTI-CLASS
PREDICTION
FIELD OF THE INVENTION
The invention relates to the development of algorithms for assigning a sample to one of several classes. In particular, but not necessarily, the invention relates to the analysis of gene expression data to assign a tissue sample to one of several phenotypic classes, such as tissue origin.
BACKGROUND OF THE INVENTION
The increasing use of DNA microarrays or Λgenechips' to generate large-scale gene expression datasets had led to several important statistical and analytical challenges. One area in which this technology is showing exciting promise is in the field of molecular diagnosis, where the phenotypic classification of a biological sample is largely based on gene expression data. Examples of such phenotypic classifications include tumor subtypes (lung, colon, etc.) and the prediction of chemotherapy response
(Staunton et al . r 2001). A significant advantage of this new approach is that classification schemes based upon molecular data can often detect biological subtypes that have traditionally eluded more conventional approaches (Alizadeh et al. 2000; Bittner et al . 2000).
Several mathematical approaches have been developed to identify and select key predictive genes in an expression dataset for use in biological classification. Currently, the majority of these reports have focused on situations where the expression dataset being analyzed contains only 2 (binary) to 3 major classes (e.g. cancer vs. normal tissue, response to treatment vs. no response) (Alon et al . 1999; Staunton et al . 2001; Li et al . 2001; Zhang et al . 2001). A much more challenging situation occurs, however, when the expression dataset in question contains multiple classes, or when there are classes with few or heterogeneous samples. Under such scenarios (especially when the number of classes exceeds 5) , the methodologies that suffice for binary or 3-class (Keller et al . 2000) datasets may not necessarily produce comparable accuracy for larger, more complex, datasets. For example, Dudoit et al . (2000) compared various methods of classification on three microarray datasets: a 3-class lymphoma dataset (Alizadeh et al . 2000), a binary leukemia dataset that also doubles as a 3-class dataset (Golub et al . 1999) and a dataset of cell lines corresponding to 9 tumor types PNCI60") (Ross et al . 2000). In that report, the best classifiers, while returning test error rates near 0% for the binary and 3-class datasets (i.e. very high success of classification) , nevertheless returned a minimum test error rate of 19% for the NCI60 dataset. Examples such as these show that there is a clear need for the development of better algorithms that can effectively analyze multiple-class expression data.
One potential reason, identified by the inventors, for the reduced performance accuracy in the multi-class scenarios is that many currently used approaches rely upon rank-based gene selection schemes (suggested by Dudoit et al. 2000) . The weakness of rank-based selection methods is that while they are good at identifying genes that are strongly correlated to the target phenotype class distinction, they tend to ignore correlations between genes. Class distinctions may be due to not only the expression of individual genes, but also gene interactions. In addition, previous methodologies also suffer from the constraint that the size of the predictor set has to be specified a priori . Thus, it is unclear if the number of genes used in the final predictor set is actually the optimal number to generate an accurate class prediction.
The inventors have sought to address these and other problems arising in the area of multi-class prediction.
SUMMARY OF THE INVENTION
The inventors have recognised that, in order to produce an accurate predictor set for multi-class prediction, it is desirable to select the predictor set using a method that is capable of selecting genes that interact (i.e. have inter-dependent expression levels) within any particular class and/or that is capable of selecting genes whose expression levels are uncorrelated with each other across the classes. The latter criterion may also be expressed as excluding genes whose expression levels are highly correlated across the classes. Inclusion of genes showing such correlation across the classes does not add any information to the predictor set.
The inventors have recognised that rank-based classification methods perform poorly according to these two criteria.
The inventors have in particular applied genetic algorithms and discriminant-based classification methods to the analysis and exploration of complex multi-class datasets.
Genetic algorithms (GAs) , as introduced by Goldberg (1989), are randomized search and optimization techniques that derive their working principles by analogy to evolution and natural genetics. Because they are aided by large amounts of implicit parallelism (Greffenstette and Baker 1989) , GAs are capable of searching for optimal or near-optimal solutions on complex and large landscapes, or spaces of possible solutions. Furthermore, GAs allow searching of these spaces of solutions by considering multiple interacting attributes (parameters) simultaneously, rather than by considering one attribute at a time. Because of these advantages, GAs may represent another useful tool in the classification of biological phenotypes based on gene expression data. Previous reports have described the use of GAs for binary class prediction problems (Li et al . , 2001). However, despite their suitability for addressing problems involving large solution spaces, the potential for using GAs in multiple- class prediction settings has to date remained unexplored.
The inventors have used the parallelised searching capability of GAs to design a gene-selection scheme that determines the optimal set of R genes in a multi-class dataset which classifies the samples within the dataset with minimal error. Using this approach, it is possible not only to determine the specific genes that should belong to a predictor set, but also the optimal size of the predictor set, from within a pre-specified range.
This approach is radically different from another GA- based method utilized by Li et al . (2001) . Using a common test multi-class dataset, the inventors have found find that the GA-based approach delivers a significantly higher level of predictive accuracy as compared to other previously reported methods. In addition, using another multi-class dataset, the inventors have shown that the GA-based approach is also capable of delivering predictive accuracies that are comparable to other methods using a set of substantially fewer genes than previously required. The inventors therefore propose that GA-based algorithms may represent a powerful new alternative in the analysis and exploration of complex multi-class datasets.
Put broadly, therefore, in a first general aspect the invention provides: in the identification, from a pool of genes whose expression levels are known for a plurality of biological samples ("training samples") which may be classified into multiple phenotypic classes, of a subset of genes whose expression levels are likely to be predictive of the phenotypic class of a sample of unknown phenotypic class, the use of a gene selection method which: (1) permits the selection of genes that interact within any particular class; and/or (2) selects genes whose expression levels are substantially uncorrelated with each other across the classes; and/or (3) excludes from the subset genes whose expression levels are highly correlated across the classes with another gene of the subset .
Said selection and exclusion need not be absolute, but may reflect a tendency or preference.
Preferably the gene selection method employs a genetic algorithm and/or a discriminant-based classifier. In particular, the discriminant-based classifier preferably employs matrix inversion (generally inversion of a class or common covariance matrix) to avoid selecting genes whose expression levels are correlated across classes. However, it is contemplated that other ways may be employed of avoiding the selection of genes whose expression levels are correlated across classes. For example, one could simply compare the expression profile of each gene in a candidate subset to all others in that subset, and reject selecting genes that display a correlation coefficient higher than some arbitrary cutoff value. Having found a plurality of genes having a too-high correlation coefficient in a candidate subset, all but one could be replaced.
Accordingly, the invention also provides the use of a genetic algorithm and/or a discriminant-based classifier in the identification, from a pool of genes whose expression levels are known for a plurality of biological samples ("training samples") which may be classified into multiple phenotypic classes, of a subset of genes whose expression levels are likely to be predictive of the phenotypic class of a sample of unknown phenotypic class.
In a second general aspect, the invention provides a method for identifying, from a pool of genes whose expression levels are known for a plurality of biological samples ("training samples") which are classified into multiple phenotypic classes, a subset of genes whose expression levels in a sample of unknown phenotypic class are likely to be predictive of the phenotypic class of that sample, the method comprising:
(a) providing a first generation population of individuals, each individual representing a candidate subset of genes;
(b) for each individual, evaluating its fitness to predict the phenotypic class of a sample of unknown phenotypic class by determining its ability to predict the phenotypic class of a plurality of test samples of known phenotypic class;
(c) generating a second generation population of individuals from selected individuals in the first generation population, wherein said selected individuals are selected according to their fitness as evaluated in step (b) ; (d) repeating steps (b) and (c) to evaluate the fitness of the individuals of the second generation population and to generate and evaluate further generations until a termination condition is reached; and (e) selecting the fittest individual, as evaluated in step (b) or step (d) , from the final generation or from among the fittest individuals of a plurality of generations .
For the avoidance of doubt, it is hereby stated that the phrase "generating a second generation population of individuals from selected individuals in the first generation population" in no way implies that each individual in the second generation population must be identical to one of the selected individuals in the first generation population. Indeed, the generating step will generally include crossing over and/or mutation, such that at least some individuals in the second generation population may (and usually will) not be identical to any selected individual in the first generation population.
In a third general aspect, the invention provides a computer program, which when run on a computer (usually a general purpose computer) or computer network will carry out the above method, and also includes a computer program, when recorded on a data carrier, for carrying out the above method. This may be stated as a computer memory product having stored thereon at least one digital file, said memory product comprising computer readable memory and said stored digital file or files constituting a program to carry out the above method.
Such a program, for identifying, from a pool of genes whose expression levels are known for a plurality of biological samples ("training samples") which are classified into multiple phenotypic classes, a subset of genes whose expression levels in a sample of unknown phenotypic class are likely to be predictive of the phenotypic class of that sample, can be expected to include code which is effective, when the program is run, to :
(a) provide a first generation population of individuals, each individual representing a candidate subset of genes;
(b) for each individual, evaluate its fitness to predict the phenotypic class of a sample of unknown phenotypic class by determining its ability to predict the phenotypic class of a plurality of test samples of known phenotypic class;
(c) generate a second generation population of individuals from selected individuals in the first generation population, wherein said selected individuals are selected according to their fitness as evaluated in step (b) ;
(d) repeat steps (b) and (c) to evaluate the fitness of the individuals of the second generation population and to generate and evaluate further generations until a termination condition is reached;
(e) select the fittest individual from the final generation or from among the fittest individuals of a plurality of generations; and
(f) display and/or output said fittest individual.
The program will generally include code effective to receive input of stored data representing expression levels of the genes represented by an individual, for evaluating its fitness in step (b) or step (d) .
Possibilities for display or output include graphical or numeric display, output to memory and output to a printer. Furthermore the invention provides apparatus for carrying out the above method, the apparatus including provision for input of stored data representing expression levels of a pool of genes in a plurality of biological samples of known phenotypic class, a processor and a stored program as defined above.
In a further aspect, the invention provides a system for identifying, from a pool of genes whose expression levels are known for a plurality of biological samples
("training samples") which are classified into multiple phenotypic classes, a subset of genes whose expression levels in a sample of unknown phenotypic class are likely to be predictive of the phenotypic class of that sample, the system comprising:
(a) a processing component capable of providing a first generation population of individuals, each individual representing a candidate subset of genes;
(b) a processing component capable, for each individual, of evaluating its fitness to predict the phenotypic class of a sample of unknown phenotypic class by determining its ability to predict the phenotypic class of a plurality of test samples of known phenotypic class; (c) a processing component capable of generating a second generation population of individuals from selected individuals in the first generation population, wherein said selected individuals are selected according to their fitness as evaluated in step (b) ; (d) a processing component capable of directing the repetition of steps (b) and (c) to evaluate the fitness of the individuals of the second generation population and to generate and evaluate further generations until a termination condition is reached; (e) a processing component capable of selecting the fittest individual, as evaluated in step (b) or step (d) , from the final generation or from among the fittest individuals of a plurality of generations; and
(f) a component capable of displaying and/or outputting said fittest individual.
For the avoidance of doubt, the different instances of "processing component" may refer to the same processing component (e.g. a processor programmed to perform processing steps identified in (a) to (e) ) or different processing components.
The system preferably comprises a memory in which is stored an expression data set representing expression levels for the pool of genes in a plurality of biological samples of known phenotypic class.
The system preferably includes a processing component capable of receiving input of stored data representing expression levels for the genes represented by an individual, for the evaluation of its fitness in step (b) or step (d) .
These and other aspects of the invention are further defined and explained as follows.
DETAILED DESCRIPTION
Step (c) is the kind of procedure which genetic algorithms are designed to perform, and preferably the method uses such an algorithm.
Preferably said selection includes an intermediate step of generating a pool of individuals from the first generation population, wherein the number of copies, or absence, of a first generation individual in the pool depends on its fitness as determined in step (b) . The pool then provides the selected first generation individuals from which the individuals of the second generation are generated (in the following description, a reference to a first generation individual, and especially a selected first generation individual, may therefore be interpreted as a reference to a member of the pool, unless the context requires otherwise) .
Selection according to fitness may involve any suitable technique, for example assigning to the pool a number of copies of each individual which is proportional to the fitness of the individual. However, the technique is preferably probabilistic, for example stochastic universal sampling or roulette wheel selection.
In step (c) , the generation of second generation individuals from the selected individuals of the first generation preferably includes crossing over between and/or mutation of the selected individuals of the first generation.
Crossing over is the exchange of information between typically two individuals ("parent" individuals) to generate new individuals which represent different combinations of genes from the parent individuals.
Mutation is the change of information in an individual, such that it represents a new combination of genes from the unmutated individual.
Where both crossing over and mutation occur, crossing over preferably precedes mutation. The following description refers to both crossing over and mutation occurring on selected first generation individuals, but applies equally to the situation where e.g. mutation follows crossing over. In relation to mutation, therefore, references to selected first generation individuals may therefore also be interpreted to include the individuals resulting from earlier crossing over between selected first generation individuals, when such crossing over has occurred before the mutation step. If crossing over has not occurred, e.g. because a crossing over step is absent or because a probability level for crossing over between two individuals has not been met, then mutation will occur on first generation individuals (or copies thereof) .
Preferably, all selected first generation individuals are subject in pairs to crossing over, preferably at a defined probability level, pc, which may be 1, but may be less than one. Typically 0.7 ≤ pc < 1. The matching together of pairs of individuals for crossing over is preferably random.
Preferably, all selected first generation individuals are subject to mutation, preferably at a defined probability level, pm. (As explained above, where mutation follows crossing over, the term "selected first generation individuals" includes individuals resulting from crossing over between selected first generation individuals, as well as selected first generation individuals that are unchanged because pc was not met, so no crossing over occurred for such individuals.)
The probability function may be applied to individuals (in which case, when an individual satisfies the probability function, it will contain a single mutation, preferably at random) . Preferably, however, the probability function is applied separately to each representation of a gene in each individual (using the terminology introduced below, to each gene indicator) . Especially in the latter case, pm is preferably much less than one, typically 0.0005 ≤ pm ≤ 0.01. Mutation is preferably uniform, i.e. the gene replacing a mutated gene is preferably selected with equal probability from the pool of genes. However, it may be non-uniform (or dynamic), i.e. the group of genes from which the gene replacing a mutated gene is selected varies in size according to the generation of the individual being mutated. In such a case, the group preferably decreases in size with increasing generations.
Preferably different individuals represent different numbers of genes, preferably within defined limits. The optimal number of genes for phenotypic classification of samples will depend on numerous factors, including the number of classes, the number of genes for which expression data is known for samples of known phenotypic class, and the variability in and correlation between expression levels. Accordingly, the optimal number of genes cannot easily be determined a priori . Preferably the method is repeated, using different defined limits in different runs.
The use of a genetic algorithm on a population including individuals representing different numbers of genes allows optimisation of the number of genes used for classification.
Each individual may comprise a sequence (or "string") of gene indicators, usually numbers, wherein different gene indicators represent different genes. Where different individuals in a population represent different numbers of genes, each individual in the population may have the same number of gene indicators and further comprise a number indicator which specifies (usually within defined limits) the number of genes the individual represents. Where the number of genes an individual represents is less than the maximum number of genes represented by the population, a number of gene indicators will be redundant in the individual. This method of representing individuals, however, allows convenient crossing over between individuals representing different numbers of genes and allows a mutation to affect the number of genes represented by an individual (see below) , as well as which genes are represented. It also allows the mutation of and/or crossing over between genes which are redundant in the individual, but which may cease to be redundant in a later generation if the number indicator is changed by subsequent crossing over or mutation.
In embodiments in which individuals comprise a sequence of indicators, crossing over may be uniform or one-point. In one-point crossing over, a number of consecutive indicators are exchanged between parent individuals to generate child individuals. In uniform crossing over, each point in the sequence is independent of the others, so the effect is the same as a random number of one-point crossings over between the same two sequences.
In relevant embodiments, mutation may occur only within the gene indicators, or may also affect the number indicator. Mutation of an indicator may occur by replacing the mutated indicator with a randomly generated indicator. This will occasionally result in an indicator being replaced with an identical indicator. This is considered to fall within the definition of mutation. Alternatively, the possibility of replacement with an identical indicator may be excluded.
Mutation and crossing over may result in individuals having duplicate copies of the same gene indicator. This possibility may be excluded (e.g. by screening for duplicate gene indicators within individuals and replacing duplicates with randomly generated indicators) . For simplicity, however, duplicates are preferably ignored. It should be noted, however, that the preferred use of matrix inversion (explained in detail below) causes individuals that represent duplicate gene indicators to have zero fitness value. Such individuals will therefore not be selected for generating later generations .
The first generation population is preferably randomly generated. Alternatively, it may comprise selected individuals, e.g. the fittest individuals from earlier runs of the method (or other similar methods) .
The termination condition is preferably satisfied when a defined number of generations have been generated and evaluated. Other termination conditions may, however, depend on measures of maximum or average fitness, e.g. as described in the background. For example, the termination condition may be satisfied when average and/or maximum fitness of all individuals of a generation is no longer increasing (or no longer increasing at a defined rate) between successive generations.
In step (b) , the prediction is preferably employs discriminant-based, rather than rank-based classification, more preferably classification that selects (i.e. performs with higher accuracy on) individuals which represent genes whose expression levels are correlated to class distinction but uncorrelated to each other. This may be stated as classification that selects individuals which represent genes whose expression levels are correlated to particular class distinctions and are uncorrelated to each other across the classes. More preferably the prediction employs a maximum likelihood classifier (MLHD) , more preferably a linear discriminant function.
More particularly, classification preferably involves the production, for each individual, of a discriminant function which assigns a biological sample to a phenotypic class, based on the expression levels in that biological sample of the genes represented by that individual. Generally, the discriminant function is produced on the basis of the expression levels, in the training samples, of the genes represented by the individual. Preferably the fitness of each individual is then evaluated by determining the error rate of the discriminant function for that individual in classifying the test samples (whose phenotypic classes are known, but which are not used in the production of the discriminant function) .
Preferably the MLHD classifier operates via the following steps:
(1) for each class, defining a typical expression profile of a member of that class. This is based on the expression levels, for the genes represented by the individual under consideration, in the training samples of that class. Preferably this involves averaging the expression levels, preferably taking their mean, to produce a class mean vector;
(2) for each class, producing a measure of the spread or variation in expression levels between members of that class. Again, this is based on the expression levels, for the genes represented by the individual under consideration, in the training samples of that class. Any suitable statistical measure of spread may be used, but covariance is preferred, preferably to produce a class covariance matrix (that is, a matrix of the covariance in expression levels between all the genes represented by the individual under consideration, among all the training samples of that class) ;
(3) optionally averaging the class measures of spread to produce a common measure of spread (again for each individual) , preferably a common covariance matrix. This serves to reduce the effect of anomalous data points and is based on the assumption that the manner of covariation in the expression levels of the genes in a particular individual is essentially the same across all classes. This will generally be similar to the individual class measures of spread. Classification based on the quadratic discriminant rule (see below) , however, does not require this step; and
(4) for each test sample, comparing its expression profile with the typical expression profile for each class (e.g. the class mean vector), taking into account the spread as determined for that class (as in step (2) , e.g. the class covariance matrix), or (preferably) as averaged across all classes (as in step (3), e.g. the common covariance matrix) , and assigning the test sample to the class to which its expression profile matches most closely.
Preferably the comparison in step (4) involves inversion of the class / common covariance matrix. This is effected in the Examples by the term Σ-1 in the calculation of fq (e) in equation (11) . High correlation even between just a pair of genes causes the covariance matrix to be badly conditioned (approaching singularity) . In turn, this causes the inverted matrix to be unreliable (approaching infinity) . Both of these can be easily identified (see e.g. Example 2). The exemplified classification method will not select individuals with highly correlated genes, because individuals with highly correlated genes are given low fitness values, because of the unreliability of the inverted covariance matrix.
The discriminant function described in detail herein will give a high number if the match between a test sample and a candidate class is good and a low number if the match is poor; the test sample will be assigned to the class which gives the highest number.
Fitness may be evaluated by cross-validation and/or independent test, preferably a combination of both. When fitness is evaluated by independent test, the discriminant-based classifier (i.e. in preferred embodiments the MLHD) is produced on the basis of a number of training samples and then tested against a number of test samples, which do not overlap with the training samples. There are preferably more training samples than test samples, more preferably at a ratio of about 2:1 (e.g. in the range 1.5:1 to 3:1, more preferably 1.75:1 to 2.5:1, more preferably 1.9:1 to 2.1:1) .
When fitness is evaluated by cross-validation, the same set of samples are used as both training samples and test samples, by producing the discriminant-based classifier (i.e. in preferred embodiments the MLHD) from one subset of the samples (the training subset) , testing it against the remaining sample (s), and then repeating the procedure with different training subsets. Preferably each MHLD (or other discriminant-based) classifier is produced using a subset containing all but one of the samples, and tested against the remaining sample; more preferably an MHLD (or other discriminant-based) classifier is produced for all possible such subsets, so that cross-validation against every sample occurs.
Fitness as determined by independent test may be accorded higher significance than fitness as determined by cross validation, especially if a small dataset is used.
Of course, once the fittest individual has been selected using the method of the invention, it may be used in the same way to classify samples of unknown phenotypic class.
Following the identification of a subset of genes according to the invention, the subset is preferably used without further change in the classification of samples of unknown phenotypic class. This is in contrast to previous uses of genetic algorithms in this area (see e.g. Li et al . , 2001), in which the genes identified most often by genetic algorithm-based selection over several runs were then grouped, and this group (not a group selected directly using a genetic algorithm) was used for the classification of samples.
Usually the number of phenotypic classes into which the biological samples may be classified will be at least 4 or 5, preferably at least 6, 7 or 8, more preferably 9 or more.
The term "phenotypic class" is intended to be interpreted in a broad sense, to encompass any desired classification of biological samples, usually based on observable or testable characteristics or responses of the samples. In particular, the phenotypic classes may represent different sources of the biological samples, e.g. tissue sources, especially tissue origin of tumour samples (e.g. secondary tumour samples) , or may represent different pathologies, or different responses to external factors, such as chemotherapy or radiotherapy, or exposure to environmental factors.
The term "biological sample" is also intended to be interpreted in a broad sense, to include any biological sample from which a measurement of gene expression levels can be obtained, whether directly (e.g. by determination of mRNA or protein levels) or indirectly. Examples of biological samples include tissue samples such as biopsies, cells, blood and other bodily fluids
(especially bodily fluids containing cells) , and extracts of any of the aforementioned samples.
The terms "gene" and "expression level" are used in a broad sense, such that the expression level of a gene may be determined in a variety of ways. Typically, determination of the expression level of a gene will involve nucleic acid hybridisation (typically hybridisation of mRNA from a sample to an array of cDNA molecules) . However, other means of determining expression levels are also contemplated and include, for example, determination of protein levels in a sample, e.g. by binding of protein from the sample to an array of specific binding partners, such as antibodies or antibody fragments.
Indeed, in broader aspects that otherwise correspond to those aspects defined above, the invention is not limited to the analysis of genetic data, nor even to the analysis of biological data (though this is preferred) and may relate to the classification of any complex data. An example is the analysis of protein and/or peptide patterns from a mass spectrum. In such broader aspects, therefore, the term "gene" as used above in the definitions of aspects of the invention may be replaced by "parameter"; similarly, "expression level" may be replaced by "value", "biological sample" by "sample", and "phenotypic class" by "class". Samples may be physical or abstract.
The methods of the invention may include a preprocessing step in which the size of the pool of parameters (e.g. genes) to be used in the method is reduced from an initial larger pool. Such reduction may exclude parameters for which values (e.g. expression levels) are unavailable for some of the training and/or test samples. In particular, the reduction may exclude parameters for which the variation in values across the training and/or test samples is lower than for other parameters. Parameters may be excluded if the variation is less than a threshold level. Alternatively, a predefined number of parameters are included, being those parameters having the highest variation. Variation may be measured by standard deviation or any other conventional measure of variation. Wherever possible, the values are first normalised, preferably by reference to control values (where available) .
It has been found that such preprocessing leads to greater fitness in the individuals (i.e. predictive parameter sets) obtained using the methods of the invention, presumably resulting from the reduction in "noise" in the values. It also leads to a reduction in the processing power or time required to carry out the methods of the invention.
Embodiments of the invention will now be described in detail, by way of example only, with reference to the accompanying figures. Further aspects and embodiments will be apparent to those skilled in the art. All documents mentioned in this text are incorporated herein by reference in their entirety and for all purposes. BRIEF DESCRIPTION OF THE FIGURES
Figure 1 is a schematic illustration of genetic algorithms (GAs) , stochastic universal sampling, and examples of crossovers:
(a) Basic steps in GAs showing population initialization, fitness selection, crossover, mutation, and termination;
(b) A simple example of stochastic universal sampling where N is 10. Arrows indicate equally spaced pointers that subdivide the line segment into N=10 individual units. In this instance, Si and S2, which are of high fitness, each have two copies while no copy of S9 and S10 (low fitness individuals) is chosen to enter the mating pool;
(c) An example of one-point crossover;
(d) An example of uniform crossover.
Figure 2 shows the effects of varying GA conditions on performance accuracy. Fitness is depicted as 'height' for all plots.
(a) Effects of pm and pc on fitness for predictor set size [5,10], uniform crossover, stochastic universal sampling;
(b) Effects of pffl and pc on fitness for predictor set size [11,15], uniform crossover, stochastic universal sampling; (c) Effects of pm and pc on fitness for predictor set size [16,20], uniform crossover, stochastic universal sampling;
(d) Effects of pm and pc on fitness for predictor set size [21,25], uniform crossover, stochastic universal sampling;
(e) Effects of pm and pc on fitness for predictor set size [26,30], uniform crossover, stochastic universal sampling.
Figure 3 shows a comparison of expression profiles of predictor sets obtained through different methodologies. Columns represent different class distinctions, and only training set samples are depicted. Arrows depict genes which have highly correlated expression patterns across the sample classes. Classes are labeled as follows: BR (breast) , CN (central nervous system) , CL (colon) , LE (leukemia) , ME (melanoma) , NS (non-small-cell lung carcinoma) , OV (ovarian) , RE (renal) and PR (reproductive system) .
(a) Expression profile of genes selected through the GA/MLHD method (only genes for the best predictor set are shown) ;
(b) Expression profile of 20 genes selected through the BSS/WSS ratio ranking method;
(c) Expression profile of 18 genes selected through the OVA/S2N ratio ranking method;
(d) Expression profile of 72 genes selected through the All-Pairs/S2N ratio ranking method. EXAMPLE 1 - EXAMPLE OF MULTI-CLASS PREDICTION USING A GENETIC ALGORITHM- AND DISCRIMINANT FUNCTION-BASED CLASSIFICATION METHOD
Introduction
Overview of GAs
The basic concept of a GA is first briefly introduced in Figure la. (Readers are referred to Holland (1992) for more comprehensive descriptions) . Generally, in GAs, each potential solution to a problem is represented in the form of a string. Each string contains encoded parameters of the solution it represents. A string is dubbed a chromosome or an individual, while a parameter is also called an attribute. A pool of such strings forms a population. Initially, a population is initialized by creating (usually randomly) a series of strings such that each string represents a point in solution space. A fitness function is then defined to measure the degree of goodness of a string (i.e. how good the string is as a solution to the problem) . Thus, the fitness value associated with a string indicates the optimality of the solution that the string represents.
A selection process is then applied to the strings, based on the principle of Λsurvival of the fittest' . In this process, a few strings are chosen from the total population. Each chosen string is then assigned a number of copies to go into the mating pool based on its fitness value .
Next, by applying genetic operators (crossover and mutation) on the strings in the mating pool, a new population of strings is formed for the next generation. The process of selection, crossover and mutation are repeated in each subsequent generation till a termination condition is satisfied. Each generation is evaluated by two parameters: the average and the maximum fitness values of the population at that generation. Commonly used termination conditions include defining the maximum number of generations, or the algorithm may be compelled to terminate when there is no significant improvement to the average or maximum fitness value of the population.
Overview of System Methodology and Implementation
The overall classification method used in the GA-based gene selection scheme essentially consists of two main components: (1) a GA-based gene selector which implements the gene selection scheme and (2) a Maximum Likelihood (MLHD) classifier. The first component generates and evolves populations of sets (herein termed "individuals" or "strings") of R genes that are used to classify the samples, where the value of R lies in the pre-specified range [.Rmin, -Rmax] • The actual classification process (i.e. assigning a sample to one of the classes) is performed using the second component, by applying the MLHD classifier to the group of genes specified by a particular individual (or string) . Each individual in the population represents a specific gene predictor set, and a fitness function is used to determine the performance accuracy of a predictor set in classifying a dataset.
A typical run of the algorithm covers G generations, where each generation consists of selection, crossover and mutation operations. The time complexity is estimated as 0 (NMR3) when there are M samples, the population size is set as N, and the predictor set size is chosen as R . That is, time complexity 0 is proportional to 2V and to M and to R3. Since R varies between Rmin and .Rmax in a run, a more realistic estimate of the time complexity is obtained by using the average of the two boundary values. Thus, the time complexity is taken to be approximately
0(NMRlva) , where .Raver is the average of i?min and Rmax .
Summary
The inventors tested their GA-based gene selection methodology on a gene expression dataset containing 9 classes ("NCI60"), which previous methodologies have had difficulty classifying (see Background) . Multiple runs were conducted, where in each run both the population size, N and the maximum number of generations, G, were set at 100. To observe how different gene selection conditions might affect the performance characteristics of the GA-based methodology, the following factors were varied in different runs:
a. Crossover method: one-point or uniform;
b. Probability of crossover, pc: 0.7, 0.8, 0.9 or 1.0;
c. Probability of mutation, pm: 0.0005, 0.001, 0.002, 0.0025, 0.005 or 0.01;
d. Selection method: stochastic universal sampling or roulette wheel;
e. Range of predictor set size (number of genes to be selected) {Rmin, Rmax] : [5, 10], [11, 15], [16, 20], [21, 25] or [26, 30].
For each predictor set size range, there are thus 96 different runs (corresponding to combining conditions (a) -(d)). Upon completion of a run, the best strings from each individual generation in the run are then collectively compared to find the string with the best overall fitness. For example, in this case, at the end of each run, there are G = 100 optimal predictor sets, i.e., individuals with the highest fitness for its particular generation, and these individuals are then compared with each other. For the NCI60 dataset, there appeared to be a consistent trade-off between cross validation and test error rate, so a sorting technique was introduced to obtain the most optimal predictor set that represents the best compromise. The G optimal individuals were sorted according to the following criteria:
ascending by total number of misclassified samples both in cross validation and independent tests, Xc + Xι r then
ascending by number of misclassified sample in independent test, χ, , then
ascending by number of misclassified sample in cross validation test, χc .
Thus, in these criteria, to avoid the classifier from over-fitting to the training data, we assigned a slightly lower importance to cross validation accuracy than independent test accuracy.
In addition to the above runs, a pm value of 0.02 was tested for one-point crossover and roulette wheel selection method runs only.
The classification method was implemented using Matlab® 5.3. Source code is included in the Annexes. Methods
Dataset and Data Preprocessing
The NCI60 gene expression dataset was first described in Ross et al., (2000), and contains the gene expression profiles of 64 cancer cell lines. This work focuses on the NCI60 dataset as previous works on tumor classification did not seem to achieve satisfactory classification accuracy with this dataset compared to other binary datasets (see Introduction) . The dataset was downloaded from the url http://genome-www.stanford.edu/ sutech/download/nci60/dross_arrays_nci60. tgz, and specifically consists of the gene expression profiles of 64 cell lines as measured by cDNA microarrays containing 9703 spotted cDNA sequences. For the purposes of this report, the single unknown cell line and two prostate cell lines, due their small number, are excluded from analysis. This leaves a total of 61 cell lines with 9 sites of origin: breast (7), central nervous system(6), colon (7),. leukemia(6), melanoma(8), non-small-cell-lung- carcinoma or NSCLC(9), ovarian (6), renal (8) and reproductive (4) . For this dataset, we aimed to develop a methodology that is capable of classifying tumor cell lines (hereafter also referred to as samples) based on their site of origin.
During data preprocessing, certain spots were excluded due to the presence of missing data fields. Control and empty spots were also excluded. After preprocessing, there are 6167 genes left in the dataset. For each array, the expression data of each spot was normalized by subtracting the mean of the Cy5/Cy3 ratio of the control spots, -controi from the Cy5/Cy3 ratio of each spot, and dividing the result by the standard deviation of the Cy5/Cy3 ratio of the control spots, σcontroι. In our analysis, we used a truncated dataset containing 1000 genes with the highest standard deviation value (ranging from 0.8112 to 2.421) in expression level which have been selected from the former 6144 genes. Genes in the truncated dataset are numbered from 1 to 1000. For the remainder of this work, these genes will be referred to using their index numbers ranging from 1 to 1000.
The second dataset ( GCM' ) (Ramaswamy et al . , 2001)) was downloaded from http://www-genome.wi.mit.edu/mpr/ publications/projects/Global Cancer Map/ and contains the expression profiles of 198 primary tumor samples, 20 poorly differentiated adenocarcinomas and 90 normal tissues samples as measured by Affymetrix Genechips containing 16063 genes and ESTs. Only the 198 primary tumor samples are considered in this work, to make the results comparable to that reported by Ramaswamy et al . (2001) . The 198 samples originate from 14 sites of origin: prostate (14), bladder (11), melanoma (10), uterine (10) , leukemia (30) , breast (12) , colorectal (12), renal (11), ovarian (12), pancreatic (11), lung (12), lymphoma (22), central nervous system (20) and pleural mesothelioma (11) . The data was preprocessed as above to generate a truncated 1000-gene dataset, with standard deviation ranging from 0.299 to 3.089.
String Representation
In our methodology, each string is a sequence of integers where each integer represents the index of a gene from the truncated dataset. Therefore the range of each element/integer in the string is [1,1000]. The length of a chromosome (string) is f?mχ+l- For the task of finding the best set of predictive genes consisting of JRmin to Rmax genes (where . min and Rmax are prespecif ied) , a typical string would be
Figure imgf000032_0001
The first element in the string, R, denotes the size of the predictive set represented by the string, while the subsequent elements g , g2, ■ ■ ■, g n represent the indices of a subset of genes picked from the truncated 1000 gene dataset. Thus, the string connotes a set of R predictive genes indexed gx, g2, ..., gR. Only the first R genes out of the Rmax genes included in the string are used for classification, where R ranges from Jmin to Rmax . For example, the string below represents a potential solution set containing 7 predictive genes: [7 258 46 293 144 336 212 36 63 874 8] for .Rmin = 5 and Rmax = 10. This string represents a potential solution set containing genes numbered 258, 46, 293, 144, 336, 212 and 36.
Initialization
An initial population of potential solutions is formed by creating N random strings, where the population size N is prespecified. A random string is produced by generating one random integer (represented by R) in the range [.min/ max] plus Rmax random integers (the gene indices) in the range [1,1000] to be used as its attributes.
Fitness Function
The fitness value of each string in the population is then computed. For this work, the fitness function is defined as: f(St) = 200 -(Ec + Ej) (1) where Ec = cross validation error rate, and E, = independent test error rate.
The two error rates are obtained by classifying the samples using the genes contained in string S± as variables using a MLHD classifier. The specifics of the MLHD classifier are described in subsequent sections.
Selection
Two general methods can be used to select the strings entering the mating pool: i) stochastic universal sampling and ii) roulette wheel selection (described in detail by Goldberg and Deb (1991) ) . Both methods were assessed. For illustration, the stochastic universal sampling method is depicted in Figure lb. According to Baker (1987) , this method provides zero bias and minimum spread. The strings are first mapped to contiguous segments of a line with unit length, such that each string's segment is proportional in length to its fitness value. Equally spaced pointers are placed over the line with the number of pointers equaling the number of strings to be selected. In standard replacement schemes, the number of strings chosen is equal to the population size N, and thus the distance between the pointers is 1/N. The position of the first pointer is given by a randomly generated number in the range [0, 1/N] . The number of copies of a string S entering the mating pool is determined by the number of pointers that fall within Si's segment along the line. As a result, strings with high fitness get to have more copies of themselves in the mating pool than strings with lower fitness. The roulette wheel selection method is a simpler method where a single pointer is placed at a random position along the line N times. In each placement, the string Si on whose segment the pointer falls is selected to enter the mating pool. Crossover
Crossover is illustrated in Figure lc and is performed by first randomly choosing a pair of strings from the mating pool and then applying a crossover operation on the selected string pair with a fixed probability pc. The outcome of this operation is two offspring strings which are produced through the exchange of genetic information between the two parents. This probabilistic process is repeated until all parent strings in the mating pool have been considered. In the GA-based gene selector, we assess two types of crossover operations: one-point and uniform crossover (described in Haupt and Haupt, 1998) .
In one-point crossover, an integer position k in the range [1, jmaχ] is chosen at random along the string. Two child strings are produced by swapping all genes between positions k+1 and maχ+1 • An example where i?max=10 and k=8 is shown in Figure lc. In uniform crossover, for each pair of parent strings, a crossover template is randomly generated. The template is a binary string of the same length as string length, Rmaχ+1- The element of the template determines which child receives the gene from which parent. For example, if the element of the template at position i is 1, child 1 receives the gene at position k from parent 1 while child 2 receives the gene at position k from parent 2. However if the element at position k is 0, then child 1 receives the gene at position k from parent 2 while child 2 receives the gene at position k from parent 1. An example of uniform crossover
Figure imgf000034_0001
is illustrated in Figure Id.
Mutation Mutation operations can also be applied at various probabilities to each of the offspring strings produced from crossover (Spears and De Jong, 1991) . In uniform mutation, a gene at position k from the offspring string ( Xk) is mutated at probability pm. The mutated gene, xk' is chosen with equal probability from either of two choices : xk' = xk + r(bk -xk) (2) or x ~ xk -r(xk -ak) (3)
with ak and bk representing the lower and upper bound (1 and 1000) of genes respectively, and r being is a random number generated uniformly in the range [0, 1] .
The effect of equations (2) and (3) is that gene xk' is randomly selected from any gene within the truncated dataset (so can in fact result in no mutation when xk' = Xk) •
An alternative, but less preferred mutation strategy uses non-uniform (or dynamic) mutation, in which the equivalent formulae to formulae (2) and (3) above are:
Figure imgf000035_0001
and
Figure imgf000035_0002
where g is the number of the current generation and G is the maximum number of generations. Parameter b determines the degree of non-uniformity. If b is zero, these formulae become the formulae (2) and (3) given above for uniform mutations, since the terms in the last bracket will then be eliminated. Termination
The processes of evaluation, selection, crossover and mating (which constitute a generation) are repeated for a number, G, of generations. After the run is complete, the string with the best fitness of all generations is outputted as the solution. Due to the inherent random nature of crossover and mutation, it is important to note that the string with best overall fitness in a particular run may not necessarily always correspond to the 'best' string in the final or last generation. Thus, it is preferred to compare all the 'best' strings from each individual generation to one another, to determine the ultimate string with the best overall fitness.
MLHD Classifier
A brief description of the maximum likelihood classifier used in the gene selection scheme is given in this section. For further details on discriminant functions, we refer the reader to James (1985) .
In order to build an MLHD classifier, a total of Mt tumor samples from the truncated dataset are used as training samples. The remaining Me tumor samples are used as test samples. For the NCI60 dataset, the ratio between Mt and Mø is 2:1, following the one used by Dudoit et al . (2000), while for the GCM dataset, Mt = 144 and Mθ = 54 (a ratio of 2.67:1) (Ramswamy et al., 2001).
For a given string, the training dataset can be written as the matrix ιw,
T = (4) lfil RM, where element x j corresponds to the expression level of gene g± (see above under "String Representation") in training sample j .
A sample that is to be classified by the classifier is represented as a column vector e=(e, e2 ■ ■ • eR ) (5) where element e is the expression level of gene i in this sample. (T denotes matrix or vector transposition, where columns vectors become row vectors and vice versa.)
Discriminant Function
The basis of the discriminant function is Bayes' rule of maximum likelihood: Assign the sample to the class with the highest conditional probability. In a domain of Q possible classes, assign the unknown sample to class q where
P(Gg \ e ) > P(Gr |e ) (6) for all q ≠ r and q, r e {1, 2, ..., Q) . The conditional probability P(Gg |e) is the probability that the unknown sample belongs to class q given that its gene expression data vector equals e .
The computation of the discriminant function for class q is based upon two parameters: the class mean vector and the common covariance Σ. For mtq training samples that belong to class q, the mean of expression of gene i is
1 MM,,
>"< ' )
where /(•) is the indicator function which equals 1 if the argument inside the parentheses is true and 0 otherwise. The class to which sample k belongs to is denoted as ck . Putting the μqι± for all i together forms the class mean vector:
J q = ( , 9.1 *g.2 , ι.R , (8)
The class covariance matrix ∑g is thus the covariance matrix of the truncated expression data for all training samples belonging to class q. In essence, its elements describe the "association" between genes among samples belonging to class q.
Figure imgf000038_0001
where σ±j = covariance between the gene i and the gene j in class q.
By averaging the class covariance matrices ∑x, Σ2, .. , / ∑Q/
James (1985) gives the "pooled estimate" of all covariance matrices as the common covariance matrix,
Figure imgf000038_0002
By using the common covariance matrix, the normality and the equal class probability assumptions P(G ) = — for all
Q q e {1, 2, ..., Q) , the linear discriminant function in the classifier can then be defined as:
Figure imgf000038_0003
Finally, the maximum likelihood (MLHD) classification rule is: Assign the unknown sample to Class q if f,GD>f,(*) ( 12 : for all q ≠ r and q, r e {1, 2, ..., Q) . Alternative Discriminant-Based Classification Methods
At least 5 variants of discriminant-based classification methods are known:
1) Fisher linear discriminant analysis
2) Quadratic discriminant rule
3) Linear discriminant rule
4) Diagonal quadratic discriminant analysis 5) Diagonal linear discriminant analysis
Any may be used in the practice of the invention, although 3) is the preferred method, as described above.
1) is similar to the MLHD method we are using (and as described above) , except that the data set is first transformed to a data space that best discriminates between samples of different classes.
2) - 5) are all maximum likelihood discriminant rules.
2) is the more general form of the MLHD method. The formula for 2) is:
fq(e) = μq ττ;λe ~--μq rτq- μq
where instead of using the common covariance matrix Σ (as in the preferred MLHD method described above) , the covariance matrix for class q, ∑g is used to calculate fg.
3) is the MLHD method used in our work.
In essence, 4) and 5) are the simplified versions of 2) and 3) respectively. Both assume that the covariance matrix (the common covariance matrix, for 5) ) is a diagonal matrix (i.e. only the diagonal elements are non- zero) . Hence, 4) and 5) by necessity ignore correlation (relationship) between any 2 genes, which we believe to be important to achieve high levels of predictive accuracy.
Evaluating a Predictor Set
Two methods are used to evaluate the performance of a classifier: leave-one-out cross validation and independent test. In the former, one sample from the training set is excluded, and a new MLHD classifier is rebuilt using the remaining Mt—1 training samples. Thus, the new classifier is totally blind to that excluded sample. The classifier is used to classify the left-out sample, and the procedure is then iteratively repeated for all of the Mt training samples. In independent test, the original MLHD classifier is built using all Mt training samples, and is used to classify Mg independent test samples.
If χc samples are misclassified in the cross validation test, then the cross validation error rate is
£r=-^-xl00% (13)
If χl samples are misclassified in the independent test, then the test error rate is
E, =^xl00% (14)
' Mθ
This evaluation method differs from that used by Dudoit et al . (2000), who used 150 different learning/test sets. We use only one such set; hence evaluation through cross validation is needed in order to have an unbiased estimate of classifier performance.
Obtaining Predictor Genes Through Rank-Based Methods For the purpose of comparing the predictor set of genes discovered by the GA gene-selector to sets obtained by various rank-based strategies, we applied several of these rank-based methods to the truncated dataset. The first is the gene selection method employed by Dudoit et al . (2000), in which genes are ranked on the basis of the ratio of between-groups to within-groups sum of squares (BSS/WSS) . The ratio for gene i is M, Q
BSSQ) _ ;=1 ,., ( 15 }
WSS(i) M, Q
∑∑1{cj = 9)(χ 9ll,)
7=1 g=l Genes with the largest BSS/WSS ratios belong to the top of the rank. We picked 20 of the top genes this way.
The second and third methods adapt the binary-class signal-to-noise (S2N) ratio
P{g,c) = l ^ (16) σ + σ2 introduced by Golub et al . (1999) and Slonim et al . (2000) for multi-class scenarios.
Using the one-vs.-all others (OVA) approach in a Q-class scenario, the second method forms 2Q sets of top-ranked genes. For each class, one set of positively correlated genes (largest positive S2N ratio) and another set of negatively correlated genes (smallest negative S2N ratio) are formed. One top positively correlated gene and one top negatively correlated gene were selected for each class, bringing the number of chosen genes to 18, since Q = 9 for our dataset.
The third method applies the all-pairs (AP) approach, in which Q ( Q - 1) sets of predictor genes are formed. For each pair-wise class combination, one set of positively correlated genes and another set of negatively correlated genes are formed. As in the previous method, one top positively correlated gene and one top negatively correlated gene were selected for each pair-wise class combination. This brought the number of chosen genes to 72.
Results
Obtaining the Best Predictor Set for a Particular Set Size Range
The results from the various runs are presented in Table 1 and Figures 2a-e. We found that in general stochastic universal sampling is superior over the roulette wheel selection method, regardless of crossover strategy (Table 1) . When we compared the crossover methods, we found that uniform crossover produced the best predictor sets in the mid-size ranges [11, 15] and [16, 20] , while one- point crossover surpassed the other in generating the best predictors sets in the extreme ranges [5, 10], [21, 25] and [26, 30] (Table 1) .
Figures 2a-e are surface plots showing the effects of varying the crossover probability pc and mutation probability pm on the maximum fitness values achieved in a GA run using uniform crossover and stochastic universal sampling. Although the small number of available data points prevents a definitive conclusion from being made, the trends of the plots makes it possible to propose that a relatively high crossover probability (pc > 0.8) and a mutation rate in the range of 2 x 10~3 or above would be likely to produce a good predictor set.
Table la-e also shows the best predictor sets obtained for each range of predictor set size using different crossover and selection methods. Based on fitness and both cross validation and test error rates, the results show that in general for the NCI60 dataset the ranges [11, 15] and [16, 20] give the highest classification accuracies.
Higher predictive accuracies were achieved using a truncated dataset rather than an untruncated one under otherwise identical GA parameters (data not shown) . This suggests that data preprocessing using a simple standard deviation filter may effectively reduce λnoise' in the expression dataset. The untruncated dataset reflects a much larger numerical solution space than the truncated one (about 1029 versus 1039, or 1010 times larger for a 13- element predictor set) .
To assess the reproducibility of the GA/MLHD algorithm, we determined the frequency at which specific genes were selected to belong to an optimal predictor set across 100 independent runs with different initial starting populations, as compared to a series of 100 randomly selected genesets (data not shown) . This analysis revealed that a number of genes were consistently preferentially chosen by the GA/MLHD algorithm, suggesting that the gene selection operation executed by the algorithm is highly reproducible despite its initial seeding' of randomly generated individuals.
Comparing GA-based Predictor Sets to Predictor Sets Obtained from Other Methodologies
The best predictor set obtained using the GA-based selection scheme exhibited a cross validation error rate of 14.63% and an independent test error rate of 5% (Table lb, line 1, and see Table 2 for specific misclassifications) . This is a significant improvement in accuracy as compared to other methodologies assessed by Dudoit et al (2000) , where the lowest independent test error rate was reported as 19%.
One distinguishing point of the GA-based approach is that it avoids reliance upon a rank-based gene selection scheme. To assess the significance of this feature on the actual process of gene selection, we compared the genes selected by the GA-based approach to genes selected by other commonly used predictive approaches. As shown in Table 3a, it is clear that the rank-based methods do not identify the majority of the genes selected by the GA- based gene selector, especially when the predictor set size is similar to number of genes in the best GA predictor set (i.e. 13). The BSS/WSS rank-based method comes closest by managing to select 6 of the 13 GA-based predictor genes, but it is only able to achieve this when the top 100 genes are selected, and not the top 20. The genes corresponding to the optimal GA-based predictor set are listed in Table 3b.
The differences in the predictor sets chosen by the various gene selection schemes can be understood when one compares the expression levels of the genes in the predictor sets. In the rank-based predictor sets, it is quite apparent that several genes share similar expression patterns across the sample classes (Figures 3b-d, especially Figure 3b) . In other words, the expression patterns of these genes are highly correlated to one another. However, in the GA-based predictor sets, the presence of correlated genes is much less apparent (Figure 3a) . The ability to select uncorrelated genes may be one important factor in the improved accuracy of the GA-based approach (see Discussion) . GA/MLHD Method Permits Significant Feature Reduction While Preserving Accuracy
Recently, Ramaswamy et al . (2001) reported a multi-class cancer study utilizing 218 tumor samples spanning 14 classes of different sites of tumor origin, in which each sample was characterized by the expression levels of 16063 genes ( GCM' ) . Using a variety of rank-based classifiers such as weighted voting and -nearest neighbors (KNN) , the authors reported that the best classification approach, yielding an overall 22% error rate (cross validation and independent test) , consisted of using 14 separate OVA/SVM (support-vector-machine) - based binary classifiers. Significantly, however, the authors found that these optimal predictive accuracies were only achieved when the SVM classifiers were trained on all 16063 genes, and that reducing the number of features in the classifier genesets compromised predictive accuracy.
Since the requirement for such massive numbers of genes for accurate classification is beyond the capabilities of traditional molecular diagnostics, we wondered if the GA/MLHD classifier, by virtue of its ability to select uncorrelated features as predictor elements, might be capable of selecting smaller optimal predictive genesets without sacrificing predictive accuracy.
We applied the GA/MLHD classifier on the GCM dataset, using our experience with the NCI60 dataset as a guide to select appropriate algorithm parameters (high pc (0.8), low pm (0.001), population size = 30, number of generations = 120, uniform crossover, SUS, predictor set size range [1, 60]). Strikingly, we found that the GA/MLHD classifier selected an optimal classifier geneset of 32 elements producing Ec = 20.67% and Ex = 14% (overall error rate = 18%) . Thus, in addition to being slightly more accurate than the OVA/SVM classifier (18% error for GA/MLHD versus 22% for OVA/SVM) , the GA/MLHD classifier is able to achieve these levels of predictive accuracy without relying on massive numbers of genes (16063 genes for OVA/SVM versus 32 genes for GA/MLHD, or an approximately 500-fold reduction) . This data suggests that the application of the GA/MLHD classifier may be especially useful in the arena of molecular diagnostics, in which a major aim is the selection of minimal identifier genesets which nevertheless offer high predictive accuracies.
Discussion
We have described a GA-based methodology for multi-class prediction using gene expression data. We defined various parameter ranges which are likely to yield good performance accuracy, and found that optimal predictor sets produced by the GA-based approach can outperform other predictor sets produced by a number of previously described methodologies. Two chief advantages of the GA- based approach include the automatic determination of the optimal number of genes in the predictor set, as well as the ability to select uncorrelated elements as predictor set members.
In rank-based gene selection methods, a measure is typically calculated for each gene based on a function that determines the "predictiveness" of the gene. Then, after sorting the genes based on their measures or scores, the top R genes are selected to form a subset of predictive R genes. This strategy, coupled with a weighted voting system, was first introduced by Golub et al . (1999) with the S2N ratio used as the score and appears to work quite well for the prediction of binary class datasets. However, expanding the weighted voting method into multi-class scenarios, as implemented by Yeang et al . (2001), did not produce the same satisfactory results. Similarly, the LIK score as described in Keller et al . (2000) is also used for predictor gene selection, except that the LIK method is more readily expandable to multiple class datasets. In another multi-class scenario, Dudoit et al . (2000) used the ratio of between-groups to within-groups sum of squares to select predictive genes.
It is worthwhile to compare these methods to the criteria established by Hall and Smith (1998), which states that good feature subsets should contain features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other. In other words, good predictor sets should ideally contain genes that are strongly correlated to class distinction but each of these genes should be as uncorrelated with each other as possible.
In general, predictor genes obtained through rank-based selection methods typically fulfill the first criterion but not the second. This is particularly true for the weighted voting method. However, in a discriminant function based classifier, features selected to be included in a predictive set must by necessity not correlate with each other. This is because if there exists even a single correlated pair of genes in the predictor set, the classification results would be unreliable, since the covariance matrix in this case would be singular or close to singular, and hence not inversible. (Inversion of the covariance matrix is essential to calculate the discriminant function) . Therefore the predictor sets obtained through our method contains genes with no or very low correlation to each other.
There are a few previous reports that have described the use of GAs in the analysis of gene expression data. The approach proposed by Li et al . (2001) combines a GA with c-Nearest Neighbor (KNN) rules in order to select a subset of predictive genes for phenotypic classification. The KNN rule, first formulated by Fix and Hodges (1951) , attempts to reduce the sensitivity of a classifier to noise in a dataset by classifying a sample based on the class assignment of the majority of its k nearest neighbors. However, in that report, the samples used, such as the lymphoma dataset (Alizadeh et al . 2000) and colon tumor dataset (Alon et al . 1999) have target class distinctions which are binary, and the GA/KNN strategy was not applied to multi-class prediction problems as in this report.
We believe, however, that the GA/KNN strategy may not be optimal for multi-class scenarios for several reasons. Firstly, most of the distance metrics (Euclidean, Pearson, etc.) used to determine the k neighbors of a sample invariably become less sensitive as data dimensionality increases (Keller et al . 2000), and samples might also be unclassifiable if no satisfactory majority vote is obtained from the k neighbors. Secondly, the KNN rule, while giving comparatively lower error rates (Dudoit et al . 2000), requires relatively large computational memory requirements in order to simultaneously maintain all training data in memory. Thirdly, the powerful tool of crossover was also not drawn on by Li et al . (2001) in their GA/KNN method.
Moreover, this method made only cross validation accuracy as the fitness criterion in searching for the optimal predictor sets. Finally, in that report the final predictor set used for classification (of test samples) was not one of the optimal or near-optimal sets obtained directly from the GA/KNN method. Instead, individual genes were ranked based on their frequency of selection into 10,000 near-optimal sets by the said method. The n top-ranked genes were then picked as predictor genes. This secondary operation, which essentially constitutes a rank-based gene-selection approach, was performed so as to "break" up the good predictor sets as directly selected by the GA/KNN approach into different sets. Thus, the possibility remains that that genes picked into a final predictor set in this manner might not have been picked into the same predictor set by the direct action of the GA/KNN method itself. This appears to pose no problem for binary class datasets, as demonstrated by the accuracy achieved for the lymphoma data, but in multi- class datasets, where gene interactions are believed be more complex, the same accuracy may not be obtained.
Finally, as reported above, a study on multi-class cancer prediction was recently conducted by Ramaswamy et al . (2001) , in which the dataset being analyzed consisted of 218 tumor samples spanning 14 classes of different sites of origin. The best approach in the study used 14 separate SVM-based (support-vector-machine) binary classifiers, with each classifier differentiating between one class and all other classes (i.e. a one vs. all (OVA) approach). The classifiers were trained on all 16,063 genes and gave an overall 22% cross validation error rate. A similar error rate was also achieved for the test set. One interesting aspect of this study was that the inclusion of all the genes appears to be required for optimal accuracy, as the results seemed to indicate that with feature selection or reduction, poorer classification results were obtained. We believe this is because an SVM classifier, while highly efficient in separating two classes amid noisy data, does not specifically select marker genes that distinguish between different classes but which are still uncorrelated to each other. Curiously, another similar study using a similar OVA approach has reported significantly better results (Su et al . 2001). At present, it is unclear why the same approach yields such drastically different results on conceptually similar experiments.
By contrast, this report suggests that GA/discriminant- based approaches, because they do not rely on rank-based gene selection methodologies, may represent an effective alternative tool for the analysis and exploration of multi-class gene expression datasets.
In conclusion, this report shows that highly accurate classification results can be obtained using a combination of GA-based gene selection and discriminant- based classification methods. The accuracy achieved (95%) is considerably better than other published methods employing the same dataset. Another advantage of the GA- based approach is that it automatically determines the optimal predictor set size. We propose that the methodology outlined here, as well as related methods, may represent useful alternatives in the analysis and exploration of complex multi-class gene expression data. EXAMPLE 2 - INVERSION OF THE COVARIANCE MATRIX
The following is an extract from a Matlab® run using a correlated dataset, a:
» a = [1 2 3; 4 5 6; 9 0 6]
1 2 3 4 5 6 9 0 6
» cov (a ' ) ans =
1.0000 1. .0000 -1.5000
1.0000 1..0000 -1.5000
-1.5000 -1..5000 21.0000
» inv (cov (a' ) )
Warning: Matrix is singular to working precision. ans
Inf Inf Inf
Inf Inf Inf
Inf Inf Inf
»
Row 1 is 100% (not just highly) correlated to Row 2 by the relationship: (Row 2) = (Row 1) + 3
The resulting covariance matrix, cov (a') contains 2 identical rows/columns (the 1st 2 rows and the 1st 2 columns) . In a covariance matrix, which is a symmetrical matrix, rows are always similar to columns. This is an extreme example because the 2 rows (genes) are perfectly correlated, hence Matlab® issues the warning "Matrix is singular to working precision." when it tries to calculate the inverse of the covariance, inv (cov (a' ) ) . a is transposed as a' because Matlab® looks at columns as variables .
For less extreme cases (high correlations) , the warning actually sounds more ominous: "Warning: Matrix is close to singular or badly scaled. Results may be inaccurate."
Note: correlation happens not just by the relationship in this example; any type of linear equation may describe correlations such as y = x/3 + 5 , y = 5x - 2 and so on.
References
Alizadeh, A. A., Eisen, M.B., Davis, R.E., Ma, C, Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., et al. 2000. Distinct types of diffuse large B- cell lymphoma identified by gene expression profiling. Nature 403: 503-511.
Alon, U., Barkai, N., Notterman, D.A., Gish, K. , Ybarra, S., Mack, D. and Levine, A.J. 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Na tl . Acad. Sci . 96: 6745-6750. Baker, J. E. 1987. Reducing Bias and Inefficiency in the Selection Algorithm. In Proceedings of the Second International Conference on Genetic Algorithms and their Application (ed. Grefenstette, J. J. ) , pp. 14-21. Lawrence Erlbaum Associates, Hillsdale, New Jersey, USA.
Bittner, M., Meltzer, P., Chen, Y., Jiang, Y., Seftor, E., Hendrix, M. , Radmacher, M., Simon, R. , Yakhini, Z., Ben-Dor, A., et al. 2000. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406: 536-540.
Dudoit, S., Fridlyand, J. , & Speed, T. 2000. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Sta tistical Association (in press) . (Technical report 576, Department of Statistics, University of California, Berkeley.)
Fix, E. and Hodges, J. 1951. Discriminatory analysis, nonparametric discrimination: consistency properties.
Technical report, USAF School of Aviation Medicine, Randolph Field, Texas.
Goldberg, D.E. 1989. Genetic Algori thms in Search , Optimization and Machine Learning. Addison-Wesley, New York.
Goldberg, D.E. and Deb, K. 1991. A comparative analysis of selection schemes used in genetic algorithms. In Rawlins, G. (ed. ) , Founda tions of Genetic Algorithms . Morgan Kaufmann, Berlin, pp. 69-93.
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C, Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., et al . 1999. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286: 531-537. Grefenstette, J.J. and Baker J.E.. 1989. How Genetic Algorithms Work: A Critical Look at Implicit Parallelism. In Proceedings of the Third Interna tional Conference on Genetic Algorithms (ed. Schaffer, J.D.) pp. 20-27. Morgan Kaufmann, San Mateo, California.
Hall, M.A. and Smith, L.A. 1998. Practical feature subset selection for machine learning. In Proceedings of Australasian Computer Science Conference (ed. McDonald, C.) pp. 181-191. Springer-Verlag, Singapore.
Haupt, R.L. and Haupt, S.E. (1998) Practical Genetic Algorithms . Wiley, New York.
Holland, J. 1992. Adapta tion in Na tural and Artificial Systems, 2nd edition. MIT Press, Cambridge, Massachusetts,
James, M. 1985. Classification Algorithms . John Wiley & Sons, New York.
Keller, A.D., Schummer, M. , Hood, L. and Ruzzo, W.L.
2000. Bayesian Classification of DNA Array Expression Data. Technical Report UW-CSE-2000-08-01, Department of Computer Science & Engineering, University of Washington, Seattle .
Li, L., Weinberg, C.R., Darden, T.A., and Pedersen, L.G.
2001. Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 17: 1131-1142.
Ramaswamy, S., Tamayo, P., Rifkin, R. , Mukherjee, S., Yeang, C.H., Angelo, M., Ladd, C, Reich, M. , Latulippe, E., Mesirov, J.P., et al. 2001. Multi-class cancer diagnosis using tumor gene expression signatures. Proc . Natl . Acad. Sci . 98: 15149-15154. Ross, D.T., Scherf, U., Eisen, M.B., Perou, CM., Rees, C, Spellman, P., Iyer, V., Jeffrey, S.S., Van de Rijn, M., Waltham, et al. 2000. Systematic Variation in Gene Expression Patterns in Human Cancer Cell Lines. Nature Genetics 24: 227-35.
Slonim, D. , Tamayo, P., Mesirov, J. , Golub, T. and Lander, E. 2000. Class prediction and discovery using gene expression data. In Proceeding of the 4th Annual Interna tional Conference on Computa tional Molecular Biology (RECOMB-00) (ed. Shamir, R. , Miyano, S., Istrail, S., Pevzner, P., and Waterman, M.) pp. 263-272. Universal Academy Press, New York. Spears, .M. and De Jong, K.A. (1991) On the virtues of uniform crossover. In Proceedings of the Fourth Interna tional Conference on Genetic Algorithms . Morgan Kaufmann, La Jolla, CA, pp. 230-236.
Staunton, J.E., Slonim, D.K., Coller, H.A., Tamayo, P., Angelo, M.J., Park, J. , Scherf, U., Lee, J.K., Reinhold, W.O., Weinstein, J.N., et al. 2001. Chemosensitivity prediction by franscriptional profiling. Proc . Na tl . Acad. Sci . 98: 10787-10792
Su, A. I., Welsh, J.B., Sapinoso, L.M., Kern, S.G., Dimitrov, P., Lapp, H., Schultz, P.G., Powell, S.M., Moskaluk, CA. , Frierson, Jr., H.F. et al . 2001. Molecular classification of human carcinomas by use of gene expression signatures. Cancer Research 61: 7388- 7393.
Yeang, CH., Ramaswamy, S., Tamayo, P., Mukherjee, S., Rifkin, R.M. , Angelo, M., Reich, M. , Lander, E.S., Mesirov, J.P. and Golub, T.R. 2001. Molecular classification of multiple tumor types. Bioinforma tics 17: S316-S322. Zhang, H. , Yu, C.Y., Singer, B., and Xiong, M. 2001.
Recursive partitioning for tumor classification with gene expression microarray data. Proc . Na tl Acad. Sci . 98: 6730-6735.
Table la) Best predictor sets for the range Rmjn = 5, i?max=10
Figure imgf000056_0001
Table Id) Best predictor sets for the range
Figure imgf000057_0001
Figure imgf000057_0002
Table le) Best predictor sets for the range ?min = 26, i?max =30
Ul Ul
Figure imgf000057_0003
Table 2a) Cross validation results of the best predictor set obtained (Ec = 14.63%). Each of the values in gray box is the number of training samples predicted correctly in cross validation.
Figure imgf000058_0001
Table 2b) Independent test results of the best predictor set obtained (Ej = 5%). Each of the values in gray box is the number of test samples predicted correctly in independent test.
Figure imgf000058_0002
Table 3 a) Comparison of genes in the best predictor set to the genes found other methods, (+: found, -: not found)
Ul -4
Figure imgf000059_0001
Table 3b) Genes in the best predictor set found through the GA/MLHD method
Figure imgf000060_0001
Figure imgf000061_0001
Codt.
Figure imgf000061_0002
function [fitness,pairErr] = evaluator (oxpopn, datas, realClass, i_datas, i__realClass)
% evaluates the objective function (fitness) values of a decoded
% population
%xpopn = round (oxpopn) ; % x(i, :) contains gene positions xpopn = oxpopn; for j=l: size (xpopn, 1)
%disp ( [sprintf ( ' inn, %g' , (xpopn (i, :)))]) ;
%disp ( [xpopn (j ,:)]); size (xpopn)
%pause; picked = [ ] ; i_picked = [ ] ; n = 1; for m=2 : xpopn (j , 1) +1 picked (n, :) = datas (xpopn (j ,m) ,:) ; i_picked(n, : ) = i_datas (xpopn (j ,ra) ,:) ; n = n+1; end
%disp (size (picked) ) ; %dis (xpopn (j , 1) ) ;
%no pea!
% covss = cov(leaveOne' ) ; [V,D] = eig(covss);
V = eye (size (picked, 1) ) ;
%nrow = size (V, 2); V = V ( : , nrow-5 :nro ) ; pcaED = V*picked;
%size (pcaED) %no canonical corαiuonCov = 0; means = []; covs=[];
%form class mean vectors and cov matrices noClass = ma (realClass ) ; %noClass for i=l:noClass temp = []; for k=l: size (realClass, 2) %numRo =l, %numCol=2 if realClass (k) ==i temp = [temp pcaED ( : , k) ] ;
end end
%classMat ( : , : , i) = temp;
%disp (size (mean (temp, 2) )) ;
%disp(i); disp (size (means) ) ; disp (size (mean(temp, 2) ) ) ; disp (size (temp) ) ; means (:,i) = (mean (temp, 2) ) ;
%disp (size (means ) ) ;
%disp (size (covs) ) ; %disp (size (covs) ) ; covs(:,:,i) = cov(temp'); commonCov = commonCov + covs ( : , : , i) ; end nSample = size (realClass, 2) ; commonCov = commonCov / (nSample-noClass) ;
%check comdition of commonCov ccCondNum = cond (commonCov) ;
%if ok if (abs (ccCondNum) <le3 & abs (ccCondNum) >le-3)
%if (abs(dc)<le-3)
%disp(lll) ; cverr = cvalTest (picked, realClass) ; iterr = indeTest (means, commonCov, i_picked, i_realClass, V) ; else
%disp(000) ; cverr = size (realClass, 2) ; iterr = size (i_realClass, 2) ; end pairErr(j,:) = [cverr iterr]; cvRight = (1 - (cverr/size (realClass, 2) ) ) * 100; itRight = (1 - (iterr/size(i__realClass,2) ) ) * 100; fitness (j) = cvRight + itRight; end fitness = fitness1;
% end of q3fun
9.9.S-Q.Q-9-Q.9.9-5.Q.9-Q.Q.5i.5.Q.Q-Q.Q.a.Q. -5.Q.9-'2.&9-Q.9-9.Q.9-9.Q-9-9.Q-9-5-9-9-9-9-9-9-9-S-9-S-S-9-9-S-9-S-
function cvPair = cvalTest (picked, realClass)
%evaluates one individuals by cv nSa ple = size (realClass, 2) ; nGene = size (picked, 1) ; nErr = 0; nUnc = 0; for j=l:nSample if (j~=l & j~=nSample) leaveOne = [picked ( : , 1: j-1) picked ( : , j+l:nSample) ] ; ll_realClass = [realClass (1: j-1) realClass (j+l:nSample) ] ;
%disp (sprintf ( 'mid' ) ) ; disp (1: j-1) ; dis (j+1 :nSample) ; end if (j==l) leaveOne = [picked (:, 2 :nSample) ] ; ll_realClass = [realClass (2 rnSample) ] ;
%disp (sprintf (' j is 1')); disp (2:nSample) ; end if (j==nSample) leaveOne = [picked ( : , l:nSample-l) ] ; ll_realClass = [realClass (1: nSample-1) ] ;
%disp (sprintf (' j is nSample ' ) ) ; disp (1 : nSample-1) ; end o_testOne = picked ( : , j ) ;
.. - •! covss = cov (leaveOne' ) ; [V,D] = eig(covss);
V = eye (size (picked, 1) ) ; nro = size (V, 2); V = V ( : , row-5 :nro ) ; pcaED = V'*leaveOne;
testOne = V'*o testOne; if 0
%disp ( sprintf ('%d',j) ) ; disp (sprintf ( 'picked size')); disp (size (picked) ) ; disp (sprintf ( 'leaveOne size')); disp (size (leaveOne) ) ; disp (sprintf ( 'covss size')); dis (size (covss) ) ; disp (sprintf ( 'pcaED size')); disp (size (pcaED) ) ; disp (sprintf ( 'V size')); disp (size (V) ) ; disp (sprintf (' D size')); disp (size (D) ) ; end
%no canonical commonCov = 0 ;
%form class mean vectors and cov matrices noClass = max (realClass) ; %noClass for i=l:noClass temp = [] ; for k=l: size (ll_realClass, 2) if ll_realClass (k) ==i temp = [temp pcaED ( : , k) ] ; end end
%classMat ( : , : , i) = temp; means (:,i) = mean (temp, 2) ; %mean of row covs(:,:,i) = cov (temp'); commonCov = commonCov + covs(:,:,i); end commonCov = commonCov / (nSample-noClass) ; if 0 disp (sprintf (' cov size')); disp (size (commonCov) ) ; disp (sprintf ( 'means size')); disp (size (means) ) ; disp (sprintf ( 'tOne size')); disp (size (testOne) ) ;
end
%check comdition of commonCov ccCondNum = cond (commonCov) ;
%if ok if (abs (ccCondNum) <le3 & abs (ccCondNum) >le-3) %if (abs(dc)<le-3)
%disp (333111) ;
%invCC = inv (commonCov) ; %outside class loop? for i=l:noClass % disp (sprintf ( 'means, i size')); disp (size (means (:, i) )) ; disp (sprintf (' inv of cc size')); disp (size (i deviate = testOne - means (:, i); si = deviate' / commonCov * deviate; score (i) = si; end
[smallest, smalnd] = min (score); if (smalnd ~= realClass (1, j ) ) nErr = nErr + 1; end else
%disp (333000) ; nErr = size (realClass, 2) ; break; end
end cvPair = nErr;
% end of cvPair
° 9-9-9-° 9-9-9-9-99-9-P-9-909- 9- 9- ° o J
function itPair = indeTest (means, commonCov, i_picked, i_realClass, VVV) levaluates one individuals by cv nSample = size (i_realClass, 2) ; nGene = size (i_picked, 1) ; nErr = 0; nϋnc = 0;
%invCC = inv (commonCo ) ; %outside sample loop? for j=l: nSample o_testOne = i_picked ( : , j ) ;
%pca testOne = VVV *o_testOne;
%no canonical
%form class mean vectors and cov matrices
ON noClass = max (i__realClass) ; %noClass *" for i=l:noClass deviate = testOne - means (:,i); si = deviate' / commonCov * deviate; score (i) = si; end
[smallest, smalnd] = min (score); if (smalnd ~= i_realClass (1, j ) ) nErr = nErr + 1; end
Figure imgf000066_0001
itPair = nErr;
% end of cvPair o o o o o o o o o Q.O o o o o .0 o o Q. α. g. o ^^^ .Q,^g.Q.Q.Q.Q,Q,Q.Q.Q.Q,g,Q, ,g,o.^.g. ,^. . , .^. . .C
O^OOOOOOOO'O'O'O'OO'O'α'OOO'O'OOOO'OOOOOOOOOO OO OOOO O O OOO O O OO O O OO O O -J
2 - SOIA/ I Coo (
Figure imgf000067_0001
~ * o4f ) function [fitness, pairErr] = evaluator_modf (oxpopn, datas, realClass, i_datas, i_realClass)
% evaluates the objective function (fitness) values of a decoded
% population
%xpopn = round (oxpopn) ; % x(i,:) contains gene positions xpopn = oxpopn; for j=l: size (xpopn, 1)
%disp ( [sprintf ( ' inn, %g' , (xpopn (i, :)))]);
%disp ( [xpopn ( j , : ) ] ) ; %size (xpopn)
%pause; picked = [ ] ; i_picked = [ ] ; n = 1; for m=2 : xpopn (j , 1) +1 picked (n, : ) = datas (xpopn (j ,m) ,:) ; i_picked(n, : ) = i_datas (xpopn (j ,m) ,:) ; n = n+1; end
Ul
%disp (size (picked) ) ; %disp (xpopn (j , 1) ) ;
%no pea!
% covss = cov ( leaveOne ') ; [V,D] = eig ( covss );
V = eye (size (picked, 1) ) ;
%nrow = size (V, 2); V = V ( : ,nrow-5:nrow) ; pcaED = V*picked;
%size (pcaED) %no canonical commonCov = 0; means = []; covs=[]; :form class mean vectors and cov matrices noClass = max (realClass) ; %noClass for i=l:noClass temp = [] ; for k=l: size (realClass, 2) %numRow=l, %numCol=2 if realClass (k) ==i temp = [temp pcaED ( : , k) ] ;
end end
%classMat ( : , : , i) = temp;
%disp (size (mean (temp, 2) )) ;
%disp(i); disp (size (means) ) ; disp (size (mean (temp, 2) )) ; disp (size (temp) ) ; means (:,i) = (mean (temp, 2) ) ;
%disp (size (means) ) ;
%disp (size (covs) ) ; %disp (size (covs) ) ; covs ( : , : , i) = cov (temp ' ) ; commonCov = commonCov + covs ( : , : , i) ; end nSample = size (realClass, 2) ; commonCov = commonCov / (nSample-noClass) ;
%check comdition of commonCov ccCondNum = cond (commonCov) ;
)if ok if (abs (ccCondNum) <le3 & abs (ccCondNum) >le-3) %if (abs(dc)<le-3)
%disp(lll) ; cverr = cvalTest (picked, realClass) ; iterr = indeTest (means, commonCov, i_picked, i_realClass, V) ; else
%disp(000) ; cverr = size (realClass, 2) ; iterr = size (i_realClass, 2) ; end pairErr (j,:) = [cverr iterr]; cvRight = (1 - (cverr/size (realClass, 2) ) ) * 100; itRight = (1 - (iterr/size(i_realClass,2) ) ) * 100; fitness (j) = cvRight; % + itRight; end fitness = fitness ' ;
% end of q3fun
function cvPair = cvalTest (picked, realClass)
%evaluates one individuals by cv nSample = size (realClass, 2) ; nGene = size (picked, 1) ; nErr = 0; nϋnc = 0; for j=l: nSample if (j~=l & j~=nSample) leaveOne = [picked ( : , 1: j-1) picked ( : , j+1: nSample) ] ; ll_realClass = [realClass (1: j-1) realClass (j+l:nSample) ] ;
%disp (sprintf ( 'mid' ) ) ; disp (1: j-1); disp (j+1 : nSample ) ; end if (j==D leaveOne = [picked (:, 2 : nSample) ] ; ll_realClass = [realClass (2 : nSample) ] ;
%disp (sprintf ( ' j is 1')); disp (2 : nSample) ; end if (j==nSample) leaveOne = [picked (:, 1 :nSample-l) ] ; ll_realClass = [realClass (l:nSample-l) ] ;
%dis (sprintf ( 'j is nSample')); disp (l:nSample-l) ; end o_testOne = picked ( :,j);
%no pea1
% % ccoovvss = cov (leaveOne ') ; [V,D] = eig (covss);
V = eye (size (picked, 1 %nrow = size (V, 2); V = V ( : , nrow-5 : nrow) ; p ^caED = V'*leaveOne;
testOne = '^o testOne; if 0
%disp (sprintf ( '%d',j) ) ; disp (sprintf ( 'picked size')); disp (size (picked) ) ; disp (sprintf (' leaveOne size')); disp (size (leaveOne) ) ; disp (sprintf ( 'covss size')); disp (size (covss) ) ; disp (sprintf ( 'pcaED size')); disp (size (pcaED) ) ; disp (sprintf ( 'V size')); disp (size (V) ) ; disp (sprintf (' D size')); disp (size (D) ) ; end
%no canonical commonCov = 0;
%form class mean vectors and cov matrices noClass = max (realClass) ; %noClass for i=l:noClass temp = [] ; for k=l: size (ll_realClass, 2) if ll_realClass (k) ==i temp = [temp pcaED ( : , k) ] ; end end
%classMat ( : , : , i) = temp; means (:,i) = mean (temp, 2) ; %mean of row covs(:,:,i) = cov (temp'); commonCov = commonCov + covs ( : , : , i) ; end commonCov = commonCov / (nSample-noClass) ;
Figure imgf000070_0001
Figure imgf000070_0002
end
%check comdition of commonCov ccCondNum = cond (commonCov) ;
%if ok if (abs (ccCondNum) <le3 & abs (ccCondNum) >le-3)
%if (abs (dc)<le-3)
%dis (333111) ;
%invCC = in (commonCov) ; %outside class loop? for i=l:noClass % disp (sprintf ( 'means, i size')); disp (size (means (:, i) )) ; disp (sprintf (' inv of cc size')); disp (size (i deviate = testOne - means (:,i); si = deviate' / commonCov * deviate; score (i) = si; end
[smallest, smalnd] = min (score); if (smalnd ~= realClass (1, j ) ) nErr = nErr + 1; end else
%disp (333000) ; nErr = size (realClass, 2) ; break; end
end cvPair = nErr;
% end of cvPair
9-9-9-°~9-0-°-9-9-9-9-P~0 ° ° 9-9-9999-9-90 ° ° 9-99-9-9-9-9-9-0 ° ° 9-9-0-9-9-9-9-9-9-9&9-9-9-9-9-9-99-C;
function itPair = indeTest (means, commonCov, i_picked, i_realClass, VVV)
%evaluates one individuals by cv nSample = size (i_realClass, 2) ; nGene = size (i_picked, 1) ; nErr = 0; nUnc = 0;
%invCC = in ( commonCov) ; %outside sample loop? for j=l: nSample o_testOne = i_picked ( : , j ) ;
%pca testOne = VVV *o_testOne;
%no canonical
%form class mean vectors and cov matrices noClass = max (i_realClass) ; %noClass for i=l:noClass deviate = testOne - means (:,ι); si = deviate' / commonCov * deviate; score (i) = si; end
[smallest, smalnd] = min(score); if (smalnd ~= i_realClass (1, j ) ) nErr = nErr + 1; end
end itPair = nErr;
% end of cvPair
Figure imgf000073_0001
%read file to obtain genes datas = dlmread (filename, ' \t ' ) ; realClass = datas(l,:); %fιrst row indicates class nGene = size (datas, 1) ; datas = datas (2: nGene, :) ; i_datas = dlmread (i_filename, ' \t ' ) ; i_realClass = i_datas (1, : ) ; i_nGene = size (i_datas, 1) ; i_datas = i__datas (2: nGene, : ) ; disp (size (datas) ) ; if (uselT == 1) fun = 'evaluator'; else fun = 'evaluator modf; end direct = strcat (host, ' results__' , num2str (lowNP) , '_ num2str (np) ,'_'); disp (sprintf ( 'low = %g, high = %g\n' , lowNP,np) ) ; warning off;
O Q.Q.O 0,Q.g„αQ.Q.9.Q.O,Q.Q.O,Q.Q.C options (3) = 0 %p options (2) = 0 %p options (6) = 1 sp options (1) = 1 °P vlb = [ ] ; vub = [ ] for i=l:l vlb = [vlb lowNP] ; vub = [vub np] ; end for i=2:np+l
vlb = [vlb 1] ; vub = [vub size (datas, 1) ] ; end rand (' state' , seedNum) ; %reset setNu = 1; %just a results file additional label
[xpopn, fitness, meanf,maxf,xopt] =selectGA(fun,popnsize,pc,pm, numgens, ... vlb, vub, options, datas, realClass, i_datas, i_realClass, setNum, direct) ; disp ( sprintf ( ' ************************ End proqram ************************\n' ) ) •
% end of chose
9o-9o-9O-9Ό-9O-9O-9O-9Ό-9Ό-9O-9O-5Ό.9Ό-9O-0Ό 9Ό- 9O- 9Ό- 9Ό- 9O- ° 9Ό-9O-°Ό 9Ό-9O-9Ό-9O9Ό-99O-9O-9Ό-9O-9Q-9Ό.9Ό-0O Ό9- 9Ό- 9Ό- °Ό 9Ό-°O ° 9-°O °Ό 9Ό-°O °Ό 9O-°O °Ό 9O-°O
-4
Ul
Figure imgf000075_0001
/ - Sou ce Cocβ s kctζΑ) function [xpopn, fitness,meanf, maxf, xopt, iterropt] =selectGA(fun,popnsize,pc,pm, numgens, vlb, vub, options, datas, realClass, i_datas, i_realClass, setNum, direct) % numbGA maximises a function of several variables subject to bounds: % vlb <= x <= vub
% The objective function is defined in fun which is supplied by the user % function fitness = fun (popn), where popn is a population of x values options contains the following: options (1) = 0 no printout
1 prints out summary statistics only
2 prints out summary statistics and initial and final populations options (2) uniform mutation b dynamic mutation, b is a parameter determining degree of non-uniformity options (3) 0 uses standard replacement for the next generation
N combine offspring with best N parents to form next generation options (4) 0 roulette wheel selection
1 stochastic universal sampling options (5) 0 one-point crossover
1 uniform crossover options (6) no fitness scaling
1 linear fitness scaling (scaling parameter c=2)
meanfhistory= [ ] maxfhistory=[] xopthistory=[] pairErrHist= []
%form class mean vectors and cov matrices if 0 noClass = ma (realClass) ; for i=l:noClass temp = [] ; for j=l : size (datas, 2) if realClass (j )==i temp = [temp datas (:, j )] ; end
end
%size (temp) classMat (:,:, i) = l;%temp; means (:,i) = mean (temp, 2) ; covs(:,:,i) = cov(temp'); end end
% Generate the initial population gen=0; popn=Initialise (popnsize, vlb, ub) ;
% Obtain fitness statistics for the initial population and save the history % parameters for use in Report xpopn=Decode (popn, vlb, ub, options) ;
[fitness,pairErr] =feval (fun, xpopn, datas, realClass, i_datas, i_realClass) ; meanf=sum(fitness) /size (popn, 1) ;
-4 [maxf, imax] =ma (fitness) ; tϋ xopt=xpopn(imax, : ) ; pairErrBest = pairErr (imax, : ) ; xopthistory=[xopthistory;xopt] ; maxfhistory= [maxfhistory;maxf] ; meanfhistory= [meanfhistory;meanf] ; pairErrHist=[pairErrHist;pairErrBest] ; iterropt = pairErrHist;
% Call Report for the printout of the initial population and the initial population % statistics
Report (gen, numgens, popn, xpopn, fitness,meanfhistory, maxfhistory, xopthistory, pairErrHist, options, setNum, pc,pm, direct) ;
% Main generation loop for gen=l : numgens
% Reproduce matingpairs = Reproduce (popn, fitness, options) ;
Figure imgf000077_0001
% Crossover offspring = Crossover (matingpairs, pc, options) ;
% Mutate newpopn = Mutation(offspring, pm, options, gen, numgens, vlb, ub) ;
% Obtain fitness of the current population if 0
Figure imgf000078_0001
xpopn=round (newpopn) ; IDecode (newpopn, lb, vub, options) ; [newfitness, pairErr] =feval ( fun, xpopn, datas, realClass, i_datas, i_realClass) ; %%%%%xpopn = popn; if 0 disp (sprintf ( 'xpopnl ')) ; disp(xpopn); pause; end
% Calculate population statistics to be saved in the history parameters
-4 meanf=sum(newfitness) /size (popn, 1) ; °^ [maxf, imax] =max (newfitness ) ; if 0 disp (sprintf ( 'imax' ) ) ; disp (imax) ; disp (sprintf ( 'maxf ) ) ; disp (maxf) ; disp (sprintf ( 'xpopn' )) ; disp (xpopn); pause; end xopt=xpopn (imax, : ) ; % (imax(l) , : ) ; pairErrBest = pairErr (imax, :) ; xopthistory= [xopthistory;xopt] ; maxfhistory= [maxfhistory;maxf] ; meanfhistory= [meanfhistory;meanf] ; pairErrHist=[pairErrHist;pairErrBest] ; iterropt = pairErrHist;
% Selection procedure for the next generation if options (3) == 0
% Replace parents by offspring popn=newpopn;
fitness=newfitness; else % Sort parent population and combine offspring with best options (3) of parents
[parsort, sorti] =sort (-fitness) ; %disp (size (fitness) ) ; disp (size (newfitness) ) ; combpopn=[popn (sorti (1: options (3) ) , : ) ; newpopn] ; combfit=[fitness (sorti (1: options (3) ) ) ; newfitness] ;
% Sort combined population and choose best of parents and offspring [fitsort, fiti]=sort (-combfit) ; popn=combpopn (fiti (l:popnsize) , : ) ; fitness=combfit (fiti (l:popnsize) ) ; xpopn=Decode (popn, lb, vub, options) ; end
% Call Report for the printout of the final population and a summary of the population % statistics over all generations
Report (gen, numgens, popn, xpopn, fitness, meanfhistory, axfhistory, xopthistory, pairErrHist, ... options, setNum, pc,pm, direct) ;
-4 -4 end; xopt=xopthistory; maxf=maxfhistory; meanf=meanfhistory;
% End of binaryGA
9.9.99-9-9-9 -9-9-9-9-99.9.9-9.9.9.9.9- -9.9.9.999-9-9-9-99-&
function popn=Initialise (popnsize, vlb, vub) % Initial generates the initial population xint = rand (popnsize, size (vlb, 2) ) ; factor = (vub-vlb) ; popn = round( ones (popnsize, 1) *vlb + xint*diag (factor) );
% End of Initial
Figure imgf000080_0001
%history=[gen meanfhistory (gen+1, 1) maxfhistory (gen+1, 1) ] ;
%disp( [sprintf (' %5.0f ', history' )]) ; disp (sprintf ('.')) ; fprintf (fid, '%g\t ' , history' ) ; fprintf (fid, ' \n' ) ; end if { (options (1) >1) & (gen==numgens) ) dispC ');disp(' ');disp(' '); disp ( [sprintf ( ' Generation %5. Of' , gen) ] ) dispC '); disp([' fitness x ']);
% Output the current population details for i=l: size (popn, 1) %disp ( [sprintf ( ' %g' , fitness (i) , xpopn (i, : ) ) ] ) ; contents = [fitness (i) xpopn(i,:)]; fprintf (fid, ' %g\t ' , contents ' ) ; fprintf (fid, ' \n ' ) end; end fclose (fid) ;
% End of Report
function writeFile (gen, numgens, popn, xpopn, fitness, ... meanfhistory,maxfhistory, xopthistory, options, pc, pm) % report outputs the population (strings and reals) and the fitness values for the % current generation, together with a history of mean and maximum fitness values if ( (options (1)>1) & (gen==0) ) dispC ');disp(' ');disp(' disp ( [sprintf ( ' Generation 5. Of' , gen) ] ) ; dispC '); disp([' fitness x ']);
Figure imgf000081_0001
% Output the current population details for i=l: size (popn, 1)
disp ( [sprintf ('%12.4g %3.0g %3.0g %3.0g %3.0g %3.0g %3.0g %3.0g %3.0g %3.0g %3.0g %3.0g %3.0g %3.0g %3.0g % fitness (i) , xpopn (i, :))]) ; —— βnd ∞d' -l-t'i *l-J*l-0j 7.5.03'. if options (1)>0 if gen==0 disp ( ' ' ) ; disp ( ' Fitness History Statistics'); disp ( 'Generation Mean Fitness Max Fitness Optimal x'); end history=[gen meanfhistory (gen+1, 1) maxfhistory (gen+1, 1) round (xopthistory (gen+1, :))] ; %history=[gen meanfhistory (gen+1, 1) maxfhistory (gen+1, 1) ] ; disp ( [sprintf ( ' %g\t ' , history' ) ] ) ; end if ( (options (1) >1) & (gen==numgens) ) disp ( ' ' ) ; isp ( ' ' ) ; disp ( ' ' ) ; disp ( [sprintf ( ' Generation %5. Of' , gen) ] ) ; dispf '); disp ( [ ' fitness x ']);
% Output the current population details for i=l: size (popn, 1) disp ( [sprintf ('%12.6g %12.6g %10.6g' , fitness (i) , xpopn (i, :))]) ; end; end
% End of writeFile
0^ 9Ό- 9Ό- 9Ό. °^ 9Ό.9Ό-9Ό-0^ °^ °^ °Ό-Ό.9Ό.Ό.°'S °'S 9'O-°'6 °'6 °'69Ό9O-°'6 °'65Ό-9O-SO-BO-9Ό-9O-9O-9O-9O.9O-5O-9O-9O-SO-SΌ-9O-SO-90-90-9O-&Ό5O-SΌ9O-20-90'3o-90i90-90-0'9o-90-90-?0-S0-90-90,90'S0'9o-9o-90- 0'50-90-So- 0-90-S0' 0-90-5o-S0'90-9o-90-90-S0-
function [matingpairs, select] =Reproduce (popn, fitness, options)
% Reproduce produces the mating pairs for crossover. select contains the numbers of each string in popn which has been added to the mating pool if options (6) >0
% First call for linear fitness scaling fitness=Scalefitness (fitness, 2) ; end if options (4)==0
% Roulette wheel selection randnums=rand(size (fitness) ) ; else
% Stochastic universal sampling rr=rand; spacing=l/length (fitness) ; randnums=sort (mod (rr : spacing: l+rr-0.5*spacing, 1) end
%disp (sprintf ( 'fiti = %f, sum = %f, fitness (1), sum (fitness) )) ; normfit=fitness/sum( fitness) ; partsum=0; count ( 1 ) =0 ; matepool= [ ] ; for i=l: length (fitness) partsum=partsum+normfit (i) ; count (i+1) =length (find (randnums<partsum) ) ; select (i, l)=count (i+1) -count (i) ; matepool=[matepool;ones (select (i,l),l) *popn (i, : ) ] ; end;
% Now re-order the strings for mating so that the string in row 1 % is to be mated with the string in row 2, etc. [junk, mating] = sort (rand(size (matepool, 1) , 1) ) ; matingpairs = matepool (mating, : ) ;
% End of Reproduce
8-&9-9-9-9-9- -9-5'9-9-9-9-9-5-9-9-&9-9-9- -9-9-9-9-9'@;5'&9-9-9-9- - -9'S-5'S'9'9-2-9-S-9- -9'9-! function offspring=Crossover (popn, pc, options)
% Crossover creates offspring from a population (ordered mating pool)
% using crossover with probability pc.
if options (5) ==0
% One-point crossover lbits = size (popn, 2) ;
% size (popn, 1) sites = ceil (rand(size (popn, 1) /2, 1) * (lbits-1) ) ; sites = sites . * (rand(size (sites) ) <pc) ; for j = 1 : length (sites) ; offspring (2*j-1, : ) = [popn (2*j-l, 1: sites (j ) ) popn (2*j , sites (j ) +1: lbits) ] ; offspring (2*j ,: ) = [popn (2*j , 1 : sites (j ) ) popn (2*j-l, sites (j ) +1: lbits) ] ; end elseif options (5) ==1
% Uniform crossover for i=l: size (popn, 1) /2 if rand<pc template=rand(l, size (popn, 2) ) <0.5; offspring (2*i-l, : ) =template. *popn (2*i-l, : ) + (1-template) . *popn (2*i, : ) ; offspring (2*i, : )=template. *popn (2*i, : ) + (1-template) . *popn (2*i-l, : ) ; else offspring (2*i-l, : ) =popn (2*i-l, : ) ; offspring (2*i, : ) =popn (2*i, : ) ; end end else
% convex crossover 11 = options (5) ; if (1K0 I 11>1) 11 = 0.5;
% return to default (average crossover if errouneous values passed) end for j = 1: size (popn, 1) /2 if rand<pc offspring (2*j-l, : ) = 11. *popn (2*j-l, : ) + (1-11) . *popn (2*j , : ) ; offspring (2*j, :) = 11. *popn (2*j , : ) + ( 1-11) . *popn (2*j-1, : ) ; else offspring (2*j-1, : ) =popn (2*j-1, : ) ; offspring (2*j, : )=popn(2*j, : ) ; end
end
Figure imgf000085_0001
offspring = round (offspring) ;
% End of Crossover
9,0.9,00,0,0.0.0,0,0,0,0,0,0,0 o o o o 9,00,0,0.9.0 o σ 0,09.999.9. .0.0,0,9,9,9,9.0.0,0,0,0,0,9,9,9,9,9,0.0,0,9.99^00,0,0,0,0,0,0, o 00000000 00000000000 o o function newpopn = Mutation (offspring, pm, options, curGen, numgens, lb, vub) % Mutation changes a gene of the offspring with probability pm. mutate = find(rand(size (offspring) ) <pm) ; popnRow = size (offspring, 1) ;
% mutate contains the positions of the genes to be mutated as a column vector % going down the columns of the matrix offspring newpopn = offspring; b = options (2) ;
00 Ul
% dynamic mutation r = rand(size (offspring) ) ; choice = (rand (size (offspring) ) <0.5) ;
newpopn (mutate) = (choice (mutate) ).* (offspring (mutate) ...
+ r (mutate) . * ( (vub (ceil (mutate/popnRow) ) ) ' -offspring (mutate) ) ...
. * ( 1-curGen/numgens ) . b) ...
+ (1-choice (mutate) ) . * (offspring (mutate) ...
- r (mutate) . * (offspring (mutate) - (vlb (ceil (mutate/popnRow) ))')...
. * (1-curGen/numgens) . Λb); newpopn = round (newpopn) ;
% End of Mutation
9-9.9' c,S.9.9-S-9-9-Q-9-9-9-S-o 9.9-9.9.9.9-S-9-9-5-9.9-9-9-9'S-S-&.9.9-9-9.9'9.9.9.9.9- -9-9'&-9-9.9.9.9-S^9'S.9-9-S.9.S-9.9-0- -&-&9-5-&-9-S.9.9.9-2- -9-9-9-9-9-&.9- function xpopn = Decode (popn, vlb, vub, options)
% Decode converts a population (popn) of strings from binary to real.
% Each string in popn is of length sum (lbits) and consists of m=length (lbits)
% substrings which decode to the variables x_l, ...,x_m
% If options (2) =1, first decodes Gray code into binary
% assign base 2 if binary or gray coding, base 10 if real coding
% First decode each substring to an unsigned decimal integer: xmt
Figure imgf000086_0001
%indexl=l;
%index2=0;
%newpopn=popn;
%for i=l: length (lbits)
% index2=index2+lbits (i) ;
% tenpowers=10.Λ (lbits (i)-l:-l:0) ;
% xint ( : , i) =newpopn ( : , inde 1: index2) *tenpowers ' ;
% indexl=indexl+lbits (i) ;
%end
% Now calculate the x values
%factor=(vub-vlb) . / (10. Λlbits-1) ;
%xpopn=ones (size (popn, 1) , 1) *vlb+xint*diag (factor) ; xpopn = round (popn) ;
% End of Decode
0,0,0.0 o o 9,9.90,0 o o o o o o o o Q.999.00,0 o o o 0,0 o λα σ n o o n o n ααo α o o αn o α o 0000000000000000 o o 00 o 00 o 00000000000 function scfitness=Scalefitness (fitness, fmultiple)
% Scalefitness performs linear scaling on the fitness values and returns the
% results in scfitness
% Calculate the parameters a and b favg=sum( fitness) /length (fitness) ; [fmax, i] =max (fitness) ;
[fmin, j ] =min (fitness) ; denl = fmax-favg; den2 = favg-fmin; error = le-6;
if abs (denl) > error a=favg* (fmultiple-1) /denl; b=favg*(l-a); elseif abs(den2) > error a=favg/den2; b=favg*(l-a); else a=l; b=0; end if a*fmin+b<0 a=favg/ (favg-fmin) ; b=favg* (1-a) ; end
% The scaled fitness scfitness=a*fitness+b*ones (size (fitness) ) ; % End of Scalefitness
%%%End of file
Figure imgf000087_0001

Claims

CLAIMS :
1. In the identification, from a pool of parameters whose values are known for a plurality of samples ("training samples") which may be classified into multiple classes, of a subset of parameters whose values are likely to be predictive of the class of a sample of unknown class, the use of a parameter selection method which: (1) permits the selection of parameters that interact within any particular class; and/or (2) selects parameters whose values are substantially uncorrelated with each other across the classes; and/or (3) excludes from the subset parameters whose values are highly correlated across the classes with another parameter of the subset.
2. The use of claim 1, wherein the parameter selection method employs a discriminant-based classifier.
3. The use of a discriminant-based classifier in the identification, from a pool of parameters whose values are known for a plurality of samples ("training samples") which may be classified into multiple classes, of a subset of parameters whose values are likely to be predictive of the class of a sample of unknown class.
4. The use of claim 2 or claim 3, wherein the discriminant- based classifier employs matrix inversion to avoid selecting parameters whose values are correlated across classes.
5. The use of claim 4, wherein the matrix is a class or common covariance matrix.
6. The use of any preceding claim, wherein the parameter selection method further employs a genetic algorithm.
7. A method for identifying, from a pool of parameters whose values are known for a plurality of samples ("training samples") which are classified into multiple classes, a subset of parameters whose values in a sample of unknown class are likely to be predictive of the class of that sample, the method comprising:
(a) providing a first generation population of individuals, each individual representing a candidate subset of parameters;
(b) for each individual, evaluating its fitness to predict the class of a sample of unknown class by determining its ability to predict the class of a plurality of test samples of known class;
(c) generating a second generation population of individuals from selected individuals in the first generation population, wherein said selected individuals are selected according to their fitness as evaluated in step (b) ;
(d) repeating steps (b) and (c) to evaluate the fitness of the individuals of the second generation population and to generate and evaluate further generations until a termination condition is reached; and
(e) selecting the fittest individual, as evaluated in step (b) or step (d) , from the final generation or from among the fittest individuals of a plurality of generations, wherein the class prediction in step (b) uses a discriminant-based classifier.
8. The method of claim 7, wherein the classifier performs with higher accuracy on individuals which represent parameters whose values are correlated to particular class distinctions and are uncorrelated to each other across the classes.
9. The method of claim 7 or claim 8, wherein the classifier is a maximum likelihood classifier (MLHD) .
10. The method of claim 9, wherein the classifier employs a linear discriminant function.
11. The method of any one of claims 7 to 10, wherein the class prediction in step (b) involves the production, for each individual, of a discriminant function which assigns a sample to a class, based on the values in that sample of the parameters represented by that individual.
12. The method of claim 11, wherein the discriminant function is produced on the basis of the values, in the training samples, of the parameters represented by the individual.
13. The method of any one of claims 9 to 12, wherein the MLHD classifier operates via the following steps:
(1) for each class, defining a typical value profile of a member of that class, based on the values in the training samples of that class for the parameters represented by the individual under consideration;
(2) for each class, producing a measure of the spread or variation in values between members of that class, based on the values in the training samples of that class for the parameters represented by the individual under consideration;
(3) optionally averaging the class measures of spread to produce a common measure of spread for each individual; and (4) for each test sample, comparing its value profile with the typical value profile for each class, taking into account the spread as determined in step (2) for that class or as averaged in step (3) across all classes, and assigning the test sample to the class to which its value profile matches most closely.
14. The method of claim 13, wherein step (1) comprises taking, for each parameter represented by the individual under consideration, the mean of the values for that parameter in the training samples of class under consideration, thereby to produce a class mean vector.
15. The method of claim 13 or claim 14, wherein the measure of spread is a measure of covariance.
16. The method of claim 15, wherein step (2) comprises generating a matrix of the covariance in values between all the parameters represented by the individual under consideration, among all the training samples of that class (a "class covariance matrix") .
17. The method of claim 16, wherein in step (3) the class covariance is averaged to produce a common covariance matrix.
18. The method of claim 15 or claim 16, wherein step (4) comprises inversion of the class or common covariance matrix.
19. The method of any one of claims 7 to 18, wherein the fitness of each individual is evaluated by determining the error rate of the discriminant function for that individual in classifying the test samples.
20. The method of claim 19, wherein the cross-validation error rate and/or the independent test error rate is determined.
21. The method of claim 20, wherein the independent test error rate is determined, and wherein the discriminant-based classifier is produced on the basis of the training samples and then tested against the test samples, which do not overlap with the training samples.
22. The method of claim 21, wherein the ratio of training samples to test samples is in the range 1.5:1 to 3:1.
23. The method of claim 20, wherein the cross-validation error rate is determined, and wherein the same set of samples are used as both training samples and test samples, by producing a discriminant-based classifier from one subset of the samples (the "training subset") , testing it against the remaining sample (s), and then repeating the procedure with different training subsets.
24. The method of claim 23, wherein a classifier is produced using a training subset containing all but one of the samples, and tested against the remaining sample, and this procedure is repeated for all possible such training subsets, so that cross-validation occurs against every sample in the training set.
25. The method of any one of claims 20 to 24, wherein the error rate is determined by a combination of cross-validation and independent testing.
26. The method of claim 25, wherein fitness as determined by independent test is accorded higher significance than fitness as determined by cross validation.
27. The method of any one of claims 7 to 26, wherein step (c) uses a genetic algorithm.
28. The method of any one of claims 7 to 27, wherein the selection in step (c) includes an intermediate step of generating a pool of individuals from the first generation population, such that the number of copies, or absence, of a first generation individual in the pool depends on its fitness as determined in step (b) , the pool providing the selected first generation individuals from which the individuals of the second generation are generated.
29. The method of any one of claims 7 to 28, wherein the selection is probabilistic.
30. The method of claim 29, wherein the selection is by stochastic universal sampling.
31. The method of any one of claims 7 to 30, wherein the generating of second generation individuals from the selected individuals of the first generation includes crossing over between and/or mutation of the selected individuals of the first generation.
32. The method of claim 31, wherein both crossing over and mutation occur, crossing over precedes mutation, and the mutation step is carried out on the population of individuals resulting from the crossing over step.
33. The method of claim 31 or claim 32, wherein all selected first generation individuals are subject in pairs to the possibility of a crossing over event.
34. The method of any one of claims 31 to 33, wherein the crossing over event occurs at a defined probability level, pc, wherein 0.7 ≤ pc < 1.
35. The method of claim 33 or claim 34, wherein the matching together of pairs of individuals for crossing over is random.
36. The method of any one of claims 31 to 35, wherein all selected first generation individuals, or all individuals resulting from the crossing over step, are subject to the possibility of a mutation event.
37. The method of claim 36, wherein a probability function is applied separately to each representation of a parameter in each individual to determine whether that parameter is mutated, mutation being replacement by another parameter.
38. The method of claim 37, wherein mutation occurs at a defined probability level, pmr wherein 0.0005 < pm < 0.01.
39. The method of any one of claims 31 to 38, wherein mutation is uniform, such that the parameter replacing a mutated parameter is selected with equal probability from the pool of parameters.
40. The method of any one of claims 7 to 39, wherein different individuals represent different numbers of parameters, within defined limits.
41. The method of claim 40, wherein the method is repeated, using different defined limits in different runs.
42. The method of any one of claims 7 to 41, wherein each individual comprises a sequence (or "string") of parameter indicators, usually numbers, wherein different parameter indicators represent different parameters.
43. The method of claim 42, wherein each individual comprises a number of parameter indicators equal to a maximum limit on the number of parameters which may be represented by an individual, and further comprises a number indicator which specifies how many parameters from the string of parameter indicators that individual represents, such that if the number indicator is less than the number of parameter indicators, one or more parameter indicators are redundant in that individual.
44. The method of any one of claims 7 to 43, further comprising the step of classifying a sample of unknown class using the fittest individual selected in step (e) .
45. The method or use of any preceding claim, wherein the number of classes into which the samples may be classified is at least 4.
46. The method or use of claim 45, wherein the number is at least 6.
47. The method or use of any preceding claims, wherein the samples are biological samples.
48. The method or use of claim 47, wherein the parameters are genes and the values are expression levels.
49. The method or use of claim 47 or 48, wherein the classes represent different sources of the biological samples.
50. The method or use of claim 49, wherein the classes represent different tissue origins of tumour samples.
51. A computer program, which when run on a computer or computer network will carry out the method of any one of claims 7 to 50.
52. The computer program of claim 51, which is recorded on a data carrier.
53. A computer program according to claim 51 or claim 52 for identifying, from a pool of parameters whose values are known for a plurality of samples ("training samples") which are classified into multiple classes, a subset of parameters whose values in a sample of unknown class are likely to be predictive of the class of that sample, the program including code which is effective, when the program is run, to:
(a) provide a first generation population of individuals, each individual representing a candidate subset of parameters;
(b) for each individual, evaluate its fitness to predict the class of a sample of unknown class by determining its ability to predict the class of a plurality of test samples of known class;
(c) generate a second generation population of individuals from selected individuals in the first generation population, wherein said selected individuals are selected according to their fitness as evaluated in step (b) ;
(d) repeat steps (b) and (c) to evaluate the fitness of the individuals of the second generation population and to generate and evaluate further generations until a termination condition is reached;
(e) select the fittest individual from the final generation or from among the fittest individuals of a plurality of generations; and
(f) display and/or output said fittest individual.
54. The computer program of claim 53, which includes code effective to receive input of stored data representing values for the parameters represented by an individual, for evaluating its fitness in step (b) or step (d) .
55. A computer memory product having stored thereon at least one digital file, said memory product comprising computer readable memory and said stored digital file or files constituting a program as defined in any one of claims 51 to 54.
56. Apparatus for carrying out the method of any one of claims 7 to 50, the apparatus including an input component for receiving data from a stored data set containing values for a pool of parameters in a plurality of samples of known class, a processor, and a memory having stored thereon a program as defined in any one of claims 51 to 54.
57. A system for identifying, from a pool of parameters whose values are known for a plurality of samples ("training samples") which are classified into multiple classes, a subset of parameters whose values in a sample of unknown class are likely to be predictive of the class of that sample, the system comprising a processing component and an output, wherein the processing component is programmed to: (a) provide a first generation population of individuals, each individual representing a candidate subset of parameters;
(b) for each individual, evaluate its fitness to predict the class of a sample of unknown class by determining its ability to predict the class of a plurality of test samples of known class;
(c) generate a second generation population of individuals from selected individuals in the first generation population, wherein said selected individuals are selected according to their fitness as evaluated in step (b) ;
(d) direct the repetition of steps (b) and (c) to evaluate the fitness of the individuals of the second generation population and to generate and evaluate further generations until a termination condition is reached;
(e) select the fittest individual, as evaluated in step (b) or step (d) , from the final generation or from among the fittest individuals of a plurality of generations; and
(f) direct the output component to display and/or output a representation of said fittest individual, and wherein the class prediction in step (b) is carried out by a discriminant-based classifier.
58. The system of claim 57, further comprising a memory in which is stored a data set containing values for the pool of parameters in a plurality of samples of known class.
59. The system of claim 57 or claim 58, wherein the processing component is programmed to receive input of stored data containing values for the parameters represented by an individual, for the evaluation of its fitness in step (b) or step (d) .
PCT/GB2003/001258 2002-05-02 2003-03-24 Analysis of gene expression data for multi-class prediction WO2003094086A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003217019A AU2003217019A1 (en) 2002-05-02 2003-03-24 Analysis of gene expression data for multi-class prediction

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US37734702P 2002-05-02 2002-05-02
US60/377,347 2002-05-02

Publications (2)

Publication Number Publication Date
WO2003094086A2 true WO2003094086A2 (en) 2003-11-13
WO2003094086A3 WO2003094086A3 (en) 2005-02-24

Family

ID=29401484

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2003/001258 WO2003094086A2 (en) 2002-05-02 2003-03-24 Analysis of gene expression data for multi-class prediction

Country Status (2)

Country Link
AU (1) AU2003217019A1 (en)
WO (1) WO2003094086A2 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109727640A (en) * 2019-01-22 2019-05-07 袁隆平农业高科技股份有限公司 Full-length genome prediction technique and device based on automaton study technology
CN110993026A (en) * 2019-12-30 2020-04-10 苏州大学 Assessment method and system for myelodysplastic syndrome
CN111859624A (en) * 2020-06-24 2020-10-30 西安理工大学 Exercise prescription parameter management method based on NSGA-II algorithm
CN112599190A (en) * 2020-12-17 2021-04-02 重庆大学 Method for identifying deafness related genes based on mixed classifier
CN113806409A (en) * 2020-05-28 2021-12-17 华为技术有限公司 Data pairing method and related equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BOJARCZUK C C ET AL: "GENETIC PROGRAMMING FOR KNOWLEDGE DISCOVERY IN CHEST-PAIN DIAGNOSISEXPLORING A PROMISING DATA MINING APPROACH" IEEE ENGINEERING IN MEDICINE AND BIOLOGY MAGAZINE, IEEE INC. NEW YORK, US, vol. 19, no. 4, July 2000 (2000-07), pages 38-44, XP000937061 ISSN: 0739-5175 *
LI L ET AL: "Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method" COMBINATORIAL CHEMISTRY AND HIGH THROUGHPUT SCREENING, vol. 4, no. 8, December 2001 (2001-12), pages 727-739, XP008039705 ISSN: 1386-2073 *
LI L ET AL: "Gene selection for sample classification based on gene expression data: Study of sensitivity to choice of parameters of the GA/KNN method" BIOINFORMATICS (OXFORD), vol. 17, no. 12, December 2001 (2001-12), pages 1131-1142, XP002309171 ISSN: 1367-4803 cited in the application *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109727640A (en) * 2019-01-22 2019-05-07 袁隆平农业高科技股份有限公司 Full-length genome prediction technique and device based on automaton study technology
CN109727640B (en) * 2019-01-22 2021-03-02 隆平农业发展股份有限公司 Whole genome prediction method and device based on automatic machine learning technology
CN110993026A (en) * 2019-12-30 2020-04-10 苏州大学 Assessment method and system for myelodysplastic syndrome
CN110993026B (en) * 2019-12-30 2023-06-06 苏州大学 Evaluation method and evaluation system for myelodysplastic syndrome
CN113806409A (en) * 2020-05-28 2021-12-17 华为技术有限公司 Data pairing method and related equipment
CN111859624A (en) * 2020-06-24 2020-10-30 西安理工大学 Exercise prescription parameter management method based on NSGA-II algorithm
CN111859624B (en) * 2020-06-24 2024-04-26 西安理工大学 Exercise prescription parameter management method based on NSGA-II algorithm
CN112599190A (en) * 2020-12-17 2021-04-02 重庆大学 Method for identifying deafness related genes based on mixed classifier
CN112599190B (en) * 2020-12-17 2024-04-05 重庆大学 Method for identifying deafness-related genes based on mixed classifier

Also Published As

Publication number Publication date
AU2003217019A1 (en) 2003-11-17
WO2003094086A3 (en) 2005-02-24

Similar Documents

Publication Publication Date Title
Ooi et al. Genetic algorithms applied to multi-class prediction for the analysis of gene expression data
Statnikov et al. Gentle introduction to support vector machines in biomedicine, A-volume 2: case studies and benchmarks
Deutsch Evolutionary algorithms for finding optimal gene sets in microarray prediction
Cho et al. Cancer classification using ensemble of neural networks with multiple significant gene subsets
JP5064625B2 (en) Method and machine for identifying patterns
Hernandez Hernandez et al. A genetic embedded approach for gene selection and classification of microarray data
Yu et al. Feature selection and molecular classification of cancer using genetic programming
US20030225526A1 (en) Molecular cancer diagnosis using tumor gene expression signature
Rifkin et al. An analytical method for multiclass molecular cancer classification
WO2010030794A1 (en) Machine learning methods and systems for identifying patterns in data
Hanczar et al. Improving classification of microarray data using prototype-based feature selection
Simon Supervised analysis when the number of candidate features (p) greatly exceeds the number of cases (n)
WO2003076928A1 (en) Methods for identifying large subsets of differentially expressed genes based on multivariate microarray data analysis
Mohamad et al. A hybrid of genetic algorithm and support vector machine for features selection and classification of gene expression microarray
EP4334912A1 (en) Analysis of histopathology samples
Lin et al. Pattern classification in DNA microarray data of multiple tumor types
Cai et al. Selecting dissimilar genes for multi-class classification, an application in cancer subtyping
Paul et al. Extraction of informative genes from microarray data
WO2003094086A2 (en) Analysis of gene expression data for multi-class prediction
Kim et al. A genetic filter for cancer classification on gene expression data
Ando et al. Classification of gene expression profile using combinatory method of evolutionary computation and machine learning
Yang et al. A combination of shuffled Frog–Leaping algorithm and genetic algorithm for gene selection
Park et al. Evolutionary ensemble classifier for lymphoma and colon cancer classification
Bolón-Canedo et al. Feature selection in DNA microarray classification
Devi Arockia Vanitha et al. Multiclass cancer diagnosis in microarray gene expression profile using mutual information and support vector machine

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP