WO2010005317A1 - Predicting phenotype from gene expression data - Google Patents

Predicting phenotype from gene expression data Download PDF

Info

Publication number
WO2010005317A1
WO2010005317A1 PCT/NO2009/000257 NO2009000257W WO2010005317A1 WO 2010005317 A1 WO2010005317 A1 WO 2010005317A1 NO 2009000257 W NO2009000257 W NO 2009000257W WO 2010005317 A1 WO2010005317 A1 WO 2010005317A1
Authority
WO
WIPO (PCT)
Prior art keywords
stress
phenotype
gene expression
prediction
genes
Prior art date
Application number
PCT/NO2009/000257
Other languages
French (fr)
Other versions
WO2010005317A4 (en
Inventor
Nicholas A. Robinson
Ben J. Hayes
Original Assignee
Nofima Akvaforsk-Fiskeriforskning As
Goddard, Michael, E.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nofima Akvaforsk-Fiskeriforskning As, Goddard, Michael, E. filed Critical Nofima Akvaforsk-Fiskeriforskning As
Publication of WO2010005317A1 publication Critical patent/WO2010005317A1/en
Publication of WO2010005317A4 publication Critical patent/WO2010005317A4/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Definitions

  • the present invention relates to a method for the prediction of phenotype from gene expression data. Further the invention provides a method for the prediction of disease resistance and disease susceptibility in a subject and for the detection, including early warning, of exposure to stresses affecting a subject or population. The invention also encompasses the use of the method for the selection of disease resistant animals for breeding. The invention further encompasses the use of the method for detection of levels of stress exposure in the farm or aquaculture environment. The invention further comprises a test kit and software and the use thereof for the evaluation of disease resistance and susceptibility.
  • Figures IA and IB Occurrence of genes with the highest correlation to time to metastases derived from one thousand subsets of (A) 67 patients and (B) 20 patients which were randomly chosen without replacement. Identification numbers and names are shown for the 10 most frequently occurring genes.
  • Figure 3 Simulation used to evaluate the use of gene expression profiles with selective breeding for disease resistance.
  • Four different criteria for selection were simulated: CRITl, selection based on family breeding value for the disease challenge test; CRIT2, selection based on prediction of challenge test phenotype using gene expression profiling; CRIT3, selection of families based on breeding values for challenge test combined with selection of individuals within families based on prediction of challenge test phenotype using gene expression profiling; CRIT4, direct selection of disease challenge test survivors.
  • Figure 4 Mean genetic response to challenge test (A-D) and corresponding benefit- cost ratios (E-H) over selective breeding programs (SBP's) simulated in Atlantic salmon using the four alternative selection criteria for disease resistance. CRITl (open triangles, point up), CRIT2 (closed diamonds), CRIT3 (open triangles, point down) and CRIT4 (open diamonds).
  • SBP's selective breeding programs
  • CRITl open triangles, point up
  • CRIT2 closed diamonds
  • CRIT3 open triangles, point down
  • CRIT4 open diamonds.
  • the correlation between the polygenetic and phenotypic effects between the challenge test and gene expression profile traits (r g and T y respectively) was varied by increasing amounts as follows. A and E, r g 0.3
  • high throughput gene expression profiling e.g. using DNA microarrays
  • gene expression profiles are used with selective breeding.
  • the profiles can be used to discover genes affecting particular traits or involved in particular processes and the variants of these genes which can be selected.
  • the gene expression profile is used to predict a phenotype or breeding value, such as resistance or susceptibility to a disease. Since tissue for gene expression profiling can be collected from live candidates, the profiles can be used as a selection tool for a selective breeding program in order to make genetic improvements to a strain or species.
  • a method of eliciting a gene expression response from breeding candidates is used as a selection tool, for example by challenging cells derived from these subjects to disease.
  • the gene expression profile is used to detect levels of stress (including stress due to handling and human interaction, stocking density, levels of toxic chemicals in the water or feed, feed availability and composition, disease, temperature, salinity, pH, oxygen and combinations or fluctuations in the levels of these factors) in the farm environment.
  • the cells of the tumour and surrounding cell types respond to the genetic changes occurring in the cells making up the tumour.
  • One particular point mutation may lead to cascades of gene expression changes. These gene expression changes may be used to help classify patient tumour types so that the most effective available treatment can be recommended.
  • certain cells e.g. macrophages or leukocytes
  • the resulting cascades of gene expression changes in these cells may differ between subjects that are able to resist versus those that are more susceptible to the disease.
  • Cancer research has focussed on the use of gene expression data for the classification of patient data and a number of discrimination methods have been compared for this purpose [Dudoit S., Fridlyand J., Speed T.
  • Van't Veer et al. Van't Veer L. J., et al., Nature 415 (2002) 530-536
  • Van't Veer L. J., et al. used a binary classification of time to metastases of greater than or less than 5 years after primary tumour biopsy.
  • Many patients used in the Van't Veer data set actually had recorded times to metastases close to 5 (mean 6.02 ⁇ 3.75, inventors own calculations).
  • the resulting classes are likely to be heterogeneous. If enough patients with extreme variation for the trait (e.g.
  • the random regression method according to the present invention provides a quantitative estimate of the phenotype which can be used to rank animals or patients to pick those with the greatest chance of extreme phenotypic performance. Noise due to subjects in the calibration set with phenotypic values close to the cut-off between two phenotypic classes is eliminated in the analysis because the raw phenotypic measurements for the trait are utilised directly.
  • microarrays may be used as a "first pass screen" to identify likely indicator genes using a modest calibration data set consisting of samples from extreme performing animals for the phenotype of interest.
  • Gene expression values for the indicative genes could be measured for larger numbers of subjects (in the order of thousands, preferably) using MassARRAY competitive RT PCR (combining competitive reverse transcriptase PCR with matrix-assisted laser desorption/ionisation time-of-flight mass spectrometry) [Ding C, Cantor C.
  • the present invention concerns a method for the prediction of phenotype from gene expression data.
  • the method may be used to predict phenotypes like disease resistance and disease susceptibility and stress to factors like handling and human interaction, stocking density, levels of toxic chemicals in the water or feed, feed availability and composition, disease, temperature, salinity, pH or oxygen. Further the invention provides a method for the selection of disease resistant animals for breeding.
  • a method for determining the optimum number of genes (Jc ), and which particular genes, to include for prediction of phenotype, wherein the correlation of predicted to actual phenotype is found to be highest when k genes are included in the prediction equation utilises randomly generated subsets of data selected from a calibration set of data (with or without replacement). The method assesses the average correlation of predicted to actual phenotype across the subsets as additional genes are included in a random regression model.
  • % is number of times gene i is detected as part of the optimum group of genes to include in the prediction equation with k iterations of the method described above
  • X f is the gene expression for gene i measured in a particular tissue or cell type while that tissue or cell type exists under a particular state, or is the change in gene expression for gene i in that tissue or cell type in response to a change of state
  • b[ is the regression coefficient for each gene and, Xi is the gene expression for gene i measured in a particular tissue or cell type while that tissue or cell type exists under a particular state, or is the change in gene expression for gene i in that tissue or cell type in response to a change of state,
  • a defined set of genes whose pattern of gene expression can be used for the prediction of disease resistance or disease susceptibility or exposure to stress, wherein gene expression is evaluated and the prediction of disease resistance or susceptibility phenotype or level of exposure to stress is made using the methods described herein above.
  • test kit comprising test reagents for measuring the expression of genes and providing data that can be used for the prediction of phenotype.
  • the test reagents used in the test kit may be selected from microarrayed DNA samples or oligonucleotides for quantitative PCR (Polymerase Chain Reaction) or competitive reverse transcriptase PCR.
  • the invention encompasses in yet another embodiment an analysis software package which can perform the calculations necessary to predict disease resistance.
  • the software package would accept data collected using the methods described herein above, and would automate the calculations necessary for prediction of phenotype according to equations 1 and 2.
  • the software could be made available for use on a personal computer or could be operated or accessed by registered users on a main frame server over the internet.
  • Gene expression responses that are found to be correlated with disease resistance or disease susceptibility or exposure to levels of stress may provide information about causative genes that could be targeted for vaccines, RNAi or other treatments and can be used to assist choosing the best performing animals for grow-out in particular production environments.
  • Example 1 To test the method publicly available gene expression data from patients with cancer was used.
  • a cRNA pool was created from each of the sporadic carcinomas and used as a common reference sample in the Van't Veer et al. study. All background corrections, normalisation and log conversions had already been applied by Van't Veer et al. [ Van't Veer L. J., et al., Nature 415 (2002) 530-536].
  • NUMGEN One thousand randomly generated subsets of 67 patients were selected from the calibration set of 77 patients, without replacement. Corresponding subsets of remaining patient data (10 patients) were also generated and were kept aside for testing how the prediction equations performed with the addition of increasing numbers of genes to the model (see below). For each subset of 67 patients, we repeated the following steps for k iterations:
  • Z as the gene, eg s the level of expression for the/ h individual for the gene with the highest correlation to the phenotypes in iteration k.
  • Phenotypes were corrected for the level of expression for gene k. Gene k was removed from the data set and added to Z. Return to step 1.
  • ESTl NUMGEN yielded 1000 sets of models (each consisting of A: genes, corresponding k regression coefficients and residual variance) that were predictive for a randomly selected subset of 61 patients. The frequency of occurrence of these predictive genes over all 1000 NUMGEN models was determined for the set of 77 calibration patients. For ESTl we repeated steps 1-5 above (NUMGEN) A: times, however, this time the mean regression coefficients for the k most frequently occurring genes over all analyses of the 1000 subsets of 67 patients from NUMGEN were used in the prediction where,
  • the sensitivity of the method to sample size was checked by repeating NUMGEN, this time using 1000 randomly generated subsets of 20 patients. The list of genes selected using this smaller sample size was compared to those selected using subsets of 67 patients (described above).
  • Figures 1 A and B shows the frequency of occurrence of genes that were added to the models (those giving highest correlation with phenotype) over the 1000 subsets of randomly chosen patients (NUMGEN for models with 60 genes).
  • the sensitivity of the method to calibration set sample size is shown by comparing the occurrence of genes identified when subsets of 67 versus 20 patients are used.
  • the genes showing a high occurrence when subsets of 20 patients were used were mostly the same genes with high occurrence found when samples of 67 patients were chosen without replacement. However, when subsets of 20 patients were used, these genes had a generally lower frequency of occurrence (i.e.
  • gene 2279 (HBGl) was encountered in 110 of the 1000 subsets of 67 patients and in 31 of the 1000 subsets of 20 patients ( Figures 1 A and B). Background noise (i.e. the number of genes with multiple occurrences throughout the 1000 subsets) was higher when subsamples of 20 patients were used to formulate solutions compared to subsets of 67 patients.
  • Gene 9730 (unidentified EST) was the gene most commonly found to predict phenotype (high correlation with phenotype in 660 of the 1000 subsets of 67 patients).
  • Four of the five most commonly identified genes with the random regression and cross validation method were also identified by the ELDA method [20] (SEClOLl, NMU and unidentified expressed sequences 9730 and 3851).
  • Table I shows correlation of predicted to actual time to metastases and success of classification using random regression with cross validation. Success of classification was determined after binary conversion of the predicted and actual phenotypes. The correlation and classification success is for prediction equations derived from methods ESTl and EST2.
  • the biased test was a form of incomplete cross validation where gene selection was made using all 77 patients, solutions for these genes were derived from subsets of 67 patients drawn from the 77 patients and resulting prediction equations were applied to the corresponding subset of 10 "left- out" patients, n, number of predictions tested for each correlation, nsets, number of correlation coefficients.
  • the polygenic component of genetic value of animal / for traits 1 and 2 ⁇ ali and a2 ⁇ was sampled from a bivariate normal distribution where and were the mean additive genetic effects of trait 1 and trait 2 respectively and G was the additive genetic covariance matrix Values were all set to 25 units in the base population.
  • the covariance was derived by setting the correlation between the polygenetic effects for the two traits (r g ). The correlation was varied from 0.3 to 0.9. From Example 1 this value is likely to be dependent on the number of informative genes (differential expression values) included in the prediction equation.
  • the challenge test phenotype for animal was calculated as
  • the predicted challenge test phenotype for animal z using gene expression profiling was calculated as ,
  • Animals in each generation were assigned a sex, male or female, with probability 0.5.
  • the genetic value of progeny / for trait 2 was calculated as where and M are the genetic values of the sire and dam of individual i respectively.
  • Mate pairs were chosen from the families / which ranked in the top 10% with respect to EBV.
  • the maximum number of mate pairs that could be chosen from any of the top EBV ranked families was limited to 15 pairs.
  • the inbreeding coefficient would be expected to increase by around 0.8% every generation.
  • CRITl was used to select families while CRIT2 was used to select individuals within families. 300 mate pairs were randomly chosen from among these selected animals. All animals not challenged to disease (15,000 total) were tested using gene expression profiling. As per CRITl, it was not possible to estimate breeding values for the first round of selection and mating. So for the initial round of selection and mating CRIT2 was used to rank all animals in the base population.
  • n ⁇ i1 was the total number of animals challenge tested was the cost of challenge testing (done by contract at another site including costs of transportation), was the cost of labour per test (to identify, sort and sample animals for challenge or gene expression profiling), were the additional costs (eg. equipment and anaesthesia associated with the tests) was all other costs to run a selective breeding program for Atlantic salmon was the cost of genetic testing, n ge was an estimate of the number of tests required (where * 60) and was the total number of fish in the selective breeding program.
  • the economic benefit from reduced antibiotic dependency in this case was assumed to be zero as antibiotics are no longer used to treat Atlantic salmon in Norway.
  • the possible economic benefit of reduced dependency on vaccines was assumed to also be zero as growers might continue to vaccinate fish in any case for added security.
  • Table II lists the values used for the parameters defined above. A number of assumptions were made in deriving these values. Apart from the parameters listed below, the same values as Thorarinsson and Powell (2006) were used. We assumed that disease challenge test survival had a high positive correlation with survival to the same disease in the wild. This assumption was based on research by Gj ⁇ en, H.M., Refstie, T., UlIa, O., Gjerde, B., 1997. Genetic correlations between survival of Atlantic salmon in challenge and field tests. Aquaculture 158, 277-288, who found a high positive correlation between these two traits for the pathogenic bacteria Aeromonas salmonicida (0.95).
  • Cost ge would vary depending on laboratory, country, available technology and estimated throughput. Cost ge was assumed to be of a similar value to that estimated for genotyping (eg. Hayes, B., Baranski, M., Goddard, M.E., Robinson, N., 2007. Optimisation of marker assisted selection for abalone breeding programs. Aquaculture 265, 61-69.). We assumed in the model that Costge would remain constant from generation to generation. Cost ch ai was a conservative estimate from figures supplied by breeding companies and researchers involved in organizing the challenge tests. Cost s b P was calculated from the combined published financial results for year 2006 of the breeding companies Aqua Gen AS and SalmoBreed AS in Norway as follows. ⁇ total income — total profit) * proportion of sales made in Norway
  • the genetic response achieved under CRITl resulted in 100% improved survival to disease challenge after 10 generations (Fig. 4A-D).
  • the genetic responses from CRIT2 and CRIT3 increased as the genetic and phenotypic correlation between traits 1 and 2 increased, reflecting improved predictive ability and consequent improved selection accuracy with the use of gene expression profiles as a predictor disease resistance.
  • CRIT4 resulted in 100% improved survival to disease challenge after 8 generations.
  • CRIT3 gave the highest rate of genetic gain, even when the genetic and phenotypic correlation between trait 1 and 2 was low (Fig. 4A, Using CRIT3, survival to disease challenge was 100% improved after 6-7 generations of selection, and varying the phenotypic and genetic correlation had a relatively small effect on the overall genetic response after 10 generations (226% when
  • CRIT2 gene expression profile information alone as a selection criteria
  • CRITl family breeding value based on challenge test data
  • the benefit to cost ratio was positive under all scenarios. The highest ratio was for CRITl when r g was low (0.3, Fig. 4 E) and for CRIT4 under all other circumstances (Fig. 4 F-H).
  • CRIT4 assumes that direct selection and breeding of survivors from the challenge test is possible.
  • CRIT3 and CRIT4 yielded equivalent benefits-costs (Fig. 4E).
  • the benefit-cost ratio for CRIT3 was highest when r g was around 0.5 (Table III & Fig. 4F, 17:1 using 10 th generation selectively bred stock and assuming 30 genes tested for making the prediction at a cost of €180/individual and total cost of over 10 million euro).
  • Use of CRIT 2 yielded smaller benefit- cost ratios in comparison (up to 10: 1).
  • Relative percent survival of 78% was achieved under CRIT3, compared to 44% under CRIT2 and 60% under CRITl (Table 2), resulting in comparatively large industry wide opportunity costs from use of CRIT3 compared to other selection criteria (142 million euro, Table III).
  • CRIT3 The total added value per kg of fish was 0.29 Euro/kg of fish produced and the nominal economic effect on operating income was over 175 million Euros after 10 generations of selection under selection criteria CRIT3 (Table III).
  • CRIT3 was almost as profitable an option as CRITl, providing that the cost of gene expression testing was less than around €280/individual and r g was greater than 0.3, was more profitable than CRIT2 under all scenarios and yielded the highest total added value and highest nominal economic effect on operating income of all the selection criteria.
  • the model also assumes a high selection intensity will be possible as it does not account for the relative importance and weighting put on other traits in the selective breeding programs and does not account for avoidance of inbreeding. In a real breeding program something like optimal contribution selection would be applied to maximise genetic gain at a set rate of inbreeding ( Hinrichs, D., Wetten, M., Meu Giveaway, T.H.E., 2006. An algorithm to compute optimal genetic contributions in selection programs with large numbers of candidates. J. Anim. Sci.
  • Aquaculture 38, 155-170 Aquaculture 38, 155-170
  • overall f ⁇ ngerling survival Rye, M., Lillevik, K.M., Gjerde, B., 1990. Survival in early life of Atlantic salmon and rainbow trout: estimates of heritabilities and genetic correlations.
  • Aquaculture 89, 209-216 and resistance to furunculosis (fmgerlings and challenge, Gjedrem, T., Salte, R., Gj ⁇ en, H.M., 1991. Genetic variation in susceptibility of Atlantic salmon to furunculosis.

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Genetics & Genomics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention is concerned with a method for the prediction of phenotype from gene expression data or for the detection of stress in the farm environment from gene expression data.

Description

PREDICTING PHENOTYPE FROM GENE EXPRESSION DATA
Field of invention
The present invention relates to a method for the prediction of phenotype from gene expression data. Further the invention provides a method for the prediction of disease resistance and disease susceptibility in a subject and for the detection, including early warning, of exposure to stresses affecting a subject or population. The invention also encompasses the use of the method for the selection of disease resistant animals for breeding. The invention further encompasses the use of the method for detection of levels of stress exposure in the farm or aquaculture environment. The invention further comprises a test kit and software and the use thereof for the evaluation of disease resistance and susceptibility.
Background of invention
Traits such as disease resistance are receiving increasing attention by agricultural industries and export markets. Disease poses a high risk to the success of intensive agriculture and aquaculture through its effects both on production and animal welfare. However, disease resistance is particularly costly for breeding companies to evaluate and is slow to improve using current challenge testing methodology. As there are risks that survivors of a challenge test may carry the disease, these individuals, which have the greatest resistance to the disease within each family, are not normally utilised for further breeding i.e. unchallenged siblings belonging to the best performing families are used instead. Thus there is a need for appropriate technologies that can be used to measure disease resistance on live breeding candidates.
The prediction of a disease resistant phenotype from gene expression data is in many ways analogous to the classification of cancer types in human patients using microarray data, [e.g. Antonov A.V., Tetko I. V., Mader M. T., Budczies J., Mewes H. W., Optimization models for cancer classification: extracting gene interaction information from microarray expression data, Bioinformatics 20 (2004) 644-652, Golub T.R., Slonim D.K., Tamayo P., Huard C, Gaasenbeek M., Mesirov J. P., Coller H., Loh M. L., Downing J. R., Caligiuri M. A., Bloomfield CD., Lander E.S., Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring, Science 286 (1999) 531-537, Tibshirani R., Hastie T.,
Narasimhan B., Chu G., Diagnosis of multiple cancer types by shrunken centroids gene expression, Proceedings of the National Academy of Sciences of the U.S.A. 99 (2002) 6567-6572, Van't veer L. J., Dal h., Van de Vijver M. J., He Y. D., Hart A. A. M., Mao M., Peterse H. J., Van der Kooy K., Marton M. J., Witteveen A. T., Schreiber G. J., Kerkoven R. M., Roberts C, Linsley P. S., Bernards R., Friend S. H., Gene expression profiling predicts clinical outcome of breast cancer, Nature 415 (2002) 530-536, Wang Y., Klijn J. G. M., Zhang Y., Sieuwerts A. M., Look M. P., Yang F., Talantov D., Timmermans M., Meijer-van Gel der M. E., Yu J., Gene expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer, The Lancet 365 (2005) 671-679] or to prediction of tumour resistance to chemotherapy [Barbado M., Preisser L., Boisdron-Celle M., Verriele V., Lorimer G., Gamelin E., Morel A., Tumour quantification of several fiuoropyrimidines resistance gene expression with a unique quantitative RT-PCR method. Implications for pretherapeutic determination of tumour resistance phenotype, Cancer Letters 242 (2006) 168-179].
Another issue receiving increasing attention is stress and welfare in the farm environment. Reducing the level of stress among farmed livestock is desired by the public for ethical and welfare reasons and by farmers for productivity reasons.
Many factors such as handling and human interaction, stocking density, levels of toxic chemicals in the water or feed, feed availability and composition, disease, temperature, salinity, pH, oxygen and combinations or fluctuations in the levels of these factors are all likely to affect stress levels in farmed Atlantic salmon. Stresses like these are known to affect the long-term performance of individual fish in terms of growth [Refstie, S. et al. Differing nutritional responses to dietary soybean meal in rainbow trout (Oncorhynchus mykiss) and Atlantic salmon (Salmo salar). Aquaculture 190, 49-63 (2000)., Boujard, T., Labbe, L. & Auperin, B. Feeding behaviour, energy expenditure and growth of rainbow trout in relation to stocking density and food accessibility. Aquaculture Research 33, 1233-1242 (2002).,
Refstie, S., Storebakken, T., Baeverfjord, G. & Roem, A. J. Long-term protein and lipid growth of Atlantic salmon (Salmo salar) fed diets with partial replacement of fish meal by soy protein products at medium or high lipid level. Aquaculture 193, 91-106 (2001)., Refstie, S., Sahlstrom, S., Brathen, E., Baeverfjord, G. & Krogedal, P. Lactic acid fermentation eliminates indigestible carbohydrates and antinutritional factors in soybean meal for Atlantic salmon (Salmo salar). Aquaculture 246, 331- 345 (2005).] , meat quality[Danley, M. L., Kenney, P. B., Mazik, P. M., Kiser, R. & Hankins, J. A. Effects of carbon dioxide exposure on intensively cultured rainbow trout Oneorhynchus mykiss: Physiological responses and fillet attributes. Journal of the World Aquaculture Society 36, 249-261 (2005).] and disease resistance [Krogdahl, A., Bakke-Mckellep, A. M., Roed, K. H. & Baeverfjord, G. Feeding Atlantic salmon Salmo salar L. soybean products: effects on disease resistance (furunculosis), and lysozyme and IgM levels in the intestinal mucosa. Aquaculture Nutrition 6, 77-84 (2000)., Gjoen, T. et al. Effect of dietary lipids on macrophage function, stress susceptibility and disease esistance in Atlantic salmon (Salmo salar). Fish Physiology and Biochemistry 30, 149-161 (2004).]. Therefore it is a need for reliable early warning tools for measuring stress. Farmers could then utilise the information derived from such tools to make adjustments to feeding, stocking density, cage placement etc in order to avoid disease outbreaks and to raise performance.
There is a need for better tools to be able to evaluate if fish are being affected from sustained levels of low-level stress or from frequent intermittent periods of stress in the farm or aquaculture environment. The final performance of the fish in the aquaculture environment or of other livestock depends on the accumulated effects of these environmental factors on the animal's physiology over time. The diverse phenotypic effects caused by long-term exposure to mild stressors only become apparent after a delay, by which time it is too late to make changes to reduce the stress. Tools that give an early warning signal so that management practices can be adjusted are therefore necessary. The tests need to be easy to perform, inexpensive, highly sensitive, accurate, reproducible, and able to distinguish between different conditions (e.g. high or low levels of stresses, differential response to different types of stresses, and whether the response is compensatory, adaptive or transient). The most widely currently utilised method for the measurement of stress levels in farm animals is the plasma Cortisol response. However, high Cortisol levels are quickly induced and are affected by the transient stress of sampling the animal, and the relationship of this short-term response to production traits such as disease resistance is unclear [Fevolden, S. E., Nordmo, R., Refstie, T. & Roed, K. H. Disease Resistance in Atlantic Salmon (Salmo-Salar) Selected for High Or Low Responses to Stress. Aquaculture 109, 215-224 (1993)., Fevolden, S. E., Refstie, T. & Roed, K. H. Disease Resistance in Rainbow-Trout (Oncorhynchus-Mykiss) Selected for Stress Response. Aquaculture 104, 19-29 (1992)., Fevolden, S. E., Refstie, T. & Roed, K. H. Selection for High and Low Cortisol Stress Response in Atlantic Salmon (Salmo-Salar) and Rainbow-Trout (Oncorhynchus-Mykiss). Aquaculture 95, 53-65 (1991).].
Brief description of the figures
Figures IA and IB. Occurrence of genes with the highest correlation to time to metastases derived from one thousand subsets of (A) 67 patients and (B) 20 patients which were randomly chosen without replacement. Identification numbers and names are shown for the 10 most frequently occurring genes.
Figure 2. Correlation of predicted versus actual phenotype with increasing numbers of genes for 1000 sets of 10 left out patients (NUMGEN).
Figure 3. Simulation used to evaluate the use of gene expression profiles with selective breeding for disease resistance. Four different criteria for selection were simulated: CRITl, selection based on family breeding value for the disease challenge test; CRIT2, selection based on prediction of challenge test phenotype using gene expression profiling; CRIT3, selection of families based on breeding values for challenge test combined with selection of individuals within families based on prediction of challenge test phenotype using gene expression profiling; CRIT4, direct selection of disease challenge test survivors.
Figure 4. Mean genetic response to challenge test (A-D) and corresponding benefit- cost ratios (E-H) over selective breeding programs (SBP's) simulated in Atlantic salmon using the four alternative selection criteria for disease resistance. CRITl (open triangles, point up), CRIT2 (closed diamonds), CRIT3 (open triangles, point down) and CRIT4 (open diamonds). The correlation between the polygenetic and phenotypic effects between the challenge test and gene expression profile traits (rg and Ty respectively) was varied by increasing amounts as follows. A and E, rg=0.3
Figure imgf000006_0001
Description of the invention
In one aspect of the invention high throughput gene expression profiling (e.g. using DNA microarrays) technology to choose livestock for production and selective breeding is utilized. In one embodiment of the invention gene expression profiles are used with selective breeding. The profiles can be used to discover genes affecting particular traits or involved in particular processes and the variants of these genes which can be selected. In another aspect of the invention the gene expression profile is used to predict a phenotype or breeding value, such as resistance or susceptibility to a disease. Since tissue for gene expression profiling can be collected from live candidates, the profiles can be used as a selection tool for a selective breeding program in order to make genetic improvements to a strain or species. In a further aspect of the invention a method of eliciting a gene expression response from breeding candidates is used as a selection tool, for example by challenging cells derived from these subjects to disease. In yet another aspect of the invention the gene expression profile is used to detect levels of stress (including stress due to handling and human interaction, stocking density, levels of toxic chemicals in the water or feed, feed availability and composition, disease, temperature, salinity, pH, oxygen and combinations or fluctuations in the levels of these factors) in the farm environment.
In the case of cancer, the cells of the tumour and surrounding cell types respond to the genetic changes occurring in the cells making up the tumour. One particular point mutation may lead to cascades of gene expression changes. These gene expression changes may be used to help classify patient tumour types so that the most effective available treatment can be recommended. In the case of certain other diseases, certain cells (e.g. macrophages or leukocytes) respond to the disease agent, and the resulting cascades of gene expression changes in these cells may differ between subjects that are able to resist versus those that are more susceptible to the disease. Cancer research has focussed on the use of gene expression data for the classification of patient data and a number of discrimination methods have been compared for this purpose [Dudoit S., Fridlyand J., Speed T. P., Comparison of discrimination methods for the classification of tumours using gene expression data, Journal of the American Statistical Association 97 (2002) 77-87]. Univariate approaches, where expression values for the genes most correlated with phenotype are subsequently used to predict disease outcome, give a poor classification success rate [Dabney A. R., Storey J. D. 2005. Optimal feature selection for nearest centroid classifiers, with applications to gene expression microarrays, pp. 1-26. In UW
Biostatistics Working Paper Series]. This is because gene expression is part of a co- regulatory network, so the value of including multiple genes affected in the same network into the prediction equation is often not as great as including genes from different networks, even though the correlation with the phenotype may be higher for the genes in the same network. No single multivariate discrimination or classification method has been found to be clearly better than the others for all situations [Dudoit S., et al., Journal of the American Statistical Association 97 (2002) 77-87].
However, doubts have been raised over the accuracy of these methods and their validation. There is little overlap in the gene sets that have been identified by different studies to predict survival to the same type of cancer, the set of genes arrived at have been found to be highly dependant on the subset of patients used for gene selection, and when one group's predictor is tested on another group's data the success of prediction decreases significantly [Ein-Dor L., KeIa I., Getz G., Givol D., Domany E., Outcome signature genes in breast cancer: is there a unique set?, Bioinformatics 21 (2005) 171-178., Ein-Dor L., Zuk O., Domany E., Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer, Proceedings of the National Academy of Sciences of the U.S.A. 103 (2006) 5923- 5928., Michiels S., Koscielny S., Hill C, Prediction of cancer outcome with microarrays: a multiple random validation strategy, The Lancet 365 (2005) 488- 492]. There are two possible reasons for the discrepancies observed. First, biases in the validation of the prediction have been introduced into many of these studies. Ntzani and Ioannidis [Ntzani E., Ioannidis J. P. A., Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment, The Lancet 362 (2003) 1439-1444] in an assessment of 84 different studies found that most were not properly validated. Complete validation of all steps in the prediction method (selection of genes, computation of solutions and creation of the prediction rule) is critical, otherwise substantial bias can be introduced into the test of the method [Michiels S., et al., The Lancet 365 (2005) 488-49216,21]. Second, the patients used in the calibration may not accurately reflect those in the wider population [Simon R., Radmacher M. D., Dobbin K., McShane L. M., Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification, Journal of the National Cancer Institute 95 (2003) 14-18].
There may also be biological or more fundamental experimental design problems that contributed to low predictive success of these methods. Information on the distribution of variation within classes was ignored in these studies because the patients were classified into broad categories. For example, Van't Veer et al. [Van't Veer L. J., et al., Nature 415 (2002) 530-536] used a binary classification of time to metastases of greater than or less than 5 years after primary tumour biopsy. Many patients used in the Van't Veer data set actually had recorded times to metastases close to 5 (mean 6.02 ± 3.75, inventors own calculations). The resulting classes are likely to be heterogeneous. If enough patients with extreme variation for the trait (e.g. time to metastases less than 2 years versus greater than 10 years) could be used as the training subset, or if the complete phenotypic variation was used in the analysis, this could lead to a more reliable prediction equation and also might result in similar subsets of genes being identified between studies.
In both the animal disease resistance and the cancer case there is expression information from a large number of genes that may be utilised to predict the phenotype of the subject i.e. the animal or patient. For both cases there is a need to determine which genes contribute most to the classification or prediction, and then determine how each genes expression data should be weighted in a prediction equation. There is also a need for an equation that is predictive for most subsets of subjects (patients or breeding candidates). For animal breeding and production purposes there is a need for a method for quantitative prediction which will allow choosing the best performing animals. According to an aspect of the invention a method involving the use of random regression with cross validation which both accounts for the distribution of variation in the trait and utilises different subsets of patients to perform a complete validation of predictive ability is provided.
Although the probability of classification success has been calculated in some instances, classification methods used to date in the literature are not designed to rank patients or animals. In contrast, the random regression method according to the present invention provides a quantitative estimate of the phenotype which can be used to rank animals or patients to pick those with the greatest chance of extreme phenotypic performance. Noise due to subjects in the calibration set with phenotypic values close to the cut-off between two phenotypic classes is eliminated in the analysis because the raw phenotypic measurements for the trait are utilised directly.
For traits such as disease resistance large numbers of animals will be included in formulating the prediction equations from microarray data. To minimise costs and maximise accuracy, microarrays may be used as a "first pass screen" to identify likely indicator genes using a modest calibration data set consisting of samples from extreme performing animals for the phenotype of interest. Gene expression values for the indicative genes could be measured for larger numbers of subjects (in the order of thousands, preferably) using MassARRAY competitive RT PCR (combining competitive reverse transcriptase PCR with matrix-assisted laser desorption/ionisation time-of-flight mass spectrometry) [Ding C, Cantor C. R., A high-throughput gene expression analysis technique using competitive PCR and matrix-assisted laser desorption ionization time-of-flight MS, Proc. Nat. Acad. Sci. USA 100 (2003) 3089-3094], or other more sensitive, reproducible and cost effective means of measuring gene expression (i.e. those methods designed to process many samples and fewer genes). These measurements may then be used to determine the final set of genes and solutions for inclusion in the prediction equation. Continuous evaluation can result in refinement of the choice of genes and solutions as more subjects are tested. In one embodiment the present invention concerns a method for the prediction of phenotype from gene expression data. The method may be used to predict phenotypes like disease resistance and disease susceptibility and stress to factors like handling and human interaction, stocking density, levels of toxic chemicals in the water or feed, feed availability and composition, disease, temperature, salinity, pH or oxygen. Further the invention provides a method for the selection of disease resistant animals for breeding.
In a further embodiment of the invention a method for determining the optimum number of genes (Jc ), and which particular genes, to include for prediction of phenotype, wherein the correlation of predicted to actual phenotype is found to be highest when k genes are included in the prediction equation. The method utilises randomly generated subsets of data selected from a calibration set of data (with or without replacement). The method assesses the average correlation of predicted to actual phenotype across the subsets as additional genes are included in a random regression model.
In yet another embodiment of the invention a method for the prediction of a phenotype is provided, wherein the estimate is calculated by the equations
(Equation 1)
Figure imgf000010_0002
wherein
Figure imgf000010_0001
and wherein O/ is the regression coefficient determined for each gene,
% is number of times gene i is detected as part of the optimum group of genes to include in the prediction equation with k iterations of the method described above, and Xf is the gene expression for gene i measured in a particular tissue or cell type while that tissue or cell type exists under a particular state, or is the change in gene expression for gene i in that tissue or cell type in response to a change of state,
or
Figure imgf000011_0001
wherein b[ is the regression coefficient for each gene and, Xi is the gene expression for gene i measured in a particular tissue or cell type while that tissue or cell type exists under a particular state, or is the change in gene expression for gene i in that tissue or cell type in response to a change of state,
In a further embodiment of the invention provide a defined set of genes whose pattern of gene expression can be used for the prediction of disease resistance or disease susceptibility or exposure to stress, wherein gene expression is evaluated and the prediction of disease resistance or susceptibility phenotype or level of exposure to stress is made using the methods described herein above.
In another embodiment of the invention a test kit comprising test reagents for measuring the expression of genes and providing data that can be used for the prediction of phenotype is provided. The test reagents used in the test kit may be selected from microarrayed DNA samples or oligonucleotides for quantitative PCR (Polymerase Chain Reaction) or competitive reverse transcriptase PCR.
The invention encompasses in yet another embodiment an analysis software package which can perform the calculations necessary to predict disease resistance. The software package would accept data collected using the methods described herein above, and would automate the calculations necessary for prediction of phenotype according to equations 1 and 2. The software could be made available for use on a personal computer or could be operated or accessed by registered users on a main frame server over the internet. Gene expression responses that are found to be correlated with disease resistance or disease susceptibility or exposure to levels of stress may provide information about causative genes that could be targeted for vaccines, RNAi or other treatments and can be used to assist choosing the best performing animals for grow-out in particular production environments.
Traits such as disease resistance are costly to evaluate and slow to improve using current selection and breeding methods. Analysis of gene expression profiles (e.g. DNA microarrays) may predict such phenotypes and has been used in an analogous way to classify cancer types in human patients. The method according to the present invention provides a quantitative estimate of phenotype that can be used to rank subjects (animals or patients) and choose those with extreme phenotypic performance.
The invention will now be illustrated by a non-limiting example.
Examples
Example 1 To test the method publicly available gene expression data from patients with cancer was used.
1.1 Data utilised
Primary breast tumour gene expression data supplied by Van't Veer et al. was utilized [ Van't Veer L. J., et al., Nature 415 (2002) 530-536]
(www.rii.com/publications/2002/vantveer.html). The data was from 97 patients who developed distal metastases (a poor prognosis). The actual time until the development of distal metastases (measured in years from when the breast tumour biopsy was taken) was used as the trait for calibration and testing. The data set included expression information for 24,481 genes. One patient was dropped from the set because there were many missing gene expression values for this particular individual. Van't Veer et al. [ Van't Veer L. J., et al., Nature 415 (2002) 530-536] describe the methodology used for measurement of gene expression values. A cRNA pool was created from each of the sporadic carcinomas and used as a common reference sample in the Van't Veer et al. study. All background corrections, normalisation and log conversions had already been applied by Van't Veer et al. [ Van't Veer L. J., et al., Nature 415 (2002) 530-536].
1.2 Class and phenotypic prediction by use of random regression to identify genes and maximum likelihood estimation of solutions with cross validation
In order to test the use of random regression with cross validation the 96 patient data was first randomly divided into two parts, giving a calibration set of 77 patients and a validation set of 19 patients. The validation set was kept aside and was not used in the derivation of solutions. The selection of patients and collation of corresponding gene expression and phenotypic datasets was done using scripts written by the authors in the R language [Ihaka R., Gentleman R., R: A Language for Data Analysis and Graphics, Journal of Computational and Graphical Statistics 5 (1996) 299-314].
1.2.1 Determining the optimum number of genes to use in the models. The following method (NUMGEN) was used to determine the optimum number of genes to include in the models:
NUMGEN. One thousand randomly generated subsets of 67 patients were selected from the calibration set of 77 patients, without replacement. Corresponding subsets of remaining patient data (10 patients) were also generated and were kept aside for testing how the prediction equations performed with the addition of increasing numbers of genes to the model (see below). For each subset of 67 patients, we repeated the following steps for k iterations:
1. The correlation between gene expression for gene and the vector of
Figure imgf000013_0003
phenotypes (time to metastases, y.) was calculated for all 24,481 genes. The gene with the highest correlation to the phenotypes was added to the matrix
Z as the
Figure imgf000013_0001
gene, eg s the level of expression for the/h individual for the
Figure imgf000013_0002
gene with the highest correlation to the phenotypes in iteration k.
2. The variance in phenotype due to expression level of gene k, was estimated using Residual Maximum Likelihood method (REML) with the 3. program ASReml [11] by fitting the mode
Figure imgf000014_0003
, where βj was the random residual and b was a random regression coefficient.
4. For calibration patients j= 1-67, one patients phenotype was removed at a time, and the remaining phenotypes were analysed with the random regression model (BLUP)
Figure imgf000014_0002
where bi was the regression coefficient for each gene, ej was the residual for
2 patient j, and bi was distributed as N(O, w).
5. The residual variance was calculated as
Figure imgf000014_0001
where is the estimate of yj when the/ phenotype is removed in step 3.
Figure imgf000014_0004
6. Phenotypes were corrected for the level of expression for gene k. Gene k was removed from the data set and added to Z. Return to step 1.
At each iteration the correlation of predicted to actual phenotype was calculated. The value of k with the highest correlation was used to estimate the optimum number of genes {k) to include in subsequent models.
1.2.2 Prediction equation
Random regression with cross validation was applied and tested in the following different ways.
ESTl NUMGEN above yielded 1000 sets of models (each consisting of A: genes, corresponding k regression coefficients and residual variance) that were predictive for a randomly selected subset of 61 patients. The frequency of occurrence of these predictive genes over all 1000 NUMGEN models was determined for the set of 77 calibration patients. For ESTl we repeated steps 1-5 above (NUMGEN) A: times, however, this time the mean regression coefficients for the k most frequently occurring genes over all analyses of the 1000 subsets of 67 patients from NUMGEN were used in the prediction where,
prediction ESTl
Figure imgf000015_0001
where
Figure imgf000015_0002
and, where ^, = number of times gene i is included in the prediction equation.
EST2 For EST2, we repeated steps 1-5 above (NUMGEN) k times, however, this time all 77 calibration patients were used in the prediction where, prediction
Figure imgf000015_0003
1.2.3 Validation The prediction equations from ESTl and EST2 were used to predict phenotype for 19 patients in the validation set (i.e. those not used at all in the derivation of prediction equations) as:
Figure imgf000015_0004
The correlation between predicted and actual phenotype (r) was used to test and compare equations ESTl and EST2.
1.2.4 Standard error of correlation coefficient and sensitivity to sample size In order to calculate the standard error (se) of r, the whole ESTl and EST2 procedures described above, including the random division of the data described at the beginning of section 1.2, was repeated 10 times. All the steps described above were carried out independently on each set of calibration and corresponding test patients.
The sensitivity of the method to sample size was checked by repeating NUMGEN, this time using 1000 randomly generated subsets of 20 patients. The list of genes selected using this smaller sample size was compared to those selected using subsets of 67 patients (described above).
1.2.5 Comparison to other methods
Previous studies have used classification or discrimination procedures (e.g. eigengene-based linear discriminant models, ELDA, [ Shen R., Ghosh D., Chinnaiyan A., Meng Z., Eigengene-based linear discriminant model for tumour classification using gene expression microarray data, Bioinformatics 22 (2006) 2635-2642.]) to analyse the same data set (the data set used herein by the inventors to test the present method). To allow direct comparison of the present method to previous methods, the predicted and actual phenotypic values were converted for the left-out patients into the same binary classes utilised by these prior studies (greater than or less than 5 years to metastases).
1.3 Results
1.3.1 Number of genes
Choice of an optimum number of genes for use in the prediction equation was made using procedure NUMGEN. With the addition of more genes to the model under procedure NUMGEN, the correlation between predicted and actual phenotype for the left-out patients increased up until the point where more than 60 genes were added to the model (correlation of 0.22 with 60 genes, Figure 2). The program did not converge when more than 60 genes were included due to singularities in the information matrix, indicating insufficient data was available to predict the effects of larger numbers of genes. Therefore, models were constructed using 60 genes in subsequent steps (ESTl and EST2). When 60 genes were included in the models, and predictions from NUMGEN were converted into binary classes, and 61% of patients were successfully classified into >5 or <5 year classes (10,000 tests).
1.3.2 Genes selected for inclusion in the models
Figures 1 A and B shows the frequency of occurrence of genes that were added to the models (those giving highest correlation with phenotype) over the 1000 subsets of randomly chosen patients (NUMGEN for models with 60 genes). The sensitivity of the method to calibration set sample size is shown by comparing the occurrence of genes identified when subsets of 67 versus 20 patients are used. The genes showing a high occurrence when subsets of 20 patients were used were mostly the same genes with high occurrence found when samples of 67 patients were chosen without replacement. However, when subsets of 20 patients were used, these genes had a generally lower frequency of occurrence (i.e. there is greater variation in the subsets of genes that show high correlation) and so the most frequently occurring genes were less differentiated from the bulk of genes (those occurring in 4-6 of the subsets). For instance, gene 2279 (HBGl) was encountered in 110 of the 1000 subsets of 67 patients and in 31 of the 1000 subsets of 20 patients (Figures 1 A and B). Background noise (i.e. the number of genes with multiple occurrences throughout the 1000 subsets) was higher when subsamples of 20 patients were used to formulate solutions compared to subsets of 67 patients. Gene 9730 (unidentified EST) was the gene most commonly found to predict phenotype (high correlation with phenotype in 660 of the 1000 subsets of 67 patients). Four of the five most commonly identified genes with the random regression and cross validation method were also identified by the ELDA method [20] (SEClOLl, NMU and unidentified expressed sequences 9730 and 3851).
1.3.3 Validation
The correlation of predicted with actual phenotype, and success of classification, was highest when the mean solutions derived from the 1000 subsets were applied to predict the phenotype of the 19 remaining patients (Table I, ESTl). The estimate of the correlation coefficient is unbiased because the derivation of the prediction equation and testing were performed using independent samples (correlation between predicted versus actual phenotype for 19 patients not included in the calibration set, 60 genes included in prediction equations). The average correlation was 0.32 ± 0.06 (P<0.001, ± standard error, Table I). Sixty-five percent of the 19 left-out patients were correctly classified when the time from tumour biopsy until development of metastases was classed as either more than 5 years or less than 5 years (Table I).
When all 77 calibration patients are used to estimate genes and solutions, and this equation was applied to predict the phenotype of 19 additional patients (EST2), the average correlation between predicted and actual phenotype for 19 test patients was 0.25 + 0.05 (P<0.001, ± standard error, 60 genes in the model, Table I). For EST2, sixty-two percent of the 19 left-out patients were correctly classified when the time from tumour biopsy until development of metastases was classed as either more than 5 years or less than 5 years (Tab. I).
Table I
Figure imgf000018_0001
***P<0.001 Table I shows correlation of predicted to actual time to metastases and success of classification using random regression with cross validation. Success of classification was determined after binary conversion of the predicted and actual phenotypes. The correlation and classification success is for prediction equations derived from methods ESTl and EST2. The biased test was a form of incomplete cross validation where gene selection was made using all 77 patients, solutions for these genes were derived from subsets of 67 patients drawn from the 77 patients and resulting prediction equations were applied to the corresponding subset of 10 "left- out" patients, n, number of predictions tested for each correlation, nsets, number of correlation coefficients.
Example 2
An Atlantic salmon selective breeding program was simulated (Fig. 3). Breeding values for two correlated quantitative traits were predicted; a prediction of family mean performance derived from the outcome of a challenge to disease (trait 1), and a prediction of the challenge test ranking derived from gene expression profile testing (trait 2) for individual breeding candidates. 2.1 Sampling animals from the base population
In the base or wild population of 18,000 fish, the polygenic component of genetic value of animal / for traits 1 and 2 {ali and a2ι ) was sampled from a bivariate normal distribution
Figure imgf000019_0009
where and
Figure imgf000019_0008
were the mean additive genetic effects of trait 1 and trait 2 respectively and G was the additive genetic covariance matrix
Figure imgf000019_0001
Values
Figure imgf000019_0004
were all set to 25 units in the base population. The covariance was derived by setting the correlation between the polygenetic effects for the two traits (rg). The correlation was varied from 0.3 to 0.9. From Example 1 this value is likely to be dependent on the number of informative genes (differential expression values) included in the prediction equation.
The effects due to environment for animal / {elt and e2;- for traits 1 and 2 respectively) were also sampled from a bivariate normal distribution
Figure imgf000019_0005
where
Figure imgf000019_0006
were the mean environmental effects for trait 1 and 2 and E was the environmental covariance matrix
Figure imgf000019_0002
Values ei and el were set to 25 units while values
Figure imgf000019_0007
were set to 225 units in the base population. The correlation between the environmental components for each trait (re) was calculated as
Figure imgf000019_0003
[Falconer, D. S., Mackay, T.F.C., 1996. An introduction to quantitative genetics. Addison Wesley Longman Limited, Edinburgh Gate, 464 pp] where hj and /?2 was the heritability for trait 1 and trait 2 and ry was the phenotypic correlation between the traits, hi and h were assumed to be low (0.1) for the purposes of the simulation (eg. on an underlying scale, heritability of infectious salmon anaemia resistance in Atlantic salmon has been estimated to be around 0.32, ødegard, J., Olesen, L, Gjerde, B., Klemetsdal, G., 2007. Evaluation of statistical models for genetic analysis of challenge-test data on ISA resistance in Atlantic salmon (Salmo salary. Prediction of progeny survival. Aquaculture 266, 70-76).
The challenge test phenotype for animal
Figure imgf000020_0007
was calculated as
Figure imgf000020_0003
The predicted challenge test phenotype for animal z using gene expression profiling
Figure imgf000020_0004
) was calculated as ,
Figure imgf000020_0005
Animals in each generation were assigned a sex, male or female, with probability 0.5.
2.2 Progeny generated each subsequent generation
Genetic and phenotypic values for both traits were generated every subsequent generation as follows. The genetic value of progeny i for trait 1, ah, was calculated as
Figure imgf000020_0002
where FaU and Mali are the genetic values of the sire and dam of individual i respectively and ranli and ran2ι were random effects associated with the genetic effect on trait 1 and trait 2 respectively as sampled from a bivariate distribution N(0,0,ssmx) where ssmx was the covariance matrix
Figure imgf000020_0001
The phenotypic value of progeny i for trait 1, yh, was calculated as
Figure imgf000020_0006
Where ei,- and e2\ were generated from a bivariate normal distribution (described in section 2.1).
The genetic value of progeny / for trait 2, was calculated as
Figure imgf000021_0001
where and M
Figure imgf000021_0002
are the genetic values of the sire and dam of individual i respectively.
The phenotypic value of progeny i for trait 2, y2j, was calculated as
Figure imgf000021_0003
2.3 Selection and mating
Three hundred males and females were selected and mated to create 300 full-sibling families (60 progeny per family) for every generation of selective breeding. Four different criteria for the selection of mate pairs were investigated:
2.3.1 Family breeding value for the disease challenge test (CRITl).
Under the first selection criterion, it was not possible to estimate breeding values for the selection of mate pairs with the first generation of selection and mating. Instead, 300 mate pairs were randomly chosen for breeding and 60 full sibling progeny were generated from each pairing. From each of the / families, a set of 10 progeny were randomly chosen for disease challenge testing and the challenge test phenotype of progeny k was added to the matrix CTy. For subsequent generations the selection of mate pairs was based on the estimated breeding value for each family ( Vj) where
Figure imgf000021_0004
j 3
Where was the mean challenge test phenotype within family j and
Figure imgf000021_0005
was the mean challenge test phenotype across all families.
Mate pairs were chosen from the families / which ranked in the top 10% with respect to EBV. The maximum number of mate pairs that could be chosen from any of the top EBV ranked families was limited to 15 pairs. At the selection intensity set for CRITl the inbreeding coefficient would be expected to increase by around 0.8% every generation.
2.3.2 Prediction of challenge test phenotype using gene expression profiling (CRIT2). Under this criterion, all 18,000 animals from the base population and subsequent generations were tested and ranked on the predicted phenotype from gene expression profiling (trait 2). 300 mate pairs were randomly selected from the top 5% ranking animals. The number of mate pairs that could be chosen from any particular family was limited as described in 2.3.1.
2.3.3 Combination of CRITl and CRIT2 above (CRIT3).
Under this criterion, CRITl was used to select families while CRIT2 was used to select individuals within families. 300 mate pairs were randomly chosen from among these selected animals. All animals not challenged to disease (15,000 total) were tested using gene expression profiling. As per CRITl, it was not possible to estimate breeding values for the first round of selection and mating. So for the initial round of selection and mating CRIT2 was used to rank all animals in the base population.
2.3.4 Direct selection using disease challenge test survivors (CRIT4). Under this criterion it was assumed that all traces of the disease could be removed from animals surviving the challenge test and that surviving individuals were fit for mating. Direct selection using disease challenge test survivors has not been used in practice because survivors are normally infected and the risks of introducing disease to the breeding nucleus are too high. All 18,000 base animals were challenged to the disease and 300 mate pairs were randomly selected from the 5% top ranking animals for the disease challenge test (trait 1). The number of mate pairs that could be chosen from any particular family was limited as described in 2.3.1.
2.4 Evaluation of mean genetic response Twenty replicate selective breeding programs were simulated and the mean genetic response in the challenge test trait, αl, was plotted for 10 generations of selective breeding. rg and ry were varied (from 0.3 to 0.9 and from 0.1 to 0.7 respectively) to evaluate the effect of including a greater number of informative genes in the gene expression profile prediction. 2.5 Cost benefit analysis
An economic model described by Thorarinsson and Powell (2006) [ Thorarinsson, R., Powell, D.B., 2006. Effects of disease risk, vaccine efficacy, and market price on the economics of fish vaccination. Aquaculture 256, 42-49] for studying the economics of fish vaccination was adapted so that the opportunity cost in our case was the added revenue realized from increased survival due to improved genetics with selective breeding (after feed costs were subtracted). The model was based on the principle of alternative cost (or opportunity cost, http://www.economist.com/research/economics). The applicability of the model was broadened so that it could be used to model the sector of the industry in Norway utilizing selectively bred fish for production purposes on a "per generation" basis as genetic improvement is made. All calculations were in EURO. The following modifications were made to the model: i.the relative percent survival (rps) was calculated as
Figure imgf000023_0001
ii. where was the predicted percent ylg and ylo were the mean phenotypic values for trait 1 at generation g and generation zero (ie. non-selected) respectively. The opportunity cost (oc) (adapted from
Lillehaug, A., 1989. A cost-effectiveness study of three different methods of vaccination against vibriosis in salmonids. Aquaculture 83, 227-236.) was calculated as
Figure imgf000023_0002
where atw was the average target weight of the fish at slaughter, was the
Figure imgf000023_0003
number of fish set out in sea cages each year (the total slaughter weight tsw divided by atw), pkg was the average harvest price for the fish per kg, for was the total feed conversion ratio from setting the fish at sea to slaughter axid
Figure imgf000023_0004
d was the average price per kg of all feed fed from time fish are set out to sea in sea cages to slaughter. iii. The accumulated savings from reduction in compensation fish (cfeb) was calculated as cfeb
Figure imgf000024_0001
where was the market value per fish at the time the fish were set in sea cages and k was the number of starting fish where
Figure imgf000024_0003
iv. The economic benefit from savings in labour costs (slceb) resulting from reduced removal and disposal of dead fish was as calculated as
Figure imgf000024_0004
where was the expected average weight of all diseased/dead non-selected fish removed (in kg) and cr was the labour cost associated with carcase removal per kg of fish. v. The costs of applying each selective breeding criteria (costcritr costcru4) were calculated assuming that three year classes would be used for the breeding programs such that the cost per year was
Figure imgf000024_0002
where n<i1: was the total number of animals challenge tested
Figure imgf000025_0006
was the cost of challenge testing (done by contract at another site including costs of transportation),
Figure imgf000025_0009
was the cost of labour per test (to identify, sort and sample animals for challenge or gene expression profiling),
Figure imgf000025_0004
were the additional costs (eg. equipment and anaesthesia associated with the tests)
Figure imgf000025_0005
was all other costs to run a selective breeding program for Atlantic salmon
Figure imgf000025_0008
was the cost of genetic testing, nge was an estimate of the number of tests required (where
Figure imgf000025_0007
* 60) and
Figure imgf000025_0010
was the total number of fish in the selective breeding program. The economic benefit from reduced antibiotic dependency in this case was assumed to be zero as antibiotics are no longer used to treat Atlantic salmon in Norway. The possible economic benefit of reduced dependency on vaccines was assumed to also be zero as growers might continue to vaccinate fish in any case for added security. Using these estimates of benefits and costs, the relative merits of selective breeding programs using the four different selection criteria were evaluated using the following economic criteria (Thorarinsson, Powell, 2006) i. total added value per kg of fish produced
Figure imgf000025_0001
where tsw was the total slaughter weight for the Atlantic salmon industry in Norway and psb was the proportion of the these fish that were selectively bred, ii. benefit-cost ratio
Figure imgf000025_0002
where cost was either
Figure imgf000025_0003
iii. nominal economic effect on operating income
(αc -f cfeh -f- slceb)" — cost
Table II lists the values used for the parameters defined above. A number of assumptions were made in deriving these values. Apart from the parameters listed below, the same values as Thorarinsson and Powell (2006) were used. We assumed that disease challenge test survival had a high positive correlation with survival to the same disease in the wild. This assumption was based on research by Gjøen, H.M., Refstie, T., UlIa, O., Gjerde, B., 1997. Genetic correlations between survival of Atlantic salmon in challenge and field tests. Aquaculture 158, 277-288, who found a high positive correlation between these two traits for the pathogenic bacteria Aeromonas salmonicida (0.95). Parameters for tsw, psb,
Figure imgf000026_0001
C and Cost add, were conservative estimates based on figures from industry and our own estimates by researchers. Costge would vary depending on laboratory, country, available technology and estimated throughput. Costge was assumed to be of a similar value to that estimated for genotyping (eg. Hayes, B., Baranski, M., Goddard, M.E., Robinson, N., 2007. Optimisation of marker assisted selection for abalone breeding programs. Aquaculture 265, 61-69.). We assumed in the model that Costge would remain constant from generation to generation. Costchai was a conservative estimate from figures supplied by breeding companies and researchers involved in organizing the challenge tests. CostsbP was calculated from the combined published financial results for year 2006 of the breeding companies Aqua Gen AS and SalmoBreed AS in Norway as follows.
Figure imgf000026_0002
{total income — total profit) * proportion of sales made in Norway
* proportion of Atlantic salmon sales) — total cost of challenge testing
Figure imgf000027_0001
2.6 Results
The genetic response achieved under CRITl resulted in 100% improved survival to disease challenge after 10 generations (Fig. 4A-D). The genetic responses from CRIT2 and CRIT3 increased as the genetic and phenotypic correlation between traits 1 and 2 increased, reflecting improved predictive ability and consequent improved selection accuracy with the use of gene expression profiles as a predictor disease resistance. CRIT4 resulted in 100% improved survival to disease challenge after 8 generations. Apart from CRIT4, CRIT3 gave the highest rate of genetic gain, even when the genetic and phenotypic correlation between trait 1 and 2 was low (Fig. 4A,
Figure imgf000028_0001
Using CRIT3, survival to disease challenge was 100% improved after 6-7 generations of selection, and varying the phenotypic and genetic correlation had a relatively small effect on the overall genetic response after 10 generations (226% when
Figure imgf000028_0002
Use of gene expression profile information alone as a selection criteria (CRIT2) led to equivalent rates of genetic gain to use of a family breeding value based on challenge test data (CRITl) when the phenotypic correlation was greater than 0.7 (rg=0.9, Fig. 4D). The benefit to cost ratio was positive under all scenarios. The highest ratio was for CRITl when rg was low (0.3, Fig. 4 E) and for CRIT4 under all other circumstances (Fig. 4 F-H). CRIT4 assumes that direct selection and breeding of survivors from the challenge test is possible. When the genetic and phenotypic correlation between the traits was low, CRIT3 and CRIT4 yielded equivalent benefits-costs (Fig. 4E). With each new generation of selected stock, the benefit to cost ratio under CRITl improves at a rate of 1.1 : 1 (Fig. 4E). With a moderate genetic and phenotypic correlation between the traits (rg =0.5 ), CRIT3 was almost as beneficial as CRITl (Fig. 4 F). Other simulations we have performed (results not shown) where the heritability of disease resistance is assumed to be higher than 0.1 have shown that CRIT3 can be more beneficial than CRITl or 2 under these circumstances. This is because the benefit from the higher genetic response achieved in this situation outweighs the additional costs needed (in terms of gene expression testing) in order to achieve a moderate rg. The benefit-cost ratio for CRIT3 was highest when rg was around 0.5 (Table III & Fig. 4F, 17:1 using 10th generation selectively bred stock and assuming 30 genes tested for making the prediction at a cost of €180/individual and total cost of over 10 million euro). Use of CRIT 2 yielded smaller benefit- cost ratios in comparison (up to 10: 1). Relative percent survival of 78% was achieved under CRIT3, compared to 44% under CRIT2 and 60% under CRITl (Table 2), resulting in comparatively large industry wide opportunity costs from use of CRIT3 compared to other selection criteria (142 million euro, Table III).
Profitability (PA) after 10 generations Selection criteria
CRITl CRIT2 CRIT3 CRIT4
Cost €7,973,000 €11,196,000 €10,674,000 €8,064,000
Relative percent survival 60% 44% 78% 66%
Economic benefit
Opportunity cost €108,847,000 €79,651,000 €142,211,000 €120,366,000
Savings from reduction in €34,915,000 €27,897,000 €42,935,000 €37,684,000 compensation fish
K
Savings in labour costs due to €523,000 €383,000 €684,000 €579,000 reduced removal and disposal of dead fish
Benefit to cost ratio 18.1 9.6 17.4 19.7
Total added value per kg of fish €0.23 €0.17 €0.29 €0.25 produced
Nominal economic effect on operating €136,312,000 €96,735,000 €175,156,000 €150,565,000 income
Table III Economic evaluation of use of the four different selection criteria with 10 generations of selective breeding, rg=0.5 and r-,,=0.32. All values are for one year's production utilizing selected stock. A phenotypic correlation of 0.32 was detected in Example 1.
With higher phenotypic and genetic correlation between trait 1 and 2, the benefit- cost ratio for CRIT3 was reduced. This is because the total cost of gene expression testing needed in order to achieve this high degree of correlation, is greater than the additional opportunity costs achieved through improvements in relative percent survival. In order to achieve rg=0.9, the model assumes that 54 gene expression tests are needed at €6/test=€324/individual. Like CRIT3, the benefit from CRIT2 also increased as the genetic and phenotypic correlation between the traits was increased. However, CRIT2 gave the lowest benefit-cost ratio under all situations because of the higher costs associated with gene expression testing.
The total added value per kg of fish was 0.29 Euro/kg of fish produced and the nominal economic effect on operating income was over 175 million Euros after 10 generations of selection under selection criteria CRIT3 (Table III). In summary, CRIT3 was almost as profitable an option as CRITl, providing that the cost of gene expression testing was less than around €280/individual and rg was greater than 0.3, was more profitable than CRIT2 under all scenarios and yielded the highest total added value and highest nominal economic effect on operating income of all the selection criteria.
The model used to estimate the economic outcomes makes a number of assumptions:
1. It was assumed that the basic costs of running a selective breeding program would need to be met in making improvements for this trait. In reality, selective breeding programs for Atlantic salmon are already running (have done so for the past 30 years or so) and so the addition of new tests comes at a very low relative cost.
2. Another major cost (associated with CRIT2 and CRIT3) will be that of testing the gene expression level in fish. As expression assay technology is rapidly developing, and there are few examples where the expression response of large numbers of animals are tested for relatively few (50-100) genes, we have assumed that this cost will in the future be similar to that of genotyping (which is coincidently, of the same order as challenge testing costs per fish). Compared to the overall cost of the basic selective breeding program, the cost of testing gene expression using multiplexed quantitative PCR is likely to be relatively low, and the relative industry-wide benefits are likely to be very high. 3. It was assumed that every animal not challenged to the disease (15,000 in total) would be tested to determine its gene expression response. The number of animals tested, and associated total test costs, could be reduced by only testing animals in the top ranking families from the disease challenge test. It would also be possible to reduce the number of animals tested within these top ranking families, however this would reduce selection accuracy, genetic response and economic benefit. 4. The model also assumes a high selection intensity will be possible as it does not account for the relative importance and weighting put on other traits in the selective breeding programs and does not account for avoidance of inbreeding. In a real breeding program something like optimal contribution selection would be applied to maximise genetic gain at a set rate of inbreeding ( Hinrichs, D., Wetten, M., Meuwissen, T.H.E., 2006. An algorithm to compute optimal genetic contributions in selection programs with large numbers of candidates. J. Anim. Sci. 84, 3212-3218.; Holtsmark, M., Sonesson, A.K., Gjerde, B., Klemetsdal, G., 2006. Number of contributing subpopulations and mating design in the base population when establishing a selective breeding program for fish. Aquaculture 258, 241-249). Genetic improvement programs such as run by Aqua Gen AS and SalmoBreed AS for Atlantic salmon in Norway also incorporate challenge test results for multiple diseases, growth rate, meat quality and other traits in a selection index. It will not be possible in reality to achieve the selection intensities, genetic response or resulting economic benefits assumed and predicted in our simulations when fish are selected on the basis of their phenotype for multiple traits (eg. using a selection index).
5. The benefit achieved from focusing on this trait will also depend on how disease resistance is correlated with other traits affecting profitability in the Atlantic salmon industry. Gjedrem, T., Olesen, L, 2005. Basic statistical parameters. In: Gjedrem, T. (Ed.), Selection and breeding programs in aquaculture. Dordrecht : Springer, c2005., pp. 45-72, have reviewed correlations that have been found between growth rate and survival traits in aquatic species. In Atlantic salmon, adult resistance to cold water vibriosis (Robison, O.W., Luempert, L.G., 1984. Genetic variation in weight and survival of brook trout (Salvelinus fontinalis). Aquaculture 38, 155-170), overall fϊngerling survival ( Rye, M., Lillevik, K.M., Gjerde, B., 1990. Survival in early life of Atlantic salmon and rainbow trout: estimates of heritabilities and genetic correlations. Aquaculture 89, 209-216) and resistance to furunculosis (fmgerlings and challenge, Gjedrem, T., Salte, R., Gjøen, H.M., 1991. Genetic variation in susceptibility of Atlantic salmon to furunculosis. Aquaculture 97, 1-6) have all been found to be positively correlated with growth rate (r=0.18-0.37). Some additional economic benefits might be expected, at least in the short-term, if resistance to a particular disease is favourably correlated with one or more economically important trait. Accounting for selection for multiple correlated traits of various heritabilities would require a more complex simulation model. The simplified model applied in this paper allows comparison of the relative benefits from applying different selection criteria on this single trait (for instance, if a tandem selection method were used where individual traits are improved in succession).
6. Finally, the economic model we have used estimates the benefit-cost ratio for the entire industry and does not reflect how the profitability of a particular breeding company will be affected with use of this method.

Claims

1. A method for the prediction of phenotype from gene expression data or for the detection of stress in the farm environment from gene expression data.
2. The method according to claim 1 wherein the phenotype is disease resistance.
3. The method according to claim 1 wherein the phenotype is disease susceptibility.
4. The method according to claim 1 to 3 for the selection of disease resistant animals for breeding.
5. A method for the determination of optimum number of genes {k ), and which particular genes, to include for prediction of phenotype or exposure to stress, wherein the correlation of predicted to actual phenotype or exposure to stress is found to be highest when k genes are included in the prediction equation, utilising randomly generated subsets of data selected from a calibration set of data with or without replacement.
6. A method for the prediction of a phenotype or exposure to stress, wherein the estimate is calculated by the equations
Figure imgf000034_0001
(Equation 1)
wherein
Figure imgf000034_0002
and wherein O\ is the regression coefficient determined for each gene,
^' is number of times gene / is detected as part of the optimum group of genes to include in the prediction equation with k iterations of the method according to claim 5 and
Xf is the gene expression for gene i measured in a particular tissue or cell type while that tissue or cell type exists under a particular state, or is the change in gene expression for gene z in that tissue or cell type in response to a change of state, or
(Equation 2)
Figure imgf000035_0001
wherein σz is the regression coefficient for each gene and,
Xi is the gene expression for gene i measured in a particular tissue or cell type while that tissue or cell type exists under a particular state, or is the change in gene expression for gene i in that tissue or cell type in response to a change of state,
7. A defined set of genes whose pattern of gene expression can be used for the prediction of disease resistance or exposure to stress, wherein gene expression is evaluated and the prediction of disease resistance phenotype or level of stress is made using the methods according to claims 1-6.
8. A test kit comprising test reagents for measuring the expression of genes and providing data that can be used for the prediction of phenotype.
9. The test kit according to claim 8 wherein said test reagents are microarrayed DNA samples or oligonucleotides for quantitative PCR or competitive reverse transcriptase PCR.
10. An analysis software package which can perform the calculations necessary to predict disease resistance, wherein the software package would accept data collected and would automate the calculations necessary for prediction of phenotype according to claims 5-6.
11. The method according to claim 1 wherein the stress is levels of handling and/or human interaction
12. The method according to claim 1 wherein the stress is stocking density.
13. The method according to claim 1 wherein the stress is levels of toxic chemicals in the water or feed.
14. The method according to claim 1 wherein the stress is feed availability and composition.
15. The method according to claim 1 wherein the stress is levels of disease.
16. The method according to claim 1 wherein the stress is levels of temperature.
17. The method according to claim 1 wherein the stress is levels of salinity.
18. The method according to claim 1 wherein the stress is levels of pH.
19. The method according to claim 1 wherein the stress is levels of oxygen.
PCT/NO2009/000257 2008-07-09 2009-07-09 Predicting phenotype from gene expression data WO2010005317A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US7929608P 2008-07-09 2008-07-09
US61/079,296 2008-07-09

Publications (2)

Publication Number Publication Date
WO2010005317A1 true WO2010005317A1 (en) 2010-01-14
WO2010005317A4 WO2010005317A4 (en) 2010-05-27

Family

ID=41100856

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/NO2009/000257 WO2010005317A1 (en) 2008-07-09 2009-07-09 Predicting phenotype from gene expression data

Country Status (1)

Country Link
WO (1) WO2010005317A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1310569A2 (en) * 2001-11-09 2003-05-14 President of Gifu University Method and test kit for the detection of genes
US20050136413A1 (en) * 2003-12-22 2005-06-23 Briggs Michael W. Reagent systems for biological assays
WO2008054432A2 (en) * 2005-12-30 2008-05-08 Honeywell International Inc. Oligonucleotide microarray for identification of pathogens

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1310569A2 (en) * 2001-11-09 2003-05-14 President of Gifu University Method and test kit for the detection of genes
US20050136413A1 (en) * 2003-12-22 2005-06-23 Briggs Michael W. Reagent systems for biological assays
WO2008054432A2 (en) * 2005-12-30 2008-05-08 Honeywell International Inc. Oligonucleotide microarray for identification of pathogens

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
CARTER GREGORY W ET AL: "Prediction of phenotype and gene expression for combinations of mutations", MOLECULAR SYSTEMS BIOLOGY, vol. 3, March 2007 (2007-03-01), pages Article No.: 96 URL - http://ww, XP002547953, ISSN: 1744-4292(print) 1744-4292(ele *
HUANG E ET AL: "GENE EXPRESSION PHENOTYPIC MODELS THAT PREDICT THE ACTIVITY OF ONCOGENIC PATHWAYS", NATURE GENETICS, NATURE PUBLISHING GROUP, NEW YORK, US, vol. 34, no. 2, 1 June 2003 (2003-06-01), pages 226 - 230,465, XP007900283, ISSN: 1061-4036 *
ROBINSON N ET AL: "Use of gene expression data for predicting continuous phenotypes for animal production and breeding", ANIMAL, vol. 2, no. 10, October 2008 (2008-10-01), pages 1413 - 1420, XP002558200, ISSN: 1751-7311(print) 1751-732X(ele *
SCHAEFFER L R: "Application of random regression models in animal breeding.", LIVESTOCK PRODUCTION SCIENCE, vol. 86, no. 1-3, March 2004 (2004-03-01), pages 35 - 45, XP002558199, ISSN: 0301-6226 *
SHEN Y J ET AL: "Improve Survival Prediction Using Principal Components of Gene Expression Data", GENOMICS PROTEOMICS AND BIOINFORMATICS, BEIJING GENOMICS INSTITUTE, BEIJING, CN, vol. 4, no. 2, 1 May 2006 (2006-05-01), pages 110 - 119, XP022856832, ISSN: 1672-0229, [retrieved on 20060501] *
ZHU W ET AL: "Detection of cancer-specific markers amid massive mass spectral data", PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF USA, NATIONAL ACADEMY OF SCIENCE, WASHINGTON, DC, US, vol. 100, no. 25, 1 January 2003 (2003-01-01), pages 14666 - 14671, XP003024459, ISSN: 0027-8424 *

Also Published As

Publication number Publication date
WO2010005317A4 (en) 2010-05-27

Similar Documents

Publication Publication Date Title
Manel et al. Assignment methods: matching biological questions with appropriate techniques
US7653491B2 (en) Computer systems and methods for subdividing a complex disease into component diseases
Bastian et al. Bgee: integrating and comparing heterogeneous transcriptome data among species
US20060111849A1 (en) Computer systems and methods that use clinical and expression quantitative trait loci to associate genes with traits
US7729864B2 (en) Computer systems and methods for identifying surrogate markers
Routtu et al. The first-generation Daphnia magna linkage map
EP3939046A1 (en) Methods and compositions for imputing or predicting genotype or phenotype
Sundaram et al. Segregating the effects of seed traits and common ancestry of hardwood trees on eastern gray squirrel foraging decisions
US20210090686A1 (en) Single cell rna-seq data processing
Li et al. Identification and optimization of classifier genes from multi-class earthworm microarray dataset
Alves et al. Genome-wide prediction for complex traits under the presence of dominance effects in simulated populations using GBLUP and machine learning methods
Contina et al. Examination of Clock and Adcyap1 gene variation in a neotropical migratory passerine
Jasper et al. Source-sink estimates of genetic introgression show influence of hatchery strays on wild chum salmon populations in Prince William Sound, Alaska
Giger et al. Population transcriptomics of life‐history variation in the genus Salmo
Zhong et al. Comparative transcriptomic analysis of the different developmental stages of ovary in red swamp crayfish Procambarus clarkii
Robinson et al. Modelling the use of gene expression profiles with selective breeding for improved disease resistance in Atlantic salmon (Salmo salar)
Robinson et al. Use of gene expression data for predicting continuous phenotypes for animal production and breeding
US20220344003A1 (en) Biomarkers for Age
Anastasiadi et al. Development of epigenetic biomarkers in aquatic organisms
Mayrink et al. Bayesian factor models for the detection of coherent patterns in gene expression data
US20180276337A1 (en) Method for identifying radiation induced genes and long non-coding RNAs and Application Thereof
WO2010005317A1 (en) Predicting phenotype from gene expression data
Qu et al. Accurate genomic selection using low-density SNP panels preselected by maximum likelihood estimation
CA3036597A1 (en) Systems, methods, and gene signatures for predicting a biological status of an individual
Kadarmideen et al. Combined genetic, genomic and transcriptomic methods in the analysis of animal traits.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09788367

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09788367

Country of ref document: EP

Kind code of ref document: A1