WO2005017652A2 - Systemes informatiques et procedes de deduction de causalite a partir de donnees d'abondance de composantes cellulaires - Google Patents

Systemes informatiques et procedes de deduction de causalite a partir de donnees d'abondance de composantes cellulaires Download PDF

Info

Publication number
WO2005017652A2
WO2005017652A2 PCT/US2004/017754 US2004017754W WO2005017652A2 WO 2005017652 A2 WO2005017652 A2 WO 2005017652A2 US 2004017754 W US2004017754 W US 2004017754W WO 2005017652 A2 WO2005017652 A2 WO 2005017652A2
Authority
WO
WIPO (PCT)
Prior art keywords
organisms
trait
seq
locus
qtl
Prior art date
Application number
PCT/US2004/017754
Other languages
English (en)
Other versions
WO2005017652A3 (fr
Inventor
Eric E. Schadt
Stephanie A. Monks
Original Assignee
Rosetta Inpharmatics, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rosetta Inpharmatics, Llc filed Critical Rosetta Inpharmatics, Llc
Priority to US10/567,282 priority Critical patent/US20070038386A1/en
Publication of WO2005017652A2 publication Critical patent/WO2005017652A2/fr
Priority to US11/361,871 priority patent/US20060241869A1/en
Publication of WO2005017652A3 publication Critical patent/WO2005017652A3/fr

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/5005Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving human or animal cells
    • G01N33/5008Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving human or animal cells for testing or evaluating the effect of chemical or biological compounds, e.g. drugs, cosmetics
    • G01N33/502Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving human or animal cells for testing or evaluating the effect of chemical or biological compounds, e.g. drugs, cosmetics for testing non-proliferative effects
    • G01N33/5023Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving human or animal cells for testing or evaluating the effect of chemical or biological compounds, e.g. drugs, cosmetics for testing non-proliferative effects on expression patterns
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/5005Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving human or animal cells
    • G01N33/5008Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving human or animal cells for testing or evaluating the effect of chemical or biological compounds, e.g. drugs, cosmetics
    • G01N33/502Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving human or animal cells for testing or evaluating the effect of chemical or biological compounds, e.g. drugs, cosmetics for testing non-proliferative effects
    • G01N33/5041Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving human or animal cells for testing or evaluating the effect of chemical or biological compounds, e.g. drugs, cosmetics for testing non-proliferative effects involving analysis of members of signalling pathways
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/136Screening for pharmacological compounds
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/154Methylation markers
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/172Haplotypes
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2800/00Detection or diagnosis of diseases
    • G01N2800/04Endocrine or metabolic disorders
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2800/00Detection or diagnosis of diseases
    • G01N2800/04Endocrine or metabolic disorders
    • G01N2800/042Disorders of carbohydrate metabolism, e.g. diabetes, glucose metabolism
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2800/00Detection or diagnosis of diseases
    • G01N2800/04Endocrine or metabolic disorders
    • G01N2800/044Hyperlipemia or hypolipemia, e.g. dyslipidaemia, obesity
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2800/00Detection or diagnosis of diseases
    • G01N2800/10Musculoskeletal or connective tissue disorders
    • G01N2800/105Osteoarthritis, e.g. cartilage alteration, hypertrophy of bone
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2800/00Detection or diagnosis of diseases
    • G01N2800/10Musculoskeletal or connective tissue disorders
    • G01N2800/108Osteoporosis
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2800/00Detection or diagnosis of diseases
    • G01N2800/12Pulmonary diseases
    • G01N2800/122Chronic or obstructive airway disorders, e.g. asthma COPD
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2800/00Detection or diagnosis of diseases
    • G01N2800/30Psychoses; Psychiatry
    • G01N2800/301Anxiety or phobic disorders
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2800/00Detection or diagnosis of diseases
    • G01N2800/30Psychoses; Psychiatry
    • G01N2800/304Mood disorders, e.g. bipolar, depression
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2800/00Detection or diagnosis of diseases
    • G01N2800/32Cardiovascular disorders
    • G01N2800/323Arteriosclerosis, Stenosis

Definitions

  • Cellular constituent abundance data from microarrays and, more generally, functional genomics has become an important tool in life sciences as well as medical research.
  • Cellular constituents are individual genes, proteins, mRNA expressing genes, and/or any other variable cellular component or protein activities such as the degree of protein modification (e.g., phosphorylation), for example, that is typically measured in biological experiments (e.g., by microarray) by those skilled in the art.
  • Significant discoveries relating to the complex networks of biochemical processes underlying living systems, common human diseases, and gene discovery and structure determination can now be attributed to the application of cellular constituent abundance data as part of the research process.
  • This validation typically involves gene knock outs/ins, transgenic construction, siRNA, drug treatments targeting candidate genes, time series experiments, and/or the development of specific assays intended to test hypotheses generated from gene expression experiments.
  • These validation methods do not easily lend themselves to high-throughput processes and can often take as long as eighteen months to complete. Developing methods that allow for the objective, data driven identification of the key drivers of common human diseases would significantly enhance the utility of cellular constituent abundance measurement experiments in the target discovery process. More generally, such methods would also provide a framework for elucidating genetic networks. Cellular constituent abundance data has recently been combined with other experimental data to allow for the more immediate identification of key drivers for complex disease traits.
  • One such technique involves treating cellular constituent abundance data (e.g., gene expression data) as a quantitative trait in segregating populations.
  • cellular constituent abundance data e.g., gene expression data
  • chromosomal regions controlling the level of expression of a particular gene are mapped as abundance quantitative trait loci (eQTL).
  • Abundance QTL that contain the gene encoding the mRNA are distinguished from the other (trans-acting) eQTL, and those cis-acting eQTL that co-localize with chromosomal regions controlling a disease (clinical) trait (cQTL) are identified.
  • the identification of a common chromosomal location for both cis-acting eQTL and a cQTL is used to nominate susceptibility loci for the disease trait. See, for example, Karp et al, 2000, Nat. Immunol 1, 221; Schadt et al. Nature 422, 297; and Eaves et al, 2002, Genome Res. 12, 232.
  • the present invention provides a process for identifying cellular constituents whose abundances are modulated by a disease trait QTL, and that, in turn, modulate the disease trait in a causal fashion. Additionally, the present invention provides a process for identifying disease traits that are causal for variations in cellular constituent levels. In the former case the cellular constituents are causal for the disease trait, whereas in the latter case the cellular constituents are reactive to the disease trait.
  • One aspect of the invention provides a method for determining whether cellular constituents are causal for a trait of interest T, exhibited by a plurality of organisms of a species.
  • a cellular constituent i that has at least one abundance quantitative trait locus (eQTL) coincident with a respective clinical quantitative trait locus (cQTL) for the trait of interest T is identified.
  • eQTL quantitative trait locus
  • cQTL clinical quantitative trait locus
  • a test is made to determine whether (i) the genetic variation of the cQTL across the plurality of organisms and (ii) the variation of the trait of interest T across the plurality of organisms are correlated conditional on an abundance pattern of the cellular constituent i across the plurality of organisms.
  • the cellular constituent i is said to be causal for the trait of interest T.
  • Another way of stating this causality test is to say that a cellular constituent i is considered to be causal for a trait of interest T when the variation of the trait of interest T can be explained by the variation in the cellular constituent i, with respect to the cQTL (provided that the trait of interest T and the cellular constituent i are both geneticially linked to the locus where the cQTL is located).
  • This test can be conceptualized as having two parts.
  • the amount of variation in the trait of interest T that is explained by (caused by, correlated with) the variation in the cQTL is determined (i.e., the coefficient of determination between the variation in the trait of interest T and the variation in the cQTL across the population is quantified).
  • the coefficient of determination between the trait of interest T and the cQTL can be small. For example, a coefficient of 0.05 or less, meaning that, for example, just five percent or less of the total variation in the trait of interest T across the population is possible so long as the amount of variation is detectable.
  • the variation in the trait of interest T identified in the first part of the test is still explained by the variation in the cQTL after conditioning on the cellular constituent i. If the variation in the cQTL no longer explains (causes, is correlated with) the variation in the trait of interest T identified in the first part of the test when the variation of the cellular constituent i is considered (after conditioning on the cellular constituent i), the variation of the cQTL and the variation in the trait T are said to be uncorrelated conditional on the variation in the abundance pattern of the cellular constituent i. In such instances, the cellular constituent i is causal for the trait of interest T.
  • the second part of the test identifies the cQTL as causal for the trait T when the coefficient of determination between the variation of the cQTL and the variation of the trait T cannot statistically be distinguished from zero after conditioning on the variation of the cellular constituent i.
  • an eQTL and overlapping cQTL are coincident with each other when the physical location of the eQTL in the genome of the species is within 40 cM or 10 cM of the physical location of the respective cQTL in the genome of the species.
  • the method further comprises, prior to identifying cellular constituents that are causal for a given clinical trait, a step to determine the eQTL for each cellular constituent using a first quantitative trait locus (QTL) analysis, wherein the first QTL analysis uses a plurality of abundance statistics for the cellular constituent i as a quantitative trait, and wherein each abundance statistic in the plurality of abundance statistics represents an abundance value for the cellular constituent i in an organism in the plurality of organisms.
  • QTL quantitative trait locus
  • the method further comprises a step of determining the respective cQTL using a second QTL analysis, wherein the second QTL analysis uses a plurality of phenotypic values as a quantitative trait, each phenotypic value in the plurality of phenotypic values corresponding to an organism in the plurality of organisms.
  • an eQTL is coincident with the respective cQTL when the eQTL and the respective cQTL colocalize within 40 cM of a locus Q in the genome of the species, within 10 cM of a locus Q in the genome of the species, within 3 cM of a locus Q in the genome of the species, or within 1 cM of a locus Q in the genome of the species.
  • the cellular constituent i is validated by a gene knock-out experiment, a transgenic construction experiment, or an siRNA experiment.
  • the first QTL analysis and the second QTL analysis each use a genetic map that represents the genome of the plurality of organisms.
  • a step of constructing the genetic map from a set of genetic markers associated with the plurality of organisms is performed.
  • the set of genetic markers comprises single nucleotide polymorphisms (SNPs), microsatellite markers, restriction fragment length polymorphisms, short tandem repeats, DNA methylation markers, sequence length polymorphisms, random amplified polymorphic DNA, amplified fragment length polymorphisms, or simple sequence repeats.
  • genotype data is used in the constructing step and wherein the genotype data comprises knowledge of which alleles, for each marker in the set of genetic markers, are present in each organism in the plurality of organisms.
  • the plurality of organisms represents a segregating population and pedigree data is used in the constructing step. Further, the pedigree data shows one or more relationships between organisms in the plurality of organisms.
  • the plurality of organisms comprises an F 2 population, a F, population, a F 2:3 population, or a Design III population and the one or more relationships between organisms in the plurality of organisms indicates which organisms in the plurality of organisms are members of the F 2 population, the F, population, the F 2:3 population, or the Design III population. More generally, the plurality of organisms comprises a human population consisting of any number of family structures with varying degrees of relatedness represented in the families.
  • each abundance value is a normalized abundance level measurement for the cellular constituent i in an organism in the plurality of organisms.
  • each abundance level measurement is determined by measuring an amount of the cellular constituent i in one or more cells from an organism in the plurality of organisms.
  • the amount of the cellular constituent can be, for example, an abundance of an RNA present in the one or more cells of the organism.
  • the abundance of the RNA is measured by contacting a gene transcript array with the RNA from the one or more cells of the organism, or with nucleic acid derived from the RNA.
  • the gene transcript array comprises a positionally addressable surface with attached nucleic acids or nucleic acid mimics.
  • the nucleic acids or nucleic acid mimics are capable of hybridizing with the RNA species or with nucleic acid derived from the RNA species.
  • the normalized abundance level measurement is obtained by a normalization technique selected from the group consisting of Z-score of intensity, median intensity, log median intensity, Z-score standard deviation log of intensity, Z- score mean absolute deviation of log intensity, calibration DNA gene set, user normalization gene set, ratio median intensity correction, and intensity background correction.
  • an abundance value comprises an amount of the cellular constituent i in tissues of the organism, a concentration of the cellular constituent i in tissues of the organism, a cellular constituent activity level for the cellular constituent i in one or more tissues of the organism, or the state of modification of the cellular constituent i in the organism.
  • the state of modification of the cellular constituent i is a degree of phosphorylation of the cellular constituent i. 3.4.
  • the first QTL analysis comprises (i) testing for linkage between (a) the genotype of the plurality of organisms at a position in the genome of the species and (b) the plurality of abundance statistics for the cellular constituent i; (ii) advancing the position in the genome by an amount; and (iii) repeating steps (i) and (ii) until all or a portion of the genome of the species has been tested.
  • the amount is less than 100 centiMorgans or less than 5 centiMorgans.
  • the testing comprises performing linkage analysis or association analysis.
  • the linkage analysis or association analysis generates a statistical score for each position in the genome of the species that is tested.
  • the testing is linkage analysis and the statistical score is a logarithm of the odds (lod) score.
  • an eQTL is represented by a lod score that is greater than 2.0, or greater than 4.0. 3.5.
  • the second QTL analysis comprises (i) testing for linkage between (a) the genotype of the plurality of organisms at a position in the genome of the species and (b) the plurality of phenotypic values; (ii) advancing the position in the genome by an amount; and (iii) repeating steps (i) and (ii) until all or a portion of the genome of the species has been tested.
  • the amount is less than 100 centiMorgans, or less than 5 centiMorgans.
  • the testing comprises performing linkage analysis or association analysis. Such analysis generates a statistical score for the position in the genome of the species.
  • the testing is linkage analysis and the statistical score is a logarithm of the odds (lod) score.
  • the cQTL is represented by a lod score that is greater than 2.0 or greater than 4.0.
  • the trait of interest T is a complex trait.
  • the trait is characterized by an allele that exhibits incomplete penetrance in the species.
  • the trait is a disease that is contracted by an organism in the population, and the organism inherits no predisposing allele to the disease.
  • the trait arises when any of a plurality of different genes in the genome of the species are mutated.
  • the trait requires the simultaneous presence of mutations in a plurality of genes in the genome of the species.
  • the trait requires the simultaneous presence of mutations in a plurality of genes in the genome of the species and a set of environmental conditions.
  • the trait is the result of the genotype of a plurality of genes as well as one or more environmental conditions (e.g., an obesity trait that requires a person eating a lot in addition to that person having gene combinations that lead to obesity).
  • the trait is associated with a high frequency of disease- causing alleles in the species.
  • the complex trait is a phenotype that does not exhibit Mendelian recessive or dominant inheritance attributable to a single gene locus.
  • the trait is asthma, ataxia telangiectasia, bipolar disorder, cancer, common late-onset Alzheimer's disease, diabetes, heart disease, hereditary early-onset Alzheimer's disease, hereditary nonpolyposis colon cancer, hypertension, infection, maturity-onset diabetes of the young, mellitus, migraine, nonalcoholic fatty liver, nonalcoholic steatohepatitis, non-insulin-dependent diabetes mellitus, obesity, polycystic kidney disease, psoriases, schizophrenia, or xeroderma pigmentosum.
  • the method further comprises testing whether the coincidence between an eQTL and a respective cQTL are a result of pleiotropy, or a result of two closely linked QTL, wherein when the coincidence between said eQTL and said respective cQTL is the result of two closely linked QTL, the cellular constituent i is not associated with said trait of interest.
  • this testing comprises comparing a model for the null hypothesis, indicating the result of pleiotropy, to a model for the alternative hypothesis, indicating two closely linked QTL.
  • the model for the null hypothesis is: where N is a categorical random variable indicating the genotypes at the position of the eQTL and the cQTL in the plurality of organisms; f r. ⁇ f( ⁇ is distributed as a bivariate normal random variable with mean and b 2 j V ⁇ V
  • the model for the alternative hypothesis is: where Ni and N2 are categorical random variables indicating the genotypes at the position of the eQTL and the cQTL in the plurality of organisms; and
  • the loglikelihood for the null hypothesis and the alternative hypothesis are maximized with respect to the model parameters ( ⁇ t , ⁇ t , and ⁇ k ) using maximum likelihood analysis. After maximum likelihood estimates are obtained for each model, the likelihood ratio test statistic between the competing models is formed and the test statistic is used to determine whether the model for the alternative hypothesis provides for a statistically significant better fit to the data than the model for the null hypothesis. 3.8.
  • the test to determine whether (i) the genetic variation of the cQTL across the plurality of organisms and (ii) the variation of the trait of interest T across the plurality of organisms are correlated conditional on an abundance pattern of the cellular constituent i across the plurality of organisms comprises considering a null test for causality having the relationship:
  • each function P is a probability density function
  • 7 1 is a trait random variable for the trait of interest across the plurality of organisms
  • Q * is a genotype random variable for a locus Q where an eQTL and a cQTL colocalize across the plurality of organisms
  • G is said abundance pattern of said cellular constituent i across said plurality of organisms.
  • such testing comprises comparing the null test for causality, indicating that G is causal for T, to an alternative hypothesis that Tand Q are dependent given G.
  • such testing comprises optimizing the log likelihood ratio of the null hypothesis and the alternative hypothesis using maximum likelihood analysis.
  • One embodiment of the present invention provides a method for determining whether a cellular constituent is causal for a trait of interest T.
  • the trait of interest T is exhibited by a plurality of organisms of a species.
  • the method comprises identifying a locus Q in the genome of the species that is a site of colocalization for (i) an abundance quantitative trait locus (eQTL) genetically linked to (correlated with) a variation in abundance levels of the cellular constituent across all or a portion of the plurality of organisms and (ii) a clinical quantitative trait locus (cQTL) that is genetically linked to (correlated with) a variation in the trait of interest T across all or a portion of the plurality of organisms.
  • eQTL abundance quantitative trait locus
  • cQTL clinical quantitative trait locus
  • a first coefficient of determination is quantified between (i) the variation in the clinical quantitative trait locus (cQTL) across all or a portion of the plurality of organisms and (ii) the variation in the trait of interest T across all or a portion of said plurality of organisms.
  • a second coefficient of determination is quantified between (i) the variation in the clinical quantitative trait locus (cQTL) across all or a portion of the plurality of organisms and (ii) the variation in the trait of interest T across all or a portion of the plurality of organisms, after conditioning on the variation of the abundance of the cellular constituent across all or a portion of the plurality of organisms.
  • the cellular constituent is causal for the trait of interest T when the first coefficient of determination is other than zero and the second coefficient of determination is zero. In some embodiments, the cellular constituent is causal for the trait of interest T when the first coefficient of deter ination is greater than a predetermined threshold amount such as 0.03 or 0.10.
  • the method further comprises identifying a candidate causative cellular constituent set.
  • Each cellular constituent in the candidate causative cellular constituent set has at least one eQTL that is coincident with a respective cQTL for the trait of interest T.
  • each cellular constituent in the candidate causative cellular constituent set that does not have a druggable domain is removed from the set.
  • a rank of a cellular constituent i in the candidate cellular constituent set is determined by the amount of genetic variation in the trait of interest T that is explained by the at least one eQTL of cellular constituent i.
  • the amount of genetic variation in the trait of interest T that is explained by the at least one eQTL of cellular constituent i is determined by a joint analysis of the trait of interest at each one of the eQTL in said at least one eQTL. 3.10. CELLULAR CONSTITUENTS WHOSE ABUNDANCE SIGNIFICANTLY ASSOCIATES WITH THE TRAIT OF INTEREST In some embodiments, only those cellular constituents whose abundance in the plurality of organisms significantly associates with the trait of interest T are considered. Accordingly, in some embodiments, the variation in the abundance level of cellular constituent i associates with the variation in the trait of interest T across the plurality of organisms.
  • the association between (i) the variation in the abundance level of a cellular constituent i and (ii) the variation in the trait of interest T across the plurality of organisms is determined using a Pearson correlation, discriminant analysis or a regression model.
  • a Pearson correlation is used and (i) the variation in the abundance level of the cellular constituent i and (ii) the variation in the trait of interest T across the plurality of organisms is identified when the Pearson correlation coefficient (p-value) is less than 0.00001 or less than 0.0001. 3.11.
  • REPRESENTATIVE COMPUTER PROGRAM PRODUCT One aspect of the invention provides a computer program product for use in conjunction with a computer system.
  • the computer program product comprises a computer readable storage medium and a computer program mechanism embedded therein.
  • the computer program mechanism is for determining whether a cellular constituent is causal for a trait of interest, exhibited by a plurality of organisms of a species.
  • the computer program mechanism comprises a cQTL/eQTL overlap module.
  • the cQTL/eQTL overlap module comprises instructions for identifying a cellular constituent i that has at least one abundance quantitative trait locus (eQTL) coincident with a respective clinical quantitative trait locus (cQTL) for the trait of interest.
  • the computer program mechanism further comprises a causality test module.
  • the causality test module comprises instructions for testing, for one or more respective eQTL in the at least one eQTL, whether (i) the genetic variation of the eQTL across the plurality of organisms and (ii) the variation of the trait of interest across the plurality of organisms are correlated conditional on an abundance pattern of the cellular constituent i across the plurality of organisms.
  • Another aspect of the present invention provides a computer program product for use in conjunction with a computer system.
  • the computer program product comprises a computer readable storage medium and a computer program mechanism embedded therein.
  • the computer program mechanism is for determining whether a cellular constituent is causal for a trait of interest, exhibited by a plurality of organisms of a species.
  • the computer program mechanism comprises an cQTL/eQTL overlap module.
  • the cQTL/eQTL overlap module comprises instructions for identifying a cellular constituent that has at least one abundance quantitative trait locus (eQTL) coincident with a respective clinical quantitative trait locus (cQTL) for the trait of interest.
  • the computer program mechanism further comprises a causality test module.
  • the causality module comprises instructions for testing, for one or more respective eQTL in the at least one eQTL, (i) a causative model, (ii) a reactive model and (iii) an independent model using a maximum likelihood approach, wherein when, for each compared eQTL, the causative model gives rise to the largest likelihood relative to the corresponding reactive model and the corresponding independent model, the cellular constituent i is causal for the trait of interest.
  • the computer program mechanism further comprises a quantitative genetics analysis module that comprises instructions for determining the eQTL using a first quantitative trait locus (QTL) analysis.
  • the first QTL analysis uses a plurality of abundance statistics for the cellular constituent i as a quantitative trait, and each abundance statistic in the plurality of abundance statistics represents an abundance value for the cellular constituent i in an organism in the plurality of organisms.
  • the quantitative genetics analysis module further comprises instructions for determining the respective cQTL using a second QTL analysis.
  • the second QTL analysis uses a plurality of phenotypic values as a quantitative trait. Each phenotypic value in the plurality of phenotypic values corresponding to an organism in the plurality of organisms.
  • the computer program mechanism further comprises a pleiotropy module that comprises instructions for testing whether the coincidence between an eQTL and a respective cQTL are a result of pleiotropy, or a result of two closely linked QTL.
  • the testing comprises comparing a null hypothesis, indicating said result of pleiotropy, to an alternative hypothesis, indicating two closely linked QTL.
  • the computer system comprises a central processing unit and a memory, coupled to the central processing unit.
  • the memory stores an cQTL/eQTL overlap module and a causality test module.
  • the cQTL/eQTL overlap module comprises instructions for identifying a cellular constituent i that has at least one abundance quantitative trait locus (eQTL) coincident with a respective clinical quantitative trait locus (cQTL) for the trait of interest.
  • the causality test module comprises instructions for testing, for one or more respective eQTL/cQTL pairs in the at least one eQTL/cQTL pair, whether (i) the genetic variation of the cQTL across the plurality of organisms and (ii) the variation of the trait of interest across the plurality of organisms are correlated conditional on an abundance pattern of the cellular constituent i across the plurality of organisms.
  • Another aspect of the present invention provides a computer system for determining whether a cellular constituent is causal for a trait of interest that is exhibited by a plurality of organisms of a species.
  • the computer system comprises a central processing unit and a memory coupled to the central processing unit.
  • the memory storing an cQTL/eQTL overlap module and a causality test module.
  • the cQTL/eQTL overlap module comprises instructions for identifying a cellular constituent that has at least one abundance quantitative trait locus (eQTL) coincident with a respective clinical quantitative trait locus (cQTL) for the trait of interest.
  • the causality test module comprises instructions for testing, for one or more respective eQTL/cQTL pairs in the at least one eQTL/cQTL pair, (i) a causative model, (ii) a reactive model and (iii) an independent model using a maximum likelihood approach.
  • One embodiment of the present invention provides a method for determining whether a candidate molecule affects a body weight disorder associated with an organism.
  • a cell from the organism is contacted with the candidate molecule or the candidate molecule is recombinantly expressed within the cell from the organism.
  • RNA expression or protein expression in the cell of at least one open reading frame is changed in the first step of the method relative to the expression of the open reading frame in the absence of the candidate molecule, each referenced open reading frame being regulated by a promoter native to a nucleic acid sequence selected from the group consisting of SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8, SEQ ID NO: 9, SEQ ID NO: 11, SEQ ID NO: 12, SEQ ID NO: 14, SEQ ID NO: 16, SEQ ID NO: 18, SEQ ID NO: 20, SEQ ID NO: 21, SEQ ID NO: 23 and homologs of each of the foregoing.
  • a cell from the organism contacted with the candidate molecule exhibits a lower expression level of a protein sequence selected from the group consisting of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, SEQ ID NO: 24, and homologs of each of the forgoing, than a cell from the organism that is not contacted with the candidate molecule.
  • the body weight disorder is obesity, anorexia nervosa, bulimia nervosa or cachexia.
  • the second step comprises determining whether RNA expression is changed or whether protein expression is changed.
  • the second step comprises determining whether RNA or protein expression of at least two of the open reading frames is changed.
  • the first step comprises contacting the cell with the candidate molecule and the first step is carried out in a liquid high throughput-like assay.
  • the cell comprises a promoter region of at least one gene selected from the group consisting of SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8, SEQ ID NO: 9, SEQ ID NO: 11, SEQ ID NO: 12, SEQ ID NO: 14, SEQ ID NO: 16, SEQ ID NO: 18, SEQ ID NO: 20, SEQ ID NO: 21 , SEQ ID NO: 23, and homologs of each of the foregoing, each promoter region being operably linked to (correlated with) a marker gene.
  • the second step comprises determining whether the RNA expression or protein expression of the marker gene(s) is changed in the first step relative to the expression of the marker gene in the absence of the candidate molecule.
  • the marker gene is selected from the group consisting of green fluorescent protein, red fluorescent protein, blue fluorescent protein, luciferase, LEU2, LYS2, ADE2, TRPl, CANl, CYH2, GUS, CUPl and chloramphenicol acetyl transferase.
  • Another embodiment of the present invention provides a method of treating or preventing a body weight disorder.
  • the method comprises administering to a subject in which treatment is desired a therapeutically effective amount of a compound that antagonizes in the subject a protein comprising a sequence selected from the group consisting of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, SEQ ID NO: 24 and homologs of each of the foregoing.
  • the subject is human.
  • the compound (i) inhibits a function of one or more of the group consisting of SEQ ID NO: 1,
  • SEQ ID NO: 2 SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, SEQ ID NO: 24, and homologs of each of the foregoing
  • (ii) is selected from the group consisting of: an antibody that binds to one of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, SEQ ID NO: 24, and homologs of each of the foregoing or a fragment or derivative therefore containing the binding region thereof, or is selected from the group consisting of: a nucleic acid complementary to the RNA produced by transcription of a gene encoding one of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO:
  • the compound that inhibits a function of one or more of the group consisting of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, SEQ ID NO: 24, and homologs of each of the foregoing is a small interfering RNA (siRNA) or RNAi.
  • siRNA small interfering RNA
  • RNAi small interfering RNA
  • RNAi small interfering RNA
  • siRNA and RNAi see, for example, Xia, et al, 2002, Nature Biotechnology 20, p. 1006; Hannon, 2002, Nature 418, p. 244; Carthew, 2001, Current Opinion in Cell Biology 13, p.
  • the compound that inhibits a function of one or more of the group consisting of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, SEQ ID NO: 24, and homologs of each of the foregoing is an oligonucleotide that: (a) consists of at least six nucleotides; (b) comprises a sequence complementary to at least a portion of an RNA transcript of a gene encoding one of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, SEQ ID NO: 24, and homologs of each of the foregoing; and (c) is hybridizable to the RNA transcript under moderately stringent conditions.
  • Another embodiment of the present invention provides a method of treating or preventing a body weight disorder comprising administering to a subject in which treatment is desired a therapeutically effective amount of a compound that enhances a function of one or more of the group consisting of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, SEQ ID NO: 24, and homologs of each of the foregoing.
  • the subject is human.
  • Still another embodiment of the present invention provides a method of diagnosing a disease or disorder or the predisposition to the disease or disorder.
  • the disease or disorder is characterized by an aberrant level of one of SEQ ID NO: 1 through SEQ ID NO: 24, or a homolog thereof, in a subject.
  • the method comprises measuring the level of any one of SEQ ID NO: 1 through SEQ ID NO: 24, or a homolog thereof, in a sample derived from the subject, in which an increase or decrease in the level of one of SEQ ID NO: 1 through SEQ ID NO: 24, or a homolog thereof, in the sample, relative to the level of a corresponding one of said SEQ ID NO: 1 through SEQ ID NO: 24, or a homolog thereof, found in an analogous sample not having the disease or disorder, indicates the presence of the disease or disorder in the subject.
  • the disease or disorder is a body weight disorder such as obesity, anorexia nervosa, bulimia nervosa, or cachexia.
  • Yet another embodiment of the present invention provides a method of diagnosing or screening for the presence of or predisposition for developing a disease or disorder involving a body weight disorder in a subject. The method comprises detecting one or more mutations in at least one of SEQ ID NO: 1 through SEQ ID NO: 24, or a homolog thereof, in a sample derived from the subject, in which the presence of the one or more mutations indicates the presence of the disease or disorder or a predisposition for developing the disease or disorder. 3.14.
  • the present invention provides embodiments that can be used to determine whether a first trait is causal for a second trait.
  • the first trait can represent variance in abundance of a first cellular constituent across a population and the second trait can represent variance in a second cellular constituent across a population.
  • the present invention provides a test to determine whether the first trait drives (is causal for) the second trait. In order to accept the results of the test however, it must be the case that there exists some QTL that is linked to (correlated with) both the first trait and the second trait.
  • one embodiment of the present invention provides a method for determining whether a first trait Ti is causal for a second trait T 2 in a plurality of organisms of a species.
  • the method at least one locus in the genome of the species is identified.
  • Each locus Q in the at least one locus is a site of colocalization for (i) a respective quantitative trait locus (QTLi) linked to (correlated with) a variation in the first trait Ti across the plurality of organisms and (ii) a respective quantitative trait locus (QTL 2 ) that is linked to (correlated with) a variation in the second trait T 2 across the plurality of organisms.
  • Each respective locus Q in the at least one locus is tested to determine whether (i) the genetic variation at QTL 2 across the plurality of organisms and (ii) the variation in the second trait T 2 across the plurality of organisms are correlated conditional on the variation in the first trait Ti across the plurality of organisms.
  • the first trait Ti is causal for the second trait T .
  • is causal for the second trait T2.
  • a respective QTLi is identified using a first quantitative trait locus (QTL) analysis.
  • This first QTL analysis uses a plurality of quantitative measurements of the first trait. Each quantitative measurement in the plurality of quantitative measurements of the first trait is associated with an organism in the plurality of organisms.
  • a respective QTL 2 is determined using a second QTL analysis. The second QTL analysis uses a plurality of quantitative measurements of the second trait. Each quantitative measurement in the plurality of quantitative measurements of the second trait is associated with an organism in the plurality of organisms.
  • the respective QTLi and the respective QTL 2 colocalize at a locus Q in the at least one locus when the respective QTLi and said respective QTL 2 are within 40 cM of a common locus Q, within 10 cM of a common locus Q, within 3 cM of a common locus Q or within 1 cM of the locus Q in the genome of the species.
  • the first trait is a variation in abundance levels of a first cellular constituent across the plurality of organisms and each quantitative measurement of the first trait is an abundance level of the first cellular constituent in an organism in the plurality of organisms.
  • the second trait is a variation in abundance levels of a second cellular constituent across the plurality of organisms and each quantitative measurement of the second trait is an abundance level of the second cellular constituent in an organism in the plurality of organisms.
  • each of the abundance levels of the first cellular constituent are normalized and each of the abundance levels of the second cellular constituent is normalized.
  • the abundance levels of the first cellular constituent are determined by measuring amounts of the first cellular constituent in one or more cells from organisms in the plurality of organisms.
  • the abundance levels of the second cellular constituent are determined by measuring amounts of the second cellular constituent in one or more cells from organisms in the plurality of organisms. Such amounts can be, for example, RNA levels.
  • RNA levels can be measured by, for example, contacting a gene transcript array with the RNA, or with nucleic acid derived from the RNA.
  • Such gene transcript arrays comprise a positionally addressable surface with attached nucleic acids or nucleic acid mimics.
  • the first QTL analysis comprises (i) testing for linkage between (a) the genotype of the plurality of organisms at a position in the genome of the species and (b) the plurality of quantitative measurements of the first trait; (ii) advancing the position in said genome by an amount; and (iii) repeating steps (i) and (ii) until all or a portion of the genome of the species has been tested.
  • the second QTL analysis comprises (i) testing for linkage between (a) the genotype of said plurality of organisms at a position in the genome of the species and (b) the plurality of quantitative measurements of the second trait; (ii) advancing the position in the genome by an amount; and (iii) repeating steps (i) and (ii) until all or a portion of the genome of the species has been tested.
  • the amount is less than 100 centiMorgans or less than 5 centiMorgans.
  • the testing comprises performing linkage analysis or association analysis. Such linkage analysis or association analysis can generate a statistical score, such as a logarithm of the odds (lod) score, for the position in the genome of the species.
  • each quantitative measurement in the plurality of quantitative measurements of the first trait is: an amount or a concentration of a first cellular constituent in one or more tissues of an organism in the plurality of organisms, a cellular constituent activity level of the first cellular constituent in one or more tissues of an organism in the plurality of organisms, or a state of cellular constituent modification of the first cellular constituent in one or more tissues of an organism in the plurality of organisms.
  • each quantitative measurement in the plurality of quantitative measurements of the second trait is an amount or a concentration of a second cellular constituent in one or more tissues of an organism in the plurality of organisms, a cellular constituent activity level of the second cellular constituent in one or more tissues of an organism in the plurality of organisms, or a state of cellular constituent modification of the second cellular constituent in one or more tissues of an organism in the plurality of organisms.
  • a respective QTLi and a respective QTL 2 colocalize at a locus Q in the at least one locus when the respective QTLi and the respective QTL 2 satisfy a pleiotropy test.
  • failure of the pleiotropy test indicates that the respective QTLi and the respective QTL 2 are two closely linked QTL, the causality test is not performed, and the first trait Ti is not determined to be causal for the second trait T 2 .
  • this pleiotropy test comprises comparing a model for a null hypothesis, indicating that the respective QTLi and the respective QTL 2 colocalize as a QTL, to a model for an alternative hypothesis, indicating that the QTLi and the respective QTL 2 are two closely linked QTL.
  • the model for the null hypothesis is: where, N is a categorical random variable indicating the genotype at locus Q across the plurality of organisms; f c ⁇ is distributed as a bivariate normal random variable with mean ⁇ b 2 ) (°1 and
  • cova ⁇ ance matrix and ⁇ , and ⁇ are model parameters.
  • the model for the alternative hypothesis is:
  • N/ and N 2 are categorical random variables indicating the genotype at locus Q across the plurality of organisms
  • cova ⁇ ance matrix ⁇ , and ⁇ , are model parameters.
  • the model for the alternative hypothesis is:
  • Ni and N 2 are categorical random variables indicating the genotype at locus Q across the plurality of organisms.
  • the testing comprises considering a null test for causality having the relationship: where each function P is a probability density function; T 2 is a trait random variable for the second trait across the plurality of organisms; Q * is a genotype random variable for locus Q in the at least one locus across the plurality of organisms; and Ti is a trait random variable for the first trait across the plurality of organisms.
  • Still another aspect of the invention provides a computer program product for use in conjunction with a computer system.
  • the computer program product comprises a computer readable storage medium and a computer program mechanism embedded therein.
  • the computer program mechanism is for determining whether a first trait Ti is causal for a second trait of interest T 2 in a plurality of organisms of a species.
  • the computer program mechanism comprises a T ⁇ /T 2 overlap module and a causality test module.
  • the T ⁇ /T 2 overlap module comprises instructions for identifying at least one locus in the genome of the species.
  • Each locus Q in the at least one locus is a site of colocalization for (i) a respective quantitative trait locus (QTLi) linked to (correlated with) a variation in the first trait Ti across the plurality of organisms and (ii) a respective quantitative trait locus (QTL 2 ) that is linked to (correlated with) a variation in the second trait T 2 across the plurality of organisms.
  • the causality test module comprises instructions for testing, for one or more locus Q in the at least one locus, whether (i) a genetic variation Q of the respective locus Q across the plurality of organisms and (ii) the variation in the second trait T 2 across the plurality of organisms are correlated conditional on the variation in the first trait Ti across the plurality of organisms.
  • Yet another aspect of the invention provides a computer system for determining whether a first trait Ti is causal for a second trait of interest T 2 in a plurality of organisms of a species.
  • the computer system comprises a central processing unit and a memory.
  • the memory is coupled to the central processing unit and stores an Q ⁇ /Q 2 overlap module and a causality test module.
  • the T ⁇ /T 2 overlap module comprises instructions for identifying at least one locus in the genome of the species.
  • Each locus Q in the at least one locus is a site of colocalization for (i) a respective quantitative trait locus (QTLi) linked to (correlated with) a variation in the first trait T ! across the plurality of organisms and (ii) a respective quantitative trait locus (QTL 2 ) that is linked to (correlated with) a variation in the second trait T 2 across the plurality of organisms.
  • the causality test module comprises instructions for testing, for one or more locus Q in the at least one locus, whether (i) a genetic variation Q * of the respective locus Q across the plurality of organisms and (ii) the variation in the second trait T 2 across the plurality of organisms are correlated conditional on the variation in the first trait Ti across the plurality of organisms.
  • Another aspect of the invention provides a method for determining whether a first trait Tj is causal for a second trait T 2 in a plurality of organisms of a species.
  • the method comprises identifying a locus Q in the genome of the species that is a site of colocalization for (i) a quantitative trait locus (QTLi) that is genetically linked to (correlated with) a variation in the first trait Ti across all or a portion of the plurality of organisms and (ii) a quantitative trait locus (QTL 2 ) that is genetically linked to (correlated with) a variation in the second trait T 2 across all or a portion of the plurality of organisms.
  • QTLi quantitative trait locus
  • QTL 2 quantitative trait locus
  • a first coefficient of determination is computed between (i) a genetic variation Q * of the locus Q across all or a portion of the plurality of organisms and (ii) the variation in the first trait Ti across the plurality of organisms.
  • a second coefficient of determination is quantified between (i) the genetic variation Q* of the locus Q across the plurality of organisms and (ii) the variation in the first trait T] across all or a portion of the plurality of organisms, after conditioning on the variation in the second trait T 2 across all or a portion of the plurality of organisms.
  • the first trait Ti is causal for the second trait T 2 when the first coefficient of determination is other than zero and the second coefficient of determination is zero.
  • the cellular constituent is causal for the trait of interest T when the first coefficient of determination is greater than a predetermined threshold amount, such as 0.03 or 0.10.
  • Still another embodiment of the present invention provides a method for determining whether a cellular constituent is causal for a trait of interest T, the trait of interest T exhibited by at least one organism in a plurality of organisms of a species, the method comprising: (A) identifying a locus Q in the genome of the species that is a site of colocalization for (i) an abundance quantitative trait locus (eQTL) genetically linked to a variation in abundance levels of the cellular constituent across all or a portion of the plurality of organisms, and (ii) a clinical quantitative trait locus (cQTL) that is genetically linked to a variation in the trait of interest T across all or a portion of the plurality of organisms; (B) quantifying a first coefficient of determination between (i) the variation in the clinical quantitative trait locus (cQTL) across all or a portion of the plurality of organisms and (ii) the variation in the trait of interest T across all or a portion of the plurality of organisms; and (C) quantifying a second coefficient
  • Another embodment of the present invention provides a method for identifying a quantitative trait locus for a trait that is exhibited by a plurality of organisms in a population.
  • the population is divided into a plurality of sub-populations using a classification scheme that classifies each organism in the population into at least one of the subpopulations.
  • the classification scheme is derived from a plurality of cellular constituent measurements for each of a plurality of respective cellular constituents that are obtained from each the organism.
  • the classification scheme uses a classifier constructed using boosting.
  • Fig. 1 illustrates a computer system for associating a gene with a trait exhibited by one or more organisms in a plurality of organisms in accordance with one embodiment of the present invention.
  • Fig. 2 illustrates a topology for how causal genes affect pathways that affect a primary disease which, in turn, affects reactive genes.
  • FIG. 3A illustrates possible relationships between quantitative trait loci (QTL), genes and disease traits once the expression of the gene (G) and the disease trait (T) have been shown to be under the control of a common QTL (Q).
  • Fig. 3B illustrates obese and lean animals segregating with the genotypes given at the locus, with up arrows indicating up regulation of the gene, horizontal arrows indicating no differential regulation, and down arrows indicating down regulation.
  • Fig. 3C illustrates an analysis of the observed correlation structure between the locus, gene expression trait, and obesity trait of Fig. 3B under a causal model.
  • Fig. 3D illustrates an analysis of the observed correlation structure between the locus, gene expression trait, and obesity trait of Fig. 3B under a reactive model.
  • Fig. 3E illustrates an analysis of the observed correlation structure between the locus, gene expression trait, and obesity trait of Fig. 3B under an independent model.
  • Fig. 4 illustrates the genomic positions of the cQTL that are linked to (correlated with) the trait omental fat pad masses (OFPM) as well as the eQTL that are linked to (correlated with) expression of the gene HSDl in a segregating mouse population.
  • OFPM trait omental fat pad masses
  • Fig. 5 illustrates a potential relationship between a specific QTL (which controls for both the trait OFPM and HSDl expression), HSDl, and OFPM.
  • Fig. 6 illustrates LOD score curves for HSDl expression, the trait OFPM, the simultaneous consideration of HSDl expression and the trait OFPM, as well as OFPM after conditioning on HSDl expression.
  • Fig. 7 illustrates processing steps for identifying a gene that affects a trait in accordance with one embodiment of the present invention.
  • Fig. 8 illustrates the data structure for phenotypic statistic sets in accordance with one embodiment of the present invention.
  • Fig. 9 illustrates a data structure for storing cellular constituent abundance data in accordance with one embodiment of the present invention.
  • Fig. 10 illustrates the data structure for a cellular constituent expression statistic in accordance with one embodiment of the present invention.
  • Fig. 11 illustrates a data structure for storing cellular constituent abundance data from a plurality of different tissue types in accordance with one embodiment of the present invention.
  • Fig. 12 illustrates a QTL results database in accordance with the present invention
  • Figs. 13A-13E illustrates several possible genetic relationships.
  • Fig. 14 illustrates gives a scatter plot for values for two traits in a hypothetical dataset.
  • Fig. 15 illustrates the results of hypothetical QTL analyses in accordance with the present invention.
  • Fig. 16 illustrates how polymorphism in a multi-cross environment can be used to localize a gene underlying a QTL.
  • Fig. 17 is the amino acid sequence of cytosolic homo sapiens malic enzyme ME1 (SEQ ID NO: 1).
  • Fig. 18 is the amino acid sequence of the enzyme mus musculus Modi (SEQ ID NO: 2).
  • Fig. 19A illustrates that quantitative trait loci that control the genetic variation in
  • Fig. 20 top panel, shows a scatter gram of the OFPM values in grams (X axis) versus Modi (SEQ ID NO: 2) mRNA levels as mlratio's (Y axis) and the lower panel shows a comparison of Modi to the log of the OFPM values (LogOmen).
  • Fig. 21 illustrates scatter grams comparing Modi (SEQ ID NO: 2) ml ratios (Y axes) to OFPM (top left), subcutaneous fat pat mass (top right), leptin protein levels (bottom left) and insulin protein levels (bottom right) all X axis.
  • Fig. 22 illustrates the correlation coefficients of various measures of fat pad masses and adiposity and Modi (SEQ ID NO: 2) mRNA levels.
  • Fig. 23 is the amino acid sequence of homo sapiens ME3 (SEQ ID NO: 3).
  • Fig. 24 is the amino acid sequence of homo sapiens ME2 (SEQ ID NO: 4).
  • Fig. 25 illustrates the relative levels of expression of the cytoslic malic enzyme Modi (ME1) in various tissues of monkeys.
  • Fig. 26 provides the position of mus musculus Modi (SEQ ID NO: 2) in a schematic representation of intermediate metabolism. Above the line 2602 is cytosol, below is mitochondria.
  • Fig. 27 is the nucleic acid sequence of homo sapiens mitochondrial NADP(+)-dependent malic enzyme 3 (NCBI accession number AY424278; SEQ ID NO: 5).
  • Fig. 28 is the nucleic acid sequence of homo sapiens mitochondrial NAD-dependent malic enzyme 2 (NCBI accession number XM_209967; SEQ ID NO: 6).
  • Fig. 29 is the nucleic acid sequence of homo sapiens cytosolic malic enzyme 1 (SEQ ID NO: 7).
  • Fig. 30 is the mus musculus nucleic acid sequence AI506234 (SEQ ID NO: 8).
  • Fig. 31 is the mus musculus nucleic acid sequence NM 011764 (SEQ ID NO: 9).
  • Fig. 32 is the mus musculus amino acid sequence gi:28279474 (SEQ ID NO: 10).
  • Fig. 33 is the mus musculus nucleic acid sequence AY027436 (SEQ ID NO: 11).
  • Fig. 34 is the mus musculus nucleic acid sequence NM 008288 (SEQ ID NO: 12).
  • Fig. 35 is the mus musculus amino acid sequence hydroxysteroid 11-beta dehydrogenase (SEQ ID NO: 13).
  • Fig. 36 is the mus musculus nucleic acid sequence for AK004942 (SEQ ID NO: 14).
  • Fig. 37 is the mus musculus amino acid sequence for Gpx3 (SEQ ID NO: 15).
  • Fig. 38 is the mus musculus nucleic acid sequence for NM_030717 (SEQ ID NO: 16).
  • Fig. 39 is the mus musculus amino acid sequence for Lactb (SEQ ID NO: 17).
  • Fig. 40 is the mus musculus nucleic acid sequence for NM_026508 (SEQ ID NO: 18).
  • Fig. 41 is the mus musculus amino acid sequence for 2410002K23Rik (SEQ ID NO: 19).
  • Fig. 42 is the mus musculus nucleic acid sequence for AK004980 (SEQ ID NO:
  • Fig. 43 is the mus musculus nucleic acid sequence for NM_008194 (SEQ ID NO: 21).
  • Fig. 44 is the mus musculus amino acid sequence for glycerol kinase (Gyk) (SEQ ID NO: 22).
  • Fig. 45 is the mus musculus nucleic acid sequence for NM 308509 (SEQ ID NO:
  • Fig. 46 is the mus musculus amino acid sequence for Lipoprotein lipase (SEQ ID NO: 24).
  • Fig. 47 illustrates how a population can be stratified, with respect to a trait under study, into subpopulations (subtypes) and causal determinants can be identified for each of the subpopulations using the methods of the present invention.
  • Fig. 48 illustrates processing steps for subdividing a disease population P into n subgroups and then subjecting one or more of the n subgroups to quantitative genetic analysis in accordance with another embodiment of the present invention.
  • Fig. 49 illustrates hierarchically clustered genes and extreme fat pad mass mice.
  • Fig. 50 illustrates the results of a QTL analysis of a portion of mouse chromosome 2 in accordance with one embodiment of the present invention.
  • Fig. 51 illustrates the results of a QTL analysis of a portion of mouse chromosome 19 in accordance with one embodiment of the present invention.
  • Fig. 52 illustrates the LOD scores for various obesity related genes.
  • Fig. 53 illustrates processing steps for subdividing a disease population P into n subgroups and then subjecting one or more of the n subgroups to quantitative genetic analysis in accordance with a preferred embodiment of the present invention.
  • Fig. 54 illustrates a data structure that comprises that data used to identify cellular constituents that discriminate a trait under study.
  • Fig. 55 illustrates the classification of a trait of interests into subtraits in accordance with one embodiment of the present invention.
  • Fig. 56 illustrates processing steps for subdividing a population into subgroups in accordance with one embodiment of the present invention.
  • a key goal of biomedical research is to identify the basis of common human diseases.
  • systems and methods for the identification of key drivers of complex traits, including common human diseases, using cellular constituent abundance data in a population are described.
  • Central to such systems and methods is the integration of genetic and cellular constituent abundance (e.g., gene expression) information with clinical trait data to infer causal patterns of association between key drivers and disease phenotypes.
  • Such procedures allow for the objective identification of druggable targets for common human diseases.
  • the present invention provides apparatus and methods for associating genes with complex traits exhibited by one or more organisms in a plurality of organisms of a species. Exemplary organisms include, but are not limited to, plants and animals.
  • exemplary organisms include, but are not limited to plants such as corn, beans, rice, tobacco, potatoes, tomatoes, cucumbers, apple trees, orange trees, cabbage, lettuce, and wheat.
  • exemplary organisms include, but are not limited to animals such as mammals, primates, humans, mice, rats, dogs, cats, chickens, horses, cows, pigs, and monkeys.
  • organisms include, but are not limited to, Drosophila, yeast, viruses, and C. elegans.
  • the gene is associated with the trait by identifying a biological pathway in which the gene product participates.
  • the trait of interest is a complex trait such as a human disease.
  • Exemplary human diseases include, but are not limited to, diabetes, obesity, cancer, asthma, schizophrenia, arthritis, multiple sclerosis, and rheumatosis.
  • the trait of interest is a preclinical indicator of disease, such as, but not limited to, high blood pressure, abnormal triglyceride levels, abnormal cholesterol levels, or abnormal high-density lipoprotein / low-density lipoprotein levels.
  • the trait is low resistance to an infection by a particular insect or pathogen. Additional exemplary diseases are found in Section 5.12, below.
  • Fig. 2 illustrates a hypothetical disease-specific genetic network for disease traits and related co-morbidities.
  • the quantitative trait loci (L Su) and environmental effects (E ⁇ ) (panel 202) represent the most upstream drivers of the disease traits in a given population.
  • a quantitative disease trait in a segregating population can be described as being made up of genetic and environmental components, with or without interactions among the genetic components and/or between the genetic and environmental components.
  • the QTL and environmental effects (202) influence other "causative" mRNAs (C M ) (panel 204) singly or in pathways that can interact in complicated ways (most generally, as a genetic network), but that ultimately lead to the disease state (primary clinical traits).
  • a genetic network can be represented as an acyclic directed graph having nodes and edges, where the nodes represent genes and each respective edge represents confidence that the two nodes, connected by the respective edge, are related as determined by an analysis of genotypic and gene expression data using the methods of the present invention.
  • Variations in the causal mRNAs or in the primary clinical traits can in turn affect reactive mRNAs (R N (panel 206) in other pathways that in turn lead to co-morbidities of the disease trait, or they can provide positive/negative feedback control to the causal pathways.
  • R N reactive mRNAs
  • the present invention broadens the search to any of the cellular constituents that operate in the causal portion of the genetic network associated with the disease trait (circles 204).
  • Identifying cellular constituents in pathways that are under the control of the same QTL that are controlling for the disease trait where the cellular constituents can be shown to act as transmitters of information from these multiple QTL to the disease trait itself (as opposed to acting as responders to the disease trait), potentially represent key intervention points that can be targeted to modulate the disease trait.
  • the biological/biochemical processes that take place that ultimately lead to the disease state, starting from the most upstream genetic components of the disease detected as QTL are completely hidden from view. Therefore, as depicted in Fig.
  • Gene expression traits and disease traits can be modulated by the same QTL. Therefore, performing genome-wide scans to map eQTL for the gene expression traits allows one to assess the amount of correlation between the gene expression and disease traits that is due to common genetic effects.
  • the QTL provide anchors in the complex network of interactions that lead to disease, and it is this causal information that provides for the opportunity to identify cellular constituents 204 that transmit "information" from single or multiple disease QTL, to the disease trait itself.
  • the QTL can modulate the disease trait through intermediates
  • identifying the intermediates using the combination of genetics and gene expression data has the potential to elucidate key control points in the complex network associated with the disease. Since one of the primary aims of the target discovery process is to identify targets for therapeutic intervention in complex human diseases, it is advantageous to partition cellular constituents (e.g., genes) making up the patterns of expression associated with the disease trait and that are modulated by QTL overlapping the disease trait QTL, into two groups: 1) cellular constituents under the control of the disease QTL that fall between the causal and reactive boundaries depicted in Fig. 2 (cellular constituents 204), and 2) cellular constituents that appear to be reactive to the disease state (cellular constituents 206).
  • cellular constituents e.g., genes
  • the partitioning approach requires that a basic set of causal scenarios be tested to determine whether a cellular constituent under the control of disease QTL is causal for the disease or reactive to it. For each cellular constituent under consideration, first a determination is made as to whether changes in the abundance (e.g., expression) of the cellular constituent are associated with QTL that explain variations in the disease trait. Then a determination is made as to whether the QTL act on the disease trait through the gene. Fig.
  • Pathway 302 represents the simplest causal relationship of a single QTL, Q, for the quantitative trait T, where Q acts on T through cellular constituent G.
  • Pathway 304 represents the simplest reactive diagram for a single QTL, Q, for the quantitative trait T, where in this case the abundance of cellular constituent G is responding to T.
  • the QTL, Q is causative for the trait T and the abundance of cellular constituent G, but acts on these traits independently.
  • Pathway 306 may arise when the QTL, Q, is actually two closely linked, independent QTL rather than a single QTL.
  • Pathway 308 represents a more complicated causal diagram where QTL Q affects the abundance of cellular constituents, and these cellular constituents, in turn, act on the trait T.
  • Pathway 310 represents the ideal causal diagram for target identification, where a number of QTL explain a significant amount of the variation in the trait T, but all of these QTL act on T through a single cellular constituent G.
  • mice with the BB genotype are obese, while 87.5% of the mice with the AA genotype are lean and the other 12.5% are obese. Further, 87.5% of the BB mice have higher transcript levels of a specific gene, while the other 12.5% have unchanged levels, and similarly, 87.5% of the AA mice have lower transcript levels of the same gene, while the other 12.5% have unchanged levels. If the clinical and expression trait were uncorrelated with the genotype at locus L (e.g., not significantly linked to this locus), it is expected that an equal percentage for each of the expression/clinical trait combinations for each genotype at locus L. Since this is clearly not true in Fig. 3B, the expression and clinical traits are significantly linked to (correlated with) locus L.
  • Fig. 3C highlights the Causative model, where the correlation between genotype and clinical trait predicted from the model is seen to be consistent with the observed correlation. In one embodiment described below, this scenario will translate into a situation where the correlation between the clinical trait and genotype, given the gene expression state, is seen to be zero. Because the clinical trait and genotype are uncorrelated once we condition on transcript abundances, we can tentatively conclude the mRNA is causal for the clinical trait. Fig.
  • FIG. 3D highlights the Reactive model, where the observed correlation between the gene expression trait and genotype is 0.88, but now the correlation between the gene expression trait and genotype given any of the clinical trait values is not equal to 0, e.g., the correlation between the expression trait and genotype predicted from the model does not equal the observed correlation. Because the expression trait and genotypes are still significantly correlated after conditioning on the clinical trait values, it is possible to confirm that the mRNA levels are not responding to the clinical trait. Finally, Fig. 3E highlights the Independent model, where again the correlation between the gene expression and clinical traits predicted from the model is not consistent with the observed correlation.
  • Fig. 3A illustrates a system 10 that is operated in accordance with one embodiment of the present invention.
  • System 10 comprises at least one computer 20 (Fig. 1).
  • Computer 20 comprises standard components including a central processing unit 22, and memory 24 (including high speed random access memory as well as non-volatile storage, such as disk storage) for storing program modules and data structures, user input/output device 26, a network interface 28 for coupling server 20 to other computers via a communication network (not shown), and one or more busses 34 that interconnect these components.
  • User input/output device 26 comprises one or more user input/output components such as a mouse 36, display 38, and keyboard 8.
  • Memory 24 comprises a number of modules and data structures that are used in accordance with the present invention. It will be appreciated that, at any one time during operation of the system, a portion of the modules and/or data structures stored in memory 24 is stored in random access memory while another portion of the modules and/or data structures is stored in non-volatile storage.
  • memory 24 comprises an operating system 40. Operating system 40 comprises procedures for handling various basic system services and for performing hardware dependent tasks.
  • Memory 24 further comprises a file system 42 for file management. In some embodiments, file system 42 is a component of operating system 40. Step 702. The present invention begins with the step of obtaining genotype data
  • Genotype data 68 comprises the actual alleles for each genetic marker typed in each individual in a plurality of individuals under study. In some embodiments, the plurality of individuals under study is human. Genotype data 68 includes marker data at intervals across the genome under study or in gene regions of interest. In some embodiments, such data is used to monitor segregation or detect associations in a population of interest. Marker data comprises those markers that will be used in the population under study to assess genotypes. In one embodiment, marker data comprises the names of the markers, the type of markers, and the physical and genetic location of the markers in the genomic sequence. Exemplary types of markers include, but are not limited to, restriction fragment length polymorphisms "RFLPs", random amplified polymorphic DNA
  • marker data comprises the different alleles associated with each marker.
  • a particular microsatellite marker consisting of 'CA' repeats can represent ten different alleles in the population under study, with each of the ten different alleles, in turn, consisting of some number of repeats. Representative marker data in accordance with one embodiment of the present invention is found in Section 5.2, below.
  • the genetic markers used comprise single nucleotide polymorphisms (SNPs), microsatellite markers, restriction fragment length polymorphisms, short tandem repeats, DNA methylation markers, sequence length polymorphisms, random amplified polymorphic DNA, amplified fragment length polymorphisms, or simple sequence repeats.
  • step 702 uses pedigree data 70.
  • Pedigree data 70 comprises the relationships between individuals in the population under study. The extent of the relationships between the individuals under study can be as simple as an inbred F 2 population, an F, population, an F 2:3 population, a Designm population, or as complicated as extended human family pedigrees.
  • a genetic map is generated from genotype data 68 and pedigree data 70. Such a genetic map includes the genetic distance between each of the markers present in the genotype data 68. These genetic distances are computed using pedigree data 70.
  • the plurality of organisms under study represents a segregating population and pedigree data is used to construct the marker map.
  • genotype probability distributions for the individuals under study are computed. Genotype probability distributions take into account information such as marker information of parents, known genetic distances between markers, and estimated genetic distances between the markers. Computation of genotype probability distributions generally require pedigree data 70. In some embodiments of the present invention, pedigree data 70 is not provided and genotype probability distributions are not computed. In some embodiments, a genetic map is not computed.
  • populations derived from multiple founders In some embodiments, the population that is used for the methods illustrated in
  • Fig. 7 is a population that is derived from a select set of strains (e.g., a small, but diverse number of founding mice) or individuals (e.g., the Icelandic population, which was founded by a small to moderate number of individuals).
  • strains e.g., a small, but diverse number of founding mice
  • individuals e.g., the Icelandic population, which was founded by a small to moderate number of individuals.
  • between 2 and 100, between 5 and 500, more than five, or less than 1000 strains of a species diverse with respect to complex phenotypes associated with common human disease are chosen.
  • the species is mice.
  • between 2 and 10 (e.g., 6) strains of mice that are diverse with respect to complex phenotypes associated with common human disease are selected.
  • Representative common human diseases include, but are not limited to, obesity, diabetes, atherosclerosis and associated morbidities, metabolic syndrome, depression / anxiety, osteoporosis, bone development, asthma, and chronic obstructive pulmonary disease.
  • the actual number of founding strains is not as important a factor as ensuring that these "founders" are diverse so as to introduce extensive heterogeneity into the population.
  • the species under study is mice and all or a portion of the following strains are used: B6_DBA GTMs (Jake Lusis, University of California, Los Angeles), B6_CAST GTMs (Jake Lusis, University of California, Los Angeles), B6_DBA Consomics (Joe Nadaeu, Case Western Reserve University), AXB recombinant inbred (RI) lines (JAX, Bar Harbor Maine), BXA RI lines (JAX), LXS RI lines (Rob Williams, University of Tennessee), AKXD RI lines (JAX), 8-way cross mice (Rob Hitzmann, Oregon Health and Science University), D129Sl/SvImJ (JAX), A/J (JAX), C57BL/6J (JAX), BALB/cJ (JAX), C3H/HeJ (JAX), CAST/Ei (JAX), DBA/2J (JAX), NOD/LU (JAX), NZB/B1NJ (JAX), S
  • the species that is selected for study using the methods illustrated in Fig 7 can be crossed.
  • crosses e.g. F 2 intercrosses
  • six founding strains are used so a total of 15 crosses are performed.
  • other cross designs are used.
  • a backcross or F 2 random mating scheme is employed.
  • "random" intercrossing at the Fi level is performed. Such embodiments begin with a predetermined number of parental strains that are crossed in various ways in order to obtain Fi mice. These Fi mice are allowed to breed with any other Fi mice irrespective of the identity of the parents from which such mice were derived.
  • mice from the crosses (for example the mice from the 15 crosses using the 6 founder strains) is collectively treated as a single large pedigree.
  • the final population size that is studied has a size of more than 1,000 organisms, between 100 and 100,000 organisms, less than 500,000 organisms, or, more preferably, between 5,000 and 25,000 organisms. This population is treated as a single large pedigree and genotype information is collected from this population using a standard set of, for example, more than 500 markers.
  • step 704 the population under study is phenotyped with respect to a trait or traits of interest using quantitative trait loci (QTL) analysis in which a phenotypic statistic set 74, representing the trait of interest, is used as the quantitative trait in the QTL analysis thereby identifying one or more clinical quantitative trait locus (cQTL) that link to the trait.
  • QTL quantitative trait loci
  • a cQTL that is linked to (correlated with) a trait of interest is identified using QTL analysis.
  • step 704 is performed by an embodiment of quantitative genetics analysis module 80.
  • a phenotypic statistic set 74 (plurality of phenotypic values) for the trait of interest serves as the clinical trait used in the QTL analysis.
  • each phenotypic statistic set 74 includes a phenotypic value 804 for a given phenotype for a each organism in a plurality of organisms under study.
  • a phenotypic value is any form of measurement of a phenotypic trait associated with the trait of interest (e.g., complex disease). For example, if the trait of interest is obesity, a suitable phenotypic trait could include cholesterol level in the blood of the organism. In such an example, the phenotypic value can be milligrams of cholesterol per liter of blood.
  • processing step 704 comprises a classical form of QTL analysis in which a phenotypic trait is quantified to form a phenotypic statistic set.
  • processing step 704 employs a whole genome search of genetic markers using the genotypic data from step 702. For each genotypic position in the genome of the population that is analyzed by genetics analysis module 80, processing step 704 provides a statistical measure (e.g., statistical score), such as the maximum lod score between the genomic position and the phenotypic statistic set 74. Thus, processing step 704 yields all the positions in the genome of the organism of interest that are linked to (correlated with) the expression statistic set 74 tested.
  • a statistical measure e.g., statistical score
  • processing step 704 association analysis, as described in Section 5.14 is used rather than linkage analysis.
  • the QTL analysis (Fig.
  • step 704) comprises: (i) testing for linkage between (a) the genotype of a plurality of organisms at a position in the genome of a single species and (b) the phenotypic statistic set 74 (e.g., plurality of phenotypic values), (ii) advancing the position in the genome by an amount, and (iii) repeating steps (i) and (ii) until all or a portion of the genome has been tested.
  • the amount advanced in each instance of (ii) is less than 100 centiMorgans, less than 10 centiMorgans, less than 5 centiMorgans, or less than 2.5 centiMorgans, or between 2.5 centiMorgans and 500 centiMorgans.
  • a Morgan is a unit that expresses the genetic distance between markers on a chromosome.
  • a Morgan is defined as the distance on a chromosome in which one recombinational event is expected to occur per gamete per generation.
  • the testing comprises performing linkage analysis (Section 5.13) or association analysis (Section 5.14) that generates a statistical score for the position in the genome of the single species.
  • the testing is linkage analysis and the statistical score is a logarithm of the odds (lod) score (Section 5.4).
  • a cQTL identified in processing step 704 is represented by a lod score that is greater than 2.0, greater than 3.0, greater than 4.0, or greater than 5.0.
  • a separate phenotypic statistic set 74 is created for the progeny of each cross. For example, consider the case where the phenotypic value under consideration is plasma cholesterol level.
  • phenotypic statistic sets 74 are constructed for plasma cholesterol level, one for the progeny of each of the fifteen strains. Then, a separate QTL analysis is performed with the progeny of each of the fifteen crosses. For each of these crosses, the phenotypic statistic set 74 associated with the cross is used as the quantitative trait in the QTL analysis. It will be appreciated that a large number of clinical traits can be considered. For each such clinical trait, measurements of the organisms 46 are made. Then, phenotypic statistic sets are created for each clinical trait considered.
  • the phenotypic measurements from the progeny of each cross are used to form a respective phenotypic statistic set 74 that is associated with the cross.
  • the progeny of each cross are subjected to a perturbation prior to phenotyping.
  • this perturbation is a drug treatment, variable diet and/or fasting/refeeding.
  • a phenotypic statistic set 74 is created from the progeny of the crosses prior to quantitative trait loci (QTL) analysis.
  • QTL quantitative trait loci
  • each such analysis corresponding to the progeny of a different cross in a plurality of crosses, there remains the task of combining the results of each such QTL analysis.
  • the phenotype is plasma cholesterol level and there are fifteen crosses in the population
  • fifteen QTL analyses are performed using plasma cholesterol as the quantitative trait, resulting in fifteen lod score curves across the genome of the species under consideration.
  • the lod score curves for the QTL overlapping in each of the crosses are combined in an additive fashion to assess the overall significance of the QTL over the different crosses.
  • this type of method ignores the relationship between the crosses that exists if they share a common parent.
  • Such matrices assess the probability that any two animals from the different crosses have inherited a common allele at any given position in the genome. These IBD matrices are then used to appropriately weight the different distributions in the phenotype of interest that can arise when the phenotype is linked to (correlated with) a particular region in the genome. For example, regions that are likely to have inherited a common allele are downweighted relative to regions that are likely to have inherited from different alleles.
  • Fig. 15 illustrates how mapping of QTL for clinical traits in a multi-cross environment in this way leads to significantly increased power to detect and localize quantitative trait loci.
  • Fig. 15A represents a QTL analysis when the progeny of a single cross are considered.
  • Fig. 15A is only a moderately significant linkage peak. Furthermore, QTL 1502 is broad and encompasses hundreds of genes, making identification of the genes that are causative of the clinical trait difficult.
  • Fig. 15B represents a QTL analysis when the progeny of a plurality of crosses are considered simultaneously.
  • QTL 1504 in Fig. 15B is a very significant linkage peak. Furthermore, QTL 1504 is much more narrow than peak 1502, containing tens of genes rather than hundreds of genes.
  • Fig. 16 illustrates how mapping of cQTL for clinical traits independently in the progeny of each cross in a plurality of crosses significantly increases the ability to identify genes underlying a given QTL.
  • a different phenotypic statistic set 74 is constructed for the progeny of each of three crosses and these phenotypic statistic sets 74 are then separately subjected to QTL analysis using genotypic data from progeny of the respective crosses in order to identify cQTL in each of the three populations that link to the clinical trait represented by the three different phenotypic statistic sets 74.
  • the progeny of a first cross are phenotyped and genotyped and this information is compared using a first QTL analysis to find cQTL
  • the progeny of a second cross are phenotyped and genotyped and this information is compared using a second QTL analysis to find cQTL
  • the progeny of a third cross are phenotyped and genotyped and this information is compared using a third QTL analysis to find cQTL.
  • the results of the three separate QTL analysis are shown for a particular portion of the genome of the species under study.
  • Boxed regions 1602, 1604 and 1606 show the polymorphic regions (gene loci that exhibit more than one allele) of the genome in the region where QTL 1608 has been found by the respective QTL analyses.
  • QTL 1608 is consistently in a polymorphic region in each of the crosses makes it more likely that the QTL is linked to (correlated with) the trait under study.
  • differences in the boundaries of the polymorphic regions help localize where the genes underlying this QTL could be located (e.g., would be localized to a region that is polymorphic in all three strains).
  • the embodiments that follow in this paragraph apply to instances where the species under study are mice.
  • the disease of interest is diabetes and/or insulin resistance and the phenotypes that are measured in step 704 include plasma glucose, plasma insulin, insulin glucose, and a glucose tolerance test (GTT).
  • the disease of interest is atherosclerosis, and the phenotypes that are measured in step 704 include aortic lesion and fatty streak (/. levels, /.
  • the disease of interest is obesity
  • the phenotypes that are measured in step 704 include body weight, anal-nasal length, fat pad weights (e.g., perimetrial fat pad mass, mesenteric omental fat pad mass, subcutaneous fat pad mass, and retroperitoneal fat pad mass), NMR fat mass, NMR muscle mass, leptin levels, food intake, liver weight, glucagon, adiponectin, and IGF-1.
  • the disease of interest is hypertension
  • the phenotypes that are measured in step 704 include blood pressure, and response to angiotensin II.
  • the disease of interest is asthma and chronic obstructive pulmonary disease (COPD) and the phenotypes that are measured in step 704 include airway hyper- responsiveness with and without antigen challenge and airway hyper-responsiveness in mice exposed to smoke for a significant length of time.
  • the trait of interest is plasma lipase activity and the phenotypes that are measured in step 704 include lipoprotein lipase (LPL), hepatic lipase (HL), and endothelial lipase activity.
  • the trait of interest is plasma lipids and the phenotypes that are measured in step 704 include total cholesterol (TC), high-density lipoprotein cholesterol (HDL), very low density lipid lipoprotein / low density lipoprotein (VLDL/LDL), triglycerides, fatty acids, ketone bodies, lactate, LDL oxidation, and HDL protection.
  • the trait of interest is plasma cytokines and the phenotypes that are measured in step 704 include interleukin 6 levels, interleukinl-beta levels, tumor necrosis factor alpha/gamma (TNF-alpha/gamma), and interleukin 4 levels.
  • the phenotypes that are measured include monocyte isolation from plasma and ELISA or LC-MS for leukotrienes.
  • the disease under study is inflammation and the phenotypes that are measured in step 704 include E06/MDA oxLDL ELISA, lipoprotein properties, macrophage/T cell interactions, and INF-gamma levels.
  • cardial related traits are of interest and the phenotypes that are measured in step 704 include heart brain weight ratio, heart rate / femur length, cardiac fibrosis, and myocardial calcification.
  • bone traits are of interest and the phenotypes that are measured in step 704 include bone density (scans), femur CT BMD, total femur x-ray BMD, total femur x-ray BMC, femur CT-determined BMC, femur diaphyseal BMC, femur diaphyseal BMD, intertrochanteric BMC, intertrochanteric BMD, femur volume by CT, femur x-ray area, femur diaphyseal cortical thickness, femur width at the diaphysis, right and left femur length, right and left tibia length, right and left length of forepaw 1 st , 2 nd , 3 rd , 4 th , and 5 th digits, right and left humerus length, right and left radius length, right and left ulna length, femure width at the intertrochanteric
  • cellular constituent abundance data 44 (e.g., from a gene expression study or a proteomics study) is obtained for a plurality of cellular constituents from one or more tissues in each member of the population under study.
  • cellular constituent abundance data 44 comprises the processed microarray images for each individual (organism) 46 in a population under study.
  • this data comprises, for each individual 46, cellular constituent abundance information 50 for each cellular constituent 48 represented on the array, optional background signal information 52, and optional associated annotation information 54 describing the probe used for the respective cellular constituent 48 (Fig. 1). See, for example, Section 5.8, below.
  • aspects of the biological state other than the transcriptional state can be measured and used as cellular constituent abundance data.
  • cellular constituent abundance data 44 is, in fact, protein levels for various proteins in the organisms 46 under study.
  • cellular constituent abundance data comprises amounts or concentrations of the cellular constituent in tissues of the organisms under study, cellular constituent activity levels in one or more tissues of the organisms under study, the state of cellular constituent modification (e.g., phosphorylation), or other measurements relevant to the trait under study.
  • the expression level of a gene in an organism in the population of interest is determined by measuring an amount of at least one cellular constituent that corresponds to the gene in one or more cells of the organism.
  • the amount of the at least one cellular constituent that is measured comprises abundances of at least one RNA species present in one or more cells. Such abundances can be measured by a method comprising contacting a gene transcript array with RNA from one or more cells of the organism, or with cDNA derived therefrom.
  • a gene transcript array comprises a surface with attached nucleic acids or nucleic acid mimics. The nucleic acids or nucleic acid mimics are capable of hybridizing with the RNA species or with cDNA derived from the RNA species.
  • the abundance of the RNA is measured by contacting a gene transcript array with the RNA from one or more cells of an organism in the plurality of organisms under study, or with nucleic acid derived from the RNA, such that the gene transcript array comprises a positional ly addressable surface with attached nucleic acids or nucleic acid mimics, where the nucleic acids or nucleic acid mimics are capable of hybridizing with the RNA species, or with nucleic acid derived from the RNA species.
  • cellular constituent abundance data 44 is taken from tissues that have been associated with a trait under study. For example, in one nonlimiting embodiment where the complex trait under study is human obesity, cellular constituent abundance data 44 is taken from the liver, brain, or adipose tissues.
  • cellular constituent abundance data 44 is measured from multiple tissues of each organism 46 (Fig. 1) under study.
  • cellular constituent abundance data 44 is collected from one or more tissues selected from the group of liver, brain, heart, skeletal muscle, white adipose from one or more locations, and blood.
  • the data is stored in a data structure such as data structure 78 of Fig. 11. This data structure is described in more detail below.
  • each progeny mouse and a number of parental and FI mice
  • tissue samples that can be collected for profiling include, but are not limited to, brain (possibly different brain parts), liver, white adipose tissue, skeletal muscle, heart, blood, kidney, lung, intestine, and stomach.
  • expression profiles for at least three of these tissues across some number of animals is performed. This rich set of clinical/biochemical phenotypes and gene expression traits over many tissues across multiple crosses allows for reconstruction of pathways involved in any of the clinical traits represented.
  • the data is transformed into abundance statistics that are used to treat each cellular constituent abundance in cellular constituent abundance data 44 as a quantitative trait.
  • cellular constituent abundance data 44 Fig.
  • the plurality of genes comprises gene expression data for a plurality of genes (or cellular constituents that correspond to the plurality of genes).
  • the plurality of genes comprises at least five genes.
  • the plurality of genes comprises at least one hundred genes, at least one thousand genes, at least twenty thousand genes, or more than thirty thousand genes.
  • the expression statistics commonly used as quantitative traits in the analyses in one embodiment of the present invention include, but are not limited to the mean log ratio, log intensity, and background-corrected intensity. In other embodiments, other types of expression statistics are used as quantitative traits.
  • the transformation of cellular constituent abundance data 44 is performed using normalization module 72 (Fig. 1). In such embodiments, the expression levels of a plurality of genes in each organism under study are normalized.
  • normalization module 72 Any normalization routine can be used by normalization module 72.
  • Representative normalization routines include, but are not limited to, Z-score of intensity, median intensity, log median intensity, Z-score standard deviation log of intensity, Z-score mean absolute deviation of log intensity calibration DNA gene set, user normalization gene set, ratio median intensity correction, and intensity background correction.
  • combinations of normalization routines can be used. Exemplary normalization routines in accordance with the present invention are disclosed in more detail in Section 5.3, below.
  • the expression statistics formed from the transformation are then stored in abundance / genotype warehouse 78, where they are ultimately matched with the corresponding genotype information.
  • Step 708 Given gene expression data for a specific tissue of interest in a population that has been genotyped and phenotyped with respect to a disease trait of interest, the next step is to identify all cellular constituents that are significantly associated with the disease trait.
  • a variety of methods can be used to establish associations between cellular constituent abundance and clinical traits, including simple Pearson correlations, basic discriminant analysis, t-tests, and ANOVA, in order to identify those cellular constituent abundance values that discriminate the extremes of the clinical trait, as well as more advanced regression models that specifically assess relationships between cellular constituent abundance values and clinical traits.
  • methods can be used to establish associations between cellular constituent abundance and clinical traits, including simple Pearson correlations, basic discriminant analysis, t-tests, and ANOVA, in order to identify those cellular constituent abundance values that discriminate the extremes of the clinical trait, as well as more advanced regression models that specifically assess relationships between cellular constituent abundance values and clinical traits.
  • only the cellular constituents that are differentially expressed in at least ten percent, at least twenty percent, or at least thirty percent of the organisms profiled are considered.
  • step 708 is a set of cellular constituents (association set D) whose abundance levels across the population under study significantly associate with the trait of interest.
  • step 708 the question is asked whether the 100 cellular constituent abundance values significantly correlate with the 100 trait measurement values. As indicated above, a statistical measure, such as the Pearson correlation coefficient between the abundance value and the Trait measurements, can be used. If a certain threshold correlation value or other metric is achieved, the cellular constituent is considered significantly associated with the trait. In some embodiments, multiple crosses are considered simultaneously. For the purposes of step 708, the progeny of the multiple crosses can be treated as a single large population.
  • the progeny of each cross can be considered independently.
  • an independent determination can be made of the cellular constituents whose abundance levels significantly associate with the trait of interest.
  • the test sets of cellular constituents that associate with the trait in the respective crosses can be combined. For instance, consider the case where cellular constituents A and B significantly associate with the trait in the progeny of a first cross and cellular constituents B and C significantly associate with the trait in the progeny of the second cross.
  • step 708 realizes an association set D comprising cellular constituents A, B, and C.
  • association set D comprising cellular constituents A, B, and C.
  • rules any number of rules that can be devised to combine the results when crosses are considered separately in step 708. The case of single addition (e.g., A, B, and C) has been presented above. Alternatively, only those cellular constituents that are significantly associated with the trait in all the crosses (or a majority of the crosses or some other percentage of the crosses) are placed in association set D.
  • step 710 a quantitative trait locus (QTL) analysis is performed using data corresponding to each cellular constituent i in association set D. For 1,000 cellular constituents, this results in 1,000 separate QTL analyses.
  • QTL quantitative trait locus
  • step 710 is performed by quantitative genetics analysis module 80 (Fig. 1).
  • quantitative genetics analysis module 80 Fig. 1
  • each QTL analysis is performed by quantitative genetics analysis module 80 (Fig. 1).
  • each QTL analysis steps through the genome of the organism of interest. Linkages to the gene under consideration are tested at each step or location along the length of the genome.
  • each step or location along the length of the chromosome is at regularly defined intervals.
  • these regularly defined intervals are defined in Morgans or, more typically, centiMorgans (cM).
  • each regularly defined interval is less than 10 cM, less than 5 cM, or less than 2.5 cM.
  • data, corresponding to a cellular constituent selected from discriminating set D is used as a quantitative trait. More specifically, for any given cellular constituent i, the quantitative trait used in the QTL analysis is an abundance statistic set such as set 904 (Fig. 9).
  • Abundance statistic set 904 comprises the corresponding abundance statistic 908 for the corresponding cellular constituent 902 from each organism 906 in the population under study.
  • the exemplary abundance statistic set 904 of Fig. 10 includes the abundance level 908 of a gene G (or cellular constituent that corresponds to gene G) from each organism in a plurality of organisms. For example, consider the case where there are ten organisms in the plurality of organisms, and each of the ten organisms expresses gene G. In this case, abundance statistic set 904 includes ten entries, each entry corresponding to a different one of the ten organisms in the plurality of organisms. Further, each entry represents the abundance level (e.g., expression level) of gene G in the organism represented by the entry.
  • abundance statistic set 904 includes ten entries, each entry corresponding to a different one of the ten organisms in the plurality of organisms. Further, each entry represents the abundance level (e.g., expression level) of gene G in the organism represented by the entry.
  • entry "1 (908-G-l) (Fig. 10) corresponds to the abundance level of gene G in organism 1
  • entry "2 (908-G-2) (Fig. 10) corresponds to the abundance level of gene G in organism 2, and so forth.
  • Fig. 11 in some embodiments of the present invention, abundance data from multiple tissue samples of each organism 906 (Fig. 1, 46) under study are collected. When this is the case, the data can be stored in the exemplary data structure illustrated in Fig. 11. In Fig. 11, a plurality of cellular constituents 902 are represented. Further, there is an abundance statistic set 904 for each cellular constituent 902. Each abundance statistic set 904 represents an abundance of the corresponding cellular constituent in each of a plurality of organisms 906 (Fig. 1, 46).
  • each QTL analysis comprises: (i) testing for linkage between a position in a genome and an abundance statistic set 904 (plurality of abundance statistics 908), (ii) advancing the position in the genome by an amount (e.g., less than 100 cM, less than 5 cM), and (iii) repeating steps (i) and (ii) until the entire genome is tested.
  • testing for linkage between a given position in the genome and the abundance statistic set 904 comprises correlating differences in the abundance found in the abundance level statistic with differences in the genotype at the given position using single marker tests (for example using /-tests, analysis of variance, or simple linear regression statistics).
  • abundance statistic set 904 is treated as the phenotype (in this case, a quantitative phenotype)
  • methods such as those disclosed in Doerge, 2002, Mapping and analysis of quantitative trait loci in experimental populations, Nature Reviews: Genetics 3:43-62, may be used.
  • the QTL data produced from each respective QTL analysis comprises a logarithm of the odds score (lod) computed at each position tested in the genome under study.
  • a lod score is a statistical estimate of whether two loci are likely to lie near each other on a chromosome and are therefore likely to be genetically linked.
  • a lod score is a statistical estimate of whether a given position in the genome under study is linked to (correlated with) the quantitative trait corresponding to a given gene. Lod scores are further defined in Section 5.4, below. In some embodiments, a lod score of 2.0 or more is generally taken to indicate that two loci are genetically linked.
  • a lod score of 3.0 or more is generally taken to indicate that two loci are genetically linked. In some embodiments, a lod score of 4.0 or more is generally taken to indicate that two loci are genetically linked.
  • the generation of lod scores requires pedigree data 70. Accordingly, in embodiments in which a lod score is generated, processing step 710 is essentially a linkage analysis, as described in Section 5.13, with the exception that the quantitative trait under study is derived from data, such as cellular constituent expression statistics, rather than classical phenotypes such as eye color. In situations where pedigree data is not available, genotype data 68 from each of the organisms 46 (Fig.
  • each abundance statistic set 904 can be compared to each abundance statistic set 904 using allelic association analysis, as described in Section 5.14, below, in order to identify QTL that are linked to (correlated with) each expression statistic set 904.
  • association analysis an affected population is compared to a control population.
  • haplotype or allelic frequencies in the affected population are compared to haplotype or allelic frequencies in a control population in order to determine whether particular haplotypes or alleles occur at significantly higher frequency amongst affected compared with control samples.
  • Statistical tests such as a chi-square test are used to determine whether there are differences in allele or genotype distributions.
  • QTL results database 1200 can be stored in memory 24 of computer 24 (Fig. 1, not shown).
  • QTL results database 1200 comprises all tested positions 1204 in the genome of the organism that were tested for linkage to the quantitative trait (expression statistic 904).
  • genotype data 68 provides the genotype at position 86 for each organism in the plurality of organisms under study.
  • a statistical measure e.g., statistical score 1206
  • the maximum lod score between the position and the abundance statistic 904 is listed.
  • data structure 1200 comprises all the positions in the genome of the organism of interest that are genetically linked to (correlated with) each abundance statistic 904 tested.
  • Step 712 those cellular constituents in association set D that do not have at least one eQTL coincident with at least one cQTL from step 704 form a candidate reactive cellular constituent set (Fig. 2, 206).
  • step 712 is performed by cQTL/eQTL overlap module 82 (Fig. 1). All cellular constituents in association set D that have at least one eQTL coincident with at least one cQTL from step 704 form a candidate causal cellular constituent set (Fig. 2, 204).
  • an eQTL is coincident with a cQTL when the eQTL and the cQTL colocalize within 40 cM of each other, within 30 cM of each other, within 20 cM of each other, within 10 cM of each other, within 3 cM of each other, or within 1 cM of each other in the genome of the species under consideration.
  • step 712 consider the case in which the phenotypic statistic set
  • 74 is omental fat pad mass in a mouse population and that a QTL analysis in accordance with step 704 yields 5 cQTL with LOD scores over 2.0 located on chromosomes 1 at 111 cM, 5 at 90 cM, 6 at 43 cM, 9 at 8 cM, and 19 at 28 cM. All cellular constituents in association set D that form eQTL at any of these chromosomal locations will be placed in the causal candidate cellular constituent set (Fig. 2, 204). All cellular constituents in association set D that do not form eQTL at any of these chromosomal locations will be placed in the reactive candidate cellular constituent set (Fig. 2, 206).
  • Each cellular constituent in the candidate causal cellular constituent set gives rise to at least one eQTL that overlaps with at least one cQTL from step 704 (an eQTL/cQTL overlap).
  • eQTL that overlaps with at least one cQTL from step 704
  • two or more traits here an eQTL and a cQTL
  • gametic phase disequilibrium also known as linkage disequilibrium
  • a single gene affecting multiple traits pleiotropy
  • the QTL associated with the position of the eQTL and cQTL must truly be common to the clinical and expression trait (due to a pleiotropic effect of a common QTL) rather than simply representing two closely linked QTL (due to linkage disequilibrium between two distinct QTL).
  • a test for pleiotropy is performed. The pleiotropy test determines whether the eQTL linked to (correlated with) the trait under study and the cQTL linked to the cellular constituent under study are statistically indistinguishable
  • this test is performed by pleiotropy module 84.
  • Jiang and Zeng, 1995, Genetics 140, 1111, devised statistical tests to assess whether the positions are equal. A generalization of this test is implemented in some embodiment of step 714.
  • the aim is to test this null against a more general alternative hypothesis that indicates p x ⁇ p 2 .
  • the alternative hypotheses of interest can be captured by the following model: where the ⁇ , are distributed as for the pleiotropy model.
  • the null hypothesis can be compared against any of a series of alternative hypotheses.
  • the likelihoods for the two competing models are easily formed, and maximum likelihood methods are then employed to estimate the model parameters ( ⁇ ,, ⁇ j , and ⁇ k ). With the maximum likelihood estimates in hand, the likelihood ratio test statistic can be formed to directly test the null hypothesis against the alternative.
  • There are several alternative hypotheses that can be tested in this setting including:
  • each cellular constituent in the candidate cellular constituent has at least one eQTL that is coincident with a respective cQTL for the trait of interest, where the at least one eQTL passes a test for pleiotropy with the respective cQTL.
  • the pleiotropy test is optional.
  • the cellular constituents in the candidate causative cellular constituent set are optionally ranked ordered based upon the amount of genetic variation in the trait of interest that is explained by the eQTL of the cellular constituent that are coincident with cQTL from the trait of interest. More specifically, for each cellular constituent i in the candidate causative cellular constituent set, a determination is made as to the amount of genetic variation in the trait of interest that is explained by the eQTL of the respective cellular constituent i coincident with the cQTL from the trait of interest. Then, the cellular constituents in the candidate causative cellular constituent set are rank ordered based upon the amount of genetic variation in the trait of interest that is explained by each cellular constituent determined in this manner.
  • a cellular constituent i in the candidate causative cellular constituent set has five eQTL.
  • Four of the eQTL overlap with four of the cQTL for the trait of interest.
  • only three of the eQTL pass the test for pleiotropy.
  • only the three eQTL that are coincident with respective cQTL for the trait of interest and that pass the test for pleiotropy described in step 712, above, are used to determine how well they explain the genetic variation in the trait of interest.
  • the determination as to how much the qualifying eQTL of a given cellular constituent explain the genetic variation in the trait of interest is performed using a joint analysis of the trait of interest at each of the qualifying coincident eQTL. This joint analysis leads to a lod score as described by Jiang and Zeng, 1995, Genetics 140, p.
  • Step 718 Steps 702 through 712 define a candidate causative cellular constituent set. Each cellular constituent in this candidate causative cellular constituent set is linked to at least one eQTL that colocalizes with a respective cQTL where, in turn, the respective cQTL is linked to the trait or traits of interest.
  • the quantitative genetic analysis of steps 704 and 712 define at least one locus in the genome of a species for each cellular constituent in the candidate causative cellular constituent set.
  • Step 718 considers each of the loci Q in the at least one locus associated with each respective cellular constituent i in the candidate causative cellular constituent set using a novel causality test in order to determine whether the respective cellular constituent i is causal for the trait or traits of interest.
  • Step 718 tests the cellular constituents in the candidate causative cellular constituent set in a manner that is independent of the pleiotropy test of step 712.
  • the pleiotropy test is designed to determine whether a cQTL and an eQTL that colocalize to a locus Q in the genome of the species under study are truly coincident (a single QTL, in which case the pleiotropy test is satisfied) or whether they are two closely linked QTL (in which case the pleiotropy test fails).
  • the pleiotropy test of step 712 can serve as an important validation that a given locus Q is a requisite site of colocalization of an eQTL and a cQTL.
  • the pleiotropy test does not always give unambiguous results.
  • step 718 is performed by causality test module 88.
  • Step 718 applies a causality test that, in one embodiment, serves to determine whether the genetic variation in each eQTL of a given cellular constituent that is coincident with a cQTL of a trait of interest is correlated with the variation in the trait of interest conditional on an abundance pattern of the cellular constituent i in the plurality of organisms.
  • Scenario 310 represents the situation where a cellular constituent (e.g., gene) is under the control of multiple disease QTL and is still causative for the disease, thereby providing maximal causal information relating to the disease under study.
  • a cellular constituent e.g., gene
  • the aim of the causality test is to distinguish between the relationships that indicate a cellular constituent is causal for the clinical trait (scenarios 302, 308, and 310 of Fig. 3A) from those that are reactive to, or independent of the disease trait (scenarios 304 and 306, respectively, of Fig. 3A).
  • the test for causality involving QTL, cellular constituent abundance (e.g., gene expression) and disease trait data is based on the same conditional probabilities that underlie mutual information measures that form the basis of the more general Bayesian network reconstruction problems. See, for example, Pearl, 1983, Probablistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufman Publishers, Inc., San Francisco.
  • the causality test assesses whether the QTL (0 and the disease trait (T) are correlated conditional on the cellular constituent abundance trait (G). Genetic linkages for disease and cellular constituent abundance traits give rise to information on causality, thereby restricting the number of relationships to consider since they establish sub-relationships with absolute certainty (e.g., it is known that Q causes variations in G and 7). In accordance with the present invention, this restriction allows for a robust, statistical test to determine whether scenarios 302, 308, and 310 of Fig. 3A hold over the relationships given by scenarios 304 and 306.
  • G, Q* and Tare not independent e.g., ⁇ P( ⁇ G)P(Q, ⁇ G), ), then one of the relationships given in scenarios 304 and 306 more likely holds (the relationships in these figures can be tested in a like manner).
  • Conditional independence is tested by first forming the likelihood functions based on the conditional probabilities discussed above, for the two competing hypotheses: 1) the null hypothesis that Tand Q are independent given G (G is causal for 7), and 2) the alternative hypothesis that T and Q are dependent given G (G is not causal for 7).
  • the likelihood functions can then be maximized with respect to the parameters of the underlying genetic model, and the likelihood ratio test statistic formed, which in the present case, under the null hypothesis, would be chi-square distributed with two degrees of freedom.
  • the correlation between T and Q * is considered in terms of a LOD score.
  • Significant correlation between Tand Q* is consistent with a significant LOD score for Tat position Q.
  • the causality test determines whether there is still a significant LOD score for Tat Q. If the LOD score for the QTL drops to zero (e.g., is statistically indistinguishable from zero) after conditioning on G, this indicates G effectively blocks transmission of the information from the QTL to the trait, indicating that scenario 302 (Fig. 3A) is the more likely explanation of the relationship between T and G (or one of the variants given in scenarios 308 or 310 of Fig. 3A).
  • Q is the DNA locus controlling cellular constituent levels and/or clinical traits
  • Q * is a genotype random variable for a locus Q across a population of organisms under study
  • G is cellular constituent level
  • 7/ is clinical trait.
  • the likelihoods are then maximized with respect to the model parameters, given the genotypic data 68, cellular constituent abundance data 44, and phenotype data 72 (Fig. 1) for the trait (or traits) of interest.
  • These maximum likelihood values are then compared using standard techniques, where the model giving rise to the largest likelihood is declared the best model. To illustrate, consider the case in which a particular trait, say X, in which 3.3 percent of the trait's variation is explained by a single QTL.
  • Table 1 below gives the Akaike Information Criterion (AIC) for three models in this case (the AIC value is defined as -2 times the loglikelihood added to two times the number of parameters in the model).
  • AIC Akaike Information Criterion
  • the AIC is used to select the "best" model from a list of theoretical functions. See, for example, Akaike Information Criterion Statistics Mathematics and Its Applications, Japanese Series, Sakamoto et al, D. Reidel Pub. Co., January 1987.
  • the model with the smallest AIC value represents the model that best fits the data and therefore has the highest likelihood given the data.
  • causality model 302 provides the best fit to the data, as would be expected given the hypothetical data.
  • the models in the hypothetical case are not nested, and so the standard likelihood ratio test theory does not strictly apply but can be used as an approximate test to determine whether the AIC values are statistically significant.
  • Permutation testing can also be used to assess the significance of the AIC differences. If the trait values are permuted in a way that maintains the correlation between them, but randomizes them with respect to the genotypes, an assessment can be made as to whether the observed differences are as big as those observed from the actual data. In this present example, 1000 permutations were tested and in no case was the difference between the causal and reactive models as large as it is in Table 1. This example demonstrates the power of the new causality test. It is effectively able to identify a strong causal relationship between two traits that were only moderately associated and weakly linking to a common QTL. To further highlight the utility consideration of genotypic information 68 (Fig.
  • step 712 Different likelihood models (causative, reactive, and independent) that are designed to discriminate between causal, reactive and independent relationships between two or more traits have been presented. Further, it was noted in step 712 that an optional pleiotropy test is performed to determine whether two traits are linked to a single QTL or whether they are driven by two independent QTL. However, in some embodiments, the likelihood models of step 718 can be used to make such a determination. For instance, if two traits test as strongly causal or reactive with respect to one another, this indicates that the traits are driven by a single QTL. If the traits are in fact driven by two closely linked, independent QTL, then the causality test would indicate that the independent model is best because the traits would not test as strongly causal or reactive.
  • the tests indicated causality or reactivity, then you could also conclude that the two traits were driven by the same QTL. The would hold even if the pleiotropy test currently described in the application could not distinguish whether it was two QTL or one (because the pleiotropy test is dependent on QTL position and the extent of recombination between the two QTL, whereas the causality test is based on correlation between the two traits). If the causality test of step 718 indicates that the independent model is preferred, then you would not be able to tell whether it was one or two QTL driving the two traits. In such instances, the optional pleiotropy test of step 712 could be used.
  • step 718 Maximum likelihood approaches to discriminating between causal, reactive, and independent relationships between two or more traits (e.g., Tand G) have been presented in step 718. Further, in step 714 a pleiotropy test for determining whether two traits (e.g., Tand G) that appear to be linked to (correlated with) a single QTL are driven by a single QTL, or whether they are driven by two independent QTL is provided.
  • the causality test can be used directly to determine the relationship between two or more traits.
  • the causality test can be applied to any pair of traits that are linked to (correlated with) a common QTL.
  • G is causal for T.
  • the causality test is not limited to the traits G and T. In other words, there is no requirement that one of the traits considered by the causality test be for variance in cellular constituent abundance and the other trait be variance in a phenotypically observable trait (e.g. an obesity index).
  • the causality test can be more generally applied to any two traits so long as there is some common QTL that genetically links with both traits.
  • the causality test can also be used to determine whether T is causal for G:
  • a determination can be made as to whether T is causal for G and whether G is causal for T. If two traits test as strongly causal or reactive with respect to one another, this argues that that the traits are driven by a single QTL (model 302 or 304 of Fig. 3A). If the traits were in fact driven by two closely linked, independent QTL, then the causality test would indicate that the independent model (model 306) was best. In other words they would not test as strongly causal or reactive.
  • the following table details how the causality test, used in conjunction with the pleiotropy test presented in step 712 can determine whether the causative model (model 302), reactive model (model 304), or independent model (model 306) describes two traits (X and Y) with respect to a QTL Q to which the two traits are linked
  • X causal for Y (302) X causal for Y Test is either satisfied, indicating Y reactive for X that Q is a single QTL that drives multiple traits (X and Y) or the Indicates that Q is a test fails to determine whether Q single QTL drives X and Y as one QTL or two closely linked QTL (because the test is dependent on QTL position and the extent of recombination between the two QTL)
  • X reactive to Y (304) X reactive for Y Test is either satisfied, indicating Y causal for X that Q is a single QTL that drives multiple traits (X and Y) or the Indicates that Q is a test fails to determine whether Q single QTL drives X and Y as one QTL or two closely linked QTL (because the test is dependent on QTL position and the extent of recombination between the two QTL)
  • Step 720 a determination is made as to whether the cellular constituents in the candidate causative cellular constituent set are druggable.
  • Hopkins and Groom, 2002, Nature Reviews 1, p. 727 provide one definition of a druggable target.
  • Hopkins and Groom identified the molecular targets to rule-of-five compliant compounds. As put forth by Lipinski et al, 1997, Adv. Drug Deliv. Rev.
  • a rule-of-five compliant synthetic compound (e.g., compounds other than those derived from natural products) has less than five hydrogen-bond donors, the molecular mass of the compound is less than 500 Daltons, the lipophilicity is less than 5, and the sum of the nitrogen and oxygen atoms is less than 10.
  • Hopkins and Groom identified 399 non-redundant molecular targets that have been shown to bind rule-of-five compliant compounds with binding affinities below 10 ⁇ M.
  • Hopkins and Groom took the drug-binding domains of the 399 non- redundant molecular targets and determined the families that they represent, as captured by their InterPro domain (Hopkins and Groom, 2002, Nature Reviews 1, p.
  • step 720 comprises determine whether each cellular constituent in the candidate causative cellular constituent set includes a druggable domain as defined by Hopkins and Groom.
  • any such definition can be used in optional step 720.
  • Drews, 1996, Nature Biotechnol. 14, 1516 and Drews and Ryser, 1997, Nature Biotechnol. 15, 1318 identified 483 molecular targets and concluded there could be 5,000-10,000 potential targets on the basis of an estimate of the number of disease related genes. See, Drews, 2000, Science 287, 1960.
  • the molecular targets identified by Drews are considered the class of cellular constituents that have a druggable domain.
  • the class of cellular constituents that have a druggable domain are any cellular constituents that are the molecular target of any drug product that has been approved under section 505 of the United States Federal Food, Drug, and Cosmetic Act.
  • Step 722 the cellular constituents in the candidate causative cellular constituent set are ranked and filtered based on the rank assigned in step 716 and and/or the results of steps 718 and 720.
  • a purpose of optional step 722 is to reduce the number of cellular constituents under consideration as molecular targets of a therapeutic drug discovery program directed at alleviating the trait under study.
  • optional ranking step 722 serves to prioritize the cellular constituents and/or filter out cellular constituents from the candidate causative cellular constituent set.
  • the only cellular constituents that are allowed to remain in the candidate causal cellular constituent set are those cellular constituents that (i) are highly ranked in step 716 (ii), have the null hypothesis of causality accepted in step 718 for all their associated eQTL that overlap a trait cQTL, and, optionally, (iii) have a druggable domain as determined by step 720.
  • a high rank means within the top 300, top 200, top 20%, or top 10% of the cellular constituents in the candidate causal cellular constituent set.
  • Step 724. The preceding steps describe an analysis of a candidate causal cellular constituent set in order to identify cellular constituents that are causal for a trait of interest.
  • the causality test of step 718 can easily be rewritten to determine whether (i) each eQTL, linked to a trait of interest T, and (ii) a cellular constituent in the candidate causal cellular constituent set, are correlated conditional on the disease trait in the plurality of organisms.
  • the methods of the present invention can be used to determine whether a cellular constituent is reactive to a trait of interest T (first graphical relationship given in Figure 13E).
  • the causality test of step 718 can easily be rewritten to determine whether (i) the trait of interest T, and (ii) a cellular constituent in the candidate causal cellular constituent set are correlated conditional on the QTL common to both traits.
  • This last test determines whether a QTL common to the trait of interest T and cellular constituent trait drives each of the traits independently, so that the cellular constituent trait is neither causal nor reactive to the trait T of interest (second graphical relationship given in Figure 13E).
  • Information on which genes are causal and which genes are reactive for a trait of interest can be used to reconstruct a genetic network using Bayesian analysis. Section 5.10, below, outlines methods that can be used to validate the hypothesis that certain cellular constituents are either causal or reactive to a trait of interest.
  • multivariate analysis can be used to determine whether such cellular constituents act in concert, in the form of a biological pathway, in order to affect the trait under study.
  • the degree to which each high ranking cellular constituent makes up a candidate pathway group that affect the trait of interest (or are affected by the trait of interest) is tested by fitting a multivariate statistical model to the eQTL of the high ranking cellular constituents.
  • Multivariate statistical models have the capability to consider multiple quantitative traits simultaneously, model epistatic interactions between the QTL and test other interesting variations that test whether a group of cellular constituents belong to the same or related biological pathway. Specific tests can be done to determine if the traits under consideration are actually controlled by the same QTL (pleiotropic effects) or if they are independent.
  • multivariate statistical analysis can be used to simultaneously consider multiple traits. This is of use to determine whether the traits are genetically linked to each other. Accordingly, in such embodiments, the eQTL of high ranking cellular constituents can be subjected to multivariate statistical analysis in order to determine whether the QTL are all genetically linked. Such an analysis can determine that some of the QTL in the cluster found in the QTL interaction map are, in fact, linked whereas other QTL in the cluster are not linked. Multivariate statistical analysis can also be used to study the same trait from multiple tissues. Multivariate statistical analysis of the same trait from multiple tissues can be used to determine whether genetic linkage varies on a tissue specific basis. Such techniques are of use, for example, in instances where a complex disease has a tissue specific etiology. Exemplary multivariate statistical models that can be used in accordance with the present invention are found in Section 5.6, below.
  • the population under study is subdivided before performing steps 708 through 724 using the methods disclosed in copending application PCT US03/15768, filed May 20, 2003, entitled “Computer Systems and Methods for Subdividing a Complex Disease Into Component Diseases," United States provisional Patent Application Serial Number 60/460,304, filed April 2, 2003, entitled “Computer Systems and Methods for Subdividing a Complex Disease Into Component Diseases,” and United States provisional Patent Application Serial Number 60/382,036, filed May 20, 2002, entitled “Computer Systems and Methods for Subdividing a Complex Disease Into Component Diseases.”
  • Such a process is illustrated in Fig.
  • Steps 4802 and 4804. The independent extremes of the population with respect to a particular quantifiable phenotype (e.g., complex trait) are identified.
  • an organism is within the group that represents an independent extreme with respect to a particular phenotype (e.g., complex trait) when the magnitude of the particular phenotype exhibited by the organism is greater than the magnitude of the particular phenotype exhibited by at least seventy percent, seventy-five percent, eighty percent, eighty-five percent, or ninety percent of the organisms in a population under study (e.g., plurality of organisms S).
  • Step 4806 Once the independent extremes have been identified, all cellular constituents (e.g. transcripts of genes) with abundances that are able to discriminate between extreme phenotypic groups (independent extremes) with reasonable accuracy are identified. In some embodiments, there are two independent extreme phenotypic groups. In other embodiments, there are more than two independent extreme phenotypic groups.
  • the set of cellular constituents that can discriminate between independent extreme phenotypic groups is referred to in this embodiment as the set of cellular constituents C. Many types of statistical analysis, such as a t-test, can be used to identify cellular constituents in the set G.
  • Step 4808 QTL for the primary trait of interest are identified using standard linkage analysis, such as that described in Section 5.13. That is, the pedigree data for population S, the phenotypic data for the trait of interest, and the genetic marker map for the species under study is used to identify clinical trait QTL (cQTL) that are linked to the trait under study. In embodiments where pedigree information is not available, an association analysis can be used to identify loci that are linked to the trait of interest. Association analyses is described in Section 5.14.
  • Step 4810 Quantitative genetic analysis is performed using each cellular constituent in the set of cellular constituents C.
  • the expression level of a cellular constituent selected from among the set of cellular constituents C serves as a phenotypic trait.
  • Each analysis is performed using quantitative genetic analysis described herein.
  • Each quantitative genetic analysis that uses the abundance data (e.g., expression data) for a given cellular constituent C in population S identifies the expression QTL (loci; eQTL) associated with the cellular constituent.
  • Step 4812 The data obtained in step 4810 is used to select which cellular constituents will remain in discriminating set G. In one embodiment, only those cellular constituents C that have an eQTL (loci) that is linked with a cQTL or that, in fact, overlaps with cQTL are allowed to remain in set G. Cellular constituents that do not have an eQTL that is linked with a cQTL and do not have an eQTL that overlaps a cQTL are discarded. For clarity, the refined set of cellular constituents is termed "DG" in this and subsequent steps. Step 4814. An optional step can be performed in order to increase the number of cellular constituents in set DG.
  • the abundance patterns of several cellular constituents in the organism under study, across the population under study is compared to the abundance pattern of any cellular constituent in set DG.
  • Cellular constituents having abundance patterns that are highly correlated with the abundance pattern of a cellular constituent in set DG across population S are added to set DG. More information on how this type of correlation may be computed is found in PCT International Publication WO 00/39338 dated July 6, 2000.
  • Step 4816 population S is clustered based on the abundance pattern of cellular constituent set C. Therefore, those organisms in population S that have similar abundance patterns across cellular constituent set C will form clusters.
  • the type of clustering can be any of the various clustering methods described in Sections 5.16.
  • the clustering results in a set of clusters (e.g. subgroups) of population S having similar abundance patterns across cellular constituent set C.
  • Step 4818 Next, linkage analysis (Section 5.13) or association analysis (Section 5.14) on the trait of interest is performed using the different identified subgroups . Those subgroups leading to significantly increased cQTL lod scores for the trait of interest are analyzed further. In particular, such subgroups are subjected to a series of quantitative genetic analyses. In each quantitative genetic analysis in the series, the expression level of a cellular constituent selected from among the cellular constituents in set DG is used as a quantitative trait. The end result of this analysis is the identification of eQTL that are linked with the abundance pattern of cellular constituents in set DG across a particular subgroup.
  • a trait is selected for study in a species.
  • the trait is a complex trait.
  • the species can be a plant, animal, human, or bacterial.
  • the species is human, cat, dog, mouse, rat, monkey, pigs, Drosophila, or corn.
  • a plurality of organisms representing the species are studied.
  • the number of organism in the species can be any number.
  • the plurality of organisms studied is between 5 and 100, between 50 and 200, between 100 and 500, or more than 500.
  • a portion of the organisms under study are subjected to a perturbation that affects the trait.
  • the perturbation can be environmental or genetic.
  • Examples of environmental perturbations include, but are not limited to, exposure of an organism to a test compound, an allergen, pain, hot or cold temperatures. Additional examples of environmental perturbations include diet (e.g. a high fat diet or low fat diet), sleep deprivation, isolation, and quantifying a natural environmental influences (e.g., smoking, diet, exercise). Examples of genetic perturbations include, but are not limited to, the use of gene knockouts, introduction of an inhibitor of a predetermined gene or gene product, N-Ethyl-N-nitrosourea (ENU) mutagenesis, siRNA knockdown of a gene, or quantifying a trait exhibited by a plurality of organisms of a species.
  • ENU N-Ethyl-N-nitrosourea
  • the perturbation optionally used in step 5302 is selected because of some relationship between the perturbation and the trait.
  • the perturbation could be the siRNA knockdown of a gene that is thought to influence the trait under study.
  • Step 5304. The levels of cellular constituents are measured from the plurality of organisms 46 in order to derive gene expression / cellular constituent data. The identity of the tissue from which such measurements are made will depend on what is known about the trait under study. In some embodiments, cellular constituent measurements are made from several different tissues. Generally, the plurality of organisms 46 exhibit a genetic variance with respect to the trait. In some embodiments, the trait is quantifiable.
  • the trait in instances where the trait is a disease, the trait can be quantified in a binary form (e.g., "1 if the organism has contracted the disease and "0 if the organism has not contracted the disease).
  • the trait can be quantified as a spectrum of values and the plurality of organisms 46 will represent several different values in such a spectrum.
  • the plurality of organisms 46 comprise an untreated (e.g., unexposed, wild type, etc.) population and a treated population (e.g., exposed, genetically altered, etc.).
  • the untreated population is not subjected to a perturbation whereas the treated population is subjected to a perturbation.
  • the tissue that is measured in step 5304 is blood, white adipose tissue, or some other tissue that is easily obtained from organisms 46.
  • the levels of between 5 cellular constituents and 100 cellular constituents, between 50 cellular constituents and 100 cellular constituents, between 300 and 1000 cellular constituents, between 800 and 5000 cellular constituents, between 4000 and 15,000 cellular constituents, between 10,000 and 40,000 cellular constituents, or more than 40,000 cellular constituents are measured.
  • gene expression / cellular constituent data comprises the processed microarray images for each individual (organism) 46 in a population under study. In some embodiments, such data comprises, for each individual 46, intensity information for each gene / cellular constituent represented on the microarray.
  • cellular constituent data is, in fact, protein expression levels for various proteins in a particular tissue in organisms 46 under study.
  • cellular constituent levels are determined in step 5304 by measuring an amount of the cellular constituent in a predetermined tissue of the organism.
  • the term "cellular constituent" comprises individual genes, proteins, mRNA, metabolites and/or any other cellular components that can affect the trait under study.
  • the level of a cellular constituent can be measured in a wide variety of methods.
  • Cellular constituent levels for example, can be amounts or concentrations in tissues of the organisms, their activities, their states of modification (e.g., phosphorylation), or other measurements relevant to the trait under study.
  • step 5304 comprises measuring the transcriptional state of cellular constituents in tissues of organisms.
  • the transcriptional state includes the identities and abundances of the constituent RNA species, especially mRNAs, in the tissue.
  • the cellular constituents are RNA, cRNA, cDNA, or the like.
  • the transcriptional state of the cellular constituents can be measured by techniques of hybridization to arrays of nucleic acid or nucleic acid mimic probes, or by other gene expression technologies.
  • step 5304 comprises measuring the translational state of cellular constituents.
  • the cellular constituents are proteins.
  • the translational state includes the identities and abundances of the proteins in the organisms.
  • whole genome monitoring of protein i.e., the "proteome,” Goffeau et al, 1996, Science 274, p. 546) can be carried out by constructing a microarray in which binding sites comprise immobilized, preferably monoclonal, antibodies specific to a plurality of protein species found in one or more tissues of the organisms under study. Preferably, antibodies are present for a substantial fraction of the encoded proteins. Methods for making monoclonal antibodies are well known. See, for example, Harlow and Lane, 1998, Antibodies: A Laboratory Manual, Cold Spring Harbor, N.Y. In one embodiment, monoclonal antibodies are raised against synthetic peptide fragments designed based on genomic sequences.
  • antibody arrays for high-throughput screening of antibody-antigen interactions are used. See, for example, Wildt et al, Nature Biotechnology 18, p. 989.
  • large scale quantitative protein expression analysis can be performed using radioactive (e.g., Gygi et al, 1999, Mol. Cell. Biol 19, p. 1720) and/or stable iostope ( 15 N) metabolic labeling (e.g., Oda et al. Proc. Natl. Acad. Sci. USA 96, p.
  • Two-dimensional gel electrophoresis is well-known in the art and typically involves focusing along a first dimension followed by SDS-PAGE electrophoresis along a second dimension. See, e.g., Hames et al, 1990, Gel Electrophoresis of Proteins: A Practical Approach, IRL Press, New York; Shevchenko et al, 1996, Proc Nat'l Acad. Sci. USA 93, p. 1440; Sagliocco et al, 1996, Yeast 12, p. 1519; Lander 1996, Science 274, p.
  • Electropherograms can be analyzed by numerous techniques, including mass spectrometric techniques, western blotting and immunoblot analysis using polyclonal and monoclonal antibodies, and internal and N-te ⁇ ninal micro-sequencing. See, for example, Gygi, et al, 1999, Nature Biotechnology 17, p. 994.
  • fluorescence two-dimensional difference gel electrophoresis (DIGE) is used. See, for example, Beaumont et al, Life Science News 7, 2001.
  • quantities of proteins in organisms 246 are determined using isotope-coded affinity tags (ICATs) followed by tandem mass spectrometry. See, for example, Gygi et al, 1999, Nature Biotech 17, p. 994. Using such techniques, it is possible to identify a substantial fraction of the proteins expressed in one or more predetermined tissues in organisms 46.
  • step 5304 comprises measuring the activity or post- translational modifications of the cellular constituents in the plurality of organisms 246. See for example, Zhu and Snyder, Curr. Opin. Chem. Biol 5, p. 40; Martzen et al, 1999, Science 286, p. 1153; Zhu et al, 2000, Nature Genet. 26, p.
  • measurement of the activity of the cellular constituents is facilitated using techniques such as protein microarrays. See, for example, MacBeath and Schreiber, 2000, Science 289, p. 1760; and Zhu et al, 2001, Science 293, p. 2101.
  • post-translation modifications or other aspects of the state of cellular constituents are analyzed using mass spectrometry. See, for example, Aebersold and Goodlett, 2001, Chem Rev 101, p. 269; Petricoin III, 2002, The Lancet 359, p. 572.
  • the proteome of organisms 46 under study is analyzed in step 5304.
  • the analysis of the proteome typically involves the use of high- throughput protein analysis methods such as microarray technology. See, for example, Templin et al, 2002, TRENDS in Biotechnology 20, p. 160; Albala and Humphrey- Smith, 1999, Curr. Opin. Mol. Ther. 1, p. 680; Cahill, 2000, Proteomics: A Trends Guide, p. 47-51; Emili and Cagney, 2000, Nat. Biotechnol., 18, p. 393; and Mitchell, Nature Biotechnology 20, p. 225.
  • "mixed" aspects of the amounts cellular constituents are measured in step 5304.
  • the amounts or concentrations of one set of cellular constituents in the organisms 46 under study are combined with measurements of the activities of certain other cellular constituents in such organisms.
  • different allelic forms of a cellular constituent in a given organism are detected and measured in step 5304. For example, in a diploid organism, there are two copies of any given gene, one descending from the "father” and the other from the "mother.” In some instances, it is possible that each copy of the given gene is expressed at different levels. This is of significant interest since this type of allelic differential expression could associate with the trait under study, particularly in instances where the trait under study is complex. Step 5306.
  • cellular constituent data comprises transcriptional data, translational data, activity data, and/or metabolite abundances for a plurality of cellular constituents.
  • the plurality of cellular constituents comprises at least five cellular constituents.
  • the plurality of cellular constituents comprises at least one hundred cellular constituents, at least one thousand cellular constituents, at least twenty thousand cellular constituents, or more than thirty thousand cellular constituents.
  • the expression statistics commonly used as quantitative traits in the analyses in one embodiment of the present invention include, but are not limited to, the mean log ratio, log intensity, and background-corrected intensity derived from transcriptional data.
  • this transformation is performed using a normalization software known in the art.
  • the expression level of each of a plurality of genes in each organism under study is normalized.
  • Any normalization routine can be used by the normalization module.
  • Representative normalization routines include, but are not limited to, Z-score of intensity, median intensity, log median intensity, Z-score standard deviation log of intensity, Z-score mean absolute deviation of log intensity calibration DNA gene set, user normalization gene set, ratio median intensity correction, and intensity background correction.
  • combinations of normalization routines can be run. Step 5350. In the preceding steps, a trait is identified, cellular constituent level data is measured, and the cellular constituent data is transformed into expression statistics.
  • step 5350 (Fig. 53A), one or more phenotypes are measured for all or a portion of the organisms 46 in the population under study.
  • Fig. 54 summarizes the data that is measured as a result of steps 5302-5306 and 5350.
  • the first class of data collected is phenotypic information 1301.
  • Phenotypic information 1301 can be anything related to the trait under study.
  • phenotypic information 1301 can be a binary event, such as whether or not a particular organism exhibits the phenotype (+/-).
  • the phenotypic information can be some quantity, such as the results of an obesity measurement for the respective organism 46. As illustrated in Fig.
  • each organism 46 in the population under study is cellular constituent levels 250 (e.g., amounts, abundances) for a plurality of cellular constituents (steps 5304-5306, Fig. 53A).
  • cellular constituent levels 250 e.g., amounts, abundances
  • Fig. 53A cellular constituent levels
  • each set of cellular constituents for a respective organism 46 could represent measurements taken from a different tissue in the organisms. For example, one set of cellular constituent measurements could be from a blood sample taken from the respective organism while another set of cellular constituent measurements could be from fat tissue from the respective organism.
  • Step 5352 the phenotypic data 1301 (Fig. 54) collected in step 5350 is used to divide the population (5500) into phenotypic groups 5510 (Fig. 55).
  • the method by which step 5352 is accomplished is dependent upon the type of phenotypic data measured in step 5350. For example, in the case where the only phenotypic data is whether or not the organism 46 exhibits a particular trait, step 5352 is straightforward. Those organisms 46 that exhibit the trait are placed in a first group and those organisms 46 that do not exhibit the trait are placed in a second group. A slightly more complex example is where amounts 1301 represent gradations of a quantified trait exhibited by each organism 46.
  • each amount 1301 can correspond to an obesity index (e.g., body mass index, etc.) for the respective organism 46.
  • organisms 46 can be binned into phenotypic groups 5510 as a function of the obesity index.
  • a plurality of phenotypic measurements e.g., 2, 3, 4, 5, 8, 10, 20 or more, between 10 and 20, 20 or more, etc
  • each phenotypic measurement for a respective organism can be treated as elements of a phenotypic vector corresponding to the respective organism.
  • phenotypic vectors can then be clustered using, for example, any of the clustering techniques disclosed in Section 5.16 in order to derive phenotypic groups.
  • the organisms are human and phenotypic measurements are derived from a standard 12-lead electrocardiogram graph (ECG).
  • ECG electrocardiogram graph
  • the standard 12-lead ECG is a representation of the heart's electrical activity recorded from electrodes on the body surface.
  • the ECG provides a wealth of phenotypic data including, but not limited to, heart rate, heart rhythm, conduction, wave form description, and ECG interpretation (typically a binary event, e.g., normal, abnormal).
  • Each of these different phenotypes can be quantified as elements in a phenotypic vector. Further, some elements of the phenotypic vector (e.g., ECG interpretation) can be given more weight during clustering. For instance, the ECG measurements can be augmented by additional phenotypes such as plasma cholesterol level, blood triglyceride level, sex, or age in order to derive a phenotypic vector for each respective organism 246. Once suitable phenotypic vectors are constructed, they can be clustered using any of the clustering algorithms in Section 5.16 in order to identify phenotypic groups.
  • the step of identifying phenotypic groups is an iterative process in which various phenotypic vectors are constructed and clustered until a form of phenotypic vector that produces clear, distinct groups is identified.
  • phenotypic vectors that are capable of producing phenotypic groups that are uniquely characterized by certain phenotypes (e.g., an abnormal ECG/ high cholesterol subgroup, a normal ECG/ low cholesterol subgroup).
  • phenotypic vectors that can be iteratively tested include a vector that has ECG data only, one that has blood measurements only, one that is a combination of the ECG data and blood measurements, one that has only select ECG data, one that has weighted ECG data, and so forth.
  • optimal phenotypic vectors can be identified using search techniques, such as stochastic search techniques (e.g., simulated annealing, genetic algorithm). See, for example, Duda et al, 2001, Pattern Recognition, second edition, John Wiley & Sons, New York. Step 5354. Once phenotypic groups have been identified, the phenotypic extremes within the population are identified.
  • Such phenotypic extremes can be referred to as a set of extreme organisms.
  • the trait of interest is obesity.
  • very obese and very lean organisms can be selected as the phenotypic extremes.
  • a phenotypic extreme is defined as the top or lowest 40 th , 30 th , 20 th , or 10 th percentile of the population with respect to a given phenotype exhibited by the population.
  • Step 5356 a plurality of cellular constituents for the species represented by the set of extreme organisms are filtered. Only levels measured for phenotypically extreme organisms (the set of extreme organisms) are used in this filtering. To illustrate, consider the case in which a first organism and a second organism represent phenotypic extremes with respect to some phenotype whereas a third organism does not. Then, in this instance, phenotypic measurements for the first organism and the second organism will be considered in the filtering whereas levels measured for the third organism will not be considered in the filtering.
  • cellular constituent levels (measured in phenotypically extreme organisms) for a given cellular constituent are subjected to a t-test (or some other test such as a multivariate test) to determine whether the given cellular constituent can discriminate between the extreme phenotypic groups identified above.
  • a cellular constituent will discriminate between extreme phenotypic groups when the cellular constituent is found at characteristically different levels in each of the phenotypic groups.
  • a cellular constituent will discriminate between the two groups when levels of the cellular constituent (measured in phenotypically extreme organisms) are found at a first level in the first phenotypic group and are found at a second level in the second phenotypic group, where the first and second level are distinctly different.
  • each cellular constituent is subjected to a t-test and/or a corresponding non-parametric test such as the Wilcoxon sign rank test without consideration of the other cellular constituents in the organism.
  • groups of cellular constituents are compared in a multivariate analysis in order to identify those cellular constituents that discriminate between phenotypic groups.
  • Step 5358 Typically, there will be a large number of cellular constituents expressed in phenotypically extreme organisms that appear to differentiate between the phenotypic groups. In some instances, this number of cellular constituents can exceed the number of organisms available for study. For instance, in some embodiments, 25,000 genes or more are considered in previous steps. Thus, there may be hundreds if not thousands of genes that discriminate the phenotypically extreme groups. In some instances, these discriminating cellular constituents are analyzed in subsequent steps with statistical models that involve many statistical parameters that cannot accommodate more cellular constituents than organisms as this leads to an over-determined system. In such instances, it is desirable to reduce the number of cellular constituents using a reducing algorithm.
  • reducing algorithms that are optionally used can involve use of the p-value or other form of metric computed for each cellular constituent as a basis for reducing the dimensionality of the previously identified cellular constituent set.
  • a few exemplary reducing algorithms will be discussed. However, those of skill in the art will appreciate that many reducing algorithms are known in the art and all such algorithms can be used.
  • One reducing algorithm is stepwise regression.
  • stepwise regression involves (1) identifying an initial model (e.g., an initial set of cellular constituents), (2) iteratively "stepping,” that is, repeatedly altering the model at the previous step by adding or removing a predictor variable (cellular constituent) in accordance with the "stepping criteria," and (3) terminating the search when stepping is no longer possible given the stepping criteria, or when a specified maximum number of steps has been reached.
  • Forward stepwise regression starts with no model terms (e.g., no cellular constituents). At each step the regression adds the most statistically significant term until there are none left.
  • Backward stepwise regression starts with all the terms in the model and removes the least significant cellular constituents until all the remaining cellular constituents are statistically significant.
  • all-possible-subset regression can be used in conjunction with stepwise regression.
  • the stepwise regression search approach presumes there is a single "best" subset of cellular constituents and seeks to identify it.
  • the range of subset sizes that could be considered to be useful is made. Only the "best" of all possible subsets within this range of subset sizes are then considered.
  • PCA seeks a projection that best represents the data in a least-squares sense
  • MDA seeks a projection that bests separates the data in a least-squares sense.
  • the ultimate goal is to identify a classifier derived from the previously identified set of cellular constituents or a subset of the cellular constituents identified in step 1256 that satisfactorily classifies organisms into the phenotypic groups.
  • stochastic search methods such as simulated annealing can be used to identify such a classifier or subset.
  • each cellular constituent under consideration can be assigned a weight in a function that assesses the aggregate ability of the set of cellular constituents identified to discriminate the organisms into the phenotypic classes. During the simulated annealing algorithm these weights can be adjusted. In fact, some cellular constituents can be assigned a zero weight and, therefore, be effectively eliminated during the anneal thereby effectively reducing the number of cellular constituents used in subsequent steps.
  • Other stochastic methods that can be used include, but are not limited to, genetic algorithms. See, for example, the stochastic methods in Chapter 7 of Duda et al., 2001 , Pattern Classification, second edition, John Wiley & Sons, New York.
  • Step 5360 the cellular constituents identified in previous steps are clustered in order to further identify subgroups within each phenotypic subpopulation.
  • an expression vector is created for each cellular constituent under consideration.
  • the levels measured for the respective cellular constituent in each of the phenotypically extreme organisms is used as an element in the vector. For example, consider the case in which an expression vector for a first cellular constituent 48-1 is to be constructed from organisms 46-1, 46-2, and 46-3. Levels 50-1-1, 50-2-1, and 50-3-1 would serve as the three elements of the expression vector that represents cellular constituent 48-1.
  • Each of the expression vectors are then clustered using, for example, any of the clustering techniques described in Section 5.16.
  • k-means clustering (Section 5.16.2) is used.
  • a benefit of clustering is that it refines the trait under study into groups that are not distinguishable using gross observable phenotypic data (other than cellular constituent levels).
  • the optional clustering provides a way to refine the definition of the clinical trait under study by focusing on those cellular constituents that actually give rise to the clinical trait or well reflect the varied biochemical response to that trait.
  • the refinement provided by clustering can be considered incomplete because it is based on only a select portion of the general population under study, those organisms that represent the phenotypic extremes. For this reason, pattern classification techniques are used in subsequent steps of the instant method to build a robust classifier that is capable of classifying the general population into subgroups in a manner that does not rely upon phenotypic levels.
  • Step 5364 Building a classifier.
  • the set of cellular constituents identified as discriminators between phenotypic extremes (or principal components derived from such cellular constituents) are used to build a classifier.
  • This set of cellular constituents actually refines the definition of the clinical phenotype under study.
  • a number of pattern classification techniques can be used to accomplish this task, including, but not limited to, Bayesian decision theory, maximum-likelihood estimation, linear discriminant functions, multilayer neural networks, supervised learning, unsupervised learning, boosting and adaptive boosting.
  • the set of cellular constituents that discriminate the phenotypically extreme organisms into phenotypic groups is used to train a neural network using, for example, a back-propagation algorithm.
  • the neural network serves as a classifier.
  • the neural network is trained with a probability distribution derived from the set of cellular constituents that discriminate the phenotypically extreme organisms into phenotypic groups.
  • the probability distribution comprises each cellular constituent t-value, p- value or other computed statistic.
  • the neural network is used to classify the general population into phenotypic groups.
  • the neural network that is trained is a multilayer neural network. In other embodiments, a projection pursuit regression, a generalized additive model, or a multivariate adaptive regression spline is used.
  • Bayesian decision theory can be used to build a classifier. Bayesian decision theory plays a role when there is some a priori information about the things to be classified.
  • a probability distribution derived from the set of cellular constituents that discriminate the phenotypically extreme organisms into phenotypic groups serves as the a priori information.
  • this probability distribution comprises each cellular constituent p-value or other computed statistic.
  • linear discriminate analysis functions
  • linear programming algorithms or support vector machines are used to create a classifier that is capable of classifying the general population of organisms into phenotypic groups. This classification is based on the cellular constituent data 50 for the cellular constituents 48 that refined the definition of the clinical phenotype (i.e., the cellular constituents selected in any of the preceding steps).
  • boosting methods are used to create a classifier based upon the set of cellular constituents identified as discriminators between phenotypic extremes or based upon principal components derived from such cellular constituents.
  • An exemplary boosting method that can be used in the present invention is described by Freund and Schapire, 1997, Journal of Computer and System Sciences 55, pp. 119-139. The technique is used as follows.
  • extreme phenotype 1 e.g., obese
  • extreme phenotype 2 e.g., lean
  • G(X) produces a prediction taking one of type values in the two value set: ⁇ extreme phenotype 1, extreme phenotype 2 ⁇ .
  • the error rate on the training sample is
  • N the number of organisms in the training set (the sum total of the organisms that have either extreme phenotype 1 or extreme phenotype 2). For example, if there are 49 obese and 72 lean organisms under study, N is 121.
  • ⁇ i, ⁇ 2, . ; ⁇ are computed by the boosting algorithm and their purpose is to weigh the contribution of each respective G m (x). Their effect is to give higher influence to the more accurate classifiers in the sequence.
  • the data modifications at each boosting step consist of applying weights w ; , w 2 ,
  • the current classifier G m (x) is induced on the weighted observations at line 2a.
  • the resulting weighted error rate is computed at line 2b.
  • Line 2c calculates the weight a m given to G m (x) in producing the final classifier G(J ) (line 3).
  • the individual weights of each of the observations are updated for the next iteration at line 2d.
  • Observations misclassified by G m (x) have their weights scaled by a factor exp( ⁇ m ), increasing their relative influence for inducing the next classifier G m+ ⁇ (x) in the sequence.
  • boosting method are used. See, for example, Hasti et al, The Elements of Statistical Learning, 2001, Springer, New York, Chapter 10. In some embodiments, boosting or adaptive boosting methods are used.
  • An embodment of the present invention provides a method for identifying a quantitative trait locus for a trait that is exhibited by a plurality of organisms in a population. In the method, the population is divided into a plurality of sub-populations using a classification scheme that classifies each organism in the population into at least one of the subpopulations. The classification scheme is derived from a plurality of cellular constituent measurements for each of a plurality of respective cellular constituents that are obtained from each the organism.
  • the classification scheme uses a classifier constructed using any of the boosting techniques described above.
  • the method further comprises performing quantitative genetic analysis on the sub-population in order to identify the quantitative trait locus for the trait.
  • modifications of Freund and Schapire, 1997, Journal of Computer and System Sciences 55, pp. 1 19-139 are used.
  • feature preselection is performed using a technique such as the nonparametric scoring methods of Park et al, 2002, Pac. Symp. Biocomput. 6, 52-63.
  • Feature preselection is a form of dimensionality reduction in which the genes that discriminate between classifications the best are selected for use in the classifier.
  • the LogitBoost procedure introduced by Friedman et al, 2000, Ann Stat 28, 337-407 is used rather than the boosting procedure of Freund and Schapire.
  • the boosting and other classification methods of Ben-Dor et al, 2000, Journal of Computational Biology 7, 559-583 are used in the present invention.
  • the boosting and other classification methods of Freund and Schapire, 1997, Journal of Computer and System Sciences 55, 119-139 are used.
  • the support vector machine classifaction methods of Furey et al, 2000, Bioinformatics 16, 906-914 is used.
  • Step 5366 Classifying the population.
  • the classifier derived above is used to classify all or a substantial portion (e.g., more than 30%, more than 50%, more than 75%) of the population under study. Essentially, the classifier bins the remaining population (the portions of the population that do not include the phenotypic extremes) without taking their phenotype into consideration.
  • the process of using the classifier to classify the general population produces phenotypic classifications (phenotypic subgroups). Phenotypic subgroups can be considered a refinement of the trait under study and subsequently used in analysis of the underlying biochemical process that differentiate the trait under study into groups using the techniques disclosed below.
  • Step 5368 Using the classifier.
  • cellular constituents that are differentially expressed in phenotypically extreme organisms are identified.
  • This set of cellular constituents is used to construct a classifier.
  • the classifier classifies the trait under study into subgroups without consideration of phenotypic data. It is expected that these subgroups define subgroups of the trait under study and that each of the subgroups define a homogenous biochemical form of the trait under study. Regardless of its form, the classifier formed in the inventive methods serves to further refine the phenotypic subgroups. As such, the methods disclosed in this section can be used to refine a trait under study. At the outset, the trait under study is exhibited by some population of organisms 46.
  • Observation of gross (visible, measurable) phenotypes (other than cellular constituent levels) related to the trait are used to divide the general population into two or more phenotypic groups.
  • Optional clustering of select cellular constituents serves to refine a phenotypic group into subphenotypic groups.
  • a benefit of the clustering is that it refines the trait under study into subgroups that are not distinguishable using gross observable phenotypic data (other than cellular constituent levels).
  • the clustering provides a way to refine the definition of the clinical trait under study by focusing on those cellular constituents that actually give rise to the clinical trait or well reflects the varied biochemical response to that trait.
  • the refinement provided by the clustering is incomplete because it is based on only a select portion of the general population under study, those organisms that represent phenotypic extremes. Accordingly, a more robust classifier is built using the initial set of cellular constituents selected based upon phenotypic extremes organisms 46 as a starting point. This derived classifier derived classifies the trait under study into highly refined subgroups. Thus, although only gross categories were used to develop the classifier, the classifier will split the population into clusters that can fall within highly refined subgroups. Each of these highly refined subgroups serves to refine the trait under study. In other words, each of the highly refined subgroups is a more homogenous form of the overall trait under study.
  • each identified subgroup represents a more homogenous suppopulation with respect to the trait of interest.
  • These homogenous subpopulation can then be studied using approaches such as quantitative genetic approaches.
  • Sections 5.1.1.1 and 5.1.1.2 provide methods for identifying subgroups of a population. These subgroups are then tested to determine whether the relationship between cQTL for a trait of interest are stronger (have higher lod scores) in a subgroup than in the population as a whole. These methods make use of techniques such as clustering, building classifiers and the like. However, some embodiments of the present invention contemplate more formal mathematical methods for identifying subgroups involving specific mathematical modeling of the subgroup identification process and cQTL assessment process so that they are linked together.
  • subdividing algorithms are contemplated that couple the magnitude of cQTL lod scores for the trait of interest with the subgroup identification process in such a way that such cQTL lod scores can actually be used to refine the subgroups.
  • Bayesian approaches in which eQTL lod scores are used to refine subgroup populations, are used.
  • SUBDIVIDING USING CLUSTERING The following embodiment makes reference to Fig. 56.
  • a species is studied.
  • the species can be, for example, a plant, animal, human, or bacteria.
  • the species is human, cat, dog, mouse, rat, monkey, pigs,
  • a plurality of organisms representing the species is studied.
  • the number of organisms in the species can be any number.
  • the plurality of organisms studied is between 5 and 100, between 50 and 200, between 100 and 500, or more than 500 organisms.
  • the plurality of organisms are an F 2 intercross, a F, population (formed by randomly mating Fis for t- ⁇ generations), an F 2 3 design (F 2 individuals are genotyped and then selfed), or a Design III (F 2 from two inbred lines are backcrossed to both parental lines).
  • organisms 246 Fig.
  • the perturbation can be environmental or genetic.
  • environmental perturbations include, but are not limited to, exposure of an organism to a test compound, an allergen, pain, and hot or cold temperatures.
  • Additional examples of environmental perturbations include diet (e.g. a high fat diet or low fat diet), sleep deprivation, isolation, and quantifying natural environmental influences (e.g., smoking, diet, exercise).
  • Examples of genetic perturbations include, but are not limited to, the use of gene knockouts, introduction of an inhibitor of a predetermined gene or gene product, N-Ethyl-N-nitrosourea (ENU) mutagenesis, siRNA knockdown of a gene, or quantifying a trait exhibited by a plurality of organisms of a species.
  • Various siRNA knock-out techniques also referred to as RNA interference or post-transcriptional gene silencing
  • RNA interference or post-transcriptional gene silencing are disclosed, for example, in Xia, et al, 2002, Nature Biotechnology 20, p. 1006; Hannon, 2002, Nature 418, p. 244; Carthew, 2001, Current Opinion in Cell Biology 13, p.
  • step 5604 the levels of cellular constituents in tissue selected from the organism are measured from the plurality of organisms 46 in order to derive gene expression / cellular constituent data 44.
  • cellular constituent data from only one tissue type is collected.
  • cellular constituent data from multiple tissue types are collected.
  • the plurality of organisms 46 exhibit a genetic variance with respect to some trait of interest.
  • the trait is quantifiable.
  • the trait in instances where the trait is a disease, the trait can be quantified in a binary form (e.g., "1 if the organism has contracted the disease and "0 if the organism has not contracted the disease).
  • the trait can be quantified as a spectrum of values and the plurality of organisms 46 will represent several different values in such a spectrum.
  • the plurality of organisms 46 comprise an untreated (e.g., unexposed, wild type, etc.) population and a treated population (e.g., exposed, genetically altered, etc.).
  • the untreated population is not subjected to a perturbation whereas the treated population is subjected to a perturbation.
  • the tissue that is measured in step 5604 is blood, white adipose tissue, or some other tissue that is easily obtained from organisms 46.
  • the levels of between 5 cellular constituents and 100 cellular constituents, between 50 cellular constituents and 100 cellular constituents, between 300 and 1000 cellular constituents, between 800 and 5000 cellular constituents, between 4000 and 15,000 cellular constituents, between 10,000 and 40,000 cellular constituents, or more than 40,000 cellular constituents are measured.
  • gene expression / cellular constituent data comprises the processed microarray images for each individual (organism) in a population under study.
  • such data comprises, for each individual, quantity (intensity) information for each gene / cellular constituent represented on the microarray, optional background signal information, and associated annotation information describing the gene probe.
  • cellular constituent data is, in fact, protein expression levels for various proteins in a particular tissue in organisms under study.
  • cellular constituent levels are determined in step 5604 by measuring an amount of the cellular constituent in a predetermined tissue of the organism.
  • the term "cellular constituent" comprises individual genes, proteins, mRNA, metabolites and/or any other cellular components that can affect the trait under study. The level of a cellular constituent other than a gene can be measured in a wide variety of methods.
  • step 5604 comprises measuring the transcriptional state of cellular constituents in one or more tissues of organisms.
  • the transcriptional state includes the identities and abundances of the constituent RNA species, especially mRNAs.
  • the cellular constituents are RNA, cRNA, cDNA, or the like.
  • the transcriptional state of the cellular constituents can be measured by techniques of hybridization to arrays of nucleic acid or nucleic acid mimic probes, or by other gene expression technologies.
  • step 5604 comprises measuring the translational state of cellular constituents in tissues.
  • the cellular constituents are proteins.
  • the translational state includes the identities and abundances of the proteins in the tissue.
  • whole genome monitoring of protein e.g., the "proteome,” Goffeau et al, 1996, Science 274, p. 546) can be carried out by constructing a microarray in which binding sites comprise immobilized, preferably monoclonal, antibodies specific to a plurality of protein species. Preferably, antibodies are present for a substantial fraction (e.g. 30%, 40%, 50%, 60%, or more) of the encoded proteins. Methods for making monoclonal antibodies are well known. See, for example, Harlow and Lane, 1998, Antibodies: A Laboratory Manual, Cold Spring Harbor, N.Y.
  • monoclonal antibodies are raised against synthetic peptide fragments designed based on genomic sequences.
  • proteins from the organisms are contacted with the array and their binding is assayed with assays known in the art.
  • antibody arrays for high-throughput screening of antibody-antigen interactions are used. See, for example, Wildt et al, Nature Biotechnology 18, p. 989.
  • large scale quantitative protein expression analysis can be performed using radioactive (e.g., Gygi et al, 1999, Mol. Cell. Biol 19, p. 1720) and/or stable iostope ( 15 N) metabolic labeling (e.g., Oda et al. Proc. Natl. Acad. Sci.
  • Two-dimensional gel electrophoresis is well-known in the art and typically involves focusing along a first dimension followed by SDS-PAGE electrophoresis along a second dimension. See, e.g., Hames et al, 1990, Gel Electrophoresis of Proteins: A Practical Approach, IRL Press, New York; Shevchenko et al, 1996, Proc Nat'l Acad. Sci. USA 93, p. 1440; Sagliocco et al, 1996, Yeast 12, p. 1519; Lander 1996, Science 274, p.
  • Electropherograms can be analyzed by numerous techniques, including mass spectrometric techniques, western blotting and immunoblot analysis using polyclonal and monoclonal antibodies, and internal and N-terminal micro-sequencing. See, for example, Gygi, et al, 1999, Nature Biotechnology 17, p. 994.
  • fluorescence two-dimensional difference gel electrophoresis (DIGE) is used. See, for example, Beaumont et al, Life Science News 7, 2001.
  • quantities of proteins in tissues of organisms 246 are determined using isotope-coded affinity tags (ICATs) followed by tandem mass spectrometry. See, for example, Gygi et al, 1999, Nature Biotech 17, p. 994. Using such techniques, it is possible to identify a substantial fraction of the proteins expressed in a predetermined tissue in organisms 246.
  • step 5604 comprises measuring the activity or post- translational modifications of the cellular constituents in predetermined tissues of the plurality of organisms 46. See for example, Zhu and Snyder, Curr. Opin. Chem. Biol 5, p. 40; Martzen et al, 1999, Science 286, p. 1153; Zhu et al, 2000, Nature Genet.
  • measurement of the activity of the cellular constituents is facilitated using techniques such as protein microarrays. See, for example, MacBeath and Schreiber, 2000, Science 289, p. 1760; and Zhu et al, 2001, Science 293, p. 2101.
  • post-translational modifications or other aspects of the state of cellular constituents are analyzed using mass spectrometry. See, for example, Aebersold and Goodlett, 2001, Chem Rev 101, p. 269; Petricoin III, 2002, The Lancet 359, p. 572.
  • the proteome of tissue from organisms 46 is analyzed in step 5604.
  • the analysis of the proteome of cells in the organisms typically involves the use of high-throughput protein analysis methods such as microarray technology. See, for example, Templin et al, 2002, TRENDS in Biotechnology 20, p. 160; Albala and Humphrey-Smith, 1999, Curr. Opin. Mol. Ther. 1, p. 680; Cahill, 2000, Proteomics: A Trends Guide, p. 47-51; Emili and Cagney, 2000, Nat. Biotechnol., 18, p. 393; and Mitchell, Nature Biotechnology 20, p.
  • step 5604 "mixed" aspects of the amounts cellular constituents are measured in step 5604.
  • the amounts or concentrations of one set of cellular constituents in tissues from organisms 46 are combined with measurements of the activities of certain other cellular constituents in such tissues in step 5604.
  • different allelic forms of a cellular constituent in a given organism are detected and measured in step 5604. For example, in a diploid organism, there are two copies of any given gene, one descending from the "father" and the other from the "mother.” In some instances, it is possible that each copy of the given gene is expressed at different levels. This is of significant interest since this type of allelic differential expression could associate with the trait under study, particularly in instances where the trait under study is complex.
  • cellular constituent data 44 comprises transcriptional data, translational data, activity data, and/or metabolite abundances for a plurality of cellular constituents.
  • the plurality of cellular constituents comprises at least five cellular constituents.
  • the plurality of cellular constituents comprises at least one hundred cellular constituents, at least one thousand cellular constituents, at least twenty thousand cellular constituents, or more than thirty thousand cellular constituents.
  • the expression statistics commonly used as quantitative traits in the analyses in one embodiment of the present invention include, but are not limited to, the mean log ratio, log intensity, and background-corrected intensity derived from transcriptional data. In other embodiments, other types of expression statistics are used as quantitative traits.
  • the expression level of each of a plurality of genes in each organism under study is normalized. Any normalization routine can be used to accomplish this normalization. Representative normalization routines include, but are not limited to, Z-score of intensity, median intensity, log median intensity, Z-score standard deviation log of intensity, Z-score mean absolute deviation of log intensity calibration DNA gene set, user normalization gene set, ratio median intensity correction, and intensity background correction. Furthermore, combinations of normalization routines can be run. Step 5608.
  • step 5608 patterns of cellular constituent levels (e.g., gene expression levels, protein abundance levels, etc.) are identified that associate with a trait under study and/or the perturbation that is optionally applied to the population prior to cellular constituent measurement.
  • cellular constituent levels e.g., gene expression levels, protein abundance levels, etc.
  • One such method first identifies those cellular constituents that discriminate the trait.
  • a perturbation is applied to the population prior to cellular constituent measurement in step 5604.
  • the perturbation can be, for example, exposure of the organism to a compound. Exposure of the organism to a compound can be effected by a variety of means, including but not limited to, administration, injection, etc.
  • the population of organisms is divided into two classes. Those organisms that have been exposed to the compound and those organisms that have not been exposed to the compound.
  • those cellular constituents e.g. genes, proteins, metabolites, etc.
  • levels e.g., transcriptional state, translational state, activity state, post-translational modification state, etc.
  • the treatment group the group exposed to the organism
  • the control group are identified using a statistical technique such as a paired t-test, an unpaired /-test, a Wilcoxon rank test, a signed rank test, or by computation of the correlation between the trait and gene expression values.
  • the perturbation optionally applied to the population comprises multiple treatments.
  • a perturbation is not applied to the population under study.
  • the population under study is divided into those organisms that exhibit the trait and those organisms that do not exhibit the trait.
  • Those cellular constituents e.g. genes, proteins, metabolites, etc.
  • levels e.g., transcriptional state, translational state, activity state, post-translational modification state, etc.
  • the population under study is divided into groups based on a function of the phenotype for the trait under study.
  • Those cellular constituents whose levels in the organisms 46 discriminate between the various groups are identified using a statistical technique.
  • the population under study exhibits a broad spectrum of phenotypes for the trait. Those cellular constituents whose levels in the organism 246 that can differentiate at least some of these phenotypes are then identified using statistical techniques.
  • the population is divided into phenotypically distinct groups and cellular constituents that distinguish between these phenotypically distinct groups are identified using statistical tests such as a t-tests (for two groups) or ANOVA (for greater than two groups).
  • the set of cellular constituents identified in step 5608 comprises between 5 and 100 cellular constituents, between 50 and 500 cellular constituents, between 400 and 1000 cellular constituents, between 800 and 4000 cellular constituents, between 3000 and 8000 cellular constituents, 8000 to 15000 cellular constituents, more 15000 cellular constituents, or less than 30000 cellular constituents.
  • the phenotypic extremes within the population are identified. For example, in one case, the trait of interest is obesity. In such an example, very obese and very skinny organisms 246 are selected as the phenotypic extremes in this step.
  • a phenotypic extreme is defined as the top or lowest 40 th , 30 th , 20 th , or 10 th percentile of the population with respect to a given phenotype exhibited by the population.
  • cellular constituent levels 250 measured in phenotypically extreme organisms
  • a t-test or some other test such as a multivariate test to determine whether the given cellular constituent 246 can discriminate between phenotypic groups identified (e.g., treated versus untreated) for the population under study.
  • a cellular constituent 246 will discriminate between phenotypic groups when the cellular constituent is found at characteristically different levels in each of the phenotypic groups. For example, in the case where there are two phenotypic groups, a cellular constituent will discriminate between the two groups when levels 250 of the cellular constituent (measured in phenotypically extreme organisms) are found at a first level in the first phenotypic group and are found at a second level in the second phenotypic group, where the first and second level are distinctly different.
  • Step 5610 Once the set of cellular constituents that discriminate the trait or, optionally, the perturbation, have been identified (e.g., using organisms in the population that represent phenotypic extremes), they can be clustered.
  • each cellular constituent in the set of cellular constituents that discriminates the trait (or the perturbation applied to the population prior to measurement in step 5604) between two or more classes (e.g., afflicted versus nonafflicted, perturbed versus nonperturbed) is treated as a cellular constituent vector.
  • n th cellular constituent in the set of cellular constituents that discriminates the perturbation (e.g., complex trait) between two or more classes is represented as: C Force - (A l , A 2 ,..., A m )
  • each A is the level (e.g., transcriptional state, translational state, activity, etc.) of cellular constituent n in a tissue of an organism 246 in the plurality of organisms under study, and m is the number of organisms considered.
  • Cellular constituent vectors C n can be clustered based on similarities in the values of corresponding levels A in each cellular constituent vector.
  • Cellular constituent vector C Compute will cluster into the same group (cellular constituent vector cluster) if the corresponding levels in such cellular constituent vectors are correlated.
  • Each cellular constituent vector will therefore have five values.
  • Each of the five values will be a level (e.g., activity, transcriptional state, translational state, etc.) of the corresponding cellular constituent n in a tissue of one of the five organisms:
  • Exemplary cellular constituent vector Ci ⁇ 0, 5, 5.5, o, 0 ⁇
  • Exemplary cellular constituent vector C 2 ⁇ 0, 4.9, 5.4, 0, 0 ⁇
  • Exemplary cellular constituent vector C 3 ⁇ 6, o, 3, 3, 5 ⁇
  • Clustering of exemplary cellular constituent vectors C ⁇ , C 2 , and C 3 will result in two clusters (cellular constituent vector clusters).
  • the first cluster will include cellular constituent vectors Ci and C 2 because there is a correlation in the levels within each vector (0 versus 0 in organism 246-1, 5 versus 4.9 in organism 246-2, 5.5 versus 5.4 in organism 246-3, 0 versus 0 in organism 246-4, and 0 versus 0 in organism 246-5).
  • the second cluster will include exemplary cellular constituent vector C 3 because the pattern of levels in vector C 3 is not similar to the pattern of levels in Ci and C 2 . This illustration serves to describe certain aspects of clustering using hypothetical cellular constituent level data.
  • the cellular constituents used in this step are selected because they discriminate trait extremes.
  • the cellular constituent levels should reflect that they were selected over phenotypic extremes.
  • the clustering in this step will help to identify subgroups of cellular constituents within the group of cellular constituents that discriminate trait extremes.
  • agglomerative hierarchical clustering is applied to the cellular constituent vectors in step 1510. In such clustering, similarity is determined using Pearson correlation coefficients between the cellular constituent vector pairs.
  • the clustering of the cellular constituent vectors comprises application of a hierarchical clustering technique, application of a k-means technique, application of a fuzzy k-means technique, application of a Jarvis-Patrick clustering technique, application of a self-organizing map or application of a neural network.
  • the hierarchical clustering technique is an agglomerative clustering procedure.
  • the agglomerative clustering procedure is a nearest- neighbor algorithm, a farthest-neighbor algorithm, an average linkage algorithm, a centroid algorithm, or a sum-of-squares algorithm.
  • the hierarchical clustering technique is a divisive clustering procedure.
  • nonparametric clustering algorithms are applied to the cellular constituent vectors.
  • Spearman R, Kendall Tau, or Gamma coefficients are used to cluster the cellular constituent vectors.
  • Step 5612 the population is reclassified into subtypes using the clustering information from step 5610.
  • the goal of step 5612 is to construct a classifier that comprises those cellular constituents that can distinguish between these subtypes.
  • a respective phenotypic vector is constructed for each organism in the population.
  • Each phenotypic vector comprises the cellular constituent levels for all or a portion of the set of cellular constituents that were used in step 5610.
  • the order of the elements in the phenotypic vectors is determined by the clustering patterns achieved in step 5610.
  • the phenotypic vectors are clustered using any known clustering technique.
  • the clustering in step 5612 produces a two- dimensional cluster.
  • cellular constituents are clustered based on similarities in their abundance across the population of organisms. For example, two cellular constituents would cluster together if they are expressed at similar levels throughout the population.
  • organisms are clustered based on similarity across the set of cellular constituents. For example, two organisms will cluster together if corresponding cellular constituents in each organism express at comparable levels.
  • the present invention provides many alternative pattern classification techniques that can be used instead of the clustering techniques that are described in steps 5610 and 5612.
  • step 5610 and 5612 order the population into new subgroups (e.g., phenotypic clusters). Each subgroup (phenotypic cluster) is characterized by a distinctive cellular constituent expression (or level) pattern.
  • phenotypic cluster is characterized by a distinctive cellular constituent expression (or level) pattern.
  • the elements in the phenotypic vectors are the measured cellular constituent levels for the respective organisms arranged in the order specified by the cellular constituent clustering results of step 5610. For illustration, suppose there are ten cellular constituents, (1, 2, 3, 4, 5, 6, 7, 8, 9, and 10), where constituents 8-10 fall into group A, constituents 4-7 fall into group B, and constituents 1-3 fall into group C.
  • a phenotypic vector V M for an organism M in the population could have the form:
  • V M ⁇ 8, 9, 10, 4, 5, 6, 7, 1, 2, 3 ⁇
  • each respective cellular constituent in the vector is represented by the level of the cellular constituent in the organism represented by the vector.
  • Each vector V M is clustered based on these levels.
  • V ! ⁇ +, -, +, +, +, +, -, -, -, - ⁇
  • V 3 ⁇ +, +, +, +, +, +, -, -, -, - ⁇
  • V 2 ⁇ -, -, -, -, +, +, +, +, +, + ⁇
  • V 4 ⁇ -, -, -, -, +, +, +, +, -, + ⁇
  • each organism in group I has a similar cellular constituent expression (or level) pattern. Further, this similar pattern distinguishes group I from group II. Likewise, each organism in group II has a similar cellular constituent (or level) pattern and this pattern distinguishes group 11 from group I.
  • the ordered set of cellular constituents from step 5610 serves as a classifier that reclassifies the organisms into subtypes. In some embodiments the clustering of step 5610 is not performed and only phenotypic vectors are clustered in order to identify such phenotypic clusters.
  • each of the subtypes (subgroups) obtained in this step are not obtained using classical phenotypic observations. Rather, each of the subtypes are identified using an ordered set of cellular constituents levels that discriminate between phenotypically distinguishable groups. As such, each of the subtypes identified in step 5612 may well represent distinct biochemical forms of the trait under study.
  • each of the subtypes identified in this step could represent a different biochemical response associated with the trait.
  • the cellular constituents that can discriminate between the newly identified subgroups (subtypes) are determined. For example, consider the example above in which the following clusters were obtained:
  • V ! ⁇ +, -, +, +, +, -, -, -, -, - ⁇
  • V 3 ⁇ +, +, +, +, -, -, -, - ⁇
  • cellular constituents 8, 10, 4, 5, 6, 7, 1, and 3 discriminate between groups I and II whereas cellular constituents 9 and 2 do not discriminate.
  • cellular constituent 9 has the values (- / +) in group I and (- / -) in group II and cellular constituent 2 has the values (- / -) in group I and (+ / -) in group II.
  • the set of cellular constituents that discriminate between subtypes (subgroups) identified in step 5612 serve as a classifier for the population under study. This classifier is capable of differentiating the general population into subtypes.
  • the cellular constituents identified in step 5612 are capable of classifying all the organisms in the general population into subgroups.
  • Step 1512 serves to break a population down into subtypes. After step 1512, quantitative genetic methods are used to study the subpopulations. 5.2.
  • SOURCES OF MARKER DATA Several forms of genetic markers that are used to construct marker map 78 are known in the art.
  • a common genetic marker is single nucleotide polymorphisms (SNPs). SNPs occur approximately once every 600 base pairs in the genome. See, for example, Kruglyak and Nickerson, 2001, Nature Genetics 27, 235.
  • the present invention contemplates the use of genotypic databases such as SNP databases as a source of genetic markers.
  • SNP haplotypes each of which reflects descent from a single ancient ancestral chromosome. See Fullerton et al, 2000, Am. J. Hum. Genet. 67, 881. Such haplotype structure is useful in selecting appropriate genetic variants for analysis. Patil et al found that a very dense set of SNPs is required to capture all the common haplotype information. Once common haplotype information is available, it can be used to identify much smaller subsets of SNPs useful for comprehensive whole-genome studies. See Patil et al, 2001, Science 294, 1719-1723.
  • Suitable sources of genetic markers include databases that have various types of gene expression data from platform types such as spotted microarray (microarray), high-density oligonucleotide array (HDA), hybridization filter (filter) and serial analysis of gene expression (SAGE) data.
  • a genetic database that can be used is a DNA methylation database.
  • a set of genetic markers is derived from any type of genetic database that tracks variations in the genome of an organism of interest.
  • Information that is typically represented in such databases is a collection of locus within the genome of the organism of interest. For each locus, strains for which genetic variation information is available are represented. For each represented strain, variation information is provided. Variation information is any type of genetic variation information. Representative genetic variation information includes, but is not limited to, single nucleotide polymorphisms, restriction fragment length polymorphisms, microsatellite markers, restriction fragment length polymorphisms, and short tandem repeats.
  • genotypic databases include, but are not limited to: Genetic variation type Uniform resource location SNP http://bioinfo.pal.roche.com/usuka_bioinformatics/cgi- bin/msnp/msnp.pl SNP http://snp.cshl.org/ SNP http://www.ibc.wustl.edu/SNP/ SNP http://www-genome.wi.mit.edu SNP/mouse/ SNP http://www.ncbi.nlm.nih.gov/SNP/ Microsatellite markers http://www.informatics.jax.org/searches/polymorphis m form.shtml Restriction fragment http://www.informatics.jax.org/searches/polymorphis length polymorphisms m form.shtml Short tandem repeats http://www.cidr.jhmi.edu/mouse/mmset.html Sequence length http://mcbio.med
  • genotypic databases within the scope of the present invention include a wide array of expression profile databases such as the one found at the URL: http://www.ncbi.nlm.nih.gov/geo/.
  • Another form of genetic marker that may be used to construct marker map 78 is restriction fragment length polymorphisms (RFLPs).
  • RFLPs are the product of allelic differences between DNA restriction fragments caused by nucleotide sequence variability. As is well known to those of skill in the art, RFLPs are typically detected by extraction of genomic DNA and digestion with a restriction endonuclease.
  • RAPD random amplified polymorphic DNA
  • RAPD random amplified polymorphic DNA
  • RAPD random amplified polymorphic DNA
  • AFLP amplified fragment length polymorphisms
  • AFLP technology refers to a process that is designed to generate large numbers of randomly distributed molecular markers (see, for example, European Patent Application No. 0534858 Al).
  • Still another form of genetic marker map that may be used to construct marker map 78 is "simple sequence repeats" or "SSRs". SSRs are di-, tri- or tetra-nucleotide tandem repeats within a genome. The repeat region may vary in length between genotypes while the DNA flanking the repeat is conserved such that the same primers will work in a plurality of genotypes. A polymorphism between two genotypes represents repeats of different lengths between the two flanking conserved DNA sequences (see, for example, Akagi et al, 1996, Theor. Appl.
  • SSR are also known as satellites or microsatellites.
  • many genetic markers suitable for use with the present invention are publicly available. Those skilled in the art can also readily prepare suitable markers. For molecular marker methods, see generally, The DNA Revolution by Andrew H. Paterson 1996 (Chapter 2) in: Genome Mapping in Plants (ed. Andrew H. Paterson) by Academic Press/R. G. Landis Company, Austin, Tex., 7-21.
  • normalization module 72 can be used by normalization module 72 to normalize cellular constituent abundance data 44. Some such normalization protocols are described in this section. Typically, the normalization comprises normalizing the expression level measurement of each gene in a plurality of genes that is expressed by an organism in a population of interest. Many of the normalization protocols described in this section are used to normalize microarray data. It will be appreciated that there are many other suitable normalization protocols that may be used in accordance with the present invention. All such protocols are within the scope of the present invention.
  • Z-score of intensity In this protocol, raw expression intensities are normalized by the (mean intensity)/(standard deviation) of raw intensities for all spots in a sample.
  • the Z-score of intensity method normalizes each hybridized sample by the mean and standard deviation of the raw intensities for all of the spots in that sample. The mean intensity mnl, and the standard deviation sdl, are computed for the raw intensity of control genes.
  • Z-score intensity for intensity I y for probe i (hybridization probe, protein, or other binding entity) and spot j is computed as:
  • Another normalization protocol is the median intensity normalization protocol in which the raw intensities for all spots in each sample are normalized by the median of the raw intensities.
  • the median intensity normalization method normalizes each hybridized sample by the median of the raw intensities of control genes (medianl,) for all of the spots in that sample.
  • the raw intensity Iy for probe i and spot j has the value Im, j where,
  • log median intensity protocol Another normalization protocol is the log median intensity protocol.
  • raw expression intensities are normalized by the log of the median scaled raw intensities of representative spots for all spots in the sample.
  • log 05/017652 For microarray data, the log 05/017652
  • median intensity method normalizes each hybridized sample by the log of median scaled raw intensities of control genes (medianl,) for all of the spots in that sample.
  • control genes are a set of genes that have reproducible accurately measured expression values.
  • the value 1.0 is added to the intensity value to avoid taking the Iog(O.O) when intensity has zero value.
  • the raw intensity I Upon normalization by the median intensity normalization method, the raw intensity I, for probe i and spot j, has the value Im, j where,
  • Im, j log(1.0 + (I,/ medianl,)).
  • Z-score standard deviation log of intensity protocol In this protocol, raw expression intensities are normalized by the mean log intensity (mnLI,) and standard deviation log intensity (sdLI,). For microarray data, the mean log intensity and the standard deviation log intensity is computed for the log of raw intensity of control genes. Then, the Z-score intensity ZlogS y for probe i and spot j is:
  • ZlogSy (log(I ⁇ ) - mnLI,)/sdLI,.
  • Still another normalization protocol is the Z-score mean absolute deviation of log intensity protocol.
  • raw expression intensities are normalized by the Z- score of the log intensity using the equation (log(intensity)-mean logarithm) / standard deviation logarithm.
  • the Z-score mean absolute deviation of log intensity protocol normalizes each bound sample by the mean and mean absolute deviation of the logs of the raw intensities for all of the spots in the sample.
  • the mean log intensity nLI, and the mean absolute deviation log intensity madLI are computed for the log of raw intensity of control genes. Then, the Z-score intensity ZlogA y for probe i and spot j is:
  • ZlogAy (log(I, j ) - mnLI,)/madLI,.
  • Another normalization protocol is the user normalization gene set protocol. In this protocol, raw expression intensities are normalized by the sum of the genes in a user defined gene set in each sample. This method is useful if a subset of genes has been determined to have relatively constant expression across a set of samples.
  • Yet another normalization protocol is the calibration DNA gene set protocol in which each sample is normalized by the sum of calibration DNA genes.
  • calibration DNA genes are genes that produce reproducible expression values that are accurately measured. Such genes tend to have the same expression values on each of several different microarrays.
  • the algorithm is the same as user normalization gene set protocol described above, but the set is predefined as the genes flagged as calibration DNA.
  • ratio median intensity correction protocol is useful in embodiments in which a two-color fluorescence labeling and detection scheme is used. See, for example, section 5.8.1.5.
  • the two fluors in a two-color fluorescence labeling and detection scheme are Cy3 and Cy5
  • measurements are normalized by multiplying the ratio (Cy3/Cy5) by medianCy5/medianCy3 intensities.
  • background correction is enabled, measurements are normalized by multiplying the ratio (Cy3/Cy5) by (medianCy5-medianBkgdCy5) / (medianCy3-medianBkgdCy3) where medianBkgd means median background levels.
  • intensity background correction is used to normalize measurements.
  • the background intensity data from a spot quantification programs may be used to correct spot intensity. Background may be specified as either a global value or on a per-spot basis. If the array images have low background, then intensity background correction may not be necessary. 5.4. LOGARITHM OF THE ODDS SCORES Denoting the joint probability of inheriting all genotypes P(g), and the joint probability of all observed data x (trait and marker species) conditional on genotypes P(x I g), the likelihood L for a set of data is where the summation is over all the possible joint genotypes g (trait and marker) for all pedigree members. What is unknown in this likelihood is the recombination fraction ⁇ , on which P(g) depends.
  • the recombination fraction ⁇ is the probability that two loci will recombine during meioses.
  • the recombination fraction ⁇ is correlated with the distance between two loci.
  • 0.5
  • the genetic distance is a monotonic function of ⁇ . See, e.g., Ott, 1985, Analysis of Human Genetic Linkage, first edition, Baltimore, MD, John Hopkins University Press.
  • genetic linkage can be exploited to obtain an estimate of the chromosomal position of a second locus relative to the first locus.
  • linkage analysis described in Section 5.2 linkage analysis is used to map the unknown location of genes predisposing to various quantitative phenotypes relative to a large number of marker loci in a genetic map.
  • is estimated by the frequency of recombinant meioses in a large sample of meioses. If two loci are linked, then the number of nonrecombinant meioses N is expected to be larger than the number of recombinant meioses R.
  • the recombination fraction between the new locus and each marker can be estimated as:
  • This likelihood function Z( ⁇ ) is a function of the recombination fraction ⁇ between the trait (e.g., classical trait or quantitative trait) and the marker locus.
  • lod is an abbreviation for "logarithm of the odds.”
  • a lod score permits visualization of linkage evidence.
  • lod scores provide a method to calculate linkage distances as well as to estimate the probability that two genes (and/or QTLs) are linked.
  • lod score computation is species dependent. For example, methods for computing the lod score in mouse different from that described in this section. However, methods for computing lod scores are known in the art and the method described in this section is only by way of illustration and not by limitation.
  • the embodiment of the invention outlined in Section 5.1, above, and shown in Fig. 7, has the significant advantage in that gene expression data and clinical traits are linked to (correlated with) quantitative trait loci (QTL).
  • QTL quantitative trait loci
  • the QTL information provides a powerful filter that allows for the rapid restriction of attention from all significantly correlated cellular constituents and trait values to those subsets of cellular constituents and traits that are under the control of a common set of QTL.
  • Fig. 13B then become QTL and traits and it is possible to initially direct an edge between the QTL and a single trait by definition of a QTL, and then test all other traits pair wise as discussed below to determine how the trait pairs are positioned relative to one another. For instance, going back to the case of a clinical trait T linked to a QTL Q, the relationship between Q and 7 can be immediately fixed as illustrated in Fig. 13C.
  • the relationship in Fig. 13C holds because Q is a QTL for T , and the QTL provides the direction of the relationship (T depends from Q) since Q is causal for T (e.g., variations in the DNA at the QTL location lead to variations in T).
  • This property is satisfied only if T and Q are conditionally dependent upon G.
  • This conditional dependence property is related to the mutual information measure that is typically used in network reconstruction problems: where the summation symbol indicates the continuous variables T and G have been discretized to allow for efficient computation over complicated graph structures, as is usually done in network reconstruction problems.
  • the use of mutual information is the reduction in uncertainty about one variable due to the knowledge of the other variable. See, for example, Duda et al, 2001, Pattern Classification, John Wiley & Sons, Inc., New York, p 632.
  • the significance of the resulting LOD score can be used as the significance level for the test of independence.
  • the likelihood for G and T for a single animal in an F 2 population are formed, where G and Tare taken to be jointly normally distributed, allowing for dependency between G and T .
  • the likelihood for animal i is:
  • ⁇ 0 ( ⁇ ⁇ , ⁇ G , ⁇ ⁇ , ⁇ c ,p) is the parameter vector for the likelihood
  • p is the correlation between G and T .
  • a — > M ⁇ Ql ' T QJ ' MGQ, > MGQ 2 ' MGQ, > ⁇ ⁇ ' ⁇ G> ) ' and P ⁇ Q j ) is the probability of genotype Q j at locus Q.
  • conditional likelihood for T ⁇ G (the conditional likelihood under the null hypothesis) for a single animal is:
  • ⁇ 0 and ⁇ A are the maximum likelihood estimates obtained from L 0 and L A defined above.
  • candidate pathway groups are identified from the analysis of QTL interaction map data and gene expression cluster maps.
  • Each candidate pathway group includes a number of genes.
  • the methods of the present invention are advantageous because they filter the potentially thousands of genes in the genome of the population of interest into a few candidate pathway groups using clustering techniques.
  • a candidate pathway group represents a group of genes that tightly cluster in a gene expression cluster map.
  • the genes in a candidate pathway group may also cluster tightly in a QTL interaction map.
  • the QTL interaction map serves as a complementary approach to defining the genes in a candidate pathway group. For example, consider the case in which genes A, B, and C cluster tightly in a gene expression cluster map. Furthermore, genes A, B, C and D cluster tightly in the corresponding QTL interaction map. In this example, analysis of the gene expression cluster map alone suggest that genes A, B, and C form a candidate pathway group.
  • candidate pathway group comprises genes A, B, C, and D.
  • candidate pathway groups have been identified, multivariate statistical techniques can be used to determine whether each of the genes in the candidate pathway group affect a particular trait, such as a complex disease trait.
  • the form of multivariate statistical analysis used in some embodiments of the present invention is dependent upon on the type of genotype and/or pedigree data that is available. Typically, more pedigree data is available in cases where the population to be studied is plants or animals.
  • the multivariate statistical models such as those of Jiang and Zeng, 1995, Nature Genetics 140, pp.l 111-1127, as well as the techniques implemented in QTL Cartographer (Basten and Zeng, 1994, Zmap-a QTL cartographer, Proceedings of the 5 th World Congress on Genetics Applied to Livestock Production: Computing Strategies and Software 22, Smith et al. eds., pp. 65-66, The Organizing Committee, 5th World Congress on Genetics Applied to Livestock Production, Guelph, Ontario, Canada; Basten et al, 2001, QTL Cartographer, Version 1.15, Department of Statistics, North Carolina State University, Raleigh, North Carolina.
  • CIM composite interval mapping
  • the multiple-trait extension to CIM developed by Jiang and Zeng provides a framework for testing the candidate pathway groups that are constructed using the methods of the present invention in cases where the genes in these candidate pathway groups link to the same genetic region.
  • the methods of Jiang and Zeng allow for the determination as to whether expression values (for the genes in the candidate pathway group) linking to the same region are controlled by a single gene pleiotropy) or by two closely linked genes. If the methods of Jiang and Zeng suggest that multiple genes are actually controlled by closely linked loci (closely linked genes), then there is not support that the genes linking to the same region are in the same pathway.
  • the components (hierarchy) of a pathway can be deduced by testing subsets of the pathway group to see which genes have an underlying pleiotropic relationship with respect to other genes.
  • the definition of the candidate pathway group can be refined by eliminating specific genes in the candidate pathway group that do not have a pleiotropic relationship with other genes in the candidate pathway group. The idea is to determine which of the genes linking to given region, have other genes linking to their physical location, indicating the order for hierarchy and control. Presently, the practical limits are that no more than ten genes can be handled at once using multivariate methods such as the Jiang and Zeng methods.
  • the number of genes is limited by the amount of data available to fit the model, but the particular limitation is that the optimization techniques are not effective for greater than 10 dimensions. However, in some embodiments, more than 10 genes can be handled at once by implementing dimensionality reductions techniques (like principal components).
  • dimensionality reductions techniques like principal components.
  • gene expression data 44 is collected for multiple tissue types.
  • multivariate analysis can be used to determine the true nature of a complex disease.
  • Multivariate techniques used in this embodiment of the invention are described, in part, in Williams et al, 1999, Am J Hum Genet 65(4): 1134-47; Amos et al, 1990, Am JHum Genet 47(2): 247-54, and Jiang and Zeng, 1995, Nature Genetics 140:1111-1127.
  • Asthma provides one example of a complex disease that can be studied using expression data from multiple tissue types. Asthma is expected to, in part, be influenced by immune system response not only in lungs but also in blood. By measuring expression of genes in the lung and in blood, the following model could be used to dissect the shared genetic effect in a model system, e.g.
  • jh " j m consists of asthma relevant phenotypes, expression data for gene expression in the lung and expression data for gene expression in blood
  • x ⁇ is the number of QTL alleles from a specific parental line
  • z i is 1 if the individual is heterozygous for the QTL and 0 otherwise
  • represents the mean for phenotype i
  • b, and d represent the additive and dominance effects of the QTL on phenotype i
  • ⁇ j is the residual error for individual ⁇ ' and phenotype .
  • kits for determining genes that are causal for traits contain microarrays, such as those described in Subsections below.
  • the microarrays contained in such kits comprise a solid phase, e.g., a surface, to which probes are hybridized or bound at a known location of the solid phase.
  • these probes consist of nucleic acids of known, different sequence, with each nucleic acid being capable of hybridizing to an RNA species or to a cDNA species derived therefrom.
  • the probes contained in the kits of this invention are nucleic acids capable of hybridizing specifically to nucleic acid sequences derived from RNA species in cells collected from an organism of interest.
  • a kit of the invention also contains one or more databases described above and in Fig. 1, encoded on computer readable medium, and/or an access authorization to use the databases described above from a remote networked computer.
  • a kit of the invention further contains software capable of being loaded into the memory of a computer system such as the one described supra, and illustrated in Fig. 1. The software contained in the kit of this invention, is essentially identical to the software described above in conjunction with Fig. 1.
  • TRANSCRIPT ASSAY USING MICROARRAYS The techniques described in this section are particularly useful for the determination of the expression state or the transcriptional state of a cell or cell type or any other cell sample by monitoring expression profiles. These techniques include the provision of polynucleotide probe arrays that can be used to provide simultaneous determination of the expression levels of a plurality of genes. These technique further provide methods for designing and making such polynucleotide probe arrays.
  • the expression level of a nucleotide sequence in a gene can be measured by any high throughput techniques. However measured, the result is either the absolute or relative amounts of transcripts or response data, including but not limited to values representing abundances or abundance rations.
  • transcript arrays which are described in this subsection.
  • "transcript arrays” or “profiling arrays” are used.
  • Transcript arrays can be employed for analyzing the expression profile in a cell sample and especially for measuring the expression profile of a cell sample of a particular tissue type or developmental state or exposed to a drug of interest.
  • an expression profile is obtained by hybridizing detectably labeled polynucleotides representing the nucleotide sequences in mRNA transcripts present in a cell (e.g., fluorescently labeled cDNA synthesized from total cell mRNA) to a microarray.
  • a microarray is an array of positionally-addressable binding (e.g., hybridization) sites on a support for representing many of the nucleotide sequences in the genome of a cell or organism, preferably most or almost all of the genes. Each of such binding sites consists of polynucleotide probes bound to the predetermined region on the support.
  • Microarrays can be made in a number of ways, of which several are described herein below. However produced, microarrays share certain characteristics. The arrays are reproducible, allowing multiple copies of a given array to be produced and easily compared with each other.
  • the microarrays are made from materials that are stable under binding (e.g., nucleic acid hybridization) conditions.
  • Microarrays are preferably small, e.g., between 1 cm 2 and 25 cm 2 , preferably 1 to 3 cm 2 . However, both larger and smaller arrays are also contemplated and may be preferable, e.g., for simultaneously evaluating a very large number or very small number of different probes.
  • a given binding site or unique set of binding sites in the microarray will specifically bind (e.g., hybridize) to a nucleotide sequence in a single gene from a cell or organism (e.g., to exon of a specific mRNA or a specific cDNA derived therefrom).
  • the microarrays used can include one or more test probes, each of which has a polynucleotide sequence that is complementary to a subsequence of RNA or DNA to be detected.
  • Each probe typically has a different nucleic acid sequence, and the position of each probe on the solid surface of the array is usually known.
  • the microarrays are preferably addressable arrays, more preferably positionally addressable arrays.
  • Each probe of the array is preferably located at a known, predetermined position on the solid support so that the identity (i.e., the sequence) of each probe can be determined from its position on the array (i.e., on the support or surface).
  • the arrays are ordered arrays.
  • the density of probes on a microarray or a set of microarrays is 100 different (i.e., non-identical) probes per 1 cm 2 or higher. More preferably, a microarray used in the methods of the invention will have at least 550 probes per 1 cm 2 , at least 1,000 probes per 1 cm 2 , at least 1,500 probes per 1 cm 2 or at least 2,000 probes per 1 cm 2 . In a particularly preferred embodiment, the microarray is a high density array, preferably having a density of at least 2,500 different probes per 1 cm 2 .
  • the microarrays used in the invention therefore preferably contain at least 2,500, at least 5,000, at least 10,000, at least 15,000, at least 20,000, at least 25,000, at least 50,000 or at least 55,000 different (i.e., non-identical) probes.
  • the microarray is an array (e.g., a matrix) in which each position represents a discrete binding site for a nucleotide sequence of a transcript encoded by a gene (e.g., for an exon of an mRNA or a cDNA derived therefrom).
  • the collection of binding sites on a microarray contains sets of binding sites for a plurality of genes.
  • the microarrays of the invention can comprise binding sites for products encoded by fewer than 50% of the genes in the genome of an organism.
  • the microarrays of the invention can have binding sites for the products encoded by at least 50%, at least 75%, at least 85%, at least 90%, at least 95%, at least 99% or 100% of the genes in the genome of an organism.
  • the microarrays of the invention can having binding sites for products encoded by fewer than 50%, by at least 50%, by at least 75%, by at least 85%, by at least 90%, by at least 95%, by at least 99% or by 100% of the genes expressed by a cell of an organism.
  • the binding site can be a DNA or DNA analog to which a particular RNA can specifically hybridize.
  • the DNA or DNA analog can be, e.g., a synthetic oligomer or a gene fragment, e.g. corresponding to an exon.
  • a gene or an exon in a gene is represented in the profiling arrays by a set of binding sites comprising probes with different polynucleotides that are complementary to different sequence segments of the gene or the exon.
  • Such polynucleotides are preferably of the length of 15 to 200 bases, more preferably of the length of 20 to 100 bases, most preferably 40-60 bases.
  • Each probe sequence may also comprise linker sequences in addition to the sequence that is complementary to its target sequence.
  • a linker sequence is a sequence between the sequence that is complementary to its target sequence and the surface of support.
  • the profiling arrays of the invention comprise one probe specific to each target gene or exon.
  • the profiling arrays may contain at least 2, 5, 10, 100, or 1000 or more probes specific to some target genes or exons.
  • the array may contain probes tiled across the sequence of the longest mRNA isoform of a gene at single base steps.
  • a set of polynucleotide probes of successive overlapping sequences, i.e., tiled sequences, across the genomic region containing the longest variant of an exon can be included in the exon profiling arrays.
  • the set of polynucleotide probes can comprise successive overlapping sequences at steps of a predetermined base intervals, e.g. at steps of 1, 5, or 10 base intervals, span, or are tiled across, the mRNA containing the longest variant.
  • Such sets of probes therefore can be used to scan the genomic region containing all variants of an exon to determine the expressed variant or variants of the exon to determine the expressed variant or variants of the exon.
  • a set of polynucleotide probes comprising exon specific probes and/or variant junction probes can be included in the exon profiling array.
  • a variant junction probe refers to a probe specific to the junction region of the particular exon variant and the neighboring exon.
  • the probe set contains variant junction probes specifically hybridizable to each of all different splice junction sequences of the exon.
  • the probe set contains exon specific probes specifically hybridizable to the common sequences in all different variants of the exon, and/or variant junction probes specifically hybridizable to the different splice junction sequences of the exon.
  • an exon is represented in the exon profiling arrays by a probe comprising a polynucleotide that is complementary to the full length exon.
  • an exon is represented by a single binding site on the profiling arrays.
  • an exon is represented by one or more binding sites on the profiling arrays, each of the binding sites comprising a probe with a polynucleotide sequence that is complementary to an RNA fragment that is a substantial portion of the target exon.
  • the lengths of such probes are normally between 15-600 bases, preferably between 20-200 bases, more preferably between 30-100 bases, and most preferably between 40-80 bases.
  • the average length of an exon is about 200 bases (see, e.g., Lewin, Genes V, Oxford
  • a probe of length of 40-80 allows more specific binding of the exon than a probe of shorter length, thereby increasing the specificity of the probe to the target exon.
  • one or more targeted exons may have sequence lengths less than 40-80 bases. In such cases, if probes with sequences longer than the target exons are to be used, it may be desirable to design probes comprising sequences that include the entire target exon flanked by sequences from the adjacent constitutively splice exon or exons such that the probe sequences are complementary to the corresponding sequence segments in the mRNAs.
  • flanking sequence from adjacent constitutively spliced exon or exons rather than the genomic flanking sequences, i.e., intron sequences, permits comparable hybridization stringency with other probes of the same length.
  • the flanking sequence used are from the adjacent constitutively spliced exon or exons that are not involved in any alternative pathways. More preferably the flanking sequences used do not comprise a significant portion of the sequence of the adjacent exon or exons so that cross-hybridization can be minimized.
  • probes comprising flanking sequences in different alternatively spliced mRNAs are designed so that expression level of the exon expressed in different alternatively spliced mRNAs can be measured.
  • the DNA array or set of arrays can also comprise probes that are complementary to sequences spanning the junction regions of two adjacent exons.
  • such probes comprise sequences from the two exons which are not substantially overlapped with probes for each individual exons so that cross hybridization can be minimized.
  • Probes that comprise sequences from more than one exons are useful in distinguishing alternative splicing pathways and/or expression of duplicated exons in separate genes if the exons occurs in one or more alternative spliced mRNAs and/or one or more separated genes that contain the duplicated exons but not in other alternatively spliced mRNAs and/or other genes that contain the duplicated exons.
  • any of the probe schemes, supra can be combined on the same profiling array and/or on different arrays within the same set of profiling arrays so that a more accurate determination of the expression profile for a plurality of genes can be accomplished.
  • the different probe schemes can also be used for different levels of accuracies in profiling. For example, a profiling array or array set comprising a small set of probes for each exon may be used to determine the relevant genes and/or RNA splicing pathways under certain specific conditions. An array or array set comprising larger sets of probes for the exons that are of interest is then used to more accurately determine the exon expression profile under such specific conditions.
  • the microarrays used in the invention have binding sites (i.e., probes) for sets of exons for one or more genes relevant to the action of a drug of interest or in a biological pathway of interest.
  • a "gene” is identified as a portion of DNA that is transcribed by RNA polymerase, which may include a 5 untranslated region ("UTR"), introns, exons and a 3 UTR.
  • UTR 5 untranslated region
  • the number of genes in a genome can be estimated from the number of mRNAs expressed by the cell or organism, or by extrapolation of a well characterized portion of the genome.
  • the number of ORFs can be determined and mRNA coding regions identified by analysis of the DNA sequence.
  • the genome of Saccharomyces cerevisiae has been completely sequenced and is reported to have approximately 6275 ORFs encoding sequences longer the 99 amino acid residues in length. Analysis of these ORFs indicates that there are 5,885 ORFs that are likely to encode protein products (Goffeau et al, 1996, Science 274: 546-567).
  • the human genome is estimated to contain approximately 30,000 to 130,000 genes (see Crollius et al, 2000, Nature Genetics 25:235-238; Ewing et al, 2000, Nature Genetics 25:232-234).
  • Genome sequences for other organisms including but not limited to Drosophila, C. elegans, plants, e.g., rice and Arabidopsis, and mammals, e.g., mouse and human, are also completed or nearly completed.
  • an array set comprising in total probes for all known or predicted exons in the genome of an organism is provided.
  • the present invention provides an array set comprising one or two probes for each known or predicted exon in the human genome.
  • cDNA complementary to the total cellular mRNA when detectably labeled (e.g., with a fluorophore) cDNA complementary to the total cellular mRNA is hybridized to a microarray, the site on the array corresponding to an exon of a gene (i.e., capable of specifically binding the product or products of the gene expressing) that is not transcribed or is removed during RNA splicing in the cell will have little or no signal (e.g., fluorescent signal), and an exon of a gene for which the encoded mRNA expressing the exon is prevalent will have a relatively strong signal.
  • the relative abundance of different mRNAs produced from the same gene by alternative splicing is then determined by the signal strength pattern across the whole set of exons monitored for the gene.
  • cDNAs from cell samples from two different conditions are hybridized to the binding sites of the microarray using a two-color protocol.
  • drug responses one cell sample is exposed to a drug and another cell sample of the same type is not exposed to the drug.
  • pathway responses one cell is exposed to a pathway perturbation and another cell of the same type is not exposed to the pathway perturbation.
  • the cDNA derived from each of the two cell types are differently labeled (e.g., with Cy3 and Cy5) so that they can be distinguished.
  • cDNA from a cell treated with a drug is synthesized using a fluorescein-labeled dNTP
  • cDNA from a second cell, not drug-exposed is synthesized using a rhodamine-labeled dNTP.
  • the cDNA from the drug-treated (or pathway perturbed) cell will fluoresce green when the fluorophore is stimulated and the cDNA from the untreated cell will fluoresce red.
  • the drug treatment has no effect, either directly or indirectly, on the transcription and or post-transcriptional splicing of a particular gene in a cell, the exon expression patterns will be indistinguishable in both cells and, upon reverse transcription, red-labeled and green-labeled cDNA will be equally prevalent.
  • the binding site(s) for that species of RNA will emit wavelengths characteristic of both fluorophores.
  • the exon expression pattern as represented by ratio of green to red fluorescence for each exon binding site will change.
  • the ratios for each exon expressed in the mRNA will increase, whereas when the drug decreases the prevalence of an mRNA, the ratio for each exons expressed in the mRNA will decrease.
  • cDNA from a single cell and compare, for example, the absolute amount of a particular exon in, e.g., a drug-treated or pathway-perturbed cell and an untreated cell.
  • labeling with more than two colors is also contemplated in the present invention. In some embodiments of the invention, at least 5, 10, 20, or 100 dyes of different colors can be used for labeling. Such labeling permits simultaneous hybridizing of the distinguishably labeled cDNA populations to the same array, and thus measuring, and optionally comparing the expression levels of, mRNA molecules derived from more than two samples.
  • Dyes that can be used include, but are not limited to, fluorescein and its derivatives, rhodamine and its derivatives, texas red, 5 carboxy-fluorescein ("FMA”), 2 ,7 -dimethoxy-4 ,5 -dichloro-6-carboxy-fluorescein (“JOE”), N,N,N',N'-tetramethyl-6- carboxy-rhodamine (“TAMRA”), 6 carboxy-X-rhodamine (“ROX”), HEX, TET, IRD40, and IRD41, cyamine dyes, including but are not limited to Cy3, Cy3.5 and Cy5; BODIPY dyes including but are not limited to BODIPY-FL, BODIPY-TR, BODIPY- TMR, BODIPY-630/650, and BODIPY-650/670; and ALEXA dyes, including but are not limited to ALEXA-488, ALEXA-532, ALE
  • hybridization data are measured at a plurality of different hybridization times so that the evolution of hybridization levels to equilibrium can be determined.
  • hybridization levels are most preferably measured at hybridization times spanning the range from 0 to in excess of what is required for sampling of the bound polynucleotides (i.e., the probe or probes) by the labeled polynucleotides so that the mixture is close to or substantially reached equilibrium, and duplexes are at concentrations dependent on affinity and abundance rather than diffusion.
  • the hybridization times are preferably short enough that irreversible binding interactions between the labeled polynucleotide and the probes and/or the surface do not occur, or are at least limited.
  • hybridization times may be approximately 0-72 hours. Appropriate hybridization times for other embodiments will depend on the particular polynucleotide sequences and probes used, and may be determined by those skilled in the art (see, e.g., Sambrook et al, Eds., 1989, Molecular Cloning: A Laboratory Manual, 2nd ed., Vol. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York). In one embodiment, hybridization levels at different hybridization times are measured separately on different, identical microarrays.
  • the microarray is washed briefly, preferably in room temperature in an aqueous solution of high to moderate salt concentration (e.g., 0.5 to 3 M salt concentration) under conditions which retain all bound or hybridized polynucleotides while removing all unbound polynucleotides.
  • high to moderate salt concentration e.g., 0.5 to 3 M salt concentration
  • the detectable label on the remaining, hybridized polynucleotide molecules on each probe is then measured by a method which is appropriate to the particular labeling method used.
  • the resulted hybridization levels are then combined to form a hybridization curve.
  • hybridization levels are measured in real time using a single microarray.
  • the microarray is allowed to hybridize to the sample without interruption and the microarray is interrogated at each hybridization time in a non-invasive manner.
  • one can use one array hybridize for a short time, wash and measure the hybridization level, put back to the same sample, hybridize for another period of time, wash and measure again to get the hybridization time curve.
  • at least two hybridization levels at two different hybridization times are measured, a first one at a hybridization time that is close to the time scale of cross- hybridization equilibrium and a second one measured at a hybridization time that is longer than the first one.
  • the time scale of cross-hybridization equilibrium depends, inter alia, on sample composition and probe sequence and may be determined by one skilled in the art.
  • the first hybridization level is measured at between 1 to 10 hours, whereas the second hybridization time is measured at 2, 4, 6, 10, 12, 16, 18, 48 or 72 times as long as the first hybridization time.
  • the "probe" to which a particular polynucleotide molecule, such as an exon, specifically hybridizes according to the invention is a complementary polynucleotide sequence.
  • one or more probes are selected for each target exon.
  • the probes normally comprise nucleotide sequences greater than 40 bases in length.
  • the probes normally comprise nucleotide sequences of 40-60 bases.
  • the probes can also comprise sequences complementary to full length exons.
  • the lengths of exons can range from less than 50 bases to more than 200 bases. Therefore, when a probe length longer than exon is to be used, it is preferable to augment the exon sequence with adjacent constitutively spliced exon sequences such that the probe sequence is complementary to the continuous mRNA fragment that contains the target exon. This will allow comparable hybridization stringency among the probes of an exon profiling array. It will be understood that each probe sequence may also comprise linker sequences in addition to the sequence that is complementary to its target sequence.
  • the probes may comprise DNA or DNA "mimics" (e.g., derivatives and analogues) corresponding to a portion of each exon of each gene in an organism's genome.
  • the probes of the microarray are complementary RNA or RNA mimics.
  • DNA mimics are polymers composed of subunits capable of specific, Watson-Crick- like hybridization with DNA, or of specific hybridization with RNA.
  • the nucleic acids can be modified at the base moiety, at the sugar moiety, or at the phosphate backbone.
  • Exemplary DNA mimics include, e.g., phosphorothioates.
  • DNA can be obtained, e.g., by polymerase chain reaction (PCR) amplification of exon segments from genomic DNA, cDNA (e.g., by RT-PCR), or cloned sequences.
  • PCR polymerase chain reaction
  • PCR primers are preferably chosen based on known sequence of the exons or cDNA that result in amplification of unique fragments (i.e., fragments that do not share more than 10 bases of contiguous identical sequence with any other fragment on the microarray).
  • Computer programs that are well known in the art are useful in the design of primers with the required specificity and optimal amplification properties, such as Oligo version 5.0 (National Biosciences).
  • Oligo version 5.0 National Biosciences
  • each probe on the microarray will be between 20 bases and 600 bases, and usually between 30 and 200 bases in length.
  • PCR methods are well known in the art, and are described, for example, in Innis et al, eds., 1990, PCR Protocols: A Guide to Methods and Applications, Academic Press Inc., San Diego, CA.
  • An alternative, preferred means for generating the polynucleotide probes of the microarray is by synthesis of synthetic polynucleotides or oligonucleotides, e.g., using N- phosphonate or phosphoramidite chemistries (Froehler et al, 1986, Nucleic Acid Res. 74:5399-5407; McBride et /., 1983, Tetrahedron Lett. 24:246-248). Synthetic sequences are typically between 15 and 600 bases in length, more typically between 20 and 100 bases, most preferably between 40 and 70 bases in length.
  • synthetic nucleic acids include non-natural bases, such as, but by no means limited to, inosine.
  • nucleic acid analogues may be used as binding sites for hybridization.
  • An example of a suitable nucleic acid analogue is peptide nucleic acid (see, e.g., Egholm et al, 1993, Nature 363:566-568; and U.S. Patent No. 5,539,083).
  • the hybridization sites i.e., the probes
  • the hybridization sites are made from plasmid or phage clones of genes, cDNAs (e.g., expressed sequence tags), or inserts therefrom (Nguyen etal, 1995, Genomics 29:207-209).
  • polynucleotide probes can be deposited on a support to form the array.
  • polynucleotide probes can be synthesized directly on the support to form the array.
  • the probes are attached to a solid support or surface, which may be made, e.g., from glass, plastic (e.g., polypropylene, nylon), polyacrylamide, nitrocellulose, gel, or other porous or nonporous material.
  • a preferred method for attaching the nucleic acids to a surface is by printing on glass plates, as is described generally by Schena et al, 1995, Science 270:467-470.
  • microarrays of cDNA See also, DeRisi et al, 1996, Nature Genetics 74:457-460; Shalon et al, 1996, Genome Res. (5:639-645; and Schena et al, 1995, Proc. Natl Acad. Sci. U.S.A. 95:10539-11286).
  • a second preferred method for making microarrays is by making high-density polynucleotide arrays.
  • oligonucleotides e.g., 60-mers
  • the array produced can be redundant, with several polynucleotide molecules per exon.
  • Other methods for making microarrays e.g., by masking (Maskos and Southern, 1992, Nucl Acids. Res. 20:1679-1684), may also be used.
  • any type of array for example, dot blots on a nylon hybridization membrane (see Sambrook et al, supra) could be used.
  • very small arrays will frequently be preferred because hybridization volumes will be smaller.
  • microarrays of the invention are manufactured by means of an ink jet printing device for oligonucleotide synthesis, e.g., using the methods and systems described by Blanchard in International Patent Publication No. WO 98/41531, published September 24, 1998; Blanchard et al, 1996, Biosensors and Bioelectronics 77:687-690; Blanchard, 1998, in Synthetic DNA Arrays in Genetic Engineering, Vol. 20, J.K. Setlow, Ed., Plenum Press, New York at pages 111-123; and U.S. Patent No. 6,028,189 to Blanchard.
  • the polynucleotide probes in such microarrays are preferably synthesized in arrays, e.g., on a glass slide, by serially depositing individual nucleotide bases in "microdroplets" of a high surface tension solvent such as propylene carbonate.
  • the microdroplets have small volumes (e.g., 100 pL or less, more preferably 50 pL or less) and are separated from each other on the microarray (e.g., by hydrophobic domains) to form circular surface tension wells which define the locations of the array elements (i.e., the different probes).
  • Polynucleotide probes are normally attached to the surface covalently at the 3 end of the polynucleotide.
  • polynucleotide probes can be attached to the surface covalently at the 5 end of the polynucleotide (see for example, Blanchard, 1998, in Synthetic DNA Arrays in Genetic Engineering, Vol. 20, J.K. Setlow, Ed., Plenum Press, New York at pages 111- 123).
  • Target polynucleotides that can be analyzed by the methods and compositions of the invention include RNA molecules such as, but by no means limited to, messenger RNA (mRNA) molecules, ribosomal RNA (rRNA) molecules, cRNA molecules (i.e., RNA molecules prepared from cDNA molecules that are transcribed in vivo) and fragments thereof.
  • Target polynucleotides which may also be analyzed by the methods and compositions of the present invention include, but are not limited to DNA molecules such as genomic DNA molecules, cDNA molecules, and fragments thereof including oligonucleotides, ESTs, STSs, etc.
  • the target polynucleotides can be from any source.
  • the target polynucleotide molecules may be naturally occurring nucleic acid molecules such as genomic or extragenomic DNA molecules isolated from an organism, or RNA molecules, such as mRNA molecules, isolated from an organism.
  • the polynucleotide molecules may be synthesized, including, e.g., nucleic acid molecules synthesized enzymatically in vivo or in vitro, such as cDNA molecules, or polynucleotide molecules synthesized by PCR, RNA molecules synthesized by in vitro transcription, etc.
  • the sample of target polynucleotides can comprise, e.g., molecules of DNA, RNA, or copolymers of DNA and RNA.
  • the target polynucleotides of the invention will correspond to particular genes or to particular gene transcripts (e.g., to particular mRNA sequences expressed in cells or to particular cDNA sequences derived from such mRNA sequences).
  • the target polynucleotides may correspond to particular fragments of a gene transcript.
  • the target polynucleotides may correspond to different exons of the same gene, e.g., so that different splice variants of that gene may be detected and/or analyzed.
  • the target polynucleotides to be analyzed are prepared in vitro from nucleic acids extracted from cells.
  • RNA is extracted from cells (e.g., total cellular RNA, poly(A) + messenger RNA, fraction thereof) and messenger RNA is purified from the total extracted RNA.
  • Methods for preparing total and poly(A) + RNA are well known in the art, and are described generally, e.g., in Sambrook et al, supra.
  • RNA is extracted from cells of the various types of interest in this invention using guanidinium thiocyanate lysis followed by CsCl centrifugation and an oligo dT purification (Chirgwin et al, 1979, Biochemistry 75:5294-5299).
  • RNA is extracted from cells using guanidinium thiocyanate lysis followed by purification on RNeasy columns (Qiagen).
  • cDNA is then synthesized from the purified mRNA using, e.g., oligo-dT or random primers.
  • the target polynucleotides are cRNA prepared from purified messenger RNA extracted from cells.
  • cRNA is defined here as RNA complementary to the source RNA.
  • the extracted RNAs are amplified using a process in which doubled-stranded cDNAs are synthesized from the RNAs using a primer linked to an RNA polymerase promoter in a direction capable of directing transcription of anti- sense RNA.
  • Anti-sense RNAs or cRNAs are then transcribed from the second strand of the double-stranded cDNAs using an RNA polymerase (see, e.g., U.S. Patent Nos. 5,891,636, 5,716,785; 5,545,522 and 6,132,997; see also, U.S. Patent No. 6,271,002, and U.S. Provisional Patent Application Serial No.
  • the target polynucleotides are short and/or fragmented polynucleotide molecules which are representative of the original nucleic acid population of the cell.
  • the target polynucleotides to be analyzed by the methods and compositions of the invention are preferably detectably labeled.
  • cDNA can be labeled directly, e.g., with nucleotide analogs, or indirectly, e.g., by making a second, labeled cDNA strand using the first strand as a template.
  • the double-stranded cDNA can be transcribed into cRNA and labeled.
  • the detectable label is a fluorescent label, e.g., by incorporation of nucleotide analogs.
  • radioactive isotopes include 32 P, 35 S, 14 C, 15 N and 125 I.
  • Fluorescent molecules suitable for the present invention include, but are not limited to, fluorescein and its derivatives, rhodamine and its derivatives, texas red, 5 carboxy-fluorescein (“FMA”), 2 ,7 -dimethoxy-4 ,5 -dichloro-6-carboxy-fluorescein (“JOE”), N,N,N',N'- tetramethyl- 6-carboxy-rhodamine (“TAMRA”), 6 carboxy-X-rhodamine (“ROX”), HEX, TET, IRD40, and IRD41.
  • FMA carboxy-fluorescein
  • JOE 2 ,7 -dimethoxy-4 ,5 -dichloro-6-carboxy-fluorescein
  • TAMRA N,N,N',N'- tetramethyl- 6-carboxy-rhodamine
  • ROX 6 carboxy-X-rhodamine
  • HEX HEX
  • Fluorescent molecules that are suitable for the invention further include: cyamine dyes, including by not limited to Cy3, Cy3.5 and Cy5; BODIPY dyes including but not limited to BODIPY-FL, BODIPY-TR, BODIPY-TMR, BODIPY-
  • Electron rich indicator molecules suitable for the present invention include, but are not limited to, ferritin, hemocyanin, and colloidal gold.
  • the target polynucleotides may be labeled by specifically complexing a first group to the polynucleotide.
  • a second group covalently linked to an indicator molecules and which has an affinity for the first group, can be used to indirectly detect the target polynucleotide.
  • compounds suitable for use as a first group include, but are not limited to, biotin and iminobiotin.
  • Compounds suitable for use as a second group include, but are not limited to, avidin and streptavidin.
  • nucleic acid hybridization and wash conditions are chosen so that the polynucleotide molecules to be analyzed by the invention (referred to herein as the "target polynucleotide molecules) specifically bind or specifically hybridize to the complementary polynucleotide sequences of the array, preferably to a specific array site, wherein its complementary DNA is located.
  • Arrays containing double-stranded probe DNA situated thereon are preferably subjected to denaturing conditions to render the DNA single-stranded prior to contacting with the target polynucleotide molecules.
  • Arrays containing single-stranded probe DNA may need to be denatured prior to contacting with the target polynucleotide molecules, e.g., to remove hai ⁇ ins or dimers which form due to self complementary sequences.
  • Optimal hybridization conditions will depend on the length (e.g., oligomer versus polynucleotide greater than 200 bases) and type (e.g., RNA, or DNA) of probe and target nucleic acids.
  • General parameters for specific (i.e., stringent) hybridization conditions for nucleic acids are described in Sambrook et al, (supra), and in Ausubel et al, 1987,
  • hybridization conditions are also provided in, e.g., Tijessen, 1993, Hybridization With Nucleic Acid Probes, Elsevier Science Publishers B.V. and Kricka, 1992, Nonisotopic DNA Probe Techniques, Academic Press, San Diego, CA.
  • Particularly preferred hybridization conditions for use with the screening and/or signaling chips of the present invention include hybridization at a temperature at or near the mean melting temperature of the probes (e.g., within 5 °C, more preferably within 2 °C) in 1 M NaCl, 50 mM MES buffer (pH 6.5), 0.5% sodium Sarcosine and 30% formamide.
  • cDNA complementary to the total cellular mRNA when detectably labeled (e.g., with a fluorophore) cDNA complementary to the total cellular mRNA is hybridized to a microarray, the site on the array corresponding to an exon of a gene (i.e., capable of specifically binding the product or products of the gene expressing) that is not transcribed or is removed during RNA splicing in the cell will have little or no signal (e.g., fluorescent signal), and an exon of a gene for which the encoded mRNA expressing the exon is prevalent will have a relatively strong signal.
  • the relative abundance of different mRNAs produced from the same gene by alternative splicing is then determined by the signal strength pattern across the whole set of exons monitored for the gene.
  • target sequences e.g., cDNAs or cRNAs
  • target sequences e.g., cDNAs or cRNAs
  • drug responses one cell sample is exposed to a drug and another cell sample of the same type is not exposed to the drug.
  • pathway responses one cell is exposed to a pathway perturbation and another cell of the same type is not exposed to the pathway perturbation.
  • the cDNA or cRNA derived from each of the two cell types are differently labeled so that they can be distinguished.
  • cDNA from a cell treated with a drug is synthesized using a fluorescein-labeled dNTP
  • cDNA from a second cell, not drug-exposed is synthesized using a rhodamine-labeled dNTP.
  • the cDNA from the drug-treated (or pathway perturbed) cell will fluoresce green when the fluorophore is stimulated and the cDNA from the untreated cell will fluoresce red.
  • the drug treatment has no effect, either directly or indirectly, on the transcription and/or post-transcriptional splicing of a particular gene in a cell, the exon expression patterns will be indistinguishable in both cells and, upon reverse transcription, red-labeled and green-labeled cDNA will be equally prevalent.
  • the binding site(s) for that species of RNA will emit wavelengths characteristic of both fluorophores.
  • the exon expression pattern as represented by ratio of green to red fluorescence for each exon binding site will change.
  • the ratios for each exon expressed in the mRNA will increase, whereas when the drug decreases the prevalence of an mRNA, the ratio for each exons expressed in the mRNA will decrease.
  • cDNA from a single cell, and compare, for example, the absolute amount of a particular exon in, e.g., a drug-treated or pathway-perturbed cell and an untreated cell.
  • the fluorescence emissions at each site of a transcript array can be, preferably, detected by scanning confocal laser microscopy.
  • a separate scan, using the appropriate excitation line, is carried out for each of the two fluorophores used.
  • a laser can be used that allows simultaneous specimen illumination at wavelengths specific to the two fluorophores and emissions from the two fluorophores can be analyzed simultaneously (see Shalon et al, 1996, Genome Res.
  • the arrays are scanned with a laser fluorescence scanner with a computer controlled X-Y stage and a microscope objective. Sequential excitation of the two fluorophores is achieved with a multi-line, mixed gas laser, and the emitted light is split by wavelength and detected with two photomultiplier tubes.
  • fluorescence laser scanning devices are described, e.g., in Schena et al, 1996, Genome Res. (5:639-645.
  • the fiber-optic bundle described by Ferguson et al, 1996, Nature Biotech. 14: 1681-1684 may be used to monitor mRNA abundance levels at a large number of sites simultaneously.
  • Signals are recorded and, in a preferred embodiment, analyzed by computer, e.g., using a 12 bit analog to digital board.
  • the scanned image is despeckled using a graphics program (e.g., Hijaak Graphics Suite) and then analyzed using an image gridding program that creates a spreadsheet of the average hybridization at each wavelength at each site. If necessary, an experimentally determined correction for "cross talk" (or overlap) between the channels for the two fluors may be made.
  • a ratio of the emission of the two fluorophores can be calculated. The ratio is independent of the absolute expression level of the cognate gene, but is useful for genes whose expression is significantly modulated by drug administration, gene deletion, or any other tested event.
  • the relative abundance of an mRNA and/or an exon expressed in an mRNA in two cells or cell lines is scored as perturbed (i.e., the abundance is different in the two sources of mRNA tested) or as not perturbed (i.e., the relative abundance is the same).
  • a difference between the two sources of RNA of at least a factor of 25% e.g., RNA is 25% more abundant in one source than in the other source
  • more usually 50% even more often by a factor of 2 (e.g., twice as abundant), 3 (three times as abundant), or 5 (five times as abundant) is scored as a perturbation.
  • Present detection methods allow reliable detection of differences of an order of 1.5 fold to 3-fold.
  • the transcriptional state of a cell can be measured by other gene expression technologies known in the art.
  • Several such technologies produce pools of restriction fragments of limited complexity for electrophoretic analysis, such as methods combining double restriction enzyme digestion with phasing primers (see, e.g., European Patent O 534858 Al , filed September 24, 1992, by Zabeau et al), or methods selecting restriction fragments with sites closest to a defined mRNA end (see, e.g., Prashar et al, 1996, Proc. Natl. Acad. Sci. USA 93:659-663).
  • cDNA pools statistically sample cDNA pools, such as by sequencing sufficient bases (e.g., 20-50 bases) in each of multiple cDNAs to identify each cDNA, or by sequencing short tags (e.g., 9-10 bases) that are generated at known positions relative to a defined mRNA end (see, e.g., Velculescu, 1995, Science 270:484-487).
  • sequencing sufficient bases e.g., 20-50 bases
  • sequencing short tags e.g., 9-10 bases
  • aspects of the biological state other than the transcriptional state such as the translational state, the activity state, or mixed aspects can be measured.
  • gene expression data can include translational state measurements or even protein expression measurements. Details of embodiments in which aspects of the biological state other than the transcriptional state are described in this section. 5.9.1. TRANSLATIONAL STATE MEASUREMENTS Measurement of the translational state can be performed according to several methods.
  • whole genome monitoring of protein can be carried out by constructing a microarray in which binding sites comprise immobilized, preferably monoclonal, antibodies specific to a plurality of protein species encoded by the cell genome.
  • binding sites comprise immobilized, preferably monoclonal, antibodies specific to a plurality of protein species encoded by the cell genome.
  • antibodies are present for a substantial fraction of the encoded proteins, or at least for those proteins relevant to the action of a drug of interest.
  • Methods for making monoclonal antibodies are well known (see, e.g., Harlow and Lane, 1988, Antibodies: A Laboratory Manual, Cold Spring Harbor, New York, which is incorporated in its entirety for all purposes).
  • monoclonal antibodies are raised against synthetic peptide fragments designed based on genomic sequence of the cell.
  • proteins from the cell are contacted to the array and their binding is assayed with assays known in the art.
  • proteins can be separated by two-dimensional gel electrophoresis systems.
  • Two-dimensional gel electrophoresis is well-known in the art and typically involves iso-electric focusing along a first dimension followed by SDS-PAGE electrophoresis along a second dimension. See, e.g., Hames et al, 1990, Gel Electrophoresis of Proteins: A Practical Approach, IRL Press, New York; Shevchenko et al, 1996, Proc. Natl. Acad. Sci.
  • the resulting electropherograms can be analyzed by numerous techniques, including mass spectrometric techniques, Western blotting and immunoblot analysis using polyclonal and monoclonal antibodies, and internal and N-terminal micro-sequencing. Using these techniques, it is possible to identify a substantial fraction of all the proteins produced under given physiological conditions, including in cells (e.g., in yeast) exposed to a drug, or in cells modified by, e.g., deletion or over-expression of a specific gene.
  • the methods of the invention are applicable to any cellular constituent that can be monitored.
  • Activity measurements can be performed by any functional, biochemical, or physical means appropriate to the particular activity being characterized.
  • the activity involves a chemical transformation
  • the cellular protein can be contacted with the natural substrate(s), and the rate of transformation measured.
  • association in multimeric units for example association of an activated DNA binding complex with DNA
  • the amount of associated protein or secondary consequences of the association such as amounts of mRNA transcribed, can be measured.
  • cellular constituent measurements are derived from cellular phenotypic techniques.
  • One such cellular phenotypic technique uses cell respiration as a universal reporter.
  • 96-well microtiter plate, in which each well contains its own unique chemistry is provided. Each unique chemistry is designed to test a particular phenotype. Cells from the organism of interest are pipetted into each well.
  • cellular constituent measurements are derived from cellular phenotypic techniques.
  • One such cellular phenotypic technique uses cell respiration as a universal reporter.
  • 96-well microtiter plates in which each well contains its own unique chemistry is provided.
  • Each unique chemistry is designed to test a particular phenotype.
  • Cells from the organism 46 (Fig. 1) of interest are pipetted into each well. If the cells exhibit the appropriate phenotype, they will respire and actively reduce a tetrazolium dye, forming a strong purple color. A weak phenotype results in a lighter color. No color means that the cells don't have the specific phenotype. Color changes may be recorded as often as several times each hour. During one incubation, more than 5,000 phenotypes can be tested. See, for example, Bochner et al, 2001, Genome Research 11, 1246-55.
  • the cellular constituents that are measured are metabolites.
  • Metabolites include, but are not limited to, amino acids, metals, soluble sugars, sugar phosphates, and complex carbohydrates.
  • Such metabolites can be measured, for example, at the whole-cell level using methods such as pyrolysis mass spectrometry (Irwin, 1982, Analytical Pyrolysis: A Comprehensive Guide, Marcel Dekker, New York; Meuzelaar et al, 1982, Pyrolysis Mass Spectrometry of Recent and Fossil Biomaterials, Elsevier, Amsterdam), fourier-transform infrared spectrometry (Griffiths and de Haseth,1986, Fourier transform infrared spectrometry, John Wiley, New York; Helm et al, 1991, J.
  • TARGET VALDDATION The methods of the present invention can be used to associate a cellular constituent with a complex trait.
  • This section discloses techniques that can be used to validate such cellular constituents identified using the techniques of the present invention.
  • gene knock-out / knock-in mice or transgenic mice are employed for such validation.
  • in vivo siRNA is used to validate such genes. See, for example, Cohen et al, 1997, J. Clin. Invest. 99, p. 1906; Xia, et al, 2002, Nature Biotechnology 20, p. 1006; Hannon, 2002, Nature 418, p.
  • association studies can be carried out in human populations to provide a source of validation in humans.
  • the term “complex trait” refers to any clinical trait T that does not exhibit classic Mendelian inheritance.
  • the term “complex trait” refers to a trait that is affected by two or more gene loci.
  • the term “complex trait” refers to a trait that is affected by two or more gene loci in addition to one or more factors including, but not limited to, age, sex, habits, and environment. See, for example, Lander and Schork, 1994, Science 265: 2037.
  • Such "complex” traits include, but are not limited to, susceptibilities to heart disease, hypertension, diabetes, obesity, cancer, and infection.
  • a complex trait is one in which there exists no genetic marker that shows perfect cosegregation with the trait due to incomplete penetrance, phenocopy, and/or nongenetic factors (e.g., age, sex, environment, and affect or other genes). Incomplete penetrance means that some individuals who inherit a predisposing allele may not manifest the disease. Phenocopy means that some individuals who inherit no predisposing allele can nonetheless get the disease as a result of environmental or random causes.
  • the genotype at a given locus may affect the probability of disease, but not fully determine the outcome.
  • the penetrance function fiG specifying the probability of disease for each genotype G, may also depend on nongenetic factors such as age, sex, environment, and other genes. For example, the risk of breast cancer by ages 40, 55, and 80 is 37%, 66%, and 85% in a woman carrying a mutation at the BCRAl locus as compared with 0.4%, 3%, and 8% in a noncarrier (Easton et al, 1993, Cancer Surv. 18: 1995; Ford et al, 1994, Lancet 343: 692).
  • genetic mapping is hampered by the fact that a predisposing allele may be present in some unaffected individuals or absent in some affected individuals.
  • a complex trait arises because any one of several genes may result in identical phenotypes (genetic heterogeneity). In cases where there is genetic heterogeneity, it may be difficult to determine whether two patients suffer from the same disease for different genetic reasons until the genes are mapped.
  • Examples of complex diseases that arise due to genetic heterogeneity in humans include polycystic kidney disease (Reeders et al, 1987, Human Genetics 76: 348), early-onset Alzheimer's disease (George-Hyslop et al, 1990, Nature 347: 194), maturity-onset diabetes of the young (Barbosa et al, 1976, Diabete Metab. 2: 160), hereditary nonpolyposis colon cancer (Fishel et al, 1993, Cell 75: 1027) ataxia telangiectasia (Jaspers and Bootsma, 1982, Proc. Natl. Acad. Sci. U.S.A.
  • polygenic inheritance in humans is one form of retinitis pigmentosa, which requires the presence of heterozygous mutations at the perpherin / RDS and ROM1 genes (Kajiwara et al, 1994, Science 264: 1604). It is believed that the proteins coded by RDS and ROM1 are thought to interact in the photoreceptor outer pigment disc membranes.
  • Polygenic inheritance complicates genetic mapping, because no single locus is strictly required to produce a discrete trait or a high value of a quantitative trait. In yet other embodiments, a complex trait arises due to a high frequency of disease-causing allele "D".
  • a high frequency of disease-causing allele will cause difficulties in mapping even a simple trait if the disease-causing allele occurs at high frequency in the population. That is because the expected Mendelian inheritance pattern of disease will be confounded by the problem that multiple independent copies of D may be segregating in the pedigree and that some individuals may be homozygous for D, in which case one will not observe linkage between D and a specific allele at a nearby genetic marker, because either of the two homologous chromosomes could be passed to an affected offspring. Late-onset Alzheimer's disease provides one example of the problems raised by high frequency disease-causing alleles.
  • the present invention provides an apparatus and method for associating a gene with a trait exhibited by one or more organisms in a plurality of organisms of a single species.
  • the gene is associated with the trait by identifying a biological pathway in which the gene product participates.
  • the trait of interest is a complex trait, such as a disease, e.g., a human disease.
  • exemplary diseases include asthma, ataxia telangiectasia (Jaspers and Bootsma, 1982, Proc. Natl. Acad. Sci. U.S.A.
  • bipolar disorder common cancers, common late-onset Alzheimer's disease, diabetes, heart disease, hereditary early-onset Alzheimer's disease (George-Hyslop et al, 1990, Nature 347: 194), hereditary nonpolyposis colon cancer, hypertension, infection, maturity-onset diabetes of the young (Barbosa et al, 1976, Diabete Metab. 2: 160), mellitus, migraine, nonalcoholic fatty liver (NAFL) (Younossi, et al, 2002, Hepatology 35, 746-752), nonalcoholic steatohepatitis (NASH) (James & Day, 1998, J. Hepatol.
  • NAFL nonalcoholic fatty liver
  • NASH nonalcoholic steatohepatitis
  • LINKAGE ANALYSIS This section describes a number of standard quantitative trait locus (QTL) linkage analysis algorithms that can be used in various embodiments of processing step 210 (Fig. 2) and/or processing step 1910 (Fig. 19). Such linkage analysis is also sometimes referred to as QTL analysis. See, for example, Lynch and Walsch, 1998, Genetics and Analysis of Quantitative Traits, Sinauer Associates, Sunderland, MA. The primary aim of linkage analysis is to determine whether there exist pieces of the genome that are passed down through each of several families with multiple afflicted organisms in a pattern that is consistent with a particular inheritance model and that is unlikely to occur by chance alone.
  • QTL quantitative trait locus
  • a locus e.g., a QTL
  • a QTL is a region of a genome of a species that is responsible for a percentage of variation in a phenotypic trait in the species under study.
  • the genotype of each organism at each marker in a plurality of markers in a genetic map produced by marker genotypic data is compared to a given phenotype of each organism.
  • the genetic map is created by placing genetic markers in genetic (linear) map order so that the positional relationships between markers are understood.
  • the information gained from knowing the relationships between markers that is provided by a marker map provides the setting for addressing the relationship between QTL effect and QTL location.
  • linkage analysis is based on any of the QTL detection methods disclosed or referenced in Lynch and Walsch, 1998, Genetics andAnalyis of Quantitative Traits, Sinauer Associates, Inc., Sunderland, MA. 5.13.1.
  • the present invention provides no limitation on the type of phenotypic data that can be used.
  • the phenotypic data can, for example, represent a series of measurements for a quantifiable phenotypic trait in a collection of organisms.
  • quantifiable phenotypic traits can include, for example, tail length, life span, eye color, size and weight.
  • the phenotypic data can be in a binary form that tracks the absence or presence of some phenotypic trait.
  • a "1 can indicate that a particular species of the organism of interest possesses a given phenotypic trait and a "0 can indicate that a particular species of the organism of interest lacks the phenotypic trait.
  • the phenotypic trait can be any form of biological data that is representative of the phenotype of each organism in the population under study.
  • the phenotypic traits are quantified and are often referred to as quantitative phenotypes.
  • genotype of each marker in the genetic marker map is determined for each organism in a population under study. Genotypic information is obtained from polymo ⁇ hisms at each marker in the genetic map. Such polymorphisms include, but are not limited to, single nucleotide polymorphisms, microsatellite markers, restriction fragment length polymo ⁇ hisms, short tandem repeats, sequence length polymo ⁇ hisms, and DNA methylation patterns. Linkage analyses use the genetic map derived from marker genotypic data as the framework for location of QTL for any given quantitative trait.
  • the intervals that are defined by ordered pairs of markers are searched in increments (for example, 2 cM), and statistical methods are used to test whether a QTL is likely to be present at the location within the interval.
  • linkage analysis statistically tests for a single QTL at each increment across the ordered markers in a genetic map. The results of the tests are expressed as lod scores, which compares the evaluation of the likelihood function under a null hypothesis (no QTL) with the alternative hypothesis (QTL at the testing position) for the pu ⁇ ose of locating probable QTL. More details on lod scores are found in Section 5.4, as well as in Lander and Schork, 1994, Science 265, p. 2037-2048. Interval mapping searches through the ordered genetic markers in a systematic, linear (one-dimensional) fashion, testing the same null hypothesis and using the same form of likelihood at each increment.
  • Exemplary crosses include backcrosses, F 2 intercrosses, F, populations (formed by randomly mating Fis for t-1 generations), F 2: design (F 2 individuals are genotyped and then selfed), Design III (F 2 from two inbred lines are backcrossed to both parental lines).
  • organisms represent a population, such as an F 2 population, and pedigree data for the F 2 population is known. This pedigree data is used to compute logarithm of the odds (lod) scores, as discussed in further detail below. For many organisms, including humans, manipulatable inbred lines are not available and outbred populations must be used to perform linkage analysis.
  • Linkage analysis using outbred populations detect QTLs responsible for within-population variation whereas linkage analysis using inbred populations detect QTLs responsible for fixed differences between lines, or even different species.
  • within-population variation outbred population
  • inbred population results in decreased power in QTL detection.
  • inbred lines all Fi parents have identical genotypes (including the same linkage phase), so all individuals are informative, and linkage disequilibrium is maximized.
  • inbred lines a variety of designs have been proposed for obtaining samples with linkage disequilibrium required for linkage analysis. Typically, collections of relatives are relied upon.
  • the major difference between QTL analysis using inbred-line crosses versus outbred populations is that while the parents in the former are genetically uniform, parents in the latter are genetically variable. This distinction has several consequences. First, only a fraction of the parents from an outbred population are informative.
  • a parent For a parent to provide linkage information, it must be heterozygous at both a marker and a linked QTL, as only in this situation can a marker-trait association be generated in the progeny. Only a fraction of random parents from an outbred population are such double heterozygotes. With inbred lines, Fi's are heterozygous at all loci that differ between the crossed lines, so that all parents are fully informative. Second, there are only two alleles segregating at any locus in an inbred-line cross design, while outbred populations can be segregating any number of alleles.
  • Model-based linkage analysis assumes a model for the mode of inheritance whereas model-free linkage analysis does not assume a mode of inheritance.
  • Model-free linkage analyses are also known as allele-sharing methods and non-parametric linkage methods.
  • Model-based linkage analyses are also known as "maximum likelihood” and "lod score” methods. Either form of linkage analysis can be used in the present invention.
  • Model-based linkage analysis is most often used for dichotomous traits and requires assumptions for the trait model. These assumptions include the disease allele frequency and penetrance function.
  • model-free linkage analysis makes use of allele-sharing. Allele-sharing methods rely on the idea that relatives with similar phenotypes should have similar genotypes at a marker locus if and only if the marker is linked to the locus of interest. Linkage analyses are able to localize the locus of interest to a specific region of a chromosome, and the scope of resolution is typically limited to no less than 5 cM or roughly 5000 kb. For more information on model-based and model-free linkage analysis, see Olson et al, 1999, Statistics in Medicine 18, p. 2961-2981; Lander and Schork 1994, Science 265, p. 2037; and Elston, 1998, Genetic Epidemiology 15, p. 565, as well as the sections below.
  • MapMaker/QTL MapMaker/QTL
  • MapMaker/QTL analyzes F 2 or backcross data using standard interval mapping.
  • QTL Cartographer which performs single-marker regression, interval mapping (Lander and Botstein, Id.), multiple interval mapping and composite interval mapping (Zeng, 1993, PNAS 90: 10972-10976; and Zeng, 1994, Genetics 136: 1457-1468).
  • QTL Cartographer permits analysis from F 2 or backcross populations.
  • QTL Cartographer is available from http://statgen.ncsu.edu/qtlcart/cartographer.html (North Carolina State University).
  • Another program that can be used by processing step 114 is Qgene, which performs QTL mapping by either single-marker regression or interval regression (Martinez and Curnow 1994 Heredity 73:198-206) .
  • Qgene eleven different population types (all derived from inbreeding) can be analyzed.
  • Qgene is available from http://www.qgene.org/.
  • MapQTL which conducts standard interval mapping (Lander and Botstein, Id.), multiple QTL mapping (MQM) (Jansen, 1993, Genetics 135: 205-211; Jansen, 1994, Genetics 138: 871-881), and nonparametric mapping (Kruskal-Wallis rank sum test).
  • Map Manager QT is a QTL mapping program (Manly and Olson, 1999, Mamm Genome 10: 327-334). Map Manager QT conducts single-marker regression analysis, regression-based simple interval mapping (Haley and Knott, 1992, Heredity 69, 315-324), composite interval mapping (Zeng 1993, PNAS 90: 10972-10976), and permutation tests.
  • a description of Map Manager QT is provided by the reference Manly and Olson, 1999, Overview of QTL mapping software and introduction to Map Manager QT, Mammalian Genome 10: 327-334.
  • MultiCross QTL maps QTL from crosses originating from inbred lines.
  • MultiCross QTL uses a linear regression-model approach and handles different methods such as interval mapping, all-marker mapping, and multiple QTL mapping with cofactors.
  • the program can handle a wide variety of simple mapping populations for inbred and outbred species.
  • MultiCross QTL is available from Unite de Biom ⁇ trie et Intelligence Artificielle, INRA, 31326 Castanet Tolosan, France.
  • Still another program that can be used to perform linkage analysis is QTL cafe.
  • the program can analyze most populations derived from pure line crosses such as F 2 crosses, backcrosses, recombinant inbred lines, and doubled haploid lines.
  • QTL Cafe incorporates a Java implementation of Haley & Knotts' flanking marker regression as well as Marker regression, and can handle multiple QTLs.
  • the program allows three types of QTL analysis single marker ANOVA, marker regression (Kearsey and Hyne, 1994, Theor. Appl. Genet., 89: 698-702), and interval mapping by regression, (Haley and Knott, 1992, Heredity 69: 315-324).
  • QTL Cafe is available from http://web.bham.ac.Uk/g.g.seaton/.
  • MAPL performs QTL analysis by either interval mapping (Hayashi and Ukai, 1994, Theor. Appl. Genet. 87:1021-1027) or analysis of variance.
  • Different population types including F 2 , back-cross, recombinant inbreds derived from F 2 or back-cross after a given generations of selfing can be analyzed. Automatic grouping and ordering of numerous markers by metric multidimensional scaling is possible.
  • MAPL is available from the Institute of Statistical Genetics on Internet (ISGI), Yasuo, UKAI, http://web.bham.ac.Uk/g.g.seaton/.
  • Another program that can be used for linkage analysis is R/qtl.
  • R/qtl makes uses of the hidden Markov model (HMM) technology for dealing with missing genotype data.
  • HMM hidden Markov model
  • R/qtl has implemented many HMM algorithms, with allowance for the presence of genotyping errors, for backcrosses, intercrosses, and phase-known four- way crosses.
  • R/qtl includes facilities for estimating genetic maps, identifying genotyping errors, and performing single-QTL genome scans and two-QTL, two-dimensional genome scans, by interval mapping with Haley-Knott regression, and multiple imputation.
  • R/qtl is available from Karl W. Broman, Johns Hopkins University, http://biosun01.biostat.jhsph.edu/ ⁇ kbroman/qtl/.
  • linkage analysis comprises QTL interval mapping in accordance with algorithms derived from those first proposed by Lander and Botstein, 1989, "Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps," Genetics 121: 185-199.
  • the principle behind interval mapping is to test a model for the presence of a QTL at many positions between two mapped marker loci. The model is fit, and its goodness is tested using a technique such as the maximum likelihood method. Maximum likelihood theory assumes that when a QTL is located between two biallelic markers, the genotypes (i.e.
  • AABB, AAbb, aaBB, aabb for doubled haploid progeny each contain mixtures of quantitative trait locus (QTL) genotypes.
  • Maximum likelihood involves searching for QTL parameters that give the best approximation for quantitative trait distributions that are observed for each marker class. Models are evaluated by computing the likelihood of the observed distributions with and without fitting a QTL effect.
  • linkage analysis is performed using the algorithm of Lander, as implemented in programs such as GeneHunter.
  • interval mapping is based on regression methodology and gives estimates of QTL position and effect that are similar to those given by the maximum likelihood method. Since the QTL genotypes are unknown in mapping based on regression methodology, genotypes are replaced by probabilities estimated using genotypes at the nearest flanking markers or for all linked markers. See, e.g., Haley and Knott, 1992, Heredity 69, 315-324; and Jiang and Zeng, 1997, Genetica
  • MODEL-FREE NONPARAMETRIC LINKAGE ANALYSIS calculates a lod score that represents the chance that a given locus in the genome is genetically linked to a trait, assuming a specific mode of inheritance for the trait. Namely the allele frequencies and penetrance values are included as parameters and are subsequently estimated. In the case of complex diseases, it is often difficult to model with any certainty all the causes of familial aggregation. In other words, when the trait exhibits non-Mendelian segregation it can be difficult to obtain reliable estimates of penetrance values, including phenocopy risks, and the allele frequency of the disease mutation.
  • Model-free linkage analyses are not based on constructing a model, but rather on rejecting a model. Specifically, one tries to prove that the inheritance pattern of a chromosomal region is not consistent with random Mendelian segregation by showing that affected relatives inherit identical copies of the region more often then expected by chance. Affected relatives should show excess allele sharing in regions linked to the QTL even in the presence of incomplete penetrance, phenocopy, genetic heterogeneity, and high-frequency disease alleles.
  • nonparametric linkage analysis involves studying affected relatives 246 (Fig. 1) in a pedigree 310 to see how often a particular copy of a chromosomal region is shared identical-by descent (IBD), that is, is inherited from a common ancestor within the pedigree. The frequency of IBD sharing at a locus can then be compared with random expectation.
  • IBD identical-by descent
  • T(s) xu(s) . • where x, j (s) is the number of copies shared IBD at position s along a chromosome, and where the sum is taken over all distinct pairs (ij) of affected relatives 246 in a pedigree 310.
  • the results from multiple families can be combined in a weighted sum T(s). Assuming random segregation, T(s) tends to a normal distribution with a mean ⁇ and a variance ⁇ that can be calculated on the basis of the kinship coefficients of the relatives compared. See, for example, Blackwelder and Elston, 1985, Genet. Epidemiol.
  • Deviation from random segregation is detected when the statistic (T- ⁇ )/ ⁇ exceeds a critical threshold.
  • the techniques in this section typically use an outbred population.
  • Affected sib pair analysis is one form of IBD-APM analysis (Section 5.13.7.1). For example, two sibs can show IBD sharing for zero, one, or two copies of any locus (with a 25%-50%-25% distribution expected under random segregation). If both parents are available, the data can be partitioned into separate IBD sharing for the maternal and paternal chromosome (zero or one copy, with a 50%-50% distribution expected under random segregation). In either case, excess allele sharing can be measured with a ⁇ 2 test. In the ASP approach, a large number of small pedigrees (affected siblings and their parents) are used.
  • DNA samples are collected from each organism and genotyped using a large collection of markers (e.g., microsatellites, SNPs). Then a check for functional polymorphism is performed. See, for example, Suarez et al, 1978, Ann. Hum. Genet. 42, p.87; Weitkamp, 1981, N. Engl. J. Med. 305, p.1301; Knapp et al, 1994, Hum. Hered. 44, p. 37; Holmans, 1993, Am. J. Hum. Genet. 52, p. 362; Rich et al, 1991, Diabetologica 34, p. 350; Owerbach and Gabbay, 1994, Am. J. Hum. Genet. 54, p.
  • markers e.g., microsatellites, SNPs
  • the number of degrees of freedom of the t test is set at the number of independent affected pairs (defined per sibship as the number of affected individuals minus 1) in the sample instead of the number of all possible pairs. See, for example, Suarez and Eerdewegh, 1984, Am. J. Med. Genet. 18, p. 135. The techniques in this section typically use an outbred population.
  • IBD IDENTICAL BY STATE - AFFECTED PEDIGREE MEMBER
  • IBS- APM ANALYSIS / OUTBRED POPULATION
  • IBD-APM enhanced-pedigree-member
  • Another method uses a statistic that is based explicitly on IBS sharing (an IBS-APM method). See, for example, Weeks and Lange, 1988, Am J. Hum. Genet. 42, p. 315; Lange, 1986, Am. J. Hum. Genet. 39, p. 148; Jeunemaitre et al, 1992, Cell 71, p.
  • the IBS-APM techniques of Weeks and Lange, 1988, Am J. Hum. Genet. 42, p. 315; and Weeks and Lange, 1992, Am. J. Hum. Genet. 50, p. 859 are used. Such techniques use marker information of affected individuals to test whether the affected persons within a pedigree are more similar to each other at the marker locus than would be expected by chance. In some embodiments, the marker similarity is measured in terms of identity by state. In some embodiments, the APM method uses a marker allele frequency weighting function, ), where/?
  • the first weighting function uses the allele frequencies only in calculation of the expected degree of marker allele sharing.
  • the second function is a reasonable compromise for generating a normal distribution of the test statistic while inco ⁇ orating an allele frequency function.
  • the APM test statistics are sensitive to marker locus and allele frequency misspecification.
  • allele frequencies are estimated from the pedigree data using the method of Boehnke, 1991, Am J. Hum. Genet. 48, p. 22, or by studying alleles. See, also, for example, Berrettini et al, 1994, Proc. Natl. Acad. Sci. USA 91, p. 5918.
  • the significance of the APM test statistics is calculated from the theoretical (normal) distribution of the statistic.
  • replicates e.g., 10,000
  • replicates are simulated to assess the probability of observing the actual results (or a more extreme statistic) by chance. This probability is the empirical P value.
  • Each replicate is generated by simulating an unlinked marker segregating through the actual pedigrees.
  • An APM statistic is generated by analyzing the simulated data set exactly as the actual data set is analyzed. The rank of the observed statistic in the distribution of the simulated statistics determines the empirical P value.
  • the techniques in this section typically use an outbred population.
  • association tests can be done with samples of pedigrees or samples of unrelated individuals. Further, association studies can be done for a dichotomous trait (e.g., disease) or a quantitative trait. See, for example, Nepom and Ehrlich, 1991, Annu. Rev. Immunol. 9, p. 493; Strittmatter and Roses, 1996, Annu. Rev. Neurosci. 19, p. 53; Vooberg et al, 1994, Lancet 343, p. 1535; Zoller et al, Lancet 343, p. 1536; Bennet et al, 1995, Nature Genet. 9, p.
  • association studies test whether a disease and an allele show correlated occurrence across the population, whereas linkage studies determine whether there is correlated transmission within pedigrees. Whereas linkage analysis involves the pattern of transmission of gametes from one generation to the next, association is a property of the population of gametes. Association exists between alleles at two loci if the frequency, with which they occur within the same gamete, is different from the product of the allele frequencies.
  • association arises when a mutation, which causes disease, occurs at a locus at some time, tj. At that time, the disease mutation occurs on a specific genetic background composed of the alleles at all other loci; thus, the disease mutation is completely associated with the alleles of this background. As time progresses, recombination occurs between the disease locus and all other loci, causing the association to diminish. Loci that are closer to the disease locus will generally have higher levels of association, with association rapidly dropping off for markers further away.
  • association linkage disequilibrium
  • Association linkage disequilibrium
  • Two forms of association analysis are discussed in the sections below, population based association analysis and family based association analysis. More generally, those of skill in the art with appreciate that there are several different forms of association analysis, and all such forms of association analysis can be used in steps of the present invention that require the use of quantitative genetic analysis.
  • whole genome association studies are performed in accordance with the present invention. Two methods can be used to perform whole- genome association studies, the "direct-study” approach and the "indirect-study” approach.
  • association test takes place between a single marker (or a number of markers that are physically very close to one another, .e.g., a haplotype) and the trait of , , memo ⁇ crani52
  • each affected organism is matched with one or more unaffected siblings (see, for example, Curtis, 1997, Ann. Hum. Genet. 61, p. 319) or cousins (see, for example, Witte, et al, 1999, Am J. Epidemiol. 149, p. 693) and analytical techniques for matched case-control studies is used to estimate effects and to test a hypotheses. See, for example, Breslow and Day, 1989, Statistical methods in cancer research I, The analysis of case-control studies 32, Lyon: IARC Scientific Publications. The following subsections describe some forms of family-based association studies. Those of skill in the art will recognize that there are numerous forms of family-based association studies and all such methodologies can be used in the present invention.
  • the haplotype relative risk test is used.
  • all marker alleles compared arise from the same person.
  • the marker alleles that parents transmit to an affected offspring (case alleles) are compared with those that they do not transmit to such an offspring (control alleles).
  • This population can be classified into a fourfold table according to whether the transmitted allele is a marker allele (M) or some other allele M and according to whether the nontransmitted allele is similarly or M :
  • the haplotype relative risk ratio is defined as (a+b)(c+d)/(a+c)(c+d).
  • a chi-square distribution using one degree of freedom can be used to determine whether the haplotype relative risk ratio differs significantly from one. See, for example, Rudorfer, et al, 1984, Br. J. Clin. Pharmacol. 17, 433; Mueller and Young, 1997, Emery's Elements of Medical Genetics, Kalow ed., p. 169-175, Churchill Livingstone, Edinburgh; and Roses, 2000, Nature 405, p. 857, Elson, 1998, Genetic Epidemilogy, 15, p. 565.
  • TDT transmission equilibrium test
  • TDT considers parents who are heterozygous for an allele and evaluates the frequency with which that allele is transmitted to affected offspring.
  • the TDT differs from other model-free tests for association between specific alleles of a polymorphic marker and a disease locus. The parameters of that locus, genotypes of sampled individuals, linkage phase, and recombination frequency are not specified. Nevertheless, by considering only heterozygous parents, the TDT is specific for association between linked loci. TDT is a test of linkage and association that is valid in heterogeneous populations.
  • the genetic data consists of the marker genotypes for the parents and child.
  • the TDT is based on transmissions, to the diseased child, from heterozygous parents, or parents whose genotypes consist of different alleles.
  • the TDT counts the number of times, « ⁇ 2 , that M ⁇ M parents transmit marker allele Mi to the diseased child and the number of times, « 2 ⁇ personally that M 2 is transmitted. If the marker is not linked to (correlated with) the disease locus, i.e.
  • the sibship-based test is used. See, for example, Wiley, 1998, Cur. Pharmaceut. Des. 4, p. 417; Blackstock and Weir, 1999, Trends Biotechnol. 17, p. 121; Kozian and Kirschbaum, 1999, Trends Biotechnol. 17, p. 73; Rockett et al, Xenobiotica 29, p. 655; Roses, 1994, J. Neuropathol. Exp. Neurol 53, p. 429; and Roses, 2000, Nature 405, p. 857.
  • each of the genes identified in Table 6 are causal for omental fad pad mass in mice. As such, each of the genes in Table 6 (and their homologs) are potential therapuetic targets for obesity and related diseases. Section 5.15.1 provides additional evidence that inhibition of the malic enzyme Modi, (ranked eighth in Table 6) could be an effective treatment for obesity.
  • Modi As discussed in Section 6 below, Modi (SEQ ID NO: 2) has been identified as one of a number of genes that test as causative for omental fat pad mass (OFPM) in a mouse cross. Modi (SEQ ID NO: 2) was ranked eighth in Table 6 of Section 6 and accounts for approximately 52 percent of the genetic variation in OFPM as judged by the causality test of the present invention. Three of six of Modi s (SEQ ID NO: 2) eQTLs overlap with three of five cQTLs for omental fat pad mass (log of omental FPM).
  • Modi sits at the center of key pathways in intermediate metabolism and is regulated in liver by thyroid hormone, insulin, glucagon, androgens, fasting, high carbohydrates, low fatty acids and thiazolidinediones.
  • Modi (SEQ ID NO: 2) activity closely follows lipogenesis and mRNA levels are positively correlated with OFPM and a number of other measures of adiposity. Modi is reported to be non-essential in mouse. See for example Johnson et al, 1981, J. Hered. 72, 134-136; and Lee et al. 1980, Mol. Cell. Biochem. 30, 143-149.
  • the present invention provides methods to identify genes in the genetic network that are causative for individual traits. Briefly, this is done by selecting genes whose expression is correlated with the trait of interest and identifying amongst those that have overlapping genetics (Quantitative Trait Loci, or QTL). These are then further assessed using a causality test as described to distinguish between reactive and causative changes with respect to the clinical trait.
  • the cytosolic malic enzyme may be an excellent target for the treatment of obesity and its co-morbidities, such as, diabetes, coronary artery disease, dyslipidemias (e.g., hyperlipidemia), stroke, chronic venous abnormalities, orthopedic problems, sleep apnea disorders, esophageal reflux disease, hypertension, arthritis and some forms of cancer (e.g., colorectal cancer, breast cancer, diabetes, heart disease). 5.15.1.1.
  • Fig. 19A shows the quantitative trait loci (QTLs) that control genetic variation in OFPM (log of OFPM or logomen, left panel) and Mod 1 (SEQ ID NO: 2) (right panel).
  • QTLs quantitative trait loci
  • the column legends for the left panel are (Chr - chromosome, Pos(M) - position on the chromosome in Morgans from the left end, LOD - calculated Lod score). Also shown are three overlapping QTLs indicated by arrows.
  • Fig. 19B lists various traits and the number of overlapping QTLs they have with Modi (SEQ ID No: 2).
  • the traits are omen - omental fat pad mass; epipa - epididymal fat pad mass; retrog - retroperitoneal fat pad mass; subc - subcutaneous fat pad mass; lep - leptin protein levels; ins, insulin protein levels, livebwt - total body weight at sacrifice, ftpsum - sum of all fat pad masses; fatbw - adiposity (ftpsum as a percentage of livebwt).
  • some of the traits are converted to the log of the values (prefix “log”) or the square root of the values (prefix “sqrt”).
  • the values are sorted by the number of overlaps with Modi (SEQ ID NO: 2) QTLs.
  • the livers from the mice in the BxD cross were profiled and 444 genes were found to be correlated with the OFPM trait (Pearson correlation coefficient p-values less than 0.0001), as discussed in Section 6 below.
  • QTLs for these genes were derived followed by a test of causality as described in Section 6 below. This resulted in a list of 40 genes with two or more QTL's that are coincident with OFPM QTLs, and two or more of which tested as causal for that trait.
  • the top panel of Fig. 20 shows a scatter gram of the OFPM values in grams (X axis) versus Modi (SEQ ID NO: 2) mRNA levels as mlratio's (Y axis).
  • the lower panel shows a comparison of Modi to the log of the OFPM values (LogOmen).
  • the positive correlation and causality implies that increasing Modi levels results in increased OFPM and therefore an inhibitor of Modi (SEQ ID NO: 2) activity may decrease OFPM.
  • Fig. 21 illustrates scatter grams comparing Modi (SEQ ID NO: 2) ml ratios (Y axes) to OFPM (top left), subcutaneous fat pat mass (top right), leptin protein levels (bottom left) and insulin protein levels (bottom right) all X axis.
  • Fig. 22 illustrates the correlation coefficients of various measures of fat pad masses and adiposity and Modi (SEQ ID NO: 2) mRNA levels.
  • Figure legends for Fig. 22 are the same as for Fig. 19.
  • NADP(+)-dependent malic enzymes There are two types of NADP(+)-dependent malic enzymes, a cytosolic form (MEl) (SEQ ID NO: 1) and a mitochondrial form (ME3) (Swiss Prot accession number Q16798; Loeber et al, 1994, Biochem. J. 304: pp. 687- 692; SEQ ID NO: 3; Fig. 23). These enzymes are also called NADP(+)-dependent malate dehydrogenases.
  • ME2 (EC 1.1.1.39) (SEQ ID NO: 4; Fig. 24; Swiss Prot accession number P23368; Loeber et al, 1991, Biol. Chem.
  • domain A tetramer of 60kD monomers
  • domain B domain B (131-277 and 467-538), domain C (residues 278-466), and domain D (residues 539-573).
  • domain D residues 539-573.
  • MODI EXPRESSION AND REGULATION Modi (SEQ ID No: 2) is broadly expressed in monkeys with highest expression in the adrenals.
  • Fig. 25 illustrates the relative levels of expression of the cytoslic Malic enzyme Modi (SEQ ID NO: 2)) in various tissues of monkeys. Highest expression of Modi is in the adrenal gland, and expression in liver is somewhat lower. Most studies have concentrated on Modi (SEQ ID NO: 2) expression in the liver and its key role in intermediate metabolism. Modi (SEQ ID NO: 2) protein levels are primarily controlled by the rate of its synthesis, and this is up-regulated by high carbohydrates, low fats, insulin, thyroid hormone and androgens in vivo.
  • Modi SEQ ID NO: 2 expression is repressed by the absence ofthyroid hormone, starvation and glucagon via increased cAMP levels. Modi (SEQ ID NO: 2) is also induced by thiazolidinediones. See Hauner 2002, Diabetes Metab Res Rev 18 Suppl 2, S10-15.
  • the mitochondrion in high energy state has high levels of ATP and NADH, H+. This reduces the flow of metabolites through the TCA cycle by inhibiting isocitrate dehydrogenase.
  • Citrate diffuses into the cytosol via the tricarboxylate carrier, leading to 3 effects: o Citrate and ATP inhibit phosphofructokinase (PFK), thereby reducing the flux through glycolysis and redirecting flow into the pentose phosphate pathway. o Citrate is processed to form the precursor (acetyl CoA) of fatty acid synthesis, and to oxaloacetate, which is processed to malate and then to pyruvate.
  • PFK phosphofructokinase
  • Fig. 26 provides the position of Modi (SEQ ID NO: 2) in a schematic representation of intermediate metabolism. Above line 2602 is cytosol, below line 2602 is mitochondria.
  • Boxes 2604 show various metabolites
  • boxes 2606 show selected enzymes (PFK - phosphofructokinase, FAS - fatty acid synthase, ACC1 - acetyl coenzyme A carboxylase, Modi - cytosolic malic enzyme).
  • Lines show the various pathways and connections and the thickness represent the relative flux through those pathways.
  • the usage and production of NAD+/NADH, H+ and NADP+/NADPH, H+ are shown in the boxes 2608.
  • Under "high energy” conditions citrate accumulates, is transported to the cytosol, and represses PFK and activates ACC1 (indicated by red lines). This results in increased fatty acid synthesis and decreased ⁇ -oxidation of fatty acids.
  • the present invention proposes the following model whereby reduced malic enzyme activity results in decreased lipogenesis and potentially increased ⁇ -oxidation of fatty acid: • Decreasing levels of Modi (SEQ ID NO: 2) reduces the recycling of oxaloacetate in the cytosol to citrate in the mitochondrion (see Figure 26). • Reduced malic enzyme also reduces the upstream reaction: oxaloacetate to malate which produces NAD+. NAD+ is required for a step in glycolysis (gIyceraldehydes-3 -phosphate to 1,3-bisphosphoglycerate. This further reduces the production of citrate (and ATP).
  • Reduced citrate has three effects: o decreased inhibition of PFK resulting in down-regulation of the pentose phosphate pathway (reducing production of NADPH, H+ which is required for FAS); o reduced precursor for fatty acid synthesis (acetyl CoA) and NADPH, H+. (It is estimated that 40 percent of the NADPH, H+ required by fatty acid synthesis is supplied by the malic enzyme reaction under glucose supported lipogenesis, the remaining 60 percent comes from the pentose phosphate pathway); o reduced activation of acetyl CoA carboxylase, the highly regulated step in fatty acid synthesis. (Reduced acetyl CoA carboxylase activity should also lower malonyl CoA. This could increase fatty acid oxidation since malonyl CoA inhibits fatty acid transport into the mitochondrion for ⁇ - oxidation. All of these effects will result in decreased fatty acid synthesis and a switch to a "low energy like state.”
  • Modi is the eighth gene on this list and could account for 52 percent of the genetically determined variation in OFPM.
  • Modi or the cytosolic malic enzyme, connects the citric acid cycle and fatty acid synthesis to glycolysis, and is regulated by high and low energy states (including insulin, glucagon, thyroid hormone, low fatty acids, high carbohydrates and fasting). Despite this high degree of regulation, Modi has not, until now, been implicated as a key regulatory step in intermediate metabolism. Mice lacking cytosolic malic enzyme activity have been reported, suggesting that it is not essential. See, for example, Johnson et al, 1981, J.
  • Modi mRNA levels are positively correlated with OFPM and other measures of obesity and the Modi enzyme catalyses a well characterized reaction for which many inhibitors have been identified. All of this is consistent with Modi being a druggable and safe target and that inhibitors of it may be effective anti-obesity agents.
  • orthologs can be used in secondary screens that are designed to test the selectivity of potential malic enzyme inhibitors.
  • Such orthologs include, but are not limited to rattus norvegicus (rat) Modi (Swiss Prot accession number P13697; Nikodem et al, 1989, Endocr. Res. 15:547-564), Mesembryanthemum crystallinum (Common ice plant) Modi (Swiss Prot accession number P37223; Cushman, 1992, Eur. J. Biochem. 208:259-266), Zea mays (maize) Modi (Swiss Prot accession number P16243; Rothermel and Nelson, 1989, J. Biol. Chem.
  • Flaveria trinervia (Clustered yellowtops) Modi (Swiss Prot accession number P22178; Boersch and Westhoff, 1990, FEBS Lett. 273:111-115), Escherichia coli Modi (Swiss Prot accession number P76558; Blattner et al, 1997, Science 277: 1453-1474), Haemophilus influenzae Modi (Swiss Prot accession number P43837; Fleisch ann et al, 1995, Science 269, pp. 496-512), Rhizobium meliloti Modi (Swiss Prot accession number 030808; Mitsch, 1998, J. Biol. Chem.
  • Anas platyrhynchos Modi (Swiss Prot accession number P28227; Hsu et al, 1992, Biochem. J. 284:869-876), Gallus gallus (chicken) Modi (Swiss Prot accession number Q92060; Hodnett et al, 1996, Arch. Biochem. Biophys. 334:309-324), columba livia (domestic pigeon) Modi (Swiss Prot accession number P40927; Chou et al, 1994, Arch. Biochem. Biophys.
  • Mus musculus (mouse) Modi Mus musculus (mouse) Modi (Swiss Prot accession number P06801; Bagchi, 1986, Ann. N.Y. Acad. Sci. 478:77-92), Phaseolus vulgaris (kidney bean) Modi (Swiss Prot accession number PI 2628; Walter et al, Proc. Natl. Acad. Sci. U.S.A., 1988, 85:5546-5550), Populus trichocarpa (Western balsam poplar) Modi (Swiss Prot accession number P34105; van Doorsselaere et al, 1991, Plant Physiol.
  • Malic enzymes include cDNAs or other nucleic acids that encode a malic enzyme.
  • Such cDNAs can include, but are not limited to, all or a portion of homo sapiens mitochondrial NADP(+)-dependent malic enzyme 3 (NCBI accession number AY424278; SEQ ID NO: 5; Fig. 27), all or a portion of homo sapiens mitochondrial NAD-dependent malic enzyme 2 (NCBI accession number XM_209967; SEQ ID NO: 6; Fig.
  • malic enzyme includes amino acid macromolecules that include a sequence as substantially set forth in any one of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, and SEQ ID NO: 4.
  • the invention further relates to fragments and derivatives thereof.
  • Antibodies to malic enzymes and derivatives of such antibodies are further provided by the present invention.
  • Section 5.15.2 describes malic enzymes and Table 6 of Section 6 describes a number of genes and proteins (SEQ ID NO: 8 through SEQ ID NO: 24) that are causal for an obesity-related trait in mice.
  • This invention further relates to modulation of these genes and proteins, their orthologs, their paralogs, and fragments and derivatives thereof.
  • the present invention further relates to therapeutic and diagnostic methods and compositions based on such nucleic acid sequences and/or gene products as well as antibodies that bind to such gene products. Animal models, diagnostic methods and screening methods for predisposition to obesity are also provided by the invention.
  • the invention further provides methods of treatment of obesity and obesity related diseases such as anorexia nervosa, bulimia nervosa, and cachexia using modulators of genes and gene products referenced in this section.
  • modulators e.g., inhibitors and agonists, of such genes and gene products can be identified by any method known in the art.
  • molecules can be assayed for their ability to promote or inhibit (modulate) the expression of the such genes. Once modulators are identified, they can be assayed for therapeutic efficacy using any assay available in the art for obesity.
  • Modulators can be identified by screening for molecules that bind to gene products referenced in this section. Molecules that bind such gene products can be identified in many ways that are well known and routine in the art.
  • the gene products e.g., SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, and SEQ ID NO: 24 or their orthologs
  • a cell line that endogenously expresses little or none of the gene product and assaying for molecules that bind to the cells overexpressing the gene product (or cell extract from such overexpressing cells) and that do not bind to the cells not overexpressing the gene product (or cell extract from such cells) or by conjugating the gene product to a solid support (e.g., a chromatography resin) contacting the conjugated gene product to a solid support with a molecule of interest, isolating the solid support and determining whether the molecule of interest bound to the gene product.
  • a solid support e.g., a chromatography resin
  • nucleic acids are provided that comprise a sequence complementary to at least 10, 25, 50, 100, or 200 nucleotides or the entire coding region of a gene encoding SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8, SEQ ID NO: 9, SEQ ID NO: 11, SEQ ID NO: 12, SEQ ID NO: 14, SEQ ID NO: 16, SEQ ID NO: 18, SEQ ID NO: 20, SEQ ID NO: 21, or SEQ ID NO: 23.
  • oligonucleotide primers for PCR amplification can be designed. PCR amplification is then used to amplify specifically the obesity related protein coding sequence, which can be cloned into an appropriate expression vector using routine techniques.
  • That vector can then be introduced into bacterial or cultured eukaryotic cells (e.g., cultured mammalian cells, insect cells, etc.) such that the gene product is expressed in the bacterial or cultured cell.
  • the gene product can then be isolated from the bacterial or eukaryotic cell culture.
  • diversity libraries such as random or combinatorial peptide or nonpeptide libraries, can be screened for molecules that specifically bind to and/or modulate the function of the gene product.
  • Many libraries are known in the art that can be used, e.g., chemically synthesized libraries, recombinant (e.g., phage display libraries), and in vitro translation-based libraries.
  • a benzodiazepine library (see e.g., Bunin et al., 1994, Proc. Natl. Acad. Sci. USA 91:4708-4712) can be adapted for use.
  • Peptoid libraries (Simon et al., 1992, Proc. Natl. Acad. Sci. USA 89:9367-9371) can also be used.
  • Another example of a library that can be used, in which the amide functionalities in peptides have been permethylated to generate a chemically transformed combinatorial library, is described by Ostresh et al. (1994, Proc. Natl. Acad. Sci. USA 91:11138-11142).
  • Screening the libraries can be accomplished by any of a variety of commonly known methods. See, e.g., the following references, which disclose screening of peptide libraries: Parmley and Smith, 1989, Adv. Exp. Med. Biol. 251:215-218; Scott and Smith, 1990, Science 249:386-390; Fowlkes et al, 1992; BioTechniques 13:422-427; Oldenburg et al, 1992, Proc. Natl. Acad. Sci.
  • screening can be carried out by contacting the library members with an obesity related gene product referenced in Section 5.15.3 (or nucleic acid or derivative) immobilized on a solid phase and harvesting those library members that bind to the protein (or nucleic acid or derivative).
  • panning techniques
  • the two-hybrid system for selecting interacting proteins in yeast can be used to identify molecules that specifically bind to a gene product referenced in Section 5.15.3 or a derivative of such gene product.
  • the invention also relates to nucleic acids hybridizable to or complementary to all or a portion of the nucleic acid sequences referenced in Section 5.15.3 under conditions of low stringency.
  • procedures using such conditions of low stringency are as follows (see also Shilo and Weinberg, 1981, Proc. Natl. Acad. Sci. U.S.A.
  • filters containing DNA are pretreated for 6 hours at 40°C in a solution containing 35% formamide, 5X SSC, 50 mM Tris-HCl (pH 7.5), 5 mM EDTA, 0.1% PVP, 0.1% Ficoll, 1% BSA, and 500 mg/ml denatured salmon sperm DNA.
  • Hybridizations are carried out in the same solution with the following modifications: 0.02% PVP, 0.02% Ficoll, 0.2% BSA, 100 mg g/ml salmon sperm DNA, 10% (wt/vol) dextran sulfate, and 5-20 X 106 cpm 32P-labeled probe is used.
  • Filters are incubated in hybridization mixture for 18-20 hours at 40°C, and then washed for 1.5 hours at 55°C in a solution containing 2X SSC, 25 mM Tris-HCl (pH 7.4), 5 mM EDTA, and 0.1% SDS. The wash solution is replaced with fresh solution and incubated an additional 1.5 hours at 60°C. Filters are blotted dry and exposed for autoradiography. If necessary, filters are washed for a third time at 65-68°C and reexposed to film. Other conditions of low stringency that can be used are well known in the art (e.g., as employed for cross-species hybridizations).
  • the invention also relates to nucleic acids hybridizable to or complementary to all or a portion of the nucleic acid sequences referenced in Section 5.15.3 under conditions of high stringency.
  • procedures using such conditions of high stringency are as follows: prehybridization of filters containing DNA is carried out for 8 hours to overnight at 65°C in buffer composed of 6X SSC, 50 mM Tris-HCl (pH 7.5), 1 M EDTA, 0.02% PVP, 0.02% Ficoll, 0.02% BSA, and 500 mg/ml denatured salmon sperm DNA.
  • Filters are hybridized for 48 hours at 65°C in prehybridization mixture containing 100 mg/ml denatured salmon sperm DNA and 5-20 X 106 cpm of 32P-labeled probe. Washing of filters is done at 37°C for one hour in a solution containing 2X SSC, 0.01% PVP, 0.01% Ficoll, and 0.01% BSA. This is followed by a wash in 0.1 X SSC at 50°C for 45 minutes before autoradiography. Other conditions of high stringency that may be used are well known in the art.
  • the invention relates to nucleic acids hybridizable to or complementary to all or a portion of the nucleic acid sequences referenced in Section 5.15.3 under conditions of moderate stringency.
  • conditions of moderate stringency as known to those having ordinary skill in the art, and as defined by Sambrook et al, Molecular Cloning: A Laboratory Manual, 2 nd Ed. Vol. 1, pp.
  • nucleic acid encoding a fragment or portion of a given nucleic acid sequence (e.g. a fragment of SEQ ID NO: 5) shall be construed as referring to a nucleic acid encoding only the recited fragment or portion of the specific nucleic acid and not the other contiguous portions of the nucleic acid as a continuous sequence.
  • the antibodies of the invention or fragments thereof can be produced by any method known in the art for the synthesis of antibodies, in particular, by chemical synthesis or preferably, by recombinant expression techniques.
  • Polyclonal antibodies can be produced by various procedures well known in the art.
  • a gene product of the present invention, as referenced in Section 5.15.3, or an immunogenic or antigenic fragment thereof can be administered to various host animals including, but not limited to, rabbits, mice, rats, etc. to induce the production of sera containing polyclonal antibodies specific for the obesity related gene product.
  • adjuvants can be used to increase the immunological response, depending on the host species, and include but are not limited to, Freund's (complete and incomplete), mineral gels such as aluminum hydroxide, surface active substances such as lysolecithin, pluronic polyols, polyanions, peptides, oil emulsions, keyhole limpet hemocyanins, dinitrophenol, and potentially useful human adjuvants such as BCG (bacille Calmette- Guerin) and corynebacterium parvum. Such adjuvants are also well known in the art.
  • Monoclonal antibodies can be prepared using a wide variety of techniques known in the art including the use of hybridoma, recombinant, and phage display technologies, or a combination thereof. For example, monoclonal antibodies can be produced using hybridoma techniques including those known in the art and taught, for example, in
  • the term "monoclonal antibody” as used herein is not limited to antibodies produced through hybridoma technology.
  • the term “monoclonal antibody” refers to an antibody that is derived from a single clone, including any eukaryotic, prokaryotic, or phage clone, and not the method by which it is produced.
  • mice can be immunized with osteopontin or an immunogenic or antigenic fragment thereof and once an immune response is detected, e.g., antibodies specific for osteopontin are detected in the mouse serum, the mouse spleen is harvested and splenocytes isolated. The splenocytes are then fused by well known techniques to any suitable myeloma cells, for example cells from cell line SP20 available from the ATCC. Hybridomas are selected and cloned by limited dilution. The hybridoma clones are then assayed by methods known in the art for cells that secrete antibodies capable of binding the obesity related gene products of the present invention.
  • the present invention provides methods of generating monoclonal antibodies as well as antibodies produced by the method comprising culturing a hybridoma cell secreting an antibody of the invention wherein, preferably, the hybridoma is generated by fusing splenocytes isolated from a mouse immunized with a gene product referenced in Section 5.15.3 or an immunogenic or antigenic fragment thereof with myeloma cells and then screening the hybridomas resulting from the fusion for hybridoma clones that secrete an antibody able to bind to the subject gene product referenced in Section 5.15.3.
  • Antibody fragments that recognize specific epitopes can be generated by any technique known to those of skill in the art.
  • Fab and F(ab')2 fragments of the invention can be produced by proteolytic cleavage of immunoglobulin molecules, using enzymes such as papain (to produce Fab fragments) or pepsin (to produce F(ab')2 fragments).
  • F(ab')2 fragments contain the variable region, the light chain constant region and the CHI domain of the heavy chain.
  • the antibodies of the present invention can also be generated using various phage display methods known in the art. In phage display methods, functional antibody domains are displayed on the surface of phage particles that carry the polynucleotide sequences encoding them.
  • DNA sequences encoding VH and VL domains are amplified from animal cDNA libraries (e.g., human or murine cDNA libraries of lymphoid tissues).
  • the DNA encoding the VH and VL domains are recombined together with a scFv linker by PCR and cloned into a phagemid vector (e.g., p CANTAB 6 or pComb 3 HSS).
  • the vector is electroporated in E. coli and the E. coli is infected with helper phage.
  • Phage used in these methods are typically filamentous phage including fd and M13 and the VH and VL domains are usually recombinantly fused to either the phage gene III or gene VIII.
  • Phage expressing an antigen binding domain that binds to an antigen of interest can be selected or identified with antigen, e.g., using labeled antigen or antigen bound or captured to a solid surface or bead.
  • Examples of phage display methods that can be used to make the antibodies of the present invention include those disclosed in Brinkman et al, 1995, J. Immunol. Methods 182:41-50; Ames etal, 1995, J. Immunol. Methods 184:177-186; Kettleborough et al, 1994, Eur. J.
  • Fab, Fab' and F(ab')2 fragments can also be employed using methods known in the art such as those disclosed in PCT publication WO 92/22324; Mullinax etal., 1992, BioTechniques 12(6):864-869; and Sawai et al, 1995, AJRI 34:26-34; and Better et l, 1988, Science 240:1041-1043 (said references incorporated by reference in their entireties).
  • PCR primers including VH or VL nucleotide sequences, a restriction site, and a flanking sequence to protect the restriction site can be used to amplify the VH or VL sequences in scFv clones.
  • the PCR amplified VH domains can be cloned into vectors expressing a VH constant region, e.g., the human gamma 4 constant region
  • the PCR amplified VL domains can be cloned into vectors expressing a VL constant region, e.g., human kappa or lamba constant regions.
  • the vectors for expressing the VH or VL domains comprise an EF-l ⁇ promoter, a secretion signal, a cloning site for the variable domain, constant domains, and a selection marker such as neomycin.
  • the VH and VL domains can also cloned into one vector expressing the necessary constant regions.
  • the heavy chain conversion vectors and light chain conversion vectors are then co-transfected into cell lines to generate stable or transient cell lines that express full-length antibodies, e.g., IgG, using techniques known to those of skill in the art.
  • full-length antibodies e.g., IgG
  • human or chimeric antibodies Completely human antibodies are particularly desirable for therapeutic treatment of human subjects.
  • Human antibodies can be made by a variety of methods known in the art including phage display methods described above using antibody libraries derived from human immunoglobulin sequences. See also U.S. Patent Nos.
  • Human antibodies can also be produced using transgenic mice that are incapable of expressing functional endogenous immunoglobulins, but which can express human immunoglobulin genes.
  • the human heavy and light chain immunoglobulin gene complexes can be introduced randomly or by homologous recombination into mouse embryonic stem cells.
  • the human variable region, constant region, and diversity region can be introduced into mouse embryonic stem cells in addition to the human heavy and light chain genes.
  • the mouse heavy and light chain immunoglobulin genes can be rendered non-functional separately or simultaneously with the introduction of human immunoglobulin loci by homologous recombination. In particular, homozygous deletion of the JH region prevents endogenous antibody production.
  • the modified embryonic stem cells are expanded and microinjected into blastocysts to produce chimeric mice. The chimeric mice are then bred to produce homozygous offspring that express human antibodies.
  • the transgenic mice are immunized in the normal fashion with a selected antigen, e.g., all or a portion of a polypeptide of interest.
  • Monoclonal antibodies directed against the antigen can be obtained from the immunized transgenic mice using conventional hybridoma technology.
  • the human immunoglobulin transgenes harbored by the transgenic mice rearrange during B cell differentiation, and subsequently undergo class switching and somatic mutation.
  • Lonberg and Huszar (1995, Int. Rev. Immunol. 13:65-93).
  • a chimeric antibody is a molecule in which different portions of the antibody are derived from different immunoglobulin molecules such as antibodies having a variable region derived from a human antibody and a non-human immunoglobulin constant region.
  • Methods for producing chimeric antibodies are known in the art. See e.g., Morrison, 1985, Science 229:1202; Oi et al, 1986, BioTechniques 4:214; Gillies et al, 1989, J. Immunol. Methods 125:191-202; U.S. Patent Nos. 5,807,715; 4,816,567; and 4,8 16397, which are inco ⁇ orated herein by reference in their entirety.
  • Chimeric antibodies comprising one or more CDRs from human species and framework regions from a non- human immunoglobulin molecule can be produced using a variety of techniques known in the art including, for example, CDR-grafting (EP 239,400; PCT publication WO 91/09967; U.S. Patent Nos. 5,225,539; 5,530,101; and 5,585,089), veneering or resurfacing (EP 592,106; EP 519,596; Padlan, 1991, Molecular Immunology 28(4/5):489- 498; Studnicka et al, 1994, Protein Engineering 7(6):805-814; Roguska et al, 1994, PNAS 91:969-973), and chain shuffling (U.S. Patent No.
  • the antibodies of the invention can, in turn, be utilized to generate anti- idiotype antibodies that "mimic" one or more of the obesity related gene products of the present invention using techniques well known to those skilled in the art. (See, e.g., Greenspan & Bona, 1989, FASEB J. 7:437-444; and Nissinoff, 1991, J. Immunol. 147:2429-2438).
  • the invention provides polynucleotides comprising a nucleotide sequence encoding an antibody of the invention or a fragment thereof.
  • the invention also encompasses polynucleotides that hybridize under high stringency, intermediate or lower stringency hybridization conditions, e.g., as defined supra, to polynucleotides that encode an antibody of the invention.
  • the polynucleotides can be obtained, and the nucleotide sequence of the polynucleotides determined, by any method known in the art. Nucleotide sequences encoding these antibodies can be determined using any nucleic acid sequencing method known in the art.
  • Such a polynucleotide encoding the antibody can be assembled from chemically synthesized oligonucleotides (e.g., as described in Kutmeier et al, 1994, BioTechniques 17:242), which, briefly, involves the synthesis of overlapping oligonucleotides containing portions of the sequence encoding the antibody, annealing and ligating of those oligonucleotides, and then amplification of the ligated oligonucleotides by PCR.
  • a polynucleotide encoding an antibody can be generated from nucleic acid from a suitable source.
  • a nucleic acid encoding the immunoglobulin can be chemically synthesized or obtained from a suitable source (e.g., an antibody cDNA library, or a cDNA library generated from, or nucleic acid, preferably poly A+ RNA, isolated from, any tissue or cells expressing the antibody, such as hybridoma cells selected to express an antibody of the invention) by PCR amplification using synthetic primers hybridizable to the 3 and 5 ends of the sequence or by cloning using an oligonucleotide probe specific for the particular gene sequence to identify, e.g., a cDNA clone from a cDNA library that encodes the antibody.
  • a suitable source e.g., an antibody cDNA library, or a cDNA library generated from, or nucleic acid, preferably poly A+ RNA, isolated from, any tissue or cells expressing the antibody, such as hybridoma cells selected to express an antibody of the invention
  • nucleic acids generated by PCR can then be cloned into replicable cloning vectors using any method well known in the art.
  • nucleotide sequence of the antibody can be manipulated using methods well known in the art for the manipulation of nucleotide sequences, e.g., recombinant DNA techniques, site directed mutagenesis, PCR, etc. (see, for example, the techniques described in Sambrook et al,
  • an antibody of the invention derivative or analog thereof, (e.g., a heavy or light chain of an antibody of the invention or a portion thereof or a single chain antibody of the invention), requires construction of an expression vector containing a polynucleotide that encodes the antibody.
  • a polynucleotide encoding an antibody molecule or a heavy or light chain of an antibody, or portion thereof (preferably, but not necessarily, containing the heavy or light chain variable domain), of the invention has been obtained, the vector for the production of the antibody molecule can be produced by recombinant DNA technology using techniques well known in the art.
  • a protein by expressing a polynucleotide containing an antibody encoding nucleotide sequences are described herein. Methods that are well known to those skilled in the art can be used to construct expression vectors containing antibody coding sequences and appropriate transcriptional and translational control signals. These methods include, for example, in vitro recombinant DNA techniques, synthetic techniques, and in vivo genetic recombination.
  • the invention thus, provides replicable vectors comprising a nucleotide sequence encoding an antibody molecule of the invention, a heavy or light chain of an antibody, a heavy or light chain variable domain of an antibody or a portion thereof, or a heavy or light chain CDR, operably linked to a promoter.
  • Such vectors can include the nucleotide sequence encoding the constant region of the antibody molecule (see, e.g., PCT Publication WO 86/05807; PCT Publication WO 89/01036; and U.S. Patent No. 5,122,464) and the variable domain of the antibody can be cloned into such a vector for expression of the entire heavy, the entire light chain, or both the entire heavy and light chains.
  • the expression vector is transferred to a host cell by conventional techniques and the transfected cells are then cultured by conventional techniques to produce an antibody of the invention.
  • the invention includes host cells containing a polynucleotide encoding an antibody of the invention or fragments thereof, or a heavy or light chain thereof, or portion thereof, or a single chain antibody of the invention, operably linked to a heterologous promoter.
  • vectors encoding both the heavy and light chains may be co-expressed in the host cell for expression of the entire immunoglobulin molecule, as detailed below.
  • a variety of host-expression vector systems can be utilized to express the antibody molecules of the invention.
  • Such host-expression systems represent vehicles by which the coding sequences of interest can be produced and subsequently purified, but also represent cells that may, when transformed or transfected with the appropriate nucleotide coding sequences, express an antibody molecule of the invention in situ.
  • These include but are not limited to microorganisms such as bacteria (e.g., E. coli, B.
  • subtilis transformed with recombinant bacteriophage DNA, plasmid DNA or cosmid DNA expression vectors containing antibody coding sequences; yeast (e.g., Saccharomyces, Pichia) transformed with recombinant yeast expression vectors containing antibody coding sequences; insect cell systems infected with recombinant virus expression vectors (e.g., baculovirus) containing antibody coding sequences; plant cell systems infected with recombinant virus expression vectors (e.g., cauliflower mosaic virus, CaMV; tobacco mosaic virus, TMV) or transformed with recombinant plasmid expression vectors (e.g., Ti plasmid) containing antibody coding sequences; or mammalian cell systems (e.g., COS, CHO, BHK, 293, 3T3 cells) harboring recombinant expression constructs containing promoters derived from the genome of mammalian cells (e.g., metallothionein promoter) or from mamm
  • bacterial cells such as Escherichia coli, and more preferably, eukaryotic cells, especially for the expression of whole recombinant antibody molecule, are used for the expression of a recombinant antibody molecule.
  • mammalian cells such as Chinese hamster ovary cells (CHO)
  • CHO Chinese hamster ovary cells
  • a vector such as the major intermediate early gene promoter element from human cytomegalovirus is an effective expression system for antibodies (Foecking et al, 1986, Gene 45:101; Cockett et ⁇ ., 1990, Bio/Technology 8:2).
  • a number of expression vectors can be advantageously selected depending upon the use intended for the antibody molecule being expressed.
  • vectors that direct the expression of high levels of fusion protein products that are readily purified can be desirable.
  • Such vectors include, but are not limited to, the E. coli expression vector pUR278 (Ruther et al, 1983, EMBO 12:1791), in which the antibody coding sequence can be ligated individually into the vector in frame with the lac Z coding region so that a fusion protein is produced; pIN vectors (Inouye & Inouye, 1985, Nucleic Acids Res. 13:3101-3109; Van Heeke & Schuster, 1989, J. Biol. Chem.
  • pGEX vectors can also be used to express foreign polypeptides as fusion proteins with glutathione 5- transferase (GST).
  • GST glutathione 5- transferase
  • fusion proteins are soluble and can easily be purified from lysed cells by adsorption and binding to matrix glutathione agarose beads followed by elution in the presence of free glutathione.
  • the pGEX vectors are designed to include thrombin or factor Xa protease cleavage sites so that the cloned target gene product can be released from the GST moiety.
  • AcNPV Autographa californica nuclear polyhedrosis virus
  • the virus grows in Spodoptera frugiperda cells.
  • the antibody coding sequence can be cloned individually into non-essential regions (for example the polyhedrin gene) of the virus and placed under control of an AcNPV promoter (for example the polyhedrin promoter).
  • an AcNPV promoter for example the polyhedrin promoter
  • a number of viral-based expression systems can be utilized.
  • the antibody coding sequence of interest can be ligated to an adenovirus transcription/translation control complex, e.g., the late promoter and tripartite leader sequence. This chimeric gene can then be inserted in the adenovirus genome by in vitro or in vivo recombination.
  • Insertion in a non-essential region of the viral genome will result in a recombinant virus that is viable and capable of expressing the antibody molecule in infected hosts (e.g., see Logan & Shenk, 1984, Proc. Natl. Acad. Sci. USA 8 1:355-359).
  • Specific initiation signals may also be required for efficient translation of inserted antibody coding sequences. These signals include the ATG initiation codon and adjacent sequences. Furthermore, the initiation codon must be in phase with the reading frame of the desired coding sequence to ensure translation of the entire insert.
  • These exogenous translational control signals and initiation codons can be of a variety of origins, both natural and synthetic.
  • telomeres can be included in the genome of the inserted sequences, or modifies and processes the gene product in the specific fashion desired. Such modifications (e.g., glycosylation) and processing (e.g., cleavage) of protein products can be important for the function of the protein.
  • Different host cells have characteristic and specific mechanisms for the post-translational processing and modification of proteins and gene products. Appropriate cell lines or host systems can be chosen to ensure the correct modification and processing of the foreign protein expressed.
  • eukaryotic host cells that possess the cellular machinery for proper processing of the primary transcript, glycosylation, and phosphorylation of the gene product can be used.
  • mammalian host cells include but are not limited to CHO, VERY, BHK, Hela, COS, MDCK, 293, 3T3, W138, and in particular, breast cancer cell lines such as, for example, BT483, Hs578T, HTB2, BT20 and T47D, and normal mammary gland cell line such as, for example, CRL7030 and HsS78Bst.
  • stable expression is preferred.
  • cell lines that stably express the antibody molecule can be engineered.
  • host cells can be transformed with DNA controlled by appropriate expression control elements (e.g., promoter, enhancer, sequences, transcription terminators, polyadenylation sites, etc.), and a selectable marker.
  • appropriate expression control elements e.g., promoter, enhancer, sequences, transcription terminators, polyadenylation sites, etc.
  • engineered cells can be allowed to grow for 1-2 days in an enriched media, and then are switched to a selective media.
  • the selectable marker in the recombinant plasmid confers resistance to the selection and allows cells to stably integrate the plasmid into their chromosomes and grow to form foci which in turn can be cloned and expanded into cell lines. This method can advantageously be used to engineer cell lines that express the antibody molecule.
  • Such engineered cell lines can be particularly useful in screening and evaluation of compositions that interact directly or indirectly with the antibody molecule.
  • a number of selection systems can be used including, but not limited to, the he ⁇ es simplex virus thymidine kinase (Wigler et al, 1977, Cell 11 :223), hypoxanthineguanine phosphoribosyltransferase (Szybalska & Szybalski, 1992, Proc. Natl. Acad. Sci. USA 48:202), and adenine phosphoribosyltransferase (Lowy etal., 1980, Cell 22:8-17) genes can be employed in tk-, hgprt- or aprt- cells, respectively.
  • antimetabolite resistance can be used as the basis of selection for the following genes: dhfr, which confers resistance to methotrexate (Wigler et al, 1980, Natl. Acad. Sci. USA 77:357; O'Hare et al, 1981, Proc. Natl. Acad. Sci. USA 78:1527); gpt, which confers resistance to mycophenolic acid (Mulligan & Berg, 1981, Proc. Natl. Acad. Sci. USA 78:2072); neo, which confers resistance to the aminoglycoside G-418 (Wu and Wu, 1991, Biotherapy 3:87-95; Tolstoshev, 1993, Ann. Rev. Pharmacol. Toxicol. 32:573-596; Mulligan, 1993, Science 260:926-932; and Morgan and Anderson, 1993, Ann. Rev.
  • the host cell can be co-transfected with two expression vectors of the invention, the first vector encoding a heavy chain derived polypeptide and the second vector encoding a light chain derived polypeptide.
  • the two vectors can contain identical selectable markers that enable equal expression of heavy and light chain polypeptides.
  • a single vector may be used that encodes, and is capable of expressing, both heavy and light chain polypeptides.
  • the coding sequences for the heavy and light chains may comprise cDNA or genomic DNA.
  • an antibody molecule of the invention may be purified by any method known in the art for purification of an immunoglobulin molecule, for example, by chromatography (e.g., ion exchange, affinity, particularly by affinity for the specific antigen after Protein A, and sizing column chromatography), centrifugation, differential solubility, or by any other standard technique for the purification of proteins.
  • the antibodies of the present invention or fragments thereof may be fused to heterologous polypeptide sequences described herein or otherwise known in the art to facilitate purification.
  • the function of the genes referenced in Section 5.15.3 can be inhibited by use of antisense nucleic acids.
  • the present invention provides the therapeutic or prophylactic use of nucleic acids of at least six nucleotides in length that are antisense to a gene or cDNA encoding an obesity related gene product referenced in Section 5.15.3, or portions thereof.
  • an “antisense” nucleic acid as used herein refers to a nucleic acid capable of hybridizing to a portion of a nucleic acid referenced in Section 5.15.3 (preferably mRNA, e.g., the sequence of SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8, SEQ ID NO: 9, SEQ ID NO: 11, SEQ ID NO: 12, SEQ ID NO: 14, SEQ ID NO: 16, SEQ ID NO: 18, SEQ ID NO: 20, SEQ ID NO: 21, or SEQ ID NO: 23) by virtue of some sequence complementarity.
  • the antisense nucleic acid can be complementary to a coding and/or noncoding region of an obesity related mRNA.
  • the antisense nucleic acids can be oligonucleotides that are double-stranded or single-stranded RNA or DNA or a modification or derivative thereof, which can be directly administered to a cell, or which can be produced intracellularly by transcription of exogenous, introduced sequences.
  • the antisense nucleic acids are of at least six nucleotides and are preferably oligonucleotides (ranging from 6 to about 200 oligonucleotides).
  • the oligonucleotide is at least 10 nucleotides, at least 15 nucleotides, at least 100 nucleotides, or at least 200 nucleotides.
  • the oligonucleotides can be DNA or RNA or chimeric mixtures or derivatives or modified versions thereof, single-stranded or double-stranded.
  • the oligonucleotide can be modified at the base moiety, sugar moiety, or phosphate backbone.
  • the oligonucleotide can include other appending groups such as peptides, or agents facilitating transport across the cell membrane (see, e.g., Letsinger et al, 1989, Proc. Natl. Acad. Sci. U.S.A. 86: 6553-6556; Lemaitre et al, 1987, Proc. Natl. Acad. Sci. 84: 648-652; PCT Publication No.
  • the antisense oligonucleotide is provided, preferably as single-stranded DNA.
  • the oligonucleotide can be modified at any position on its structure with constituents generally known in the art.
  • the antisense oligonucleotides can comprise at least one modified base moiety that is selected from the group including, but not limited to, 5-fluorouracil, 5-bromouracil, 5-chlorouracil, 5-iodouracil, hypoxanthine, xanthine, 4-acetylcytosine, 5-(carboxyhydroxylmethyl) uracil, 5-carboxymethylaminomethyl-2-thiouridine, 5-carboxymethylaminomethyluracil, dihydrouracil, beta-D-galactosylqueosine, inosine, N6-isopentenyladenine, 1-methylguanine, 1-methylinosine, 2,2-dimethylguanine, 2-methyladenine,
  • modified base moiety that is selected from the group including, but not limited to, 5-fluorouracil, 5-bromouracil, 5-chlorouracil, 5-iodouracil, hypoxanthine, xanthine, 4-acetyl
  • the oligonucleotide comprises at least one modified sugar moiety selected from the group including, but not limited to, arabinose, 2-fluoroarabinose, xylulose, and hexose.
  • the oligonucleotide comprises at least one modified phosphate backbone selected from the group consisting of a phosphorothioate, a phosphorodithioate, a phosphoramidothioate, a phosphoramidate, a phosphordiamidate, a methylphosphonate, an alkyl phosphotriester, a formacetal, or analogs thereof.
  • the oligonucleotide is an ⁇ -anomeric oligonucleotide.
  • An ⁇ -anomeric oligonucleotide forms specific double-stranded hybrids with complementary RNA in which, contrary to the usual ⁇ -units, the strands run parallel to each other (Gautier et ⁇ /., 1987, Nucl. Acids Res. 15: 6625-6641).
  • the oligonucleotide can be conjugated to another molecule, e.g., a peptide, hybridization triggered cross-linking agent, transport agent, hybridization-triggered cleavage agent, etc.
  • Oligonucleotides may be synthesized by standard methods known in the art, e.g.
  • phosphorothioate oligonucleotides can be synthesized by the method of Stein et al. (1988, Nucl. Acids Res. 16: 3209), methylphosphonate oligonucleotides can be prepared by use of controlled pore glass polymer supports (Sarin et al, 1988, Proc. Natl. Acad. Sci. U.S.A. 85: 7448-7451), etc.
  • the antisense oligonucleotides comprise catalytic RNAs, or ribozymes (see, e.g., PCT International Publication WO 90/11364, published October 4, 1990; Sarver et ⁇ /., 1990, Science 247: 1222-1225).
  • the oligonucleotide is a 2'-0-methylribonucleotide (Inoue et al, 1987, Nucl. Acids Res. 15: 6131-6148), or a chimeric RNA-DNA analog (Inoue et al, 1987, FEBS Lett. 215: 327-330).
  • antisense nucleic acids are produced intracellularly by transcription from an exogenous sequence.
  • a vector can be introduced in vivo such that it is taken up by a cell, within which cell the vector or a portion thereof is transcribed, producing an antisense nucleic acid (RNA) of the invention.
  • RNA antisense nucleic acid
  • Such a vector would contain a sequence encoding an antisense nucleic acid.
  • Such a vector can remain episomal or become chromosomally integrated, as long as it can be transcribed to produce the desired antisense RNA.
  • Such vectors can be constructed by recombinant DNA technology methods standard in the art. Vectors can be plasmid, viral, or others known in the art, used for replication and expression in mammalian cells.
  • Expression of the sequences encoding the antisense RNAs can be by any promoter known in the art to act in mammalian, preferably human, cells. Such promoters can be inducible or constitutive. Such promoters include, but are not limited to, the SV40 early promoter region (Bernoist and Chambon, 1981, Nature 290: 304-310), the promoter contained in the 3 long terminal repeat of Rous sarcoma virus (Yamamoto et al, 1980, Cell 22: 787-797), the herpes thymidine kinase promoter (Wagner et al, 1981, Proc. Natl. Acad. Sci. U.S.A.
  • the antisense nucleic acids of the invention comprise a sequence complementary to at least a portion of an RNA transcript of a gene referenced in Section 5.15.3. However, absolute complementarity, although preferred, is not required.
  • RNA complementary to at least a portion of an RNA
  • “complementary to at least a portion of an RNA” means a sequence having sufficient complementarity to be able to hybridize with the RNA, forming a stable duplex; in the case of double-stranded antisense nucleic acids, a single strand of the duplex DNA can thus be tested, or triplex formation can be assayed.
  • the ability to hybridize will depend on both the degree of complementarity and the length of the antisense nucleic acid. Generally, the longer the hybridizing nucleic acid, the more base mismatches with an obesity related RNA (target RNA) it may contain and still form a stable duplex (or triplex, as the case may be).
  • compositions of the invention comprising an effective amount of an antisense nucleic acid in a pharmaceutically acceptable carrier can be administered in therapeutic methods of the invention.
  • the amount of antisense nucleic acid that will be effective in the treatment of a particular disorder or condition will depend on the nature of the disorder or condition, and can be determined by standard clinical techniques. Where possible, it is desirable to determine the antisense cytotoxicity in vitro, and then in useful animal model systems prior to testing and use in humans.
  • pharmaceutical compositions comprising antisense nucleic acids are administered via liposomes, microparticles, or microcapsules.
  • compositions may be useful to use such compositions to achieve sustained release of antisense nucleic acids.
  • Agonists include, but are not limited to, active fragments thereof (wherein a fragment is at least 10, 15, 20, 30, 50, 75, 100, or 150 amino acid portion of an obesity related gene product disclosed in Section 6.7.5) and analogs and derivatives thereof, and nucleic acids encoding any of the foregoing.
  • the nucleic acid containing all or a portion of the nucleotide sequence encoding the protein can be inserted into an appropriate expression vector, e.g., a vector that contains the necessary elements for the transcription and translation of the inserted protein coding sequence.
  • the regulatory elements are heterologous (i.e., not the native gene promoter).
  • Promoters which may be used include but are not limited to the SV40 early promoter (Bernoist and Chambon, 1981, Nature 290: 304-310), the promoter contained in the 3 long terminal repeat of Rous sarcoma virus (Yamamoto et al, 1980, Cell 22: 787-797), the he ⁇ es thymidine kinase promoter (Wagner et al, 1981, Proc. Natl. Acad. Sci.
  • promoter elements from yeast and other fungi such as the Gal4 promoter, the alcohol dehydrogenase promoter, the phosphoglycerol kinase promoter, the alkaline phosphatase promoter, and the following animal transcriptional control regions that exhibit tissue specificity and have been utilized in transgenic animals: elastase I gene control region which is active in pancreatic acinar cells (Swift et al, 1984, Cell 38: 639-646; Ornitz et al., 1986, Cold Spring Harbor Symp. Quant. Biol.
  • mouse mammary tumor virus control region which is active in testicular, breast, lymphoid and mast cells (Leder et al, 1986, Cell 45: 485-495), albumin gene control region which is active in liver (Pinckert et al, 1987, Genes and Devel. 1 : 268-276), alpha-fetoprotein gene control region which is active in liver (Krumlauf et al, 1985, Mol. Cell. Biol. 5: 1639-1648; Hammer et al, 1987, Science 235: 53-58), alpha-1 antitrypsin gene control region which is active in liver (Kelsey et al, 1987, Genes and Devel.
  • beta globin gene control region which is active in myeloid cells (Mogram et al, 1985, Nature 315: 338-340; Kollias et al, 1986, Cell 46: 89-94), myelin basic protein gene control region which is active in oligodendrocyte cells of the brain (Readhead et al, 1987, Cell 48: 703-712), myosin light chain-2 gene control region which is active in skeletal muscle (Sani 1985, Nature 314: 283-286), and gonadotrophic releasing hormone gene control region which is active in gonadotrophs of the hypothalamus (Mason et al, 1986, Science 234: 1372-1378).
  • a variety of host-vector systems can be utilized to express the protein coding sequence. These include, but are not limited to, mammalian cell systems infected with virus (e.g., vaccinia virus, adenovirus, etc.); insect cell systems infected with virus (e.g. baculovirus); microorganisms such as yeast containing yeast vectors; or bacteria transformed with bacteriophage, DNA, plasmid DNA, or cosmid DNA.
  • virus e.g., vaccinia virus, adenovirus, etc.
  • insect cell systems infected with virus e.g. baculovirus
  • microorganisms such as yeast containing yeast vectors
  • bacteria transformed with bacteriophage, DNA, plasmid DNA, or cosmid DNA e.g., bacteriophage, DNA, plasmid DNA, or cosmid DNA.
  • the expression elements of vectors vary in their strengths and specificities. Depending on the host-vector system utilized, any one
  • a gene product disclosed in Section 5.15.3, or fragment, derivative or analog thereof can be isolated and purified by standard methods including chromatography (e.g., ion exchange, affinity, and sizing column chromatography), centrifugation, differential solubility, or by any other standard technique for the purification of proteins.
  • An obesity related gene product can also be purified by any standard purification method from natural sources.
  • an obesity related gene product, analog or derivative thereof of the present invention can be synthesized by standard chemical methods known in the art (e.g., see Hunkapiller et al, 1984, Nature 310: 105-111).
  • the derivatives include less than 25 amino acid substitutions, less than 20 amino acid substitutions, less than 15 amino acid substitutions, less than 10 amino acid substitutions, less than 5 amino acid substitutions, less than 4 amino acid substitutions, less than 3 amino acid substitutions, or less than 2 amino acid substitutions relative to the original molecule.
  • the derivatives have conservative amino acid substitutions are made at one or more predicted non-essential amino acid residues.
  • a “conservative amino acid substitution” is one in which the amino acid residue is replaced with an amino acid residue having a side chain with a similar charge.
  • Families of amino acid residues having side chains with similar charges have been defined in the art. These families include amino acids with basic side chains (e.g., lysine, arginine, histidine), acidic side chains (e.g., aspartic acid, glutamic acid), uncharged polar side chains (e.g., glycine, asparagine, glutamine, serine, threonine, tyrosine, cysteine), nonpolar side chains (e.g., alanine, valine, leucine, isoleucine, proline, phenylalanine, methionine, tryptophan), beta-branched side chains ( e.g., threonine, valine, isoleucine) and aromatic side chains (e.g., tyrosine, phenylalanine, tryptophan
  • mutations can be introduced randomly along all or part of the coding sequence, such as by saturation mutagenesis, and the resultant mutants can be screened for biological activity to identify mutants that retain activity. Following mutagenesis, the encoded protein can be expressed and the activity of the protein can be determined.
  • the gene analog, derivative or fragment thereof is encoded by a nucleotide sequence that hybridizes to the nucleotide sequence of SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8, SEQ ID NO: 9, SEQ ID NO: 11, SEQ ID NO: 12, SEQ ID NO: 14, SEQ ID NO: 16, SEQ ID NO: 18, SEQ ID NO: 20, SEQ ID NO: 21, or SEQ ID NO: 23 under stringent conditions, e.g., hybridization to filter-bound DNA in 6x sodium chloride/sodium citrate (SSC) at about 45 °C followed by one or more washes in 0.2x SSC/0.1% SDS at about 50-65 °C, under highly stringent conditions, e.g., hybridization to filter-bound nucleic acid in 6x SSC at about 45 °C followed by one or more washes in O.lx SSC/0.2% SDS at about 68 °C, or under other stringent hybridization conditions
  • the analog, derivative or fragment comprises an amino acid sequence that is at least 35%, at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, or at least 99% identical to the amino acid sequence of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, and SEQ ID NO: 24.
  • nucleic acid sequence can be mutated in vitro or in vivo, to create and/or destroy translation, initiation, and/or termination sequences, or to create variations in coding regions and/or form new restriction endonuclease sites or destroy preexisting ones, to facilitate further in vitro modification.
  • Any technique for mutagenesis known in the art can be used, including, but not limited to, chemical mutagenesis, in vitro site-directed mutagenesis (Hutchinson, C, et al, 1978, J. Biol. Chem 253:6551), use of TAB® linkers (Pharmacia), etc.
  • Manipulations of the sequence can also be made at the protein level.
  • protein fragments or other derivatives or analogs that are differentially modified during or after translation, e.g., by glycosylation, acetylation, phosphorylation, amidation, derivatization by known protecting/blocking groups, proteolytic cleavage, linkage to an antibody molecule or other cellular ligand, etc.
  • Any of numerous chemical modifications can be carried out by known techniques including, but not limited to, specific chemical cleavage by cyanogen bromide, trypsin, chymotrypsin, papain, V8 protease, NaBH 4 ⁇ acetylation, formylation, oxidation, reduction; metabolic synthesis in the presence of tunicamycin, etc.
  • Non-classical amino acids include but are not limited to the D-isomers of the common amino acids, ⁇ -amino isobutyric acid, 4-aminobutyric acid, Abu, 2-amino butyric acid, ⁇ -Abu, ⁇ -Ahx, 6-amino hexanoic acid, Aib, 2-amino isobutyric acid, 3-amino propionic acid, ornithine, norleucine, norvaline, hydroxyproline, sarcosine, citrulline, cysteic acid, t-butylglycine, t-butylalanine, phenylglycine, cyclohexylalanine, ⁇ -alanine, fluoro-amino acids, designer amino
  • the amino acids used to make the analogs and derivatives can be D (dextrorotary), L (levorotary), or some combination of D and L.
  • the derivative is a chimeric (or fusion) protein comprising a gene product referenced in Section 5.15.3 or fragment thereof (preferably consisting of at least one protein domain or protein structural motif, or at least 15, preferably 20, amino acids of the obesity related protein) joined at its amino- or carboxy-terminus via a peptide bond to an amino acid sequence of a different protein.
  • such a chimeric protein is produced by recombinant expression of a nucleic acid encoding the protein (comprising an obesity related protein-coding sequence joined in-frame to a coding sequence for a different protein).
  • a nucleic acid encoding the protein comprising an obesity related protein-coding sequence joined in-frame to a coding sequence for a different protein.
  • Such a chimeric product can be made by ligating the appropriate nucleic acid sequences encoding the desired amino acid sequences to each other by methods known in the art, in the proper coding frame, and expressing the chimeric product by methods commonly known in the art.
  • such a chimeric product may be made by protein synthetic techniques, e.g., by use of a peptide synthesizer.
  • Chimeric genes comprising portions of a gene product referenced in Section 5.15.3 (e.g.
  • SEQ ID NO: 1 SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, and SEQ ID NO: 24) fused to any heterologous protein-encoding sequences can be constructed.
  • the invention provides methods of treatment, prophylaxis, and amelioration of one or more symptoms associated with obesity by administrating to a subject an effective amount of a modulater of a gene referenced in Section 5.15.3. (e.g. SEQ ID NO: 5, SEQ
  • the obesity related gene modulator is substantially purified (e.g., substantially free from substances that limit its effect or produce undesired side-effects).
  • the subject is preferably a mammal such as non-primate (e.g., cows, pigs, horses, cats, dogs, rats etc.) and a primate (e.g., monkeys or humans).
  • the subject is a human.
  • DELIVERY SYSTEMS Various delivery systems are known and can be used to administer modulators of the invention or fragment thereof, e.g., encapsulation in liposomes, microparticles, microcapsules, recombinant cells capable of expressing a protein or antibody modulator, receptor-mediated endocytosis (see, e.g., Wu and Wu, 1987, J. Biol. Chem. 262:4429- 4432), construction of a nucleic acid as part of a retroviral or other vector, etc.
  • modulators of the invention or fragment thereof e.g., encapsulation in liposomes, microparticles, microcapsules, recombinant cells capable of expressing a protein or antibody modulator, receptor-mediated endocytosis (see, e.g., Wu and Wu, 1987, J. Biol. Chem. 262:4429- 4432), construction of a nucleic acid as part of a retroviral or other vector, etc.
  • Methods of administering a modulator, or pharmaceutical composition include, but are not limited to, parenteral administration (e.g., intradermal, intramuscular, intraperitoneal, intravenous and subcutaneous), epidural, and mucosal (e.g., intranasal and oral routes).
  • parenteral administration e.g., intradermal, intramuscular, intraperitoneal, intravenous and subcutaneous
  • epidural e.g., intranasal and oral routes
  • mucosal e.g., intranasal and oral routes.
  • modulators of the present invention or fragments thereof, or pharmaceutical compositions are administered intramuscularly, intravenously, or subcutaneously.
  • the compositions can be administered by any convenient route, for example by infusion or bolus injection, by abso ⁇ tion through epithelial or mucocutaneous linings (e.g., oral mucosa, rectal and intestinal mucosa, etc.) and can be administered together with other biologically
  • Administration can be systemic or local.
  • pulmonary administration can also be employed, e.g., by use of an inhaler or nebulizer, and formulation with an aerosolizing agent. See, e.g., U.S. Patent Nos. 6,019,968, 5,985,309, 5,934,272, 5,874,064, 5,290,540, and 4,880,078, and PCT Publication No. WO 92/19244.
  • the pharmaceutical composition is delivered locally to the site of neural tissue damage, e.g., using osmotic or other types of pumps.
  • the invention also provides that the pharmaceutical composition is packaged in a hermetically sealed container such as an ampule or sachette indicating the quantity of modulator.
  • the modulator is supplied as a dry sterilized lyophilized powder or water free concentrate in a hermetically sealed container and can be reconstituted, e.g., with water or saline to the appropriate concentration for administration to a subject.
  • the modulator is supplied as a dry sterile lyophilized powder in a hermetically sealed container at a unit dosage of at least 5 mg, more preferably at least 10 mg, at least 15 mg, at least 25 mg, at least 35 mg, at least 45 mg, at least 50 mg, or at least 75 mg.
  • the liquid form is supplied in a hermetically sealed container at least 1 mg/ml, more preferably at least 2.5 mg/ml, at least 5 mg/ml, at least 8 mg/ml, at least 10 mg/ml, or at least 25 mg/ml.
  • compositions of the invention can be desirable to administer the pharmaceutical compositions of the invention locally to the area in need of treatment; this can be achieved by, for example, and not by way of limitation, local infusion, by injection, or by means of an implant, said implant being of a porous, non-porous, or gelatinous material, including membranes, such as sialastic membranes, or fibers.
  • a particularly useful application involves coating, imbedding or derivatizing fibers, such as collagen fibers, protein polymers, etc. with a modulator of the invention.
  • the composition can be delivered in a vesicle, in particular a liposome (see Langer, 1990, Science 249:1527-1533 1990); Treat et al, 1989, in Liposomes in the Therapy of Infectious Disease and Cancer, Lopez-Berestein and Fidler (eds.), Liss, New York, pp.
  • the composition can be delivered in a controlled release system.
  • a pump may be used (see Langer, supra; Sefton, 1987, CRC Crit. Ref. Biomed. Eng. 14:20; Buchwald et al, 1980, Surgery 88:507; Saudek et al, 1989, N. Engl. J. Med. 321:574).
  • polymeric materials can be used (see e.g., Medical Applications of Controlled Release, Langer and Wise (eds.), CRC Pres., Boca Raton, Florida (1974); Controlled Drug Bioavailability, Drug Product Design and Performance, Smolen and Ball (eds.), Wiley, New York (1984); Ranger and Peppas, 1983, J., Macromol. Sci. Rev. Macromol. Chem. 23:61; see also Levy et al, 1985, Science 228:190; During et al, 1989, Ann. Neurol. 25:351; Howard et al, 1989, J.Neurosurg. 7 1:105); U.S. Patent No. 5,679,377; U.S.
  • a controlled release system can be placed in proximity of the therapeutic target, i.e., nervous tissue (see, e.g., Goodson, 1984, in Medical Applications of Controlled Release, supra, vol. 2, pp. 115-138). Other controlled release systems are discussed in the review by Langer, 1990, Science 249:1527-1533.
  • the nucleic acid can be administered in vivo to promote expression of its encoded modulator by constructing it as part of an appropriate nucleic acid expression vector and administering it so that it becomes intracellular, e.g., by use of a retroviral vector (see U.S. Patent No.
  • a nucleic acid can be introduced intracellularly and incorporated within host cell DNA for expression by homologous recombination.
  • compositions of the invention comprise a prophylactically or therapeutically effective amount of an obesity related gene modulator, and a pharmaceutically acceptable carrier.
  • pharmaceutically acceptable means approved by a regulatory agency of the Federal or a state government or listed in the U.S. Pharmacopeia or other generally recognized pharmacopeia for use in animals, and more particularly in humans.
  • carrier refers to a diluent, adjuvant (e.g., Freund' s adjuvant (complete and incomplete)), excipient, or vehicle with which the therapeutic is administered.
  • Such pharmaceutical carriers can be sterile liquids, such as water and oils, including those of petroleum, animal, vegetable or synthetic origin, such as peanut oil, soybean oil, mineral oil, sesame oil and the like. Water is a preferred carrier when the pharmaceutical composition is administered intravenously. Saline solutions and aqueous dextrose and glycerol solutions can also be employed as liquid carriers, particularly for injectable solutions. Suitable pharmaceutical excipients include starch, glucose, lactose, sucrose, gelatin, malt, rice, flour, chalk, silica gel, sodium stearate, glycerol monostearate, talc, sodium chloride, dried skim milk, glycerol, propylene, glycol, water, ethanol and the like.
  • compositions can also contain minor amounts of wetting or emulsifying agents, or pH buffering agents.
  • These compositions can take the form of solutions, suspensions, emulsion, tablets, pills, capsules, powders, sustained-release formulations and the like.
  • Oral formulation can include standard carriers such as pharmaceutical grades of annitol, lactose, starch, magnesium stearate, sodium saccharine, cellulose, magnesium carbonate, etc. Examples of suitable pharmaceutical carriers are described in "Remington's Pharmaceutical Sciences" by E.W. Martin.
  • Such compositions will contain a prophylactically or therapeutically effective amount of the antibody or fragment thereof, preferably in purified form, together with a suitable amount of carrier so as to provide the form for proper administration to the patient.
  • compositions for intravenous administration are solutions in sterile isotonic aqueous buffer.
  • the composition can also include a solubilizing agent and a local anesthetic such as lignocamne to ease pain at the site of the injection.
  • the ingredients of compositions of the invention are supplied either separately or mixed together in unit dosage form, for example, as a dry lyophilized powder or water free concentrate in a hermetically sealed container such as an ampoule or sachette indicating the quantity of active agent.
  • compositions of the invention can be formulated as neutral or salt forms.
  • Pharmaceutically acceptable salts include those formed with anions such as those derived from hydrochloric, phosphoric, acetic, oxalic, tartaric acids, etc., and those formed with cations such as those derived from sodium, potassium, ammonium, calcium, ferric hydroxides, isopropylamine, triethylamine, 2-ethylamino ethanol, histidine, procaine, etc.
  • the amount of the composition delivered is that amount that will be effective in the methods of treatment of the invention.
  • the compositions are delivered by gene therapy.
  • Gene therapy refers to therapy performed by the administration to a subject of an expressed or expressible nucleic acid.
  • the nucleic acids produce their encoded modulator that mediates a therapeutic effect. Any of the methods for gene therapy available in the art can be used according to the present invention. Exemplary methods are described below. For general reviews of the methods of gene therapy, see Goldspiel et al, 1993, Clinical Pharmacy 12:488-505; Wu and Wu, 1991, Biotherapy 3:87-95; Tolstoshev, 1993, Ann. Rev. Pharmacol. Toxicol.
  • a composition of the invention comprises nucleic acids encoding a modulator. These nucleic acids are part of an expression vector that expresses the modulator in a suitable host.
  • nucleic acids have promoters, preferably heterologous promoters, operably linked to the antibody coding region, the promoter being inducible or constitutive and, optionally, tissue-specific.
  • nucleic acid molecules are used in which the modulator coding sequences and any other desired sequences are flanked by regions that promote homologous recombination at a desired site in the genome, thus providing for intrachromosomal expression of the modulator encoding nucleic acids (Koller and Smithies, 1989, Proc. Natl. Acad. Sci. USA 86:8932-8935; Zijlstra et al, 1989, Nature 342:435-438).
  • the expressed antibody molecule is a single chain antibody.
  • the nucleic acid sequences include sequences encoding both the heavy and light chains, or fragments thereof, of the antibody. Delivery of the nucleic acids into a subject can be either direct, in which case the subject is directly exposed to the nucleic acid or nucleic acid-carrying vectors, or indirect, in which case cells are first transformed with the nucleic acids in vitro, then transplanted into the subject. These two approaches are known, respectively, as in vivo or ex vivo gene therapy.
  • the nucleic acid sequences are directly administered in vivo, where it is expressed to produce the encoded product.
  • microparticle bombardment e.g., a gene gun; Biolistic, Dupont
  • coating lipids or cell-surface receptors or transfecting agents, encapsulation in liposomes, microparticles, or microcapsules, or by administering them in linkage to a peptide which is known to enter the nucleus, by administering it in linkage to a ligand subject to receptor-mediated endocytosis (see, e.g., Wu and Wu, 1987, J. Biol. Chem. 262:4429-4432) (which can be used to target cell types specifically expressing the receptors), etc.
  • nucleic acid-ligand complexes can be formed in which the ligand comprises a fusogenic viral peptide to disrupt endosomes, allowing the nucleic acid to avoid lysosomal degradation.
  • the nucleic acid can be targeted in vivo for cell specific uptake and expression, by targeting a specific receptor (see, e.g., PCT Publications WO 92/06180; WO 92/22635; W092/203 16; W093/14188, WO 93/20221).
  • the nucleic acid can be introduced intracellularly and inco ⁇ orated within host cell DNA for expression, by homologous recombination (Koller and Smithies, 1989, Proc. Natl.
  • viral vectors that contains nucleic acid sequences encoding an antibody of the invention or fragments thereof are used.
  • a retroviral vector can be used (see Miller et al, 1993, Meth. Enzymol. 217:581-599). These retroviral vectors contain the components necessary for the correct packaging of the viral genome and integration into the host cell DNA.
  • the nucleic acid sequences encoding the antibody to be used in gene therapy are cloned into one or more vectors, which facilitates delivery of the gene into a subject.
  • retroviral vectors More detail about retroviral vectors can be found in Boesen et al, 1994, Biotherapy 6:291-302, which describes the use of a retroviral vector to deliver the mdr 1 gene to hematopoietic stem cells in order to make the stem cells more resistant to chemotherapy.
  • Other references illustrating the use of retroviral vectors in gene therapy are Clowes et al, 1994, J. Clin. Invest. 93:644-651; Klein et al, 1994, Blood 83:1467-1473; Salmons and Gunzberg, 1993, Human Gene Therapy 4:129-141; and Grossman and Wilson, 1993, Curr. Opin. in Genetics and Devel. 3:110-114.
  • Adenoviruses are other viral vectors that can be used in gene therapy and can be targeted to the central nervous system. Adenoviruses have the advantage of being capable of infecting non-dividing cells. Kozarsky and Wilson, 1993, Current Opinion in Genetics and Development 3:499-503 present a review of adenovirus-based gene therapy. Other instances of the use of adenoviruses in gene therapy can be found in Rosenfeld et al, 1991, Science 252:431-434; Rosenfeld et al, 1992, Cell 68:143-155; Mastrangeli et al, 1993, J. Clin. Invest.
  • Adeno-associated virus has also been proposed for use in gene therapy (Walsh et al, 1993, Proc. Soc. Exp. Biol. Med. 204:289-300; and U.S. Patent No. 5,436,146).
  • Another approach to gene therapy involves transferring a gene to cells in tissue culture by such methods as electroporation, lipofection, calcium phosphate mediated transfection, or viral infection. Usually, the method of transfer includes the transfer of a selectable marker to the cells. The cells are then placed under selection to isolate those cells that have taken up and are expressing the transferred gene.
  • the nucleic acid is introduced into a cell prior to administration in vivo of the resulting recombinant cell.
  • introduction can be carried out by any method known in the art, including but not limited to transfection, electroporation, microinjection, infection with a viral or bacteriophage vector containing the nucleic acid sequences, cell fusion, chromosome-mediated gene transfer, microcellmediated gene transfer, spheroplast fusion, etc.
  • Numerous techniques are known in the art for the introduction of foreign genes into cells (see, e.g., Loeffler and Behr, 1993, Meth. Enzymol. 217:599-618; and Cohen et al, 1993, Meth. Enzymol.
  • the technique should provide for the stable transfer of the nucleic acid to the cell, so that the nucleic acid is expressible by the cell and preferably heritable and expressible by its cell progeny.
  • the resulting recombinant cells can be delivered to a subject by various methods known in the art.
  • Recombinant blood cells e.g., hematopoietic stem or progenitor cells
  • the amount of cells envisioned for use depends on the desired effect, patient state, etc., and can be determined by one skilled in the art.
  • Cells into which a nucleic acid can be introduced for purposes of gene therapy encompass any desired, available cell type, and include but are not limited to epithelial cells, endothelial cells, keratinocytes, fibroblasts, muscle cells, hepatocytes; blood cells such as T lymphocytes, B lymphocytes, monocytes, macrophages, neutrophils, eosinophils, megakaryocytes, granulocytes; various stem or progenitor cells, in particular hematopoietic stem or progenitor cells, e.g., as obtained from bone marrow, umbilical cord blood, peripheral blood, fetal liver, etc.
  • the cell is a neural cell.
  • the cell used for gene therapy is autologous to the subject.
  • nucleic acid sequences encoding a modulator are introduced into the cells such that they are expressible by the cells or their progeny, and the recombinant cells are then administered in vivo for therapeutic effect.
  • stem or progenitor cells are used. Any stem and/or progenitor cells that can be isolated and maintained in vitro can potentially be used in accordance with this embodiment of the present invention (see e.g., PCT Publication WO 94/08598; Stemple and Anderson, 1992, Cell 7 1:973-985; Rheinwald, 1980, Meth. Cell Bio.
  • the nucleic acid to be introduced for pu ⁇ oses of gene therapy comprises an inducible promoter operably linked to the coding region, such that expression of the nucleic acid is controllable by controlling the presence or absence of the appropriate inducer of transcription.
  • the modulators of the invention can be assayed by any method well known in the art.
  • the modulators of the invention or fragments thereof are preferably tested in vitro, and then in vivo for the desired therapeutic or prophylactic activity, prior to use in humans.
  • in vitro assays that can be used to determine whether administration of a specific composition of the present invention is indicated, include in vitro cell culture assays in which a subject tissue sample is grown in culture, and exposed to or otherwise administered a composition of the present invention, and the effect of such a composition of the present invention upon the tissue sample is observed.
  • the following subsections describe various assays that can be used to determine the efficacy of the modulators of the invention.
  • Consumption data is collected while the animals are housed in Nalgene Metabolic cages (Model #650-0100).
  • Each cage comprises subassemblies made of clear poly meth lypentene (PMP), polycarbonate (PC), or stainless steel (SS).
  • PMP poly meth lypentene
  • PC polycarbonate
  • SS stainless steel
  • the entire cylinder-shaped plastic and SS cage rests on a SS stand and houses one animal.
  • the animal is contained in the round Upper Chamber (PC) assembly (12 cm high and 20 cm in diameter) and rests on a SS floor.
  • Two subassemblies are attached to the Upper Chamber.
  • the first assembly consists of a SS feeding chamber (10 cm long, 5 cm high and 5 cm wide) with a PC feeding drawer attached to the bottom.
  • the feeding drawer has two compartments: a food storage compartment with the capacity for approximately 50 g of pulverized rat chow, and a food spillage compartment.
  • the animal is allowed access to the pulverized chow by an opening in the SS floor of the feeding chamber.
  • the floor of the feeding chamber does not allow access to the food dropped into the spillage compartment.
  • the second assembly includes a water bottle support, a PC water bottle (100 ml capacity) and a graduated water spillage collection tube.
  • the water bottle support funnels any spilled water into the water spillage collection tube.
  • the lower chamber consists of a PMP separating cone, PMP collection funnel, PMP fluid (urine) collection tube, and a PMP solid (feces) collection tube.
  • the separating cone is attached to the top of the collection funnel, which in turn is attached to the bottom of the Upper Chamber.
  • the urine runs off the separating cone onto the walls of the collection funnel and into the urine collection tube.
  • the separating cone also separates the feces and funnels it into the feces collection tube. Food consumption, water consumption, and body weight are measured with an
  • Ohaus Portable Advanced scale ( ⁇ 0.1 gram accuracy). Procedure. Prior to the day of testing, animals are habituated to the testing apparatus by placing each animal in a Metabolic cage for one hour. On the day of the experiment, animals that are food deprived the previous night are weighed and assigned to treatment groups. Assignments are made using a quasi-random method utilizing the body weights to assure that the treatment groups have similar average body weight. Animals are then administered either vehicle (generally 0.5% methyl cellulose, MC) or test compound. At that time, the feeding drawer is filled with pulverized chow, and the filled water bottle, the empty urine and feces collection tubes are weighed. Two hours after test compound treatment, each animal is weighed and placed in a Metabolic Cage.
  • vehicle generally 0.5% methyl cellulose, MC
  • Test Compound is administered orally (0.1-50 mg/kg for oral (PO) dosing) using a gavage tube connected to a 3 or 5 ml syringe at a volume of 10 ml/kg. In some instances test compound is administered by a systemic route (e.g. by intravenous injection 0.1-20 mg/kg for i.v. dosing). Test compound for oral dosing is made into a homogenous suspension by stirring and ultrasonicating for at least one hour prior to dosing. Statistical Analyses.
  • Body weight change is the difference between the body weight of the animal immediately prior to placement in the metabolic cage and its body weight at the end of the one hour test session.
  • Food consumption is the difference in the weight of the food drawer prior to testing and the weight following the one hour test session.
  • Water consumption is the difference in the weight of the water bottle prior to testing and the weight following the one hour test session.
  • Each cage is comprised of subassemblies made of clear polymethlypentene (PMP), polycarbonate (PC), or stainless steel (SS). All parts disassemble for quick and accurate data collection and for cleaning.
  • the entire cylinder-shaped plastic and SS cage rests on a SS stand and houses one animal.
  • the animal is contained in the round Upper Chamber (PC) assembly (12 cm high and 20 cm in diameter) and rests on a SS floor.
  • Two subassemblies are attached to the Upper Chamber.
  • the first assembly consists of a SS feeding chamber (10 cm long, 5 cm high and 5 cm wide) with a PC feeding drawer attached to the bottom.
  • the feeding drawer has two compartments: a food storage compartment with the capacity for approximately 50 grams of pulverized rat chow, and a food spillage compartment.
  • the animal is allowed access to the pulverized chow by an opening in the SS floor of the feeding chamber.
  • the floor of the feeding chamber does not allow access to the food dropped into the spillage compartment.
  • the second assembly includes a water bottle support, a PC water bottle (100 ml capacity) and a graduated water spillage collection tube.
  • the water bottle support funnels any spilled water into the water spillage colllecton tube.
  • the lower chamber consists of a PMP separating cone, PMP collection funnel, PMP fluid (urine) collection tube, and a PMP solid (feces) collection tube.
  • the separating cone is attached to the top of the collection funnel, which in turn is attached to the bottom of the Upper Chamber.
  • the urine runs off the separating cone onto the walls of the collection funnel and into the urine collection tube.
  • the separating cone also separates the feces and funnels it into the feces collection tube.
  • Food consumption, water consumption, urine excretion, feces excretion, and body weight are measured with an Ohaus Portable Advanced scale ( ⁇ 0.1 gram accuracy).
  • Ohaus Portable Advanced scale ⁇ 0.1 gram accuracy
  • Test compound is administered orally (PO) using a gavage tube connected to a 3 or 5 ml syringe at a volume of 10 mVkg. Test compound is made into a homogenous suspension by stirring and ultrasonicating for at least one hour prior to dosing. In some experiments, animals are tested for more than one night.
  • Water consumption is the difference in the weight of the water bottle at 1630 and the weight at 0800.
  • Fecal excretion is the difference in the weight of the empty fecal collection tube at 1630 and the weight at 0800.
  • Urinary excretion is the difference in the weight of the empty urine collection tube at 1630 and the weight at 0800.
  • RNA expression or protein expression of an open reading frame (which may be of a marker gene or may be of a gene referenced in Section 5.15.3), regulated by a promoter native to the gene referenced in Section 5.15.3 can be measured by measuring the amount or abundance of the RNA (as RNA or cDNA) or protein.
  • the assays may detect the presence of increased or decreased expression of a gene referenced in Section 5.15.3 (e.g., SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, and SEQ ID NO: 24) on the basis of increased or decreased mRNA expression (using, e.g., nucleic acid probes), increased or decreased levels of protein products (using, e.g., antibodies thereto), or increased or decreased levels of expression of a marker gene (e.g., green fluorescent protein "GFP" operably linked to the 5 promoter region in a recombinant construct.
  • a marker gene e.g., green fluorescent protein "GFP"
  • the present invention envisions monitoring changes in gene expression (e.g., a gene referenced in Section 5.15.3) or marker gene expression by any expression analysis technique known to one of skill in the art, including but not limited to, differential display, serial analysis of gene expression (SAGE), nucleic acid array technology, oligonucleotide array technology, GeneChip expression analysis, dot blot hybridization, northern blot hybridization, subtractive hybridization, protein chip arrays, Western blot, immunoprecipitation followed by SDS PAGE, immunocytochemistry, proteome analysis and mass-spectrometry of two-dimensional protein gels.
  • SAGE serial analysis of gene expression
  • nucleic acid array technology oligonucleotide array technology
  • GeneChip expression analysis GeneChip expression analysis
  • dot blot hybridization northern blot hybridization
  • subtractive hybridization protein chip arrays
  • Western blot immunoprecipitation followed by SDS PAGE, immunocytochemistry, proteome analysis and mass-spectrometry of two-dimensional protein gels.
  • various expression analysis techniques can be used to identify molecules that affect expression of a gene referenced in Section 5.15.3 or marker gene expression, by comparing a cell line expressing a gene disclosed in Section 5.15.3 (e.g. SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, and SEQ ID NO: 24) or a marker gene under the control of a gene promoter sequence in the absence of a test molecule to a cell line expressing the same gene or marker gene under the control of the same promoter sequence in the presence of the test molecule.
  • a cell line expressing a gene disclosed in Section 5.15.3 e.g. SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID
  • expression analysis techniques are used to identify a molecule that upregulates a gene referenced in Section 5.15.3 or upregulates marker gene expression upon treatment of a cell with the molecule. 5.15.17. METHODS FOR MONITORING REPORTER GENE EXPRESSION OF A GENE OF THE PRESENT INVENTION
  • the cell being assayed for reporter gene expression contains a fusion construct of at least one transcriptional promoter region for a gene disclosed in Section 5.15.3 (e.g., SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, and SEQ ID NO: 24) (also referred to herein as the test gene), or homologs of the foregoing, each operably linked to a marker gene expressing a detectable and/or selectable product.
  • a gene disclosed in Section 5.15.3 e.g., SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, and SEQ ID NO: 24
  • the test gene also referred to herein as the test gene
  • Increased expression of a marker gene operably linked to a gene promoter indicates increased expression of the test gene.
  • the marker gene is a sequence encoding a detectable or selectable marker, the expression of which is regulated by at least one gene promoter region in the heterologous construct used in the present invention.
  • the assay is carried out in the absence of background levels of marker gene expression (e.g., in a cell that is mutant or otherwise lacking in the marker gene). If not already lacking in endogenous marker gene activity, cells mutant in the marker gene may be selected by known methods, or the cells can be made mutant in the marker gene by known gene-disruption methods prior to introducing the marker gene (Rothstein, 1983, MethEnzymol. 101:202-211).
  • a marker gene of the invention can be any gene that encodes a detectable and/or selectable product.
  • the detectable marker can be any molecule that can give rise to a detectable signal, e.g., a fluorescent protein or a protein that can be readily visualized or that is recognizable by a specific antibody or that gives rise enzymatically to a signal.
  • the selectable marker can be any molecule that can be selected for its expression, e.g., which gives cells a selective advantage over cells not having the selectable marker under appropriate (selective) conditions.
  • the selectable marker is an essential nutrient in which the cell in which the interaction assay occurs is mutant or otherwise lacks or is deficient, and the selection medium lacks such nutrient.
  • one type of marker gene is used to detect gene expression.
  • more than one type of marker gene is used to detect gene expression.
  • Preferred marker genes include but are not limited to, green fluorescent protein (GFP) (Cubitt et al, 1995, Trends Biochem. Sci. 20:448-455), red fluorescent protein, blue fluorescent protein, luciferase, LEU2, LYS2, ADE2, TRPl, CANl, CYH2, GUS, CUPl or chloramphenicol acetyl transferase (CAT).
  • GFP green fluorescent protein
  • Other marker genes include, but are not limited to, URA3, HIS3 and/or the lacZ genes (see e.g., Rose and Botstein, 1983, Meth. Enzymol.
  • detectable marker genes that can be operably linked to a glucan synthase pathway reporter gene promoter region (Alam and Cook, 1990, Anal. Biochem. 188:245-254).
  • more than one different marker gene is used to detect transcriptional activation, e.g., one encoding a detectable marker, and one or more encoding one or more different selectable marker(s), or e.g., different detectable markers.
  • Expression of the marker genes can be detected and/or selected for by techniques known in the art (see e.g. U.S. Patent Nos. 6,057,101 and 6,083,693).
  • the reporter gene construct is a chimeric reporter construct comprising a marker gene that is transcribed under the control of a gene promoter sequence comprising all or a portion of a promoter region of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, and SEQ ID NO: 24. If not already a part of the DNA sequence, the translation initiation codon, ATG, is provided in the correct reading frame upstream of the DNA sequence.
  • Vectors comprising all or portions of the gene sequences of SEQ ID NO: 1, SEQ ID NO:
  • the vectors of this invention also include those vectors comprising DNA sequences that hybridize under stringent conditions to SEQ ID NO: 1 , SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, or SEQ ID NO: 24 useful in the construction of recombinant reporter gene constructs and cells are provided.
  • the vectors of this invention also include those vectors comprising DNA sequences that hybridize under stringent conditions to SEQ ID NO: 1 , SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, or SEQ ID NO: 24 gene sequences, and conservatively modified variations thereof.
  • the vectors of this invention may be present in transformed or transfected cells, cell lysates, or in partially purified or substantially pure forms.
  • DNA vectors may contain a means for amplifying the copy number of the gene of interest, stabilizing sequences, or alternatively may be designed to favor directed or non-directed integration into the host cell genome. Given the strategies described herein, one of skill in the art can construct a variety of vectors and nucleic acid molecules comprising functionally equivalent nucleic acids. DNA cloning and sequencing methods are well known to those of skill in the art and are described in an assortment of laboratory manuals, including Sambrook et al, 1989, supra; and Ausubel et al, 2002 Supplement.
  • Transformation and other methods of introducing nucleic acids into a host cell can be accomplished by a variety of methods that are well known in the art (see, for instance, Ausubel, supra, and Sambrook, supra).
  • S. cerevisiae cells of the invention can be transformed or transfected with an expression vector, such as a plasmid, a cosmid, or the like, wherein the expression vector comprises the DNA of interest.
  • the cells can be infected by a viral expression vector comprising the DNA or RNA of interest.
  • reporter gene expression can be monitored at the RNA or the protein level.
  • molecules that affect reporter gene expression can be identified by detecting differences in the level of marker protein expressed by cells contacted with a test molecule versus the level of marker protein expressed by cells in the absence of the test molecule.
  • Protein expression can be monitored using a variety of methods that are well known to those of skill in the art. For example, protein chips or protein microarrays (e.g., ProteinChipTM, Ciphergen Biosystem) and two-dimensional electrophoresis (see e.g., U.S. Patent No. 6,064,754) can be utilized to monitor protein expression levels.
  • two-dimensional electrophoresis means a technique comprising isoelectric focusing, followed by denaturing electrophoresis, generating a two-dimensional gel (2D-gel) containing a plurality of proteins.
  • Any protocol for 2D- electrophoresis known to one of ordinary skill in the art can be used to analyze protein expression by the reporter genes of the invention.
  • 2D electrophoresis can be performed according to the methods described in O'Farrell, 1975, J. Biol. Chem. 250: 4007-4021.
  • Liquid High Throughput-Like Assay In a preferred embodiment, a liquid high throughput-like assay is used to determine the protein expression level of a reporter gene.
  • a reporter construct is transformed into a cell strain. Cultures from solid media plates are used to innoculate liquid cultures in Casamino Acids media or an equivalent media. This liquid culture is grown and then diluted in Casamino Acids media or an equivalent media.
  • a test molecule is selected for the assay, preferably but not necessarily along with a negative control molecule. The test molecule and negative control molecule are separately added to an assay plate containing multiple wells and serially diluted (e.g., 1 to 2) into Casamino Acids media plus DMSO in sequential columns, so that each plate contains a range of concentrations of each drug. If a negative control is being used, one column of each plate may be used as a "no drug" control, containing only Casamino
  • Acids media plus DMSO Acids media plus DMSO.
  • assay plates can be used, such as those with 96, 384 or 1536 well format.
  • An aliquot of liquid reporter strain is added to each well of the serial dilution plates from above and mixed.
  • the assay plates are then incubated. After incubation the assay plates are analyzed for detectable marker gene product.
  • the assay plates are imaged in a Molecular Dynamics Fluorimager SI to measure the fluorescence from the GFP reporters. The results are then analyzed, as described above. If the drug is an inhibitor of the gene product (e.g., an inhibitor of e.g.
  • SPECIFIC EMBODIMENTS One embodiment of the present invention provides a method for determining whether a candidate molecule affects a body weight disorder associated with an organism.
  • a cell from the organism is contacted with the candidate molecule.
  • the candidate molecule is recombinantly expressed within the cell.
  • step (b) of the method a determination is made as to whether the RNA expression or protein expression in the cell of at least one open reading frame is changed in step (a) relative to the expression of the open reading frame in the absence of the candidate molecule, where each open reading frame is regulated by a promoter native to a nucleic acid sequence selected from the group consisting of SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8, SEQ ID NO: 9, SEQ ID NO: 11, SEQ ID NO: 12, SEQ ID NO: 14, SEQ ID NO: 16, SEQ ID NO: 18, SEQ ID NO: 20, SEQ ID NO: 21, or SEQ ID NO: 23 and homologs (e.g., orthologs, and paralogs) of each of the foregoing.
  • a promoter native to a nucleic acid sequence selected from the group consisting of SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8, SEQ ID NO: 9, SEQ ID NO: 11, SEQ ID NO
  • the candidate molecule affects a body weight disorder associated with the organism when the RNA expression or protein expression of the at least one open reading frame is changed.
  • the candidate molecule does not affect a body weight disorder associated with the organism when the RNA expression or protein expression of the at least one open reading frame is unchanged.
  • the body weight disorder is obesity, anorexia nervosa, bulimia nervosa or cachexia.
  • the candidate molecule affects a body weight disorder associated with the organism when a cell from the organism that is contacted with the candidate molecule exhibits a lower expression level of a protein sequence in the group consisting of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, and SEQ ID NO: 24 relative to a cell from the organism that is not contacted with the candidate molecule.
  • step (b) comprises determining whether RNA expression is changed.
  • step (b) comprises determining whether protein expression is changed.
  • step (b) comprises determining whether RNA or protein expression of at least two of the open reading frames is changed.
  • step (a) comprises contacting the cell with the candidate molecule and step (a) is carried out in a liquid high throughput-like assay.
  • the cell comprises a promoter region of at least one gene selected from the group consisting of SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8, SEQ ID NO: 9, SEQ ID NO: 11, SEQ ID NO: 12, SEQ ID NO: 14, SEQ ID NO: 16, SEQ ID NO: 18, SEQ ID NO: 20, SEQ ID NO: 21, or SEQ ID NO: 23 and homologs of each of the foregoing, each promoter region being operably linked to a marker gene.
  • step (b) comprises determining whether the RNA expression or protein expression of the marker gene(s) is changed in step (a) relative to the expression of the marker gene in the absence of the candidate molecule.
  • the marker gene is selected from the group consisting of green fluorescent protein, red fluorescent protein, blue fluorescent protein, luciferase, LEU2, LYS2, ADE2, TRPl, CANl, CYH2, GUS, CUPl and chloramphenicol acetyl transferase.
  • Another aspect of the invention provides a method of identifying a molecule that specifically binds to a ligand selected from the group consisting of (i) a protein encoded by a gene selected from the group consisting of SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8, SEQ ID NO: 9, SEQ ID NO: 11, SEQ ID NO: 12, SEQ ID NO: 14, SEQ ID NO: 16, SEQ ID NO: 18, SEQ ID NO: 20, SEQ ID NO: 21, or SEQ ID NO: 23 and homologs of each of the foregoing, and (ii) a biologically active fragment of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, and SEQ ID NO: 24.
  • the method comprises (a) contacting the ligand with one or more candidate molecules under conditions conducive to binding between the lig
  • One aspect of the invention provides a method of treating or preventing a body weight disorder.
  • the method comprises administering to a subject in which treatment is desired a therapeutically effective amount of a molecule that inhibits a function of one or more of the group consisting of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, SEQ ID NO: 24, and homologs (e.g., orthologs and paralogs) thereof.
  • the subject is human.
  • the molecule that inhibits a function of one or more of the group consisting of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, SEQ ID NO: 24 and homologs (e.g., orthologs and paralogs) thereof is selected from the group consisting of an antibody that binds to one of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, SEQ ID NO: 24, and homologs thereof, or a fragment or derivative therefof.
  • Another aspect of the invention provides a method of treating or preventing a body weight disorder.
  • the method comprises administering to a subject in which treatment is desired a therapeutically effective amount of a molecule that enhances a function of one or more of the group consisting of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, SEQ ID NO: 24 and homologs thereof.
  • the subject is human.
  • Yet another aspect of the invention provides a method of diagnosing a disease or disorder or the predisposition to the disease or disorder, where the disease or disorder is characterized by an aberrant level of one of SEQ ID NO: 1 through SEQ ID NO: 24 (or homologs thereof) in a subject.
  • the method comprises measuring the level of any one of SEQ ID NO: 1 through SEQ ID NO: 24 (or homologs thereof) in a sample derived from the subject, in which an increase or decrease in the level of one of SEQ ID NO: 1 through SEQ ID NO: 24 (or homologs thereof) in the sample, relative to the level of one of said SEQ ID NO: 1 through SEQ ID NO: 24 (or homologs thereof) found in an analogous sample not having the disease or disorder, indicates the presence of the disease or disorder in the subject.
  • the disease or disorder is a body weight disorder, such as obesity, anorexia nervosa, bulimia nervosa, or cachexia.
  • Still another aspect of the invention provides a method of diagnosing or screening for the presence of or predisposition for developing a disease or disorder involving a body weight disorder in a subject comprising detecting one or more mutations in at least one of SEQ ID NO: 1 through SEQ ID NO: 24 (or homologs thereof) in a sample derived from the subject, in which the presence of the one or more mutations indicates the presence of the disease or disorder or a predisposition for developing the disease or disorder.
  • TRANSGENIC ANIMALS The invention also provides animal models.
  • Transgenic animals that have incorporated and express a constitutively-functional obesity related gene have use as animal models of obesity related diseases and disorders. Such animals can be used to screen for or test molecules for the ability to prevent such obesity related diseases and disorders.
  • animal models for obesity related diseases and disorders is provided. Such animals can be initially produced by promoting homologous recombination between an obesity related gene (e.g.
  • sequence inserted is a heterologous sequence, e.g., an antibiotic resistance gene.
  • this homologous recombination is carried out by transforming embryo-derived stem (ES) cells with a vector containing an insertionally inactivated gene, where the active gene encodes a particular obesity related gene, such that homologous recombination occurs; the ES cells are then injected into a blastocyst, and the blastocyst is implanted into a foster mother, followed by the birth of the chimeric animal, also called a "knockout animal," in which an obesity related gene has been inactivated (see Capecchi, 1989, Science 244: 1288-1292). The chimeric animal can be bred to produce additional knockout animals.
  • ES embryo-derived stem
  • Chimeric animals can be and are preferably non-human mammals such as mice, hamsters, sheep, pigs, cattle, etc.
  • a knockout mouse is produced.
  • Such knockout animals are expected to develop or be predisposed to developing diseases or disorders involving obesity and thus can have use as animal models of such diseases and disorders, e.g., to screen for or test molecules for the ability to promote activation or proliferation and thus treat or prevent such diseases or disorders.
  • transgenic animals that have incorporated and express a constitutively-functional obesity related gene have use as animal models of diseases and disorders involving in T-cell overactivation, or in which T cell activation is desired.
  • each transgenic line expressing a particular key gene under the control of the regulatory sequences of a characterizing gene is created by the introduction, for example by pronuclear injection, of a vector containing the transgene into a founder animal, such that the transgene is transmitted to offspring in the line.
  • the transgene preferably randomly integrates into the genome of the founder but in specific embodiments can be introduced by directed homologous recombination.
  • the transgene is present at a location on the chromosome other than the site of the endogenous characterizing gene.
  • homologous recombination in bacteria is used for target-directed insertion of the key gene sequence into the genomic DNA for all or a portion of the characterizing gene, including sufficient characterizing gene regulatory sequences to promote expression of the characterizing gene in its endogenous expression pattern.
  • the characterizing gene sequences are on a bacterial artificial chromosome (BAC).
  • the key gene coding sequences are inserted as a 5 fusion with the characterizing gene coding sequence such that the key gene coding sequences are inserted in frame and directly 3 from the initiation codon for the characterizing gene coding sequences.
  • the key gene coding sequences are inserted into the 3 untranslated region (UTR) of the characterizing gene and, preferably, have their own internal ribosome entry sequence (IRES).
  • the vector preferably a BAC
  • the vector comprising the key gene coding sequences and characterizing gene sequences is then introduced into the genome of a potential founder animal to generate a line of transgenic animals.
  • Potential founder animals can be screened for the selective expression of the key gene sequence in the population of cells characterized by expression of the endogenous characterizing gene.
  • Transgenic animals that exhibit appropriate expression e.g., detectable expression of the key gene product having the same expression pattern within the animal as the endogenous characterizing gene are selected as founders for a line of transgenic animals.
  • One aspect of the invention provides a recombinant non-human animal that is the product of a process comprising introducing a nucleic acid encoding at least a domain of one of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 10, SEQ ID NO: 13, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 19, SEQ ID NO: 22, and SEQ ID NO: 24 (or homologs thereof) into the non-human animal.
  • Hierarchical cluster analysis is a statistical method for finding relatively homogenous clusters of elements based on measured characteristics.
  • n samples into c clusters. The first of these is a partition into n clusters, each cluster containing exactly one sample. The next is a partition into n-1 clusters, the next is a partition into n-2, and so on until the n th , in which all the samples form one cluster.
  • level one corresponds to n clusters and level n corresponds to one cluster.
  • the hierarchical clustering technique used to cluster gene analysis vectors is an agglomerative clustering procedure.
  • Agglomerative (bottom-up clustering) procedures start with n singleton clusters and form a sequence of partitions by successively merging clusters.
  • the major steps in agglomerative clustering are contained in the following procedure, where c is the desired number of final clusters, D, and D ⁇ are clusters, x, is a gene analysis vector, and there are n such vectors:
  • a ⁇ -b assigns to variable a the new value b.
  • the procedure terminates when the specified number of clusters has been obtained and returns the clusters as a set of points.
  • a key point in this algorithm is how to measure the distance between two clusters D, and D
  • This algorithm is also known as the minimum algorithm. Furthermore, if the algorithm is terminated when the distance between nearest clusters exceeds an arbitrary threshold, it is called the single-linkage algorithm.
  • the algorithm is terminated when the distance between nearest clusters exceeds an arbitrary threshold, it is called the single-linkage algorithm.
  • the data points are nodes of a graph, with edges forming a path between the nodes in the same subset D,.
  • ⁇ minQ is used to measure the distance between subsets
  • the nearest neighbor nodes determine the nearest subsets.
  • the merging of and D ⁇ corresponds to adding an edge between the nearest pair of nodes in D, and D Because edges linking clusters always go between distinct clusters, the resulting graph never has any closed loops or circuits; in the terminology of graph theory, this procedure generates a tree.
  • a spanning tree is a tree with a path from any node to any other node. Moreover, it can be shown that the sum of the edge lengths of the resulting tree will not exceed the sum of the edge lengths for any other spanning tree for that set of samples.
  • ⁇ minQ as the distance measure
  • This algorithm is also known as the maximum algorithm. If the clustering is terminated when the distance between the nearest clusters exceeds an arbitrary threshold, it is called the complete-linkage algorithm.
  • the farthest-neighbor algorithm discourages the growth of elongated clusters.
  • Application of this procedure can be thought of as producing a graph in which the edges connect all of the nodes in a cluster. In the terminology of graph theory, every cluster contains a complete subgraph. The distance between two clusters is terminated by the most distant nodes in the two clusters. When the nearest clusters are merged, the graph is changed by adding edges between every pair of nodes in the two clusters.
  • Hierarchical cluster analysis begins by making a pair-wise comparison of all gene analysis vectors in a set of such vectors. After evaluating similarities from all pairs of elements in the set, a distance matrix is constructed. In the distance matrix, a pair of vectors with the shortest distance (i.e. most similar values) is selected. Then, when the average linkage algorithm is used, a "node” (“cluster”) is constructed by averaging the two vectors. The similarity matrix is updated with the new "node” (“cluster”) replacing the two joined elements, and the process is repeated n-1 times until only a single element remains.
  • A-F having the values: A ⁇ 4.9 ⁇ , B ⁇ 8.2 ⁇ , C ⁇ 3.0 ⁇ , D ⁇ 5.2 ⁇ , E ⁇ 8.3 ⁇ , F ⁇ 2.3 ⁇ .
  • the first partition using the average linkage algorithm could yield the matrix:
  • QTL vectors and/or gene expression vectors are clustered using agglomerative hierarchical clustering with Pearson correlation coefficients.
  • similarity is determined using Pearson correlation coefficients between the QTL vectors pairs, gene expression pairs, or sets of cellular constituent measurements.
  • Other metrics that can be used, in addition to the Pearson correlation coefficient include but are not limited to, a Euclidean distance, a squared Euclidean distance, a Euclidean sum of squares, a Manhattan metric, and a squared Pearson correlation coefficient.
  • Such metrics may be computed using SAS (Statistics Analysis Systems Institute, Cary, North Carolina) or S-Plus (Statistical Sciences, Inc., Seattle, Washington).
  • the hierarchical clustering technique used to cluster QTL vectors and/or gene expression vectors is a divisive clustering procedure.
  • Divisive (top- down clustering) procedures start with all of the samples in one cluster and form the sequence by successfully splitting clusters.
  • Divisive clustering techniques are classified as either a polythetic or a monthetic method.
  • a polythetic approach divides clusters into arbitrary subsets.
  • K-MEANS CLUSTERING In k-means clustering, sets of QTL vectors, gene expression vectors, or sets of cellular constituent measurements are randomly assigned to K user specified clusters. The centroid of each cluster is computed by averaging the value of the vectors in each cluster. Then, for each i 1, ..., N, the distance between vector Xj and each of the cluster centroids is computed. Each vector Xj is then reassigned to the cluster with the closest centroid. Next, the centroid of each affected cluster is recalculated. The process iterates until no more reassignments are made. See Duda et al, 2001, Pattern Classification, John Wiley & Sons, New York, NY, pp.
  • a related approach is the fuzzy k- means clustering algorithm, which is also known as the fuzzy c-means algorithm.
  • fuzzy k-means clustering algorithm the assumption that every QTL vector, gene expression vector, or set of cellular constituent measurements is in exactly one cluster at any given time is relaxed so that every vector (or set) has some graded or "fuzzy" membership in a cluster. See Duda et al, 2001, Pattern Classification, John Wiley & Sons, New York, NY, pp. 528-530. 5.16.3.
  • JARVIS-PATRICK CLUSTERING Jarvis-Patrick clustering is a nearest-neighbor non-hierarchical clustering method in which a set of objects is partitioned into clusters on the basis of the number of shared nearest-neighbors.
  • a preprocessing stage identifies the K nearest-neighbors of each object in the dataset.
  • two objects i and j join the same cluster if (i) i is one of the K nearest-neighbors of j, (ii) j is one of the K nearest-neighbors of i, and (iii) i and j have at least k m j n of their K nearest-neighbors in common, where K and k min are user-defined parameters.
  • the method has been widely applied to clustering chemical structures on the basis of fragment descriptors and has the advantage of being much less computationally demanding than hierarchical methods, and thus more suitable for large databases.
  • Jarvis-Patrick clustering may be performed using the Jarvis-Patrick Clustering Package 3.0 (Barnard Chemical Information, Ltd., Sheffield, United Kingdom).
  • a neural network has a layered structure that includes a layer of input units (and the bias) connected by a layer of weights to a layer of output units.
  • multilayer neural networks there are input units, hidden units, and output units. In fact, any function from input to output can be implemented as a three-layer network. In such networks, the weights are set based on training patterns and the desired output.
  • One method for supervised training of multilayer neural networks is back-propagation. Back-propagation allows for the calculation of an effective error for each hidden unit, and thus derivation of a learning rule for the input-to-hidden weights of the neural network.
  • the basic approach to the use of neural networks is to start with an untrained network, present a training pattern to the input layer, and pass signals through the net and determine the output at the output layer. These outputs are then compared to the target values; any difference corresponds to an error.
  • This error or criterion function is some scalar function of the weights and is minimized when the network outputs match the desired outputs. Thus, the weights are adjusted to reduce this measure of error.
  • Three commonly used training protocols are stochastic, batch, and on-line. In stochastic training, patterns are chosen randomly from the training set and the network weights are updated for each pattern presentation.
  • Multilayer nonlinear networks trained by gradient descent methods such as stochastic back-propagation perform a maximum-likelihood estimation of the weight values in the model defined by the network topology.
  • batch training all patterns are presented to the network before learning takes place. Typically, in batch training, several passes are made through the training data. In online training, each pattern is presented once and only once to the net.
  • a self-organizing map is a neural-network that is based on a divisive clustering approach. The aim is to assign genes to a series of partitions on the basis of the similarity of their expression vectors to reference vectors that are defined for each partition.
  • the reference vector is then adjusted so that it is more similar to the vector of the assigned gene. That means the reference vector is moved one distance unit on the x axis and y-axis and becomes closer to the assigned gene.
  • the other nodes are all adjusted to the assigned gene, but only are moved one half or one- fourth distance unit. This cycle is repeated hundreds of thousands times to converge the reference vector to fixed value and where the grid is stable. At that time, every reference vector is the center of a group of genes. Finally, the genes are mapped to the relevant partitions depending on the reference vector to which they are most similar. 6.
  • the following examples are presented by way of illustration of the invention and are not limiting. The methods outlined in Section 5.1 as well as Fig. 7 were applied to the data derived from the F 2 mouse population described by Schadt et al, 2003, Nature 422, 297 and Drake et al, 2001, Physiol. Genomics 5, 205.
  • mice were purchased from the Jackson Laboratories (Bar Harbor, ME). Females of strain C57BL/6J (B6) were mated with DBA/2 J (DBA) males. FI progeny were then intercrossed to produce F 2 intercross progeny.
  • the female F 2 population (111 mice) was on a high-fat, atherogenic diet for 16 weeks, starting at 12 months of age, before omental fat pad masses (OFPM) were measured and livers were extracted for gene expression profiling (step 706 below).
  • the mice were genotyped at 139 microsatellite markers uniformly distributed over the mouse genome to allow for the genetic mapping of the gene expression and disease traits.
  • a complete linkage map for all chromosomes in Z mays was constructed at an average density of 12 cM using the microsatellite markers using MapMaker QTL (Lincoln, et al, 1993, MAPMAKER/QTL Use 's Manual, Whitehead Institute for Biomedical Research, Cambridge, Massachusetts).
  • the OFPM trait was served as a quantitative trait in a QTL analysis using the program QTL Cartographer. Basten et al, 1999, QTL Cartographer User's Manual, Department of Statistics, North Carolina State University, Raleigh.
  • OFPM had a total of four QTL with LOD scores over 2.0 located on chromosomes 1 at 95cM, 6 at 43cM, 9 at 8cM, and 19 at 28cM, with LOD scores 2.10, 2.84, 2.53, and 1.92, respectively.
  • Step 706 Expression profiling was carried out on the extracted liver tissues from the F 2 population as described by Schadt et al, 2003, Nature 422, 297 using a standard 23,000 plus gene microarray manufactured by Agilent Technologies.
  • array images were scanned using the Agilent Dual Laser Microarray scanner (Agilent Technologies) and processed as described in Hughes, 2000, Cell 102, p. 109, to obtain background noise, single-channel intensity and associated measurement error estimates.
  • the mouse microarray contained 23,574 non-control oligonucleotide probes for mouse genes as described in Schadt et al, 2003, Nature 422, 297-302.
  • the hybridization protocol for the microarray data and the subsequent lower-level microarray analysis was carried out as described in Schadt et al, 2003, Nature 422, 297-302.
  • the single trait QTL analysis for the gene expression and OFPM trait described in this example was also carried out as described in Schadt et al, 2003, Nature 422, 297-302.
  • the multiple interval mapping described in this example was carried out using the Mlmapqtl program, Zeng et al, 1999, Genet Res 74, 279-289.
  • Step 708 the cellular constituents whose abundance levels across the population significantly associate with the trait of interest were identified using the Pearson correlation coefficients between the OFPM trait and the genes that were significantly differentially expressed in at least ten percent of the samples profiled. Of the transcripts that were significantly differentially expressed in at least 10% of the samples, 438 of these transcripts had Pearson correlation coefficient p-values less than 0.001 (fewer than 5 would be expected by chance). This set of 438 transcripts was selected as the association set D for the OFPM trait. This set of genes represents targets for an obesity (or related disease) drug discovery program. This set of genes is provided in Table 4, below. Of these, those genes that include a druggable binding domain are preferred.
  • column 1 gives the accession number for the gene
  • column 2 gives the p-value for the strength of correlation between OFPM and the gene expression trait
  • column 3 gives the official symbol associated with the gene (may be null)
  • column 4 gives the official gene name (may be null)
  • the final column is non-null if a druggable domain was identified in the coding part of the gene, in which case the name of the druggable domain is indicated.
  • locus I solute carrier family 25 Adenine (mitochondrial carrier; nucleotide adenine nucleotide translocator
  • AK003140 4.30E-05 Rik gene AK003165 8.46E-07 G0s2 G0/G1 switch gene 2 1110001N06 RIKEN cDNA

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Physics & Mathematics (AREA)
  • Analytical Chemistry (AREA)
  • Immunology (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Genetics & Genomics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Biomedical Technology (AREA)
  • Organic Chemistry (AREA)
  • Biophysics (AREA)
  • Pathology (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Hematology (AREA)
  • Urology & Nephrology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Food Science & Technology (AREA)
  • Medicinal Chemistry (AREA)
  • General Physics & Mathematics (AREA)
  • Cell Biology (AREA)
  • Tropical Medicine & Parasitology (AREA)
  • Toxicology (AREA)
  • General Engineering & Computer Science (AREA)
  • Ecology (AREA)
  • Hospice & Palliative Care (AREA)
  • Oncology (AREA)
  • Physiology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'invention a trait à des procédés, à des produits programmes informatiques et à des systèmes permettant d'associer une composante cellulaire à un caractère T présenté par une espèce. L'on identifie une composante cellulaire i, qui possède au moins un locus quantitatif d'abondance (eQTL) qui coïncide avec un locus quantitatif clinique (cQTL) respectif pour le caractère d'intérêt T. Pour chaque eQTL, l'on détermine si (i) la variation génétique du eQTL et (ii) la variation du caractère d'intérêt T à travers une pluralité d'organismes sont corrélées, ladite corrélation dépendant d'un motif d'abondance de la composante cellulaire i à travers la pluralité d'organismes. Lorsque (i) la variation génétique de l'un des eQTL et (ii) la variation du caractère d'intérêt T à travers la pluralité d'organismes ne sont pas corrélés, ladite absence de corrélation dépendant du motif d'abondance de la composante cellulaire i, la composante cellulaire i est considérée comme étant à l'origine du caractère d'intérêt T et est donc associée à ce dernier.
PCT/US2004/017754 2003-08-05 2004-06-04 Systemes informatiques et procedes de deduction de causalite a partir de donnees d'abondance de composantes cellulaires WO2005017652A2 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US10/567,282 US20070038386A1 (en) 2003-08-05 2004-06-04 Computer systems and methods for inferring casuality from cellular constituent abundance data
US11/361,871 US20060241869A1 (en) 2003-08-05 2006-02-23 Computer systems and methods for inferring causality from cellullar constituent abundance data

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US49268203P 2003-08-05 2003-08-05
US60/492,682 2003-08-05
US49747003P 2003-08-21 2003-08-21
US60/497,470 2003-08-21
US57549904P 2004-05-28 2004-05-28
US60/575,499 2004-05-28

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US10/567,282 A-371-Of-International US20070038386A1 (en) 2003-08-05 2004-06-04 Computer systems and methods for inferring casuality from cellular constituent abundance data
US11/361,871 Continuation-In-Part US20060241869A1 (en) 2003-08-05 2006-02-23 Computer systems and methods for inferring causality from cellullar constituent abundance data

Publications (2)

Publication Number Publication Date
WO2005017652A2 true WO2005017652A2 (fr) 2005-02-24
WO2005017652A3 WO2005017652A3 (fr) 2007-08-09

Family

ID=34198952

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2004/017754 WO2005017652A2 (fr) 2003-08-05 2004-06-04 Systemes informatiques et procedes de deduction de causalite a partir de donnees d'abondance de composantes cellulaires

Country Status (2)

Country Link
US (2) US20070038386A1 (fr)
WO (1) WO2005017652A2 (fr)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7653491B2 (en) 2002-05-20 2010-01-26 Merck & Co., Inc. Computer systems and methods for subdividing a complex disease into component diseases
US7729864B2 (en) 2003-05-30 2010-06-01 Merck Sharp & Dohme Corp. Computer systems and methods for identifying surrogate markers
US8185367B2 (en) 2004-04-30 2012-05-22 Merck Sharp & Dohme Corp. Systems and methods for reconstructing gene networks in segregating populations
WO2014066217A1 (fr) * 2012-10-23 2014-05-01 Illumina, Inc. Typage hla faisant appel à une amplification et à un séquençage sélectifs
US8843356B2 (en) 2002-12-27 2014-09-23 Merck Sharp & Dohme Corp. Computer systems and methods for associating genes with traits using cross species data
US9530161B2 (en) 2014-02-28 2016-12-27 Ebay Inc. Automatic extraction of multilingual dictionary items from non-parallel, multilingual, semi-structured data
US9569526B2 (en) 2014-02-28 2017-02-14 Ebay Inc. Automatic machine translation using user feedback
CN106755436A (zh) * 2016-12-29 2017-05-31 北京林业大学 一种林木基因组dna甲基化遗传连锁图谱的构建方法
US9798720B2 (en) 2008-10-24 2017-10-24 Ebay Inc. Hybrid machine translation
US9881006B2 (en) 2014-02-28 2018-01-30 Paypal, Inc. Methods for automatic generation of parallel corpora
US9940658B2 (en) 2014-02-28 2018-04-10 Paypal, Inc. Cross border transaction machine translation
CN111239083A (zh) * 2020-02-26 2020-06-05 东莞市晶博光电有限公司 一种手机玻璃油墨红外线透过率测试设备及相关性算法
US10913986B2 (en) 2016-02-01 2021-02-09 The Board Of Regents Of The University Of Nebraska Method of identifying important methylome features and use thereof
CN114240264A (zh) * 2022-02-24 2022-03-25 成都四方伟业软件股份有限公司 一种城管事件指标间的因果关系检验方法及装置

Families Citing this family (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003253936A (ja) * 2002-02-28 2003-09-10 Sony Corp 車両、電子キー、データ処理装置及び車両管理方法
US7608458B2 (en) * 2004-02-05 2009-10-27 Medtronic, Inc. Identifying patients at risk for life threatening arrhythmias
US20050191678A1 (en) * 2004-02-12 2005-09-01 Geneob Usa Inc. Genetic predictability for acquiring a disease or condition
US8027791B2 (en) * 2004-06-23 2011-09-27 Medtronic, Inc. Self-improving classification system
US8335652B2 (en) * 2004-06-23 2012-12-18 Yougene Corp. Self-improving identification method
US20050287574A1 (en) * 2004-06-23 2005-12-29 Medtronic, Inc. Genetic diagnostic method for SCD risk stratification
US7747392B2 (en) * 2004-12-14 2010-06-29 Genomas, Inc. Physiogenomic method for predicting clinical outcomes of treatments in patients
US20060278241A1 (en) * 2004-12-14 2006-12-14 Gualberto Ruano Physiogenomic method for predicting clinical outcomes of treatments in patients
KR101346073B1 (ko) * 2005-03-07 2013-12-31 니폰 조키 세야쿠 가부시키가이샤 연구, 판정 또는 평가 방법
US8117203B2 (en) * 2005-07-15 2012-02-14 Fetch Technologies, Inc. Method and system for automatically extracting data from web sites
US7493324B1 (en) * 2005-12-05 2009-02-17 Verizon Services Corp. Method and computer program product for using data mining tools to automatically compare an investigated unit and a benchmark unit
US7961189B2 (en) * 2006-05-16 2011-06-14 Sony Corporation Displaying artists related to an artist of interest
US8026049B2 (en) * 2007-03-23 2011-09-27 Wisconsin Alumni Research Foundation Noninvasive measurement and identification of biomarkers in disease state
US20110143956A1 (en) * 2007-11-14 2011-06-16 Medtronic, Inc. Diagnostic Kits and Methods for SCD or SCA Therapy Selection
US20090131276A1 (en) * 2007-11-14 2009-05-21 Medtronic, Inc. Diagnostic kits and methods for scd or sca therapy selection
CN106126881A (zh) 2008-03-26 2016-11-16 赛拉诺斯股份有限公司 表征对象的临床结果的计算机系统
US8285719B1 (en) 2008-08-08 2012-10-09 The Research Foundation Of State University Of New York System and method for probabilistic relational clustering
US20100317006A1 (en) * 2009-05-12 2010-12-16 Medtronic, Inc. Sca risk stratification by predicting patient response to anti-arrhythmics
US8631057B2 (en) * 2009-08-25 2014-01-14 International Business Machines Corporation Alignment of multiple liquid chromatography-mass spectrometry runs
WO2011088098A1 (fr) * 2010-01-12 2011-07-21 Statistical Innovations, Inc. Modèles mis en œuvre par ordinateur prédisant des variables de résultat et caractérisant des conditions sous-jacentes plus fondamentales
CA2796272C (fr) 2010-04-29 2019-10-01 The Regents Of The University Of California Algorithme de reconnaissance de voie a l'aide d'integration de donnees sur des modeles genetiques (paradigme)
US8473445B2 (en) * 2010-08-02 2013-06-25 Disney Enterprise, Inc Real-time story generation
US8930362B2 (en) * 2011-03-31 2015-01-06 Infosys Limited System and method for streak discovery and prediction
US10025877B2 (en) * 2012-06-06 2018-07-17 23Andme, Inc. Determining family connections of individuals in a database
WO2014186036A1 (fr) * 2013-03-14 2014-11-20 Allegro Diagnostics Corp. Procédés d'évaluation de l'état d'une maladie pulmonaire obstructive chronique (copd)
US11976329B2 (en) 2013-03-15 2024-05-07 Veracyte, Inc. Methods and systems for detecting usual interstitial pneumonia
EP3140429B1 (fr) 2014-05-05 2020-02-19 Medtronic Inc. Procédés pour l'identification et/ou la sélection d'un traitement du sca ou du scd par crt ou crt-d
CN114606309A (zh) 2014-11-05 2022-06-10 威拉赛特公司 使用机器学习和高维转录数据的诊断系统和方法
WO2016178591A2 (fr) * 2015-05-05 2016-11-10 Gene Predit, Sa Marqueurs génétiques et traitement de l'obésité masculine
WO2017049010A1 (fr) * 2015-09-15 2017-03-23 The Trustees Of Columbia University In The City Of New York Utilisation d'aldéhyde déshydrogénase en tant que biomarqueur du dysfonctionnement et de la perte de cellules bêta
WO2017165803A1 (fr) * 2016-03-24 2017-09-28 President And Fellows Of Harvard College Compositions et procédés utiles pour identifier des variants allèles qui modulent l'expression génique
WO2018094204A1 (fr) * 2016-11-17 2018-05-24 Arivale, Inc. Détermination de relations entre des risques pour des états biologiques et des analytes dynamiques
CN106777168A (zh) * 2016-12-21 2017-05-31 深圳中兴网信科技有限公司 数据管理方法及数据管理系统
JP7179766B2 (ja) * 2017-05-12 2022-11-29 ラボラトリー コーポレイション オブ アメリカ ホールディングス バイオマーカー識別のためのシステムおよび方法
US20210105962A1 (en) * 2018-02-22 2021-04-15 Elsoms Developments Ltd Methods and compositions relating to maintainer lines
US10692254B2 (en) * 2018-03-02 2020-06-23 International Business Machines Corporation Systems and methods for constructing clinical pathways within a GUI
WO2020180424A1 (fr) 2019-03-04 2020-09-10 Iocurrents, Inc. Compression et communication de données à l'aide d'un apprentissage automatique
US20230420077A1 (en) * 2020-11-18 2023-12-28 Kiromic BioPharma, Inc. Disease-associated isoform identifier
CN113889179A (zh) * 2021-10-13 2022-01-04 山东大学 基于多视图深度学习的化合物-蛋白质相互作用预测方法

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020094532A1 (en) * 2000-10-06 2002-07-18 Bader Joel S. Efficient tests of association for quantitative traits and affected-unaffected studies using pooled DNA

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0317239A3 (fr) * 1987-11-13 1990-01-17 Native Plants Incorporated Procédé et dispositif pour la détection des polymorphismes de restriction des longueurs de fragments
US5075217A (en) * 1989-04-21 1991-12-24 Marshfield Clinic Length polymorphisms in (dC-dA)n ·(dG-dT)n sequences
US5143854A (en) * 1989-06-07 1992-09-01 Affymax Technologies N.V. Large scale photolithographic solid phase synthesis of polypeptides and receptor binding screening thereof
US5578832A (en) * 1994-09-02 1996-11-26 Affymetrix, Inc. Method and apparatus for imaging a sample on a device
US5569588A (en) * 1995-08-09 1996-10-29 The Regents Of The University Of California Methods for drug screening
US6165709A (en) * 1997-02-28 2000-12-26 Fred Hutchinson Cancer Research Center Methods for drug target screening
US5965352A (en) * 1998-05-08 1999-10-12 Rosetta Inpharmatics, Inc. Methods for identifying pathways of drug action
US6324479B1 (en) * 1998-05-08 2001-11-27 Rosetta Impharmatics, Inc. Methods of determining protein activity levels using gene expression profiles
US6132969A (en) * 1998-06-19 2000-10-17 Rosetta Inpharmatics, Inc. Methods for testing biological network models
US6218122B1 (en) * 1998-06-19 2001-04-17 Rosetta Inpharmatics, Inc. Methods of monitoring disease states and therapies using gene expression profiles
US6274339B1 (en) * 1999-02-05 2001-08-14 Millennium Pharmaceuticals, Inc. Methods and compositions for the diagnosis and treatment of body weight disorders, including obesity
US6132997A (en) * 1999-05-28 2000-10-17 Agilent Technologies Method for linear mRNA amplification
US6271002B1 (en) * 1999-10-04 2001-08-07 Rosetta Inpharmatics, Inc. RNA amplification method
MXPA03000495A (es) * 2000-07-17 2004-08-12 Gricolas Su Majestad La Reina Metodo de exploracion del genoma a base de mapa para identificar sitios regulatorios que controlan el nivel de transcripciones de genes y productos.
US6368806B1 (en) * 2000-10-05 2002-04-09 Pioneer Hi-Bred International, Inc. Marker assisted identification of a gene associated with a phenotypic trait
JP2005508178A (ja) * 2001-11-08 2005-03-31 デヴェロゲン アクチエンゲゼルシャフト フュア エントヴィックルングスビオローギッシェ フォルシュング エネルギー恒常性の調節に関与するMenタンパク質、GST2、Rab−RP1、Csp、F−ボックスタンパク質Lilina/FBL7、ABC50、コロニン、Sec61α、またはVhaPPA1−1、または相同性タンパク質
CA2474982A1 (fr) * 2002-02-01 2003-08-07 Rosetta Inpharmatics Llc Systemes et procedes informatiques concus pour identifier des genes et determiner des voies associees a des caracteres
US7653491B2 (en) * 2002-05-20 2010-01-26 Merck & Co., Inc. Computer systems and methods for subdividing a complex disease into component diseases
WO2004013727A2 (fr) * 2002-08-02 2004-02-12 Rosetta Inpharmatics Llc Systemes et procedes informatiques utilisant des locus quantitatifs cliniques et d'expression afin d'associer des genes a des traits

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020094532A1 (en) * 2000-10-06 2002-07-18 Bader Joel S. Efficient tests of association for quantitative traits and affected-unaffected studies using pooled DNA

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JIANG ET AL.: 'Multiple trait analysis of genetic mapping for quantitative trait loci' GENETICS vol. 140, July 1995, pages 1111 - 1127 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7653491B2 (en) 2002-05-20 2010-01-26 Merck & Co., Inc. Computer systems and methods for subdividing a complex disease into component diseases
US8843356B2 (en) 2002-12-27 2014-09-23 Merck Sharp & Dohme Corp. Computer systems and methods for associating genes with traits using cross species data
US7729864B2 (en) 2003-05-30 2010-06-01 Merck Sharp & Dohme Corp. Computer systems and methods for identifying surrogate markers
US8185367B2 (en) 2004-04-30 2012-05-22 Merck Sharp & Dohme Corp. Systems and methods for reconstructing gene networks in segregating populations
US9798720B2 (en) 2008-10-24 2017-10-24 Ebay Inc. Hybrid machine translation
WO2014066217A1 (fr) * 2012-10-23 2014-05-01 Illumina, Inc. Typage hla faisant appel à une amplification et à un séquençage sélectifs
US9181583B2 (en) 2012-10-23 2015-11-10 Illumina, Inc. HLA typing using selective amplification and sequencing
EP3594362A1 (fr) * 2012-10-23 2020-01-15 Illumina, Inc. Procédé et systèmes permettant de déterminer des haplotypes dans un échantillon
US10262104B2 (en) 2012-10-23 2019-04-16 Illumina, Inc. HLA typing using selective amplification and sequencing
AU2013334958B2 (en) * 2012-10-23 2018-11-15 Illumina, Inc. HLA typing using selective amplification and sequencing
US9940658B2 (en) 2014-02-28 2018-04-10 Paypal, Inc. Cross border transaction machine translation
US9881006B2 (en) 2014-02-28 2018-01-30 Paypal, Inc. Methods for automatic generation of parallel corpora
US9805031B2 (en) 2014-02-28 2017-10-31 Ebay Inc. Automatic extraction of multilingual dictionary items from non-parallel, multilingual, semi-structured data
US9569526B2 (en) 2014-02-28 2017-02-14 Ebay Inc. Automatic machine translation using user feedback
US9530161B2 (en) 2014-02-28 2016-12-27 Ebay Inc. Automatic extraction of multilingual dictionary items from non-parallel, multilingual, semi-structured data
US10913986B2 (en) 2016-02-01 2021-02-09 The Board Of Regents Of The University Of Nebraska Method of identifying important methylome features and use thereof
CN106755436A (zh) * 2016-12-29 2017-05-31 北京林业大学 一种林木基因组dna甲基化遗传连锁图谱的构建方法
CN111239083A (zh) * 2020-02-26 2020-06-05 东莞市晶博光电有限公司 一种手机玻璃油墨红外线透过率测试设备及相关性算法
CN114240264A (zh) * 2022-02-24 2022-03-25 成都四方伟业软件股份有限公司 一种城管事件指标间的因果关系检验方法及装置

Also Published As

Publication number Publication date
WO2005017652A3 (fr) 2007-08-09
US20060241869A1 (en) 2006-10-26
US20070038386A1 (en) 2007-02-15

Similar Documents

Publication Publication Date Title
US20070038386A1 (en) Computer systems and methods for inferring casuality from cellular constituent abundance data
US8843356B2 (en) Computer systems and methods for associating genes with traits using cross species data
US7729864B2 (en) Computer systems and methods for identifying surrogate markers
US7653491B2 (en) Computer systems and methods for subdividing a complex disease into component diseases
US7035739B2 (en) Computer systems and methods for identifying genes and determining pathways associated with traits
US20060111849A1 (en) Computer systems and methods that use clinical and expression quantitative trait loci to associate genes with traits
Voight et al. The metabochip, a custom genotyping array for genetic studies of metabolic, cardiovascular, and anthropometric traits
US8185367B2 (en) Systems and methods for reconstructing gene networks in segregating populations
Gerhold et al. Better therapeutics through microarrays
US7908090B2 (en) Signatures for human aging
US20060177837A1 (en) Systems and methods for identifying diagnostic indicators
Schadt et al. A new paradigm for drug discovery: integrating clinical, genetic, genomic and molecular phenotype data to identify drug targets
Aguet et al. Molecular quantitative trait loci
Dahlin et al. Integrative systems biology approaches in asthma pharmacogenomics
Flynn et al. Functional characterization of genetic variant effects on expression
Emani et al. Single-cell genomics and regulatory networks for 388 human brains
WO2019191123A1 (fr) Procédés de prédiction d'effets de variation génomique sur la transcription génique
Dobbyn et al. Co-localization of Conditional eQTL and GWAS Signatures in Schizophrenia
WO2008060566A2 (fr) Analyse biométrique de populations définies par la longueur de la piste de marqueurs homozygotes
Liu et al. Brain transcriptional regulatory architecture and schizophrenia etiology converge between East Asian and European ancestral populations
Walker et al. Genetic control of gene expression and splicing in the developing human brain
Rothberg et al. Integrating expression‐based drug response and SNP‐based pharmacogenetic strategies into a single comprehensive pharmacogenomics program
Tan et al. Prioritization of genes associated with type 2 diabetes mellitus for functional studies
Czamara et al. Statistical genetic concepts in psychiatric genomics
WO2024102199A1 (fr) Procédés et systèmes pour le diagnostic et le traitement du lupus fondés sur l'expression des gènes d'immunodéficience primaire

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2007038386

Country of ref document: US

Ref document number: 10567282

Country of ref document: US

122 Ep: pct application non-entry in european phase
WWP Wipo information: published in national office

Ref document number: 10567282

Country of ref document: US