WO2004031912A2 - Methodes d'estimation de frequences haplotypiques et associations de maladies presentant des haplotypes et des variables environnementales - Google Patents

Methodes d'estimation de frequences haplotypiques et associations de maladies presentant des haplotypes et des variables environnementales Download PDF

Info

Publication number
WO2004031912A2
WO2004031912A2 PCT/US2003/031186 US0331186W WO2004031912A2 WO 2004031912 A2 WO2004031912 A2 WO 2004031912A2 US 0331186 W US0331186 W US 0331186W WO 2004031912 A2 WO2004031912 A2 WO 2004031912A2
Authority
WO
WIPO (PCT)
Prior art keywords
trait
block
haplotype
markers
haplotypes
Prior art date
Application number
PCT/US2003/031186
Other languages
English (en)
Other versions
WO2004031912A3 (fr
Inventor
Lue Ping Zhao
Original Assignee
Fred Hutchinson Cancer Research Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fred Hutchinson Cancer Research Center filed Critical Fred Hutchinson Cancer Research Center
Priority to AU2003282907A priority Critical patent/AU2003282907A1/en
Publication of WO2004031912A2 publication Critical patent/WO2004031912A2/fr
Publication of WO2004031912A3 publication Critical patent/WO2004031912A3/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Definitions

  • the present invention relates to methods of estimating haplotype frequencies and methods of associating haplotype frequencies and environmental factors with a trait.
  • a trait may be associated with a particular S ⁇ P genotype or with a distinct combination of contiguous S ⁇ P genotypes (S ⁇ P haplotype). For example, there is a haplotype association of lipoprotein lipase with coronary heart diseases, but no association is evident with individual S ⁇ Ps (Hallman et al. (1999) Ann. Hum. Genet. 63:499-510).
  • haplotypes from diploid individuals is a difficult task for individuals heterozygous at multiple SNPs. For example, if a sample has two SNP genotypes, A/C and A/G, then the two haplotypes could be either AA and CG or AG and CA.
  • Several approaches have been attempted to read alleles from separated chromosomes, such as dissecting a single chromosome or inserting an entire chromosome into a yeast artificial chromosome (YAC) (Green et al. (1998) in The Genetic Basis of Human Cancer, McGraw-Hill Health Professions Division, NY) or using rodent-human hybrid technique to physically separate two chromosomes (Patil et al.
  • genotyping Another approach is to infer haplotypes by genotyping first-degree relatives.
  • parental genotype data With parental genotype data, the phases (origin of parental copy of chromosome) of genotypes can be partially resolved (Wijsman (1987) Amer. J. Hum. Genet. : 41 :356-373).
  • gathering parental biological samples, if available, and genotyping them are expensive and may be ethically sensitive.
  • haplotype frequencies can be accurately estimated using expectation-maximization algorithms (EM).
  • EM expectation-maximization algorithms
  • maximum likelihood approaches have been developed to estimate population haplotype frequencies directly from genotype data in unrelated individuals (Excoffier & Slatkin (1995) Mol. Biol. Evol 12:921-927; Hawley & Kidd (1995) J Heredity 86:409-411; Long et al. (1995) Amer. J. Hum. Genet. 56:799-810).
  • Excoffier & Slatkin's EM algorithm is the most popular one and appears to require the fewest assumptions.
  • the EM algorithms are computationally burdensome when the number of SNPs exceeds 20.
  • SEs standard errors
  • haplotype frequencies and assessing the association of particular haplotypes with defined traits becomes increasingly practical as millions of SNPs are identified and are readily genotyped in population research. There exists a need for algorithms that accurately estimate haplotype frequencies from genotyped SNPs when the number of SNPs is large, and to provide SEs for the estimated haplotype frequencies. There also exists a need for methods for assessing disease associations with SNP haplotypes and environmental variables in case-control studies.
  • the invention provides a method for estimating haplotype frequencies for a set of markers in a sample population using an estimating equation (EE) method and a forward-block algorithm, wherein the genotypes of at least some of the markers are known for each individual in a sample population, wherein the phase information is incomplete, and wherein the method directly provides a standard error measurement for each estimated haplotype frequency.
  • EE estimating equation
  • the forward-block algorithm comprises distributing the markers into a plurality of blocks and (i) estimating block haplotype frequencies for a first block, wherein a block haplotype comprises a combination of genotypes for markers in a block; (ii) estimating block haplotype frequencies for a second block; (iii) estimating combination block haplotype frequencies for a first combination block using selected and pooled block haplotypes for the first and second blocks, wherein a combination block comprises two or more blocks, and wherein block haplotypes with greater than a predetermined minimum frequency are selected and block haplotypes that are not selected are pooled; (iv) estimating block haplotype frequencies for a third block; (v) estimating combination block haplotype frequencies for a second combination block using selected and pooled block haplotypes for the first combination block and the third block; and (vi) sequentially repeating steps (iv) and (v), wherein each repetition provides combination block haplotype frequencies for a combination block comprising an additional block.
  • the invention provides a method for assessing the association of a haplotype with a trait.
  • the method comprises two parts.
  • the first part of the method estimates a set of haplotype frequencies in a group of trait- positive individuals and in a group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers may be incomplete (i.e., the phase information may be known, partially known, or completely unknown) and wherein the algorithm provides a standard error measurement for each estimated haplotype frequency.
  • the forward-block algorithm comprises distributing the markers into a plurality of blocks and determining haplotype frequencies in a step-wise fashion for both individuals in the trait- positive group and individuals in the trait-negative group, as described above for the methods of the first aspect of the invention.
  • the second part of the method estimates the differences in haplotype frequencies in the trait-positive group and the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard error measurement for each estimated difference in haplotype frequencies.
  • the second aspect of the invention provides a haplotype- based method of diagnosing an increased risk of developing a trait.
  • the method comprises four parts.
  • the first part of the method estimates a set of haplotype frequencies in a group of trait-positive individuals and in a group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers may be incomplete (i.e., the phase information may be known, partially known, or completely unknown), and wherein the algorithm provides a standard error measurement for each estimated haplotype frequency.
  • the forward-block algorithm comprises distributing the markers into a plurality of blocks and determining haplotype frequencies in a step-wise fashion for both individuals in the trait-positive group and individuals in the trait-negative group, as described above.
  • the second part of the method estimates the differences in haplotype frequencies in the trait-positive group and the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard error measurement for each estimated difference in haplotype frequencies.
  • the third part of the method derives one or more haplotypes that are significantly associated with the trait.
  • the fourth part of the method diagnoses an increased risk of developing the trait in a trait- negative individual by determining the presence of a pattern that is significantly positively associated with the trait or the absence of a pattern that is significantly negatively associated with the trait.
  • the invention provides a method for associating haplotypes for one or more sets of markers and one or more environmental factors with a trait.
  • the method comprises two parts.
  • the first part of the method estimates pattern frequencies in a group of trait-positive individuals and in a group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein a pattern comprises at least one of one or more haplotypes or diplotypes at one or more loci, one or more environmental factors, or a combination of one or more haplotypes or diplotypes at one or more loci and one or more environmental factors, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers may be incomplete (i.e., the phase information may be known, partially known, or completely unknown), and wherein the algorithm provides a standard error measurement for each estimated pattern frequency.
  • the forward-block algorithm comprises distributing the markers in each set of markers into a plurality of blocks and determining pattern frequencies in a step-wise fashion for each set of markers in both trait-positive individuals and trait-negative individuals.
  • the forward-block algorithm comprises the steps of: (i) estimating pattern frequencies for a first block, wherein pattern comprises a block haplotypes and one or more environmental factors and a block haplotype comprises a combination of genotypes for markers in a block; (iii) estimating pattern frequencies for a second block; (iii) estimating pattern frequencies for a first combination block using selected and pooled patterns for the first and second blocks, wherein a combination block comprises two or more blocks, and wherein patterns with greater than a predetermined minimum frequency are selected and patterns that are not selected are pooled; (iv) estimating pattern frequencies for a third block; (v) estimating pattern frequencies for a second combination block using selected and pooled patterns for the first combination block and the third block; and (vi) sequentially repeating
  • the second part of the method estimates the differences in pattern frequencies in the trait-positive group and the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard error measurement for each estimated difference in pattern frequencies.
  • the third aspect of the invention provides a haplotype- based method of diagnosing an increased risk of developing a trait. In this embodiment, the method comprises four parts.
  • the first part of the method estimates pattern frequencies in a group of trait-positive individuals and in a group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein a pattern comprises at least one of one or more haplotypes or diplotypes at one or more loci, one or more environmental factors, or a combination of one or more haplotypes or diplotypes at one or more loci and one or more environmental factors, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers may be incomplete (i.e., the phase information may be known, partially known, or completely unknown), and wherein the algorithm provides a standard error measurement for each estimated pattern frequency.
  • the forward-block algorithm comprises distributing the markers in each set of markers into a plurality of blocks and determining pattern frequencies in a step-wise fashion for each set of markers in both trait-positive individuals and trait- negative individuals, as described above.
  • the second part of the method estimates the differences in pattern frequencies in the trait-positive group and the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard error measurement for each estimated difference in pattern frequencies.
  • the third part of the method derives one or more patterns that are significantly associated with the trait.
  • the fourth part of the method diagnoses an increased risk of developing the trait in a trait- negative individual by determining the presence of a pattern that is significantly positively associated with the trait or the absence of a pattern that is significantly negatively associated with the trait.
  • the invention provides a method for associating haplotypes for one or more sets of markers and one or more environmental factors with multiple phenotypes or traits.
  • the methods of the invention may be used for any study design, for example case- control studies , observational studies, cohort studies, or clinical trials.
  • the markers used in any of the methods of the invention may have two alleles (biallelic markers) or multiple alleles (multiallelic markers).
  • Representative markers include, but are not limited to, microsatellite markers or single nucleotide polymorphisms (SNPs).
  • SNPs single nucleotide polymorphisms
  • the markers are biallelic SNPs.
  • the set of markers may comprise more than about 12 markers, such as more than about 20 markers, such as more than about 50 markers, such as more than about 100 markers, or such as several hundreds of markers.
  • each block may comprise at least about 3 markers, such as at least about 5 markers, such as at least about 10 markers, or such as at least about 20 markers. In some embodiments the genotype of at least some of the markers is unknown.
  • the sample population in any of the methods of the invention may comprise more than about 10 individuals, more than about 50 individuals, more than about 100 individuals, more than about 500 individuals, or more than about 1000 individuals.
  • the invention also provides a computer-readable medium having computer- readable instructions for performing the methods of the invention.
  • the invention also provides a computer-system having a processor, a memory, and an operating environment, the computer system operable for performing the methods of the invention. BRIEF DESCRIPTION OF THE DRAWINGS
  • FIGURE 1 shows a schematic diagram representing the forward-block algorithm.
  • FIGURE 2 shows a schematic flowchart for the simulations.
  • FIGURE 3 shows plots of the average discrepancy of the estimates of haplotype frequencies.
  • the plots in the top panel are the average discrepancy ⁇ ⁇ j - ⁇ j
  • the plots in the lower panel are the average discrepancy y ⁇ ⁇ j - ⁇ j ⁇ summed over all common haplotypes) of the estimated standard error j compared to the sample standard deviation of the estimate based on 500 simulations under HWE, HWD(l), and HWD(2), respectively. The simulations are carried out for a five-locus system.
  • FIGURE 4 shows the discrepancy of the estimated haplotype frequencies using the EE method.
  • the solid line represents the average discrepancy over 1,000 simulated data sets and the two dashed lines represent the 5 and 95 percentile of 1 ,000 discrepancies.
  • FIGURE 5 shows Table 1 describing estimated haplotype frequencies (standard errors) using the EE and estimated haplotype frequencies using Bayesian method
  • ARHGDIB on chromosome 12 core SNP, G4923a6, G4923a7, G4923al0, G4923al l, G4923al2, G4923al8, G4923al5, G4923al6, G4923al7, G4923a22, G4923a26, G4923a28, G4923a44, G4923a46, G4923a35, G4923a37, and G4923a38.
  • Complete genotype for all 44 individuals are available for the 8 SNPs indicated in italics, the remaining SNPs have missing genotypes for one or more individuals (0: reference allele, 1 : variant allele).
  • FIGURE 6 shows Table 2 describing estimated haplotype frequencies (standard errors) using both the EE and EM algorithms and estimated haplotype frequencies using Bayesian methods, PHASE and HAPLOTYPER, from 44 individuals with complete genotype data for 8 SNPs within the gene ARHGDIB on chromosome 12: core SNP, G4923a6, G4923a7, G4923al2, G4923a26, G4923a44, G4923a46, and G4923a35 (0: reference allele, 1 : variant allele).
  • the standard errors of EM estimates of haplotype frequencies were computed from 1,000 bootstrap samples.
  • FIGURE 7 shows Table 3 describing estimated allelic frequencies among cases and controls, their allelic differences, standard errors and Z-statistics, and estimated odds ratios with bootstrap confidence intervals
  • FIGURE 8 shows Table 4 describing estimated Z-statistics, odds ratios, and their 95% confidence intervals for haplotype associations with two adjacent SNPs within APOC3.
  • FIGURES 9(a)-(d) show the distributions of maximum Z-statistics, 90% and 95% threshold values obtained through 1,000 permutations, (a) shows the frequency of maximum Z-scores using 2 adjacent SNPs; (b) shows the maximized Z-scores at every SNP locus when two adjacent SNPs are used ; (c) and (d) are similar to (a) and (b) using three adjacent SNPs.
  • the two threshold values in (b) are 2.86 and 2.54), in (d) are 3.05 and 2.82.
  • FIGURE 10 shows Table 5 describing estimated Z-statistics, OR, and their 95% confidence intervals for haplotype associations at three adjacent SNPs within APOC3.
  • FIGURE 11 shows Table 6 describing estimated Z statistics, OR, and their 95% confidence intervals for haplotype associations at four adjacent SNPs within APOC3.
  • FIGURE 12 shows Table 7 describing estimated Z statistics, OR, and their 95% confidence intervals for haplotype associations at five adjacent SNPs within APOC3.
  • FIGURE 13 shows Table 8 describing estimated Z statistics, OR, and their 95% confidence intervals for haplotype associations at six adjacent SNPs within APOC3.
  • FIGURE 14 shows Table 9 comparing the estimating equation approach (EE) with the EM algorithm in estimating haplotype frequencies and their standard errors.
  • FIGURE 15 shows Table 10 describing a Monte Carlo simulation study under the null hypothesis with no covariates.
  • Avg. bias is the average bias of estimate of parameter over 1,000 duplicates.
  • Std is the standard deviation of 1,000 estimates of parameter.
  • Avg. SE is the average of estimated standard error of the estimate of parameter over 1,000 duplicates. There are 300 cases and 300 controls in each set of simulation data.
  • FIGURE 16 shows Table 11 describing a Monte Carlo simulation study under the c null hypothesis with covariates.
  • FIGURE 17 shows Table 12 describing a Monte Carlo simulation study under the c null hypothesis with varying sample sizes.
  • FIGURE 18 shows Table 13 describing a Monte Carlo simulation study under the c alternative hypothesis with no covariates.
  • Avg. bias is the average bias of estimate of parameter over 1,000 duplicates.
  • Std is the standard deviation of 1,000 estimates of parameter.
  • Avg. SE is the average of estimated standard error of the estimate of parameter over 1,000 duplicates.
  • FIGURE 19 shows Table 14 describing a Monte Carlo simulation study under the alternative hypothesis with covariates.
  • Avg. bias is the average bias of estimate of parameter over 1,000 duplicates. Std is the standard deviation of 1,000 estimates of parameter. Avg. SE is the average of estimated standard error of the estimate of parameter over 1,000 duplicates. There are 300 cases and 300 controls in each simulation data.
  • FIGURE 20 shows Table 15 describing a case-control study under the null hypothesis with covariates.
  • I( ) is an indicator function;
  • h 2 and A- are the second and third common haplotypes.
  • the risk factors Gender and Age are omitted from the estimation model and only all common haplotypes are modeled in a case-control study with 300 cases and 300 controls.
  • Avg. bias is the average bias of estimate of parameter over 1,000 duplicates.
  • Std is the standard deviation of 1,000 estimates of parameter.
  • Avg. SE is the average of estimated standard error of the estimate of parameter over 1,000 duplicates.
  • ⁇ 5 is the coefficient of I(h 4 ) in the estimation model.
  • the invention provides a method for estimating haplotype frequencies in a population using an estimating equation method that provides a standard error measurement for each estimated haplotype frequency.
  • the invention provides a method assessing the association of a haplotype with a trait, comprising estimating haplotype frequencies for a group of trait-positive individuals and a group of trait-negative individuals using an estimating equation method that provides a standard error measurement for each estimated haplotype frequency; and estimating the differences in haplotype frequencies for the trait-positive group and the trait-negative group.
  • the invention provides a method for assessing the association of a trait with a pattern comprising: (1) one or more haplotypes or diplotypes at one or more loci, (2) one or more environmental factors, and/or (3) a combination of haplotypes or diplotypes and environmental factors, using an estimating equation method that provides standard error measurements.
  • the invention provides a method for associating haplotypes for one or more sets of markers and one or more environmental factors with multiple phenotypes.
  • the invention also provides a computer-readable medium having computer-readable instructions for performing the methods of the invention, and a computer-system having a processor, a memory, and an operating environment, the computer system operable for performing the methods of the invention. Unless specifically defined herein, all terms used herein have the same meaning as they would to one skilled in the art of the present invention.
  • the term "genetic marker” or “marker” refers to genome-derived polynucleotides which are sufficiently polymorphic to allow a reasonable probability that a randomly selected individual will be heterozygous, and thus informative for genetic analysis by methods such as linkage analysis or association studies.
  • the human haploid genome contains a 3xl0 9 base- long double stranded DNA shared among the 24 chromosomes. Each human individual is diploid, i.e., possesses two haploid genomes, one of paternal origin, the other of maternal origin. The sequence of the human genome varies among individuals in a population.
  • polymorphism refers to the occurrence of two or more alternative genomic sequences or alleles between or among different genomes or individuals. “Polymorphic” refers to the condition in which two or more variants of a specific genomic sequence can be found in a population. A “polymorphic site” is the locus at which the variation occurs.
  • a single nucleotide polymorphism is the replacement of one nucleotide by another nucleotide at the polymorphic site. Deletion of a single nucleotide or insertion of a single nucleotide also gives rise to single nucleotide polymorphisms.
  • the term "single nucleotide polymorphism” or "SNP" refers to which ' are genome-derived polynucleotides markers that exhibit allelic polymorphism.
  • allele refers to a variant of a nucleotide sequence. All alleles on a single chromosome are from one of the two parents. At a given polymorphic site, any individual (diploid), can be either homozygous (twice the same allele) or heterozygous (two different alleles).
  • haplotype refers to a particular combination of alleles at multiple markers, such as SNPs, present in an individual. Haplotype information is specified by the parental origin (phase) of individual marker alleles. Across short distances (usually 10's of kbp), haplotypes are conserved over the evolutionary time-scale (Drysdale et al. (2000) Proc. Natl Acad. Sci.
  • the term “genotype” refers the identity of the alleles present in an individual or a sample.
  • the term “genotyping” a sample or an individual for a marker involves determining the specific allele or the specific nucleotide carried by an individual at a biallelic marker.
  • the term “diplotype” refers to the identity of the alleles on both chromosomes in an individual, i.e., a pair of haplotypes.
  • the term “Hardy- Weinberg Equilibrium” or “HWE” refers to the assumption that haplotypes are randomly paired to form individuals' diplotypes.
  • a “trait” or “genetic trait” or “phenotype” refers to a measurable characteristic present in some individuals of a population and absent in others.
  • a given polymo ⁇ hism or rare mutation can be either neutral, i.e., it has no effect on the trait, or functional, i.e., it is responsible for a particular genetic trait.
  • the preferred traits contemplated within the present invention relate to fields of therapeutic interest, for example susceptibility to a disease, or drug response reflecting drug efficacy, toxicity, or other side effects related to treatment.
  • Traits may also relate to any other desirable or undesirable characteristic in a population of individuals, such as a bovine trait to produce tender, high-quality beef (see Adam (2002) Nature 417:778) or a disease-resistance in a plant.
  • a bovine trait to produce tender
  • high-quality beef see Adam (2002) Nature 417:778
  • a disease-resistance in a plant.
  • the term “trait-positive individuals” or “cases” refers to individuals in which a particular trait is present.
  • twin-negative individuals or “controls” refers to individuals in which a particular trait is absent.
  • Traits can either be "binary”, e.g. diabetic vs. non-diabetic, "quantitative”, e.g. elevated blood pressure, or "censored”, e.g., time-to-onset of a disease outcome.
  • Individuals affected by a quantitative trait can be classified according to an appropriate scale of trait values, e.g., blood pressure ranges. Each trait value range can then be analyzed as a binary trait. Individuals showing a trait value within one such range are compared to individuals showing a trait value outside of this range. In such a case, genetic analysis methods are applied to subpopulations of individuals showing trait values within defined ranges.
  • a trait can be categorical (e.g., blood types), ordinal (e.g., stages of breast cancer), censored (e.g., observations for study participants who do not have the event of interest during the period of follow-up), or provide frequency information.
  • categorical e.g., blood types
  • ordinal e.g., stages of breast cancer
  • censored e.g., observations for study participants who do not have the event of interest during the period of follow-up
  • environmental variable refers to non-genetic contributions to the presence or absence of a trait.
  • Environmental factors include, for example, age, sex, weight, height, nutrition, life-styles, smoking, alcohol-consumption, work habits, history of medications, medical history, family history, leisure activities, etc.
  • pattern refers to either (1) one or more haplotypes or diplotypes at one or more loci, (2) one or more environmental factors, or (3) a combination of one or more haplotypes or diplotypes at one or more loci and one or more environmental factors.
  • the presence of a particular pattern may associate a trait with the combination of a haplotype or diplotype at one locus with a haplotype or diplotype at another locus, or it may associate a trait with one or more environmental factors, or it may associate a trait with a combination of a haplotype or diplotype at one ore more loci with one or more environmental factors.
  • locus refers to the specific location of a marker in the genome.
  • the terms "significantly associated,” “significantly positively associated,” and “significantly negatively associated” all refer to statistical significance. Statistical significance, is used herein as it is typically used by those with skill in the art. It is a measure of the probability that an observed difference would have been observed simply by chance and is not the result of a "real" difference between two groups, for example. Thus, the lower the probability that the observed difference would have happened by chance, the less likely that it happened by chance. Statistical significance is based on p- values. A p-value ⁇ 0.05 is typically considered statistically significant, although in some instances a p-value of ⁇ 0.01 or even ⁇ 0.005 or ⁇ 0.001 is preferred.
  • a statistically significant positive association between a haplotype and a trait means that the presence of a haplotype correlates with the presence of the trait
  • a statistically significant negative association between a haplotype and a trait means that the presence of a haplotype correlates with the absence of the trait.
  • the invention provides a method for estimating haplotype frequencies for a set of markers in a sample population using an estimating equation (EE) method, wherein the genotypes of at least some of the markers are known for each individual in a sample population, wherein the phase information for the markers is incomplete, and wherein the algorithm used directly provides a standard error measurement for each estimated haplotype frequency without using bootstrap methods.
  • EE estimating equation
  • genotyping protocols result purely in genotype information; they produce information about the pair of alleles an individual possesses at each locus, but not necessarily haplotype information which would reveal the alleles that have been inherited together on the same paternal or maternal chromosome. Without explicit haplotype information, there is ambiguity with respect to the origin of alleles at neighboring loci. For example, it is difficult to determine if there are differences in the frequency of certain haplotypes between individuals with a disease ("cases”) and individuals without the disease (“controls”) in the absence of haplotype information.
  • haplotype frequencies from genotype data gathered on a sample of individuals is based on the fact that the haplotypes of some individuals in the sample are unambiguous. This allows the ambiguous haplotypes to be estimated using statistical predictions. Individuals that are unambiguous with respect to phase or haplotype information have homozygous genotypes either at all relevant loci or at all but one relevant locus. Individuals with two or more heterozygous genotypes have more than one possible haplotype configuration compatible with their genotype data, and hence are ambiguous with respect to phase or haplotype information.
  • the EE method of the first aspect of the invention is motivated by the likelihood approach for missing data problems (Efron (1994) J. Am. Stat. Assoc. 89:463-475; Heitjan (1994) Biometrika 81 :701-708; Rubin (1996) J Am. Stat. Assoc. 91:473-489).
  • the phase of a heterozygous SNP is unresolved, and may be treated as missing data.
  • the missingness of phase is Missing At Random (Heitjan (1994) Biometrika 81 :701-708) in the sense that the missing mechanism depends on observed genotype data, rather than missed phase information. Note that if the parental origin (phase) of an allele at any single heterozygous locus is assumed to be fixed, it may serve as a reference phase for other SNPs within the same individual.
  • the EE method of the first aspect of the invention is described in EXAMPLE 1.
  • the derivations of the likelihood and estimating equations are described in the first section of EXAMPLE 1.
  • the likelihood function used in the method of the invention is essentially the same as the likelihood function that was derived in Excoffier & Slatkin (1995) Mol. Biol. Evol 12:921-927. This likelihood function was also used in the two Bayesian methods (Stephens et al. (2001) Amer. J. Hum. Genet. 68:978-989; Niu et al. (2002) Amer. J. Hum. Genet. 70:157-169).
  • the EE method uses a forward-block computational algorithm, which is computationally efficient algorithm that permits the resolution of phase information when the number of SNPs is larger than about 20.
  • the forward-block algorithm used in the EE method of the invention is described in the second section of EXAMPLE 1.
  • the forward-block algorithm carries out computations in a stepwise fashion as shown in FIGURE 1.
  • the set of markers e.g., SNPs
  • a block may contain from about 3 to about 100 markers.
  • haplotype frequencies for the first two blocks are estimated separately.
  • the first two blocks are then joined and the haplotype frequencies for the enlarged block are estimated using the estimation results of the first two blocks as initial values.
  • haplotype frequencies for the next single block are estimated, and the single block is added to the enlarged block.
  • the estimations are done sequentially for each enlarged block and next single block. This process is continued until all the blocks are joined.
  • haplotypes with insignificant frequencies are filtered out.
  • the markers used in the methods of the invention may be biallelic or multiallelic markers.
  • Exemplary markers are microsatellite markers and SNPs.
  • the markers are biallelic SNPs.
  • the set of markers may comprise more than about 12 markers, more than about 20 markers, more than about 50 markers, more than about 100 markers, between about 200 and about 1000 markers, or several thousands of markers.
  • the genotype of at least some of the markers is unknown.
  • the sample population may comprise more than about 40 individuals, more than about 100 individuals, more than about 500 individuals, or more than 1000 individuals.
  • the EE method of the invention directly provides a standard error measurement for each estimated haplotype frequency using estimation equation theory and can also handle data sets that include missing genotype values.
  • the estimations of the standard error for any haplotype frequency is as shown in the third section of EXAMPLE 1.
  • the adjustment of the covariance matrix for missing genotypes is described in the fourth section of EXAMPLE 1.
  • the EE method also permits the inference of haplotypes of individuals.
  • the phase of an individual's genotype can be predicted to provide a pair of haplotypes with a probability statement, as shown in the fifth section of EXAMPLE 1.
  • the EE method accurately estimates haplotype frequencies and their standard errors (SEs) under the assumption of Hardy- Weinberg Equilibrium (HWE) and under conditions when the model assumption is violated, as shown in EXAMPLE 2.
  • SEs standard errors
  • HWE Hardy- Weinberg Equilibrium
  • the estimated haplotype frequencies and their SEs using the EE method are consistent even when the sample size is small (see FIGURES 2 and 3).
  • Another simulation study shows that the haplotype frequencies estimated using the EE method are accurate even when the number of markers (e.g., SNPs) is large, as shown in EXAMPLE 2 and FIGURE 4.
  • the number of individuals can be several thousands, and the number of markers can be several hundreds.
  • the estimation of haplotype frequencies from a dataset of 632 individuals and 296 markers was completed in approximately 2 minutes, as shown in EXAMPLE 2.
  • the estimated haplotype frequencies and their SEs using the EE method of the invention compare favorable with those calculated with three existing methods: Arlequin software (Schneider et al. (2000) http//lgb.unige.ch/arelquin), which is an implementation of the EM method proposed in Excoffier & Slatkin (1995), PHASE software, which is an implementation of the Bayesian method (Stephens et al. (2001) Amer. J. Hum. Genet. 68:978-989), and HAPLOTYPER software, which is an implementation of the Bayesian method (Niu et al. (2002) Amer. J. Hum. Genet. 70:157-169).
  • the EE method Utilizing the forward-block-computational algorithm and estimating equation technique, the EE method is able to handle the data with large numbers of SNPs and estimate SEs directly without using bootstraps. Unlike the EM method, The EE method can also handle the data with missing genotypes. Compared with the EM method for estimating haplotype frequencies, the EE method yields consistently smaller estimates for SEs, as documented in EXAMPLE 4 and FIGURE 14. In addition, the computations using the EE methods were at least 120 times faster than using the EM algorithm in a dataset including 6 markers and 437 individuals, as shown in EXAMPLE 4.
  • haplotype frequencies can be estimated using Bayesian approach.
  • two Bayesian methods (Stephens et al. (2001) Amer. J. Hum. Genet. 68:978- 989; Niu et al. (2002) Amer. J. Hum. Genet. 70:157-169) were proposed to reconstruct individuals' haplotypes from their genotypes.
  • both maximum likelihood and Bayesian methods can be used to reconstruct individuals' haplotypes and estimate haplotype frequencies.
  • an additional assumption is required for the conditional distribution of individual haplotype given all haplotypes in the first Bayesian approach (PHASE, Stephens et al. (2001) Amer. J. Hum.
  • An extension of the EE method may be used if an individual's parental genotypes are also available.
  • the accuracy of the estimates may be improved by inco ⁇ orating parental genotypes in estimating equation (9) in EXAMPLE 1.
  • the improvement can be very valuable, especially for estimating frequencies of rare haplotypes, which is not reliable unless a very large number of individuals are genotyped.
  • the EE method may be used to estimate haplotype frequencies for multi-allelic markers, such as microsatellite markers.
  • the EE method may also be modified by inco ⁇ orating available population genetic models to estimate haplotype frequencies for an isolated population.
  • the EE method may be used together with the forward block computational algorithm to systematically identify haplotype blocks such that there are only limited haplotypes within each block. With identified blocks and haplotypes, it is possible to determine which SNPs are essential for defining the haplotypes and which SNPs are redundant, obviating the need to genotype non-essential SNPs.
  • the estimating equation approach may be used for assessing trait associations with haplotypes.
  • SNP genotypes are the study of their association with complex traits, such as cancers or coronary heart diseases.
  • case-control study a widely accepted design strategy is the case-control study, which has been established through decades of epidemiological research (Breslow & Day (1980) Statistical Methods in Cancer Research, Int'l Agency for Research on Cancer, Lyon; Schlesselman (1982) Case-Control Studies: Design, Conduct, Analysis, Oxford University Press, NY).
  • a typical case-control study identifies a sample of trait-positive individuals ("cases") and a sample of matched, trait-negative individuals (“controls”) from a well-defined population.
  • haplotypes are associated with haplotypes, rather than with any individual marker within the haplotype. For example, Hallman and co-workers noted a haplotype association of lipoprotein lipase with coronary heart diseases (Hallman et al. (1999) Ann.
  • haplotypes Without relying on experimental methods or on parental data, another class of methods infers haplotypes from multiple markers. Unambiguous haplotypes among cases and controls are identified, ignoring remainder haplotypes (Haviland et al. (1995) Ann. Hum. Genet. 59:211-231). As expected, ignoring partially informative haplotypes implies a loss of efficiency, and this difficulty becomes more significant as the number of sites assayed increases. Typically, such a method is applicable to at the most three or four markers. To inco ⁇ orate partially observed haplotypes, another method is to infer haplotypes based upon empirical distributions, tolerating a degree of misclassification error (Hallman et al. (1999) Ann. Hum. Genet.
  • the method of the second aspect of the invention uses an estimating equation approach for analyzing case-control marker data to associate haplotypes with a trait, as shown in EXAMPLES 4 and 5.
  • the method infers haplotype frequencies statistically, and correlates these with case-control status, thereby establishing associations of particular haplotypes with the disease phenotype.
  • the method of the invention has two parts. In the first part, the method estimates haplotype frequencies for a set of markers with incomplete phase information in a group of trait-positive individuals and in a group of trait-negative individuals using an estimating equation algorithm, which directly provides a standard error measurement for each estimated haplotype frequency.
  • a forward-block algorithm carries out computation of this part of the method in a stepwise fashion as shown in FIGURE 1.
  • the set of markers (e.g., SNPs) is divided into several block.
  • a block may contain between about 3 and about 100 markers.
  • haplotype frequencies for the first two blocks are estimated separately. The first two blocks are then joined and the haplotype frequencies for the enlarged block is estimated using the estimation results of the first two blocks as initial values.
  • haplotype frequencies for the next single block are estimated, and the single block is added to the enlarged block. The estimations are done sequentially for each enlarged block and next single block. This process is continued until all the blocks are joined. At each step, haplotypes with insignificant frequencies are filtered out.
  • the EE method of the invention directly provides a standard error measurement for each estimated haplotype frequency using estimation equation theory and can also handle data sets that include missing genotype values.
  • the estimations of the standard error for any haplotype frequency is as shown in the third section of EXAMPLE 1.
  • the adjustment of the covariance matrix for missing genotypes is described in the fourth section of EXAMPLE 1.
  • the method estimates the differences in frequencies of individual haplotypes for the set of markers in the trait-positive group and in the trait- negative group using an estimating equation algorithm, which directly provides a standard error measurement for each estimated difference in haplotype frequency.
  • the invention provides a haplotype- based method for diagnosing an increased risk of development a trait.
  • the method comprises four parts.
  • the first part of the method estimates a set of haplotype frequencies in a group of trait-positive individuals and in a group of trait- negative individuals using an estimating equation method and a forward-block algorithm, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers is incomplete, and wherein the algorithm provides a standard error measurement for each estimated haplotype frequency.
  • the forward-block algorithm comprises distributing the markers into a plurality of blocks and determining haplotype frequencies in a step-wise fashion for both individuals in the trait-positive group and individuals in the trait-negative group.
  • haplotype frequencies are estimated for the subset of markers in the first block.
  • the second step comprises estimating haplotype frequencies for the subset of markers in the second block.
  • the haplotype frequencies are estimated from a combination of selected and pooled haplotypes for the first block and selected and pooled haplotypes for the second block, wherein a selected block haplotype has greater than a predetermined minimum frequency and a pooled block haplotypes is a haplotype that is not selected.
  • Step four comprises sequentially repeating steps two and three, wherein during each repetition the pooled and selected haplotypes for one additional block is added to the selected and pooled haplotypes for the previous combination of blocks.
  • the second part of the method estimates the differences in haplotype frequencies in the trait-positive group and the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard error measurement for each estimated difference in haplotype frequencies.
  • the third part of the method derives one or more haplotypes that are significantly associated with the trait.
  • the fourth part of the method diagnoses an increased risk of developing the trait in a trait-negative individual by determining the presence of a pattern that is significantly positively associated with the trait or the absence of a pattern that is significantly negatively associated with the trait.
  • the likelihood derivation used in the EE method in a case-control study is described in the first section of EXAMPLE 4.
  • the motivation of the estimating equation is described in the second section of EXAMPLE 4.
  • the strategy used is to search for haplotype associations progressively, in multiple steps, as shown in the third section of EXAMPLE 4.
  • the permutation of the case/control study under the null hypothesis that markers and their haplotypes do not associate with disease status to compute the maximum Z-statistic is described in the fourth section of EXAMPLE 4.
  • An efficient computational implementation of the estimating equation method for use in the case-control study is described in the fifth section of EXAMPLE 4.
  • FIGURES 7-13 show the estimated allele or haplotype frequencies, estimated odds ratios (ORs), and their 95% confidence intervals for cases and controls.
  • This study shows the feasibility of using the method of the invention with many markers.
  • This EE approach for case-control studies has several important features. First, it does not require family data to infer haplotype information. Of course, if family data are available, they can be used to improve haplotype information.
  • the method estimates haplotype frequencies, and establishes their associations (but not necessarily causal associations) with a phenotype, without the need to assay haplotypes directly.
  • this method is easily generalized to inco ⁇ orate many markers, and this feature is important in view of increasing knowledge of SNPs in the genome and early evidence for haplotype specific effects.
  • this method is more efficient than the analysis with individual markers, because the actual number of common haplotypes is much smaller than the theoretically possible number of haplotypes.
  • Sixth, because of its computational efficiency, this method is easily scaled to deal with a large number of SNPs (e.g., >100) collected from a large number of subjects (e.g., >1000).
  • This method of the second aspect of the invention if applied only to controls, has a close connection with Expectation-Maximization (EM) algorithms developed for estimating haplotype frequencies in a general population (Excoffier & Slatkin (1995) Mol Biol Evol. 12:921-927; Hawley & Kidd (1995) J. Heredity 86:409-411; Long et al. (1995) Amer. J. Hum. Genet. 56:799-810).
  • the EE method has the advantage of readily estimating standard errors for haplotype frequency estimates, while an EM algorithm may not.
  • Arlequin Excoffier and co-workers used a parametric bootstrap method to estimate standard errors. As expected, the computing burden from the bootstrap is substantial, and is as much as 120 times slower than the EE method.
  • a modification of the EE method may be used if an individual's parental genotypes are also available.
  • the accuracy of the estimates may be improved by inco ⁇ orating parental genotypes in estimating equation (23) in EXAMPLE 4.
  • the application of such a modified likelihood function will improve the efficiency in deriving distributions of haplotypes and hence the efficiency of the estimation.
  • the improvement can be particularly valuable especially for estimating frequencies of rare haplotypes, which is not reliable unless a very large number of individuals are genotyped.
  • the EE method may be used to estimate haplotype frequencies for multi-allelic markers, such as microsatellite markers.
  • the method may be modified by inco ⁇ orating population genetic models.
  • the method of the invention may be extended to analyze quantitative phenotypes.
  • the invention provides a method for assessing trait associations with haplotypes or diplotypes and environmental factors in case-control studies. Interactions between genes and environmental factors has been of interest to genetic epidemiology (Khoury et al. (1993) Fundamentals of Genetic Epidemiology, Oxford University Press, New York). In recent years, researchers in pharmacological research have been very interested in studying the interactions of drugs and genes, the field for which is known as pharmacogenomics. Additionally, researchers in clinical sciences have been interested in personalized medicine in the sense that physicians can prescribe medical treatment based upon patients' genotypes. Characterization of the association between traits and genes, independently and/or interactively with environmental factors, can also improve the prediction, diagnosis, and prognosis of disease or other traits in an individual.
  • a gene may also interact with another gene at a functional level (epistasis). Given the current understanding on genetic regulatory circuitry, it is believed that multiple candidate genes, rather than a single gene, likely play a role in most of chronic diseases. Consequently, multiple genes may jointly penetrate to disease phenotype in the form of gene/gene interactions. The interactions could occur as haplotype-haplotype interaction, or as diplotype-diplotype interaction.
  • the method of the invention treats haplotypes, if unknown, as latent variables and constructs an estimating equation by integrating out these latent haplotypes, resulting in an estimating equation for estimating association parameters of interest.
  • the various notations and assumptions, and the estimation procedure is shown in EXAMPLE 6.
  • the derivation of the estimating equation is described in EXAMPLE 7. Under the assumption that the traits are uncommon, this conditional probability may be approximated as described in EXAMPLE 8.
  • the method of the invention provides an assessment of associations of a trait with haplotypes or with diplotypes, as shown in EXAMPLE 6.
  • the method estimates the association of a trait with a combination of haplotypes or diplotypes and environmental factors.
  • the method also provides estimates of associations of interacting haplotypes or diplotypes with a trait.
  • case/control status is permutated under the null hypothesis that markers and their haplotypes do not associate with trait status to compute the Z-statistic over all selected haplotypes, as shown in EXAMPLE 6.
  • the estimated association parameters are unbiased and inferences retain the desired false error rate, provided that the logistic regression model in equation (26) holds.
  • the method for associating one or more haplotypes or diplotypes for a set of markers at one or more loci and one or more environmental factors with a trait comprises two parts.
  • the first part of the method estimates pattern frequencies in a group of trait- positive individuals and in a group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein a pattern comprises at least one of (1) one or more haplotypes or diplotypes at one or more loci, (2) one or more environmental factors, and (3) a combination of one or more haplotypes or diplotypes at one or more loci and one or more environmental factors, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers is incomplete, and wherein the algorithm provides a standard error measurement for each estimated pattern frequency.
  • the forward-block algorithm comprises a series of steps for both individuals in the trait-positive group and individuals in the trait-negative group.
  • each set of markers is divided into a plurality of blocks.
  • a block may contain between about 3 and about 100 markers.
  • the first part of the method comprises the following steps: (1) estimating pattern frequencies for block 1, wherein a pattern comprises a haplotype with or without the one or more environmental factors; (2) estimating pattern frequencies for block 2 haplotypes; (3) estimating the pattern frequencies for a combination of the first and the second blocks using of selected and pooled patterns for block 1 and selected and pooled patterns for block 2, wherein a selected pattern has greater than a predetermined minimum frequency and a pooled pattern is a haplotype that is not selected; (4) sequentially repeating steps (2) and (3), wherein during each repetition the patterns for one additional block is added to the previous combination of blocks.
  • the second part of the method estimates the differences in pattern frequencies in the trait-positive group and the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard error measurement for each estimated difference in pattern frequencies.
  • the third aspect of the invention provides a haplotype- based method of diagnosing an increased risk of developing a trait. In this embodiment, the method comprises four parts.
  • the first part of the method estimates pattern frequencies in a group of trait-positive individuals and in a group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein a pattern comprises at least one of (1) one or more haplotypes or diplotypes at one or more loci, (2) one or more environmental factors, and (3) a combination of one or more haplotypes or diplotypes at one or more loci and one or more environmental factors, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers is incomplete, and wherein the algorithm provides a standard error measurement for each estimated pattern frequency.
  • the forward-block algorithm comprises a series of steps for both individuals in the trait-positive group and individuals in the trait-negative group.
  • each set of markers is divided into a plurality of blocks.
  • a block may contain between about 3 and about 10 markers.
  • the first part of the method comprises the following steps: (1) estimating pattern frequencies for block 1 haplotypes with or without the one or more environmental factors, wherein a block 1 haplotype comprises the subset of markers in a first block; (2) estimating pattern frequencies for block 2 haplotypes with or without the one or more environmental factors, wherein a block 2 haplotype comprises the subset of markers in a second block, (3) estimating the pattern frequencies for a combination of selected and pooled block 1 haplotypes with or without environmental factors and selected and pooled block 2 haplotypes with or without environmental factors, wherein a selected block haplotype has greater than a predetermined minimum frequency and a pooled block haplotypes is a haplotype that is not selected; (4) sequentially repeating steps (2) and (3), wherein during each repetition the pooled and selected haplotypes with or
  • the second part of the method estimates the differences in pattern frequencies in the trait-positive group and the trait-negative group, wherein the algorithm provides a standard error measurement for each estimated difference in pattern frequencies.
  • the third part of the method derives one or more patterns that are significantly associated with the trait.
  • the fourth part of the method diagnoses an increased risk of developing the trait in a trait-negative individual by determining the presence of a pattem that is significantly positively associated with the trait or the absence of a pattern that is significantly negatively associated with the trait.
  • the validity of the method of this aspect of the invention was established using
  • the invention provides a method for associating haplotypes for one or more sets of markers and one or more environmental factors with multiple phenotypes.
  • Yet another aspect of the invention provides computer programs and systems for implementing the haplotype-based methods of the invention.
  • the computation steps of the previous methods are implemented on a computer system or on one or more networked computer systems in order to provide a powerful and convenient facility for forming and testing network models of biological systems.
  • the computer system can include but is not limited to a hand-held device, a server computer, a desktop personal computer, a portable computer or a mobile telephone.
  • a representative computer system is a single hardware platform comprising internal components and being linked to external components. The internal components of this computer system include processor elements interconnected with main memory.
  • the computer system includes a processing unit, a display, an input/output (I/O) interface and a mass memory, all connected via a communication bus, or other communication device.
  • the I/O interface includes hardware and software components that facilitate interaction with a variety of the monitoring devices via a variety of communication protocols including TCP/IP, XI 0, digital I/O, RS-232, RS-485 and the like. Additionally, the I/O interface facilitates communication via a variety of communication mediums including telephone land lines, wireless networks (including cellular, digital and radio networks), cable networks and the like.
  • the I/O interface is implemented as a layer between the server hardware and software applications. It will be understood by one skilled in the relevant art that alternative interface configurations can be practiced with the present invention.
  • the external components include mass storage.
  • the mass memory generally comprises a RAM, ROM, and a permanent mass storage device, such as a hard disk drive, tape drive, optical drive, floppy disk drive, or combination thereof.
  • the mass memory stores an operating system for controlling the operation of the premises server. It will be appreciated that this component can comprise a general piupose server operating system as is known to those skilled in the art, such as UNIX, LINUX, or Microsoft WINDOWS NT.
  • the memory also includes a WWW browser, such as Netscape's NAVIGATOR or Microsoft's Internet Explorer browsers, for accessing the WWW.
  • This mass storage can be one or more hard disks (which are typically packaged together with the processor and memory).
  • Computer system is also linked to other local computer systems, remote computer systems, or wide area communication networks, such as the Internet. This network link allows computer system to share data and processing tasks with other computer systems.
  • a software component represents the operating system, which is responsible for managing computer system and its network interconnections. This operating system can be, e.g., of the Microsoft Windows family, a UNIX operating system, or a LINUX-based operating system. Another software component represents common languages and functions conveniently present on this system to assist programs implementing the methods specific to this invention.
  • Languages that can be used to program the analytic methods of this invention include C, C++, or, less preferably, JAVA.
  • the methods of this invention are programmed in mathematical software packages which allow symbolic entry of equations and high-level specification of processing, including algorithms to be used, thereby freeing a user of the need to procedurally program individual equations or algorithms.
  • Such packages include, e.g., MATLAB from Mathworks (Natick, MA), MATHEMATICA from Wolfram Research (Champaign, 111.), and MATHCAD from Mathsoft (Cambridge, MA).
  • the analytical methods of the invention can be programmed in a procedural language or symbolic package.
  • the mass memory generally comprises a RAM, ROM, and a permanent mass storage device, such as a hard disk drive, tape drive, optical drive, floppy disk drive, or combination thereof.
  • the mass memory stores an operating system for controlling the operation of the premises server. It will appreciated that this component can be comprised of a general-pu ⁇ ose server operating system as is known to those skilled in the art, such as UNIX, LINUX, or Microsoft WINDOWS NT.
  • the memory also includes a WWW browser, such as Netscape's NAVIGATOR or Microsoft's INTERNET EXPLORER browsers, for accessing the WWW.
  • the mass memory also stores program code and data for interfacing with various premises monitoring devices, for processing the monitoring device data and for transmitting the data to a central server. More specifically, the mass memory stores a device interface application in accordance with the present invention for obtaining monitoring device data from a variety of devices and for manipulating the data for processing by the central server.
  • the device interface application comprises computer-executable instructions which, when executed by the premises server obtains and transmits device data as will be explained below in greater detail.
  • the mass memory also stores a data transmittal application program for transmitting the device data to a central server and to facilitate communication between the central server and the monitoring devices. It will be appreciated that these components can be stored on a computer-readable medium and loaded into the memory of the premises server using a drive mechanism associated with the computer-readable medium, such as a floppy,
  • CD-ROM CD-ROM, DVD-ROM drive, or network drive.
  • This example describes the estimating equation (EE) method, along with a computationally efficient algorithm.
  • genotype g ⁇ can be represented by a pair of haplotypes namely, (H' : H, 2 ) . Under the assumption of the Hardy- Weinberg
  • the function f(H ; ⁇ ) may be represented by a multinomial distribution function: m ⁇ rff) (4)
  • This likelihood function is essentially the same as the likelihood function that was derived in Excoffier & Slatkin (1995) Mol. Biol. Evol. 12:921-927. This likelihood function was also used in the two Bayesian methods (Stephens et al. (2001) Amer. J. Hum. Genet. 68:978-989; Niu et al. (2002) Amer. J. Hum. Genet. 70:157-169).
  • V is the covariance matrix of multinomial distribution, which is a semi- positive definite and constant over all individuals. Consequently, solving score estimation equations in (6) for ⁇ are equivalent to solving the following equations, which are called estimating and denoted by U( ⁇ ) ,
  • we first find the partial haplotypes corresponding to the genotype data g lU ,i l,---,n in the first block.
  • R is the number of haplotypes whose estimated frequencies are above ⁇ in the first block.
  • ⁇ Jl2k ⁇ ⁇ k * ⁇ J ⁇ (k+l)2k
  • ⁇ ( ⁇ ) (dU( ⁇ ) / d ⁇ ) ⁇ l var [U( ⁇ )] (dU( ⁇ ) I (11) dU( ⁇ )
  • var [[( ?)] ⁇ E p (F t ⁇ g x ; ⁇ E p (F, '
  • the covariance matrix i of ⁇ can be estimated by ⁇ ( ⁇ ) and the standard error of ⁇ is estimated by the square root of diagonals of ⁇ ( ⁇ ) .
  • genotypes on some loci cannot be determined therefore are treated as missing data.
  • the missing genotype is associated with the choice of probe sequences, experimental conditions or random fluctuations. It is reasonable to assume that the missing mechanism is either Missing
  • haplotypes have been identified as disease-associated haplotypes and if the penetrance from the haplotypes to the disease phenotype is specified, an inference of haplotypes from an individual's genotypes at multiple SNP loci can be used to predict the individual's risk of having the disease.
  • phase f(P, ⁇ g, > ' ⁇ ) given in equation (8), we can predict the phase of individual's genotype and, therefore, provide a pair of haplotypes with probability statement.
  • EXAMPLE 2 Simulation Study This example describes a simulation study to assess the accuracy of estimates under the HWE model assumption and when the model assumption is violated, and when the number of SNPs is large.
  • Sample haplotype frequencies are estimated from the sampled individuals' genotypes with known phase.
  • Step E we estimate haplotype frequencies and standard errors (SE) from the sampled individuals' genotypes using the EE method discussed in EXAMPLE 1 and ignoring the phase information.
  • Step G and Step E we repeat Step G and Step E for 500 times after each Step G. Then, we repeat Step G for 20 times.
  • HWD(l) we assume that one out of five SNP loci carries a lethal mutant allele and that individuals with a heterozygous genotype (0/1) at that locus have a 50% chance for survival and individuals with a homozygous variant genotype (1/1) have no chance for survival.
  • HWD(2) we assume that one haplotype is associated with a disease and that individuals with one copy of the haplotype have a 75% chance for survival and individuals with two copies have a 50% chance for survival.
  • the simulation results are shown in FIGURE 3.
  • the top panels present the average discrepancy of the estimated haplotype frequencies from the true population frequencies (dashed line) and the average discrepancy of the estimated haplotype frequencies from the sample frequencies (solid line) under the assumption of HWE, HWD(l), and HWD(2), respectively.
  • the discrepancy of the estimated haplotype frequencies from the true population frequencies assesses the overall validity of the final estimates of haplotype frequencies with respect to the true population values.
  • the discrepancy of the estimated haplotype frequencies from the sample frequencies assesses the accuracy of the estimation method. The difference between the two discrepancies is due to sampling error. The difference approaches to zero when the sample sizes approach infinity.
  • the simulation results show that the estimated haplotype frequencies by the EE method are consistent even when sample sizes are small.
  • the discrepancies are not affected significantly by the departures of HWE. These observations are consistent to the observations done in Fallin and Slatkin (2000) for the EM method (Fallin & Schork (2000) Amer. J. Hum. Genet. 65:947-959).
  • the bottom panels present the average discrepancy of the estimated SE using the EE method from the sample SD of the estimates. The small discrepancy indicates that the EE method gives correctly estimates of standard errors. The discrepancy is not affected by the departures of HWE.
  • the average number of common haplotypes for sample sizes of 30, 50,100, 150, 200, 500, and 1,000 is 2.3, 3.6, 4.7, 5.0, 5.3, 11.8, and 20.9 under HWE, 2.1, 3.3, 4.3, 4.8, 5.1, 10.5, and 17.2 under HWD(l), and 2.4, 3.6, 4.6, 5.0, 5.2, 10.7, and 20.0 under HWD(2).
  • the number of common haplotypes increases with sample size because a haplotype is considered "common” if its estimated frequency multiplied by twice the sample size is at least 5.
  • Step S we generate genotypes for 100 individuals. Each individual's genotype is generated from the generated haplotype population under the assumption of HWE. The rest of the simulation procedure is as discussed at the beginning of this example. The simulation result is shown in FIGURE 4. Similarly, the top panels show the average discrepancy of the estimated haplotype frequencies from the population haplotype frequencies (dashed line) and the average discrepancy the estimated haplotype frequencies from the sample haplotype frequencies (solid line).
  • the bottom panels show the average discrepancy of the estimated standard errors of the estimates of haplotype frequencies from the sample standard deviation of the estimates of 500 replicates.
  • the average number of common haplotypes observed from 100 individuals' genotypes is between 6 and 12 regardless of the recombination rate.
  • the running time depends linearly on the number of SNPs, the number of subjects and complexity of the data set.
  • SNPs were with complete genotypes on all 44 individuals, the sample size used in that study. The rest of the SNPs were missing genotypes in one or more individuals. Because only our EE method and HAPLOTYPER can handle the missing genotype data, we first analyzed the whole data set using the two methods. The EE method found 20 haplotypes and HAPLOTYPER found 21 haplotypes.
  • haplotypes 16 haplotypes were found by both methods.
  • the haplotypes that were found by only one method had estimated frequencies less than 1%.
  • the SE of the estimate for common haplotypes was estimated in the EE method and are given in parentheses after the corresponding estimate of haplotype frequency. Comparing the estimates of haplotype frequencies using the EE method and HAPLOTYPER, one can see that HAPLOTYPER produced a higher estimate for the more frequent haplotype than the EE method.
  • the reason for this phenomenon is that the estimates of haplotype frequencies are estimated from the best reconstruction of individuals' haplotypes. Because each individual contributes paired haplotypes, other most frequent haplotypes' frequencies might be underestimated. If we use the reconstructed the individuals' haplotypes for a subsequent analysis (here the estimate of population haplotype frequencies), misclassification errors can result in biased estimations.
  • EXAMPLE 4 The Use of the EE Method in a Case-Control Study 1.
  • the phase (parental origin) of an individual allele, e.g., g l , within a subject may or may not be known.
  • p t (p n ,p i2 ,- - -,p iq ) denote a vector of phase indicators; p..
  • denotes parameters of interest to be estimated from the data, including haplotype frequencies and their differences between cases and within controls. From the likelihood (13), it is clear that the derivation of the likelihood function is equivalent to the derivation of an individual distributional function.
  • the distributional function f(g, ⁇ p,,d l ) specifies the association of genotypes and hence haplotypes with the disease phenotype.
  • ⁇ d ( ) may be represented by a multinomial distribution function:
  • ⁇ t (d t ) denotes the frequency of the tth haplotype
  • ⁇ t (d i ) a t + ⁇ t d i , (18) where or, represents the frequency of the tth haplotype in controls, and (a t + ⁇ t ) represents the frequency in cases.
  • the difference ⁇ t if not equal to zero, indicates that the tth haplotype frequency is different in cases and controls.
  • odds ratio is more commonly used as a measure of association.
  • the above likelihood function encompasses the haplotype likelihood function derived by Excoffier and Slatkin (1995), when the disease status is ignored.
  • the score estimating equation (21) can be solved for the maximum likelihood estimate, but the computational burden of enumerating all possible phases
  • the estimate and its standard error can then be used to construct a 95% confidence interval for the difference between cases and controls of the tth haplotype as [ ⁇ t -1.96 ⁇ ⁇ t , ⁇ , +1.96 ⁇ ⁇ t ] .
  • Constructing a test statistic based upon the estimating function U( ⁇ , ⁇ ) is also possible.
  • haplotype-based associations When analyzing multiple SNPs, one has to rely on a systematic strategy for analyzing all potential haplotype-based associations. While an optimal strategy remains to be investigated, we consider an analytic strategy that searches for haplotype- associations progressively, in multiple steps. The first step is to examine allelic association with the case/control disease phenotype. The second step is to correlate haplotypes with two adjacent SNPs with the disease phenotype. Progressively, one can add more SNPs in assessing associations of phenotype with haplotypes of three, four, or more adjacent SNPs. The primary rationale is that the physical adjacency implies the linkage-disequilibrium and hence the haplotype-based associations with adjacent SNPs.
  • equation (25) is iterated until convergence in all parameters is reached, thus obtaining the estimate ⁇ .
  • equation (25) is iterated until convergence in all parameters is reached, thus obtaining the estimate ⁇ .
  • a similar procedure can be applied to cases to yield the estimate ( + ⁇ ) .
  • the choice of the initial value is important, given the high dimensionality of many haplotypes.
  • the second issue relates to the convergence criteria.
  • convergence in the estimating function U(a, ⁇ ) (23)
  • convergence in estimates we examine whether the estimated parameters stabilize and if the increments of updating parameters approach zero.
  • the third technical issue relates to the numerical derivative matrix of the estimating function U(a, ⁇ ) .
  • U(a, ⁇ ) we calculate the difference of da U( ⁇ + ⁇ , ⁇ ) and U( ⁇ , ⁇ ) divided by a small ⁇ , where ⁇ may be chosen as 10 "5 .
  • the primary reason for choosing ⁇ instead of just ⁇ , is to ensure precision when ⁇ is small.
  • EXAMPLE 5 This example compares the use of the EE method and the EM method using data from an actual case-control study.
  • DNA from all 779 patients were genotyped using a panel of biallelic sites that are candidate markers for atherosclerotic disease risk.
  • This panel represents an expansion of a previously described assay (Cheng et al. (1999) Genome Res. 9:936-949). Briefly, the assay uses multiplex PCR and immobilized sequence-specific probes to detect amplified alleles. All probes have been validated using DNAs of known genotype, as confirmed by sequencing. Most of the candidate genes in the panel were represented by one or two polymo ⁇ hisms, but multiple genes were typed in a few genes, including apolipoprotein CIII (APOC3), a gene encoding a component of plasma lipoproteins.
  • APOC3 apolipoprotein CIII
  • APOC3 was represented by six biallelic sites: three in the promoter region, one in exon 3, two in the 3'-untranslated region of exon 4 (Cheng et al. (1999) Genome Res. 9:936-949). APOC3, thus offered us an opportunity to illustrate the estimating equation approach.
  • Paired SNP loci are listed row-wise, (1,2), (2,3), (3,4), (4,5), (5,6), where the locus numbers are associated with C(-641)A, C(-482)T, T(-455)C, C1100T, C3175G, T3206G, respectively.
  • All possible haplotypes, except the reference haplotype 11, are listed column-wise.
  • haplotype 00 for locus 1,2 is equivalent to the haplotype AT at C(-641)A, C(- 482)T.
  • FIGURE 9(a) shows the distribution of maximized test statistics after 1,000 permutations. It appears that the estimated maximum test statistic centers around 2 with a heavy tail towards the right. Within the same 1,000 permutations, we computed maximum Z-statistics at every loci (FIGURE 9(b), solid line). Based upon this distribution, we estimated the threshold values for 90% and 95%) significance levels (dotted lines). At loci 4 and 5, the observed Z-statistics remains significant at the 90% level, but not at the 95% level. That is, the statistical significance associated with the haplotype TC at C1100T and C3175G, is between 90% and 95%.
  • Table 5 in FIGURE 10 shows Z-statistics, odds ratios and their 95% confidence intervals for haplotype associations with three adjacent SNPs.
  • Table 5 lists four possible adjacencies, i.e., 123, 234, 345 and 456, in the columns, while all possible haplotypes, excluding 111, which is the reference haplotype, are listed by row.
  • Rare haplotypes were merged with more common haplotypes, and are thus omitted from the table.
  • the rare haplotype 110 at locus 123 i.e., haplotype CCC at C(-641)A, C(-482)T and T(-455)C, was merged with other uncommon haplotypes.
  • haplotype 010 at locus 456 i.e., haplotype TCG at C1100T, C3175G and T3206G
  • FOGURE 9(c) the distribution of maximum Z statistics
  • FIG. 9(d) the threshold values at 90% and 95%) level
  • haplotype 0110 at locus 3456 i.e., haplotype CCCG at T(-455)C, C1100T, C3175G and T3206G, has a reduced risk associated with the phenotype.
  • Haplotype TCG (010) which was significant at locus 456, extends to haplotypes CTCG (0010) or TTCG (1010) at locus 3456. That neither haplotype was significantly associated with the phenotype is inconsistent with TCG at locus 456 identified earlier.
  • Table 7 in FIGURE 12 and Table 8 in FIGURE 13 show Z statistics, odds ratios and their confidence intervals for haplotype associations with five and six SNP loci, respectively. It appears that maximum Z-statistics become much smaller, even though the 95% CIs seem to suggest several genes with marginally significant.
  • This exercise shows the feasibility of performing haplotype analysis with many SNPs. Indeed, when the number of SNP loci increases, the total number of haplotypes does not increase exponentially with the number of loci. The number of common haplotypes appears to be around 10, even though the theoretical number of all possible haplotypes could be far greater. This result is consistent with the recent observations by (Drysdale et al. (2000) Proc. Natl Acad. Sci. U.S.A. 97:10483-10488).
  • Table 9 in FIGURE 14 lists estimated haplotype frequencies and their standard errors. Estimated frequencies by both methods are identical up to the 5 l decimal point, so only one column of haplotype frequencies is shown in the second column of Table 9.
  • the third and fourth columns in Table 7 show estimated standard errors by EM algorithm and EE approach. While the estimates obtained by the EM algorithm and the EE approach are largely consistent, the EE approach yields consistently smaller estimates for standard errors. The smaller standard errors reflect the efficiency of the EE approach in estimating haplotype frequencies. Nevertheless, it is quite comforting to learn that both EM algorithm and EE approach yield comparable results. However, the EM approach was computationally less efficient. In this application, the EE computations were at least 120 times faster than EM. Computation speed is important in the design of analytic strategies for dealing with complex situations: 1) dealing with many more SNPs, 2) testing many combinations of SNP loci, adjacent or non-adjacent, and 3) using a permutation method to estimate significance level.
  • EXAMPLE 6 Trait Associations with SNP Haplotypes and Environmental Variables This example describes the application of the EE method for assessing trait associations with SNP haplotypes and environmental variables.
  • the analytic objective is to correlate haplotypes of multiple SNPs with the disease phenotype, i.e., the penetrance from haplotypes and covariates to disease phenotype.
  • haplotypes of multiple SNPs i.e., the penetrance from haplotypes and covariates to disease phenotype.
  • penetrance function we assume a logistic regression that relates haplotypes and covariables with disease phenotype via:
  • I(h),h 2 ,x t , ⁇ ) is a function of haplotypes and covariables (A?,/., 2 ,*,) with ⁇ as regression coefficients.
  • this function I(h],h 2 x t , ⁇ ) may choose the following function:
  • the rationale for choosing the logistic regression function includes: 1) regression coefficients ( ⁇ x , ⁇ 2 ) are rather inte ⁇ retable as log odds ratios, and log odds ratios approximate log relative risk when the disease incidence rate is low (Prentice & Pyke (1979) Biometrika 66(3):403- 11); 2) The logistic regression technique has been well studied in biostatistical literature and its statistical properties are well known; and 3) The logistic regression is routinely applied to epidemiological studies to inte ⁇ ret results from case-control studies (Rothman & Greenland (1998) Modern Epidemiology, Lipincott-Raven Publishers, Philadelphia), and is thus readily accepted to study gene/environmental interactions.
  • the logistic regression defined by equation (26) is a penetrance function quantifying the disease probability given paired haplotypes and covariates, and is not directly estimable case-control studies.
  • the intercept a specifying the baseline prevalence rate, is unspecified in case-control studies (Prentice & Pyke (1979) Biometrika 66:403-11; Whittemore (1995) Biometrika 82:57-67).
  • the analytic objective is to estimate parameters ( , ?) via an estimating equation, the derivation for which is detailed in EXAMPLE 7.
  • an estimating function for ( ⁇ , ⁇ ) if phases were known. Since phases are largely unknown, one can treat them as latent variables. After obtaining a posterial distribution of phases given all observed data (phenotypes, genotypes and covariates), one can integrate out these latent haplotypes via conditional expectation of estimating function given observed data. Setting the integrated estimating function to zero results in an estimating equation for estimating ( ⁇ , ⁇ ) . 2.
  • X i is the partial derivative of I(h),h 2 ,x t , ⁇ ) with respect to ⁇
  • an f(Pi ⁇ gi.d j ,x i ) is the posterial probability of latent phases given observed data.
  • 0 is a matrix of appropriate dimension
  • conditional means, variances and covariances are computed in the usual way.
  • Raphson method to estimate all parameters that satisfy the estimating equation (32). Starting from an initial value ( ⁇ (0) , ⁇ (0) , ⁇ (0) ) , one can iterate to a new value
  • the covariance matrix is easily estimable, and may be written as
  • I( ⁇ , ⁇ , ⁇ ) ⁇ var[u( ⁇ , ⁇ , ⁇ )]l( ⁇ , ⁇ , ⁇ )- 1 , (34) in which all quantities are evaluated at their respective estimates with a variance matrix va ⁇ [u( ⁇ , ⁇ , ⁇ )] is estimated by ⁇ . i
  • test statistic based upon the estimating function U( ⁇ , ⁇ , ⁇ ) (32).
  • Haplotype-Specific Effects An immediate interest is to assess the disease associations with haplotypes.
  • Let h denote H common haplotypes of interest.
  • To test this null hypothesis one may use the above Z-statistics (35).
  • Diplotype-Specific Effects While haplotype-based associations are of interest, the disease association could also be genotype-specific, that is, the disease associations are with one or more genotypes, formed by paired haplotypes (diplotype). Diplotype- based associations may be categorized into four different penetrance modes: being dominant, or being recessive, or being additive, or being co-dominant. To capture the mode of diplotype associations, one needs to re-code corresponding genotypes under each mode of penetrance. Suppose that h is the target haplotype of interest. Under a dominant mode, we use the following indicator function,
  • K(h],ht) 1 One of/?, 1 and h equals /?
  • I(h),h 2 x l , ⁇ ) ⁇ n K (h),h 2 ) + ⁇ n K 2 (ti l ,h 2 ) + ⁇ 2 'x l and
  • h. and h 2 are two co-dominant haplotypes under considerations. Indeed, this model encompasses both dominance and additive modes. Specifically, if one of two coefficients i ⁇ , ⁇ 12 ) equals zero, the model implies the dominant. If two coefficients ( ⁇ u , ⁇ l2 ) are equal, the above model implies the additive mode.
  • ⁇ 3 ( ⁇ ,—, ⁇ 3H )' quantifies the interactions of all candidate haplotypes with the dosage. Indeed, one can postulate other models to depict interactions that may be dominant, recessive or co-dominant, besides the additive mode described above. Interactions among Candidate Genes: In addition to gene/environmental interactions, one candidate gene may also interact with another candidate gene at a functional level, which is known as epistasis. The interactions could occur as haplotype- haplotype interaction, or as genotype-genotype interaction.
  • K 2 (h] 2 ,h 2 2 ) are two indicator functions for candidate genotype 1 and 2, respectively, and the log odds ratio ⁇ i quantifies the interaction of interest.
  • TDT Transmission Disequilibrium Test
  • the idea is to permute case/control status under the null hypothesis that SNPs and their haplotypes do not associate with disease status.
  • This example describes the derivation of the estimating equations.
  • EXAMPLE 8 Approximation of Conditional Probability This example describes the approximation of the conditional probability.
  • the probability function may be written as ft ⁇
  • the marginal disease probability Pr(-/, l
  • the disease probability is approximated by exp[ ⁇ + /(/?,', h', Xl , ⁇ )]
  • Pr( ⁇ , 1 l 2 . , ): exp[ ⁇ + /(?;, h;, Xl , ⁇ )], l + exp[ ⁇ + I(h), h 2 ,x lake ⁇ )] because exp[ ⁇ + I(h), h 2 ,x t , ⁇ )] is much smaller than one. Substituting these approximations into the above probability function, one obtains, for cases, an approximated function,
  • f(p t ⁇ g,,d l ,x t ) may be represented by exp ⁇ /?, 1 ,/?, 2 ⁇ ,, ⁇ )]/ ⁇ , 1 ,/?, 2 ) f(P, ⁇ g handedd constrainx,) ⁇
  • EXAMPLE 9 Derivation of Derivative Matrix for the Joint Estimating Equation This example describes the derivation of the derivative matrix for the joint estimating equations.
  • ⁇ Ai d i- ⁇ dfiPAgi ⁇ A - ⁇ EiX ⁇ ' ⁇ XA ⁇ d ⁇ g ⁇ + ⁇ ⁇ X ⁇ - ⁇ -fip ⁇ g ⁇ x,)
  • This example describes a Monte Carlo simulation study.
  • the first statistic is the biases in estimating log odds ratios.
  • ⁇ j denote they ' th estimated regression coefficient in (26) from the rth replicate.
  • the second statistic is the biases in estimating standard errors.
  • SEj SE ( ⁇ j ) denote the estimated standard error for they ' th log odds ratio ⁇ ⁇ r from the rth replicate.
  • the third statistic is the coverage probability, measuring how frequently the confidence interval, ⁇ ⁇ r -Z g SE ⁇ j +Z e SE ⁇ r ] at the significance level of ⁇ , covers the true value ⁇ ⁇ .
  • the rationale for choosing the coverage probability is that it is reliable measurement under both null and alternative hypothesis. If the estimation and inference are appropriate, the coverage probability should be around 1- ⁇ . 3. Under the Null Hypothesis with No Covariates
  • haplotypes For consistence throughout the simulation studies, we always use the haplotype with the highest haplotype frequency as the reference. In the current simulation, we focus on three haplotypes, which are considered as common haplotypes in the entire population. In this case, we simulate phenotypes in the general population by the following penetrance function:

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Ecology (AREA)
  • Physiology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'invention concerne des méthodes permettant d'estimer des fréquences haplotypiques au moyen d'une méthode d'équation d'estimation et d'un algorithme de bloc avant. Ces méthodes permettent d'obtenir une mesure d'erreur standard pour chaque fréquence haplotypique estimée. Ces méthodes peuvent être utilisées, par exemple, pour estimer un ensemble de fréquences haplotypiques dans une population, pour évaluer l'association d'un haplotype et d'une caractéristique, et pour évaluer l'association d'haplotypes pour un ou pour plusieurs ensembles de marqueurs et pour un ou pour plusieurs facteurs environnementaux et d'un caractère. L'invention concerne également un support lisible par ordinateur présentant des instructions lisibles par ordinateur permettant de mettre en oeuvre les méthodes de l'invention, et un système informatique permettant de mettre en oeuvre les méthodes de l'invention.
PCT/US2003/031186 2002-10-01 2003-10-01 Methodes d'estimation de frequences haplotypiques et associations de maladies presentant des haplotypes et des variables environnementales WO2004031912A2 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003282907A AU2003282907A1 (en) 2002-10-01 2003-10-01 Methods for estimating haplotype frequencies and disease associations with haplotypes and environmental variables

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US41502802P 2002-10-01 2002-10-01
US60/415,028 2002-10-01

Publications (2)

Publication Number Publication Date
WO2004031912A2 true WO2004031912A2 (fr) 2004-04-15
WO2004031912A3 WO2004031912A3 (fr) 2004-08-05

Family

ID=32069800

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2003/031186 WO2004031912A2 (fr) 2002-10-01 2003-10-01 Methodes d'estimation de frequences haplotypiques et associations de maladies presentant des haplotypes et des variables environnementales

Country Status (2)

Country Link
AU (1) AU2003282907A1 (fr)
WO (1) WO2004031912A2 (fr)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7797302B2 (en) 2007-03-16 2010-09-14 Expanse Networks, Inc. Compiling co-associating bioattributes
US7917438B2 (en) 2008-09-10 2011-03-29 Expanse Networks, Inc. System for secure mobile healthcare selection
US8200509B2 (en) 2008-09-10 2012-06-12 Expanse Networks, Inc. Masked data record access
US20130332081A1 (en) * 2010-09-09 2013-12-12 Omicia Inc Variant annotation, analysis and selection tool
WO2014089356A1 (fr) * 2012-12-05 2014-06-12 Genepeeks, Inc. Système et procédé de prédiction informatique de l'expression de phénotypes monogéniques
US8788286B2 (en) 2007-08-08 2014-07-22 Expanse Bioinformatics, Inc. Side effects prediction using co-associating bioattributes
WO2014144032A2 (fr) * 2013-03-15 2014-09-18 The Broad Institute, Inc. Systèmes et procédés pour identifier des gènes significativement mutés
US9031870B2 (en) 2008-12-30 2015-05-12 Expanse Bioinformatics, Inc. Pangenetic web user behavior prediction system
US11322227B2 (en) 2008-12-31 2022-05-03 23Andme, Inc. Finding relatives in a database

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
DRYSDALE C.M. ET AL: 'Complex promoter and coding region beta2-adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness' PNAS vol. 97, no. 19, 12 September 2000, pages 10483 - 10488, XP002940094 *
EXCOFFIER L. ET AL: 'Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population' MOLECULAR BIOLOGY AND EVOLUTION vol. 12, no. 5, 1995, pages 921 - 927, XP002953528 *
FALLIN D. ET AL: 'Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data' J. OF HUMAN GENETICS vol. 67, October 2000, pages 947 - 959, XP002951850 *
NIU T. ET AL: 'Bayesian haplotype inference for multiple linked single-nucleotide polymophisms' AM.J. OF HUMAN GENETICS vol. 70, January 2002, pages 157 - 169, XP002978163 *
PATIL N. ET AL: 'Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21' SCIENCE vol. 294, 23 November 2001, pages 1719 - 1723, XP002965310 *
STEPHENS M. ET AL: 'A new statistical method for haplotype reconstruction from population data' AM. J. OF HUMAN GENETICS vol. 68, April 2001, pages 978 - 989, XP002955780 *

Cited By (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10991467B2 (en) 2007-03-16 2021-04-27 Expanse Bioinformatics, Inc. Treatment determination and impact analysis
US11735323B2 (en) 2007-03-16 2023-08-22 23Andme, Inc. Computer implemented identification of genetic similarity
US7844609B2 (en) 2007-03-16 2010-11-30 Expanse Networks, Inc. Attribute combination discovery
US12106862B2 (en) 2007-03-16 2024-10-01 23Andme, Inc. Determination and display of likelihoods over time of developing age-associated disease
US7933912B2 (en) 2007-03-16 2011-04-26 Expanse Networks, Inc. Compiling co-associating bioattributes using expanded bioattribute profiles
US7941329B2 (en) 2007-03-16 2011-05-10 Expanse Networks, Inc. Insurance optimization and longevity analysis
US7941434B2 (en) 2007-03-16 2011-05-10 Expanse Networks, Inc. Efficiently compiling co-associating bioattributes
US8024348B2 (en) 2007-03-16 2011-09-20 Expanse Networks, Inc. Expanding attribute profiles
US8051033B2 (en) 2007-03-16 2011-11-01 Expanse Networks, Inc. Predisposition prediction using attribute combinations
US8099424B2 (en) 2007-03-16 2012-01-17 Expanse Networks, Inc. Treatment determination and impact analysis
US11791054B2 (en) 2007-03-16 2023-10-17 23Andme, Inc. Comparison and identification of attribute similarity based on genetic markers
US8209319B2 (en) 2007-03-16 2012-06-26 Expanse Networks, Inc. Compiling co-associating bioattributes
US8606761B2 (en) 2007-03-16 2013-12-10 Expanse Bioinformatics, Inc. Lifestyle optimization and behavior modification
US10957455B2 (en) 2007-03-16 2021-03-23 Expanse Bioinformatics, Inc. Computer implemented identification of genetic similarity
US10896233B2 (en) 2007-03-16 2021-01-19 Expanse Bioinformatics, Inc. Computer implemented identification of genetic similarity
US11621089B2 (en) 2007-03-16 2023-04-04 23Andme, Inc. Attribute combination discovery for predisposition determination of health conditions
US11600393B2 (en) 2007-03-16 2023-03-07 23Andme, Inc. Computer implemented modeling and prediction of phenotypes
US11581098B2 (en) 2007-03-16 2023-02-14 23Andme, Inc. Computer implemented predisposition prediction in a genetics platform
US11581096B2 (en) 2007-03-16 2023-02-14 23Andme, Inc. Attribute identification based on seeded learning
US11545269B2 (en) 2007-03-16 2023-01-03 23Andme, Inc. Computer implemented identification of genetic similarity
US10379812B2 (en) 2007-03-16 2019-08-13 Expanse Bioinformatics, Inc. Treatment determination and impact analysis
US10803134B2 (en) 2007-03-16 2020-10-13 Expanse Bioinformatics, Inc. Computer implemented identification of genetic similarity
US11515047B2 (en) 2007-03-16 2022-11-29 23Andme, Inc. Computer implemented identification of modifiable attributes associated with phenotypic predispositions in a genetics platform
US7818310B2 (en) 2007-03-16 2010-10-19 Expanse Networks, Inc. Predisposition modification
US7797302B2 (en) 2007-03-16 2010-09-14 Expanse Networks, Inc. Compiling co-associating bioattributes
US11495360B2 (en) 2007-03-16 2022-11-08 23Andme, Inc. Computer implemented identification of treatments for predicted predispositions with clinician assistance
US11482340B1 (en) 2007-03-16 2022-10-25 23Andme, Inc. Attribute combination discovery for predisposition determination of health conditions
US11348691B1 (en) 2007-03-16 2022-05-31 23Andme, Inc. Computer implemented predisposition prediction in a genetics platform
US11348692B1 (en) 2007-03-16 2022-05-31 23Andme, Inc. Computer implemented identification of modifiable attributes associated with phenotypic predispositions in a genetics platform
US8788286B2 (en) 2007-08-08 2014-07-22 Expanse Bioinformatics, Inc. Side effects prediction using co-associating bioattributes
US7917438B2 (en) 2008-09-10 2011-03-29 Expanse Networks, Inc. System for secure mobile healthcare selection
US8200509B2 (en) 2008-09-10 2012-06-12 Expanse Networks, Inc. Masked data record access
US9031870B2 (en) 2008-12-30 2015-05-12 Expanse Bioinformatics, Inc. Pangenetic web user behavior prediction system
US11003694B2 (en) 2008-12-30 2021-05-11 Expanse Bioinformatics Learning systems for pangenetic-based recommendations
US11514085B2 (en) 2008-12-30 2022-11-29 23Andme, Inc. Learning system for pangenetic-based recommendations
US11657902B2 (en) 2008-12-31 2023-05-23 23Andme, Inc. Finding relatives in a database
US11935628B2 (en) 2008-12-31 2024-03-19 23Andme, Inc. Finding relatives in a database
US11468971B2 (en) 2008-12-31 2022-10-11 23Andme, Inc. Ancestry finder
US11776662B2 (en) 2008-12-31 2023-10-03 23Andme, Inc. Finding relatives in a database
US11508461B2 (en) 2008-12-31 2022-11-22 23Andme, Inc. Finding relatives in a database
US11322227B2 (en) 2008-12-31 2022-05-03 23Andme, Inc. Finding relatives in a database
US12100487B2 (en) 2008-12-31 2024-09-24 23Andme, Inc. Finding relatives in a database
US20130332081A1 (en) * 2010-09-09 2013-12-12 Omicia Inc Variant annotation, analysis and selection tool
US20150317432A1 (en) * 2012-12-05 2015-11-05 Genepeeks, Inc. System and method for the computational prediction of expression of single-gene phenotypes
WO2014089356A1 (fr) * 2012-12-05 2014-06-12 Genepeeks, Inc. Système et procédé de prédiction informatique de l'expression de phénotypes monogéniques
US11545235B2 (en) 2012-12-05 2023-01-03 Ancestry.Com Dna, Llc System and method for the computational prediction of expression of single-gene phenotypes
WO2014144032A3 (fr) * 2013-03-15 2014-11-06 The Broad Institute, Inc. Systèmes et procédés pour identifier des gènes significativement mutés
WO2014144032A2 (fr) * 2013-03-15 2014-09-18 The Broad Institute, Inc. Systèmes et procédés pour identifier des gènes significativement mutés

Also Published As

Publication number Publication date
AU2003282907A1 (en) 2004-04-23
WO2004031912A3 (fr) 2004-08-05
AU2003282907A8 (en) 2004-04-23

Similar Documents

Publication Publication Date Title
Choin et al. Genomic insights into population history and biological adaptation in Oceania
Speidel et al. A method for genome-wide genealogy estimation for thousands of samples
Palamara et al. High-throughput inference of pairwise coalescence times identifies signals of selection and enriched disease heritability
Zhao et al. A method for the assessment of disease associations with single-nucleotide polymorphism haplotypes and environmental variables in case-control studies
Gao et al. New software for the fast estimation of population recombination rates (FastEPRR) in the genomic era
Pe'er et al. Evaluating and improving power in whole-genome association studies using fixed marker sets
Ngo et al. A diagnostic ceiling for exome sequencing in cerebellar ataxia and related neurological disorders
Glémin et al. Quantification of GC-biased gene conversion in the human genome
Zaitlen et al. Leveraging genetic variability across populations for the identification of causal variants
Flutre et al. A statistical framework for joint eQTL analysis in multiple tissues
Carlson et al. Mapping complex disease loci in whole-genome association studies
Dumas et al. Direct quantitative trait locus mapping of mammalian metabolic phenotypes in diabetic and normoglycemic rat models
Ptak et al. Evidence for population growth in humans is confounded by fine-scale population structure
Servin et al. Imputation-based analysis of association studies: candidate regions and quantitative traits
Sun et al. eQTL mapping using RNA-seq data
Munafo et al. Meta-analysis of genetic association studies
Prabhu et al. Ultrafast genome-wide scan for SNP–SNP interactions in common complex disease
Furlotte et al. Efficient multiple-trait association and estimation of genetic correlation using the matrix-variate linear mixed model
Hung et al. Analysis of microarray and RNA-seq expression profiling data
Chou et al. A combined reference panel from the 1000 Genomes and UK10K projects improved rare variant imputation in European and Chinese samples
Liu et al. Estimating genetic effects and quantifying missing heritability explained by identified rare-variant associations
Paşaniuc et al. Accurate estimation of expression levels of homologous genes in RNA-seq experiments
Paschou et al. Intra-and interpopulation genotype reconstruction from tagging SNPs
Rodriguez et al. Parente2: a fast and accurate method for detecting identity by descent
Leache et al. Comparative species divergence across eight triplets of spiny lizards (Sceloporus) using genomic sequence data

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP