EP1350212A2 - Denombrements d'haplotypes composites pour loci et alleles multiples et tests d'association avec des phenotypes continus ou distincts - Google Patents
Denombrements d'haplotypes composites pour loci et alleles multiples et tests d'association avec des phenotypes continus ou distinctsInfo
- Publication number
- EP1350212A2 EP1350212A2 EP01988909A EP01988909A EP1350212A2 EP 1350212 A2 EP1350212 A2 EP 1350212A2 EP 01988909 A EP01988909 A EP 01988909A EP 01988909 A EP01988909 A EP 01988909A EP 1350212 A2 EP1350212 A2 EP 1350212A2
- Authority
- EP
- European Patent Office
- Prior art keywords
- computer
- haplotype
- haplotypes
- program code
- readable program
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/40—Population genetics; Linkage disequilibrium
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
Definitions
- This invention relates to data processing systems, methods and computer program products, and more particularly to bioinformatic systems, methods and computer program products.
- Case-control data may not contain complete information about gametic phase, or haplotype, of the individuals. Nevertheless, haplotypes can be useful for fine mapping of disease susceptibility genes, for at least several reasons. First, despite the fact that the haplotype generally is unobservable, in many cases the haplotypes can be reasonably inferred from genotypes. Second, if recombination in the neighborhood of the disease- causing mutation is rare, then the haplotype of the original carrier may remain largely intact for many generations. Thus, haplotype can be a good surrogate for a disease susceptibility gene.
- Sasieni From Genotypes to Genes: Doubling the Sample Size, Biometrics, 53, 1997, pp. 1253- 1261, discusses the case of a binary (e.g., diseased/non-diseased) trait.
- Sasieni's paper discloses equivalence between two methods for disease association, where one model uses alleles as observations (2n), and the other uses individuals as observations (n).
- haplotype frequency inference when only single-locus genotypes are scored.
- Embodiments of the invention associate haplotype frequencies for a plurality of individuals with a continuous trait.
- Each individual includes a pair of chromosomes having a plurality of markers thereon.
- Each marker has a pair of alleles for an individual.
- a haplotype comprises a combination of alleles for a set of markers on a predetermined chromosome.
- a subset of markers from the set of markers that may correlate with the continuous trait is selected.
- a value of the continuous trait, and a pair of alleles for each of the markers in the subset of markers, is obtained for each individual.
- probabilities of haplotypes that are compatible with the alleles in the subset of markers is determined.
- a regression is performed on the probabilities of haplotypes that are compatible with the alleles in the subset of markers, for all the individuals, to determine correlation between the continuous trait and the haplotypes.
- a regression is performed by sampling a first haplotype from the haplotypes that are compatible with the individual's set of alleles, from the probabilities of haplotypes that are compatible with the alleles in the subset of markers, for each individual, to thereby define a second haplotype which is determined by the sampling of the first haplotype.
- the value of the continuous trait for the individual is assigned to both the first haplotype and the second haplotype, to thereby define a doubled sample size.
- An analysis of variance then is performed, by comparing average values of the trait among the sampled first and second haplotypes for all the individuals.
- the sampling a first haplotype, assigning the value of the continuous trait and performing an analysis of variance, are repeatedly performed, to obtain a distribution of correlations of the continuous trait and the haplotype.
- a value then is determined from the distribution that identifies a significance ofthe correlation.
- the above-described analysis of variance may be performed by defining a design matrix of first and second indicator values having two rows for each individual, where the second indicator value is associated with the first and second haplotypes and remaining positions in the design matrix are set to the first indicator value in the two rows.
- a regression is then performed on the design matrix, to thereby identify a correlation value between the value of the continuous trait and the first and second haplotypes.
- the value that is determined from the distribution can be a median that is determined from the distribution that identifies a significance of the correlation.
- a regression is performed by assigning a rank of significance for each haplotype in the set. For each individual, a first haplotype is sampled from the haplotypes that are compatible with the individual's set of alleles, to thereby define a second haplotype which is determined by the sampling of the first haplotype. The value of the continuous trait for the individual is assigned to both the first haplotype and the second haplotype, to thereby define a doubled sample size. A one degree of freedom regression is perfomied on the ranks for the sampled first and second haplotypes for all the individuals.
- the sampling a first haplotype, assigning the value of the continuous trait and performing a one degree of freedom regression are repeatedly performed to obtain a distribution of the correlation of the continuous trait in the haplotypes.
- a value is determined from the distribution that identifies a significance o the correlation. For example, a median may be determined from the distribution.
- the one degree of freedom regression may be performed by defining a design matrix having two columns of the ranks of the first and second haplotypes, and having two rows for each individual. A regression is performed on the design matrix, to thereby define a correlation value between the value of the continuous trait and the haplotypes.
- a regression is performed by relating the value of the continuous trait for each individual to a vector of estimated frequencies of all haplotypes.
- a multiple regression is performed of the trait values on the vectors of estimated frequencies, to thereby determine correlations between the continuous trait and the haplotypes.
- probabilities of haplotypes that are compatible with the alleles in the subset of markers is determined by a haplotype- response associate test on unrelated individuals. Additionally, the probabilities that haplotypes are compatible with alleles may be determined by obtaining a composite haplotype.
- Figure 1 is a block diagram of data processing systems according to embodiments of the present invention.
- Figures 2-6 are flowcharts of methods, systems and/or computer program products according to embodiments of the present invention
- Figures 7A-7J graphically illustrate simulated correlations between continuous traits and haplotypes according to embodiments of the invention.
- Figures 8A-8J graphically illustrate simulated correlations between continuous traits and haplotypes according to other embodiments of the invention.
- Figures 9A-9C graphically illustrate simulated correlations between traits and haplotypes according to embodiments of the invention.
- Figures 10A-10C graphically illustrate the distribution of the difference P ⁇ B - P, ' . B for three penetrance matrices.
- These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instructions which implement the lunction specified in the block diagrams and/or flowchart block or blocks.
- the computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented method such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the block diagrams and/or flowchart block or blocks.
- alleles are an alternative form of a gene. Alleles may result from at least one mutation in the nucleic acid sequence and may result in altered mRNAs or polypeptides whose structure or function may or may not be altered. A natural or recombinant gene may have none, one, or many allelic forms. Common mutational changes which can give rise to alleles are generally ascribed to natural deletions, additions, or substitutions of nucleotides. These types of changes may occur alone, or in combination with the others, one or more times in a given sequence.
- Chrosomes are the self-replicating genetic structures of cells containing the cellular DNA that bears in its nucleotide sequence the linear array of genes.
- Continuous trait refers to a common detectable phenotypic variation of a particular inherited characteristic in individuals. Examples of continuous traits include blood pressure, testosterone levels, hair count, efficacy of a drug and body mass index.
- Continuous traits may be contrasted with binary traits such as diseased/not diseased.
- the traits have an associated genetic marker.
- haplotype is a combination of alleles, which tend to be inherited together.
- Haplotype frequencies refers to the number of occurrences of a haplotype.
- “Individuals” refer to persons or organisms.
- a "marker” is an identifiable physical location on a chromosome whose inheritance can be monitored. Markers can be, for example, a restriction enzyme cutting site, an expressed region of DNA (genes), or any segment of DNA with or without known coding function, whose pattern of inheritance can be determined.
- haplotype frequencies can be estimated through expectation-maximization (E-M), and each individual in a sample is expanded into all possible haplotype configurations with corresponding probabilities.
- Embodiments of the invention then will be confirmed to have type I error control, and also can have excellent power.
- An application to gene mapping using epidemiologic data with adjacent markers then will be described, showing that embodiments ofthe invention can be used to improve the efficiency of genome scans by incorporating information from consecutive markers.
- Embodiments of the invention can be more computationally efficient than conventional techniques that use time-consuming resampling coupled with numerical optimization at each step.
- Embodiments of the invention can allow an optimization (haplotype frequencies inference) to be performed only once. Then, each individual in a sample can be expanded into all consistent haplotype configurations with corresponding probabilities, and regression can be used to relate these probabilities to the response. This efficiency can allow embodiments of the present invention to be applied to whole genome scans.
- the present invention allows continuous traits to be studied, while also allowing both discrete and continuous traits to be studied within a unified regression framework.
- the present invention may be embodied in a data processing system such as illustrated in Figure 1 .
- the data processing system 24 may be configured with computational, storage and control program resources for associating haplotype frequencies for a plurality of individuals with a continuous trait, in accordance with embodiments of the present invention.
- the data processing system 24 may be contained in one or more ente ⁇ rise, personal and/or pervasive computing devices, that may communicate over a network which may be a wired and/or wireless, public and/or private, local and/or wide area network such as the World Wide Web and/or a sneaker network using portable media.
- communication may take place via an Application Program Interface (API).
- API Application Program Interface
- embodiments of the data processing system 24 may include input device(s) 52, such as a keyboard or keypad, a display 54, and a memory 56 that communicate with a processor 58.
- the data processing system 24 may further include a storage system 62, a speaker 64, and an input/output (I O) data port(s) 66 that also communicate with the processor 58.
- the storage system 62 may include removable and/or fixed media, such as floppy disks, ZIP drives, hard disks, or the like, as well as virtual storage, such as a RAMDISK.
- the I/O data port(s) 66 may be used to transfer information between the data processing system 24 and another computer system or a network [e.g., the Internet).
- These components may be conventional components such as those used in many conventional computing devices, which may be configured to operate as described herein.
- the memory 56 may include an operating system to manage the data processing system resources and one or more applications programs including one or more application programs for associating haplotype frequencies for a plurality of individuals, with a continuous trait, according to embodiments ofthe present invention.
- FIG 2 is a flowchart of methods, systems and/or computer program products 200 for associating haplotype frequencies with continuous traits according to embodiments of the present invention. It will be understood that these systems, methods and/or computer program products 200 may stored in the memory 56 of Figure 1 and may execute on the processor 58 of Figure 1. It also will be understood that each individual includes a pair of chromosomes having a plurality of markers thereon. Each marker includes a pair of alleles for an individual. A haplotype comprises a combination of alleles for a set of markers on a predetermined chromosome.
- a subset of markers is selected from the set of markers that may correlate with the continuous trait.
- the selection of a subset of markers may be determined empirically and/or theoretically based on available literature, studies and/or other techniques.
- the selection of a subset of markers that may correlate with the continuous trait is well known to those having skill in the art and need not be described further herein.
- a value of the continuous trait and the pair of alleles for each of the markers in the subset of markers is obtained.
- the obtaining of a value of the continuous trait and the pair of alleles for each of the markers may be obtained through clinical trials or other studies that may involve a control group and a sample group.
- the obtaining a value of a continuous trait and a pair of alleles for each of the markers in the subset of markers is well known to those having skill in the art and need not be described further herein.
- Blocks 210 and 220 may be embodied by storing data in the memory 56 of the data processing system 24, regardless of the source of the data or the manner in which it was obtained or derived.
- Table I illustrates an example of data that may be stored in memory as a result of performing the operations at Blocks 210 and 220. As shown in Table I, data for N individuals is stored.
- Figure 3 is a block diagram of operations for performing regression analysis on the probabilities of haplotypes that are compatible with the alleles in the subject markers, for all the individuals, to determine correlations between the continuous trait and the haplotypes (Block 240 of Figure 2) according to embodiments 240' of the invention.
- a first haplotype from the haplotypes that are compatible with the individual set of alleles is sampled from the probability distribution determined at Block 230, to thereby define a second haplotype which is determined by the sampling of the first haplotype.
- the value of the continuous trait for the individual is assigned to both the first haplotype and the second haplotype, to thereby define a doubled sample size.
- FIG. 330 is a block diagram of embodiments of performing an analysis of variance
- an analysis of variance may be performed by defining a design matrix of first and second indicator values (such as 0 and 1) having two rows for each individual, where the second indicator value is associated with the first and second haplotypes and remaining positions in the design matrix are set to the first indicator value in the two rows.
- a regression in then performed on the design matrix, to thereby identify a correlation value between the value of the continuous trait and the first and second haplotypes.
- Figure 5 is a flowchart of other embodiments of performing a regression on the probabilities of haplotypes that are compatible with the alleles in the subset of markers for all the individuals, to determine correlations between the continuous trait and the haplotypes (Block 240 of Figure 2).
- Embodiments 240" of Figure 5 first assign a rank of significance for each haplotype in the set, at Block 510. Operations corresponding to Blocks 310 and 320 of Figure 3 then are performed. Then, at Block 520, a one degree of freedom regression is performed on the ranks for the sampled first and second haplotypes for all the individuals.
- Block 340 the operations of Blocks 310, 320 and 520 are then repeatedly performed for all haplotypes, to obtain a distribution of the correlation of the continuous trait and the haplotypes. Then at Block 350, a value is determined from the distribution that identifies the significance of the correlation.
- FIG. 6 other embodiments of performing regression analysis on the probabilities of haplotypes that are compatible with the alleles in the subset of markers, for all the individuals, to determine correlations between the continuous trait and the haplotypes (Block 240 of Figure 2) are described. As shown in Figure 6, these embodiments 240'" relate the value of the continuous trait of each individual to a vector of estimated frequencies of all haplotypes (Block 610). Then, at Block 620, a multiple regression of the trait values is performed on the vectors of estimated frequencies.
- allelic versus genotypic tests for the case- control design and bi-allelic markers were studied.
- a genotypic test for association can operate on a 2 x 3 contingency table of individuals, classified according to their genotypes and the affection status. The total count of such a table is n.
- An allelic test would operate on a 2 x 2 table of allele counts versus affection status. Thus, each individual would contribute two alleles to the table, and the total count becomes 2n.
- the test implicitly assumes that the allele counts are binomially distributed, and thus may require that the population is in Hardy-Weinberg Equilibrium (HWE).
- HWE Hardy-Weinberg Equilibrium
- Sasieni described that the Armitage's trend test addresses essentially the same question, however it does not "double" the data, and therefore can be applied to samples from non- randomly mating populations. Sasieni also provided explicit expressions for odds ratios comparing heterozygous and homozygous cases and argued that the genotypic test is sometimes a better choice, since it allows to test genotypic effects not explained by alleles, or "dominance deviations". See the above-cited Weir et al. 1977 publication.
- Equation (1) is an Analysis of Variance (ANOVA) model relating response to allele class.
- ANOVA Analysis of Variance
- Equation (1) An alternative model to Equation (1) with similar asymptotic properties and well- known finite-sample properties now will be described.
- the model is an n-dimensional regression model
- Y D ⁇ + ⁇ (2)
- Yj trait value for individual i, D' - (Di, D 2 , ..., D complet), D - (D ⁇ ⁇ , D, 2 , ..., D ⁇ ), and where
- Equation (2) may have the usual validity (or lack thereof, in cases of lack of fit) of standard regression models, whereas Equation (1) may seem unrealistic since the observations are simply doubled. Nevertheless, it will be shown that these models can produce equivalent F statistics when HWE holds.
- Equation (2) is exactly equation (3) with d j ⁇ 0.
- Equation (3) may lack sensitivity in cases of dominance effects d jk ⁇ 0).
- the test for ⁇ Ho : Vj ⁇ 0 and d j --- 0 ⁇ may lose power because of the large numerator degrees of freedom (L(L+l)/2 - 1).
- the additive Equation (2) may be preferable despite possible lack of fit.
- the F test uses:
- Equation (1) and (2) the "alleles" can denote multi-locus haplotypes rather than single-locus alleles.
- the parameter V j refers to the main effect of haplotype j.
- the haplotypes are generally unobservable, and therefore missing data methods may be used for their estimation.
- ANOVA model Equation (1) but where the A ⁇ are generated at random from a distribution inferred through the observed single-locus genotypes, then results are averaged over random haplotype generations.
- the second basic type is like the regression model Equation (2), where instead of using actual haplotype frequencies (0, 1 ,2) for person i, the expected haplotype frequencies (given the observed single locus genotypes) are used.
- E-M Expectation-Maximization
- haplotype frequencies real values
- vectors of possible haplotypes vectors of integers
- the mapping may be more conveniently implemented through associative arrays, such as generic "map" from the C++ Standard Template Library. This can make the algorithm completely general with respect to the value of L.
- a model specified by ( 1) is formed, and a test statistic (F) for the importance of including the genotype is calculated (Block 330);
- yet another embodiment is to perform a multiple regression, based on n observations instead of 2n, (Block 620) directly on the set of per-person expected haplotype frequencies (Block 610).
- This embodiment is motivated by Equation (2), where the traits are regressed on the observed frequencies. If all elements in the matrix D in Equation (2) are divided by two, then they can be considered as probabilities for the individuals to have a particular allele. In the single-locus case, the identification of alleles may be certain, and so 0, 0.5, and 1 generally are the only values possible.
- E-M inferred haplotypes the corresponding model is:
- Frequencies for haplotypes incompatible with the ith individual's single-locus genotypes are set to zero. Also, haplotypes with expected counts that are less than one are removed from consideration.
- the test can be made more robust by permuting the vector (Y ⁇ ,...,Y n ) independently of the haplotype frequency data.
- the final p-value is the proportion of permutations that yield an F- statistic p-value that is no larger than the original F-statistic p-value.
- n jk /n p jk + ⁇ p (l) (6)
- p j is the population proportion of individuals with genotype (j, k)
- 'O p (l)" denotes a term that converges to 0 in probability.
- n Aj + (ri j i + .. . + + .. . + n j L ) gametes having allele j.
- n Aj /(2n) p J +o p (l) (9)
- Equations (6)-(9) concern the behavior of the n j and the p j .
- n l ( ⁇ ⁇ O ⁇ d U (1 1) where ⁇ denotes the vector of allelic averages, where U denotes a multivariate normal L-vector, and where ⁇ u denotes convergence in distribution. Also,
- Equation (13) Equation (13)
- Equation (14), (15) and (16) need to be demonstrated.
- SSAi are expressed as quadratic forms in Y ⁇ , then examine the difference ofthe defining matrices.
- SSA, Y;[D A (D'D) -'D.-D ⁇ D ⁇ nVjY,
- Equation (16) uses Equation (18).
- Equation (16) follows by noting that n 1 ⁇ Y A converges in distribution and that the elements of B n converge in probability, and Equation (4) is finally proven.
- Simulation experiments were conducted using actual programs, by running them multiple times in a UNIX shell script loop, together with programs simulating the data sets.
- Type-I error rate has been studied using one, two, three and five bi-allelic markers to infer haplotype frequencies. All three embodiments ( Figures 3, 5 and 6) can provide good size of the test even for small sample sizes. Between 5,000 and 10,000 simulations were used for calculating expected proportions of rejections, 10 restarts, and 1 ,000 samples for probability sampling (for calculating the median in embodiment 1).
- markers were allowed to be unlinked and response to follow different models, including binary, Gamma(10,5) distributed, Normal(0,l), mixture of two normals
- the mean of the distribution was equal to one for the one of the homozygotes, and zero for two other genotypes.
- 100 individuals were sampled and started embodiment 1 and embodiment 3 regressions at the beginning of the chromosome.
- a sliding window of one to seven markers was moved toward the end, calculating p (model p-value), and plotting -In p against the marker number, as shown in Figures 7A-7J.
- Figures 8A-8J are an independent repetition of the same simulation experiment, but with a sample size of 50. The actual polymorphism causing the shift in the response mean was removed from the data, thus was assumed "unobserved".
- embodiments of the present invention appear to be quite robust, and can perform well under small sample sizes and various response models, even for binary data.
- embodiments of the invention can be used with case-control data as well as with continuous traits.
- the population simulation results described above are quite encouraging. Single- marker peaks around the true location are somewhat ragged, because of the stochastic differences in allele frequencies and amount of linkage disequilibrium with the disease gene. Some of the -In p variation for embodiment 1 might also be due to the stochastic nature of the E-M ANOVA. At each window, 10 initial restarts and 3200 samples were used to build the F-statistic distribution for embodiment 1.
- Haplotype-based tests using continuous phenotype and E-M based frequencies therefore can be powerful and valid tests for association.
- Models based on individuals (n observations) or gametes (2n observations) can be null hypothesis-equivalent in the case of known gametic phase.
- Embodiments of the invention can be used as a screening tool for localizing genetic effects and/or for detecting epistatic effects involving candidate genes. Marker/disease and/or marker/trait associations can be uncovered.
- Systems, methods and/or computer program products according to embodiments of the invention can be efficient, and can allow rapid processing of large amounts of genetic data, including whole genome scans with dense maps of genetic markers.
- Embodiments of the invention can extend the idea of composite haplotypes to an arbitrary number of markers and alleles and can provide an efficient algorithm for calculating composite haplotype frequencies.
- embodiments of the invention can:
- Embodiments of the invention may be distinguished from a conventional E-M algorithm for at least one or more ofthe following reasons: 1. Calculations of composite frequencies do not require the HWE assumption. This may be an important distinction between E-M — based and composite methods, since Hardy — Weinberg disequilibrium (HWD) may be expected for haplotypes related to the response. In the presence of the HWE, however, the composite haplotype frequencies may lead to an unbiased estimate of LD.
- HWD Hardy — Weinberg disequilibrium
- E-M estimates the frequencies for the whole sample. This means that abundant haplotypes with response values from one tail of the distribution can affect probabilities of ambiguous haplotype configurations of the other tail, and thus can mask conceivable effects of haplotypes of the other tail.
- Composite frequency calculations can be much faster.
- the amount of computing for a particular haplotype type can depend linearly on the sample size.
- Asymptotic tests may fail if the E-M is run separately for different categories of response, but the composite haplotype method may not be prone to this.
- Shuffling tests with E-M have been suggested. However, they may be notoriously slow, because they may require a new E-M estimation each time the response is scrambled. 6.
- There are many biologically plausible situations when genetic contribution to response is not determined solely by haplotypes. Rather, it can be important which alleles an individual has at a particular set of markers. Embodiments of the invention can capture the combination of both situations: presence of particular haplotypes as well as a particular set of alleles at different markers.
- Figures 9A-9C are an example of this, simulated under the assumption that pairs of haplotypes forming a genotype may additionally contribute to the response beyond what is explained by individual haplotypes.
- the functional (response-related) region extends up to the 50 th marker, and the height of the peak reflects the statistical strength of the method.
- the single-marker approach ( Figure 9A) does not do well in comparison with either E-M-inferred haplotypes
- the multilocus, multiple allele definition derives from counting numbers of genotypes compatible with a particular haplotype.
- the amount of uncertainty is a function of numbers of distinct haplotypes that each genotype could expand into. This uncertainty defines w* eights for multilocus genotype contributions.
- H(g t ) For a multilocus genotype g. , define H(g t ) to be the number of single — locus heterozygotes in g, . Then the weights are given by:
- n the sample size
- per-individual conditional probabilities are computed. They are computed from additive contribution of pairs of composite haplotypes. Specifically, for composite haplotypes hk and ⁇ , with frequencies (p h , p, ) , the conditional probability of the pair h , hj) for the /-th individual with genotype g, is:
- CH Composite Haplotypes
- the CH embodiments introduced here can be used as a general test for association of di-genic counts with the phenotype.
- the comparisons presented here include the binary phenotype, so that the CH performance can be compared with an EM-based Likelihood- Ratio Test (LRT). Note however, that the power of CH can be increased if the data sets used are not dichotomized and the continuous phenotype is assumed.
- LRT Likelihood- Ratio Test
- Embodiments ofthe present invention can allow identification of composite haplotypes with user-specified threshold frequency (/) by randomly reconstructing pairs of haplotypes for each individual W times and keeping a list of observed haplotypes with the corresponding frequency.
- the number IT is determined by the tolerable error associated with the binomial (nW,t) random variable. Thus, the speed of these embodiments may be affected very little by J.
- P . B , P ⁇ ' , B are the frequencies of A, B alleles that reside on two different gametes in contrast to P AB ,P .iB , that measure their joint frequency on the same gamete.
- This "intra-gametic" frequency can also be written as a product of A, B allele frequencies plus the deviation (D 4 B ) unexplained by the product. Generally, this deviation is not zero if the HWE at the haplotype level does not hold.
- P lB - P iB > 0 generally P, B - P i u > 0 . Therefore a test that ⁇ 1B - ⁇ ⁇ B ⁇ 0, is next considered which may be the basis of the CH embodiments.
- f ⁇ is the frequency of AJB/ab and / 6 is the frequency of Ab/aB genotype.
- the missing gametic phase implies that only the sum (j + fij) can be observed.
- conditional probabilities may be observed:
- N) ( . ; - ⁇
- ⁇ AB 2 Yx(g I Y) + Pr(g I Y) + ?v(g 4 I y) + 1 (Pr( 5 1 Y) + Pr(g 6
- Figures 10A-10C are a numerical illustration of this observation, obtained for three penetrance matrices: (1,1,0,1,1,0,0,0,0), (1,1/2,0,1/2,1/2,0,0,0,0,0), (1/2,1,0,1,1,0,0,0,0,0) corresponding to Figures 10A, 10B and 10C, respectively.
- Each histogram is based on 50,000 observations and was obtained by sampling four haplotype frequencies from a uniform distribution and computing /,,..., , 0 from the Hardy- Weinberg proportions. Only the last example has non-zero (9%) probability of P,, ⁇ -P.
- n a b is the number of individuals with genotype Aa/Bb, ⁇ 4B s the di-genic count, and n is the sample size.
- the composite disequilibrium is calculated using the sum of inter- and intra-gametic components:
- P +P intra-gametic components as the work may be in the term of p B - — ⁇ l - ⁇ - .
- Sample composite haplotype counts are calculated from summing over individual contributions:
- n «bc . ⁇ w(g 1 )/( ⁇ ,b,c,... g-,) ,
- n the sample size
- /( ⁇ ) the indicator function
- conditional composite frequencies x ; , for an individual j with the multi-locus genotype g, .
- These frequencies are estimates of conditional probabilities of composite haplotypes given the genotype g. .
- the length of the vector is the number of haplotypes that will be used for the test of association with the phenotype. For example, a minimum required sample frequency may be used as a threshold to reduce the number of haplotypes for the analysis.
- the Ar-th component of the vector corresponding to the k-t haplotype is calculated as:
- ⁇ [ denotes frequency of the composite haplotype that is complementary to the haplotype hi.
- the complementary haplotype is determined by the genotype, given the first haplotype, hi.
- the probability ?v(g ⁇ ⁇ h ]t is either zero or one, so the sum in the denominator is over the haplotype pairs compatible with the genotype. Denoting the vector of phenotype values (not necessarily binary) by Y and letting
- This model explicitly assigns different penetrance values for genotypes that contain the AB haplotype.
- a p (a a a a b a a a a)
- Population haplotype frequencies for each of 10,000 simulations were generated by (a) sampling from the multivariate uniform distribution, Dirichlet(l ,l , l,l), with ten di- locus population genotype frequencies obtained assuming HW ⁇ ; and (b) by sampling ten- locus genotypes directly from the multivariate uniform distribution.
- the second way permits genotypes to deviate from the HWE proportions.
- Rejection sampling was used to obtain pre-specified values of LD (0 to 0.3 and 0.5 to 1 of the maximum possible value) and HW disequilibrium (0.5 to 1 of the maximum possible value). Samples of 50 and 100 individuals were obtained by multinomial sampling from the population frequencies.
- the number of migrants was equal to 5% of the isolate size.
- the initial values of population allele frequencies were sampled from the uniform (0, 1 ) distribution. The recombination was modeled assuming no interference.
- the final generation of the isolate consisted of 10,000 individuals. 100 individuals were sampled for the consequent analysis and 512 separate evolutions were perfomied.
- the response was modeled by assigning a genotypic value, G k - N(0,1) to each genotype in the response region defined by ten consecutive SNPs. These SNPs were assumed unobserved and genotypes of two to eight SNPs that were 0.025 cM away from the response region were used for the analysis.
- the phenotypic value was dichotomized about the sample mean prior to the analysis.
- the average LD between adjacent markers was 0.35 as measured by the correlation coefficient.
- the program "FAST EH+" was used to carry out the LRT. See, Zhao et al.
- the model A did not reveal higher power for the EM-based test. Under the HWD (table XV) the power ofthe LRT appears to be slightly affected. Table XV. Power values for the LRT and the CH, two-locus simulations, HWD, LD range: 0.5,..., 1 of the maximum value, and the sample size of 50.
- CH shows small improvement in power. Similar results were observed for smaller values of LD, 0 to 0.3 of the maximum value, with higher power for both tests (data not shown). This can be attributed to reduction of haplotype diversity caused by high values of LD.
- Table XVII presents results from multi-locus simulations. One to seven markers used in the analysis wasn't directly affecting the phenotype, therefore the power values reflect the strength of the LD between the "unobserved" functional region and these markers. Power values are clearly higher for the CH with the largest value (90%) observed for five marker composite haplotypes. Although the permutation test is most likely to have the correct size, the validity of the CH test was verified under the null hypothesis. For each of 10,000 simulations, population haplotype frequencies were sampled from the Dirichlet distribution and obtained multinomial samples of genotypes of various //. These simulations were performed for normally and binary distributed Y and haplotype sizes of one to ten.
- CH shows a small improvement in power when the size ofthe haplotype is increased.
- Table XVII Power values for the LRT and the CH, 512 multi-locus forward simulations, sample size of 100.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Engineering & Computer Science (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Genetics & Genomics (AREA)
- Theoretical Computer Science (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Analytical Chemistry (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Molecular Biology (AREA)
- Ecology (AREA)
- Physiology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
La présente invention permet d'associer à un trait continu des fréquences d'haplotypes pour une pluralité d'individus. Chaque individu comprend une paire de chromosomes munis d'une pluralité de marqueurs. Chaque marqueur comporte une paire d'allèles correspondant à un individu. Un haplotype comprend une combinaison d'allèles correspondant à un ensemble de marqueurs sur un chromosome prédéterminé. Dans l'ensemble de marqueurs, on choisit un sous-ensemble de marqueurs susceptible d'être mis en corrélation avec le trait continu. On obtient, pour chaque individu, une valeur de trait continu et une paire d'allèles pour chacun des marqueurs du sous-ensemble de marqueurs. On détermine, pour chaque individu, des probabilités d'haplotypes compatibles avec les allèles du sous-ensemble de marqueurs. Enfin, on effectue une régression sur les probabilités d'haplotypes compatibles avec les allèles du sous-ensemble de marqueurs pour tous les individus afin de déterminer la corrélation entre le trait continu et les haplotypes.
Applications Claiming Priority (7)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US69474800A | 2000-10-23 | 2000-10-23 | |
US694748 | 2000-10-23 | ||
US28878901P | 2001-04-13 | 2001-04-13 | |
US288789P | 2001-04-13 | ||
US32734801P | 2001-10-04 | 2001-10-04 | |
US327348P | 2001-10-04 | ||
PCT/US2001/045393 WO2002035442A2 (fr) | 2000-10-23 | 2001-10-22 | Denombrements d'haplotypes composites pour loci et alleles multiples et tests d'association avec des phenotypes continus ou distincts |
Publications (1)
Publication Number | Publication Date |
---|---|
EP1350212A2 true EP1350212A2 (fr) | 2003-10-08 |
Family
ID=27617543
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP01988909A Withdrawn EP1350212A2 (fr) | 2000-10-23 | 2001-10-22 | Denombrements d'haplotypes composites pour loci et alleles multiples et tests d'association avec des phenotypes continus ou distincts |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP1350212A2 (fr) |
WO (1) | WO2002035442A2 (fr) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FI114551B (fi) * | 2001-06-13 | 2004-11-15 | Licentia Oy | Menetelmä, muistiväline ja tietokonejärjestelmä geenipaikannuksen kromosomi- ja fenotyyppidatasta |
US7107155B2 (en) | 2001-12-03 | 2006-09-12 | Dnaprint Genomics, Inc. | Methods for the identification of genetic features for complex genetics classifiers |
US11545235B2 (en) | 2012-12-05 | 2023-01-03 | Ancestry.Com Dna, Llc | System and method for the computational prediction of expression of single-gene phenotypes |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1071817A2 (fr) * | 1998-04-21 | 2001-01-31 | Genset | Marqueurs bialleles convenant a la constitution d'une carte haute densite des desequilibres du genome humain |
US6525185B1 (en) * | 1998-05-07 | 2003-02-25 | Affymetrix, Inc. | Polymorphisms associated with hypertension |
GB9904585D0 (en) * | 1999-02-26 | 1999-04-21 | Gemini Research Limited | Clinical and diagnostic database |
US20020077775A1 (en) * | 2000-05-25 | 2002-06-20 | Schork Nicholas J. | Methods of DNA marker-based genetic analysis using estimated haplotype frequencies and uses thereof |
GB0021667D0 (en) * | 2000-09-04 | 2000-10-18 | Glaxo Group Ltd | Genetic study |
-
2001
- 2001-10-22 WO PCT/US2001/045393 patent/WO2002035442A2/fr not_active Application Discontinuation
- 2001-10-22 EP EP01988909A patent/EP1350212A2/fr not_active Withdrawn
Non-Patent Citations (1)
Title |
---|
See references of WO0235442A2 * |
Also Published As
Publication number | Publication date |
---|---|
WO2002035442A2 (fr) | 2002-05-02 |
WO2002035442A3 (fr) | 2003-07-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Casillas et al. | Molecular population genetics | |
Tang et al. | Estimation of individual admixture: analytical and study design considerations | |
Gompert et al. | Detection of individual ploidy levels with genotyping‐by‐sequencing (GBS) analysis | |
Griffiths et al. | Ancestral inference from samples of DNA sequences with recombination | |
Seltman et al. | Evolutionary‐based association analysis using haplotype data | |
Ellegren et al. | Mutation rate variation in the mammalian genome | |
Marchini et al. | A comparison of phasing algorithms for trios and unrelated individuals | |
AU783215B2 (en) | Methods of DNA marker-based genetic analysis using estimated haplotype frequencies and uses thereof | |
De Iorio et al. | Importance sampling on coalescent histories. I | |
Warmuth et al. | Genotype‐free estimation of allele frequencies reduces bias and improves demographic inference from RADSeq data | |
Cartwright et al. | A family-based probabilistic method for capturing de novo mutations from high-throughput short-read sequencing data | |
Zhang et al. | HTreeQA: using semi-perfect phylogeny trees in quantitative trait loci study on genotype data | |
Ignatieva et al. | The distribution of branch duration and detection of inversions in ancestral recombination graphs | |
Sevon et al. | TreeDT: tree pattern mining for gene mapping | |
Wu | Inference of population admixture network from local gene genealogies: a coalescent-based maximum likelihood approach | |
Rasmussen et al. | Inferring drift, genetic differentiation, and admixture graphs from low-depth sequencing data | |
WO2002035442A2 (fr) | Denombrements d'haplotypes composites pour loci et alleles multiples et tests d'association avec des phenotypes continus ou distincts | |
Marsh et al. | Biases in ARG-based inference of historical population size in populations experiencing selection | |
Wu et al. | BAM: A block-based Bayesian method for detecting genome-wide associations with multiple diseases | |
Halperin et al. | HAPLOFREQ—estimating haplotype frequencies efficiently | |
CN111739584B (zh) | 一种用于pgt-m检测的基因分型评估模型的构建方法及装置 | |
Bafna et al. | Inference about recombination from haplotype data: lower bounds and recombination hotspots | |
Struett et al. | Inference of evolutionary transitions to self-fertilization using whole-genome sequences | |
Rosenthal et al. | Joint linkage and segregation analysis under multiallelic trait inheritance: simplifying interpretations for complex traits | |
Kang et al. | Inference of population mutation rate and detection of segregating sites from next-generation sequence data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20030519 |
|
AK | Designated contracting states |
Kind code of ref document: A2 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR |
|
AX | Request for extension of the european patent |
Extension state: AL LT LV MK RO SI |
|
17Q | First examination report despatched |
Effective date: 20031205 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20040416 |