WO2006104263A2 - Statistical genetics analysis system, statistical genetics analysis method, and statistical genetics analysis program - Google Patents

Statistical genetics analysis system, statistical genetics analysis method, and statistical genetics analysis program Download PDF

Info

Publication number
WO2006104263A2
WO2006104263A2 PCT/JP2006/307287 JP2006307287W WO2006104263A2 WO 2006104263 A2 WO2006104263 A2 WO 2006104263A2 JP 2006307287 W JP2006307287 W JP 2006307287W WO 2006104263 A2 WO2006104263 A2 WO 2006104263A2
Authority
WO
WIPO (PCT)
Prior art keywords
loci
specific
haplotype frequencies
data
computed
Prior art date
Application number
PCT/JP2006/307287
Other languages
French (fr)
Other versions
WO2006104263A8 (en
Inventor
Suenori Chiku
Teruhiko Yoshida
Hiromi Sakamoto
Original Assignee
Mizuho Information & Research Institute, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mizuho Information & Research Institute, Inc. filed Critical Mizuho Information & Research Institute, Inc.
Priority to EP06731235A priority Critical patent/EP1864235A2/en
Publication of WO2006104263A2 publication Critical patent/WO2006104263A2/en
Publication of WO2006104263A8 publication Critical patent/WO2006104263A8/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • STATISTICAL GENETICS ANALYSIS SYSTEM STATISTICAL GENETICS ANALYSIS METHOD, AND STATISTICAL GENETICS ANALYSIS PROGRAM
  • the present invention relates to a statistical genetics analysis system, a statistical genetics analysis method, and a statistical genetics analysis program for estimating haplotype frequencies between two loci and computing linkage disequilibrium (LD) indices between two loci.
  • LD linkage disequilibrium
  • a conventional procedure for computing LD indices typically includes (1) estimation of haplotype frequencies between two loci, (2) computation of LD indices such as D' and p 2 , and (3) estimation of confidence intervals using a bootstrap or a likelihood.
  • Various algorithms have been proposed for estimating haplotype frequencies between two loci. For example, Japanese Laid-Open Patent Publication No. 2004-192018 describes a method for estimating haplotype frequencies for a group in accordance with an expectation maximization algorithm using as input values genotype pool information accumulating genotype information relating to a plurality of subjects included in the group.
  • a first aspect of the present invention is a statistical genetics analysis system for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals.
  • the statistical genetics analysis system includes a computer functioning as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, and a process for storing the converted two-loci haplotype frequencies.
  • the computer also functions as a means for estimating haplotype frequencies between the two specific loci based on the two-loci haplotype frequencies stored for each of the possible multiple loci including the two specific loci.
  • a second aspect of the present invention is a statistical genetics analysis system for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals.
  • the statistical genetics analysis system includes a computer functioning as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing variances and confidence intervals of the haplotype frequencies of the multiple loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances and the computed confidence intervals into information relating to the two specific loci, a process for comparing variance
  • the computer also functions as a means for comparing confidence intervals of the haplotype frequencies between the two specific loci stored in the confidence interval estimation result storage unit for the possible multiple loci including the two specific loci and specifying confidence intervals that are adopted to specify two-loci haplotype frequencies stored in association with the specified confidence intervals.
  • a third aspect of the present invention is a statistical genetics analysis system for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals.
  • the statistical genetics analysis system includes a computer functioning as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing maximum likelihood estimates of linkage disequilibrium indices using the converted two-loci haplotype frequencies, and a process for storing the computed maximum likelihood estimates of the linkage disequilibrium indices.
  • the computer further functions as a means for estimating linkage disequilib
  • a fourth aspect of the present invention is a statistical genetics analysis system for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals.
  • the statistical genetics analysis system includes a computer functioning as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing maximum likelihood estimates of linkage disequilibrium indices using the converted two-loci haplotype frequencies, a process for computing variances of the haplotype frequencies of the multiple loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed
  • the computer further functions as a means for comparing confidence intervals of the linkage disequilibrium indices stored in the confidence interval estimation result storage unit for each of the possible multiple loci including the two specific loci and specifying confidence intervals of the linkage disequilibrium indices that are adopted to specify linkage disequilibrium indices stored in association with the specified confidence intervals.
  • a fifth aspect of the present invention is a statistical genetics analysis system for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals.
  • the statistical genetics analysis system includes a computer functioning as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for computing individual's diplotype posterior probabilities of the multiple loci including the two specific loci based on a result of the computed maximum likelihood estimates of haplotype frequencies and converting the computed posterior probabilities into individual's diplotype posterior probabilities between the two specific loci, and a process for storing the converted individual's diplotype posterior probabilities between the two specific loci.
  • the computer further functions as a means for estimating individual
  • a sixth aspect of the present invention is a statistical genetics analysis system for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals.
  • the statistical genetics analysis system includes a computer functioning as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for computing individual's diplotype posterior probabilities of the multiple loci including the two specific loci based on a result of the computed maximum likelihood estimates of haplotype frequencies and converting the computed posterior probabilities into individual's diplotype posterior probabilities between the two specific loci, a process for computing variances of haplotype frequencies of the multi-loci including the two specific loci through a plurality of different methods using the multi-loci data
  • the computer further functions as a means for comparing confidence intervals of the individual's diplotype posterior probabilities between the two specific loci stored in the confidence interval estimation result storage unit for each of the possible multiple loci including the two specific loci to specify confidence intervals of the individual's diplotype posterior probabilities between the two specific loci that are to be adopted and specify individual's diplotype posterior probabilities between the two specific loci stored in association with the specified confidence intervals.
  • a seventh aspect of the present invention is a method for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer.
  • the computer executes the steps of generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi- loci genotype data of individuals, and performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, and a process for storing the converted two-loci haplotype frequencies.
  • the computer further executes the step of estimating haplotype frequencies between the two specific loci based on the two-loci haplotype frequencies stored for each of the possible multiple loci including the two specific loci.
  • An eighth aspect of the present invention is a method for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer.
  • the computer executes the steps of generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing variances and confidence intervals of the haplotype frequencies of the multiple loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances and the computed confidence intervals into information relating to the two specific loci, a process for comparing variances of the haplotype frequencies between the two specific loci
  • the computer further executes the step of comparing confidence intervals of the haplotype frequencies between the two specific loci stored in the confidence interval estimation result storage unit for the possible multiple loci including the two specific loci and specifying confidence intervals that are adopted to specify two-loci haplotype frequencies stored in association with the specified confidence intervals.
  • a ninth aspect of the present invention is a method for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer.
  • the computer executes the steps of generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing maximum likelihood estimates of linkage disequilibrium indices using the converted two-loci haplotype frequencies, and a process for storing the computed maximum likelihood estimates of the linkage disequilibrium indices.
  • the computer further executes the step of estimating linkage disequilibrium indices based on the maximum likelihood estimates of the link
  • a tenth aspect of the present invention is a method for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer.
  • the computer executes the steps of generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing maximum likelihood estimates of linkage disequilibrium indices using the converted two-loci haplotype frequencies, a process for computing variances of the haplotype frequencies of the multiple loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances into information relating to the two specific loc
  • the computer further executes the step of comparing confidence intervals of the linkage disequilibrium indices stored in the confidence interval estimation result storage unit for each of the possible multiple loci including the two specific loci and specifying confidence intervals of the linkage disequilibrium indices that are adopted to specify linkage disequilibrium indices stored in association with the specified confidence intervals.
  • An eleventh aspect of the present invention is a method for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using computer.
  • the computer executes the steps of generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for computing individual's diplotype posterior probabilities of the multiple loci including the two specific loci based on a result of the computed maximum likelihood estimates of haplotype frequencies and converting the computed posterior probabilities into individual's diplotype posterior probabilities between the two specific loci, and a process for storing the converted individual's diplotype posterior probabilities between the two specific loci.
  • the computer further executes the step of estimating individual's diplotype posterior probabilities between the two specific loci
  • a twelfth aspect of the present invention is a method for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer.
  • the computer executes the steps of generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi- loci genotype data of individuals, and performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for computing individual's diplotype posterior probabilities of the multiple loci including the two specific loci based on a result of the computed maximum likelihood estimates of haplotype frequencies and converting the computed posterior probabilities into individual's diplotype posterior probabilities between the two specific loci, a process for computing variances of haplotype frequencies of the multi-loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances into
  • the computer further executes the step of comparing confidence intervals of the individual's diplotype posterior probabilities between the two specific loci stored in the confidence interval estimation result storage unit for each of the possible multiple loci including the two specific loci to specify confidence intervals of the individual's diplotype posterior probabilities between the two specific loci that are to be adopted and specify individual's diplotype posterior probabilities between the two specific loci stored in association with the specified confidence intervals.
  • a thirteenth aspect of the present invention is a statistical genetics analysis program for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer.
  • the program causes the computer to function as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, and a process for storing the converted two-loci haplotype frequencies.
  • the program further causes the computer to function as a means for estimating haplotype frequencies between the two specific loci based on the two-loci haplotype frequencies stored for each of the possible multiple loci including the two specific loci.
  • a fourteenth aspect of the present invention is a statistical genetics analysis program for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer.
  • the program causes the computer to function as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing variances and confidence intervals of the haplotype frequencies of the multiple loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances and the computed confidence intervals into information relating to the two specific loci, a process for comparing
  • the program further causes the computer to function as a means for comparing confidence intervals of the haplotype frequencies between the two specific loci stored in the confidence interval estimation result storage unit for the possible multiple loci including the two specific loci and specifying confidence intervals that are adopted to specify two-loci haplotype frequencies stored in association with the specified confidence intervals.
  • a fifteenth aspect of the present invention is a statistical genetics analysis program for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer.
  • the program causes the computer to function as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing maximum likelihood estimates of linkage disequilibrium indices using the converted two-loci haplotype frequencies, and a process for storing the computed maximum likelihood estimates of the linkage disequilibrium indices.
  • the program further causes the computer to function as a means for estimating link
  • a sixteenth aspect of the present invention is a statistical genetics analysis program for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer.
  • the program causes the computer to function as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing maximum likelihood estimates of linkage disequilibrium indices using the converted two-loci haplotype frequencies, a process for computing variances of the haplotype frequencies of the multiple loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the compute
  • the program further causes the computer to function as a means for comparing confidence intervals of the linkage disequilibrium indices stored in the confidence interval estimation result storage unit for each of the possible multiple loci including the two specific loci and specifying confidence intervals of the linkage disequilibrium indices that are adopted to specify linkage disequilibrium indices stored in association with the specified confidence intervals.
  • a seventeenth aspect of the present invention is a statistical genetics analysis program for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer.
  • the program causes the computer to function as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for computing individual's diplotype posterior probabilities of the multiple loci including the two specific loci based on a result of the computed maximum likelihood estimates of haplotype frequencies and converting the computed posterior probabilities into individual's diplotype posterior probabilities between the two specific loci, and a process for storing the converted individual's diplotype posterior probabilities between the two specific loci.
  • the program further causes the computer to function as a means for estimating individual's diplotype posterior probabilities between the two specific loci based on the individual's diplotype posterior probabilities between the two specific loci stored for each of the possible multiple loci including the two specific loci.
  • An eighteenth aspect of the present invention is a statistical genetics analysis program for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer.
  • the program causes the computer to function as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for computing individual's diplotype posterior probabilities of the multiple loci including the two specific loci based on a result of the computed maximum likelihood estimates of haplotype frequencies and converting the computed posterior probabilities into individual's diplotype posterior probabilities between the two specific loci, a process for computing variances of haplotype frequencies of the multi-loci including the two specific loci through a plurality of different methods using the multi-
  • the program further causes the computer to function as a means for comparing confidence intervals of the individual's diplotype posterior probabilities between the two specific loci stored in the confidence interval estimation result storage unit for each of the possible multiple loci including the two specific loci to specify confidence intervals of the individual's diplotype posterior probabilities between the two specific loci that are to be adopted and specify individual's diplotype posterior probabilities between the two specific loci stored in association with the specified confidence intervals.
  • Fig. 1 is a schematic diagram of a system according to a preferred embodiment of the present invention.
  • FIGs. 2 and 3 are explanatory diagrams showing the processing procedure of the preferred embodiment
  • Figs. 4 and 5 are explanatory diagrams showing haplotypes including two subject loci.
  • Figs. 6a through 6d are charts showing results of analysis using simulation data.
  • a preferred embodiment of the present invention will now be described with reference to Figs. 1 to 4.
  • the preferred embodiment will be described as a statistical genetics analysis system, a statistical genetics analysis method, and a statistical genetics analysis program for obtaining haplotype frequencies between two specific loci (two-loci haplotype frequencies), diplotype posterior probabilities of each individual (individual's diplotype posterior probabilities), and linkage disequilibrium (LD) indices based on genotype data of multi-loci.
  • haplotype frequencies between two specific loci two-loci haplotype frequencies
  • diplotype posterior probabilities of each individual individual's diplotype posterior probabilities
  • LD linkage disequilibrium
  • haplotype frequencies of multi-loci including two subject loci are computed, and haplotype frequencies between the two subject loci are computed using the multi-loci haplotype frequencies. In this way, information of multi-loci is used to evaluate LD between two loci.
  • multi-loci increases the amount of information and the number of parameters, and the estimation accuracy may rather be lowered when the haplotype is too long.
  • a haplotype with an optimum length is believed to exist for each different amount of data.
  • haplotype frequencies of multi-loci are used to compute haplotype frequencies between two loci
  • a set of multi-loci having appropriate lengths or located at appropriate positions must be selected.
  • maximum likelihood estimates are evaluated to check their validity, and an estimate with the highest accuracy is selected from values for which validity has been verified.
  • the evaluation of the validity of maximum likelihood estimates is performed by constructing confidence intervals with various methods and thereby evaluating the validity of the model or its sample amount. When the number of samples is small or when the validity of the model fails to be verified, such facts can be detected as distortions in the confidence intervals.
  • confidence intervals are constructed using an observed information matrix, an empirical information matrix, a nonparametric bootstrap, and a parametric bootstrap. Further, variances obtained using these methods are compared with one another to evaluate the validity of the model or its sample amount.
  • the selection of the estimate with the highest accuracy is performed by comparing the confidence intervals estimated using sets of loci of various numbers each including two subject loci, and adopting the shortest confidence interval. As a result, the estimation result with the highest accuracy is adopted.
  • a diplotype posterior probabilities of each individual and LD indices are also estimated in the same manner as described above using sets of multi-loci each including two subject loci.
  • a “haplotype” is a combination of alleles at a plurality of loci.
  • a “hap- lotype frequency” is a frequency at which a predetermined allele combination at a plurality of loci (haplotype) appears.
  • a combination of haplotypes of an individual is referred to as a "diplotype” .
  • a haplotype cannot be directly obtained through normal genotyping and is thus obtained through estimation.
  • the LD index computation device 20 includes a control unit 30.
  • the control unit 30 includes a control unit (CPU), a storage unit (RAM, ROM, etc.), and a communication unit, which are not shown in Fig. 1.
  • the control unit 30 performs processing that will be described later.
  • the LD index computation device 20 has the function of computing haplotype frequencies between two loci using multi-loci, individual's diplotype posterior probabilities, and LD indices.
  • This control unit 30 includes a successive multi-loci data generation unit 31, a confidence interval estimation unit 32, and a confidence interval estimation result comparison unit 33.
  • the successive multi-loci data generation unit 31 generates data of possible sets of successive multi-loci each including two subject loci (successive multi-loci data) based on multi-loci genotype data of individuals.
  • the successive multi-loci data generation unit 31 functions as a multi-loci data generation means as recited in the claims. Further, the successive multi-loci data corresponds to multi-loci data as recited in the claims.
  • the confidence interval estimation unit 32 estimates confidence intervals using a collection of successive multi-loci data.
  • the confidence interval estimation unit 32 functions as a confidence interval estimation means as recited in the claims.
  • the confidence interval estimation unit 32 includes a maximum likelihood estimation unit 41, an observed information matrix processing unit 42, an empirical information matrix processing unit 43, a nonparametric bootstrap processing unit 44, a parametric bootstrap processing unit 45, and a confidence verification unit 46.
  • the maximum likelihood estimation unit 41 performs maximum likelihood estimation of haplotype frequencies.
  • the maximum likelihood estimation unit 41 converts multi-loci haplotype frequencies into two-loci haplotype frequencies using the estimation result, computes LD indices, and computes individual's diplotype posterior probabilities.
  • the observed information matrix processing unit 42 obtains a variance through an observed information matrix using a collection of successive multi-loci data. Further, the observed information matrix processing unit 42 computes confidence intervals of haplotype frequencies and two-loci haplotype frequencies.
  • the empirical information matrix processing unit 43 obtains a variance through an empirical information matrix using a collection of successive multi-loci data. Further, the empirical information matrix processing unit 43 computes confidence intervals of haplotype frequencies and two-loci haplotype frequencies.
  • the nonparametric bootstrap processing unit 44 obtains a variance through a nonparametric bootstrap method using a collection of successive multi-loci data, and computes confidence intervals of haplotype frequencies, two-loci haplotype frequencies, individual's diplotype posterior probabilities, and LD indices.
  • the parametric bootstrap processing unit 45 obtains the variance through a parametric bootstrap method using a collection of successive multi-loci data. Further, the parametric bootstrap processing unit 45 computes confidence intervals of haplotype frequencies, two-loci haplotype frequencies, individual's diplotype posterior probabilities, and LD indices.
  • the confidence verification unit 46 performs confidence verification based on comparison among the variances obtained using the methods described above and based on adjustments of the confidence intervals obtained using the methods described above. Further, the confidence verification unit 46 determines whether maximum likelihood estimates fall within the confidence intervals. Next, based on these results, the confidence verification unit 46 determines whether the estimation results are to be adopted. The confidence interval estimation results are stored in a confidence interval estimation result storage unit 53 only when they are determined to be adopted.
  • the confidence interval estimation result comparison unit 33 compares the confidence interval estimation results for each set of successive multi-loci, and specifies and outputs two-loci haplotype frequencies, individual's diplotype posterior probabilities, and the LD indices associated with the shortest confidence interval.
  • control unit 30 is connected to a storage unit 50, which is formed by a RAM, a ROM, a hard disk etc.
  • the storage unit 50 functions as an individual's multi-loci genotype data storage unit 51, a successive multi-loci data storage unit 52, and a confidence interval estimation result storage unit 53 serving as a confidence interval estimation result storage unit, etc.
  • the individual's multi-loci genotype data storage unit 51 stores genotype data of multi-loci of a plurality of individuals (individual's multi-loci genotype data).
  • the multi-loci genotype data of each individual is input and stored via an input unit 61.
  • the successive multi-loci data storage unit 52 stores genotype data of each possible set of successive multi-loci including two subject loci (successive multi-loci data), which is generated based on the multi-loci genotype data of each individual.
  • the successive multi-loci data is stored when the data is generated by the successive multi-loci data generation unit 31 in accordance with processing that will be described later.
  • the confidence interval estimation result storage unit 53 stores data relating to each confidence interval computed by the confidence interval estimation unit 32. More specifically, for a two-loci haplotype frequency, data for specifying a set of multi-loci for which a confidence interval of the two-loci haplotype frequency are obtained, the confidence interval of the two-loci haplotype frequency, and data relating to the two-loci haplotype frequency are stored in a manner that these data elements are associated with one another. For each LD index, data for specifying a set of multi-loci for which a confidence interval of the LD index is obtained, the confidence interval of the LD index, and data relating to the LD index are stored in a manner that these data elements are associated with one another.
  • data for specifying a set of multi-loci for which a confidence interval of the individual's two-loci diplotype posterior probability is obtained, the confidence interval of the individual's two-loci diplotype posterior probability, and data relating to the individual's two-loci diplotype posterior probability are stored in a manner that these data elements are associated with one another.
  • data for specifying a set of multi-loci, the confidence interval of the multi-loci haplotype frequency, and data relating to the multi-loci haplotype frequency are stored in a manner that these data elements are associated with one another.
  • data for specifying a set of multi-loci, the confidence interval of the individual's multi-loci diplotype posterior probability, and data relating to the individual's multi-loci diplotype posterior probability are stored in a manner that these data elements are associated with one another.
  • a flag indicating this (verification error flag) is added to the data.
  • a flag indicating this (confidence interval error flag) is added to the data.
  • control unit 30 is connected to the input unit 61, which is formed by an input unit for external data, such as a keyboard and a mouse, and is connected to an output unit 62, such as a display device.
  • the input unit 61 is used when, for example, multi-loci genotype data of an individual is input, or when an allowable range of two edge values of a confidence interval is set.
  • the output unit 62 outputs two-loci haplotype frequencies, individual's diplotype posterior probabilities, and the LD indices associated with the shortest confidence interval.
  • the user sets two subject loci using the input unit 61.
  • the control unit 30 of the LD index computation device 20 stores data relating to the two set subject loci into an internal storage unit of the control unit 30, which is not shown.
  • the user also sets an allowable range of two edge values of a confidence interval using the input unit 61.
  • the control unit 30 of the LD index computation device 20 stores data relating to the set allowable range of the two edge values of the confidence interval into the storage unit of the control unit 30 (not shown).
  • a collection of multi-loci genotype data of individuals is input into the LD index computation device 20 using the input unit 61.
  • the LD index computation device 20 stores the multi-loci genotype data of each individual into the individual's multi-loci genotype data storage unit 51.
  • the control unit 30 of the LD index computation device 20 reads the collection of the multi-loci genotype data of individuals from the genotype data storage unit 51 as shown in Fig. 2.
  • control unit 30 generates genotype data of possible sets of successive multi-loci including the two subject loci (successive multi-loci data) (step Sl-I).
  • haplotype frequencies between two specific loci and the LD indices are evaluated.
  • haplotype frequencies of all sets of loci including these two loci are used to estimate confidence intervals, and the shortest confidence interval of the estimated confidence intervals is adopted as an estimate.
  • SNPs single nucleotide polymorphisms
  • the degree of freedom of a haplotype is increased twofold whenever the number of its loci increases. Accordingly, the estimation accuracy of an extremely long haplotype is low.
  • a maximum value may be set for the length of a haplotype, and all haplotype frequencies that fall within this range defined using the set maximum value may be evaluated.
  • Fig. 4 shows examples of sets of multi-loci each of which includes two subject loci. Although a case in which the two subject loci are successive is described here, the two subject loci do not necessarily have to be successive. A case in which the two subject loci are not successive will be described later.
  • Two loci i and i — 1 are the subject loci, and data of each of possible sets of successive loci including the two loci (successive multi-loci data) is generated. More specifically, symbol (a) shows a haplotype that is formed by two loci % — 1 and i. Symbol (b) shows a haplotype that is formed by three loci i — 2, i — 1, and i. Symbol (c) shows a haplotype that is formed by three loci i — 1, i, and i + 1. Symbol (d) shows a haplotype that is formed by four loci i — 3, % — 2, i — 1, and i.
  • Symbol (e) shows a haplotype that is formed by four loci i — 2, i — 1, i, and i + 1.
  • Symbol (f) shows a haplotype that is formed by four loci i — 1, /, i + 1, and i + 2.
  • Symbol (g) shows a haplotype that is formed by five loci i — 3, i — 2, i — 1, i, and i + 1.
  • step Sl-I the control unit 30 generates successive multi-loci data for one of the possible sets of multi-loci each including the two subject loci based on multi-loci genotype data of each individual.
  • the generated successive multi-loci data is stored in the successive multi-loci data storage unit 52.
  • control unit 30 inputs the successive multi-loci data generated in the manner described above into a confidence interval estimation module, and estimates confidence intervals (step Sl-2). This processing will be described below.
  • control unit 30 performs maximum likelihood estimation of haplotype frequencies in the manner described below (step S2-1).
  • Index i denotes an identifier of the data.
  • the frequency of allele a at locus I is expressed as p l a .
  • a frequency of haplotype q in a certain interval in the group is expressed as h q .
  • h q is hereafter used also to denote an identifier of a haplotype.
  • V g ⁇ h g jl r ⁇ . (2)
  • An appearance probability of genotype data V 1 can be obtained by summing P(v p ) over possible haplotype combinations of v p , which is given by
  • a value of ⁇ h q ⁇ that maximizes L is retrieved by performing a search under the constraint condition of
  • an EM (expectation maximization) algorithm is used here.
  • the EM algorithm iterates the operation of estimating complete-data from incomplete-data for maximizing the likelihood of the complete-data.
  • the incomplete-data in the present analysis is genotype data V 1 of which phase is unknown.
  • the complete-data is data of a certain set of haplotypes V p .
  • data with the same genotype is estimated to be data of the same set of haplotypes.
  • the incomplete-data and the complete-data are summarized in the table below.
  • K is the number of kinds of Vi
  • P is the number of kinds of v p .
  • C is a number that is not dependent on h q .
  • n q ⁇ r q(Vp)nv p (8) P is satisfied.
  • ⁇ q (v p ) is the number of h q included in v p .
  • N n ⁇ H ⁇ - n ⁇ , and is the number of individuals for which data is used.
  • the EM algorithm iterates E step (expectation step) performed using eq.(9) for obtaining an expectation of complete-data and the above M step (11).
  • a estimate of individual's diplotype posterior probability is obtained from a posterior probability expressed below in accordance with the Bayes' theorem.
  • Vi is a collection of genotype data of individual i
  • ⁇ p is a certain diplotype.
  • the summation over possible diplotypes for data Vi is performed.
  • An appearance probability of ⁇ p is expressed using the above eq.(3) relating to an appearance probability on the assumption of the HWE.
  • diplotype posterior probabilities diplotypes of two subject loci are estimated using multi-loci in the same manner as the estimation of haplotype frequencies.
  • diplotype posterior probabilities of two loci including missing data may be estimated with less variation, the use of multi-loci would improve the estimation accuracy of the diplotype posterior probabilities.
  • the multi-loci haplotype frequencies are converted into the two-loci haplotype frequency using the expression below.
  • a two-loci haplotype frequency is normally obtained by performing a summation over L — 2 loci that are loci excluding subject loci in the manner using the above eq.(13).
  • Symbol ⁇ ' indicates that its summation is for all haplotypes of which alleles at subject loci i and j are ⁇ ; and ⁇ ,-.
  • the LD indices between two loci are computed, using the two-loci haplotype frequencies that are estimated in the manner described above (step S2-4).
  • p 2 and D' are used as LD indices.
  • LD indices including indices p 2 and D', some LD indices do not have uniform definitions.
  • Various indices used in the present analysis are defined below.
  • h aia2 denotes a frequency of a haplotype formed by allele a ⁇ at locus 1 and allele a ⁇ at locus 2
  • p ai denotes a frequency of allele O 1 at locus 1.
  • O 1 denotes an allele other than allele a, ⁇ at locus 1.
  • D', p 2 , ⁇ , d, and Q are all defined using D in eq.(14). A value of D is determined only by using a haplotype frequency between two loci.
  • the processing in the above steps S2-1 to S2-4 is executed by an MLE (maximum likelihood estimation) module.
  • MLE maximum likelihood estimation
  • This MLE module is also used in the nonparametric bootstrap and the parametric bootstrap, which will be described later.
  • the observed information matrix processing unit 42 of the control unit 30 computes an observed information matrix in the manner described below (step S2-5).
  • the maximum likelihood estimation method is known to have asymptotic efficiency.
  • the asymptotic efficiency means that an estimate approaches a true value with a minimum variance as the number of samples increases as in the expression below. ⁇ ⁇ N( ⁇ o , Ir l ( ⁇ o )), as N ⁇ oo. (20)
  • Xp is a matrix called a Fisher information matrix, and is defined as an expectation of a second derivative of a log likelihood function.
  • the confidence of an estimate is obtained by evaluating the information matrix.
  • such computation is usually difficult (because it requires an expectation to be computed and information relating to a true parameter to be used), and some approximation methods have been proposed.
  • approximations of the Fisher information matrix an observed information matrix and an empirical information matrix have been proposed. In the present analysis, these two information matrices are computed.
  • the expression to be evaluated is dP(y p )/dh q .
  • One haplotype needs to be deleted based on the condition of constraint.
  • the haplotype to be deleted is /I Q , according to the following expression,
  • the estimate ⁇ is a saddle point of the likelihood when the eigenvalue is negative.
  • the value of ⁇ needs to be changed slightly in the direction of the eigenvector of the negative eigenvalue, and the EM algorithm needs to be started again.
  • the EM algorithm alone fails to determine whether the converged point is a maximum value or a saddle point.
  • the maximum value is checked by examining the eigenvalue of the observed information matrix.
  • all eigenvalues may not necessarily be positive when the parameter converges on the edge of the domain of definition region.
  • the observed information matrix processing unit 42 of the control unit 30 computes confidence intervals of haplotype frequencies and two-loci haplotype frequencies in the manner described below (step S2-6).
  • An information matrix to be computed is a matrix of (Q — 1) x (Q - I) (one is reduced based on the condition of constraint).
  • confidence intervals of the Q - I dimensional parameters are constructed in accordance with the basic principle.
  • R 2 ( ⁇ h) ⁇ ⁇ ⁇ h q l qg . ⁇ hq, (31) qq> is defined from the expression of an exponent of a multivariate normal distribution in eq.(20) relating to the asymptotic efficiency of the maximum likelihood estimation method.
  • ⁇ h h — h° is satisfied.
  • I gq ⁇ is an estimated information matrix. This reveals that R 2 is determined in accordance with the distribution of X 2 with the degree of freedom being Q - I from eq.(20) relating to the asymptotic efficiency of the maximum likelihood estimation method.
  • the confidence region (1 — a) is defined by,
  • a confidence interval of each value of h q is defined as one side of a multidimensional rectangular parallelepiped in which this multidimensional ellipsoid is inscribed.
  • VR 2 ⁇ h is a normal vector of R 2 ( ⁇ h).
  • a confidence interval of h q is obtained as a point at which this normal vector becomes parallel to the h q axis.
  • a unit vector of the h q axis is defined by e 9 , and ⁇ h q satisfies the following equations.
  • the Q-th parameter that is deleted based on the condition of constraint is obtained using the condition of constraint as follows,
  • confidence intervals of haplotype frequencies are obtained using the observed information matrix result according to eq.(42) above relating to the confidence interval of H Q .
  • An average of multi-loci haplotype frequencies is represented by h q .
  • a sample variance is represented by ⁇ qq ⁇ . In this case, the following is satisfied.
  • a partial set of subject haplotypes is s
  • a summation of h q in set s is / ⁇ , which is expressed as
  • results of information matrices (an observed information matrix and an empirical information matrix) or bootstrap methods described later (a nonparametric bootstrap method and a parametric bootstrap method) are each converted into information of two loci.
  • the result of an observed information matrix is converted into information of two loci.
  • a confidence interval of a two-loci haplotype is computed using the conversion result.
  • the empirical information matrix processing unit 43 of the control unit 30 computes an empirical information matrix in the manner described below (step S2-7).
  • the empirical information matrix is computed using a method obtained by simplifying the computation method of the observed information matrix described above.
  • I e (0; y) is referred to as an empirical information matrix.
  • the empirical information matrix is computed using eq.(53) relating to (I e ) q,r-
  • the empirical information matrix processing unit 43 of the control unit 30 computes confidence intervals of haplotype frequencies and two-loci haplotype frequencies (step S2-8). This processing is performed in the same manner as in the above step S2-6 using the result of the empirical information matrix in step S2-7.
  • the nonparametric bootstrap processing unit 44 of the control unit 30 performs a nonparametric bootstrap in the manner described below (step S2-9).
  • the bootstrap method is a method for generating a sample from given data using random numbers and estimating a variance or a confidence interval for a statistic.
  • a basic concept of the bootstrap method is not to compute a true parameter but to estimate the relationship (distribution) between an estimate and a true parameter from data or from a sample obtained from an estimate (referred to as a bootstrap sample).
  • the bootstrap method is roughly divided in a nonparametric bootstrap and a parametric bootstrap depending on how a bootstrap sample is generated.
  • ⁇ y 3 ⁇ y ⁇ is the number of data elements less than or equal to y
  • H(y — y t ) is a function being 1 when y > y x and being 0 when y ⁇ y ⁇ .
  • a bootstrap sample is generated from this distribution. More specifically, the same number of data elements as the number of given data elements are randomly selected by recovering and extracting these data elements from the given data elements.
  • the parametric bootstrap method can be used when a model of distribution is given. Data is extracted randomly from this distribution by establishing a virtual bootstrap world with a parameter estimated from data being used as a true parameter. With the maximum likelihood method, a probability function including a parameter is usually given. Thus, this method can be used.
  • the statistical processing performed after a bootstrap sample is obtained is the same for both the nonparametric bootstrap method and the parametric bootstrap method.
  • Comparing the results of the nonparametric bootstrap and the parametric bootstrap provides one method for determining the adaptiveness of the model.
  • Y [Vu V 2 , - - - , V N ).
  • Each Vi denotes genotype data of L loci. Haplotype frequencies of the group are expressed below.
  • h ⁇ h u h 2 , - - - , hQ ⁇ - (56)
  • Each h q denotes a haplotype frequency labeled with q.
  • the HWE is a model set for performing maximum likelihood estimation. Haplotype frequencies obtained directly by performing maximum likelihood estimation from data are shown below.
  • the 6-th bootstrap sample is expressed as
  • each V ⁇ ' denotes one element of eq.(55) relating to a collection of observed data, and denotes genotype data obtained from a set of haplotypes generated according to a frequency distribution obtained using eq.(57) relating to haplotype frequencies obtained directly by performing maximum likelihood estimation from data with the parametric method.
  • Haplotype frequencies obtained by performing maximum likelihood estimation using each V* ⁇ are given by
  • se(h q ) is not usually used to construct a confidence interval. This is used only as a guideline for the varying degree of h g when the bootstrap method is used. The central limit theorem is not used in most cases. A bias occurring when the bootstrap method is used is defined below.
  • Bias correction with the bootstrap method is usually extremely dangerous and is not recommended in most cases. In most cases, such bias correction may be performed only to check whether a bias falls within a standard error range or a confidence interval. When the bias is too large, the estimation method itself needs to be checked. For example, the number of samples may be small, or the assumption may be wrong.
  • the nonparametric bootstrap processing unit 44 of the control unit 30 computes confidence intervals of multi-loci haplotype frequencies, two-loci haplotype frequencies, individual's multi-loci diplotype posterior probabilities, individual's two-loci diplotype posterior probabilities, and LD indices in the manner described below (step S 2- 10).
  • a method for obtaining a confidence interval by sorting estimates computed from bootstrap samples is referred to as a percentile method.
  • the shape of a confidence region is determined directly by a parameter for which sorting is performed. Because this method directly refers to the distribution tail, the number of B needs to be large. Even when the number of B is around 2000, multivariate analysis seems to have greatly varying results. Further, because the bias correction is not performed at all, the convergence of an obtained confidence interval is not satisfactory.
  • a variance needs to be known to compute the ⁇ 2 statistic.
  • a Maha- lanobis distance (described in detail later), which is obtained by replacing a variance matrix with a sample variance obtained using the bootstrap method with reference to the t statistic, will be discussed here. This is computed in the manner described below.
  • denotes a sample variance matrix in a bootstrap sample
  • each element ⁇ qq ⁇ is computed in the manner described below.
  • a percentile confidence interval can be constructed by directly sorting each element of h q .
  • a confidence interval constructed in this way is shorter than the confidence interval constructed using r ⁇ described above.
  • This confidence interval is not a confidence interval constructed in the entire multidimension but is a confidence interval constructed in one dimension by projecting data on a certain axis. This method fails to consider the entire multidimension (referred to as a "1-dim percentile" herein).
  • variance matrix ⁇ has a singular value (when its rank is reduced), there exists no inverse matrix of that matrix. This is equivalent to zero eigenvalues existing when a variance matrix is diagonalized.
  • the variance matrix is a diagonal matrix.
  • matrix ⁇ ' of Q' x Q' which is obtained by reducing the size of matrix ⁇ by an amount corresponding to the number of the zero eigenvalues, is generated.
  • Matrix ⁇ ' and the Q' dimensional vector y' excluding elements corresponding to the zero eigenvalues of y are used to compute r ⁇ as follows.
  • ⁇ "1 PA -1 P*.
  • r ⁇ is evaluated using eq.(63) although the rank of matrix ⁇ is reduced.
  • a variance matrix usually has at least one zero eigenvalue based on the condition of constraint of haplotype frequencies. Because rounding errors are generated in numerical computation, an eigenvalue that is extremely smaller than a maximum eigenvalue needs to be regarded as a zero eigenvalue.
  • Confidence intervals of haplotype frequencies are computed using eq.(42) relating to the confidence interval of K Q described above.
  • a variance matrix is converted into an upper portion of two loci, and that result is used to compute confidence intervals of two-loci haplotype frequencies.
  • Confidence intervals of individual's diplotype posterior probabilities are constructed using the maximum likelihood estimation and the bootstrap method. Confidence intervals of individual's multi-loci diplotype posterior probabilities and confidence intervals of individual's two-loci diplotype posterior probabilities are constructed in the same manner as the haplotype frequency estimation.
  • the parametric bootstrap processing unit 45 of the control unit 30 performs the parametric bootstrap described above (step S2-11).
  • the parametric bootstrap processing unit 45 of the control unit 30 computes confidence intervals of multi-loci haplotype frequencies, two-loci haplotype frequencies, individual's multi-loci diplotype posterior probabilities, individual's two-loci diplotype posterior probabilities, and LD indices (step S2-12). This processing is performed using results of the parametric bootstrap in the same manner as the computation of the confidence intervals of the multi-loci haplotype frequencies, the two-loci haplotype frequencies, the individual's multi-loci diplotype posterior probabilities, the individual's two-loci diplotype posterior probabilities, and the LD indices performed in step S2-10 described above.
  • the confidence verification unit 46 of the control unit 30 evaluates the confidence intervals obtained in the above processes in the manner described below (S2-13).
  • the distribution also has asymptotic efficiency.
  • a variance of each estimate is at its minimum when the data amount is large.
  • the maximum likelihood method is shown to be one of the best estimation methods if the data amount is so large that estimation performed has asymptotic efficiency. This may be confirmed by comparing variances obtained using the information matrices (the observed information matrix and the empirical information matrix) and variances obtained using the bootstrap methods. This is because the variances obtained using the bootstrap methods are variances of direct estimates and the variances obtained using the information matrices are variances obtained based on the assumption of the asymptotic normality.
  • the validity of the model is verified by comparing the variance obtained using the nonparametric bootstrap method and the variance obtained using the parametric bootstrap method.
  • the asymptotic normality is verified by comparing the variance obtained using the observed information matrix and the variance obtained using the nonparametric bootstrap method.
  • the test using the variance ratio (F-test) and (2) the adjustment of the allowable confidence interval are performed.
  • F-test for example, the variance obtained using the information matrix is given the degree of freedom being 1, and the variance obtained using the bootstrap method is given the degree of freedom being the "bootstrap sample number minus one" .
  • This verification involves multiple comparisons.
  • the level of significance considering the Bonferroni correction needs to be set in advance.
  • determination is performed as to whether the two edge values of the confidence interval obtained using each method falls within the allowable range of the two edge values of the confidence interval, which is specified in advance by the user.
  • This method is not executed by directly comparing variances. However, this method is effective when the estimation accuracy is considered to be sufficiently high when the confidence interval falls within the tolerable value range set based on the amount of data.
  • the verification with this maximum likelihood estimation method is performed for estimation of each of a plurality of sets of loci.
  • evaluation of confidence intervals of haplotype frequencies, individual's diplotype posterior probabilities, and LD indices described below only the results that have passed each evaluation of this verification are used.
  • the verification with this maximum likelihood estimation method is performed for the variance of the multi-loci haplotype frequencies and the variance of the two- loci haplotype frequencies.
  • confidence intervals of haplotype frequencies of an entire specified interval successive multi-loci
  • confidence intervals of haplotype frequencies between two loci confidence intervals of individual's diplotype posterior probabilities of an entire specified interval
  • confidence intervals of individual's diplotype posterior probabilities between two loci confidence intervals of individual's diplotype posterior probabilities between two loci
  • confidence intervals of LD indices will be separately described in detail.
  • data specifying the set of the multi-loci, data relating to the confidence intervals of the multi-loci haplotype frequencies, and data relating to the multi-loci haplotype frequencies are stored in the confidence interval estimation result storage unit 53 in a manner that these data elements are associated with one another.
  • a flag indicating this (verification error flag) is added and stored together with the above data.
  • a flag indicating this (confidence interval error flag) is added and stored together with the above data.
  • data specifying the set of the multi-loci for which the confidence intervals of the two-loci haplotype frequencies are computed, data relating to the confidence intervals of the two-loci haplotype frequencies, and data relating to the two-loci haplotype frequencies are stored in the confidence interval estimation result storage unit 53 in a manner that these data elements are associated with one another.
  • a flag indicating this (verification error flag) is added and stored together with the above data.
  • a flag indicating this (confidence interval error flag) is added and stored together with the above data.
  • data specifying the set of the multi-loci, data relating to the confidence intervals of the individual's multi-loci diplotype posterior probabilities, and data relating to the individual's multi-loci diplotype posterior probabilities are stored in the confidence interval estimation result storage unit 53 in a manner that these data elements are associated with one another.
  • a flag indicating this (verification error flag) is added and stored together with the above data.
  • a flag indicating this (confidence interval error flag) is added and stored together with the above data.
  • data specifying the set of the multi-loci for which the confidence intervals of the individual's two-loci diplotype posterior probabilities are computed data relating to the confidence intervals of the individual's two-loci diplotype posterior probabilities, and data relating to the individual's two-loci diplotype posterior probabilities are stored in the confidence interval estimation result storage unit 53 in a manner that these data elements are associated with one another.
  • a flag indicating this (verification error flag) is added and stored together with the above data.
  • a flag indicating this (confidence interval error flag) is added and stored together with the above data.
  • indices p 2 and D' are evaluated using (1) the BCa method in accordance with the nonparametric bootstrap method and (2) the BCa method in accordance with the parametric bootstrap method. More specifically, " [4-3-1] Verification of Estimation Confidence" described above is first performed for the variances of the haplotype frequencies that are converted into information of two loci.
  • the confidence interval estimation unit 32 repeats the processing of step Sl-I and step Sl-2 described above for possible sets of successive multi-loci including two subject loci (successive multi-loci) in a specified range (e.g., in a range defined using a set maximum length) until the processing is completed (until the processing reaches termination in step Sl-I).
  • a specified range e.g., in a range defined using a set maximum length
  • the confidence interval estimation result comparison unit 33 performs comparison of data of the confidence interval estimation results of the two-loci haplotype frequencies, the confidence interval estimation results of the individual's two-loci diplotype posterior probabilities, the confidence interval estimation results of the LD indices stored in the confidence interval estimation result storage unit 53 in the manner described below.
  • data stored in the confidence interval estimation result storage unit 53 only data for which neither a verification error flag nor a confidence interval error flag is set is used for the comparison. More specifically, data used here for the comparison is data satisfying the evaluation conditions in step S2-13 described above.
  • the confidence interval estimation results of the two-loci haplotype frequencies are compared with one another, and the shortest confidence interval is specified.
  • the specified shortest confidence interval, the two-loci haplotype frequency stored as being associated with this confidence interval, and data specifying the set of the multi-loci for which the confidence interval of this two-loci haplotype frequency is obtained are output to the output unit 62 in a manner that the data of the specified confidence interval can be specified.
  • the confidence interval estimation results of the individual's two-loci diplotype posterior probabilities For the confidence interval estimation results of the individual's two-loci diplotype posterior probabilities, the confidence interval estimation results of the individual's two-loci diplotype posterior probabilities of each set of successive multi-loci for which neither a verification error flag nor a confidence interval error flag is set are compared with one another, and the shortest confidence interval is specified.
  • the specified shortest confidence interval, the individual's two-loci diplotype posterior probabilities stored as being associated with this confidence interval, and data specifying the set of the multi-loci for which the confidence intervals of these individual's two-loci diplotype posterior probabilities is obtained are output to the output unit 62 in a manner that the data of the specified confidence interval can be specified.
  • the confidence interval estimation results of the LD indices For the confidence interval estimation results of the LD indices, the confidence interval estimation results of each two-loci LD index of each set of successive multi-loci for which neither a verification error flag nor a confidence interval error flag is set are compared with one another, and the shortest confidence interval is specified.
  • the specified shortest confidence interval, the LD indices stored as being associated with this confidence interval, and data specifying the set of the multi-loci for which the confidence intervals of these LD indices are obtained are output to the output unit 62 in a manner that the data of the specified confidence interval can be specified.
  • the shortest confidence intervals of the two-loci haplotype frequencies, the individual's two-loci diplotype posterior probabilities, and the LD indices are output in a manner that the shortest confidence intervals can be specified.
  • the multi-loci haplotype frequencies, the two-loci haplotype frequencies, the individual's multi-loci diplotype posterior probabilities, the individual's two-loci diplotype posterior prob- abilitie, the LD indices, the confidence intervals for these, and the data specifying the set of the multi-loci for which these confidence intervals are obtained are output as described above in the preferred embodiment. Further, for the data for which a verification error flag or a confidence interval error flag is set, the reason why the data is not used for the comparison is output based on the flag in a manner that the reason can be specified.
  • successive multi-loci data in accordance with multi-loci genotype data of possible sets of successive multi-loci including two specific loci is generated using a collection of multi-loci genotype data of individuals by the LD index computation device 20 based on multi-loci genotype data of individuals.
  • a maximum likelihood estimates of haplotype frequencies of successive multi-loci including the two specific loci are computed using the successive multi-loci data for each set of successive multi-loci, and the maximum likelihood estimates of the multi-loci haplotype frequencies are converted into two-loci haplotype frequencies.
  • Variances and confidence intervals are computed using the successive multi-loci data with a plurality of different methods, and are each converted into information of the two specific loci. Then, verification is performed by comparing the variances of the haplotype frequencies between the two specific loci computed using the different methods. Further, determination is performed as to whether the maximum likelihood estimates of the haplotype frequencies between the two specific loci fall within the confidence intervals of the haplotype frequencies between the two specific loci computed using predetermined ones of the different methods. Based on the results of the verification and the determination, the confidence intervals and the corresponding two-loci haplotype frequencies are stored in the confidence interval estimation result storage unit 53 in a manner that these data elements are associated with one another.
  • the confidence intervals which are stored in the confidence interval estimation result storage unit 53 are compared with one another, and the confidence interval to be adopted is specified. Then, the two-loci haplotype frequencies stored as being associated with the adopted confidence interval are specified.
  • haplotype frequencies between two loci can be obtained using genotype data of multi-loci. This enables analysis effectively using experimental data to be performed.
  • the validity can be evaluated accurately by constructing confidence intervals using various methods.
  • the confidence intervals obtained using various numbers of loci including the two specific loci are compared with one another. As a result, the two-loci haplotype frequencies with higher accuracy are obtained.
  • variances and confidence intervals are computed using the observed information matrix, the empirical information matrix, the nonparametric bootstrap method, and the parametric bootstrap method.
  • the variance obtained using the observed empirical matrix and the variance obtained using the empirical information matrix are compared.
  • the variance obtained using the nonparametric bootstrap method and the variance obtained using the parametric bootstrap method are compared.
  • the variance obtained using the observed information matrix and the variance obtained using the nonparametric bootstrap method are compared. Then, verification is performed based on these comparison results.
  • variances and confidence intervals of haplotype frequencies are obtained using the observed information matrix, the empirical information matrix, the nonparametric bootstrap method, and the parametric bootstrap method.
  • the variance obtained using the observed information matrix and the variance obtained using the empirical information matrix are compared. This enables the law of large numbers to be checked.
  • the variance obtained using the nonparametric bootstrap method and the variance obtained using the parametric bootstrap method are compared. This enables the validity of the model to be verified.
  • the variance obtained using the observed information matrix and the variance obtained using the nonparametric bootstrap method are compared. This enables the asymptotic normality to be verified.
  • a maximum likelihood estimate of each LD index is computed using a two-loci haplotype frequencies based on successive multi-loci data of each set of successive multi-loci. Variances and confidence intervals of the LD indices are computed. Then, determination is performed as to whether the maximum likelihood estimates of the LD indices fall within the confidence intervals of the LD indices. Based on the results of the verification and the results of the determination as to whether the maximum likelihood estimates fall within the confidence intervals of the LD indices, the confidence intervals of the LD indices and the corresponding LD indices are stored in the confidence interval result storage unit 53 in a manner that these data elements are associated with one another.
  • the confidence intervals of the LD indices which are stored in the confidence interval estimation result storage unit 53 are compared with one another, and the confidence interval of the LD indices to be adopted is specified. Then, the LD indices stored as being associated with the adopted confidence interval is specified.
  • LD indices between two loci can be obtained using multi-loci genotype data. This enables analysis effectively using experimental data to be performed. Further, for the results for which the validity of the maximum likelihood estimate is verified, the confidence intervals computed using various numbers of loci including the two specific loci are compared with one another, so that the LD indices with higher accuracy are obtained.
  • variances and confidence intervals of each LD index are computed separately using the BCa method according to the nonparametric bootstrap method and the BCa method according to the parametric bootstrap method.
  • variances and confidence intervals of the LD indices can be obtained separately using the BCa method according to the nonparametric bootstrap method and the BCa method according to the parametric bootstrap method.
  • the confidence intervals of the LD indices that are obtained separately can be used to evaluate the validity of the maximum likelihood estimate of each LD index.
  • the individual's multi-loci diplotype posterior probabilities are computed for each set of successive multi-loci using successive multi-loci data based on the results of maximum likelihood estimation of the haplotype frequencies obtained by the maximum likelihood estimation process, and are converted into the individual's diplotype posterior probabilities between the two specific loci.
  • Variances and confidence intervals of the multi-loci diplotype posterior probabilities are computed using the successive multi-loci data, and are converted into information of the two specific loci. Then, determination is performed as to whether the maximum likelihood estimates of the individual's diplotype posterior probabilities between the two specific loci fall within the confidence intervals of the individual's diplotype posterior probabilities between the two specific loci.
  • the confidence intervals of the individual's diplotype posterior probabilities between the two specific loci and the corresponding individual's diplotype posterior probabilities between the two specific loci are stored into the confidence interval estimation result storage unit 53 in a manner that these data elements are associated with one another.
  • the confidence intervals of the individual's diplotype posterior probabilities between the two specific loci stored in the confidence interval estimation result storage unit 53 are compared with one another, and the confidence intervals of the individual's diplotype posterior probabilities between the two specific loci to be adopted are specified. Then, the individual's diplotype posterior probabilities between the two specific loci stored as being associated with the adopted confidence intervals are specified.
  • an individual's diplotype posterior probabilities between two loci can be obtained using multi-loci genotype data. This enables analysis to be performed effectively using experimental data. For the results for which the validity of the maximum likelihood estimate is verified, the confidence intervals obtained using various numbers of loci including the two specific loci are compared with one another, so that the diplotype posterior probabilities with higher accuracy are obtained.
  • variances and confidence intervals of the diplotype posterior probabilities are computed separately using the nonparametric bootstrap method and the parametric bootstrap method.
  • variances and confidence intervals of the diplotype posterior probabilities can be obtained separately using the nonparametric bootstrap method and the parametric bootstrap method.
  • the confidence intervals of the diplotype posterior probabilities obtained separately can be used to evaluate the validity of the maximum likelihood estimates of the diplotype posterior probabilities.
  • the observed information matrix, the empirical information matrix, the nonparametric bootstrap method, and the parametric bootstrap method are used to compute variances and confidence intervals, and the obtained variances are compared with one another to verify the estimation confidence.
  • the methods for obtaining the variances and the confidence intervals should not be limited to those methods but may be other methods that can evaluate the validity of the model or the sample amount.
  • the methods other than the above listed methods may be used to compute variances and confidence intervals, and the computed variances and confidence intervals may be used to evaluate the validity of the model or the sample amount.
  • the BCa method in accordance with the nonparametric bootstrap method and the BCa method in accordance with the parametric bootstrap method are used to compute variances and confidence intervals of the LD indices.
  • the methods for computing variances and confidence intervals for evaluating the LD indices should not be limited to those methods. Methods other than the above listed methods may be used to compute variances and confidence intervals and the computed variances and confidence intervals may be used in evaluation.
  • the different methods are used to obtain variances and confidence intervals of haplotype frequencies between two specific loci.
  • the variances of the haplotype frequencies between the two specific loci are compared with one another to perform verification, and determination is performed as to whether the maximum likelihood estimates of the haplotype frequencies between the two specific loci fall within the confidence intervals.
  • the two-loci haplotype frequencies to be adopted are specified from those adopted based on the results of the verification and the determination based on the comparison among the confidence intervals. Only one of the verification performed by comparing variances and the determination as to whether the maximum likelihood estimates of the two specific loci haplotype frequencies fall within the confidence intervals may be performed, or none of the verification and the determination may be performed. In this case, verification is performed using another method, and the two-loci haplotype frequencies are estimated based on the result of verification performed using the other method.
  • the different methods are used to obtain variances of haplotype frequencies between two specific loci.
  • the variances are compared with one another to perform verification, confidence intervals of the LD indices are obtained, and determination is performed as to whether the maximum likelihood estimate of each LD index falls within the confidence intervals.
  • the LD indices to be adopted are specified from those adopted based on the results of the verification and the determination based on the comparison among the confidence intervals. Only one of the verification performed by comparing variances and the determination as to whether the maximum likelihood estimates of the LD indices fall within the confidence intervals may be performed, or none of the verification and the determination may be performed. In this case, verification is performed using another method, and the LD indices are estimated based on the result of verification performed using the other method.
  • the different methods are used to obtain variances of haplotype frequencies between two specific loci.
  • the variances are compared with one another to perform verification.
  • confidence intervals of the two specific loci diplotype posterior probabilities are obtained, and determination is performed as to whether the maximum likelihood estimates of the two specific loci diplotype posterior probabilities fall within the confidence intervals.
  • the two specific loci diplotype posterior probabilities to be adopted are specified from those adopted based on the results of the verification and the determination based on the comparison among the confidence intervals. Only one of the verification performed by comparing variances and the determination as to whether the maximum likelihood estimates of the two specific loci diplotype posterior probabilities fall within the confidence intervals may be performed, or none of the verification and the determination may be performed. In this case, verification is performed using another method, and the two specific loci diplotype posterior probabilities are estimated based on the result of verification performed using the other method.
  • the user when the user specifies two subject loci via input in " [1] User Setting" in the above embodiment, the user may specify either successive two loci or two loci that are not successive as two subject loci.
  • [2] Input of Collection of Multi-Loci Genotype Data of Individuals in the above embodiment, a collection of multi-loci genotype data of individuals for sets of multi-loci including the specified two subject loci is input.
  • genotype data of each possible set of multi-loci including the two subject loci is generated.
  • locus data for each of multi-loci in the same set may be generated in the order of positions of the loci in the set.
  • each of the multi-loci for which such locus data is generated may be successive or may not be successive (may be intermittent) in the original set of the multi-loci.
  • Fig. 5 shows examples of sets of multi-loci, each of which includes two subject loci that are not successive.
  • Two loci that are not successive are the subject loci, and data of possible sets of multi-loci including the two loci (multi-loci data described in the claims) is generated.
  • symbol (a) shows a haplotype that is formed by two loci j and A;.
  • symbol (b) shows a haplotype that is formed by three loci j, j + 1, and k.
  • Symbol (c) shows a haplotype that is formed by four loci j, j + 1, A;, and k + 1.
  • Symbol (d) shows a haplotype that is formed by four loci j — 1, j, k — 1, and k.
  • Symbol (e) shows a haplotype that is formed by five loci j — 1, j, j + 1, £;, and jfc + 1.
  • Symbol (f) shows a haplotype that is formed by six loci j — 1, j, j + 1, k — 1, A;, and & + 1.
  • Symbol (g) shows a haplotype that is formed by six loci j, j ' + 1, j + 2, & — 1, &, and A: + 1.
  • the same processing as in the above embodiment is performed using the "genotype data of possible sets of multi-loci including the two subject loci” generated in this way. More specifically, the "genotype data of possible sets of multi-loci including the two subject loci” generated in this way corresponds to the "successive multi-loci data" in the above embodiment.
  • a multi-loci haplotype frequencies, two-loci haplotype frequencies, two-loci LD indices, individual's multi-loci diplotype posterior probabilities, and individual's two-loci diplotype posterior probabilities, and their variances and confidence intervals are computed for the "possible sets of multi-loci including the two subject loci" and the "two subject loci" in the same manner as in the above embodiment.
  • the evaluation of the confidence intervals and the comparison of the confidence interval estimation results are performed in the same manner as in the above embodiment.
  • the result of the statistical genetics analysis is output.
  • the user is enabled to obtain the result of the statistical genetics analysis by specifying the two subject loci not only when the two subject loci are successive but also when the two subject loci are not successive.
  • the above-described processing is performed by the LD index computation device 20. Instead of this, the same processing may be performed in a distributed environment.
  • simulation data was generated and analyzed.
  • the number of loci was 12
  • the number of haplotypes was 7, and the number of alleles at each locus was 2, and the parameters were generated using the haplotype frequency greater than or equal to 1% and the allele frequency greater than or equal to 5%.
  • data of each of 100 individuals was given missing portions with missing rates of 0%, 5%, and 10%. The data for each of the 100 individuals was analyzed, and the analysis results were compiled.
  • Fig. 6 shows the results of analysis performed using the generated simulation data with the maximum haplotype length being set at 2 (two loci) and 6 (six loci).
  • the maximum haplotype length is set at 2 (two loci) and 6 (six loci).
  • Fig. 6 shows the results of estimation for which the validity of the haplotype frequency estimation was not checked.
  • Fig. 6 (a) shows the estimation accuracy comparison of the two-loci haplotype frequencies.
  • Fig. 6(b) shows the estimation accuracy comparison of the two-loci diplotypes.
  • Fig. 6(c) shows the estimation accuracy comparison of p 2 (r 2 ).
  • Fig.6(d) shows the estimation accuracy comparison of D' .
  • the results are categorized as that a difference between the true frequency generated by simulation and the frequency estimated using the statistical genetics analysis system of the present invention is 0.005 or less, the difference is 0.015 or less and greater than 0.005, and the difference is greater than 0.015.
  • Fig. 6 (a) shows the estimation accuracy comparison of the two-loci haplotype frequencies.
  • each index is compared with a true value, and the results are categorized as that the index is estimated with a difference of 0.005 or less, the index is estimated with a difference of 0.015 or less and greater than 0.005, and the index differs from the true value by a value greater than 0.015.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

An LD index computation device generates genotype data of possible multiple loci including two specific loci with a collection of multi-loci genotype data of individuals. The LD index computation device computes for each possible multiple loci including the two specific loci maximum likelihood estimates of haplotype frequencies and converts the computed maximum likelihood estimates into haplotype frequencies between two loci. The LD index computation device computes variances and confidence intervals through a plurality of different methods and converts the computed variances and the computed confidence intervals into information of two loci. The LD index computation device compares variances relating to two loci computed through the different methods and determines whether the maximum likelihood estimates fall within the confidence intervals and stores confidence intervals and corresponding two-loci haplotype frequencies in a confidence interval estimation result storage unit. The LD index computation device compares confidence intervals stored in the confidence interval estimation result storage unit for each possible multiple loci including the two specific loci to specifies a confidence interval that is to be adopted and corresponding two-loci haplotype frequencies. As a result, even when the amount of data is small or the data includes missing data, the statistical genetics analysis is performed with relatively high accuracy.

Description

DESCRIPTION
STATISTICAL GENETICS ANALYSIS SYSTEM, STATISTICAL GENETICS ANALYSIS METHOD, AND STATISTICAL GENETICS ANALYSIS PROGRAM
TECHNICAL FIELD
The present invention relates to a statistical genetics analysis system, a statistical genetics analysis method, and a statistical genetics analysis program for estimating haplotype frequencies between two loci and computing linkage disequilibrium (LD) indices between two loci.
BACKGROUND ART
A conventional procedure for computing LD indices typically includes (1) estimation of haplotype frequencies between two loci, (2) computation of LD indices such as D' and p2, and (3) estimation of confidence intervals using a bootstrap or a likelihood. Various algorithms have been proposed for estimating haplotype frequencies between two loci. For example, Japanese Laid-Open Patent Publication No. 2004-192018 describes a method for estimating haplotype frequencies for a group in accordance with an expectation maximization algorithm using as input values genotype pool information accumulating genotype information relating to a plurality of subjects included in the group.
However, conventionally proposed methods for estimating the haplotype frequencies between two loci all use data for only two loci as experimental data. When using data for two loci to estimate the haplotype frequencies, the estimation result has a relatively high accuracy when the amount of data is sufficiently large. However, the estimation result does not have such a relatively high accuracy when the amount of data is not sufficiently large. Moreover, the estimation result fails to have relatively high accuracy when the data includes missing data. However, during actual collection of data, the amount of the data obtained is insufficient and the obtained data may often include missing data. SUMMARY OF THE INVENTION
It is an object of the present invention to provide a statistical genetics analysis system, a statistical genetics analysis method, and a statistical genetics analysis program for performing a statistical genetics analysis with relatively high accuracy even when the amount of data is small or the data includes missing data.
A first aspect of the present invention is a statistical genetics analysis system for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals. The statistical genetics analysis system includes a computer functioning as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, and a process for storing the converted two-loci haplotype frequencies. The computer also functions as a means for estimating haplotype frequencies between the two specific loci based on the two-loci haplotype frequencies stored for each of the possible multiple loci including the two specific loci.
A second aspect of the present invention is a statistical genetics analysis system for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals. The statistical genetics analysis system includes a computer functioning as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing variances and confidence intervals of the haplotype frequencies of the multiple loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances and the computed confidence intervals into information relating to the two specific loci, a process for comparing variances of the haplotype frequencies between the two specific loci computed through the plurality of different methods, a process for determining whether the maximum likelihood estimates of the haplotype frequencies between the two specific loci fall within the confidence intervals of the haplotype frequencies between the two specific loci computed through a predetermined one of the plurality of different methods, and a process for storing in a confidence interval estimation result storage unit the confidence intervals of the haplotype frequencies between the two specific loci and corresponding two-loci haplotype frequencies in an associated manner based on the comparison process and the determination process. The computer also functions as a means for comparing confidence intervals of the haplotype frequencies between the two specific loci stored in the confidence interval estimation result storage unit for the possible multiple loci including the two specific loci and specifying confidence intervals that are adopted to specify two-loci haplotype frequencies stored in association with the specified confidence intervals.
A third aspect of the present invention is a statistical genetics analysis system for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals. The statistical genetics analysis system includes a computer functioning as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing maximum likelihood estimates of linkage disequilibrium indices using the converted two-loci haplotype frequencies, and a process for storing the computed maximum likelihood estimates of the linkage disequilibrium indices. The computer further functions as a means for estimating linkage disequilibrium indices based on the maximum likelihood estimates of the linkage disequilibrium indices stored for each of the possible multiple loci including the two specific loci.
A fourth aspect of the present invention is a statistical genetics analysis system for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals. The statistical genetics analysis system includes a computer functioning as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing maximum likelihood estimates of linkage disequilibrium indices using the converted two-loci haplotype frequencies, a process for computing variances of the haplotype frequencies of the multiple loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances into information relating to the two specific loci, a process for comparing variances of the haplotype frequencies between the two specific loci computed through the plurality of different methods, a process for computing variances and confidence intervals of linkage disequilibrium indices, a process for determining whether maximum likelihood estimates of the linkage disequilibrium indices fall within the confidence intervals of the linkage disequilibrium indices, a process for storing confidence intervals of the linkage disequilibrium indices and corresponding linkage disequilibrium indices into a confidence interval estimation result storage unit in an associated manner based on the comparison process and the determination process. The computer further functions as a means for comparing confidence intervals of the linkage disequilibrium indices stored in the confidence interval estimation result storage unit for each of the possible multiple loci including the two specific loci and specifying confidence intervals of the linkage disequilibrium indices that are adopted to specify linkage disequilibrium indices stored in association with the specified confidence intervals.
A fifth aspect of the present invention is a statistical genetics analysis system for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals. The statistical genetics analysis system includes a computer functioning as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for computing individual's diplotype posterior probabilities of the multiple loci including the two specific loci based on a result of the computed maximum likelihood estimates of haplotype frequencies and converting the computed posterior probabilities into individual's diplotype posterior probabilities between the two specific loci, and a process for storing the converted individual's diplotype posterior probabilities between the two specific loci. The computer further functions as a means for estimating individual's diplotype posterior probabilities between the two specific loci based on the individual's diplotype posterior probabilities between the two specific loci stored for each of the possible multiple loci including the two specific loci.
A sixth aspect of the present invention is a statistical genetics analysis system for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals. The statistical genetics analysis system includes a computer functioning as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for computing individual's diplotype posterior probabilities of the multiple loci including the two specific loci based on a result of the computed maximum likelihood estimates of haplotype frequencies and converting the computed posterior probabilities into individual's diplotype posterior probabilities between the two specific loci, a process for computing variances of haplotype frequencies of the multi-loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances into information relating to the two specific loci, a process for comparing variances of the haplotype frequencies between the two specific loci computed with the plurality of different methods, a process for computing variances and confidence intervals of diplotype posterior probabilities of the multiple loci including the two specific loci using the multi-loci data and converting the computed variances and the computed confidence intervals into information relating to the two specific loci, a process for determining whether maximum likelihood estimates of the individual's diplotype posterior probabilities between the two specific loci fall within the confidence intervals of the individual's diplotype posterior probabilities between the two specific loci, and a process for storing confidence intervals of the individual's diplotype posterior probabilities between the two specific loci and corresponding individual's diplotype posterior probabilities between the two specific loci into a confidence interval estimation result storage unit in an associated manner based on the comparison process and the determination process. The computer further functions as a means for comparing confidence intervals of the individual's diplotype posterior probabilities between the two specific loci stored in the confidence interval estimation result storage unit for each of the possible multiple loci including the two specific loci to specify confidence intervals of the individual's diplotype posterior probabilities between the two specific loci that are to be adopted and specify individual's diplotype posterior probabilities between the two specific loci stored in association with the specified confidence intervals.
A seventh aspect of the present invention is a method for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer. The computer executes the steps of generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi- loci genotype data of individuals, and performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, and a process for storing the converted two-loci haplotype frequencies. The computer further executes the step of estimating haplotype frequencies between the two specific loci based on the two-loci haplotype frequencies stored for each of the possible multiple loci including the two specific loci.
An eighth aspect of the present invention is a method for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer. The computer executes the steps of generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing variances and confidence intervals of the haplotype frequencies of the multiple loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances and the computed confidence intervals into information relating to the two specific loci, a process for comparing variances of the haplotype frequencies between the two specific loci computed through the plurality of different methods, a process for determining whether the maximum likelihood estimates of the haplotype frequencies between the two specific loci fall within the confidence intervals of the haplotype frequencies between the two specific loci computed through a predetermined one of the plurality of different methods, and a process for storing in a confidence interval estimation result storage unit the confidence intervals of the haplotype frequencies between the two specific loci and corresponding two-loci haplotype frequencies in an associated manner based on the comparison process and the determination process. The computer further executes the step of comparing confidence intervals of the haplotype frequencies between the two specific loci stored in the confidence interval estimation result storage unit for the possible multiple loci including the two specific loci and specifying confidence intervals that are adopted to specify two-loci haplotype frequencies stored in association with the specified confidence intervals.
A ninth aspect of the present invention is a method for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer. The computer executes the steps of generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing maximum likelihood estimates of linkage disequilibrium indices using the converted two-loci haplotype frequencies, and a process for storing the computed maximum likelihood estimates of the linkage disequilibrium indices. The computer further executes the step of estimating linkage disequilibrium indices based on the maximum likelihood estimates of the linkage disequilibrium indices stored for each of the possible multiple loci including the two specific loci.
A tenth aspect of the present invention is a method for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer. The computer executes the steps of generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing maximum likelihood estimates of linkage disequilibrium indices using the converted two-loci haplotype frequencies, a process for computing variances of the haplotype frequencies of the multiple loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances into information relating to the two specific loci, a process for comparing variances of the haplotype frequencies between the two specific loci computed through the plurality of different methods, a process for computing variances and confidence intervals of linkage disequilibrium indices, a process for determining whether maximum likelihood estimates of the linkage disequilibrium indices fall within the confidence intervals of the linkage disequilibrium indices, a process for storing confidence intervals of the linkage disequilibrium indices and corresponding linkage disequilibrium indices into a confidence interval estimation result storage unit in an associated manner based on the comparison process and the determination process. The computer further executes the step of comparing confidence intervals of the linkage disequilibrium indices stored in the confidence interval estimation result storage unit for each of the possible multiple loci including the two specific loci and specifying confidence intervals of the linkage disequilibrium indices that are adopted to specify linkage disequilibrium indices stored in association with the specified confidence intervals.
An eleventh aspect of the present invention is a method for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using computer. The computer executes the steps of generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for computing individual's diplotype posterior probabilities of the multiple loci including the two specific loci based on a result of the computed maximum likelihood estimates of haplotype frequencies and converting the computed posterior probabilities into individual's diplotype posterior probabilities between the two specific loci, and a process for storing the converted individual's diplotype posterior probabilities between the two specific loci. The computer further executes the step of estimating individual's diplotype posterior probabilities between the two specific loci based on the individual's diplotype posterior probabilities between the two specific loci stored for each of the possible multiple loci including the two specific loci.
A twelfth aspect of the present invention is a method for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer. The computer executes the steps of generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi- loci genotype data of individuals, and performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for computing individual's diplotype posterior probabilities of the multiple loci including the two specific loci based on a result of the computed maximum likelihood estimates of haplotype frequencies and converting the computed posterior probabilities into individual's diplotype posterior probabilities between the two specific loci, a process for computing variances of haplotype frequencies of the multi-loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances into information relating to the two specific loci, a process for comparing variances of the haplotype frequencies between the two specific loci computed with the plurality of different methods, a process for computing variances and confidence intervals of diplotype posterior probabilities of the multiple loci including the two specific loci using the multi-loci data and converting the computed variances and the computed confidence intervals into information relating to the two specific loci, a process for determining whether maximum likelihood estimates of the individual's diplotype posterior probabilities between the two specific loci fall within the confidence intervals of the individual's diplotype posterior probabilities between the two specific loci, and a process for storing confidence intervals of the individual's diplotype posterior probabilities between the two specific loci and corresponding individual's diplotype posterior probabilities between the two specific loci into a confidence interval estimation result storage unit in an associated manner based on the comparison process and the determination process. The computer further executes the step of comparing confidence intervals of the individual's diplotype posterior probabilities between the two specific loci stored in the confidence interval estimation result storage unit for each of the possible multiple loci including the two specific loci to specify confidence intervals of the individual's diplotype posterior probabilities between the two specific loci that are to be adopted and specify individual's diplotype posterior probabilities between the two specific loci stored in association with the specified confidence intervals.
A thirteenth aspect of the present invention is a statistical genetics analysis program for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer. The program causes the computer to function as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, and a process for storing the converted two-loci haplotype frequencies. The program further causes the computer to function as a means for estimating haplotype frequencies between the two specific loci based on the two-loci haplotype frequencies stored for each of the possible multiple loci including the two specific loci.
A fourteenth aspect of the present invention is a statistical genetics analysis program for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer. The program causes the computer to function as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing variances and confidence intervals of the haplotype frequencies of the multiple loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances and the computed confidence intervals into information relating to the two specific loci, a process for comparing variances of the haplotype frequencies between the two specific loci computed through the plurality of different methods, a process for determining whether the maximum likelihood estimates of the haplotype frequencies between the two specific loci fall within the confidence intervals of the haplotype frequencies between the two specific loci computed through a predetermined one of the plurality of different methods, and a process for storing in a confidence interval estimation result storage unit the confidence intervals of the haplotype frequencies between the two specific loci and corresponding two-loci haplotype frequencies in an associated manner based on the comparison process and the determination process. The program further causes the computer to function as a means for comparing confidence intervals of the haplotype frequencies between the two specific loci stored in the confidence interval estimation result storage unit for the possible multiple loci including the two specific loci and specifying confidence intervals that are adopted to specify two-loci haplotype frequencies stored in association with the specified confidence intervals.
A fifteenth aspect of the present invention is a statistical genetics analysis program for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer. The program causes the computer to function as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing maximum likelihood estimates of linkage disequilibrium indices using the converted two-loci haplotype frequencies, and a process for storing the computed maximum likelihood estimates of the linkage disequilibrium indices. The program further causes the computer to function as a means for estimating linkage disequilibrium indices based on the maximum likelihood estimates of the linkage disequilibrium indices stored for each of the possible multiple loci including the two specific loci.
A sixteenth aspect of the present invention is a statistical genetics analysis program for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer. The program causes the computer to function as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing maximum likelihood estimates of linkage disequilibrium indices using the converted two-loci haplotype frequencies, a process for computing variances of the haplotype frequencies of the multiple loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances into information relating to the two specific loci, a process for comparing variances of the haplotype frequencies between the two specific loci computed through the plurality of different methods, a process for computing variances and confidence intervals of linkage disequilibrium indices, a process for determining whether maximum likelihood estimates of the linkage disequilibrium indices fall within the confidence intervals of the linkage disequilibrium indices, a process for storing confidence intervals of the linkage disequilibrium indices and corresponding linkage disequilibrium indices into a confidence interval estimation result storage unit in an associated manner based on the comparison process and the determination process. The program further causes the computer to function as a means for comparing confidence intervals of the linkage disequilibrium indices stored in the confidence interval estimation result storage unit for each of the possible multiple loci including the two specific loci and specifying confidence intervals of the linkage disequilibrium indices that are adopted to specify linkage disequilibrium indices stored in association with the specified confidence intervals.
A seventeenth aspect of the present invention is a statistical genetics analysis program for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer. The program causes the computer to function as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for computing individual's diplotype posterior probabilities of the multiple loci including the two specific loci based on a result of the computed maximum likelihood estimates of haplotype frequencies and converting the computed posterior probabilities into individual's diplotype posterior probabilities between the two specific loci, and a process for storing the converted individual's diplotype posterior probabilities between the two specific loci. The program further causes the computer to function as a means for estimating individual's diplotype posterior probabilities between the two specific loci based on the individual's diplotype posterior probabilities between the two specific loci stored for each of the possible multiple loci including the two specific loci.
An eighteenth aspect of the present invention is a statistical genetics analysis program for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer. The program causes the computer to function as a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals, and a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for computing individual's diplotype posterior probabilities of the multiple loci including the two specific loci based on a result of the computed maximum likelihood estimates of haplotype frequencies and converting the computed posterior probabilities into individual's diplotype posterior probabilities between the two specific loci, a process for computing variances of haplotype frequencies of the multi-loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances into information relating to the two specific loci, a process for comparing variances of the haplotype frequencies between the two specific loci computed with the plurality of different methods, a process for computing variances and confidence intervals of diplotype posterior probabilities of the multiple loci including the two specific loci using the multi-loci data and converting the computed variances and the computed confidence intervals into information relating to the two specific loci, a process for determining whether maximum likelihood estimates of the individual's diplotype posterior probabilities between the two specific loci fall within the confidence intervals of the individual's diplotype posterior probabilities between the two specific loci, and a process for storing confidence intervals of the individual's diplotype posterior probabilities between the two specific loci and corresponding individual's diplotype posterior probabilities between the two specific loci into a confidence interval estimation result storage unit in an associated manner based on the comparison process and the determination process. The program further causes the computer to function as a means for comparing confidence intervals of the individual's diplotype posterior probabilities between the two specific loci stored in the confidence interval estimation result storage unit for each of the possible multiple loci including the two specific loci to specify confidence intervals of the individual's diplotype posterior probabilities between the two specific loci that are to be adopted and specify individual's diplotype posterior probabilities between the two specific loci stored in association with the specified confidence intervals.
Other aspects and advantages of the present invention will become apparent from the following description, taken in conjunction with the accompanying drawings, illustrating by way of example the principles of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
The invention, together with objects and advantages thereof, may best be understood by reference to the following description of the presently preferred embodiments together with the accompanying drawings in which:
Fig. 1 is a schematic diagram of a system according to a preferred embodiment of the present invention;
Figs. 2 and 3 are explanatory diagrams showing the processing procedure of the preferred embodiment;
Figs. 4 and 5 are explanatory diagrams showing haplotypes including two subject loci; and
Figs. 6a through 6d are charts showing results of analysis using simulation data. BEST MODE FOR CARRYING OUT THE INVENTION
A preferred embodiment of the present invention will now be described with reference to Figs. 1 to 4. The preferred embodiment will be described as a statistical genetics analysis system, a statistical genetics analysis method, and a statistical genetics analysis program for obtaining haplotype frequencies between two specific loci (two-loci haplotype frequencies), diplotype posterior probabilities of each individual (individual's diplotype posterior probabilities), and linkage disequilibrium (LD) indices based on genotype data of multi-loci.
[Overview of Preferred Embodiment]
First, an overview of the processing performed in the preferred embodiment will be discussed.
In the preferred embodiment, haplotype frequencies of multi-loci including two subject loci are computed, and haplotype frequencies between the two subject loci are computed using the multi-loci haplotype frequencies. In this way, information of multi-loci is used to evaluate LD between two loci.
However, the use of multi-loci increases the amount of information and the number of parameters, and the estimation accuracy may rather be lowered when the haplotype is too long. A haplotype with an optimum length is believed to exist for each different amount of data. Thus, when haplotype frequencies of multi-loci are used to compute haplotype frequencies between two loci, a set of multi-loci having appropriate lengths or located at appropriate positions must be selected. In the preferred embodiment, to select the multi-loci used to compute haplotype frequencies between two loci, maximum likelihood estimates are evaluated to check their validity, and an estimate with the highest accuracy is selected from values for which validity has been verified.
The evaluation of the validity of maximum likelihood estimates is performed by constructing confidence intervals with various methods and thereby evaluating the validity of the model or its sample amount. When the number of samples is small or when the validity of the model fails to be verified, such facts can be detected as distortions in the confidence intervals. In the preferred embodiment, confidence intervals are constructed using an observed information matrix, an empirical information matrix, a nonparametric bootstrap, and a parametric bootstrap. Further, variances obtained using these methods are compared with one another to evaluate the validity of the model or its sample amount.
The selection of the estimate with the highest accuracy is performed by comparing the confidence intervals estimated using sets of loci of various numbers each including two subject loci, and adopting the shortest confidence interval. As a result, the estimation result with the highest accuracy is adopted.
A diplotype posterior probabilities of each individual and LD indices are also estimated in the same manner as described above using sets of multi-loci each including two subject loci.
A "haplotype" is a combination of alleles at a plurality of loci. A "hap- lotype frequency" is a frequency at which a predetermined allele combination at a plurality of loci (haplotype) appears. In this specification, a combination of haplotypes of an individual is referred to as a "diplotype" . A haplotype cannot be directly obtained through normal genotyping and is thus obtained through estimation.
[Structure of LD Index Computation Device 20]
Next, the structure of an LD index computation device 20 used in the preferred embodiment will be described.
As shown in Fig. 1, the LD index computation device 20 includes a control unit 30. The control unit 30 includes a control unit (CPU), a storage unit (RAM, ROM, etc.), and a communication unit, which are not shown in Fig. 1. The control unit 30 performs processing that will be described later. The LD index computation device 20 has the function of computing haplotype frequencies between two loci using multi-loci, individual's diplotype posterior probabilities, and LD indices. This control unit 30 includes a successive multi-loci data generation unit 31, a confidence interval estimation unit 32, and a confidence interval estimation result comparison unit 33.
The successive multi-loci data generation unit 31 generates data of possible sets of successive multi-loci each including two subject loci (successive multi-loci data) based on multi-loci genotype data of individuals. The successive multi-loci data generation unit 31 functions as a multi-loci data generation means as recited in the claims. Further, the successive multi-loci data corresponds to multi-loci data as recited in the claims.
The confidence interval estimation unit 32 estimates confidence intervals using a collection of successive multi-loci data. The confidence interval estimation unit 32 functions as a confidence interval estimation means as recited in the claims. The confidence interval estimation unit 32 includes a maximum likelihood estimation unit 41, an observed information matrix processing unit 42, an empirical information matrix processing unit 43, a nonparametric bootstrap processing unit 44, a parametric bootstrap processing unit 45, and a confidence verification unit 46.
The maximum likelihood estimation unit 41 performs maximum likelihood estimation of haplotype frequencies. The maximum likelihood estimation unit 41 converts multi-loci haplotype frequencies into two-loci haplotype frequencies using the estimation result, computes LD indices, and computes individual's diplotype posterior probabilities.
The observed information matrix processing unit 42 obtains a variance through an observed information matrix using a collection of successive multi-loci data. Further, the observed information matrix processing unit 42 computes confidence intervals of haplotype frequencies and two-loci haplotype frequencies.
The empirical information matrix processing unit 43 obtains a variance through an empirical information matrix using a collection of successive multi-loci data. Further, the empirical information matrix processing unit 43 computes confidence intervals of haplotype frequencies and two-loci haplotype frequencies. The nonparametric bootstrap processing unit 44 obtains a variance through a nonparametric bootstrap method using a collection of successive multi-loci data, and computes confidence intervals of haplotype frequencies, two-loci haplotype frequencies, individual's diplotype posterior probabilities, and LD indices.
The parametric bootstrap processing unit 45 obtains the variance through a parametric bootstrap method using a collection of successive multi-loci data. Further, the parametric bootstrap processing unit 45 computes confidence intervals of haplotype frequencies, two-loci haplotype frequencies, individual's diplotype posterior probabilities, and LD indices.
The confidence verification unit 46 performs confidence verification based on comparison among the variances obtained using the methods described above and based on adjustments of the confidence intervals obtained using the methods described above. Further, the confidence verification unit 46 determines whether maximum likelihood estimates fall within the confidence intervals. Next, based on these results, the confidence verification unit 46 determines whether the estimation results are to be adopted. The confidence interval estimation results are stored in a confidence interval estimation result storage unit 53 only when they are determined to be adopted.
The confidence interval estimation result comparison unit 33 compares the confidence interval estimation results for each set of successive multi-loci, and specifies and outputs two-loci haplotype frequencies, individual's diplotype posterior probabilities, and the LD indices associated with the shortest confidence interval.
Further, the control unit 30 is connected to a storage unit 50, which is formed by a RAM, a ROM, a hard disk etc. The storage unit 50 functions as an individual's multi-loci genotype data storage unit 51, a successive multi-loci data storage unit 52, and a confidence interval estimation result storage unit 53 serving as a confidence interval estimation result storage unit, etc.
The individual's multi-loci genotype data storage unit 51 stores genotype data of multi-loci of a plurality of individuals (individual's multi-loci genotype data). In the preferred embodiment, the multi-loci genotype data of each individual is input and stored via an input unit 61.
The successive multi-loci data storage unit 52 stores genotype data of each possible set of successive multi-loci including two subject loci (successive multi-loci data), which is generated based on the multi-loci genotype data of each individual. The successive multi-loci data is stored when the data is generated by the successive multi-loci data generation unit 31 in accordance with processing that will be described later.
The confidence interval estimation result storage unit 53 stores data relating to each confidence interval computed by the confidence interval estimation unit 32. More specifically, for a two-loci haplotype frequency, data for specifying a set of multi-loci for which a confidence interval of the two-loci haplotype frequency are obtained, the confidence interval of the two-loci haplotype frequency, and data relating to the two-loci haplotype frequency are stored in a manner that these data elements are associated with one another. For each LD index, data for specifying a set of multi-loci for which a confidence interval of the LD index is obtained, the confidence interval of the LD index, and data relating to the LD index are stored in a manner that these data elements are associated with one another. For an individual's two-loci diplotype posterior probability, data for specifying a set of multi-loci for which a confidence interval of the individual's two-loci diplotype posterior probability is obtained, the confidence interval of the individual's two-loci diplotype posterior probability, and data relating to the individual's two-loci diplotype posterior probability are stored in a manner that these data elements are associated with one another. For a multi-loci haplotype frequency, data for specifying a set of multi-loci, the confidence interval of the multi-loci haplotype frequency, and data relating to the multi-loci haplotype frequency are stored in a manner that these data elements are associated with one another. For an individual's multi-loci diplotype posterior probability, data for specifying a set of multi-loci, the confidence interval of the individual's multi-loci diplotype posterior probability, and data relating to the individual's multi-loci diplotype posterior probability are stored in a manner that these data elements are associated with one another. When verification conditions for estimation confidence described later are not satisfied, a flag indicating this (verification error flag) is added to the data. When a maximum likelihood estimate fails to fall within each confidence interval obtained, a flag indicating this (confidence interval error flag) is added to the data.
Further, the control unit 30 is connected to the input unit 61, which is formed by an input unit for external data, such as a keyboard and a mouse, and is connected to an output unit 62, such as a display device.
The input unit 61 is used when, for example, multi-loci genotype data of an individual is input, or when an allowable range of two edge values of a confidence interval is set. The output unit 62 outputs two-loci haplotype frequencies, individual's diplotype posterior probabilities, and the LD indices associated with the shortest confidence interval.
[Processing Procedures]
Next, the processing procedures for computing two-loci haplotype frequencies, individual's diplotype posterior probabilities, and LD indices using multi-loci, will be described with reference to Figs. 2 and 3.
[1] User Setting
First, the user sets two subject loci using the input unit 61. The control unit 30 of the LD index computation device 20 stores data relating to the two set subject loci into an internal storage unit of the control unit 30, which is not shown. The user also sets an allowable range of two edge values of a confidence interval using the input unit 61. The control unit 30 of the LD index computation device 20 stores data relating to the set allowable range of the two edge values of the confidence interval into the storage unit of the control unit 30 (not shown).
[2] Input of Collection of Multi-Loci Genotype Data of Individuals In the preferred embodiment, a collection of multi-loci genotype data of individuals is input into the LD index computation device 20 using the input unit 61. The LD index computation device 20 stores the multi-loci genotype data of each individual into the individual's multi-loci genotype data storage unit 51. Next, when a request to start an LD index computation process is input, the control unit 30 of the LD index computation device 20 reads the collection of the multi-loci genotype data of individuals from the genotype data storage unit 51 as shown in Fig. 2.
[3] Generation of Data of Possible Sets of Successive Loci Including Two Subject Loci
Next, the control unit 30 generates genotype data of possible sets of successive multi-loci including the two subject loci (successive multi-loci data) (step Sl-I).
In the preferred embodiment, haplotype frequencies between two specific loci and the LD indices are evaluated. In the evaluation, haplotype frequencies of all sets of loci including these two loci are used to estimate confidence intervals, and the shortest confidence interval of the estimated confidence intervals is adopted as an estimate. In reality, for SNPs (single nucleotide polymorphisms), the degree of freedom of a haplotype is increased twofold whenever the number of its loci increases. Accordingly, the estimation accuracy of an extremely long haplotype is low. To prevent the estimation accuracy from being lowered, for example, a maximum value may be set for the length of a haplotype, and all haplotype frequencies that fall within this range defined using the set maximum value may be evaluated.
Fig. 4 shows examples of sets of multi-loci each of which includes two subject loci. Although a case in which the two subject loci are successive is described here, the two subject loci do not necessarily have to be successive. A case in which the two subject loci are not successive will be described later.
Two loci i and i — 1, are the subject loci, and data of each of possible sets of successive loci including the two loci (successive multi-loci data) is generated. More specifically, symbol (a) shows a haplotype that is formed by two loci % — 1 and i. Symbol (b) shows a haplotype that is formed by three loci i — 2, i — 1, and i. Symbol (c) shows a haplotype that is formed by three loci i — 1, i, and i + 1. Symbol (d) shows a haplotype that is formed by four loci i — 3, % — 2, i — 1, and i. Symbol (e) shows a haplotype that is formed by four loci i — 2, i — 1, i, and i + 1. Symbol (f) shows a haplotype that is formed by four loci i — 1, /, i + 1, and i + 2. Symbol (g) shows a haplotype that is formed by five loci i — 3, i — 2, i — 1, i, and i + 1.
In step Sl-I, the control unit 30 generates successive multi-loci data for one of the possible sets of multi-loci each including the two subject loci based on multi-loci genotype data of each individual. The generated successive multi-loci data is stored in the successive multi-loci data storage unit 52.
[4] Estimation of Confidence Intervals
Next, the control unit 30 inputs the successive multi-loci data generated in the manner described above into a confidence interval estimation module, and estimates confidence intervals (step Sl-2). This processing will be described below.
[4-1] Maximum Likelihood Estimation of Haplotype Frequencies First, the control unit 30 performs maximum likelihood estimation of haplotype frequencies in the manner described below (step S2-1).
[4-1-1] Overview of Maximum Likelihood Estimation of Haplotype Frequencies
First, the overview of maximum likelihood estimation of haplotype frequencies will be described.
The genotype data of L loci given below will now be discussed.
Figure imgf000024_0001
Index i denotes an identifier of the data. The frequency of allele a at locus I is expressed as pl a. A frequency of haplotype q in a certain interval in the group is expressed as hq. The frequency of a haplotype of loci from locus I to locus m {a1 , ■ ■ • , am} is expessed as hq = haιt...tam . For simplification of the symbols, hq is hereafter used also to denote an identifier of a haplotype. Although information relating to an allele set also has index q, hq as a haplotype identifier clarifies that it is a haplotype, and would not cause any confusion with the notation of an allele set. A combination of haplotypes given below will now be discussed.
Vg = {hgjlr}. (2)
In this case, based on the assumption of the Hardy- Weinberg's equilibrium (HWE), an appearance probability of this set of haplotypes is given by
P(υp) = (2 - δg,r)hqhr. (3)
Here, δq<r is Kronecker δ, and is 1 when q = r and otherwise 0. An appearance probability of genotype data V1 can be obtained by summing P(vp) over possible haplotype combinations of vp, which is given by
P(V1) = E P(vp). (4)
When the genotype data includes missing data, all possible polymorphisms for this locus are covered. For example, when locus 1 has a polymorphism of A and T, and locus 2 has a polymorphism of G and C, possible haplotypes are ^AG^AC ^TG, and hτc- When genotype data V1 is V1 = {A/T, G/C}, possible sets of haplotypes are expressed using P(vp) as {2/IAG^TC, 2Λ.AC^TG}- When genotype data V2 including a missing allele X is V2= {A/ A, G/X}, possible sets of haplotypes are expressed as
Figure imgf000025_0001
AC }•
Appearance probability (likelihood) L of the entire data is given by
Figure imgf000025_0002
According to a maximum likelihood method, a value of {hq} that maximizes L (in reality, log L) is retrieved by performing a search under the constraint condition of
Figure imgf000025_0003
[4-1-2] EM Algorithm
As a method for obtaining the value of {hg} that maximizes eq.(5) relating to appearance probability (likelihood) L of the entire data, an EM (expectation maximization) algorithm is used here. The EM algorithm iterates the operation of estimating complete-data from incomplete-data for maximizing the likelihood of the complete-data. The incomplete-data in the present analysis is genotype data V1 of which phase is unknown. The complete-data is data of a certain set of haplotypes Vp. Thus, data with the same genotype is estimated to be data of the same set of haplotypes. The incomplete-data and the complete-data are summarized in the table below.
Figure imgf000026_0001
Here, K is the number of kinds of Vi, and P is the number of kinds of vp.
In this model, a log likelihood of incomplete-data is expressed as in L ≡ ∑ ln ∑ P(vp). (6) % vpevτ
A log likelihood of complete-data is expressed as
In Lc = ∑ nVp In P(vp) = ∑ ng
Figure imgf000026_0002
hq + C. (7)
P Q
Here, C is a number that is not dependent on hq, and
nq = Σ rq(Vp)nvp (8) P is satisfied. Here, τq(vp) is the number of hq included in vp.
An expectation of the complete-data from the incomplete-data is expressed as
(71Vp) = Σ mP(V13]V1). (9)
Figure imgf000026_0003
An expectation of nq expressed below is computed in reality.
K) = ∑ rq(vP)(nυp). (10)
P Further, this expectation of nq is substituted as nq in eq. (7) with respect to the complete data. As a result, a set of parameters hq for maximizing the likelihood is obtained. In this case, M step (maximization step) is extremely simplified, and the computation below is performed. h _ K) _ nq n n h« - ∑q(nq) - 2N - (11)
Here, N = nγ H \- nκ, and is the number of individuals for which data is used.
More specifically, in the preferred embodiment, the EM algorithm iterates E step (expectation step) performed using eq.(9) for obtaining an expectation of complete-data and the above M step (11).
[4-1-3] Computation of Individual's Diplotype Posterior Probabilities Next, diplotype posterior probabilities of each individual are computed (step S2-2).
A estimate of individual's diplotype posterior probability is obtained from a posterior probability expressed below in accordance with the Bayes' theorem.
P(vr\Vi) = P{Ωj (12)
Here, Vi is a collection of genotype data of individual i, and υp is a certain diplotype. The summation over possible diplotypes for data Vi is performed. An appearance probability of υp is expressed using the above eq.(3) relating to an appearance probability on the assumption of the HWE.
For diplotype posterior probabilities, diplotypes of two subject loci are estimated using multi-loci in the same manner as the estimation of haplotype frequencies. In particular, although diplotype posterior probabilities of two loci including missing data may be estimated with less variation, the use of multi-loci would improve the estimation accuracy of the diplotype posterior probabilities.
[4-1-4] Conversion from Multi-Loci Result into Two-Loci Haplotype Frequencies Further, the maximum likelihood estimation unit 41 of the control unit 30 converts estimation results of multi-loci haplotype frequencies into two-loci haplotype frequencies (step S2-3).
The multi-loci haplotype frequencies are converted into the two-loci haplotype frequency using the expression below.
h'aφj — Z_^ "Oj1 - O1- αj -ojL - (13) aι,—,a,L
A two-loci haplotype frequency is normally obtained by performing a summation over L — 2 loci that are loci excluding subject loci in the manner using the above eq.(13). Symbol ∑' indicates that its summation is for all haplotypes of which alleles at subject loci i and j are α; and α,-.
[4-1-5] Computation of LD Indices
Next, the LD indices between two loci are computed, using the two-loci haplotype frequencies that are estimated in the manner described above (step S2-4). In the present analysis, p2 and D' are used as LD indices.
[4-1-5-1] LD Indices
Although there are various LD indices including indices p2 and D', some LD indices do not have uniform definitions. Various indices used in the present analysis are defined below.
D = haia2 - paiPa2, (14) n' - n m n - J min OvPa2, Ps1Pa2), D > 0 . >
D - V/VmΑX, M»ax - j -mm(poiPβ2 ) PδiPa2), D < 0 I15)
Figure imgf000028_0001
δ = P —a2?^a—1Z2 , (17) d = -^-, (18)
Pa2Pa-I
"ja1a2 ' l'a1ai ' ""αiάo '^α^
Here, haia2 denotes a frequency of a haplotype formed by allele a\ at locus 1 and allele a^ at locus 2, and pai denotes a frequency of allele O1 at locus 1. Further, O1 denotes an allele other than allele a,\ at locus 1. Symbols D', p2, δ, d, and Q are all defined using D in eq.(14). A value of D is determined only by using a haplotype frequency between two loci.
[4-1-6] MLE Module
The processing in the above steps S2-1 to S2-4 is executed by an MLE (maximum likelihood estimation) module. This MLE module is also used in the nonparametric bootstrap and the parametric bootstrap, which will be described later.
[4-2] Estimation of Confidence Intervals
Next, an observed information matrix process, an empirical information matrix process, a nonparametric bootstrap process, and a parametric bootstrap process are performed. Each of these processes will now be described.
[4-2-1] Observed Information Matrix Process [4-2-1-1] Computation of Observed Information Matrix
Next, the observed information matrix processing unit 42 of the control unit 30 computes an observed information matrix in the manner described below (step S2-5).
[4-2-1-1-1] Information Matrix
An information matrix will first be described.
The maximum likelihood estimation method is known to have asymptotic efficiency. The asymptotic efficiency means that an estimate approaches a true value with a minimum variance as the number of samples increases as in the expression below. θ ~ N(θo, Irlo)), as N → oo. (20)
Here, Xp is a matrix called a Fisher information matrix, and is defined as an expectation of a second derivative of a log likelihood function. Thus, the confidence of an estimate is obtained by evaluating the information matrix. However, such computation is usually difficult (because it requires an expectation to be computed and information relating to a true parameter to be used), and some approximation methods have been proposed. As approximations of the Fisher information matrix, an observed information matrix and an empirical information matrix have been proposed. In the present analysis, these two information matrices are computed.
[4-2-1-1-2] Observed Information Matrix
Computation of the Fisher information matrix is usually complicated. Thus, in most cases, an expectation is not used and a Hessian at a maximum likelihood estimate is used. In other words, the observed information matrix expressed below is computed. d2 hx L{θ) i(0; y), (21)
113ιJ
When the observed information matrix of this model is I0, the expression below is obtained by computing a second derivative of the above eq.(6) relating to a log likelihood of incomplete-data.
{ o)q'r
Figure imgf000030_0001
Here,
Λ dPPΠ(VΛk)I Λ PC-v,1p) 1
= V dP(
(23) dhq υ^Vk dhq
Thus, the expression to be evaluated is dP(yp)/dhq. One haplotype needs to be deleted based on the condition of constraint. When the haplotype to be deleted is /IQ, according to the following expression,
P(vp) = (2 - δab)hahb, (24) the expressions, dP(vp)
= (2 - δajb)(δha + δatqhb), if a ψ Q and b ≠ Q, (25)
Ohn
= -2ha + 2δa>qhQ, if a φ Q and b — Q, (26)
Figure imgf000030_0002
dP(vp)
= — 2JiQ, H a = Q and b = Q, (27) dhq are satisfied. Further, in the second derivation, the expressions d2P(υp) dh dh - (2 ~ 5α'6) (δb'qδa>r + δa'gδb'r^ if a ≠ Q Αnd b ≠ Q> (28)
PP(Vp) = _ -2(<Jβ,r + <JβΛ), if α ≠ Q and & = Q, (29) dhqdhr ζ^- = 2, if α = Q and δ = Q, (30) dhqdhr are satisfied.
Once the observed information matrix, which is a Hessian, is evaluated, the estimate θ is a saddle point of the likelihood when the eigenvalue is negative. In that case, the value of θ needs to be changed slightly in the direction of the eigenvector of the negative eigenvalue, and the EM algorithm needs to be started again. The EM algorithm alone fails to determine whether the converged point is a maximum value or a saddle point. Thus, the maximum value is checked by examining the eigenvalue of the observed information matrix. However, all eigenvalues may not necessarily be positive when the parameter converges on the edge of the domain of definition region.
[4-2-1-2] Computation of Confidence Intervals of Haplotype Frequencies and Two-Loci Haplotype Frequencies
Next, the observed information matrix processing unit 42 of the control unit 30 computes confidence intervals of haplotype frequencies and two-loci haplotype frequencies in the manner described below (step S2-6).
First, a method for constructing confidence intervals will be described.
[4-2-1-2-1] Q - I Dimensional Parameters
An information matrix to be computed is a matrix of (Q — 1) x (Q - I) (one is reduced based on the condition of constraint). First, confidence intervals of the Q - I dimensional parameters are constructed in accordance with the basic principle.
To define a multidimensional confidence region, not only siginificance level a but also the shape of the region needs to be defined. A multidimensional ellipsoid with χ2 being constant is used. The statistic
R2(δh) ≡ ∑ δhqlqg.δhq, (31) qq> is defined from the expression of an exponent of a multivariate normal distribution in eq.(20) relating to the asymptotic efficiency of the maximum likelihood estimation method. Here, δh = h — h° is satisfied. Further, Igqι is an estimated information matrix. This reveals that R2 is determined in accordance with the distribution of X2 with the degree of freedom being Q - I from eq.(20) relating to the asymptotic efficiency of the maximum likelihood estimation method. Thus, when point a of \Q_: distribution is Xg-1(Q;), the confidence region (1 — a) is defined by,
Figure imgf000032_0001
A confidence interval of each value of hq is defined as one side of a multidimensional rectangular parallelepiped in which this multidimensional ellipsoid is inscribed. VR2{δh) is a normal vector of R2(δh). Thus, a confidence interval of hq is obtained as a point at which this normal vector becomes parallel to the hq axis. A unit vector of the hq axis is defined by e9, and δhq satisfies the following equations.
Iδh = ce9, (33)
Figure imgf000032_0002
Here, c is a certain constant, and σ2 q, = (I"1)^. Using eq.(32) relating to the confidence region (1 — a), c is expressed as
c2 ∑ ∑ e«σρ 2Λ9%V£ = X2Q-I («), (35) qq' ab
Figure imgf000032_0003
and
Figure imgf000032_0004
When c is substituted in eq. (34) relating to δhq, the confidence limit point is given by
Figure imgf000032_0005
Accordingly, the confidence intervals of the Q - I dimensional parameters are obtained as
Figure imgf000032_0006
[4-2-1-2-2] Parameter of Q-th Dimension
The Q-th parameter that is deleted based on the condition of constraint is obtained using the condition of constraint as follows,
Q-i he = 1 - ∑ hq (40)
To obtain the confidence interval of HQ, maximum values and minimum values of ∑^"1 hq in the ellipsoid defined using eq.(32) relating to the confidence region (1 — a) need to be obtained. A plane that intersects with all axes at τr/4 in the Q — 1 dimensional space, ∑g hq = c, is obtained as points being in contact with the ellipsoid with χ2 being constant. A vector vertical to this plane is expressed as ϋ — (1, • • • , 1). Thus, when this expression is substituted in eq.(36) relating to c, c is obtained as
Figure imgf000033_0001
Here, σgg ≡ ∑qq' rfq'- Then, c is substituted in eq.(34) in which eq a — > va and the summation over values of q is performed. Based on the constraint condition, HQ = 1 — ∑g hq, as a result, the confidence interval of fiQ is finally obtained as described below.
Figure imgf000033_0002
CQQ, which is defined as σgg ≡ ∑qq' σ qq> > m fact is a variance of KQ. In reality, due to ∑n hq = 1, the following is satisfied.
Figure imgf000033_0003
Further, σq 2g is obtained as described below.
Figure imgf000033_0004
= EF[(hq - hq o)(hQ - h°Q)}
Q-I Q-I
= Ef (K - hQ g) { (1 - ∑ \0 - (1 - Σ hi) ι' = l q'=l
Figure imgf000034_0001
As a result, the same variance matrix of the Q x Q dimension is obtained irrespective of which q is deleted based on the constraint condition.
[4-2-1-2-3] Computation of Confidence Intervals of Haplotype Frequencies
As described above, confidence intervals of haplotype frequencies are obtained using the observed information matrix result according to eq.(42) above relating to the confidence interval of HQ.
[4-2-1-2-4] Computation of Confidence Intervals of Two-Loci Haplotype Frequencies
To convert a result of an information matrix or a result of a bootstrap method into information of two loci, it is convenient if a variance matrix can be converted.
An average of multi-loci haplotype frequencies is represented by hq. A sample variance is represented by σqqι. In this case, the following is satisfied.
hn - - hq)(hg, - hq,). (45)
Figure imgf000034_0002
Here, a partial set of subject haplotypes is s, and a summation of hq in set s is /β, which is expressed as
Λ = ∑V (46) qζs
An average of /s is computed as shown below.
Figure imgf000034_0003
Further, its variance matrix is computed as shown below.
Figure imgf000034_0004
Figure imgf000035_0001
Using this, results of information matrices (an observed information matrix and an empirical information matrix) or bootstrap methods described later (a nonparametric bootstrap method and a parametric bootstrap method) are each converted into information of two loci.
The result of an observed information matrix is converted into information of two loci. A confidence interval of a two-loci haplotype is computed using the conversion result.
[4-2-2] Empirical Information Matrix Process
[4-2-2-1] Computation of Empirical Information Matrix
Next, the empirical information matrix processing unit 43 of the control unit 30 computes an empirical information matrix in the manner described below (step S2-7).
First, computation of the empirical information matrix will be described.
The empirical information matrix is computed using a method obtained by simplifying the computation method of the observed information matrix described above.
The likelihood may be written as
N lnL(0) = ∑>/fø; 0). (49) i
In this case, if the data is independent and identically distributed (i.i.d.), 1(0; y) may be simplified. Based on this assumption, a Hessian can be written as
» dHnf(yi; θ) i$ y)«r = 2T dθpg
^ d\nf(yi; θ) d\nf(yi; θ) N d2f{yύ θ)
(50) dθ v, dθn - Σ f (Vi-J) dθpg An expectation of the final term is zero (differentiation of the normalization) as shown below. f dyU f(v θ) λ PfW) - f PfW) - Q f5i)
Thus, when the number of data elements N is sufficiently large, in accordance with the law of large numbers, the above eq.(50) is given by
*-{v, y)pg ~ Z^ — ΈZ M = -'el6'' y)pQ- {b2J
Here, Ie(0; y) is referred to as an empirical information matrix.
In this model, using eq.(22), which relates to (I0)ρ,r > the following is expressed. (1 ) f "* dP(Vk) dP(Vk)
In other words, the empirical information matrix is computed using eq.(53) relating to (Ie) q,r-
[4-2-2-2] Computation of Confidence Intervals of Haplotype Frequencies and Two-Loci Haplotype Frequencies
Next, the empirical information matrix processing unit 43 of the control unit 30 computes confidence intervals of haplotype frequencies and two-loci haplotype frequencies (step S2-8). This processing is performed in the same manner as in the above step S2-6 using the result of the empirical information matrix in step S2-7.
[4-2-3] Nonparametric Bootstrap Process [4-2-3-1] Nonparametric Bootstrap
Next, the nonparametric bootstrap processing unit 44 of the control unit 30 performs a nonparametric bootstrap in the manner described below (step S2-9).
[4-2-3-1-1] Overview of Bootstrap Method
Theoretically constructing a standard error or a confidence interval is extremely difficult in the case of a complicated statistic. The bootstrap method is a method for generating a sample from given data using random numbers and estimating a variance or a confidence interval for a statistic. A basic concept of the bootstrap method is not to compute a true parameter but to estimate the relationship (distribution) between an estimate and a true parameter from data or from a sample obtained from an estimate (referred to as a bootstrap sample). The bootstrap method is roughly divided in a nonparametric bootstrap and a parametric bootstrap depending on how a bootstrap sample is generated.
With the nonparametric bootstrap method, empirical distribution obtained from data is computed as a cumulative distribution function expressed as
Hy) = Φ{y3 n ≤ y} = ~ l t H{y - yt). (54)
Here, φ{y3 < y} is the number of data elements less than or equal to y, and H(y — yt) is a function being 1 when y > yx and being 0 when y < yτ. A bootstrap sample is generated from this distribution. More specifically, the same number of data elements as the number of given data elements are randomly selected by recovering and extracting these data elements from the given data elements.
The parametric bootstrap method can be used when a model of distribution is given. Data is extracted randomly from this distribution by establishing a virtual bootstrap world with a parameter estimated from data being used as a true parameter. With the maximum likelihood method, a probability function including a parameter is usually given. Thus, this method can be used.
The statistical processing performed after a bootstrap sample is obtained is the same for both the nonparametric bootstrap method and the parametric bootstrap method.
Comparing the results of the nonparametric bootstrap and the parametric bootstrap provides one method for determining the adaptiveness of the model.
[4-2-3-1-2] Definitions etc. of Data Used in Bootstrap Method A collection of observed data is expressed below.
Y = [Vu V2, - - - , VN). (55) Each Vi denotes genotype data of L loci. Haplotype frequencies of the group are expressed below. h = {hu h2, - - - , hQ}- (56)
Each hq denotes a haplotype frequency labeled with q. The HWE is a model set for performing maximum likelihood estimation. Haplotype frequencies obtained directly by performing maximum likelihood estimation from data are shown below.
h = {hι, h2, - - - , hQ}. (57)
The 6-th bootstrap sample is expressed as
Figure imgf000038_0001
With the nonparametric method, each V^ ' denotes one element of eq.(55) relating to a collection of observed data, and denotes genotype data obtained from a set of haplotypes generated according to a frequency distribution obtained using eq.(57) relating to haplotype frequencies obtained directly by performing maximum likelihood estimation from data with the parametric method. Haplotype frequencies obtained by performing maximum likelihood estimation using each V*^ are given by
K*(b) = {h*t(h), h2*(b), - - - , h*Q(b)}. (59)
Further, the average of the bootstrap estimates is denned as
Λ;O = ^ ∑ Λ;(6). (60)
D 6=1
[4-2-3-1-3] Basic Statistic
A standard error occurring when the bootstrap method is used is computed using the expression below.
Figure imgf000038_0002
Here, se(hq) is not usually used to construct a confidence interval. This is used only as a guideline for the varying degree of hg when the bootstrap method is used. The central limit theorem is not used in most cases. A bias occurring when the bootstrap method is used is defined below.
bϊas{hg) = h* g(-) - hg. (62)
Bias correction with the bootstrap method is usually extremely dangerous and is not recommended in most cases. In most cases, such bias correction may be performed only to check whether a bias falls within a standard error range or a confidence interval. When the bias is too large, the estimation method itself needs to be checked. For example, the number of samples may be small, or the assumption may be wrong.
[4-2-3-2] Computation of Confidence Intervals of Haplotype Frequencies, Two-Loci Haplotype Frequencies, Individual's Diplotype Posterior Probabilities, and LD indices
Next, the nonparametric bootstrap processing unit 44 of the control unit 30 computes confidence intervals of multi-loci haplotype frequencies, two-loci haplotype frequencies, individual's multi-loci diplotype posterior probabilities, individual's two-loci diplotype posterior probabilities, and LD indices in the manner described below (step S 2- 10).
[4-2-3-2-1] Estimation of Confidence Intervals
A method for obtaining a confidence interval by sorting estimates computed from bootstrap samples is referred to as a percentile method. In multivariate analysis, the shape of a confidence region is determined directly by a parameter for which sorting is performed. Because this method directly refers to the distribution tail, the number of B needs to be large. Even when the number of B is around 2000, multivariate analysis seems to have greatly varying results. Further, because the bias correction is not performed at all, the convergence of an obtained confidence interval is not satisfactory.
A variance needs to be known to compute the χ2 statistic. However, a Maha- lanobis distance (described in detail later), which is obtained by replacing a variance matrix with a sample variance obtained using the bootstrap method with reference to the t statistic, will be discussed here. This is computed in the manner described below.
Figure imgf000040_0001
Here, Σ denotes a sample variance matrix in a bootstrap sample, and each element σqqι is computed in the manner described below.
Figure imgf000040_0002
The rank of this matrix is reduced based on the constraint condition. However, the above eq.(63) relating to r^ can be used to compute r^ when simply evaluating the computation for the matrix of Q x Q and obtaining Σ-1 as a generalized inverse matrix.
In the above eq.(63) relating to
Figure imgf000040_0003
(h*(b) ~ h*(-)) may be used instead of (h*(b) — h). However, in the world of the bootstrap method, a true parameter is h, and hence this is used.
In the estimation of haplotype frequencies, the steps below are executed to construct a (1 — 2a) confidence interval using eq.(63) above relating to r^K
1. Sort r^ from smaller values.
2. Set B - (I - 2α)-th r^ as ri_.
3. Construct a confidence interval of hq using maximum values and minimum values of hq of all r^ included in r^ < rχ_2Q.
As methods other than this, a percentile confidence interval can be constructed by directly sorting each element of hq. However, a confidence interval constructed in this way is shorter than the confidence interval constructed using r^ described above. This confidence interval is not a confidence interval constructed in the entire multidimension but is a confidence interval constructed in one dimension by projecting data on a certain axis. This method fails to consider the entire multidimension (referred to as a "1-dim percentile" herein).
[4-2-3-2-2] Computation of Mahalanobis Distance in Singular Variance Matrix
When variance matrix Σ has a singular value (when its rank is reduced), there exists no inverse matrix of that matrix. This is equivalent to zero eigenvalues existing when a variance matrix is diagonalized. In typical cases, when matrix Σ is a symmetric matrix, certain orthogonal matrix P exists, and diagonal matrix Λ exists that satisfies Σ = PKP1. When zero eigenvalues do not exist, an inverse matrix of Λ exists. Thus, the inverse matrix of matrix Σ is given by Σ"1 = PA~1Pt. Further, based on,
B , ~ ~ ~ >.
Λ = P*ΣP = P1 ∑ J (h* (h) - K) (h* (b) - hf \ P, (65)
the transformation of y = Plh may be performed. After this transformation, the variance matrix is a diagonal matrix.
When zero eigenvalues exist, matrix Λ' of Q' x Q', which is obtained by reducing the size of matrix Λ by an amount corresponding to the number of the zero eigenvalues, is generated. Matrix Λ' and the Q' dimensional vector y' excluding elements corresponding to the zero eigenvalues of y are used to compute r^ as follows.
r{b) = (y'*(b) - y')*Λ'- V'*(&) - y1)- (66)
However, when the inverse matrix of Λ of which parts corresponding to the zero eigenvalues are set to zero is written as Λ~x, the corresponding parts are zero. As a result, the above expression is transformed into
r(b) = (F(b) - yyA-l(?(b) - $)
= {K*(b) - Jiγ p A-1Pt(K^b) - h)
= (frity -
Figure imgf000041_0001
- h). (67)
Here, the inverse matrix Σ"1 of Σ is defined as Σ"1 = PA-1P*. As a result, when a variance matrix is defined using this generalized inverse matrix, r^ is evaluated using eq.(63) although the rank of matrix Σ is reduced.
A variance matrix usually has at least one zero eigenvalue based on the condition of constraint of haplotype frequencies. Because rounding errors are generated in numerical computation, an eigenvalue that is extremely smaller than a maximum eigenvalue needs to be regarded as a zero eigenvalue. [4-2-3-2-3] Computation of Confidence Intervals of Haplotype Frequencies
Confidence intervals of haplotype frequencies are computed using eq.(42) relating to the confidence interval of KQ described above.
[4-2-3-2-4] Computation of Confidence Intervals of Two-Loci Haplotype Frequencies
In the same manner as the computation of the confidence intervals of the two-loci haplotype frequencies described in step S2-6 above, a variance matrix is converted into an upper portion of two loci, and that result is used to compute confidence intervals of two-loci haplotype frequencies.
[4-2-3-2-5] Computation of Confidence Intervals of Individual's Diplotype Posterior Probabilities
Confidence intervals of individual's diplotype posterior probabilities are constructed using the maximum likelihood estimation and the bootstrap method. Confidence intervals of individual's multi-loci diplotype posterior probabilities and confidence intervals of individual's two-loci diplotype posterior probabilities are constructed in the same manner as the haplotype frequency estimation.
[4-2-3-2-6] Computation of Confidence Intervals of LD Indices In the present analysis, confidence intervals of p2 and JJ', which are LD indices, are constructed using a bias-corrected and accelerated (BCa) method. The BCa method is a well-known method, and is a method obtained by adding bias correction and distribution shape correction to the percentile method.
[4-2-4] Parametric Bootstrap Process [4-2-4-1] Parametric Bootstrap
Next, the parametric bootstrap processing unit 45 of the control unit 30 performs the parametric bootstrap described above (step S2-11).
[4-2-4-2] Computation of Confidence Intervals of Haplotype Frequencies, Two-Loci Haplotype Frequencies, Individual's Diplotype Posterior Probabilities, and LD Indices
Next, the parametric bootstrap processing unit 45 of the control unit 30 computes confidence intervals of multi-loci haplotype frequencies, two-loci haplotype frequencies, individual's multi-loci diplotype posterior probabilities, individual's two-loci diplotype posterior probabilities, and LD indices (step S2-12). This processing is performed using results of the parametric bootstrap in the same manner as the computation of the confidence intervals of the multi-loci haplotype frequencies, the two-loci haplotype frequencies, the individual's multi-loci diplotype posterior probabilities, the individual's two-loci diplotype posterior probabilities, and the LD indices performed in step S2-10 described above.
[4-3] Evaluation of Confidence Intervals
Next, the confidence verification unit 46 of the control unit 30 evaluates the confidence intervals obtained in the above processes in the manner described below (S2-13).
[4-3-1] Verification of Estimation Confidence
With the maximum likelihood method, the distribution of estimates asymptotically follows the normal distribution as described above. The distribution also has asymptotic efficiency. Thus, a variance of each estimate is at its minimum when the data amount is large. Thus, the maximum likelihood method is shown to be one of the best estimation methods if the data amount is so large that estimation performed has asymptotic efficiency. This may be confirmed by comparing variances obtained using the information matrices (the observed information matrix and the empirical information matrix) and variances obtained using the bootstrap methods. This is because the variances obtained using the bootstrap methods are variances of direct estimates and the variances obtained using the information matrices are variances obtained based on the assumption of the asymptotic normality.
In the present analysis, variances obtained using the above-described four methods, namely, (1) the observed information matrix, (2) the empirical information matrix, (3) the nonparametric bootstrap method, and (4) the parametric bootstrap method, are compared, so that (1) the law of large numbers, (2) the validity of the model, and (3) the asymptotic normality are verified.
The law of large numbers is verified by comparing the variance obtained using the observed information matrix and the variance obtained using the empirical information matrix.
The validity of the model is verified by comparing the variance obtained using the nonparametric bootstrap method and the variance obtained using the parametric bootstrap method.
The asymptotic normality is verified by comparing the variance obtained using the observed information matrix and the variance obtained using the nonparametric bootstrap method.
As methods for such verification in the preferred embodiment, (1) the test using the variance ratio (F-test) and (2) the adjustment of the allowable confidence interval are performed. In verification using the F-test, for example, the variance obtained using the information matrix is given the degree of freedom being 1, and the variance obtained using the bootstrap method is given the degree of freedom being the "bootstrap sample number minus one" . This verification involves multiple comparisons. Thus, the level of significance considering the Bonferroni correction needs to be set in advance.
In the adjustment of the allowable confidence interval, determination is performed as to whether the two edge values of the confidence interval obtained using each method falls within the allowable range of the two edge values of the confidence interval, which is specified in advance by the user. This method is not executed by directly comparing variances. However, this method is effective when the estimation accuracy is considered to be sufficiently high when the confidence interval falls within the tolerable value range set based on the amount of data.
The verification with this maximum likelihood estimation method is performed for estimation of each of a plurality of sets of loci. In evaluation of confidence intervals of haplotype frequencies, individual's diplotype posterior probabilities, and LD indices described below, only the results that have passed each evaluation of this verification are used.
The verification with this maximum likelihood estimation method is performed for the variance of the multi-loci haplotype frequencies and the variance of the two- loci haplotype frequencies.
Hereafter, confidence intervals of haplotype frequencies of an entire specified interval (successive multi-loci), confidence intervals of haplotype frequencies between two loci, confidence intervals of individual's diplotype posterior probabilities of an entire specified interval (successive multi-loci), confidence intervals of individual's diplotype posterior probabilities between two loci, and confidence intervals of LD indices will be separately described in detail.
[4-3-2] Haplotype Frequencies
For the confidence intervals of the haplotype frequencies of the entire specified interval (successive multi-loci), " [4-3-1] Verification of Estimation Confidence" described above is first performed for the variances of the multi-loci haplotype frequencies. Next, for each of the confidence interval computation results satisfying each evaluation condition of " [4-3-1] Verification of Estimation Confidence" described above performed for the variances of the multi-loci haplotype frequencies, determination is performed as to whether the maximum likelihood estimates of the multi-loci haplotype frequencies fall within the confidence intervals computed using the nonparametric bootstrap method and the parametric bootstrap method. Then, data specifying the set of the multi-loci, data relating to the confidence intervals of the multi-loci haplotype frequencies, and data relating to the multi-loci haplotype frequencies are stored in the confidence interval estimation result storage unit 53 in a manner that these data elements are associated with one another. Here, for each of the results failing to satisfy the evaluation conditions of " [4-3-1] Verification of Estimation Confidence" described above, a flag indicating this (verification error flag) is added and stored together with the above data. Further, when the maximum likelihood estimates of the multi-loci haplotype frequencies fail to fall within the above confidence intervals, a flag indicating this (confidence interval error flag) is added and stored together with the above data.
For the confidence intervals of the two-loci haplotype frequencies, " [4-3-1] Verification of Estimation Confidence" described above is first performed for the variances of the haplotype frequencies that are converted into information of two loci. Next, for each of the confidence interval computation results satisfying each evaluation condition of " [4-3-1] Verification of Estimation Confidence" described above performed for the variances of the two-loci haplotype frequencies, determination is performed as to whether the maximum likelihood estimates of the two-loci haplotype frequencies fall within the confidence intervals computed using the nonparametric bootstrap method and the parametric bootstrap method. Then, data specifying the set of the multi-loci for which the confidence intervals of the two-loci haplotype frequencies are computed, data relating to the confidence intervals of the two-loci haplotype frequencies, and data relating to the two-loci haplotype frequencies are stored in the confidence interval estimation result storage unit 53 in a manner that these data elements are associated with one another. Here, for each of the results failing to satisfy the evaluation conditions of " [4-3-1] Verification of Estimation Confidence" described above, a flag indicating this (verification error flag) is added and stored together with the above data. Further, when the maximum likelihood estimates of the two-loci haplotype frequencies fail to fall within the above confidence intervals, a flag indicating this (confidence interval error flag) is added and stored together with the above data.
[4-3-3] Individual's Diplotype Posterior Probabilities
For the diplotype posterior probabilities of the entire specified interval (successive multi-loci) of each individual, " [4-3-1] Verification of Estimation Confidence" described above is first performed for the variances of the multi-loci haplotype frequencies. Next, for each of the confidence interval computation results passing each evaluation of " [4-3-1] Verification of Estimation Confidence" described above performed for the variances of the multi-loci haplotype frequencies, a determination is performed as to whether the maximum likelihood estimates of the individual's multi- loci diplotype posterior probabilities fall within the confidence intervals computed using the nonparametric bootstrap method and the parametric bootstrap method. Then, data specifying the set of the multi-loci, data relating to the confidence intervals of the individual's multi-loci diplotype posterior probabilities, and data relating to the individual's multi-loci diplotype posterior probabilities are stored in the confidence interval estimation result storage unit 53 in a manner that these data elements are associated with one another. Here, for each of the results failing to satisfy the evaluation conditions of " [4-3-1] Verification of Estimation Confidence" described above, a flag indicating this (verification error flag) is added and stored together with the above data. Further, when the maximum likelihood estimates of the individual's multi-loci diplotype posterior probabilities fail to fall within the above confidence intervals, a flag indicating this (confidence interval error flag) is added and stored together with the above data.
For the confidence interval of the diplotype posterior probabilities of the two loci of each individual, " [4-3-1] Verification of Estimation Confidence" described above is first performed for the variances of the haplotype frequencies that are converted into information of two loci. Next, for each of the confidence interval computation results passing each evaluation of " [4-3-1] Verification of Estimation Confidence" described above performed for the variances of the two-loci haplotype frequencies, a determination is performed as to whether the maximum likelihood estimates of the individual's two-loci diplotype posterior probabilities fall within the confidence intervals computed using the nonparametric bootstrap method and the parametric bootstrap method. Then, data specifying the set of the multi-loci for which the confidence intervals of the individual's two-loci diplotype posterior probabilities are computed, data relating to the confidence intervals of the individual's two-loci diplotype posterior probabilities, and data relating to the individual's two-loci diplotype posterior probabilities are stored in the confidence interval estimation result storage unit 53 in a manner that these data elements are associated with one another. For each of the results failing to satisfy the evaluation conditions of " [4-3-1] Verification of Estimation Confidence" described above, a flag indicating this (verification error flag) is added and stored together with the above data. Further, when the maximum likelihood estimates of the individual's two-loci diplotype posterior probabilities fail to fall within the above confidence intervals, a flag indicating this (confidence interval error flag) is added and stored together with the above data.
[4-3-4] LD Indices
For the LD indices, indices p2 and D' are evaluated using (1) the BCa method in accordance with the nonparametric bootstrap method and (2) the BCa method in accordance with the parametric bootstrap method. More specifically, " [4-3-1] Verification of Estimation Confidence" described above is first performed for the variances of the haplotype frequencies that are converted into information of two loci. Next, for each of the confidence interval computation results passing each evaluation of " [4-3-1] Verification of Estimation Confidence" described above performed for the variances of the two-loci haplotype frequencies, a determination is performed as to whether the maximum likelihood estimate of each LD index falls within the confidence intervals computed using the BCa method in accordance with the nonparametric bootstrap method and the BCa method in accordance with the parametric bootstrap method. Then, data specifying the set of the multi-loci for which the confidence intervals of each LD index are computed, data relating to the confidence intervals of each LD index, and data relating to each LD index are stored in the confidence interval estimation result storage unit 53 in a manner that these data elements are associated with one another. For each of the results failing to satisfy the evaluation conditions of " [4-3-1] Verification of Estimation Confidence" described above, a flag indicating this (verification error flag) is added and stored together with the above data. Further, when the maximum likelihood estimate of each LD index fails to fall within the above confidence intervals, a flag indicating this (confidence interval error flag) is added and stored together with the above data.
[5] Generation of Successive Multi-Loci Data and Estimation of Confidence Intervals for All Possible Sets of Successive Multi-loci Including Two subject loci in Specified Range
The confidence interval estimation unit 32 repeats the processing of step Sl-I and step Sl-2 described above for possible sets of successive multi-loci including two subject loci (successive multi-loci) in a specified range (e.g., in a range defined using a set maximum length) until the processing is completed (until the processing reaches termination in step Sl-I). [6] Comparison of Confidence Interval Estimation Results Next, the confidence interval estimation result comparison unit 33 of the control unit 30 compares the confidence interval estimation results computed for each of the sets of successive multi-loci using the above methods (step Sl-3). More specifically, the confidence interval estimation result comparison unit 33 performs comparison of data of the confidence interval estimation results of the two-loci haplotype frequencies, the confidence interval estimation results of the individual's two-loci diplotype posterior probabilities, the confidence interval estimation results of the LD indices stored in the confidence interval estimation result storage unit 53 in the manner described below. Among the data stored in the confidence interval estimation result storage unit 53, only data for which neither a verification error flag nor a confidence interval error flag is set is used for the comparison. More specifically, data used here for the comparison is data satisfying the evaluation conditions in step S2-13 described above.
[6-1] Comparison of Confidence Interval Estimation Results of Haplotype Frequencies
For the confidence interval estimation results of the two-loci haplotype frequencies, the confidence interval estimation results of the two-loci haplotype frequencies of each set of successive multi-loci for which neither a verification error flag nor a confidence interval error flag is set are compared with one another, and the shortest confidence interval is specified. The specified shortest confidence interval, the two-loci haplotype frequency stored as being associated with this confidence interval, and data specifying the set of the multi-loci for which the confidence interval of this two-loci haplotype frequency is obtained are output to the output unit 62 in a manner that the data of the specified confidence interval can be specified.
[6-2] Comparison of Confidence Interval Estimation Results of Individual's Diplotype Posterior Probabilities
For the confidence interval estimation results of the individual's two-loci diplotype posterior probabilities, the confidence interval estimation results of the individual's two-loci diplotype posterior probabilities of each set of successive multi-loci for which neither a verification error flag nor a confidence interval error flag is set are compared with one another, and the shortest confidence interval is specified. The specified shortest confidence interval, the individual's two-loci diplotype posterior probabilities stored as being associated with this confidence interval, and data specifying the set of the multi-loci for which the confidence intervals of these individual's two-loci diplotype posterior probabilities is obtained are output to the output unit 62 in a manner that the data of the specified confidence interval can be specified.
[6-3] Comparison of Confidence Interval Estimation Results of LD Indices
For the confidence interval estimation results of the LD indices, the confidence interval estimation results of each two-loci LD index of each set of successive multi-loci for which neither a verification error flag nor a confidence interval error flag is set are compared with one another, and the shortest confidence interval is specified. The specified shortest confidence interval, the LD indices stored as being associated with this confidence interval, and data specifying the set of the multi-loci for which the confidence intervals of these LD indices are obtained are output to the output unit 62 in a manner that the data of the specified confidence interval can be specified.
The shortest confidence intervals of the two-loci haplotype frequencies, the individual's two-loci diplotype posterior probabilities, and the LD indices are output in a manner that the shortest confidence intervals can be specified. The multi-loci haplotype frequencies, the two-loci haplotype frequencies, the individual's multi-loci diplotype posterior probabilities, the individual's two-loci diplotype posterior prob- abilitie, the LD indices, the confidence intervals for these, and the data specifying the set of the multi-loci for which these confidence intervals are obtained are output as described above in the preferred embodiment. Further, for the data for which a verification error flag or a confidence interval error flag is set, the reason why the data is not used for the comparison is output based on the flag in a manner that the reason can be specified.
As described above, the preferred embodiment has the advantages described below. In the above embodiment, successive multi-loci data in accordance with multi-loci genotype data of possible sets of successive multi-loci including two specific loci is generated using a collection of multi-loci genotype data of individuals by the LD index computation device 20 based on multi-loci genotype data of individuals. A maximum likelihood estimates of haplotype frequencies of successive multi-loci including the two specific loci are computed using the successive multi-loci data for each set of successive multi-loci, and the maximum likelihood estimates of the multi-loci haplotype frequencies are converted into two-loci haplotype frequencies. Variances and confidence intervals are computed using the successive multi-loci data with a plurality of different methods, and are each converted into information of the two specific loci. Then, verification is performed by comparing the variances of the haplotype frequencies between the two specific loci computed using the different methods. Further, determination is performed as to whether the maximum likelihood estimates of the haplotype frequencies between the two specific loci fall within the confidence intervals of the haplotype frequencies between the two specific loci computed using predetermined ones of the different methods. Based on the results of the verification and the determination, the confidence intervals and the corresponding two-loci haplotype frequencies are stored in the confidence interval estimation result storage unit 53 in a manner that these data elements are associated with one another. Based on the results of the verification and the determination performed for each set of successive multi-loci, the confidence intervals, which are stored in the confidence interval estimation result storage unit 53 are compared with one another, and the confidence interval to be adopted is specified. Then, the two-loci haplotype frequencies stored as being associated with the adopted confidence interval are specified.
In this way, haplotype frequencies between two loci can be obtained using genotype data of multi-loci. This enables analysis effectively using experimental data to be performed. For evaluation of the validity of the maximum likelihood estimate, the validity can be evaluated accurately by constructing confidence intervals using various methods. Thus, when the number of samples is small or the validity of the model fails to be verified, such facts can be detected as distortions of the confidence intervals. For the results for which the validity of the maximum likelihood estimate is verified, the confidence intervals obtained using various numbers of loci including the two specific loci are compared with one another. As a result, the two-loci haplotype frequencies with higher accuracy are obtained.
In the above embodiment, variances and confidence intervals are computed using the observed information matrix, the empirical information matrix, the nonparametric bootstrap method, and the parametric bootstrap method. The variance obtained using the observed empirical matrix and the variance obtained using the empirical information matrix are compared. The variance obtained using the nonparametric bootstrap method and the variance obtained using the parametric bootstrap method are compared. The variance obtained using the observed information matrix and the variance obtained using the nonparametric bootstrap method are compared. Then, verification is performed based on these comparison results.
As a result, variances and confidence intervals of haplotype frequencies are obtained using the observed information matrix, the empirical information matrix, the nonparametric bootstrap method, and the parametric bootstrap method. The variance obtained using the observed information matrix and the variance obtained using the empirical information matrix are compared. This enables the law of large numbers to be checked. The variance obtained using the nonparametric bootstrap method and the variance obtained using the parametric bootstrap method are compared. This enables the validity of the model to be verified. The variance obtained using the observed information matrix and the variance obtained using the nonparametric bootstrap method are compared. This enables the asymptotic normality to be verified.
In the above embodiment, a maximum likelihood estimate of each LD index is computed using a two-loci haplotype frequencies based on successive multi-loci data of each set of successive multi-loci. Variances and confidence intervals of the LD indices are computed. Then, determination is performed as to whether the maximum likelihood estimates of the LD indices fall within the confidence intervals of the LD indices. Based on the results of the verification and the results of the determination as to whether the maximum likelihood estimates fall within the confidence intervals of the LD indices, the confidence intervals of the LD indices and the corresponding LD indices are stored in the confidence interval result storage unit 53 in a manner that these data elements are associated with one another. Based on the results of this verification and the determination performed for each set of successive multi-loci, the confidence intervals of the LD indices, which are stored in the confidence interval estimation result storage unit 53 are compared with one another, and the confidence interval of the LD indices to be adopted is specified. Then, the LD indices stored as being associated with the adopted confidence interval is specified.
As a result, LD indices between two loci can be obtained using multi-loci genotype data. This enables analysis effectively using experimental data to be performed. Further, for the results for which the validity of the maximum likelihood estimate is verified, the confidence intervals computed using various numbers of loci including the two specific loci are compared with one another, so that the LD indices with higher accuracy are obtained.
In the above embodiment, variances and confidence intervals of each LD index are computed separately using the BCa method according to the nonparametric bootstrap method and the BCa method according to the parametric bootstrap method.
Thus, variances and confidence intervals of the LD indices can be obtained separately using the BCa method according to the nonparametric bootstrap method and the BCa method according to the parametric bootstrap method. The confidence intervals of the LD indices that are obtained separately can be used to evaluate the validity of the maximum likelihood estimate of each LD index.
In the above embodiment, the individual's multi-loci diplotype posterior probabilities are computed for each set of successive multi-loci using successive multi-loci data based on the results of maximum likelihood estimation of the haplotype frequencies obtained by the maximum likelihood estimation process, and are converted into the individual's diplotype posterior probabilities between the two specific loci. Variances and confidence intervals of the multi-loci diplotype posterior probabilities are computed using the successive multi-loci data, and are converted into information of the two specific loci. Then, determination is performed as to whether the maximum likelihood estimates of the individual's diplotype posterior probabilities between the two specific loci fall within the confidence intervals of the individual's diplotype posterior probabilities between the two specific loci. Then, based on the results of the verification and the results of the determination as to whether the maximum likelihood estimates of the individual's diplotype posterior probabilities between the two specific loci fall within the confidence intervals of the individual's diplotype posterior probabilities between the two specific loci, the confidence intervals of the individual's diplotype posterior probabilities between the two specific loci and the corresponding individual's diplotype posterior probabilities between the two specific loci are stored into the confidence interval estimation result storage unit 53 in a manner that these data elements are associated with one another. Based on the results of the verification and the results of the determination performed for each set of successive multi-loci, the confidence intervals of the individual's diplotype posterior probabilities between the two specific loci stored in the confidence interval estimation result storage unit 53 are compared with one another, and the confidence intervals of the individual's diplotype posterior probabilities between the two specific loci to be adopted are specified. Then, the individual's diplotype posterior probabilities between the two specific loci stored as being associated with the adopted confidence intervals are specified.
As a result, an individual's diplotype posterior probabilities between two loci can be obtained using multi-loci genotype data. This enables analysis to be performed effectively using experimental data. For the results for which the validity of the maximum likelihood estimate is verified, the confidence intervals obtained using various numbers of loci including the two specific loci are compared with one another, so that the diplotype posterior probabilities with higher accuracy are obtained.
In the above embodiment, variances and confidence intervals of the diplotype posterior probabilities are computed separately using the nonparametric bootstrap method and the parametric bootstrap method.
As a result, variances and confidence intervals of the diplotype posterior probabilities can be obtained separately using the nonparametric bootstrap method and the parametric bootstrap method. The confidence intervals of the diplotype posterior probabilities obtained separately can be used to evaluate the validity of the maximum likelihood estimates of the diplotype posterior probabilities.
The above embodiment may be modified in the following forms.
In the above embodiment, the observed information matrix, the empirical information matrix, the nonparametric bootstrap method, and the parametric bootstrap method are used to compute variances and confidence intervals, and the obtained variances are compared with one another to verify the estimation confidence. The methods for obtaining the variances and the confidence intervals should not be limited to those methods but may be other methods that can evaluate the validity of the model or the sample amount. The methods other than the above listed methods may be used to compute variances and confidence intervals, and the computed variances and confidence intervals may be used to evaluate the validity of the model or the sample amount.
In the above embodiment, the BCa method in accordance with the nonparametric bootstrap method and the BCa method in accordance with the parametric bootstrap method are used to compute variances and confidence intervals of the LD indices. However, the methods for computing variances and confidence intervals for evaluating the LD indices should not be limited to those methods. Methods other than the above listed methods may be used to compute variances and confidence intervals and the computed variances and confidence intervals may be used in evaluation.
In the above embodiment, the different methods are used to obtain variances and confidence intervals of haplotype frequencies between two specific loci. The variances of the haplotype frequencies between the two specific loci are compared with one another to perform verification, and determination is performed as to whether the maximum likelihood estimates of the haplotype frequencies between the two specific loci fall within the confidence intervals. Then, the two-loci haplotype frequencies to be adopted are specified from those adopted based on the results of the verification and the determination based on the comparison among the confidence intervals. Only one of the verification performed by comparing variances and the determination as to whether the maximum likelihood estimates of the two specific loci haplotype frequencies fall within the confidence intervals may be performed, or none of the verification and the determination may be performed. In this case, verification is performed using another method, and the two-loci haplotype frequencies are estimated based on the result of verification performed using the other method.
In the above embodiment, the different methods are used to obtain variances of haplotype frequencies between two specific loci. The variances are compared with one another to perform verification, confidence intervals of the LD indices are obtained, and determination is performed as to whether the maximum likelihood estimate of each LD index falls within the confidence intervals. Then, the LD indices to be adopted are specified from those adopted based on the results of the verification and the determination based on the comparison among the confidence intervals. Only one of the verification performed by comparing variances and the determination as to whether the maximum likelihood estimates of the LD indices fall within the confidence intervals may be performed, or none of the verification and the determination may be performed. In this case, verification is performed using another method, and the LD indices are estimated based on the result of verification performed using the other method.
In the above embodiment, the different methods are used to obtain variances of haplotype frequencies between two specific loci. The variances are compared with one another to perform verification. Further, confidence intervals of the two specific loci diplotype posterior probabilities are obtained, and determination is performed as to whether the maximum likelihood estimates of the two specific loci diplotype posterior probabilities fall within the confidence intervals. Then, the two specific loci diplotype posterior probabilities to be adopted are specified from those adopted based on the results of the verification and the determination based on the comparison among the confidence intervals. Only one of the verification performed by comparing variances and the determination as to whether the maximum likelihood estimates of the two specific loci diplotype posterior probabilities fall within the confidence intervals may be performed, or none of the verification and the determination may be performed. In this case, verification is performed using another method, and the two specific loci diplotype posterior probabilities are estimated based on the result of verification performed using the other method.
Although the above embodiment describes a case in which the two subject loci are successive as exemplified in Fig. 4, the two subject loci do not necessarily have to be successive. Hereafter, a case in which the two subject loci are not successive will be described. Even when the two subject loci are not successive, the statistical genetics analysis is performed in the same manner by performing the same processing as the processing performed when the two subject loci are successive.
More specifically, when the user specifies two subject loci via input in " [1] User Setting" in the above embodiment, the user may specify either successive two loci or two loci that are not successive as two subject loci. In " [2] Input of Collection of Multi-Loci Genotype Data of Individuals" in the above embodiment, a collection of multi-loci genotype data of individuals for sets of multi-loci including the specified two subject loci is input.
In " [3] Generation of Data of Possible Sets of Successive Loci Including Two subject loci" , genotype data of each possible set of multi-loci including the two subject loci (multi-loci data as recited in the claims) is generated. Here, as the" genotype data of each possible set of multi-loci including the two subject loci" , locus data for each of multi-loci in the same set may be generated in the order of positions of the loci in the set. In other words, each of the multi-loci for which such locus data is generated may be successive or may not be successive (may be intermittent) in the original set of the multi-loci.
Fig. 5 shows examples of sets of multi-loci, each of which includes two subject loci that are not successive. Two loci that are not successive (loci j and k) are the subject loci, and data of possible sets of multi-loci including the two loci (multi-loci data described in the claims) is generated. More specifically, symbol (a) shows a haplotype that is formed by two loci j and A;. Symbol (b) shows a haplotype that is formed by three loci j, j + 1, and k. Symbol (c) shows a haplotype that is formed by four loci j, j + 1, A;, and k + 1. Symbol (d) shows a haplotype that is formed by four loci j — 1, j, k — 1, and k. Symbol (e) shows a haplotype that is formed by five loci j — 1, j, j + 1, £;, and jfc + 1. Symbol (f) shows a haplotype that is formed by six loci j — 1, j, j + 1, k — 1, A;, and & + 1. Symbol (g) shows a haplotype that is formed by six loci j, j ' + 1, j + 2, & — 1, &, and A: + 1.
The same processing as in the above embodiment is performed using the "genotype data of possible sets of multi-loci including the two subject loci" generated in this way. More specifically, the "genotype data of possible sets of multi-loci including the two subject loci" generated in this way corresponds to the "successive multi-loci data" in the above embodiment.
In this way, a multi-loci haplotype frequencies, two-loci haplotype frequencies, two-loci LD indices, individual's multi-loci diplotype posterior probabilities, and individual's two-loci diplotype posterior probabilities, and their variances and confidence intervals are computed for the "possible sets of multi-loci including the two subject loci" and the "two subject loci" in the same manner as in the above embodiment. Using these results, the evaluation of the confidence intervals and the comparison of the confidence interval estimation results are performed in the same manner as in the above embodiment. Based on these results, the result of the statistical genetics analysis is output. Thus, in the same manner as in the above embodiment, the user is enabled to obtain the result of the statistical genetics analysis by specifying the two subject loci not only when the two subject loci are successive but also when the two subject loci are not successive.
In the above embodiment, the above-described processing is performed by the LD index computation device 20. Instead of this, the same processing may be performed in a distributed environment.
[Example]
The present invention will now be described in more detail by way of examples. However, the examples described below should not limit the present invention in any way.
To verify that the use of multi-loci improves the estimation accuracy of the haplotype frequencies, the LD indices, and the individual's diplotype estimation, simulation data was generated and analyzed. In the simulation data, the number of loci was 12, the number of haplotypes was 7, and the number of alleles at each locus was 2, and the parameters were generated using the haplotype frequency greater than or equal to 1% and the allele frequency greater than or equal to 5%. Based on such data, data of each of 100 individuals was given missing portions with missing rates of 0%, 5%, and 10%. The data for each of the 100 individuals was analyzed, and the analysis results were compiled.
[Analysis Method]
Fig. 6 shows the results of analysis performed using the generated simulation data with the maximum haplotype length being set at 2 (two loci) and 6 (six loci). Here, with the "maximum haplotype length" , computation was performed for each of possible haplotypes including two subject loci having various lengths up to the maximum haplotype length. Fig. 6 shows the results of estimation for which the validity of the haplotype frequency estimation was not checked.
Fig. 6 (a) shows the estimation accuracy comparison of the two-loci haplotype frequencies. Fig. 6(b) shows the estimation accuracy comparison of the two-loci diplotypes. Fig. 6(c) shows the estimation accuracy comparison of p2(r2). Fig.6(d) shows the estimation accuracy comparison of D' . In Fig.6(a) showing the estimation accuracy comparison of the two-loci haplotype frequencies, the results are categorized as that a difference between the true frequency generated by simulation and the frequency estimated using the statistical genetics analysis system of the present invention is 0.005 or less, the difference is 0.015 or less and greater than 0.005, and the difference is greater than 0.015. In Fig. 6(b) showing the estimation accuracy comparison of the two-loci diplotypes, the results are categorized as that a true diplotype is correctly estimated at a posterior probability of 1.0 (match), the true diplotype is correctly estimated not at a posterior probability of 1.0 but at a maximum posterior probability (maximum estimation), and the result shows a mismatch at a maximum posterior probability (mismatch). In Fig. 6(c) showing the estimation accuracy comparison of p2{r2) and Fig. 6(d) showing the estimation accuracy comparison of D', each index is compared with a true value, and the results are categorized as that the index is estimated with a difference of 0.005 or less, the index is estimated with a difference of 0.015 or less and greater than 0.005, and the index differs from the true value by a value greater than 0.015.
As shown in Fig. 6, the results of the analysis using such simulation data revealed that the estimation accuracy was higher when six loci were used than when two loci were used for the two-loci haplotype frequencies, the two-loci diplotype, and the indices p2(r2) and D'.

Claims

1. A statistical genetics analysis system for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals, wherein the statistical genetics analysis system includes a computer, the statistical genetics analysis system being characterized in that: the computer functions as: a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals; a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, and a process for storing the converted two loci haplotype frequencies; and a means for estimating haplotype frequencies between the two specific loci based on the two-loci haplotype frequencies stored for each of the possible multiple loci including the two specific loci.
2. A statistical genetics analysis system for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals, wherein the statistical genetics analysis system includes a computer, the statistical genetics analysis system being characterized in that: the computer functions as: a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals; a means for performing a process for computing a maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing variances and confidence intervals of the haplotype frequencies of the multiple loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances and the computed confidence intervals into information relating to the two specific loci, a process for comparing variances of the haplotype frequencies between the two specific loci computed through the plurality of different methods, a process for determining whether the maximum likelihood estimates of the haplotype frequencies between the two specific loci fall within the confidence intervals of the haplotype frequencies between the two specific loci computed through a predetermined one of the plurality of different methods, and a process for storing in a confidence interval estimation result storage unit the confidence intervals of the haplotype frequencies between the two specific loci and corresponding two-loci haplotype frequencies in an associated manner based on the comparison process and the determination process; and a means for comparing confidence intervals of the haplotype frequencies between the two specific loci stored in the confidence interval estimation result storage unit for the possible multiple loci including the two specific loci and specifying confidence intervals that are "adopted to specify two-loci haplotype frequencies stored in association with the specified confidence intervals.
3. A statistical genetics analysis system for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals, wherein the statistical genetics analysis system includes a computer, the statistical genetics analysis system being characterized in that: the computer functions as: a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals; a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing maximum likelihood estimates of linkage disequilibrium indices using the converted two-loci haplotype frequencies, and a process for storing the computed maximum likelihood estimates of the linkage disequilibrium indices; and a means for estimating linkage disequilibrium indices based on the maximum likelihood estimates of the linkage disequilibrium indices stored for each of the possible multiple loci including the two specific loci.
4. A statistical genetics analysis system for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals, wherein the statistical genetics analysis system includes a computer, the statistical genetics analysis system being characterized in that: the computer functions as: a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals; a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing maximum likelihood estimates of linkage disequilibrium indices using the converted two-loci haplotype frequencies, a process for computing variances of the haplotype frequencies of the multiple loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances into information relating to the two specific loci, a process for comparing variances of the haplotype frequencies between the two specific loci computed through the plurality of different methods, a process for computing variances and confidence intervals of linkage disequilibrium indices, a process for determining whether maximum likelihood estimates of the linkage disequilibrium indices fall within the confidence intervals of the linkage disequilibrium indices, a process for storing confidence intervals of the linkage disequilibrium indices and corresponding linkage disequilibrium indices into a confidence interval estimation result storage unit in an associated manner based on the comparison process and the determination process; and a means for comparing confidence intervals of the linkage disequilibrium indices stored in the confidence interval estimation result storage unit for each of the possible multiple loci including the two specific loci and specifying confidence intervals of the linkage disequilibrium indices that are adopted to specify linkage disequilibrium indices stored in association with the specified confidence interval.
5. A statistical genetics analysis system for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals, wherein the statistical genetics analysis system includes a computer, the statistical genetics analysis system being characterized in that: the computer functions as: a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals; a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for computing individual's diplotype posterior probabilities of the multiple loci including the two specific loci based on a result of the computed maximum likelihood estimates of haplotype frequencies and converting the computed posterior probabilities into individual's diplotype posterior probabilities between the two specific loci, and a process for storing the converted individual's diplotype posterior probabilities between the two specific loci; and a means for estimating an individual's diplotype posterior probabilities between the two specific loci based on the individual's diplotype posterior probabilities between the two specific loci stored for each of the possible multiple loci including the two specific loci.
6. A statistical genetics analysis system for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals, wherein the statistical genetics analysis system includes a computer, the statistical genetics analysis system being characterized in that: the computer functions as: a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals; a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for computing individual's diplotype posterior probabilities of the multiple loci including the two specific loci based on a result of the computed maximum likelihood estimates of haplotype frequencies and converting the computed posterior probabilities into individual's diplotype posterior probabilities between the two specific loci, a process for computing variances of haplotype frequencies of the multi-loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances into information relating to the two specific loci, a process for comparing variances of the haplotype frequencies between the two specific loci computed with the plurality of different methods, a process for computing variances and confidence intervals of diplotype posterior probabilities of the multiple loci including the two specific loci using the multi-loci data and converting the computed variances and the computed confidence intervals into information relating to the two specific loci, a process for determining whether maximum likelihood estimates of the individual's diplotype posterior probabilities between the two specific loci fall within the confidence intervals of the individual's diplotype posterior probabilities between the two specific loci, and a process for storing confidence intervals of the individual's diplotype posterior probabilities between the two specific loci and corresponding individual's diplotype posterior probabilities between the two specific loci into a confidence interval estimation result storage unit in an associated manner based on the comparison process and the determination process; and a means for comparing confidence intervals of the individual's diplotype posterior probabilities between the two specific loci stored in the confidence interval estimation result storage unit for each of the possible multiple loci including the two specific loci to specify confidence intervals of the individual's diplotype posterior probabilities between the two specific loci that are to be adopted and specify individual's diplotype posterior probabilities between the two specific loci stored in association with the specified confidence intervals.
7. The statistical genetics analysis system according to any one of claims 2, 4, and 6, characterized in that: the plurality of different methods for computing the variances and the confidence intervals of the haplotype frequencies of the multiple loci including the two specific loci include an observed information matrix, an empirical information matrix, a nonparametric bootstrap method, and a parametric bootstrap method; and the comparison process includes comparison of variances obtained through the observed information matrix and variances obtained through the empirical information matrix, comparison of variances obtained through the non-parametric bootstrap and variances obtained through the parametric bootstrap method, and comparison of variances obtained through the observed information matrix and variances obtained through the non-parametric bootstrap method.
8. The statistical genetics analysis system according to claim 4, characterized in that: the variances and the confidence intervals of the linkage disequilibrium indices are computed through methods including a BCa method in accordance with a nonparametric bootstrap method and a BCa method in accordance with a parametric bootstrap method.
9. The statistical genetics analysis system according to claim 6, characterized in that: the variances and the confidence intervals of the diplotype posterior probabilities are computed through methods including a nonparametric bootstrap method and a parametric bootstrap method.
10. A method for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer, the method being characterized in that: the computer executes the steps of: generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals; performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, and a process for storing the converted two-loci haplotype frequencies; and estimating haplotype frequencies between the two specific loci based on the two-loci haplotype frequencies stored for each of the possible multiple loci including the two specific loci.
11. A method for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer, the method being characterized in that: the computer executes the steps of: generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals; performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing variances and confidence intervals of the haplotype frequencies of the multiple loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances and the computed confidence intervals into information relating to the two specific loci, a process for comparing variances of the haplotype frequencies between the two specific loci computed through the plurality of different methods, a process for determining whether the maximum likelihood estimates of the haplotype frequencies between the two specific loci fall within the confidence intervals of the haplotype frequencies between the two specific loci computed through a predetermined one of the plurality of different methods, and a process for storing in a confidence interval estimation result storage unit the confidence intervals of the haplotype frequencies between the two specific loci and corresponding two-loci haplotype frequencies in an associated manner based on the comparison process and the determination process; and comparing confidence intervals of the haplotype frequencies between the two specific loci stored in the confidence interval estimation result storage unit for the possible multiple loci including the two specific loci and specifying confidence intervals that are adopted to specify two-loci haplotype frequencies stored in association with the specified confidence intervals.
12. A method for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer, the method being characterized in that: the computer executes the steps of: generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals; performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing maximum likelihood estimates of linkage disequilibrium indices using the converted two-loci haplotype frequencies, and a process for storing the computed maximum likelihood estimates of the linkage disequilibrium indices; and estimating linkage disequilibrium indices based on the maximum likelihood estimates of the linkage disequilibrium indices stored for each of the possible multiple loci including the two specific loci.
13. A method for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer, the method being characterized in that: the computer executes the steps of: generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals; performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing maximum likelihood estimates of linkage disequilibrium indices using the converted two-loci haplotype frequencies, a process for computing variances of the haplotype frequencies of the multiple loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances into information relating to the two specific loci, a process for comparing variances of the haplotype frequencies between the two specific loci computed through the plurality of different methods, a process for computing variances and confidence intervals of linkage disequilibrium indices, a process for determining whether a maximum likelihood estimates of the linkage disequilibrium indices fall within the confidence intervals of the linkage disequilibrium indices, a process for storing confidence intervals of the linkage disequilibrium indices and corresponding linkage disequilibrium indices into a confidence interval estimation result storage unit in an associated manner based on the comparison process and the determination process; and comparing confidence intervals of the linkage disequilibrium indices stored in the confidence interval estimation result storage unit for each of the possible multiple loci including the two specific loci and specifying confidence intervals of the linkage disequilibrium indices that are adopted to specify linkage disequilibrium indices stored in association with the specified confidence intervals.
14. A method for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer, the method being characterized in that: the computer executes the steps of: generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals; performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for computing individual's diplotype posterior probabilities of the multiple loci including the two specific loci based on a result of the computed maximum likelihood estimates of haplotype frequencies and converting the computed posterior probabilities into individual's diplotype posterior probabilities between the two specific loci, and a process for storing the converted individual's diplotype posterior probabilities between the two specific loci; and estimating individual's diplotype posterior probabilities between the two specific loci based on the individual's diplotype posterior probabilities between the two specific loci stored for each of the possible multiple loci including the two specific loci.
15. A method for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer, the method being characterized in that: the computer executes the steps of: generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals; performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for computing individual's diplotype posterior probabilities of the multiple loci including the two specific loci based on a result of the computed maximum likelihood estimates of haplotype frequencies and converting the computed posterior probabilities into individual's diplotype posterior probabilities between the two specific loci, a process for computing variances of haplotype frequencies of the multi-loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances into information relating to the two specific loci, a process for comparing variances of the haplotype frequencies between the two specific loci computed with the plurality of different methods, a process for computing variances and confidence intervals of diplotype posterior probabilities of the multiple loci including the two specific loci using the multi-loci data and converting the computed variances and the computed confidence intervals into information relating to the two specific loci, a process for determining whether maximum likelihood estimates of the individual's diplotype posterior probabilities between the two specific loci fall within the confidence intervals of the individual's diplotype posterior probabilities between the two specific loci, and a process for storing confidence intervals of the individual's diplotype posterior probabilities between the two specific loci and corresponding individual's diplotype posterior probabilities between the two specific loci into a confidence interval estimation result storage unit in an associated manner based on the comparison process and the determination process; and comparing confidence intervals of the individual's diplotype posterior probabilities between the two specific loci stored in the confidence interval estimation result storage unit for each of the possible multiple loci including the two specific loci to specify confidence intervals of the individual's diplotype posterior probabilities between the two specific loci that are to be adopted and specify individual's diplotype posterior probabilities between the two specific loci stored in association with the specified confidence intervals.
16. A statistical genetics analysis program for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer, the program being characterized in that: the program causes the computer to function as: a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals; a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, and a process for storing the converted two-loci haplotype frequencies; and a means for estimating haplotype frequencies between the two specific loci based on the two-loci haplotype frequencies stored for each of the possible multiple loci including the two specific loci.
17. A statistical genetics analysis program for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer, the program being characterized in that: the program causes the computer to function as: a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals; a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing variances and confidence intervals of the haplotype frequencies of the multiple loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances and the computed confidence intervals into information relating to the two specific loci, a process for comparing variances of the haplotype frequencies between the two specific loci computed through the plurality of different methods, a process for determining whether the maximum likelihood estimates of the haplotype frequencies between the two specific loci fall within the confidence intervals of the haplotype frequencies between the two specific loci computed through a predetermined one of the plurality of different methods, and a process for storing in a confidence interval estimation result storage unit the confidence intervals of the haplotype frequencies between the two specific loci and corresponding two-loci haplotype frequencies in an associated manner based on the comparison process and the determination process; and a means for comparing confidence intervals of the haplotype frequencies between the two specific loci stored in the confidence interval estimation result storage unit for the possible multiple loci including the two specific loci and specifying confidence intervals that are adopted to specify two-loci haplotype frequencies stored in association with the specified confidence intervals.
18. A statistical genetics analysis program for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer, the program being characterized in that: the program causes the computer to function as: a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals; a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing maximum likelihood estimates of linkage disequilibrium indices using the converted two-loci haplotype frequencies, and a process for storing the computed maximum likelihood estimates of the linkage disequilibrium indices; and a means for estimating linkage disequilibrium indices based on the maximum likelihood estimates of the linkage disequilibrium indices stored for each of the possible multiple loci including the two specific loci.
19. A statistical genetics analysis program for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer, the program being characterized in that: the program causes the computer to function as: a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals; a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for converting the computed maximum likelihood estimates of the multi-loci haplotype frequencies into haplotype frequencies between two loci, a process for computing maximum likelihood estimates of linkage disequilibrium indices using the converted two-loci haplotype frequencies, a process for computing variances of the haplotype frequencies of the multiple loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances into information relating to the two specific loci, a process for comparing variances of the haplotype frequencies between the two specific loci computed through the plurality of different methods, a process for computing variances and confidence intervals of linkage disequilibrium indices, a process for determining whether maximum likelihood estimates of the linkage disequilibrium indices fall within the confidence intervals of the linkage disequilibrium indices, a process for storing confidence intervals of the linkage disequilibrium indices and corresponding linkage disequilibrium indices into a confidence interval estimation result storage unit in an associated manner based on the comparison process and the determination process; and a means for comparing confidence intervals of the linkage disequilibrium indices stored in the confidence interval estimation result storage unit for each of the possible multiple loci including the two specific loci and specifying confidence intervals of the linkage disequilibrium indices that are adopted to specify linkage disequilibrium indices stored in association with the specified confidence intervals.
20. A statistical genetics analysis program for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer, the program being characterized in that: the program causes the computer to function as: a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals; a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for computing individual's diplotype posterior probabilities of the multiple loci including the two specific loci based on a result of the computed maximum likelihood estimates of haplotype frequencies and converting the computed posterior probabilities into individual's diplotype posterior probabilities between the two specific loci, and a process for storing the converted individual's diplotype posterior probabilities between the two specific loci; and a means for estimating individual's diplotype posterior probabilities between the two specific loci based on the individual's diplotype posterior probabilities between the two specific loci stored for each of the possible multiple loci including the two specific loci.
21. A statistical genetics analysis program for performing a statistical genetics analysis with a collection of multi-loci genotype data of individuals using a computer, the program being characterized in that: the program causes the computer to function as: a means for generating multi-loci data, which is genotype data of possible multiple loci including two specific loci, based on the multi-loci genotype data of individuals; a means for performing a process for computing maximum likelihood estimates of haplotype frequencies of the multiple loci including the two specific loci with the multi-loci data for each of the possible multiple loci including the two specific loci, a process for computing individual's diplotype posterior probabilities of the multiple loci including the two specific loci based on a result of the computed maximum likelihood estimates of multi-loci haplotype frequencies and converting the computed posterior probabilities into individual's diplotype posterior probabilities between the two specific loci, a process for computing variances of haplotype frequencies of the multi-loci including the two specific loci through a plurality of different methods using the multi-loci data and converting the computed variances into information relating to the two specific loci, a process for comparing variances of the haplotype frequencies between the two specific loci computed with the plurality of different methods, a process for computing variances and confidence intervals of diplotype posterior probabilities of the multiple loci including the two specific loci using the multi-loci data and converting the computed variances and the computed confidence intervals into information relating to the two specific loci, a process for determining whether maximum likelihood estimates of the individual's diplotype posterior probabilities between the two specific loci fall within the confidence intervals of the individual's diplotype posterior probabilities between the two specific loci, and a process for storing confidence intervals of the individual's diplotype posterior probabilities between the two specific loci and corresponding individual's diplotype posterior probabilities between the two specific loci into a confidence interval estimation result storage unit in an associated manner based on the comparison process and the determination process; and a means for comparing confidence intervals of the individual's diplotype posterior probabilities between the two specific loci stored in the confidence interval estimation result storage unit for each of the possible multiple loci including the two specific loci to specify confidence intervals of the individual's diplotype posterior probabilities between the two specific loci that are to be adopted and specify individual's diplotype posterior probabilities between the two specific loci stored in association with the specified confidence intervals.
PCT/JP2006/307287 2005-03-31 2006-03-30 Statistical genetics analysis system, statistical genetics analysis method, and statistical genetics analysis program WO2006104263A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP06731235A EP1864235A2 (en) 2005-03-31 2006-03-30 Statistical genetics analysis system, statistical genetics analysis method, and statistical genetics analysis program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2005102107 2005-03-31
JP2005-102107 2005-03-31

Publications (2)

Publication Number Publication Date
WO2006104263A2 true WO2006104263A2 (en) 2006-10-05
WO2006104263A8 WO2006104263A8 (en) 2008-01-24

Family

ID=36954711

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2006/307287 WO2006104263A2 (en) 2005-03-31 2006-03-30 Statistical genetics analysis system, statistical genetics analysis method, and statistical genetics analysis program

Country Status (2)

Country Link
EP (1) EP1864235A2 (en)
WO (1) WO2006104263A2 (en)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020077775A1 (en) * 2000-05-25 2002-06-20 Schork Nicholas J. Methods of DNA marker-based genetic analysis using estimated haplotype frequencies and uses thereof
GB0021667D0 (en) * 2000-09-04 2000-10-18 Glaxo Group Ltd Genetic study

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
No Search *

Also Published As

Publication number Publication date
EP1864235A2 (en) 2007-12-12
WO2006104263A8 (en) 2008-01-24

Similar Documents

Publication Publication Date Title
US20230402132A1 (en) Error Correction in Ancestry Classification
US10755805B1 (en) Ancestry painting with local ancestry inference
Stram et al. Modeling and EM estimation of haplotype-specific relative risks from genotype data for a case-control study of unrelated individuals
Astle et al. Population structure and cryptic relatedness in genetic association studies
Rosenblum et al. Simple, efficient estimators of treatment effects in randomized trials using generalized linear models to leverage baseline variables
US20090326832A1 (en) Graphical models for the analysis of genome-wide associations
CN109524059A (en) A kind of animal individual genomic breeding value appraisal procedure of fast and stable
CA2409857A1 (en) Methods of dna marker-based genetic analysis using estimated haplotype frequencies and uses thereof
CN113127973B (en) CAE simulation technology-based multi-material intelligent material selection method and system and electronic equipment
Graça et al. Haplotype inference with pseudo-Boolean optimization
US20070239416A1 (en) Pharmacokinetic analysis system and method thereof
WO2006104263A2 (en) Statistical genetics analysis system, statistical genetics analysis method, and statistical genetics analysis program
CN109446057B (en) Dynamic system test resource allocation method based on GDE3 algorithm
JP2006309711A (en) Statistical genetic analysis system, statistical genetic analysis method and statistical genetic analysis program
CN110706737A (en) Method and device for simulating target sequence and electronic equipment
CN110942089A (en) Key stroke identification method based on multi-level decision
Kirkpatrick et al. Correcting for cryptic relatedness in genome-wide association studies
Temple et al. Identity-by-descent in large samples
Marques-Silva et al. Efficient and tight upper bounds for haplotype inference by pure parsimony using delayed haplotype selection
CN113343407B (en) Equivalent optimization method and system for large-scale energy storage battery
Alfons et al. Package ‘robustHD’
CN114549134A (en) Method, device and equipment for customizing and recommending accelerator pedal characteristics and storage medium
Luo et al. Promoter recognition based on the interpolated Markov chains optimized via simulated annealing and genetic algorithm
Rueda et al. Protocol S1 for:“Flexible and accurate detection of genomic copy-number changes from aCGH”
WO2024191661A1 (en) Systems and methods of determining a nucleic acid sequence based on mutated sequence reads

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 2006731235

Country of ref document: EP

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

WWP Wipo information: published in national office

Ref document number: 2006731235

Country of ref document: EP