JPWO2005030952A1

JPWO2005030952A1 - Haplotype analysis method

Info

Publication number: JPWO2005030952A1
Application number: JP2005514327A
Authority: JP
Inventors: 関根　章博; 章博関根; 飯田　有俊; 有俊飯田; 斎藤　督; 督斎藤; 中村　祐輔; 祐輔中村; 鎌谷　直之; 直之鎌谷
Original assignee: StaGen Co Ltd; RIKEN
Current assignee: StaGen Co Ltd; RIKEN
Priority date: 2003-09-30
Filing date: 2004-09-30
Publication date: 2006-12-07
Also published as: WO2005030952A1

Abstract

本発明は、ハプロタイプの解析方法であって、以下のステップ：（ａ）被検集団から得られる表１に示す薬物関連遺伝子の少なくとも１つについて、遺伝子多型を検出するステップ、（ｂ）前記検出情報を処理してＣｏｍｍｏｎ多型を選出するステップ、（ｃ）前記特定されたＣｏｍｍｏｎ多型を用いてハプロタイプブロックを構築するステップ、（ｄ）前記ハプロタイプブロック内におけるＣｏｍｍｏｎ多型を特定するステップ、及び（ｅ）前記ハプロタイプブロック内Ｃｏｍｍｏｎ多型からタグ多型を特定するステップ、を含む前記方法を提供する。The present invention is a haplotype analysis method comprising the following steps: (a) detecting a genetic polymorphism in at least one drug-related gene shown in Table 1 obtained from a test population, (b) Processing the detection information to select a Common polymorphism, (c) constructing a haplotype block using the identified Common polymorphism, (d) identifying a Common polymorphism in the haplotype block, And (e) identifying a tag polymorphism from the common polymorphism in the haplotype block.

Description

本発明は、ハプロタイプの解析方法に関する。 The present invention relates to a haplotype analysis method.

各個体において、薬物に対する応答は千差万別である。ある患者の症状は薬物により著しく改善される場合もあれば、別の患者では効果が認められない場合や重篤な副作用を生ずる場合もある。従って、個体間において薬物に対する応答の違いを解析することは、治療法及び薬物を選択するために極めて重要である。
薬物応答の個体差は、性別、年齢及び疾患等を考慮する必要はあるが、少なくとも一部は遺伝情報によるものである。
以前の研究において、代謝、及びサクシニルコリンとイソニアジドそれぞれの効果について、シュードコリンエステラーゼとＮ−アセチルエステラーゼの遺伝的変化の影響が示されている。そして、他の酵素、レセプター及びトランスポーターについての薬物応答と遺伝的変化とが関連する旨の報告がなされた（Ｅｖａｎｓ，Ｗ．Ｅ．＆ＭｃＬｅｏｄ，Ｈ．Ｌ．Ｐｈａｒｍａｃｏｇｅｎｏｍｉｃｓ−ｄｒｕｇｄｉｓｐｏｓｉｔｉｏｎ，ｄｒｕｇｔａｒｇｅｔｓ，ａｎｄｓｉｄｅｅｆｆｅｃｔｓ．Ｎ．Ｅｎｇｌ．Ｊ．Ｍｅｄ．２４８，５３８−５４９（２００３）．）。
遺伝子工学技術の発達により、薬物等に対する遺伝子の変化は、塩基配列レベルでより容易に解析できるようになった。例えば、チオプリン−Ｓ−メチルトラスフェラーゼ遺伝子と６−メルカプトプリン、ジヒドロピリミジンデヒドロゲナーゼ遺伝子と５−フルオロウラシル、チトクロームＰ−４５０酵素遺伝子（ＣＹＰ２Ｄ６、ＣＹＰ２Ｃ９、ＣＹＰ２Ｃ１９）とデブリソキン含有薬物、多剤耐性遺伝子と抗ＨＩＶ治療、アンジオテンシン変換酵素（ＡＣＥ）遺伝子とＡＣＥ阻害剤、β_２−アドレナリン様レセプター遺伝子とアゴニスト、エストロゲンレセプター遺伝子とエストロゲンとの間において、上記関連性が示されている。
最近、遺伝的変化と薬物応答との関連性に関する研究は新たな段階に突入した。すなわち、ヒトゲノム研究の最近の急速な進歩により、短時間に一塩基置換多型（ＳＮＰ）等の多数の多型における遺伝子型の特定が可能となった。さらに、推定１０，０００，０００個の通常ＳＮＰｓのうち４，０００，０００個が既に明らかにされた。その結果、一つ一つの多型又は遺伝子を試験するのではなく、薬物反応に関連すると思われる多数の多型を試験することにより、各種薬物の効果に影響する遺伝的変化を体系的に同定することが可能となった。
この場合、どのような遺伝子のどのような多型を調べればよいのかが検討事項となる。そして、外因性薬物の代謝に関する各種酵素の遺伝子、トランスポーター、及び薬物の標的が試験される。薬物が投与されると、薬物は、その作用部位に吸着、分布し、そこで標的と相互作用し、代謝を受け、排泄される。これらの薬物代謝のそれぞれの過程において、薬物応答に対する個体差が現れる。現在、実質上全ての薬物代謝経路が最終的には遺伝的変化を有することが明らかとなっている。
ヒトゲノム上の多くの座に関するデータを、表現型に対する遺伝的変化と関連付けるように解析される場合は、分子遺伝学に加え、集団遺伝学の知識が必要となる。多くの個体から得た多くのＳＮＰ座のデータに関する最近の研究から、ヒトゲノムはハプロタイプブロック（連鎖不平衡（ＬＤ）ブロック）構造をもつことが明らかとなった（Ｋｒｕｇｌｙａｋ，Ｌ．Ｐｒｏｓｐｅｔｓｆｏｒｗｈｏｌｅ−ｇｅｎｏｍｅｌｉｎａｋｇｅｄｉｓｅｑｕｉｌｉｂｒｉｕｍｍａｐｐｉｎｇｏｆＣｏｍｍｏｎｄｉｓｅａｓｅｇｅｎｅｓ．Ｎａｔ．Ｇｅｎｅｔ．２２，１３９−１４４（１９９９）．）。すなわち、ブロック内においてＬＤは強く、主要ハプロタイプの数は限定されているが、薬物の吸収、分布、代謝などに関与する遺伝子は類似のハプロタイプブロック構造を有していると思われる。
そして、ＳＮＰ遺伝子型よりも、ハプロタイプ又はディプロタイプ形（ハプロタイプの組合せ）がしばしば表現型の主たる決定因子であることが明らかされているため（Ｊｕｄｓｏｎ，Ｒ．＆Ｓｔｅｐｈｅｎｓ，Ｊ．Ｃ．ＮｏｔｅｓｆｒｏｍｔｈｅＳＮＰｖｓ．ｈａｐｌｏｔｙｐｅｆｒｏｎｔ．Ｐｈａｒｍａｃｏｇｅｎｏｍｉｃｓ２，７−１０（２００１）の、ハプロタイプブロックについての知識は不可欠である。例えば、ハプロタイプ分析はβ_２−アドレナリンアゴニスト、メトトレキサート及びスルファサラジンに関する薬物治療の結果を予測するのに有用である。
さらに、ＳＮＰと表現型との関連性を十分理解するためには、ハプロタイプの概念が不可欠である。そして、薬物応答と遺伝的変化の関連性についての研究の基礎を確立するために、多型座、その分布及びアレルの頻度についての正確な知識が必要である。そのためには、単一の座についてのデータに加え、ハプロタイプの頻度だけでなくハプロタイプブロックを構築することが必要となる。In each individual, the response to drugs varies widely. Some patients' symptoms may be significantly improved by the drug, others may not be effective, and serious side effects may occur. Therefore, analyzing differences in response to drugs among individuals is extremely important for selecting treatment methods and drugs.
Individual differences in drug response need to take gender, age, disease, etc. into consideration, but at least partly depends on genetic information.
Previous studies have shown the effects of genetic changes in pseudocholinesterase and N-acetylesterase on metabolism and the effects of succinylcholine and isoniazid, respectively. And reports have been made that drug responses and genetic changes for other enzymes, receptors and transporters are related (Evans, WE & McLeod, HL Pharmagenomics-drug disposition, drug targets, and side effects.N. Engl.J.Med.248, 538-549 (2003).).
With the development of genetic engineering technology, changes in genes for drugs etc. can be analyzed more easily at the base sequence level. For example, thiopurine-S-methyltransferase gene and 6-mercaptopurine, dihydropyrimidine dehydrogenase gene and 5-fluorouracil, cytochrome P-450 enzyme gene (CYP2D6, CYP2C9, CYP2C19) and debrisoquin-containing drug, multidrug resistance gene and anti-HIV The above relationships have been shown between therapy, angiotensin converting enzyme (ACE) gene and ACE inhibitor, β ₂ -adrenergic receptor gene and agonist, estrogen receptor gene and estrogen.
Recently, research on the relationship between genetic change and drug response has entered a new stage. That is, recent rapid progress in human genome research has enabled genotype identification in many polymorphisms such as single nucleotide substitution polymorphisms (SNPs) in a short time. Furthermore, out of an estimated 10,000,000 normal SNPs, 4,000,000 have already been revealed. As a result, instead of testing every single polymorphism or gene, we systematically identify genetic changes that affect the effects of various drugs by testing multiple polymorphisms that appear to be related to drug response It became possible to do.
In this case, what kind of polymorphism of which gene should be examined is a matter to be examined. The genes of various enzymes, transporters, and drug targets related to the metabolism of exogenous drugs are then tested. When a drug is administered, it is adsorbed and distributed at its site of action, where it interacts with the target, undergoes metabolism, and is excreted. Individual differences in drug response appear in each of these drug metabolism processes. Currently, it is clear that virtually all drug metabolic pathways ultimately have genetic changes.
In addition to molecular genetics, knowledge of population genetics is required when data for many loci on the human genome are analyzed to correlate with genetic changes to the phenotype. Recent studies on data from many SNP loci obtained from many individuals revealed that the human genome has a haplotype block (link disequilibrium (LD) block) structure (Kruglyak, L. Prospets for whole-genome). (lineage disequilibrium mapping of common disease genes. Nat. Genet. 22, 139-144 (1999).). That is, LD is strong within the block, and the number of major haplotypes is limited, but genes involved in drug absorption, distribution, metabolism, etc. are thought to have similar haplotype block structures.
And it has been shown that haplotypes or diplotype forms (combination of haplotypes) are often the main determinants of phenotype rather than SNP genotypes (Judson, R. & Stephens, JC Notes from the Knowledge of haplotype blocks in SNP vs. haprotype front. Pharmacogenomics 2, 7-10 (2001) is essential, for example, haplotype analysis predicts outcome of drug treatment for β ₂ -adrenergic agonists, methotrexate and sulfasalazine Useful for.
Furthermore, in order to fully understand the relationship between SNPs and phenotypes, the concept of haplotypes is essential. And in order to establish the basis of research on the relationship between drug response and genetic change, accurate knowledge of polymorphic loci, their distribution and allele frequency is required. This requires building haplotype blocks as well as haplotype frequencies in addition to data for a single locus.

本発明は、ハプロタイプの解析法を提供することを目的とする。具体的には、薬物代謝酵素及びトランスポーターなど薬物関連のタンパク質をコードする遺伝子中の多型について、個体のＤＮＡの遺伝子型を特定し、解析により得られた知識に基づき、集団に基づく薬理遺伝学的研究のための最適化された方法を確立することを目的とする。
上記課題を解決するため鋭意研究を行なった結果、多数の集団を用いて薬物代謝酵素及びトランスポーターなど薬物関連のタンパク質をコードする遺伝子中のＳＮＰｓについて解析した結果、上記課題を解決することを見出し、本発明を完成するに至った。
すなわち、本発明は以下の通りである。
（１）ハプロタイプの解析方法であって、以下のステップ：
（ａ）被検集団から得られる表１に示す薬物関連遺伝子の少なくとも１つについて、遺伝子多型を検出するステップ、
（ｂ）前記検出情報を処理してＣｏｍｍｏｎ多型を選出するステップ、
（ｃ）前記特定されたＣｏｍｍｏｎ多型を用いてハプロタイプブロックを構築するステップ、
（ｄ）前記ハプロタイプブロック内におけるＣｏｍｍｏｎ多型を特定するステップ、及び
（ｅ）前記ハプロタイプブロック内Ｃｏｍｍｏｎ多型からタグ多型を特定するステップ、
を含む前記方法。
（２）遺伝子多型が、一塩基多型、複数個の塩基の欠失、置換若しくは挿入による多型、又はＶＮＴＲ若しくはマイクロサテライトによる多型である（１）記載の方法。
（３）タグ多型が表３の「Ｂｌｏｃｋ」の項の「ｈｔＳＮＰｓ」の欄に示されるものから選ばれる少なくとも１つである（１）記載の方法。
（４）ハプロタイプの解析方法であって、以下のステップ：
（ａ）被検集団から得られる表１に示す薬物関連遺伝子の少なくとも１つについて、遺伝子多型を検出するステップ、
（ｂ）前記検出情報を処理してＣｏｍｍｏｎ多型を選出するステップ、
（ｃ）前記特定されたＣｏｍｍｏｎ多型を用いてハプロタイプブロックを構築するステップ、及び
（ｄ）前記ハプロタイプブロック外におけるＣｏｍｍｏｎ多型を特定するステップ、
を含む前記方法。
（５）遺伝子多型が、一塩基多型、複数個の塩基の欠失、置換若しくは挿入による多型、又はＶＮＴＲ若しくはマイクロサテライトによる多型である（４）記載の方法。
（６）ハプロタイプブロック外におけるＣｏｍｍｏｎ多型が表３の「Ｂｅｔｗｅｅｎ」の項に示されるものから選ばれる少なくとも１つである（４）記載の方法。
（７）ハプロタイプの解析方法であって、以下のステップ：
（ａ）被検集団から得られる薬物関連遺伝子について、遺伝子多型を検出するステップ、
（ｂ）前記検出情報を処理してＣｏｍｍｏｎ多型を選出するステップ、
（ｃ）前記特定されたＣｏｍｍｏｎ多型を用いてハプロタイプブロックを構築するステップ、
（ｄ）前記ハプロタイプブロック内及び／又は外におけるＲａｒｅ多型を特定するステップ、並びに
（ｅ）前記Ｒａｒｅ多型を主要ハプロタイプに割り当てるステップ、
を含む前記方法。
（８）薬物関連遺伝子が、表１に示されるものから選ばれる少なくとも１つである（７）記載の方法。
（９）遺伝子多型が、一塩基多型、複数個の塩基の欠失、置換若しくは挿入による多型、又はＶＮＴＲ若しくはマイクロサテライトによる多型である（７）記載の方法。
（１０）Ｒａｒｅ多型が表２に示されるものから選ばれる少なくとも１つである（７）記載の方法。
（１１）薬物若しくは異物の感受性又は疾患の感受性とハプロタイプとの関連を解析する方法であって、以下のステップ：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（１）〜（３）のいずれか１項に記載の方法によりタグ多型を特定するステップ、及び
（ｂ）前記特定されたタグ多型又はその組合せを用いて、一の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度と、他の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度とを比較するステップ、
を含む前記方法。
（１２）薬物若しくは異物の感受性又は疾患の感受性とハプロタイプとの関連を解析する方法であって、以下のステップ：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（４）〜（６）のいずれか１項に記載の方法により、ブロック外におけるＣｏｍｍｏｎ多型を特定するステップ、及び
（ｂ）前記特定されたＣｏｍｍｏｎ多型について、一の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度と、他の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度とを比較するステップ、
を含む前記方法。
（１３）以下のステップ：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（４）〜（６）のいずれか１項に記載の方法により、ブロック外におけるＣｏｍｍｏｎ多型を特定するステップ、及び
（ｂ）前記特定されたＣｏｍｍｏｎ多型について、一の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度と、他の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度とを比較するステップ、
をさらに含む、（１１）記載の方法。
（１４）薬物若しくは異物の感受性又は疾患の感受性とハプロタイプとの関連を解析する方法であって、以下のステップ：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、請求項７〜１０のいずれか１項に記載の方法により、Ｒａｒｅ多型を特定の主要ハプロタイプに割り当てるステップ、及び
（ｂ）前記割り当てられたＲａｒｅ多型について、一の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度と、他の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度とを比較するステップ、
を含む前記方法。
（１５）以下のステップ：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（７）〜（１０）のいずれか１項に記載の方法により、Ｒａｒｅ多型を特定の主要ハプロタイプに割り当てるステップ、及び
（ｂ）前記割り当てられたＲａｒｅ多型について、一の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度と、他の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度とを比較するステップ、
をさらに含む、（１１）〜（１３）のいずれか１項に記載の方法。
上記方法は、タグ多型と少なくとも１個のＲａｒｅ多型との組合せを用いて、薬物若しくは異物の感受性又は疾患の感受性とハプロタイプとの関連を解析する方法を包含する。
（１６）薬物若しくは異物の感受性又は疾患の感受性に関連する多型を推定する方法であって、以下のステップ：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（１）〜（３）のいずれか１項に記載の方法によりタグ多型を特定するステップ、
（ｂ）前記特定されたタグ多型又はその組合せを用いて、一の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度と、他の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度とを比較するステップ、
（ｃ）頻度の違いに関与したブロックを選択するステップ、
（ｄ）前記選択されたブロック内に存在する多型を選出するステップ、並びに
（ｅ）頻度の違いと関連する多型を推定するステップ、
を含む前記方法。
（１７）薬物若しくは異物の感受性又は疾患の感受性に関連する多型を推定する方法であって、以下のステップ：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（４）〜（６）のいずれか１項に記載の方法により、ブロック外におけるＣｏｍｍｏｎ多型を特定するステップ、
（ｂ）前記特定されたＣｏｍｍｏｎ多型について、一の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度と、他の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度とを比較するステップ、及び
（ｃ）頻度の違いと関連する多型を推定するステップ、
を含む前記方法。
（１８）以下のステップ：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（４）〜（６）のいずれか１項に記載の方法により、ブロック外におけるＣｏｍｍｏｎ多型を特定するステップ、
（ｂ）前記特定されたＣｏｍｍｏｎ多型について、一の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度と、他の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度とを比較するステップ、及び
（ｃ）頻度の違いと関連する多型を推定するステップ、
をさらに含む、（１６）記載の方法。
（１９）薬物若しくは異物の感受性又は疾患の感受性に関連する多型を推定する方法であって、以下のステップ：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（７）〜（１０）のいずれか１項に記載の方法により、Ｒａｒｅ多型を特定の主要ハプロタイプに割り当てるステップ、
（ｂ）前記割り当てられたＲａｒｅ多型について、一の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度と、他の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度とを比較するステップ、及び
（ｃ）頻度の違いと関連する多型を推定するステップ、
を含む前記方法。
（２０）以下のステップ：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（７）〜（１０）のいずれか１項に記載の方法により、Ｒａｒｅ多型を特定の主要ハプロタイプに割り当てるステップ、
（ｂ）前記割り当てられたＲａｒｅ多型について、一の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度と、他の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度とを比較するステップ、及び
（ｃ）頻度の違いと関連する多型を推定するステップ、
をさらに含む、（１６）〜（１８）のいずれか１項に記載の方法。
上記方法は、タグ多型と少なくとも１個のＲａｒｅ多型との組合せを用いて、薬物若しくは異物の感受性又は疾患の感受性に関連する多型を推定する方法を包含する。
（２１）薬物若しくは異物の感受性又は疾患の感受性に関する表現型の個人差を解析する方法であって、（１１）〜（１５）のいずれか１項に記載の方法により解析された解析結果、又は（１６）〜（２０）のいずれか１項に記載の方法により推定された推定結果を指標として、個人と表現型とを関連づけるステップを含む前記方法。
（２２）薬物の感受性が、薬物動態、薬物の有効性又は薬物の副作用に関する感受性である（１１）〜（２１）のいずれか１項に記載の方法。
（２３）薬物動態が、薬物の吸収、分布、代謝又は排泄に関する動態である（２２）記載の方法。
（２４）薬物動態が、薬物の血中濃度に関する動態である（２２）記載の方法。
（２５）疾患の感受性が、罹患可能性の有無又は強弱である（１１）〜（２１）のいずれか１項に記載の方法。
（２６）疾患が、悪性腫瘍、免疫系疾患、循環器系疾患、代謝系疾患、腎泌尿器系疾患、呼吸器系疾患及び運動器系疾患からなる群から選択される少なくとも１つである（１１）〜（２１）のいずれか１項に記載の方法。
（２７）（１１）〜（２６）のいずれか１項に記載の方法により得られた解析結果又は推定結果を指標として、薬物若しくは異物の感受性又は疾患を予測する方法。
（２８）（１１）〜（２６）のいずれか１項に記載の方法により得られた解析結果又は推定結果を指標として、疾患の予防用若しくは治療用の薬物及び／又は疾患の予防法若しくは治療法を選択する方法。
（２９）（１１）〜（２６）のいずれか１項に記載の方法により得られた解析結果又は推定結果を指標として、疾患の予防用若しくは治療用の薬物の適正投与量を決定する方法。
（３０）（１１）〜（２６）のいずれか１項に記載の方法により得られた解析結果又は推定結果を指標として、薬物間相互作用を解析する方法。
（３１）（１１）〜（２６）のいずれか１項に記載の方法により得られた解析結果又は推定結果を指標として、薬物若しくは異物又は疾患の感受性に関する関連多型を決定する方法。
（３２）ハプロタイプの解析用プログラムであって、コンピュータを、以下の手段：
（ａ）被検集団から得られる表１に示す薬物関連遺伝子の少なくとも１つについて、遺伝子多型を検出する手段、
（ｂ）前記検出情報を処理してＣｏｍｍｏｎ多型を選出する手段、
（ｃ）前記特定されたＣｏｍｍｏｎ多型を用いてハプロタイプブロックを構築する手段、
（ｄ）前記ハプロタイプブロック内におけるＣｏｍｍｏｎ多型を特定する手段、及び
（ｅ）前記ハプロタイプブロック内Ｃｏｍｍｏｎ多型からタグ多型を特定する手段、
として機能させるための前記プログラム。
（３３）遺伝子多型が、一塩基多型、複数個の塩基の欠失、置換若しくは挿入による多型、又はＶＮＴＲ若しくはマイクロサテライトによる多型である（３２）記載のプログラム。
（３４）タグ多型が表３の「Ｂｌｏｃｋ」の項の「ｈｔＳＮＰｓ」の欄に示されるものから選ばれる少なくとも１つである（３２）記載のプログラム。
（３５）ハプロタイプの解析用プログラムであって、コンピュータを、以下の手段：
（ａ）被検集団から得られる表１に示す薬物関連遺伝子の少なくとも１つについて、遺伝子多型を検出する手段、
（ｂ）前記検出情報を処理してＣｏｍｍｏｎ多型を選出する手段、
（ｃ）前記特定されたＣｏｍｍｏｎ多型を用いてハプロタイプブロックを構築する手段、及び
（ｄ）前記ハプロタイプブロック外におけるＣｏｍｍｏｎ多型を特定する手段、
として機能させるための前記プログラム。
（３６）遺伝子多型が、一塩基多型、複数個の塩基の欠失、置換若しくは挿入による多型、又はＶＮＴＲ若しくはマイクロサテライトによる多型である（３５）記載のプログラム。
（３７）ハプロタイプブロック外におけるＣｏｍｍｏｎ多型が表３の「Ｂｅｔｗｅｅｎ」の項に示されるものから選ばれる少なくとも１つである（３５）記載のプログラム。
（３８）ハプロタイプの解析用プログラムであって、コンピュータを、以下の手段：
（ａ）被検集団から得られる薬物関連遺伝子について、遺伝子多型を検出する手段、
（ｂ）前記検出情報を処理してＣｏｍｍｏｎ多型を選出する手段、
（ｃ）前記特定されたＣｏｍｍｏｎ多型を用いてハプロタイプブロックを構築する手段、
（ｄ）前記ハプロタイプブロック内及び／又は外におけるＲａｒｅ多型を特定する手段、並びに
（ｅ）前記Ｒａｒｅ多型を主要ハプロタイプに割り当てる手段、
として機能させるための前記プログラム。
（３９）薬物関連遺伝子が、表１に示されるものから選ばれる少なくとも１つである（３８）記載のプログラム。
（４０）遺伝子多型が、一塩基多型、複数個の塩基の欠失、置換若しくは挿入による多型、又はＶＮＴＲ若しくはマイクロサテライトによる多型である（３８）記載のプログラム。
（４１）Ｒａｒｅ多型が表２に示されるものから選ばれる少なくとも１つである（３８）記載のプログラム。
（４２）薬物若しくは異物の感受性又は疾患の感受性とハプロタイプとの関連を解析するためのプログラムであって、コンピュータを、以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（３２）〜（３４）のいずれか１項に記載のプログラムによりタグ多型を特定する手段、及び
（ｂ）前記特定されたタグ多型又はその組合せを用いて、一の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度と、他の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度とを比較する手段、
として機能させるための前記プログラム。
（４３）薬物若しくは異物の感受性又は疾患の感受性とハプロタイプとの関連を解析するためのプログラムであって、コンピュータを、以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（３５）〜（３７）のいずれか１項に記載のプログラムにより、ブロック外におけるＣｏｍｍｏｎ多型を特定する手段、及び
（ｂ）前記特定されたＣｏｍｍｏｎ多型について、一の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度と、他の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度とを比較する手段、
として機能させるための前記プログラム。
（４４）以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（３５）〜（３７）のいずれか１項に記載のプログラムにより、ブロック外におけるＣｏｍｍｏｎ多型を特定する手段、及び
（ｂ）前記特定されたＣｏｍｍｏｎ多型について、一の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度と、他の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度とを比較する手段、
をさらに含む、（４２）記載のプログラム。
（４５）薬物若しくは異物の感受性又は疾患の感受性に関連する多型を推定するためのプログラムであって、コンピュータを、以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（３８）〜（４１）のいずれか１項に記載のプログラムにより、Ｒａｒｅ多型を特定の主要ハプロタイプに割り当てる手段、及び
（ｂ）前記割り当てられたＲａｒｅ多型について、一の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度と、他の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度とを比較する手段、
として機能させるための前記プログラム。
（４６）以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（３８）〜（４１）のいずれか１項に記載のプログラムにより、Ｒａｒｅ多型を特定の主要ハプロタイプに割り当てる手段、及び
（ｂ）前記割り当てられたＲａｒｅ多型について、一の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度と、他の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度とを比較する手段、
をさらに含む、（４２）〜（４４）のいずれか１項に記載のプログラム。
上記プログラムは、タグ多型と少なくとも１個のＲａｒｅ多型との組合せを用いて、薬物若しくは異物の感受性又は疾患の感受性とハプロタイプとの関連を解析するためのプログラムを包含する。
（４７）薬物若しくは異物の感受性又は疾患の感受性に関連する多型を推定するためのプログラムであって、コンピュータを、以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（３２）〜（３４）のいずれか１項に記載のプログラムによりタグ多型を特定する手段、
（ｂ）前記特定されたタグ多型又はその組合せを用いて、一の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度と、他の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度とを比較する手段、
（ｃ）頻度の違いに関与したブロックを選択する手段、
（ｄ）前記選択されたブロック内に存在する多型を選出する手段、並びに
（ｅ）頻度の違いと関連する多型を推定する手段、
として機能させるための前記プログラム。
（４８）薬物若しくは異物の感受性又は疾患の感受性に関連する多型を推定するためのプログラムであって、コンピュータを、以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（３５）〜（３７）のいずれか１項に記載のプログラムにより、ブロック外におけるＣｏｍｍｏｎ多型を特定する手段、
（ｂ）前記特定されたＣｏｍｍｏｎ多型について、一の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度と、他の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度とを比較する手段、及び
（ｃ）頻度の違いと関連する多型を推定する手段、
として機能させるための前記プログラム。
（４９）以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（３５）〜（３７）のいずれか１項に記載のプログラムにより、ブロック外におけるＣｏｍｍｏｎ多型を特定する手段、
（ｂ）前記特定されたＣｏｍｍｏｎ多型について、一の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度と、他の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度とを比較する手段、及び
（ｃ）頻度の違いと関連する多型を推定する手段、
をさらに含む、（４７）記載のプログラム。
（５０）薬物若しくは異物の感受性又は疾患の感受性に関連する多型を推定するためのプログラムであって、コンピュータを、以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（３８）〜（４１）のいずれか１項に記載のプログラムにより、Ｒａｒｅ多型を特定の主要ハプロタイプに割り当てる手段、
（ｂ）前記割り当てられたＲａｒｅ多型について、一の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度と、他の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度とを比較する手段、及び
（ｃ）頻度の違いと関連する多型を推定する手段、
として機能させるための前記プログラム。
（５１）以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（３８）〜（４１）のいずれか１項に記載のプログラムにより、Ｒａｒｅ多型を特定の主要ハプロタイプに割り当てる手段、
（ｂ）前記割り当てられたＲａｒｅ多型について、一の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度と、他の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度とを比較する手段、及び
（ｃ）頻度の違いと関連する多型を推定する手段、
をさらに含む、（４７）〜（４９）のいずれか１項に記載のプログラム。
上記プログラムは、タグ多型と少なくとも１個のＲａｒｅ多型との組合せを用いて、薬物若しくは異物の感受性又は疾患の感受性に関連する多型を推定するためのプログラムを包含する。
（５２）薬物若しくは異物の感受性又は疾患の感受性に関する表現型の個人差を解析するためのプログラムであって、コンピュータを、（４２）〜（４６）のいずれか１項に記載のプログラムにより解析された解析結果、又は（４７）〜（５１）のいずれか１項に記載のプログラムにより推定された推定結果を指標として、個人と表現型とを関連づける手段として機能させるための前記プログラム。
（５３）薬物の感受性が、薬物動態、薬物の有効性又は薬物の副作用に関する感受性である（４７）〜（５２）のいずれか１項に記載のプログラム。
（５４）薬物動態が、薬物の吸収、分布、代謝又は排泄に関する動態である（５３）記載のプログラム。
（５５）薬物動態が、薬物の血中濃度に関する動態である（５３）記載のプログラム。
（５６）疾患の感受性が、罹患可能性の有無又は強弱である（４７）〜（５２）のいずれか１項に記載のプログラム。
（５７）疾患が、悪性腫瘍、免疫系疾患、循環器系疾患、代謝系疾患、腎泌尿器系疾患、呼吸器系疾患及び運動器系疾患からなる群から選択される少なくとも１つである（４７）〜（５２）のいずれか１項に記載のプログラム。
（５８）コンピュータを、（４２）〜（５７）のいずれか１項に記載のプログラムにより得られた解析結果又は推定結果を指標として、薬物若しくは異物の感受性又は疾患を予測する手段として機能させるためのプログラム。
（５９）コンピュータを、（４２）〜（５７）のいずれか１項に記載のプログラムにより得られた解析結果又は推定結果を指標として、疾患の予防用若しくは治療用の薬物及び／又は疾患の予防法若しくは治療法を選択する手段として機能させるためのプログラム。
（６０）コンピュータを、（４２）〜（５７）のいずれか１項に記載のプログラムにより得られた解析結果又は推定結果を指標として、疾患の予防用若しくは治療用の薬物の適正投与量を決定する手段として機能させるためのプログラム。
（６１）コンピュータを、（４２）〜（５７）のいずれか１項に記載のプログラムにより得られた解析結果又は推定結果を指標として、薬物間相互作用を解析する手段として機能させるためのプログラム。
（６２）コンピュータを、（４２）〜（５７）のいずれか１項に記載のプログラムにより得られた解析結果又は推定結果を指標として、薬物若しくは異物又は疾患の感受性に関する関連多型を決定する手段として機能させるためのプログラム。
（６３）（３２）〜（６２）のいずれか１項に記載のプログラムを記録したコンピュータ読み取り可能な記録媒体。
（６４）表４に示されるＲａｒｅ多型を少なくとも１つ含むハプロタイプ。
さらに、本発明は、上記方法に使用するためのハプロタイプ解析装置（ハプロタイプ解析システム）などを提供する。例えば、以下の発明を提供する。
（６５）ハプロタイプの解析装置であって、以下の手段：
（ａ）被検集団から得られる表１に示す薬物関連遺伝子の少なくとも１つについて、遺伝子多型を検出する手段、
（ｂ）前記検出情報を処理してＣｏｍｍｏｎ多型を選出する手段、
（ｃ）前記特定されたＣｏｍｍｏｎ多型を用いてハプロタイプブロックを構築する手段、
（ｄ）前記ハプロタイプブロック内におけるＣｏｍｍｏｎ多型を特定する手段、及び
（ｅ）前記ハプロタイプブロック内Ｃｏｍｍｏｎ多型からタグ多型を特定する手段、
を含む前記装置。
（６６）ハプロタイプの解析装置であって、以下の手段：
（ａ）被検集団から得られる表１に示す薬物関連遺伝子の少なくとも１つについて、遺伝子多型を検出する手段、
（ｂ）前記検出情報を処理してＣｏｍｍｏｎ多型を選出する手段、
（ｃ）前記特定されたＣｏｍｍｏｎ多型を用いてハプロタイプブロックを構築する手段、及び
（ｄ）前記ハプロタイプブロック外におけるＣｏｍｍｏｎ多型を特定する手段、
を含む前記装置。
（６７）ハプロタイプの解析装置であって、以下の手段：
（ａ）被検集団から得られる薬物関連遺伝子について、遺伝子多型を検出する手段、
（ｂ）前記検出情報を処理してＣｏｍｍｏｎ多型を選出する手段、
（ｃ）前記特定されたＣｏｍｍｏｎ多型を用いてハプロタイプブロックを構築する手段、
（ｄ）前記ハプロタイプブロック内及び／又は外におけるＲａｒｅ多型を特定する手段、並びに
（ｅ）前記Ｒａｒｅ多型を主要ハプロタイプに割り当てる手段、
を含む前記装置。
（６８）薬物若しくは異物の感受性又は疾患の感受性とハプロタイプとの関連を解析する装置であって、以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、前記（６５）に記載の装置によりタグ多型を特定する手段、及び
（ｂ）前記特定されたタグ多型又はその組合せを用いて、一の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度と、他の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度とを比較する手段、
を含む前記装置。
（６９）薬物若しくは異物の感受性又は疾患の感受性とハプロタイプとの関連を解析する装置であって、以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（６６）に記載の装置により、ブロック外におけるＣｏｍｍｏｎ多型を特定する手段、及び
（ｂ）前記特定されたＣｏｍｍｏｎ多型について、一の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度と、他の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度とを比較する手段、
を含む前記装置。
（７０）以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（６６）に記載の装置により、ブロック外におけるＣｏｍｍｏｎ多型を特定する手段、及び
（ｂ）前記特定されたＣｏｍｍｏｎ多型について、一の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度と、他の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度とを比較する手段、
をさらに含む、（６８）記載の装置。
（７１）薬物若しくは異物の感受性又は疾患の感受性とハプロタイプとの関連を解析する装置であって、以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（６７）に記載の装置により、Ｒａｒｅ多型を特定の主要ハプロタイプに割り当てる手段、及び
（ｂ）前記割り当てられたＲａｒｅ多型について、一の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度と、他の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度とを比較する手段、
を含む前記装置。
（７２）以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（６７）に記載の装置により、Ｒａｒｅ多型を特定の主要ハプロタイプに割り当てる手段、及び
（ｂ）前記割り当てられたＲａｒｅ多型について、一の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度と、他の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度とを比較する手段、
をさらに含む、（６８）〜（７０）のいずれか１項に記載の装置。
上記装置は、タグ多型と少なくとも１個のＲａｒｅ多型との組合せを用いて、薬物若しくは異物の感受性又は疾患の感受性とハプロタイプとの関連を解析するための装置を包含する。
（７３）薬物若しくは異物の感受性又は疾患の感受性に関連する多型を推定する装置であって、以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（６５）に記載の手段によりタグ多型を特定する手段、
（ｂ）前記特定されたタグ多型又はその組合せを用いて、一の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度と、他の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度とを比較する手段、
（ｃ）頻度の違いに関与したブロックを選択する手段、
（ｄ）前記選択されたブロック内に存在する多型を選出する手段、並びに
（ｅ）頻度の違いと関連する多型を推定する手段、
を含む前記装置。
（７４）薬物若しくは異物の感受性又は疾患の感受性に関連する多型を推定する装置であって、以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（６６）に記載の装置により、ブロック外におけるＣｏｍｍｏｎ多型を特定する手段、
（ｂ）前記特定されたＣｏｍｍｏｎ多型について、一の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度と、他の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度とを比較する手段、及び
（ｃ）頻度の違いと関連する多型を推定する手段、
を含む前記装置。
（７５）以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（６６）に記載の装置により、ブロック外におけるＣｏｍｍｏｎ多型を特定する手段、
（ｂ）前記特定されたＣｏｍｍｏｎ多型について、一の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度と、他の表現型を有する被検集団の多型の頻度又は該多型を有する個体の頻度とを比較する手段、及び
（ｃ）頻度の違いと関連する多型を推定する手段、
をさらに含む、（７３）記載の装置。
（７６）薬物若しくは異物の感受性又は疾患の感受性に関連する多型を推定する装置であって、以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（６７）に記載の装置により、Ｒａｒｅ多型を特定の主要ハプロタイプに割り当てる手段、
（ｂ）前記割り当てられたＲａｒｅ多型について、一の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度と、他の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度とを比較する手段、及び
（ｃ）頻度の違いと関連する多型を推定する手段、
を含む前記装置。
（７７）以下の手段：
（ａ）薬物若しくは異物に曝された若しくは曝される可能性のある被検集団、又は疾患の危険因子に曝された被検集団から採取した薬物関連遺伝子について、（６７）に記載の装置により、Ｒａｒｅ多型を特定の主要ハプロタイプに割り当てる手段、
（ｂ）前記割り当てられたＲａｒｅ多型について、一の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度と、他の表現型を有する被検集団のハプロタイプの頻度又は該ハプロタイプを有する個体の頻度とを比較する手段、及び
（ｃ）頻度の違いと関連する多型を推定する手段、
をさらに含む、（７３）〜（７５）のいずれか１項に記載の装置。
上記装置は、タグ多型と少なくとも１個のＲａｒｅ多型との組合せを用いて、薬物若しくは異物の感受性又は疾患の感受性に関連する多型を推定するための装置を包含する。
（７８）薬物若しくは異物の感受性又は疾患の感受性に関する表現型の個人差を解析する装置であって、（６８）〜（７２）のいずれか１項に記載の装置により解析された解析結果、又は（７３）〜（７７）のいずれか１項に記載の装置により推定された推定結果を指標として、個人と表現型とを関連づける手段を含む前記装置。
（７９）（６８）〜（７８）のいずれか１項に記載の装置により得られた解析結果又は推定結果を指標として、薬物若しくは異物の感受性又は疾患を予測する装置。
（８０）（６８）〜（７８）のいずれか１項に記載の装置により得られた解析結果又は推定結果を指標として、疾患の予防用若しくは治療用の薬物及び／又は疾患の予防法若しくは治療法を選択する装置。
（８１）（６８）〜（７８）のいずれか１項に記載の装置により得られた解析結果又は推定結果を指標として、疾患の予防用若しくは治療用の薬物の適正投与量を決定する装置。
（８２）（６８）〜（７８）のいずれか１項に記載の装置により得られた解析結果又は推定結果を指標として、薬物間相互作用を解析する装置。
（８３）（６８）〜（７８）のいずれか１項に記載の装置により得られた解析結果又は推定結果を指標として、薬物若しくは異物又は疾患の感受性に関する関連多型を決定する装置。An object of this invention is to provide the analysis method of a haplotype. Specifically, with respect to polymorphisms in genes encoding drug-related proteins such as drug-metabolizing enzymes and transporters, individual DNA genotypes are identified and pharmacogenetics based on populations based on knowledge obtained through analysis. The aim is to establish an optimized method for scientific research.
As a result of diligent research to solve the above problems, as a result of analyzing SNPs in genes encoding drug-related proteins such as drug metabolizing enzymes and transporters using a large number of populations, it was found that the above problems can be solved. The present invention has been completed.
That is, the present invention is as follows.
(1) Haplotype analysis method comprising the following steps:
(A) detecting a genetic polymorphism for at least one of the drug-related genes shown in Table 1 obtained from the test population;
(B) processing the detection information to select a Common polymorph;
(C) constructing a haplotype block using the identified Common polymorphism;
(D) identifying a Common polymorphism within the haplotype block; and
(E) identifying a tag polymorphism from the common polymorphism in the haplotype block;
Including said method.
(2) The method according to (1), wherein the gene polymorphism is a single nucleotide polymorphism, a polymorphism caused by deletion, substitution or insertion of a plurality of bases, or a polymorphism caused by VNTR or microsatellite.
(3) The method according to (1), wherein the tag polymorphism is at least one selected from those shown in the “htSNPs” column in the “Block” section of Table 3.
(4) Haplotype analysis method comprising the following steps:
(A) detecting a genetic polymorphism for at least one of the drug-related genes shown in Table 1 obtained from the test population;
(B) processing the detection information to select a Common polymorph;
(C) constructing a haplotype block using the identified Common polymorphism; and
(D) identifying a common polymorphism outside the haplotype block;
Including said method.
(5) The method according to (4), wherein the gene polymorphism is a single nucleotide polymorphism, a polymorphism caused by deletion, substitution or insertion of a plurality of bases, or a polymorphism caused by VNTR or microsatellite.
(6) The method according to (4), wherein the common polymorphism outside the haplotype block is at least one selected from those shown in the “Between” section of Table 3.
(7) Haplotype analysis method comprising the following steps:
(A) detecting a gene polymorphism for a drug-related gene obtained from a test population;
(B) processing the detection information to select a Common polymorph;
(C) constructing a haplotype block using the identified Common polymorphism;
(D) identifying a Rare polymorphism within and / or outside the haplotype block; and
(E) assigning said Rare polymorphism to a major haplotype;
Including said method.
(8) The method according to (7), wherein the drug-related gene is at least one selected from those shown in Table 1.
(9) The method according to (7), wherein the gene polymorphism is a single nucleotide polymorphism, a polymorphism caused by deletion, substitution or insertion of a plurality of bases, or a polymorphism caused by VNTR or microsatellite.
(10) The method according to (7), wherein the Rare polymorphism is at least one selected from those shown in Table 2.
(11) A method for analyzing the relationship between drug or foreign substance susceptibility or disease susceptibility and haplotype, the following steps:
(A) About a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, (1) to (3) Identifying a tag polymorphism by the method according to any one of the above, and
(B) Using the identified tag polymorphism or a combination thereof, the frequency of the haplotype of the test population having one phenotype or the frequency of individuals having the haplotype and the test population having another phenotype Comparing the frequency of a haplotype or the frequency of an individual having the haplotype;
Including said method.
(12) A method for analyzing the relationship between drug or foreign substance susceptibility or disease susceptibility and haplotype, comprising the following steps:
(A) About a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, (4) to (6) Identifying a common polymorphism outside the block by the method according to any one of the above, and
(B) About the identified common polymorphism, the frequency of the polymorphism of the test population having one phenotype or the frequency of individuals having the polymorphism and the polymorphism of the test population having another phenotype Comparing the frequency or the frequency of an individual having the polymorphism;
Including said method.
(13) The following steps:
(A) About a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, (4) to (6) Identifying a common polymorphism outside the block by the method according to any one of the above, and
(B) About the identified common polymorphism, the frequency of the polymorphism of the test population having one phenotype or the frequency of individuals having the polymorphism and the polymorphism of the test population having another phenotype Comparing the frequency or the frequency of an individual having the polymorphism;
The method according to (11), further comprising:
(14) A method for analyzing the relationship between drug or foreign body susceptibility or disease susceptibility and haplotype, comprising the following steps:
(A) any one of claims 7 to 10 for a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease; Assigning a Rare polymorphism to a particular major haplotype by the method of claim 1, and
(B) for the assigned Rare polymorphism, the frequency of a haplotype of a test population having one phenotype or the frequency of individuals having the haplotype and the frequency of a haplotype of a test population having another phenotype or the Comparing the frequency of individuals with haplotypes;
Including said method.
(15) The following steps:
(A) About a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, (7) to (10) Assigning a Rare polymorphism to a particular major haplotype by the method of any one of the above, and
(B) for the assigned Rare polymorphism, the frequency of a haplotype of a test population having one phenotype or the frequency of individuals having the haplotype and the frequency of a haplotype of a test population having another phenotype or the Comparing the frequency of individuals with haplotypes;
The method according to any one of (11) to (13), further comprising:
The method includes a method of analyzing the association between drug susceptibility or drug susceptibility or disease susceptibility and haplotype using a combination of a tag polymorphism and at least one Rare polymorphism.
(16) A method for estimating a polymorphism associated with drug or foreign substance susceptibility or disease susceptibility, comprising the following steps:
(A) About a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, (1) to (3) Identifying a tag polymorphism by the method according to any one of the above items,
(B) Using the identified tag polymorphism or a combination thereof, the frequency of the haplotype of the test population having one phenotype or the frequency of individuals having the haplotype and the test population having another phenotype Comparing the frequency of a haplotype or the frequency of an individual having the haplotype;
(C) selecting a block involved in the frequency difference;
(D) selecting a polymorphism present in the selected block; and
(E) estimating a polymorphism associated with a frequency difference;
Including said method.
(17) A method for estimating a polymorphism associated with drug or foreign substance susceptibility or disease susceptibility, comprising the following steps:
(A) About a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, (4) to (6) Identifying the common polymorphism outside the block by the method according to any one of the above-mentioned items:
(B) About the identified common polymorphism, the frequency of the polymorphism of the test population having one phenotype or the frequency of individuals having the polymorphism and the polymorphism of the test population having another phenotype Comparing the frequency or the frequency of an individual having the polymorphism; and
(C) estimating a polymorphism associated with a frequency difference;
Including said method.
(18) The following steps:
(A) About a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, (4) to (6) Identifying the common polymorphism outside the block by the method according to any one of the above-mentioned items:
(B) About the identified common polymorphism, the frequency of the polymorphism of the test population having one phenotype or the frequency of individuals having the polymorphism and the polymorphism of the test population having another phenotype Comparing the frequency or the frequency of an individual having the polymorphism; and
(C) estimating a polymorphism associated with a frequency difference;
The method according to (16), further comprising:
(19) A method for estimating a polymorphism associated with drug or foreign substance susceptibility or disease susceptibility, comprising the following steps:
(A) About a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, (7) to (10) Assigning a Rare polymorphism to a particular major haplotype by the method of any one of the preceding claims;
(B) for the assigned Rare polymorphism, the frequency of a haplotype of a test population having one phenotype or the frequency of individuals having the haplotype and the frequency of a haplotype of a test population having another phenotype or the Comparing the frequency of individuals with haplotypes; and
(C) estimating a polymorphism associated with a frequency difference;
Including said method.
(20) The following steps:
(A) About a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, (7) to (10) Assigning a Rare polymorphism to a particular major haplotype by the method of any one of the preceding claims;
(B) for the assigned Rare polymorphism, the frequency of a haplotype of a test population having one phenotype or the frequency of individuals having the haplotype and the frequency of a haplotype of a test population having another phenotype or the Comparing the frequency of individuals with haplotypes; and
(C) estimating a polymorphism associated with a frequency difference;
The method according to any one of (16) to (18), further comprising:
The method includes a method of estimating a polymorphism associated with drug or foreign body susceptibility or disease susceptibility using a combination of a tag polymorphism and at least one Rare polymorphism.
(21) A method for analyzing individual differences in phenotype related to drug or foreign substance susceptibility or disease susceptibility, the analysis result analyzed by the method according to any one of (11) to (15), or (16) The method including the step of associating an individual with a phenotype using the estimation result estimated by the method according to any one of (20) as an index.
(22) The method according to any one of (11) to (21), wherein the drug sensitivity is sensitivity related to pharmacokinetics, drug effectiveness, or drug side effects.
(23) The method according to (22), wherein the pharmacokinetics are dynamics related to drug absorption, distribution, metabolism or excretion.
(24) The method according to (22), wherein the pharmacokinetics is kinetics related to the blood concentration of the drug.
(25) The method according to any one of (11) to (21), wherein the susceptibility of the disease is presence or absence of morbidity or strength.
(26) The disease is at least one selected from the group consisting of malignant tumors, immune system diseases, circulatory system diseases, metabolic system diseases, renal urinary system diseases, respiratory system diseases and musculoskeletal diseases (11) The method according to any one of (21) to (21).
(27) A method for predicting drug or foreign substance sensitivity or disease using the analysis result or estimation result obtained by the method according to any one of (11) to (26) as an index.
(28) A drug for preventing or treating a disease and / or a method for preventing or treating a disease using as an index the analysis result or the estimation result obtained by the method according to any one of (11) to (26) How to choose a law.
(29) A method for determining an appropriate dose of a drug for preventing or treating a disease, using the analysis result or the estimation result obtained by the method according to any one of (11) to (26) as an index.
(30) A method for analyzing a drug-drug interaction using an analysis result or an estimation result obtained by the method according to any one of (11) to (26) as an index.
(31) A method for determining a related polymorphism related to drug or foreign substance or disease susceptibility using the analysis result or estimation result obtained by the method according to any one of (11) to (26) as an index.
(32) A program for analyzing a haplotype, wherein the computer has the following means:
(A) means for detecting a genetic polymorphism for at least one of the drug-related genes shown in Table 1 obtained from the test population;
(B) means for processing the detection information and selecting a Common polymorph;
(C) means for constructing a haplotype block using the identified Common polymorphism,
(D) means for identifying a Common polymorphism in the haplotype block; and
(E) means for identifying a tag polymorphism from the common polymorphism in the haplotype block;
The program for functioning as:
(33) The program according to (32), wherein the gene polymorphism is a single nucleotide polymorphism, a polymorphism caused by deletion, substitution or insertion of a plurality of bases, or a polymorphism caused by VNTR or microsatellite.
(34) The program according to (32), wherein the tag polymorphism is at least one selected from those shown in the “htSNPs” column in the “Block” section of Table 3.
(35) A program for analyzing a haplotype, wherein the computer is operated by the following means:
(A) means for detecting a genetic polymorphism for at least one of the drug-related genes shown in Table 1 obtained from the test population;
(B) means for processing the detection information and selecting a Common polymorph;
(C) means for constructing a haplotype block using the identified Common polymorphism, and
(D) means for identifying a common polymorphism outside the haplotype block;
The program for functioning as:
(36) The program according to (35), wherein the gene polymorphism is a single nucleotide polymorphism, a polymorphism caused by deletion, substitution or insertion of a plurality of bases, or a polymorphism caused by VNTR or microsatellite.
(37) The program according to (35), wherein the common polymorphism outside the haplotype block is at least one selected from those shown in the “Between” section of Table 3.
(38) A program for analyzing a haplotype, wherein the computer is operated by the following means:
(A) a means for detecting a gene polymorphism for a drug-related gene obtained from a test population,
(B) means for processing the detection information and selecting a Common polymorph;
(C) means for constructing a haplotype block using the identified Common polymorphism,
(D) means for identifying the Rare polymorphism within and / or outside the haplotype block; and
(E) means for assigning said Rare polymorphism to a major haplotype;
The program for functioning as:
(39) The program according to (38), wherein the drug-related gene is at least one selected from those shown in Table 1.
(40) The program according to (38), wherein the gene polymorphism is a single nucleotide polymorphism, a polymorphism caused by deletion, substitution or insertion of a plurality of bases, or a polymorphism caused by VNTR or microsatellite.
(41) The program according to (38), wherein the Rare polymorphism is at least one selected from those shown in Table 2.
(42) A program for analyzing the relationship between drug or foreign substance susceptibility or disease susceptibility and haplotype, the computer comprising the following means:
(A) About a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, (32) to (34) Means for identifying a tag polymorphism by the program according to any one of the above, and
(B) Using the identified tag polymorphism or a combination thereof, the frequency of the haplotype of the test population having one phenotype or the frequency of individuals having the haplotype and the test population having another phenotype Means for comparing the frequency of a haplotype or the frequency of an individual having the haplotype;
The program for functioning as:
(43) A program for analyzing the relationship between drug or foreign substance susceptibility or disease susceptibility and a haplotype, the computer comprising the following means:
(A) About a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, (35) to (37) Means for identifying a common polymorphism outside a block by the program according to any one of the above, and
(B) About the identified common polymorphism, the frequency of the polymorphism of the test population having one phenotype or the frequency of individuals having the polymorphism and the polymorphism of the test population having another phenotype Means for comparing the frequency or the frequency of an individual having the polymorphism;
The program for functioning as:
(44) The following means:
(A) About a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, (35) to (37) Means for identifying a common polymorphism outside a block by the program according to any one of the above, and
(B) About the identified common polymorphism, the frequency of the polymorphism of the test population having one phenotype or the frequency of individuals having the polymorphism and the polymorphism of the test population having another phenotype Means for comparing the frequency or the frequency of an individual having the polymorphism;
The program according to (42), further including:
(45) A program for estimating a polymorphism associated with drug or foreign substance susceptibility or disease susceptibility, wherein the computer comprises the following means:
(A) About a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, (38) to (41) Means for assigning a Rare polymorphism to a specific major haplotype by the program according to any one of the above; and
(B) for the assigned Rare polymorphism, the frequency of a haplotype of a test population having one phenotype or the frequency of individuals having the haplotype and the frequency of a haplotype of a test population having another phenotype or the Means to compare the frequency of individuals with haplotypes,
The program for functioning as:
(46) The following means:
(A) About a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, (38) to (41) Means for assigning a Rare polymorphism to a specific major haplotype by the program according to any one of the above; and
(B) for the assigned Rare polymorphism, the frequency of a haplotype of a test population having one phenotype or the frequency of individuals having the haplotype and the frequency of a haplotype of a test population having another phenotype or the Means to compare the frequency of individuals with haplotypes,
The program according to any one of (42) to (44), further including:
The program includes a program for analyzing the association between drug or foreign substance susceptibility or disease susceptibility and haplotype using a combination of a tag polymorphism and at least one Rare polymorphism.
(47) A program for estimating a polymorphism associated with drug or foreign substance susceptibility or disease susceptibility, wherein the computer comprises the following means:
(A) About a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, (32) to (34) Means for identifying a tag polymorphism by the program according to any one of the above items,
(B) Using the identified tag polymorphism or a combination thereof, the frequency of the haplotype of the test population having one phenotype or the frequency of individuals having the haplotype and the test population having another phenotype Means for comparing the frequency of a haplotype or the frequency of an individual having the haplotype;
(C) means for selecting blocks involved in the difference in frequency;
(D) means for selecting a polymorphism present in the selected block; and
(E) means for estimating polymorphisms associated with frequency differences;
The program for functioning as:
(48) A program for estimating a polymorphism associated with drug or foreign substance susceptibility or disease susceptibility, wherein the computer comprises the following means:
(A) About a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, (35) to (37) Means for identifying a common polymorphism outside a block by the program according to any one of the above items,
(B) About the identified common polymorphism, the frequency of the polymorphism of the test population having one phenotype or the frequency of individuals having the polymorphism and the polymorphism of the test population having another phenotype Means for comparing the frequency or the frequency of an individual having the polymorphism; and
(C) means for estimating a polymorphism associated with a difference in frequency;
The program for functioning as:
(49) The following means:
(A) About a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, (35) to (37) Means for identifying a common polymorphism outside a block by the program according to any one of the above items,
(B) About the identified common polymorphism, the frequency of the polymorphism of the test population having one phenotype or the frequency of individuals having the polymorphism and the polymorphism of the test population having another phenotype Means for comparing the frequency or the frequency of an individual having the polymorphism; and
(C) means for estimating a polymorphism associated with a difference in frequency;
The program according to (47), further including:
(50) A program for estimating a polymorphism associated with drug or foreign substance susceptibility or disease susceptibility, wherein the computer comprises the following means:
(A) About a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, (38) to (41) Means for assigning a Rare polymorphism to a specific major haplotype by the program according to any one of the above items;
(B) for the assigned Rare polymorphism, the frequency of a haplotype of a test population having one phenotype or the frequency of individuals having the haplotype and the frequency of a haplotype of a test population having another phenotype or the Means for comparing the frequency of individuals with haplotypes; and
(C) means for estimating a polymorphism associated with a difference in frequency;
The program for functioning as:
(51) The following means:
(A) About a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, (38) to (41) Means for assigning a Rare polymorphism to a specific major haplotype by the program according to any one of the above items;
(B) for the assigned Rare polymorphism, the frequency of a haplotype of a test population having one phenotype or the frequency of individuals having the haplotype and the frequency of a haplotype of a test population having another phenotype or the Means for comparing the frequency of individuals with haplotypes; and
(C) means for estimating a polymorphism associated with a difference in frequency;
The program according to any one of (47) to (49), further including:
The program includes a program for estimating a polymorphism associated with drug or foreign substance susceptibility or disease susceptibility using a combination of a tag polymorphism and at least one Rare polymorphism.
(52) A program for analyzing individual differences in phenotype related to drug or foreign substance susceptibility or disease susceptibility, wherein the computer is analyzed by the program according to any one of (42) to (46). The program for causing an individual and a phenotype to function as an index using the analysis result or the estimation result estimated by the program according to any one of (47) to (51) as an index.
(53) The program according to any one of (47) to (52), wherein the drug sensitivity is sensitivity related to pharmacokinetics, drug effectiveness, or drug side effects.
(54) The program according to (53), wherein the pharmacokinetics are dynamics relating to drug absorption, distribution, metabolism, or excretion.
(55) The program according to (53), wherein the pharmacokinetics is a kinetics related to a blood concentration of the drug.
(56) The program according to any one of (47) to (52), wherein the susceptibility of the disease is presence or absence of morbidity or strength.
(57) The disease is at least one selected from the group consisting of malignant tumors, immune system diseases, circulatory system diseases, metabolic system diseases, renal urinary system diseases, respiratory system diseases, and musculoskeletal diseases (47 The program according to any one of (52) to (52).
(58) To cause a computer to function as a means for predicting drug or foreign substance sensitivity or disease using the analysis result or estimation result obtained by the program according to any one of (42) to (57) as an index Program.
(59) A drug for preventing or treating a disease and / or prevention of a disease by using a computer as an index the analysis result or the estimation result obtained by the program according to any one of (42) to (57) Program to function as a means to select a method or treatment.
(60) The computer determines an appropriate dose of a drug for preventing or treating a disease using as an index the analysis result or the estimation result obtained by the program according to any one of (42) to (57) Program to function as a means to
(61) A program for causing a computer to function as means for analyzing an interaction between drugs using an analysis result or an estimation result obtained by the program according to any one of (42) to (57) as an index.
(62) Means for determining a related polymorphism related to drug or foreign substance or disease susceptibility using the analysis result or estimation result obtained by the program according to any one of (42) to (57) as an index Program to function as.
(63) A computer-readable recording medium on which the program according to any one of (32) to (62) is recorded.
(64) A haplotype including at least one Rare polymorphism shown in Table 4.
Furthermore, the present invention provides a haplotype analysis device (haplotype analysis system) for use in the above method. For example, the following invention is provided.
(65) A haplotype analysis apparatus comprising the following means:
(A) means for detecting a genetic polymorphism for at least one of the drug-related genes shown in Table 1 obtained from the test population;
(B) means for processing the detection information and selecting a Common polymorph;
(C) means for constructing a haplotype block using the identified Common polymorphism,
(D) means for identifying a Common polymorphism in the haplotype block; and
(E) means for identifying a tag polymorphism from the common polymorphism in the haplotype block;
Including said device.
(66) A haplotype analysis apparatus comprising the following means:
(A) means for detecting a genetic polymorphism for at least one of the drug-related genes shown in Table 1 obtained from the test population;
(B) means for processing the detection information and selecting a Common polymorph;
(C) means for constructing a haplotype block using the identified Common polymorphism, and
(D) means for identifying a common polymorphism outside the haplotype block;
Including said device.
(67) A haplotype analysis apparatus comprising the following means:
(A) a means for detecting a gene polymorphism for a drug-related gene obtained from a test population,
(B) means for processing the detection information and selecting a Common polymorph;
(C) means for constructing a haplotype block using the identified Common polymorphism,
(D) means for identifying the Rare polymorphism within and / or outside the haplotype block; and
(E) means for assigning said Rare polymorphism to a major haplotype;
Including said device.
(68) A device for analyzing the relationship between drug or foreign substance susceptibility or disease susceptibility and haplotype, the following means:
(A) The apparatus according to (65) above, for a drug-related gene collected from a test population exposed to or possibly exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease Means for identifying the tag polymorphism by:
(B) Using the identified tag polymorphism or a combination thereof, the frequency of the haplotype of the test population having one phenotype or the frequency of individuals having the haplotype and the test population having another phenotype Means for comparing the frequency of a haplotype or the frequency of an individual having the haplotype;
Including said device.
(69) An apparatus for analyzing the relationship between drug or foreign substance susceptibility or disease susceptibility and haplotype, the following means:
(A) For a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a risk factor for a disease, the apparatus according to (66) , Means for identifying common polymorphism outside the block, and
(B) About the identified common polymorphism, the frequency of the polymorphism of the test population having one phenotype or the frequency of individuals having the polymorphism and the polymorphism of the test population having another phenotype Means for comparing the frequency or the frequency of an individual having the polymorphism;
Including said device.
(70) The following means:
(A) For a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a risk factor for a disease, the apparatus according to (66) , Means for identifying common polymorphism outside the block, and
(B) About the identified common polymorphism, the frequency of the polymorphism of the test population having one phenotype or the frequency of individuals having the polymorphism and the polymorphism of the test population having another phenotype Means for comparing the frequency or the frequency of an individual having the polymorphism;
The device of (68), further comprising:
(71) An apparatus for analyzing the relationship between drug or foreign substance susceptibility or disease susceptibility and haplotype, the following means:
(A) For a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, the apparatus according to (67) Means for assigning a Rare polymorphism to a particular major haplotype; and
(B) for the assigned Rare polymorphism, the frequency of a haplotype of a test population having one phenotype or the frequency of individuals having the haplotype and the frequency of a haplotype of a test population having another phenotype or the Means to compare the frequency of individuals with haplotypes,
Including said device.
(72) The following means:
(A) For a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, the apparatus according to (67) Means for assigning a Rare polymorphism to a particular major haplotype; and
(B) for the assigned Rare polymorphism, the frequency of a haplotype of a test population having one phenotype or the frequency of individuals having the haplotype and the frequency of a haplotype of a test population having another phenotype or the Means to compare the frequency of individuals with haplotypes,
The device according to any one of (68) to (70), further including:
The apparatus includes an apparatus for analyzing the relationship between drug or foreign substance susceptibility or disease susceptibility and haplotype using a combination of a tag polymorphism and at least one Rare polymorphism.
(73) A device for estimating a polymorphism associated with drug or foreign substance susceptibility or disease susceptibility, comprising the following means:
(A) With respect to a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, the means described in (65) A means of identifying tag polymorphisms,
(B) Using the identified tag polymorphism or a combination thereof, the frequency of the haplotype of the test population having one phenotype or the frequency of individuals having the haplotype and the test population having another phenotype Means for comparing the frequency of a haplotype or the frequency of an individual having the haplotype;
(C) means for selecting blocks involved in the difference in frequency;
(D) means for selecting a polymorphism present in the selected block; and
(E) means for estimating polymorphisms associated with frequency differences;
Including said device.
(74) An apparatus for estimating a polymorphism associated with drug or foreign substance susceptibility or disease susceptibility, comprising the following means:
(A) For a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a risk factor for a disease, the apparatus according to (66) , Means for identifying the common polymorphism outside the block,
(B) About the identified common polymorphism, the frequency of the polymorphism of the test population having one phenotype or the frequency of individuals having the polymorphism and the polymorphism of the test population having another phenotype Means for comparing the frequency or the frequency of an individual having the polymorphism; and
(C) means for estimating a polymorphism associated with a difference in frequency;
Including said device.
(75) The following means:
(A) For a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a risk factor for a disease, the apparatus according to (66) , Means for identifying the common polymorphism outside the block,
(B) About the identified common polymorphism, the frequency of the polymorphism of the test population having one phenotype or the frequency of individuals having the polymorphism and the polymorphism of the test population having another phenotype Means for comparing the frequency or the frequency of an individual having the polymorphism; and
(C) means for estimating a polymorphism associated with a difference in frequency;
The device according to (73), further comprising:
(76) A device for estimating a polymorphism associated with drug or foreign substance susceptibility or disease susceptibility, comprising the following means:
(A) For a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, the apparatus according to (67) , Means for assigning a Rare polymorphism to a particular major haplotype,
(B) for the assigned Rare polymorphism, the frequency of a haplotype of a test population having one phenotype or the frequency of individuals having the haplotype and the frequency of a haplotype of a test population having another phenotype or the Means for comparing the frequency of individuals with haplotypes; and
(C) means for estimating a polymorphism associated with a difference in frequency;
Including said device.
(77) The following means:
(A) For a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, the apparatus according to (67) , Means for assigning a Rare polymorphism to a particular major haplotype,
(B) for the assigned Rare polymorphism, the frequency of a haplotype of a test population having one phenotype or the frequency of individuals having the haplotype and the frequency of a haplotype of a test population having another phenotype or the Means for comparing the frequency of individuals with haplotypes; and
(C) means for estimating a polymorphism associated with a difference in frequency;
The device according to any one of (73) to (75), further including:
The apparatus includes an apparatus for estimating a polymorphism associated with drug or foreign body susceptibility or disease susceptibility using a combination of a tag polymorphism and at least one Rare polymorphism.
(78) A device for analyzing individual differences in phenotype relating to drug or foreign substance susceptibility or disease susceptibility, the analysis result analyzed by the device according to any one of (68) to (72), or (73) The device including means for associating an individual with a phenotype using an estimation result estimated by the device according to any one of (77) to (77) as an index.
(79) A device for predicting drug or foreign substance sensitivity or disease using as an index the analysis result or the estimation result obtained by the device according to any one of (68) to (78).
(80) A drug for preventing or treating a disease and / or a method for preventing or treating a disease, using as an index the analysis result or the estimation result obtained by the apparatus according to any one of (68) to (78) A device that selects the law.
(81) An apparatus for determining an appropriate dose of a drug for preventing or treating a disease using the analysis result or the estimation result obtained by the apparatus according to any one of (68) to (78) as an index.
(82) A device for analyzing an interaction between drugs using an analysis result or an estimation result obtained by the device according to any one of (68) to (78) as an index.
(83) An apparatus for determining an associated polymorphism related to drug or foreign substance or disease susceptibility using the analysis result or the estimation result obtained by the apparatus according to any one of (68) to (78) as an index.

図１は、本発明の解析方法を実行するためのフローチャートを示す図である。
図２は、ハプロタイプブロックの模式図である。
図３は、ハプロタイプと表現型との関連を示す図である。
図４は、ハプロタイプブロック構築の概要を示す図である。
図５は、ＲａｒｅＳＮＰｓを含むハプロタイプを示す図である。
図６は、本発明のプログラムを実行するためのコンピュータが備える手段のブロック図である。
図７は、２６０３個の常染色体ＳＮＰｓを統計的に分類した結果を示す図である。
図８は、常染色体ＳＮＰｓのマイナーアレル頻度のヒストグラムを示す図である。
図９は、異なるマイナーアレル頻度を有する同義及び非同義ＳＮＰｓのヒストグラムを示す図である。
図１０は、常染色体遺伝子あたりのハプロタイプブロックの数のヒストグラムを示す図である。
図１１は、マイナーアレル頻度を０．１以上としたときのＳＮＰを用いて構築した常染色体のハプロタイプブロックの長さのヒストグラムを示す図である。
図１２ａは、異なる頻度を有するＲａｒｅＳＮＰｓについてｍａｊｏｒｉｔｙ−ａｓｓｉｇｎｅｄｈａｐｌｏｔｙｐｅにａｓｓｉｇｎされたマイナーアレル頻度の比率を示す図である。
図１２ｂは、異なる頻度を有するＲａｒｅＳＮＰｓについてｍａｊｏｒｉｔｙ−ａｓｓｉｇｎｅｄｈａｐｌｏｔｙｐｅにａｓｓｉｇｎされたマイナーアレル頻度の比率を示す図である。
図１３は、４，１０４個の常染色体ＳＮＰｓを統計的に分類した結果を示す図である。
図１４は、ＲａｒｅＳＮＰｓの低頻度アレルをｈｔＳＮＰにより構成されたハプロタイプに割り当てる方法を示す図である。
図１５は、ＲａｒｅＳＮＰｓの低頻度アレル（Ｘ）が割り当てられるＡｊ（ｈｔＳＮＰにより構築されたハプロタイプ）の割合を示す図である。
図１６は、ｈｔＳＮＰにより構築されたハプロタイプＡｊのうち、ＲａｒｅＳＮＰｓの低頻度アレル（Ｘ）を保有する割合を示す図である。
図１７は、ＲａｒｅＳＮＰｓの高頻度アレル（Ｘの補集合）のうちＡｊに割り当てられる割合と、ＲａｒｅＳＮＰｓの低頻度アレル（Ｘ）のうちＡｊに割り当てられる割合との比を示す図である。
図１８は、罹患者とコントロール集団の間で表現型に関係するＲａｒｅＳＮＰｓの低頻度アレルの頻度が異なるときの、ｈｔＳＮＰで構成されたハプロタイプの頻度の違いを比較する検定が有意となる確率を示す図である。
図１９は、完全ハプロタイプ及び不完全ハプロタイプの概念図である。
図２０は、ＮＡＴ２遺伝子と周辺の２４個のＳＮＰの地図である。ＮＡＴ２遺伝子は８ｐ２２に存在し、２つのｅｘｏｎにより成り立つ。上の図は表１２に説明した２４個のＳＮＰの部位を示す。上の図で太線で示したＳＮＰはｈｔＳＮＰとして抽出されたＳＮＰを示す。上の図の一部を拡大し、下の図に示す。下の図に示した６個のＳＮＰ（ＳＮＰｎｏ６−１１）がｅｘｏｎ２上のコード領域に存在し、この中でＳＮＰｎｏ７，９，１０，１１（下の図の太線で示したＳＮＰ）がアミノ酸置換を来たすＳＮＰであり、表現型に関係するＳＮＰである。
図２１は、ケースとコントロールが同数のもとでλの値を０．１、０．０１及び０．００１とした場合におけるｑ_＋とｒ_＋との関係を示す特性図である。
図２２は、λを０．０１としケースとコントロールとの比率（ｃａｓｅ／ｃｏｎｔｒｏｌ）を１、５及び１０とした場合における特性図である。
図２３は、帰無仮説の下に生成されたデータを解析した場合の統計量−２ｌｏｇ（Ｌ_０ｍａｘ／Ｌ_ｍａｘ）のヒストグラムである。
図２４は、対立仮説の下で生成されたデータに対するｑ_＋／ｑ₋（相対危険）と統計量−２ｌｏｇ（Ｌ_０ｍａｘ／Ｌ_ｍａｘ）の関係を示す特性図である。
図２５は、ｑ_＋／ｑ₋（相対危険）と標本サイズＮの増加による検出力の増加を示す特性図である。
図２６は、「表現型に関係したハプロタイプ」の頻度と検出力の関係を示す特性図である。
図２７は、ＳＡＡ遺伝子のハプロタイプ頻度のデータを用いた場合における、ケース又はコントロールの数と第１の過誤との関係を示す特性図である。
図２８は、ＡＲＴのハプロタイプ頻度のデータを用いた場合における、ケース又はコントロールの数と第１の過誤との関係を示す特性図である。
図２９は、本アルゴリズムによる帰無仮説の第１の過誤とｓｅｐａｒａｔｅ３手法による帰無仮説の第１の過誤とをシミュレーションにより求めた結果を示す特性図である。FIG. 1 is a diagram showing a flowchart for executing the analysis method of the present invention.
FIG. 2 is a schematic diagram of a haplotype block.
FIG. 3 is a diagram illustrating the relationship between haplotypes and phenotypes.
FIG. 4 is a diagram showing an outline of haplotype block construction.
FIG. 5 is a diagram showing haplotypes including Rare SNPs.
FIG. 6 is a block diagram of means included in a computer for executing the program of the present invention.
FIG. 7 is a diagram showing the results of statistically classifying 2603 autosomal SNPs.
FIG. 8 shows a histogram of minor allele frequencies of autosomal SNPs.
FIG. 9 is a diagram showing a histogram of synonymous and non-synonymous SNPs having different minor allele frequencies.
FIG. 10 is a diagram showing a histogram of the number of haplotype blocks per autosomal gene.
FIG. 11 is a diagram showing a histogram of the length of autosomal haplotype blocks constructed using SNPs when the minor allele frequency is 0.1 or more.
FIG. 12a is a diagram showing a ratio of minor allele frequencies assigned to a majority-assigned haplotype for Rare SNPs having different frequencies.
FIG. 12b is a diagram showing a ratio of minor allele frequencies assigned to a majority-assigned haplotype for Rare SNPs having different frequencies.
FIG. 13 is a diagram showing the results of statistically classifying 4,104 autosomal SNPs.
FIG. 14 is a diagram illustrating a method of assigning low-frequency alleles of Rare SNPs to haplotypes configured by htSNPs.
FIG. 15 is a diagram illustrating a ratio of Aj (haplotype constructed by htSNP) to which a low frequency allele (X) of Rare SNPs is assigned.
FIG. 16 is a diagram showing the ratio of the rare frequency allele (X) of Rare SNPs among haplotypes Aj constructed by htSNP.
FIG. 17 is a diagram illustrating a ratio between a ratio assigned to Aj in the high-frequency allele (X complement) of the Rare SNPs and a ratio assigned to Aj in the low-frequency allele (X) of the Rare SNPs.
FIG. 18 shows the probability that the test comparing the difference in the frequency of haplotypes composed of htSNPs is significant when the frequency of the rare alleles of Rare SNPs related to the phenotype differs between affected individuals and the control population. FIG.
FIG. 19 is a conceptual diagram of a complete haplotype and an incomplete haplotype.
FIG. 20 is a map of the NAT2 gene and 24 surrounding SNPs. The NAT2 gene is present in 8p22 and consists of two exons. The upper figure shows the 24 SNP sites described in Table 12. The SNP indicated by the bold line in the upper diagram indicates the SNP extracted as the htSNP. A part of the upper figure is enlarged and shown in the lower figure. Six SNPs (SNP no 6-11) shown in the lower figure are present in the coding region on exon2, among which SNP no 7, 9, 10, 11 (SNP shown by the bold line in the lower figure) are amino acids. A SNP that causes substitution, and is a SNP related to the phenotype.
FIG. 21 is a characteristic diagram showing the relationship between q ₊ and r ₊ when the values of λ are 0.1, 0.01, and 0.001 under the same number of cases and controls.
FIG. 22 is a characteristic diagram when λ is 0.01 and the ratio of the case to the control (case / control) is 1, 5, and 10.
FIG. 23 is a histogram of statistics −2 log (L _0max / L _max ) when data generated under the null hypothesis is analyzed.
FIG. 24 is a characteristic diagram showing the relationship between q ₊ / q ₋ (relative risk) and statistic −2 log (L _0max / L _max ) for data generated under the alternative hypothesis.
FIG. 25 is a characteristic diagram showing an increase in detection power due to an increase in q ₊ / q ₋ (relative danger) and sample size N.
FIG. 26 is a characteristic diagram showing the relationship between the frequency of “haplotype related to phenotype” and the power of detection.
FIG. 27 is a characteristic diagram showing the relationship between the number of cases or controls and the first error when the haplotype frequency data of the SAA gene is used.
FIG. 28 is a characteristic diagram showing the relationship between the number of cases or controls and the first error in the case where ART haplotype frequency data is used.
FIG. 29 is a characteristic diagram showing the results obtained by simulation of the first error of the null hypothesis by this algorithm and the first error of the null hypothesis by the separate3 method.

Explanation of symbols

６０１：ＣＰＵ、６０２：ＲＯＭ、６０３：ＲＡＭ、６０４：入力部、６０５：送信／受信部、６０６：出力部、６０７：ＨＤＤ、６０８：ＣＤ−ＲＯＭドライブ、６０９：ネットワーク回線601: CPU, 602: ROM, 603: RAM, 604: input unit, 605: transmission / reception unit, 606: output unit, 607: HDD, 608: CD-ROM drive, 609: network line

以下、本発明を詳細に説明する。
１．本発明の概要
ゲノム上にはＳＮＰを代表とする多数の遺伝子多型が存在しており、これらの遺伝子多型を指標として、薬剤に対する有効性、あるいは副作用の有無または強弱の違いを多型ごとに調べることができる。薬剤に対する有効性、あるいは副作用の有無または強弱の違いなど、個体に対応する遺伝子型以外の属性を表現型という。しかしながら、これらの多数の多型一つ一つを調べることは、労力を要し、効率的ではない。しかも、未知の多型が存在し、それが表現型と関係している可能性もある。
ヒトなどの２倍体個体では、染色体上で特定の位置に存在する多型（これを座位という）において、遺伝子型という単位が存在する。この遺伝子型は両親のそれぞれから由来する２つの遺伝子単位の組み合わせにより構成される。この遺伝子単位をアレルという。そして、同一染色体上に連鎖している複数の座位について、片親に由来するアレルの組み合わせをハプロタイプという。
染色体上で近傍に存在する複数の座位上のアレルは、ある一組の集合（ひとかたまり）として代々受け継がれることが知られており、この一組の集合が前記のハプロタイプである。
一般に交配集団の大きさは有限であり、集団遺伝学的考察により、有限の交配集団において、一つの集団では当該座位のアレル数から予想されるよりはるかに少数のハプロタイプしか存在しないと予想される。実際に、現実の人類集団では、アレル数から予想されるよりはるかに少数のハプロタイプが存在することが多い。ここで、アレルの集団内での二つの座位間の染色体上の関係が独立な場合を連鎖平衡と呼ぶ。このような連鎖平衡の状態では、ハプロタイプの頻度は、各座位のアレル頻度の積で表される。これに対して、アレルの集団内での二つの座位間の染色体上の関係が独立でない場合を連鎖不平衡と呼ぶ。連鎖不平衡が存在すると、ハプロタイプ頻度は各座位でのアレル頻度の積からずれる。
また、染色体上で近傍に位置する座位について、特定の区域内に存在する座位間の連鎖不平衡が極めて強く、その区域内に集団内で少数のハプロタイプしか存在しないような染色体上の領域をハプロタイプブロックという。ハプロタイプブロックの内部の座位間では組み換えが稀にしか起こらないので、ハプロタイプブロック内のハプロタイプはそのままの形で世代間で受け継がれる単位となることがほとんどである。
遺伝子多型と薬物感受性等の表現型との関連性を決定するためには、一つの座位についてのみ考えることもできるが、しばしば表現型は一つの座位のみに関連するのではなく、ハプロタイプ単位で関連する。二つ以上の座位に相互作用が存在する場合に、このようなハプロタイプ単位での表現型への影響が見られる。しかも、未知の多型座位については、そもそもその座位の遺伝子型を調べることは不可能である。しかし、後述のように、Ｒａｒｅ多型については、低頻度アレルは一つのハプロタイプに代表されることが多い。従って、ハプロタイプは、疾患や薬物感受性等の表現型と遺伝的多型との関連性を見出すために使用され、上記連鎖不平衡を利用して真に薬剤応答性に関連する多型を決めることが可能である。しかし、これまでに具体的な方法を示したものが無い。そこで本発明は、ハプロタイプを利用して薬物応答性を調べるための具体的方法を提供する。
本発明においては、ハプロタイプブロック及びタグを構築することで表現型と遺伝子多型との関連性解析の効率化を図る。すなわち、ある多型が表現型と関連するか否か（例えば、原因となるか否か）にかかわらず、ハプロタイプブロック上に所定の頻度で存在する多型をランドマークとして選択し、この選択された多型をタグとして用いると、さらに効率的に解析することができる。本発明においては、上記ランドマークを「タグ多型」という。そして、タグ多型により構成されるハプロタイプ（これを「タグハプロタイプ」という。）又はその組合せを用いて、表現型との関連を解析する。「タグハプロタイプ」とは、一定頻度のハプロタイプを説明することができる最低限の多型の組合せを意味する。一定頻度としては、例えば、少なくとも７０％、好ましくは８０％、より好ましくは９０％、さらに好ましくは９５％が挙げられる。
換言すれば、以下のようになる。ブロック内のすべてのありふれた多型（Ｃｏｍｍｏｎ多型）を用いてハプロタイプを区別したとする。タグ多型は、その多型から選択されたものであるが、すべてのありふれた多型を用いた場合とほぼ同じ（少なくとも７０％、好ましくは９５％）効率でハプロタイプを区別することができる多型である。頻度が７０〜９５％を規定するようなタグ多型の組合せはすべて本発明に含まれる。タグハプロタイプはタグ多型のみにより構成されるハプロタイプであるが、それぞれのタグハプロタイプはすべてのありふれた多型を用いたハプロタイプとほぼ一対一に対応する。稀に複数のハプロタイプが一つのタグハプロタイプに対応する可能性があるが、そのようなすべてのハプロタイプを合計しても５％に満たない。
タグハプロタイプの組合せは限定されるものではなく、少なくとも１つあればよい。ブロック内に含まれるタグ多型のすべての組合せを用いることもできる。
また、本発明においては、上記タグ多型又はその組合せのほか、ハプロタイプブロック内に構築されなかった多型、すなわちブロックの外に存在する多型（「ブロック外多型」という。）を、上記タグ多型と組み合わせることもできる。
さらに、ハプロタイプブロック構築の際に、集団におけるマイナーアレルのアレル頻度が一定の割合よりも少ない多型（これを「Ｒａｒｅ多型」という。）又はその組合せと表現型との関連を解析する。
本明細書においては、適宜ＳＮＰｓを多型の例として説明するが、多型はＳＮＰｓに限定されるものではない。なお、本明細書において、「ＳＮＰｓ」と表示してあっても単数のＳＮＰを表わす場合がある。
図１に示すように、まず、被検集団からハプロタイプ解析の対象となる遺伝子を採取し、その遺伝子からＳＮＰｓ情報を得る（図１、Ｓ１）。ここでいう「被検集団」とは、表現型との関連を調べるために使用されるＳＮＰｓの基礎情報を得るための母集団であり、集団内に含まれる個体数が多いことが好ましい。本発明において対象となる集団内の数は、１００人以上、好ましくは２００人以上である。
次に、検出されたＳＮＰｓの情報を用いて、上記被検集団内においてそのＳＮＰｓを有する頻度が集団全体の何％存在するかを算出し、一定頻度（例えば０．１）以上共通して存在するＳＮＰｓを「ＣｏｍｍｏｎＳＮＰｓ」として選出する（Ｓ２）。「Ｃｏｍｍｏｎ」とは、「ありふれた」という意味である。ＣｏｍｍｏｎＳＮＰｓに対する概念は「ＲａｒｅＳＮＰｓ」といい（稀なＳＮＰｓという意味であり、「ＵｎｃｏｍｍｏｎＳＮＰｓ」と言うこともある）、一定頻度（例えば０．１）未満で存在するＳＮＰｓとして特定する（Ｓ３）。ＲａｒｅＳＮＰｓの具体的内容については後述する。そして、上記ＣｏｍｍｏｎＳＮＰｓについて、所定のアルゴリズムに従ってハプロタイプブロックを構築し（Ｓ４）、ブロック内に存在するＳＮＰｓとブロックの外に存在するＳＮＰｓとを区別する（Ｓ５）。ブロック内のＳＮＰｓを用いる場合は、ＣｏｍｍｏｎＳＮＰｓを特定し（Ｓ６）、その中からタグＳＮＰｓ（「ｈｔＳＮＰｓ」ともいう）を選択する（Ｓ７）。このタグＳＮＰｓは、（ｉ）薬物や異物に曝された対象被検集団、（ｉｉ）薬物や異物に曝される可能性のある対象被検集団、又は（ｉｉｉ）疾患の危険因子等に曝された対象被検集団（これら（ｉ）〜（ｉｉｉ）の集団をケース集団という）において、薬物の効果や副作用などの表現型との関連を調べるために使用されるＳＮＰｓである。「疾患の危険因子」とは、腫瘍などの疾患を引き起こす環境を意味し、個人の遺伝的背景、性別、年齢、嗜好又は生活習慣（例えば酒、タバコ等）、化学物質などが挙げられる。
このタグＳＮＰｓを用いて、あるいはタグＳＮＰｓとＲａｒｅＳＮＰｓとの組合せを用いて、ケース集団同士におけるハプロタイプの頻度又はそのようなハプロタイプを有する個体の頻度を比較し、あるいは、ケース集団とコントロール集団におけるハプロタイプの頻度又はそのようなハプロタイプを有する個体の頻度を比較し、表現型との関連を試験する（Ｓ８）。
一方、ブロック外ＳＮＰｓを用いる場合（Ｓ９）は、単独で、又はタグＳＮＰｓと組み合わせて表現型との関連を試験することができる。すなわち、ケース集団同士における多型（ＳＮＰｓ）の頻度又はそのようなＳＮＰｓを有する個体の頻度を比較し、あるいは、ケース集団とコントロール集団におけるＳＮＰｓの頻度又はそのようなＳＮＰｓを有する個体の頻度を比較し、表現型との関連を試験する。
図２は、ある遺伝子Ｘ上に１６個のＳＮＰｓが存在していることを示す模式図である。１６個のＳＮＰｓは、母集団全体の一定割合以上の個体が有するＳＮＰｓであり、ＣｏｍｍｏｎＳＮＰｓ（Ｃｏｍｍｏｎ多型）を示している。そして、ＳＮＰ１〜６、ＳＮＰ８〜１０及びＳＮＰ１２〜１５がそれぞれハプロタイプを形成していると仮定すると、それぞれブロックＡ、ブロックＢ、ブロックＣが構築される（図２）。ＳＮＰ７、ＳＮＰ１１及びＳＮＰ１６は、ブロック外ＳＮＰを示す。
また、ハプロタイプブロックＡを抜き出した図を図３に示す。図３に示すように、ＳＮＰ１がＧ／Ａ、ＳＮＰ２がＧ／Ａ、ＳＮＰ３がＴ／Ｃ、ＳＮＰ４がＧ／Ａ、ＳＮＰ５がＡ／Ｔ、ＳＮＰ６がＣ／Ｇであるとし、ハプロタイプのパターンはａ〜ｄの４通り存在するものとする。ハプロタイプａは「Ｇ−Ａ−Ｔ−Ｇ−Ｔ−Ｃ」、ｂは「Ｇ−Ａ−Ｔ−Ａ−Ａ−Ｃ」、ｃは「Ａ−Ａ−Ｃ−Ｇ−Ａ−Ｃ」、ｄは「Ｇ−Ａ−Ｃ−Ａ−Ｔ−Ｃ」である。そして、タグＳＮＰｓはＳＮＰ２、ＳＮＰ３及びＳＮＰ５であるとする（図３中、★印を付したＳＮＰ）。
なお、ブロックＡは６個のＳＮＰｓからなるハプロタイプを例示したものであり、６個のＳＮＰｓに限定されるものではない。
ここで、１つのハプロタイプブロック内にＮ個のＳＮＰｓが存在するものと仮定すると、ＳＮＰｓの組合せ、すなわちハプロタイプパターンの数は２^Ｎ個となる。しかし、集団遺伝学の考察によれば、有限の集団では多型やハプロタイプの増加又は消滅に偶然の要素が強く働く。このため、ハプロタイプパターンの数はＳＮＰｓのすべての組合せに対応するわけではなく、すべての組合せよりも少なくなる。図３に示す例では、ブロックＡにおけるハプロタイプパターンは２^６通りの組合せではなく、４通りの組合せとなるわけである。従って、ハプロタイプを用いることにより、少ない組合せを利用して効率的に表現型とハプロタイプとの関係を解析することができる。これにより、多型一つ一つを解析するよりも、解析効率を向上させることができる。
さらに、本発明においては、前述の通りハプロタイプと表現型との関連を解析するに際し、どのような多型を用いて解析すればよいのか、ハプロタイプブロック内に存在する多数の多型の中から、その解析対象となる多型及びその組合せを特定する。例えば、どのＳＮＰｓ情報を用いればよいのか、そのＳＮＰｓ（ｈｔＳＮＰｓ、ブロック外ＳＮＰｓ、ＲａｒｅＳＮＰｓ等）の組合せを提供するものである。次に、上記絞り込まれたｈｔＳＮＰｓ、ブロック外ＳＮＰｓ、ＲａｒｅＳＮＰｓ、又はこれらの組合せをマーカーとして用いて、一の表現型と他の表現型との間で関連を解析する（図１、Ｓ８）。
実際には、ｈｔＳＮＰｓ及びブロック外ＣｏｍｍｏｎＳＮＰｓのすべてを用いて薬物を服用した個人個人の遺伝子型を調べ、表現型との関連を解析することとなる。このうちｈｔＳＮＰｓについては、表現型と、ｈｔＳＮＰｓを用いたハプロタイプ（タグハプロタイプ）との関連をテストし、ブロック外ＣｏｍｍｏｎＳＮＰｓについてはそれぞれのＳＮＰｓと表現型との関連をテストする。関連のテストは、上記ｈｔＳＮＰｓ、ブロック外ＳＮＰｓ、ＲａｒｅＳＮＰｓを用いて、特定の表現型との関係においてハプロタイプの頻度又はそのようなハプロタイプを有する個体の頻度、あるいはＳＮＰｓの頻度又はそのようなＳＮＰｓを有する個体の頻度を比較し、表現型との関連を試験する（図１、Ｓ８）。これにより、無数に存在するＳＮＰｓのうちどのＳＮＰｓを解析すればよいのか、その検索範囲を明確にすることができる。
タグハプロタイプと表現型との間に相関が認められた場合は、さらに、ｈｔＳＮＰｓを含む上記関連に関与したハプロタイプブロックの中から、そこに含まれる多型すべてを選択し、網羅的に解析することで、表現型への関連多型を推定又は特定する（図１、Ｓ１０）。
図３において、被検者に薬剤を投与した結果、副作用がなかった又は少なかった群を安全群、副作用が強かった群を副作用群として、副作用群におけるハプロタイプの頻度と安全群におけるハプロタイプの頻度とを比較する場合を考える。そして、ＳＮＰ２及びＳＮＰ３のタグハプロタイプの組合せのうち、Ａ−Ｔである頻度が、安全群では８０％であるのに対し、副作用群では２０％であり、Ａ−Ｃである頻度が、安全群では２０％であるのに対し、副作用群では８０％であるとする。
ここで、安全群と副作用群のハプロタイプ頻度に違いがあるかどうかを統計学的検定法により検定する。その手法は、例えば独立性に関するχ二乗検定法などである。
上記違いを満たすハプロタイプパターンは、ａとｂのハプロタイプの組合せであってタグハプロタイプがＡ−Ｔのもの（副作用有りの頻度２０％）、並びにｃとｄのハプロタイプの組合せであってタグハプロタイプがＡ−Ｃのもの（副作用有りの頻度８０％）となる。
特定のブロック（ブロックＡ）の特定のタグハプロタイプ（ハプロタイプｃの「Ａ−Ｃ」の組合せ及びハプロタイプｄの「Ａ−Ｃ」の組合せ）と表現型との間に関連が認められたので、この関連に関与したハプロタイプブロック（ブロックＡ）を選択し、上記タグＳＮＰｓを起点として、ハプロタイプｃ及びハプロタイプｄ内に存在する全ての多型を選出する。すなわち、そのブロックについてあらかじめ構築された主要ハプロタイプ（ｍａｊｏｒｈａｐｌｏｔｙｐｅ）の中から、関連のあった特定のタグハプロタイプを含むハプロタイプを選出する。そして、そのハプロタイプと表現型が関連するか否かを判断する。「主要ハプロタイプ」とは、集団におけるハプロタイプ頻度の高いハプロタイプから選択して、合計が一定の割合（例えば９５％）以上になるようなハプロタイプを意味する。
上記設例において、ＳＮＰ２、ＳＮＰ３又はＳＮＰ５を起点としてブロック内のすべてのＳＮＰについて、表現型との関連を解析する。このときに選出の対象となるハプロタイプは、ブロック内のすべてのＣｏｍｍｏｎＳＮＰｓ（ｈｔＳＮＰｓに限られない）を含むハプロタイプ（ＳＮＰ１〜ＳＮＰ６）である。
また、ＲａｒｅＳＮＰｓが表現型と関連するかどうかを判断するため、関連のあったブロックについて配列決定を行い、関連のあるＲａｒｅＳＮＰｓを見つけることもできる。特定のブロック外ＣｏｍｍｏｎＳＮＰｓと表現型に関連があった場合は、その付近の配列決定を行い、関連のあるＳＮＰを見つける。
以上の過程で、ブロック内ＣｏｍｍｏｎＳＮＰｓ又はブロック内ＲａｒｅＳＮＰｓと表現型との間に真の関連がある場合、関連に関係するＳＮＰｓを見つけることができる。上記関連するか否かの判断は以下の基準による。
（ａ）特定のハプロタイプを構成するＣｏｍｍｏｎＳＮＰｓのいずれか又はそれぞれのＣｏｍｍｏｎＳＮＰｓの組合せと表現型とが関連するか否か。
（ｂ）特定のハプロタイプにａｓｓｉｇｎされるＲａｒｅＳＮＰｓが表現型と関連するか否か。
違いに関連するＳＮＰｓの特定により、原因疾患に関連するＳＮＰｓが判明し、診断や薬剤の選択、あるいは原因ＳＮＰｓの特定をすることができる。
前述した本発明の方法の各ステップ又は本発明のシステムの各手段について、以下、詳細に説明する。
２．遺伝子多型
遺伝子多型には、一塩基多型、インサーション／デリーション型多型、及び塩基配列の繰り返し数が異なっていることにより生じる多型が含まれる。一塩基多型（ＳＮＰｓ）とは、一般にはある遺伝子又はその相補鎖（相補配列）領域における特定の１個の塩基が他の塩基に置換することによる多型を意味するが、本発明においては、上記置換による多型のほか、当該１個の塩基が欠失したことによる多型、当該１個の塩基にさらに１個の塩基が挿入したことによる多型も含めることとする。
また、インサーション／デリーション型多型とは、複数の塩基（例えば２個〜数十塩基）が欠失や挿入をしていることによる多型をいい、数百塩基〜数千塩基が欠失や挿入されているものも存在する。さらに、塩基配列の繰り返し数が異なっていることにより生じる多型は２〜数十塩基の配列が繰り返されており、その繰り返し回数が個人間で異なっているものをいう。繰り返しの単位が数塩基から数十塩基のものをＶＮＴＲ（ｖａｒｉａｂｌｅｎｕｍｂｅｒｏｆｔａｎｄｅｍｒｅｐｅａｔ）といい、２〜４塩基単位程度のものをマイクロサテライト多型という。ＶＮＴＲやマイクロサテライト多型においては、この繰り返し回数の違いが個々人のアレル（対立遺伝子）で異なることにより、バリエーションを獲得している。
遺伝子多型情報は、一般的遺伝子多型検出法を利用して得ることができる。例えば、シークエンス法、ＰＣＲによる方法、断片長多型アッセイ、アレル特異的オリゴヌクレオチドを鋳型としてハイブリダイゼーションを行う方法（例えばＴａｑＭａｎＰＣＲ法、インベーダー法、ＤＮＡチップ法）、プライマー伸長反応を利用する方法、シークエンス法、ＭＡＬＤＩ−ＴＯＦ／ＭＳ法、ＤＮＡチップ法等が採用される。ＰＣＲ法やシークエンス法はいずれの遺伝子多型の検出法にも使用することができ、他の方法は、主としてＳＮＰの検出法に使用することができる。
ＴａｑＭａｎＰＣＲ法とは、蛍光標識したアレル特異的オリゴとＴａｑＤＮＡポリメラーゼによるＰＣＲ反応とを利用した方法である（Ｌｉｖａｋ，Ｋ．Ｊ．Ｇｅｎｅｔ．Ａｎａｌ．１４，１４３（１９９９）；ＭｏｒｒｉｓＴ．ｅｔａｌ．，Ｊ．Ｃｌｉｎ．Ｍｉｃｒｏｂｉｏｌ．３４，２９３３（１９９６））。インベーダー法とは、ＳＮＰのそれぞれのアレルに特異的な２種類のレポータープローブ及び１種類のインベーダープローブの鋳型ＤＮＡへのハイブリダイゼーションと、ＤＮＡの構造を認識して切断するという特殊なエンドヌクレアーゼ活性を有する酵素によるＤＮＡの切断を組み合わせた方法である（Ｌｉｖａｋ，Ｋ．Ｊ．Ｂｉｏｍｏｌ．Ｅｎｇ．１４，１４３−１４９（１９９９）；ＭｏｒｒｉｓＴ．ｅｔａｌ．，Ｊ．Ｃｌｉｎ．Ｍｉｃｒｏｂｉｏｌ．３４，２９３３（１９９６）；Ｌｙａｍｉｃｈｅｖ，Ｖ．ｅｔａｌ．，Ｓｃｉｅｎｃｅ，２６０，７７８−７８３（１９９３）等）。
また、プライマー伸長反応を利用する方法として、例えばＳｎｉＰｅｒ法を採用することもできる。ＳｎｉＰｅｒ法とは、ＲＣＡ（ｒｏｌｌｉｎｇｃｉｒｃｌｅａｍｐｌｉｆｉｃａｔｉｏｎ）法と呼ばれる手法を基本原理するものであり、環状の一本鎖ＤＮＡを鋳型としてＤＮＡポリメラーゼがその上を移動しながら相補鎖ＤＮＡを連続して合成していくものである。この方法によれば、ＤＮＡ増幅が起こった場合に生じる発色反応の有無を測定することによってＳＮＰを判定する（Ｌｉｚａｒｄｉ，Ｐ．Ｍ．ｅｔａｌ．，ＮａｔｕｒｅＧｅｎｅｔ．，１９，２２５−２３２（１９９８）；Ｐｉａｔｅｄ，Ａ．Ｓ．ｅｔａｌ．，ＮａｔｕｒｅＢｉｏｔｅｃｈ．，１６，３５９−３６３（１９９８））。
シークエンス法とは、遺伝子多型を含む領域をＰＣＲにて増幅させ、ＤｙｅＴｅｒｍｉｎａｔｏｒなどを用いてＤＮＡ配列をシークエンスすることで遺伝子多型（特にＳＮＰｓ）の頻度を解析する方法である。
ＭＡＬＤＩ−ＴＯＦ／ＭＳ法とは、質量分析機（ｍａｓｓｓｐｅｃｔｒｍｅｔｅｒ）を用いた方法で、基本的には異なる一塩基の質量の違いを利用してＳＮＰジェノタイピングする方法である。ＰＣＲ増幅を利用した方法とｍｕｌｔｉｐｌｅｘを利用した方法がある（Ｈａｆｆ，Ｌ．Ａ．，Ｓｍｉｒｎｏｖ，Ｉ．Ｐ．，ＧｅｎｏｍｅＲｅｓ．，７，３７８−（１９９７）；Ｌｉｔｔｌｅ，Ｄ．Ｐ．ｅｔａｌ．Ｅｕｒ．Ｊ．Ｃｌｉｎｉｃａ．Ｃｈｅｍ．Ｃｌｉｎ．Ｂｉｏｃｈｅｍ．，３５，５４５−（１９９７）；Ｒｏｓｓ，Ｐ．，ｅｔａｌ．ＮａｔＢｉｏｔｅｃｈｎｏｌ．，１６，１３４７−（１９９８））。
ＤＮＡチップ法とは、ガラスなどの基盤上に多種類のＤＮＡプローブを整列化し、固定し、その上で標識ＤＮＡのハイブリダイゼーションを行い、プローブ上の標識（蛍光など）シグナルを検出する方法を利用して、ハイブリダイゼーションで完全マッチと一塩基ミスマッチを分別検出する方法である。
ＳＮＰｓを代表とする多型情報解析には市販のキットを使用することができ、自動化も可能である。
本発明において解析の対象となる遺伝子は特に限定されるものではない。例えば、以下の遺伝子の中の少なくとも１つを解析することができる（表１）。実施例５において、スルファサラジンの代謝に関係する酵素であるＮＡＴ２を用いて本発明を実施している。ＮＡＴ２と同様に、例えばＡｌｃｏｈｏｌｄｅｈｙｄｒｏｇｅｎａｓｅ（ＡＤＨ）等も広く知られた酵素であり、本発明において解析することができる。

ここで、表１に示す個々の遺伝子は、薬物が投与されてから吸収、分布、代謝を経て排泄されるまでの間に何らかの機能を果たすと考えられる。この機能は、一の遺伝子が発現したときに他の遺伝子とネットワークを形成して複数の遺伝子が相互に関係する場合もある。従って、本発明においては、１個の遺伝子を解析するのみならず、表１に示す遺伝子の全部、又は部分的な複数の遺伝子を解析することによって、多型情報と表現型とを決定する場合がある。
また、表１に示す遺伝子に存在するＳＮＰｓであって本発明において解析の対象となるＳＮＰｓの情報例を表２に示す。

表２において、「番号」の欄は、そのブロックにおけるＳＮＰｓの識別番号（例えばＬ１、Ｌ２）を示してある。「ｃｏｎｔｉｇｐｏｓ」は、当該ＳＮＰｓが存在するゲノム上の位置を示す番号である。「ｒｓｎｕｍｂｅｒ」は、ＮＣＢＩ（ＮａｔｉｏｎａｌＣｅｎｔｅｒｆｏｒＢｉｏｔｅｃｈｎｏｌｏｇｙＩｎｆｏｒｍａｔｉｏｎ）に登録されているＳＮＰｓ情報の番号である。ＳＮＰｓ情報は、ＮＣＢＩのＷｅｂサイト（ｈｔｔｐ：／／ｗｗｗ．ｎｃｂｉ．ｎｌｍ．ｎｉｈ．ｇｏｖ／）からｒｓｎｕｍｂｅｒを入力することにより得ることができる。例えば、上記ＮＣＢＩのＷｅｂサイトの検索画面において、項目を「ＳＮＰ」として「ｒｓ２２９３００４」を入力すると、当該ＳＮＰｓの情報画面（下記ＵＲＬ）を得ることができる。
ｈｔｔｐ：／／ｗｗｗ．ｎｃｂｉ．ｎｌｍ．ｎｉｈ．ｇｏｖ／ＳＮＰ／ｓｎｐ＿ｒｅｆ．ｃｇｉ？ｒｓ＝２２９３００４
そして、上記ＵＲＬにおけるｒｓ２２９３００４のＳＮＰｓ情報は、「Ｇ／Ａ」となる。
「ＲＫｎｕｍｂｅｒ」は、出願人が付与した識別番号である。「ＳＮＰ」の欄にはアレルを「／」で区切って表示した。また、「Ｂｌｏｃｋ」の項において、ｈｔＳＮＰｓを「Ｔ」で表示した。そして、「Ｂｌｏｃｋ」の項の最も右側の欄に記載した「Ｃ」はＣｏｍｍｏｎＳＮＰｓを、「Ｒ」はＲａｒｅＳＮＰｓであることを表わす。
３．ハプロタイプ解析
（１）ハプロタイプブロックの構築
図４に示すように、遺伝子Ｙ上に存在する９個のＳＮＰのすべての組合せについて、連鎖不平衡解析を行ない、ハプロタイプブロックを構築する場合を考える。ある集団において一のＳＮＰｓとそれ以外のＳＮＰｓとの間の連鎖不平衡の尺度、及び複数の隣接するＳＮＰｓ情報を用いた場合の主要ハプロタイプの数を計算することにより、一定の基準を設け、それを指標としてハプロタイプブロック構成を行う。
ハプロタイプブロックを構成する手法は既報の手法を用いることができる（Ａｖｉ−Ｉｔｚｈａｋ，Ｈ．Ｉ．ｅｔａｌ．，Ｐａｃ．Ｓｙｍｐ．Ｂｉｏｃｏｍｐｕｔ４６６−４７７（２００３））。
すなわち、まず、連鎖するＳＮＰ座位の隣接するＣｏｍｍｏｎＳＮＰｓを用い、上記のハプロタイプ推定を行う（図１、Ｓ１，Ｓ２）。通常、集団内でのハプロタイプ頻度は、５％以上、好ましくは５％〜２０％、さらに好ましくは５％〜１０％とするが、この頻度に限定されるものではない。その結果、集団のハプロタイプ頻度が推定され、それを用いＤ’の値が推定される。Ｄ’は２つのＳＮＰｓ間の連鎖不平衡の指標である。推定されたＤ’が例えば０．９０（０．９０に限定しない）以上のペアを探す。そのようなＣｏｍｍｏｎＳＮＰｓのペアが発見されると、そのペアをハプロタイプブロック形成の種として、５’と３’方向に可能な限りブロックを伸長させていく。
前記の連鎖不平衡の強いＣｏｍｍｏｎＳＮＰｓを含む領域を暫定ハプロタイプブロックとし、これまでのハプロタイプ推定により５％（５％に限定しない）以上に推定されたハプロタイプを暫定主要ハプロタイプとする。暫定ハプロタイプブロックに隣接したＣｏｍｍｏｎＳＮＰｓを、暫定ハプロタイプブロックに含まれるＣｏｍｍｏｎＳＮＰｓに一つ加え、前記のハプロタイプ推定を行う。
その結果、これまでの暫定主要ハプロタイプ以外の５％（５％に限定しない）以上のハプロタイプが生じなかった場合、暫定ハプロタイプに新しくＣｏｍｍｏｎＳＮＰｓを加える形で伸長する。そのようなハプロタイプが新たに生じた場合は、ハプロタイプブロックの伸長は、その方向には停止したと考える。例えば図４Ａに示すように、ＳＮＰ１とＳＮＰ２とが強い連鎖不平衡を形成すると判断されたときは、暫定的にＳＮＰ１とＳＮＰ２とがハプロタイプを形成しているものとし（ａ）、次に、当該ハプロタイプブロックとＳＮＰ３とが強い連鎖不平衡を形成しているか否かを解析し（ｂ）、順次下流のＳＮＰとの連鎖不平衡を解析する（ｃ）。その結果、ＳＮＰ１〜ＳＮＰ４により形成されるブロックにＳＮＰ５を加えることにより５％以上の新たな主要ハプロタイプが生じた場合は、ＳＮＰ１〜ＳＮＰ４が１つのブロックを形成していると判断する（ｄ）。同様にして、ＳＮＰ１〜４のブロックとＳＮＰ６、ＳＮＰ７、・・・ＳＮＰ９が強い連鎖不平衡を形成するか否かを解析する。次に、ＳＮＰ２を起点として上記解析を繰り返す（ｅ−ｇ）。この作業をＳＮＰ９まですべての組合せについて解析し（ａ−ｉ）、ハプロタイプブロックを構築する。
上記解析は、５’方向及び３’方向の両者について行ない、いずれかの方向に対してハプロタイプブロックの伸長が停止していない場合、その方向の隣接するＣｏｍｍｏｎＳＮＰｓを一つ加え、ステップを繰り返す。
いずれの方向へもハプロタイプブロックが伸長を停止した場合は、その段階での暫定ハプロタイプブロックをハプロタイプブロックとする（図１、Ｓ４）。
一般には、図４Ｂに示すマトリックスを用いてブロック構築の関係を知ることができる。各マトリックス（図４Ｂでは８１個の桝目）を色の違い又は色の濃さによって、どのＳＮＰとどのＳＮＰとがブロックを構成するのか、その連鎖する確率の相違を示すことができる。
（２）タグハプロタイプの選択
上記のとおり構築されたブロックの中から、タグＳＮＰｓ（ｈｔＳＮＰｓ）を選択する（図１、Ｓ７）。ｈｔＳＮＰｓは、その組合せにより一定割合（例えば９０％）以上のハプロタイプを説明することができるものであり、主要ハプロタイプを区別できる最低限のＣｏｍｍｏｎＳＮＰｓである。タグの選択は、ブロック内外のＳＮＰｓを区別した後（図１、Ｓ５，Ｓ６）、ブロック内ＣｏｍｍｏｎＳＮＰｓについて所定のコンピュータプログラムに従って実行され、ｈｔＳＮＰｓとして特定される（図１、Ｓ７）。ｈｔＳＮＰｓの例を表３に示す。特定されたｈｔＳＮＰｓは、単独で、又は２個以上を組み合わせて使用することができる。
表３中、「Ｂｌｏｃｋ＃」（＃は連続番号を表わす）の項において「ｈｔＳＮＰｓ」と表示されているものがタグＳＮＰｓである。「Ｂｅｔｗｅｅｎ」の項に示されるＳＮＰｓはブロック外ＳＮＰｓ（後述）である。最も左側の列の数字は出願人が付与した識別番号であり（表２の「ＲＫｎｕｍｂｅｒ」に対応する）、左から２番目の列に記載されている記号（例えば「ＤＤＯＳＴ」、「ＮＤＵＦＳ５」等）は遺伝子名である。その次の列から、順に一方のＳＮＰｓ、その頻度、他方のＳＮＰｓ、その頻度、ＳＮＰｓのゲノム上の位置、ｒｓｎｕｍｂｅｒ、最も右側の欄が遺伝子の別名である。但し、遺伝子の別名は、２列目の遺伝子名と同じ場合でも表示してある。

また、表３において、「ｄｅｌ」は欠失を意味し、例えば「ＣＴ／ｄｅｌ」と表示されたＳＮＰｓはＣＴ又は欠失を表わしている。「ｉｎｓ」は挿入を意味し、例えば「Ｇ／ｉｎｓ」と表示されたＳＮＰｓは、その箇所にＧが挿入されたＳＮＰｓであることを表わしている。さらに、「（ＴＣ）２〜３」のように塩基にかっこをつけ、数字を付したＳＮＰｓは、かっこ内の塩基が当該数字の数だけ（この場合（ＴＣ）が２〜３個）繰り返された多型であることを表わしている。
（３）ブロック外ＳＮＰｓの選択
ハプロタイプブロックは、その構築の際に、ある程度以上の強さの指標で連鎖不平衡を形成するＣｏｍｍｏｎＳＮＰｓの集合として規定される。従って、ハプロタイプブロックを構築する際に、当該指標の範囲外となってハプロタイプブロックからはずれるＳＮＰｓが生じる（例えば、図２のＳＮＰ７、ＳＮＰ１１及びＳＮＰ１６）。このように、連鎖不平衡が弱いと判断され、ブロックからはずれたＳＮＰを「Ｏｕｔ−ｏｆ−ｂｌｏｃｋＳＮＰｓ」又は「ブロック外ＳＮＰｓ」という。例えば、ブロック外ＳＮＰｓの例を上記表３の「Ｂｅｔｗｅｅｎ」の項に示す。
本発明においては、ブロック内のＳＮＰのみならず、このようにブロック外ＳＮＰｓを単独で、又は複数個組合せて解析の対象とすることができる（図１、Ｓ９）。予め構築されたハプロタイプブロックのデータにブロック外ＳＮＰｓのデータを追加して表現型との関連を解析することも可能である。
（４）ＲａｒｅＳＮＰｓの解析
（ｉ）ＲａｒｅＳＮＰｓの概念
図２及び図５に示すように、ＳＮＰ１２（Ｇ／Ａ）、ＳＮＰ１３（Ｃ／Ｔ）、ＳＮＰ１４（Ｃ／Ａ）及びＳＮＰ１５（Ｔ／Ａ）の４箇所のＳＮＰｓにより、ハプロタイプａ（「Ａ−Ｔ−Ａ−Ａ」）、ハプロタイプｂ（「Ａ−Ｔ−Ｃ−Ｔ」）、ハプロタイプｃ（「Ｇ−Ｃ−Ｃ−Ｔ」）及びハプロタイプｄ（「Ｇ−Ｃ−Ａ−Ｔ」）の４種類の主要ハプロタイプブロック（ブロックＣ）が形成されているとする。そして、これらのハプロタイプは、主要アレル頻度が一定の値以上（例えば０．１以上）の条件を満たす場合にブロックが構築されるものと仮定する。この場合、マイナーアレル頻度が０．１よりも小さく、その頻度の範囲に含まれるＳＮＰｓは、ハプロタイプブロックの構築には寄与しないこととなる。
しかしながら、そのような寄与しないＳＮＰｓに注目すると、当該ＳＮＰｓも、表現型との関連性を関連づけるために使用することができる。このようなＳＮＰｓを「ＲａｒｅＳＮＰｓ」又は「稀なＳＮＰｓ」という（図１、Ｓ３）。ＲａｒｅＳＮＰｓについても、ブロックの外又は内のいずれかに存在するか判断がされ、両者を区別して用いることができる（図１、Ｓ１１，Ｓ１２，Ｓ１４）。
図５において、ＳＮＰ１３とＳＮＰ１４との間に、マイナーアレル頻度が４％（頻度が０．１未満）のＲａｒｅＳＮＰｓ（ＳＮＰ１７の「Ｔ」）が存在すると仮定する。このＲａｒｅＳＮＰｓは、ブロックＣ内であって特定のハプロタイプ上（例えば図５のハプロタイプｄ）に存在し、他のハプロタイプ（図５のハプロタイプａ、ｂ及びｃ）上には存在しないことが多いため、ＲａｒｅＳＮＰｓはある特定のハプロタイプの構成要素となる。このように、ＲａｒｅＳＮＰｓを特定の主要ハプロタイプに位置付けることを、「ＲａｒｅＳＮＰｓを割り当てる」又は「ＲａｒｅＳＮＰｓをａｓｓｉｇｎする」という（図１、Ｓ１３）。ＲａｒｅＳＮＰｓの割り当ては、ブロック内主要ハプロタイプについて行なわれる（図１、Ｓ１２，Ｓ１３）。ａｓｓｉｇｎされたＲａｒｅＳＮＰｓを有するハプロタイプを有する集団は、表現型と関連することが多いため、ＲａｒｅＳＮＰｓは疾患の診断、薬剤選択、原因ＳＮＰｓの特定に使用することができる。
従って、図１のＳ１及びＳ３においてＲａｒｅＳＮＰｓ情報を得た後に、表現型との関連試験（図１、Ｓ８）に直接進むこともでき、図１のＳ７及びＳ９で得たｈｔＳＮＰｓ及びブロック外ＳＮＰｓ情報と併せて表現型との関連試験（図１、Ｓ８）を実施することもできる。例えば、ＲａｒｅＳＮＰｓとｈｔＳＮＰｓとを組合せて実施する場合は、ブロックに含まれるすべてのｈｔＳＮＰｓと、少なくとも１個のＲａｒｅＳＮＰｓ（好ましくは１個のＲａｒｅＳＮＰｓ）との組合せが好ましい。ＲａｒｅＳＮＰｓの組合せは特に限定されるものではなく、ブロック内に含まれるＲａｒｅＳＮＰｓのほか、ブロック外ＲａｒｅＳＮＰｓを含むすべての組合せを用いることができる。
ＲａｒｅＳＮＰｓの例を表２（最も右側に「Ｒ」と表示したＳＮＰｓ）に示す。
表２において、各ブロックの一覧表の下に、ＲａｒｅＳＮＰｓ及びＲａｒｅＡｓｓｉｇｎＢｌｏｃｋ情報を記載した。
これらのＲａｒｅＳＮＰｓを少なくとも１つ含むハプロタイプも、本発明に含まれる。
なお、ＳＮＰはｂｉａｌｌｅｌｉｃであるため、０と１などの２つの数字でＳＮＰを表現することができる（図５下段のかっこ内）。図５の場合、０は主要アレル（アレル頻度の高いアレル）、１はマイナーアレル（アレル頻度の低いアレル）を表す。そして、ＲａｒｅＳＮＰｓはハプロタイプｄに割り当てられ、ハプロタイプｄは「１１（１）１０」のように表現することができる（カッコを付した「（１）」がＲａｒｅＳＮＰｓである）。但し、上記数字は０と１に限定されるものではなく、また、上記数字で表記する方法に限定されるものでもない。
（ｉｉ）ＲａｒｅＳＮＰｓの割り当てを決める手法
ＲａｒｅＳＮＰｓの主要なハプロタイプへの割り当て（ａｓｓｉｇｎｍｅｎｔ）を決める手法（図１、Ｓ１３）を具体的に説明する。
まず、一つのハプロタイプブロック内のＣｏｍｍｏｎＳＮＰｓ（例えば１０％以上のマイナーアレル頻度のＳＮＰを用いるが、１０％には限定しない）の、集団に属する個人の遺伝子型データ、及び上記の集団のハプロタイプ頻度を推定する手法を用いて、集団のハプロタイプ頻度を推定する。そして、得られた集団のハプロタイプ頻度の中で高頻度のハプロタイプから合計９５％以上になるようなハプロタイプを選択して、主要ハプロタイプと名づける。上記９５％は、この値に限定しない。
次に、前述の一つのハプロタイプ内のＣｏｍｍｏｎＳＮＰｓ（個数をｎ個とする）に一つのＲａｒｅＳＮＰｓ（マイナーアレル頻度が１０％未満のものを想定するが、１０％と限定しない）を加え、ｎ＋１個のＳＮＰｓの遺伝子型を用いてハプロタイプ推定を行う。
このハプロタイプ推定の結果、集団のハプロタイプ頻度が推定される。集団のハプロタイプの中で、ＲａｒｅＳＮＰｓについてマイナーアレルを含むハプロタイプが一つしか無い場合は、ＲａｒｅＳＮＰｓのマイナーアレルが一つのハプロタイプにａｓｓｉｇｎされたことになる。
集団のハプロタイプ頻度の推定結果によりＲａｒｅＳＮＰｓのマイナーアレルがａｓｓｉｇｎされる主要ハプロタイプが上記の手法で決定されない場合は、以下の手法を用いる。
上記の手法で推定された個人のディプロタイプ形は、最尤推定された集団のハプロタイプ頻度（Θ）の下において、その個体のディプロタイプ形（その個体の保有する二つのハプロタイプの組合せ、又は順列）の事後確率分布で表されている。個体の推定ディプロタイプ形の中でＲａｒｅＳＮＰｓのマイナーアレルを含むハプロタイプを有するディプロタイプ形の事後確率が１０％（１０％と限定しない）以上の個体を調べ、それらの個体の推定ハプロタイプの情報から、ＲａｒｅＳＮＰｓのマイナーアレルのほとんどがａｓｓｉｇｎされるハプロタイプを求める。このハプロタイプを、「ｍａｊｏｒｉｔｙ−ａｓｓｉｇｎｅｄｈａｐｌｏｔｙｐｅ」と呼ぶ。そして、個人の推定されたディプロタイプ形の情報からＲａｒｅＳＮＰｓのマイナーアレルがｍａｊｏｒｉｔｙ−ａｓｓｉｇｎｅｄｈａｐｌｏｔｙｐｅにａｓｓｉｇｎされる割合を計算する。
ハプロタイプブロック内において、ＣｏｍｍｏｎＳＮＰｓにより形成された主要ハプロタイプへのＲａｒｅＳＮＰｓのａｓｓｉｇｎｍｅｎｔを決める手法を以下に説明する。
まず、ＲａｒｅＳＮＰｓのａｓｓｉｇｎｍｅｎｔに必要な集団のハプロタイプ頻度の推定、及び個人のディプロタイプ形を推定する手法を以下に説明する。この手法は、Ｋｉｔａｍｕｒａら、Ｉｔｏらにより報告された手法である（ＫｉｔａｍｕｒａＹ．ｅｔａｌ．（２００２）Ａｎｎ．Ｈｕｍ．Ｇｅｎｅｔ６６：１８３−１９３；ＩｔｏＴ．ｅｔａｌ．（２００３）Ａｍ．Ｊ．Ｈｕｍ．Ｇｅｎｅｔ．７２：３８４−３９８）。この手法では、与えられるデータは、特定の集団に属する複数の個体において連鎖する複数の座位についての遺伝子型（順列ではなく組み合せ）である。推定すべきものはパラメータである集団のハプロタイプの頻度と、個人のディプロタイプ形（個人の保有する二つのハプロタイプの組合せ、又は順列）である。
次のような一つの実験を考える。連鎖するＬ個の座位について、可能なハプロタイプすべて（ＳＮＰ座位であれば２^Ｌ個）の頻度を非決定論的に決める。この頻度をΘ＝（θ_１，．．．，θ_Ｍ）（ｉ＝１〜Ｍ）とする。但し、Ｍは可能なハプロタイプの数であり、θ_ｉはｉ番目のハプロタイプの頻度とする。Θに従ってｎ人の個体にハプロタイプを二つずつ非決定論的に配布する。それぞれの個体は順番付きのハプロタイプを二つ与えられる。これを順列ディプロタイプ形ということにする。ｄｉをｉ番目の個人に与えられた順列ディプロタイプ形とし、Ｄ＝（ｄ_１，ｄ_２，．．．，ｄ_ｎ）とする。
このような実験の一つの結果は（Θ，Ｄ）によって表され、一つの結果が決まればＬ個のすべての座位について、すべての個体の遺伝子型が決まる。「（Θ，Ｄ）」は、Θであって、かつ、Ｄである、という意味である。実際にはすべての個体の遺伝子型はＤのみによって決まる（Θによらない）。
Θは連続量であるため、標本空間は非加算無限である。Θの条件下でｄ１，ｄ２，．．．，ｄｎは互いに独立である。
すなわち、Θという条件の下でのＤである確率をＰ（Ｄ｜Θ）とすると、

となる。
Ｈａｒｄｙ−Ｗｅｉｎｂｅｒｇ平衡を仮定し、ｄｉを構成するハプロタイプがｊ番目とｋ番目であるとすると、Θという条件の下でのｄ_ｉである確率Ｐ（ｄ_ｉ｜Θ）は、

となる。
すべての個体についての観察された遺伝子型をＧ＝（ｇ_１，．．．，ｇ_ｎ）とする。ここで、ｇ_ｋはｋ番目の個人のＬ個の座位についての遺伝子型をｇｋとすると、ｇｋは、全体でｇ_ｋ＝（ｇ_ｋ１，．．．，ｇ_ｋＬ）となる。ただし、ｇｋｌはｋ番目の個人のｌ座位の遺伝子型（順列ではなく組合せ）とする。
尤度関数は、単に標本空間Ωの要素である結果（Θ，Ｄ）のうち、Ｇと合致するものの集合のΘのもとでの条件付確率であり、

〔Ｐ（ｇ_ｋ｜Θ）は出来事ｄｋのうち、単にｇ_ｋに合致するものの集合（ｄｋ∩ｇｋ）の確率である。〕

となる。ただし、Ｑ_ｋはｇ_ｋに合致するｄ_ｋを構成するハプロタイプの番号の順列の集合である。すなわち、尤度関数は、

となり、これをΘの上で最大化すればよい。
ＥＭアルゴリズムによりこれを最大化し、最尤推定値のもとでのｄ_ｋの確率分布を求める。ＥＭアルゴリズム（Ｅｘｐｅｃｔａｔｉｏｎ−Ｍａｘｉｍｉｚａｔｉｏｎａｌｇｏｒｉｔｈｍ（ＥＭ−ａｌｇｏｒｉｔｈｍ））とは、ＳＮＰの観測値を最も得やすいｈａｐｌｏｔｙｐｅ頻度の分布を求めるのに用いる標準的アルゴリズムである。また、最尤推定値とは、ある特定のθに対してデータの妥当性が最大となるときに、そのθの値が最もよいものをいう。このプログラムがＬＤＳＵＰＰＯＲＴである（ＫｉｔａｍｕｒａＹ．ｅｔａｌ．（２００３）Ａｎｎ．Ｈｕｍ．Ｇｅｎｅｔ．６６：１８３−１９３）。

となる。
Θの最尤推定量は次の等式を満足する。

ただし、ｎ_ｉは特定のＤのもとでのｉ番目のハプロタイプの数である。従って、上式は、Ｇに合致するすべてのＤの上で平均したｉ番目のハプロタイプ数の期待値である。これは、最尤推定値の内部無矛盾性を表現したものであり、これにより以下のＥＭアルゴリズムの繰り返し過程が生まれる。
すべてのｉ（ｉ＝１，２，．．．，Ｍ）について、

ステップでのｉ番目のハプロタイプの頻度である。）
を求める。
これを繰り返し、Θ^ｔが収束したところで中止する。また、最尤推定量のもとでのｄｋの事後確率：

を計算する。
この手法は、一人の個人が二つのハプロタイプを持ち、そのため情報が不完全になるため、完全情報を推定する、というものであることがわかる。これを一般化し、多くのハプロタイプを、それぞれ２ｍ個のハプロタイプを含む多くの小集団に混合し、不完全になった情報から完全情報を推定する手法により拡張したプログラムがｌｄｐｏｏｌｅｄである（ＩｔｏＴ．ｅｔａｌ．（２００３）Ａｍ．Ｊ．Ｈｕｍ．Ｇｅｎｅｔ．７２：３８４−３９８）。即ち、一般的なハプロタイプ推定法は、２ｍ個のハプロタイププールからの推定法の中で単にｍ＝１の特殊な場合である。
以上が、与えられた多数の個体の複数の連鎖する座位についての遺伝子座から集団のハプロタイプ頻度と個人のディプロタイプ形を推定する手法である。この手法は連鎖する複数の座位間で連鎖不平衡が強い場合（例えば一つのハプロタイプブロックの中のＳＮＰである場合）、正確な推定を与えることが証明されている。
このようなハプロタイプ推定の手法は、上記のＥＭアルゴリズムを用いる方法が一般的であるが、この他にもＣｌａｒｋ法（ＣｌａｒｋＡＧ（１９９０）Ｍｏｌ．Ｂｉｏｌ．Ｅｖｏｌ．７：１１１−１２２）、ＰＨＡＳＥ法（ＳｔｅｐｈｅｎｓＭ．ｅｔａｌ．（２００１）Ａｍ．Ｊ．Ｈｕｍ．Ｇｅｎｅｔ．６８：９７８−９８９）、ＰＬ法（ＮｉｕＴ．ｅｔａｌ．（２００２）Ａｍ．Ｊ．Ｈｕｍ．Ｇｅｎｅｔ．７０：１５７−１６９）なども知られており、ＲａｒｅＳＮＰｓのａｓｓｉｇｎｍｅｎｔに用いる手法はＥＭ法に限定されるものではない。
ハプロタイプブロックの中のＲａｒｅＳＮＰｓについて、このような手法で、そのブロックに関する主要ハプロタイプへのａｓｓｉｇｎｍｅｎｔを解析した結果、ｍａｊｏｒｉｔｙ−ａｓｓｉｇｎｅｄｈａｐｌｏｔｙｐｅにａｓｓｉｇｎされるＲａｒｅＳＮＰｓのマイナーアレルの割合は多くの場合１００％に近いことが示された。但し、一部に二つ以上の主要ハプロタイプにａｓｓｉｇｎされるものも見られるが、ｍａｊｏｒｉｔｙ−ａｓｓｉｇｎｅｄｈａｐｌｏｔｙｐｅ以外にａｓｓｉｇｎされるＲａｒｅＳＮＰｓのマイナーアレルの割合は低い。
（ｉｉｉ）Ｉｎ−ｂｌｏｃｋのＲａｒｅＳＮＰｓ解析の手法
上記のように、ハプロタイプブロック内のＲａｒｅＳＮＰｓについては、そのマイナーアレルのほとんどが一つの主要ハプロタイプにａｓｓｉｇｎされることがわかった。このような場合、たとえ表現型と関係するＲａｒｅＳＮＰｓを発見できない場合でも、主要ハプロタイプを標識札（ｔａｇ）としてそれを探すことが可能である。
重症の薬物副作用の原因については５−ｆｌｕｏｒｏｕｒａｃｉｌとその誘導体の副作用に関係したｄｉｈｙｄｒｏｐｙｒｉｍｉｄｉｎｅｄｅｈｙｄｒｏｇｅｎａｓｅ欠損症や、６−ｍｅｒｃａｐｔｏｐｕｒｉｎｅとその誘導体の副作用に関係したｔｈｉｏｐｕｒｉｎｅｍｅｔｈｙｌｔｒａｎｓｆｅｒａｓｅ欠損症などがある。これらはＲａｒｅＳＮＰｓが関係している。これらのように原因がすでに判明している場合は、原因となる遺伝子や責任突然変異を特定することは容易であるが、多くの場合、その原因（関連）遺伝子及び責任突然変異は不明である。しかも、これらのＳＮＰはマイナーアレルの頻度が低いため、ＳＮＰのスクリーニングでは見つからないことが多い。
そのような場合、主要ハプロタイプを標識札としてＲａｒｅＳＮＰｓを検出する手法を構築できる。即ち、ＣｏｍｍｏｎＳＮＰｓ（あるいは後述のようにｈｔＳＮＰｓ）を用いて重症副作用の集団とコントロール集団のハプロタイプの頻度、又はそのようなハプロタイプを有する個体の頻度を比較する。その集団間で差の認められるハプロタイプを持つ遺伝子のハプロタイプブロックについて、重症副作用を示した全個体の配列決定を行う。以上の手法で、スクリーニング段階では発見できなかった重症副作用の原因となるＲａｒｅＳＮＰｓを発見することが可能である。
（ｉｖ）Ｏｕｔ−ｏｆ−ｂｌｏｃｋのＲａｒｅＳＮＰｓの解析の手法
もっとも解析が困難と考えられるのは、表現型（例えば重症副作用）と関係するハプロタイプブロック外のＲａｒｅＳＮＰｓを検索する手法である。
上記のように一般にはＲａｒｅＳＮＰｓはスクリーニングでは発見できない場合が多いので、ＣｏｍｍｏｎＳＮＰｓを標識札として発見の手口とする手法が望ましい。
但し、ブロック外であってもＳＮＰ間の連鎖不平衡は認められる。従って、ブロック外のＣｏｍｍｏｎＳＮＰｓについて、重症副作用の集団及びコントロール集団の頻度を比較する。両集団において、有意差が存在すれば、そのＣｏｍｍｏｎＳＮＰｓ周辺について重症副作用の個体のサンプルの配列決定を行うことにより、副作用と関係するＲａｒｅＳＮＰｓを発見することができる。
（ｖ）ｈｔＳＮＰｓとＲａｒｅＳＮＰｓとの組み合わせによる解析
本発明において、ｈｔＳＮＰｓとＲａｒｅＳＮＰｓとの組合せにより、さらに正確にハプロタイプ推定を行うことができる。ｈｔＳＮＰｓとＲａｒｅＳＮＰｓとの組合せは、ブロック内の全てのｈｔＳＮＰｓと、ＲａｒｅＳＮＰのうちの１つとの組合せであることが好ましい。
ＲａｒｅＳＮＰｓと主要なハプロタイプとの関係の分析手法は以下の通りである。
ＲａｒｅＳＮＰｓとハプロタイプとの関係を記載するためには確率空間を定義することが必要である。そのために、本発明においては、「完全ハプロタイプ」及び「不完全ハプロタイプ」と呼ばれる概念を採用する。
今、全部でＮ個の連鎖した多型座位が存在するハプロタイプブロック（Ｎ個の座位が連鎖しているハプロタイプブロック）があると仮定する。そして、ｉ番目の完全ハプロタイプブロックを「Ｈｉ」とする（図１９Ａ）。「完全ハプロタイプ」とは、Ｎ個のすべての座位についてそれぞれ１個のアレル（全部でＮ個のアレル）を有するときの、当該アレルのリストである。そして、集団内のすべての個体の保有する完全ハプロタイプの集合を標本空間Ωとする（図１９Ａ）。ここで、Ｈｉは、要素がＨｉであるようなΩの部分集合であると再定義する。つまり、標本空間Ωにおけるｉ番目の完全ハプロタイプのリストをＨｉと定義し、さらに、そのようなＨｉを要素として有する集合を「Ｈｉ」と再定義するわけである。
また、ブロック内のＮ個の座位の中に含まれる一つの低頻度アレルを「Ｘ」とする（図１９Ａ）。そして、集団内の完全ハプロタイプリストの中で、そのリストにＸ（このＸは低頻度アレルを意味している）を含む完全ハプロタイプの集合をＸ（このＸは集合を意味している）と再定義する。そうすると、この再定義されたＸはΩの部分集合となる（図１９Ａ）。
一方、不完全ハプロタイプの１つをＡｉとすると、Ａｉとは、例えばブロック内のｈｔＳＮＰのみにより定義されたハプロタイプのうちｉ番目のものである（図１９Ｂ）。すなわち、「不完全ハプロタイプ」とは、Ｎ個のすべての座位について、Ｎ個よりも少ないアレルを有するときの、当該アレルのリストをいう。例えば、Ａ_２は４個の座位についてｈｔと表示した２個のアレルをいう。そして、集団内の完全ハプロタイプのうち、ｈｔＳＮＰ座位についてＡｉというリストをもつものの集合をＡｉと再定義する。ｈｔＳＮＰｓのみにより定義されたハプロタイプは、完全ハプロタイプの一部のハプロタイプであるから、不完全ハプロタイプということになる。即ち、上記のような定義を用いると完全ハプロタイプＨｉ、不完全ハプロタイプＡｉ、及び一つのＳＮＰ座位における低頻度アレルＸがすべてΩの部分集合として定義され、従って、同じ確率空間上に定義された出来事（事象）と解釈される。このような出来事を補う出来事（補事象）である以下の集合：

も定義できる。このような概念は、他の方法では定義することが困難である。
一つのＲａｒｅＳＮＰｓと、ブロック内のｈｔＳＮＰにより構成されたハプロタイプとの関係は次のように解析する。すなわち、ブロック内のすべてのｈｔＳＮＰ、およびＲａｒｅＳＮＰｓ（頻度＜０．１と定義する）のうちの一つを用いてハプロタイプ推定を行う。
ハプロタイプ推定はＬＤＳＵＰＰＯＲＴ（Ｋｉｔａｍｕｒａｅｔａｌ．２００２）、またはＰＨＡＳＥ（Ｓｔｅｐｈｅｎｓｅｔａｌ．２００１）ソフトウェアを用いることができる。
「ＬＤＳＵＰＰＯＲＴ」は、Ｅｘｐｅｃｔａｔｉｏｎ−ｍａｘｉｍｉｚａｔｉｏｎ（ＥＭ）アルゴリズムに基づき集団のハプロタイプ頻度とそれぞれの個体のディプロタイプ分布（ディプロタイプ形の事後分布）を同時に推定するプログラムであるのに対し、「ＰＨＡＳＥ」は、Ｍａｒｋｏｖ−ｃｈａｉｎＭｏｎｔｅ−Ｃａｒｌｏ法及びｃｏａｌｅｓｃｅｎｃｅｍｏｄｅｌを用いてハプロタイプを推定するプログラムである。ハプロタイプ推定が終了した後、次のような４つの重なり確率：

をそれぞれのＡｉについて計算する。
上記（ａ）の「Ｐ（Ａｉ，Ｘ）」は、Ａｉであって、かつ、Ｘである確率を意味し、これと同様に、（ｂ）はＡｉではなく、かつ、Ｘである確率、（ｃ）はＡｉであって、かつ、Ｘではない確率、（ｄ）はＡｉではなく、かつ、Ｘでもない確率を意味する。
このように定義された確率を用いて、本発明において以下の計算をすることができる。

Ｐ（Ａｉ｜Ｘ）は、Ｘという条件の下でのＡｉとなる確率であり、換言すれば、集合Ａｉ（Ｘが含まれる）とその補集合（Ｘが含まれる）のうち、集合Ａｉの中にＸが含まれている確率を意味する。
ここでｉ個のハプロタイプの中でＰ（Ａｉ｜Ｘ）を最大化する集合をＡｊとし、上式によりＡｊを選ぶことが可能である。これがＸが割り当てられる不完全ハプロタイプである。Ａｊを選ぶための式は次の通りである。

上記式は、ＲａｒｅＳＮＰｓの不完全ハプロタイプへの割り当ての尺度として計算される。ＲａｒｅＳＮＰｓの不完全ハプロタイプの割り当て法についての詳しい手法は実施例３において一つのブロックを例にとって詳しく説明する。
（ｖｉ）表現型に関係したＲａｒｅＳＮＰｓをｈｔＳＮＰにより構成されたハプロタイプで検出するためのシミュレーション
ＲａｒｅＳＮＰｓは表現型に関連する確率が高いが、本発明においては、ＲａｒｅＳＮＰｓを用いずタグハプロタイプを用いることにより、表現型との関係を解析することも可能である。
Ｘの代わりにＡｊを用いることにより有意性を見出す確率は、次式：

などに依存する。ここでＭ_１及びＭ_２は、それぞれ罹患者とコントロールの数であ

補集合である。
シミュレーションのためのアルゴリズムは以下の通りである。
このシミュレーションにおいてはＸ（即ち、あるＲａｒｅＳＮＰｓの低頻度アレルの関係する座位）のみが直接表現型に関係している座位と仮定した。Ｘの多くが割り当てられる不完全ハプロタイプＡｊもψ（罹患者における完全ハプロタイプの集合）に関係しているかも知れないが、これはＸとψの関係を通してのみの関係であるとする。Ｘが表現型と関係していると仮定したので、その頻度は罹患者と非罹患者の集団で異なっていると考えられる。
即ち、

ケースコントロール研究では、これらの頻度の違いが検定されるのである。ここで、ｒを次の比を表すものと定義する。

本発明者は、ＡｊがＸを検出するための良いマーカーであるかを考えた。Ｘがψ

Ｘが表現型に直接関係する唯一の座位であり、Ａｊと表現型との関係はＸを通してのもののみであるので、以下の等式が成り立つ。

ベイズの定理により次の等式が得られる。

等式（３）より等式（５）は次のようになる。

Ｐ（Ａｊ｜Ｘ，ψ）と同様に、下記式：

要である。その理由は、次式：

は、（１）と（２）から計算されるからである。

その理由は、Ｐ（ψ）が非常に小さく、そのため

の場合のみを考えるからである。即ち、コントロール集団におけるＸの頻度は、集団における頻度と同じとした。従って、等式（１）及び（２）は次のようになる。

をそれぞれのＲａｒｅＳＮＰｓについて計算し、

を計算した。次式：

の計算結果は実施例に例示されている。
シミュレーションの前に、比率ｒ、罹患者、コントロールの人数、それぞれＭ_１、Ｍ_２の変数について、色々の値の組み合わせを与えた。シミュレーションについては数２Ｍ_１のハプロタイプを罹患者集団に頻度パラメータＰ（Ａｊ｜ψ）の二項分布より与え、数２Ｍ_２のハプロタイプをコントロール集団に頻度パラメータ：

の二項分布より与えた。

×２分割表について行った。Ｐ＜０．０１を示した繰り返し数の割合を経験的検出力とした。一つのＲａｒｅＳＮＰｓについて５，０００回の繰り返しを行った。この検定は

の違いを検出することを目的とするものである。従って、比

が極めて重要である。この比は、式（７）および（８）より次のようになる。

従って、比（９）は

に依存する。

であるとすると、比（９）は１になる。そうすると

の間の違いを検出することはもはや不可能である。シミュレーションのためのソフトウェアＡＮＡＳＳＩＧＮは、Ｃ言語で本発明者により作成されている。
４．表現型との関連解析
表現型とは、薬物又は異物の感受性に関する表現型、及び疾患に関する表現型がある。
薬物又は異物の感受性に関する表現型としては、薬物動態（特に血中濃度）、薬物の有効性（疾患マーカーの消失、臨床症状の改善を含む）、疾患の感受性及びその強弱、並びに副作用の有無及びその強弱などが挙げられる。薬物動態とは、薬物投与を受けた後の薬物の体内的挙動（例えば薬物の血中濃度）を意味し、吸収、分布、代謝、排泄などが挙げられる。異物とは、経口的、あるいは非経口的に人体に入る物質の中で生理的物質以外のものを意味し、病気のなりやすさの決定因子を意味する。例えば、タバコに含まれる化学物質、大気汚染物質、飲料水汚染物質、食品添加物、農薬などが挙げられる。糖尿病については糖、高血圧症については塩分、その他にはウイルス、細菌由来物質（エンドトキシン等）、環境ホルモン、各種遺伝的素因なども異物となることがある。
疾患に関する表現型は、癌などの疾患に罹っているときの病態、発癌物質などの危険因子に曝されたときの発症の有無、疾患に罹ったときの合併症の有無、再発の有無などが挙げられる。疾患としては、例えば悪性腫瘍、免疫系疾患、循環器系疾患、代謝系疾患、腎泌尿器系疾患、呼吸器系疾患、運動器系疾患が挙げられ、これらの疾患に属する１種又は複数種のいずれも解析の対象となる。また、糖尿病、高血圧症などの生活習慣病については高血糖、糖尿病の合併症の有無、収縮期血圧、拡張期血圧、高血圧の合併症としての心不全の有無などが挙げられる。
疾患の感受性は、疾患に罹患する可能性の有無又は強弱などが挙げられる。
上記表現型を指標として、被検群と被検群、あるいは被検群と対照群との間で関連を調べることができる。
５．解析結果の利用
上記のようにして解析された結果は、薬物又は異物の感受性を予測する方法、治療又は予防法を選択する方法、診断マーカーを選択する方法、薬物を選択する方法、疾患の予防又は治療用の薬物の適正投与量を決定する方法などに利用することができる。
また、薬物間相互作用を調べる方法に対して応用することも可能である。「薬物間相互作用」とは、複数の薬物を同時に投与された場合に、それぞれの薬物が独立に投与されたときに期待される効果と比較して質的、量的に異なったものが見られるときの作用を意味する。結果的に、特定の副作用が増強されたり、効果が減弱する現象が見られる。この方法は、複数の薬物を服用したときに、副作用の増強、効果の減弱等の問題が出るため、その問題となるＳＮＰｓを特定するために有用である。
さらに、上記解析結果は、関連多型の特定方法に利用することができる。「関連多型」とは、ある表現型について原因となる多型、さまざまなメカニズムで表現型を量的あるいは質的に変える多型意味し、疾患の原因遺伝子座の特定、あるいは異物や薬物の感受性に関連する遺伝子の解析に利用することができる。
６．コンピュータプログラム
本発明のハプロタイプ解析プログラムにおいて、コンピュータを実行させるための手段を示す構成例を図６に示す。
図６に示すように、本発明のハプロタイプ解析システムは、ＣＰＵ６０１、ＲＯＭ６０２、ＲＡＭ６０３、入力部６０４、情報通信送信／受信部６０５、出力部６０６、ハードディスクドライブ（ＨＤＤ）６０７及びＣＤ−ＲＯＭドライブ６０８等を備える。
ＣＰＵ６０１（ＭＰＵともいう）は、ホストコンピュータの情報記憶手段（例えば磁気的及び／又は光学的記録媒体）に記憶されているプログラムに従って、多型解析データ処理システム全体を制御する。そして、入力部６０４などから受け取った情報を出力部６０６に供給する。また、ネットワーク回線６０９を通じて受け取った情報に基づいて解析処理を実行することもできる。ネットワーク回線６０９を通じて受け取った情報としては、例えばＮＣＢＩ（ｈｔｔｐ：／／ｗｗｗ．ｎｃｂｉ．ｎｌｍ．ｎｉｈ．ｇｏｖ／）からのＳＮＰ情報などが挙げられる。入力部６０４は、キーボードやマウス等であり、解析処理を実行する上で必要な条件又はデータを入力するときに操作される。ＲＯＭ６０２は、本発明の解析処理システムの動作に必要な処理を命令するプログラム等を格納する。ＲＡＭ６０３は、解析処理システムにおける処理を実行する上で必要なデータを一時的に格納する。
送信／受信部６０５は、ＣＰＵ６０１の命令に基づいて、ネットワーク回線６０９等との間で情報通信（データの送受信処理）を実行するものであり、例えばモデム、ルーター等が例示される。出力部６０６は、入力手段６０４から入力された遺伝子の多型解析データ、その他各種条件等を、ＣＰＵ６０１からの命令に基づいて情報表示処理する（例えば表示画面、プリンタ）。ＣＤ−ＲＯＭドライブ６０８は、ＣＰＵ６０１の指示に基づいて、ＣＤ−ＲＯＭに格納されている解析処理システムを機能させるためのプログラム又はデータ等を読み出し、例えばＲＡＭ６０３に格納する。ＣＤ−ＲＯＭの代わりに記録媒体として書き換え可能なＣＤ−Ｒ、ＣＤ−ＲＷを用いることもできる。その場合には、ＣＤ−ＲＯＭドライブ６０８の代わりにＣＤ−Ｒ又はＣＤ−ＲＷ用ドライブを設ける。また、上記媒体の他に、ＤＶＤ、ＭＯとそれらの媒体を用い、それに対応するドライブを備える構成としてもよい。
本発明のプログラムは、例えばＣ言語、Ｊａｖａ、Ｐｅｒｌ、Ｆｏｒｔｒａｎ、Ｐａｓｃａｌ等で書くことができ、そしてクロスプラットフォームに対応できるように設計されている。従って、このソフトウエアはＷｉｎｄｏｗｓ（登録商標）９５／９８／２０００／ＸＰ等、Ｌｉｎｕｘ、ＵＮＩＸ（登録商標）、Ｍａｃｉｎｔｏｓｈで作動させることが可能である。
７．コンピュータ用記録媒体
本発明のプログラムは、コンピュータ読み取り可能な記録媒体又はコンピュータに接続しうる記憶手段に保存することができる。本発明のプログラムを含有するコンピュータ用記録媒体又は記憶手段も本発明に含まれる。記録媒体又は記憶手段としては、磁気的媒体（フレキシブルディスク、ハードディスクなど）、光学的媒体（ＣＤ、ＤＶＤなど）、磁気光学的媒体（ＭＯ、ＭＤ）などが挙げられる。
以下、実施例により本発明をさらに具体的に説明する。但し、本発明はこれら実施例に限定されるものではない。Hereinafter, the present invention will be described in detail.
1. Summary of the present invention
There are many gene polymorphisms, such as SNP, on the genome, and using these gene polymorphisms as an index, it is possible to examine the efficacy against drugs, the presence or absence of side effects, or differences in strength for each polymorphism. it can. An attribute other than the genotype corresponding to an individual, such as the effectiveness against a drug, the presence or absence of a side effect, or the difference in strength is called a phenotype. However, examining each of these many polymorphisms is labor intensive and inefficient. Moreover, there is a possibility that an unknown polymorphism exists and is related to the phenotype.
In a diploid individual such as a human, there is a unit called genotype in a polymorphism present at a specific position on a chromosome (this is called a locus). This genotype is composed of a combination of two gene units derived from each of the parents. This gene unit is called an allele. And about the some locus linked on the same chromosome, the combination of the allele originating in one parent is called haplotype.
It is known that alleles on a plurality of loci present in the vicinity on a chromosome are inherited from one generation to another as a set (a group), and this set is the haplotype.
In general, the size of a mating population is finite, and population genetic considerations suggest that in a finite mating population, there will be far fewer haplotypes in a population than expected from the number of alleles at that locus. . In fact, in the actual human population, there are often far fewer haplotypes than expected from the number of alleles. Here, the case where the chromosomal relationship between two loci within an allele population is independent is called linkage equilibrium. In such a chain equilibrium state, the haplotype frequency is represented by the product of the allele frequencies at each locus. In contrast, the case where the chromosomal relationship between two loci within an allele population is not independent is called linkage disequilibrium. In the presence of linkage disequilibrium, the haplotype frequency deviates from the product of allele frequencies at each locus.
In addition, for loci located nearby on a chromosome, a haplotype is a region on a chromosome where linkage disequilibrium between loci existing in a specific area is extremely strong, and there are only a few haplotypes in the group within that area. This is called a block. Since recombination rarely occurs between loci within the haplotype block, the haplotype in the haplotype block is almost always a unit that is inherited between generations.
To determine the association between genetic polymorphisms and phenotypes such as drug susceptibility, it is possible to consider only one locus, but often phenotypes are not associated with only one locus, but in haplotype units. Related. When there are interactions at two or more loci, such haplotype effects on phenotypes are seen. Moreover, it is impossible to examine the genotype of an unknown polymorphic locus in the first place. However, as described later, for the Rare polymorphism, the low frequency allele is often represented by one haplotype. Therefore, haplotypes are used to find associations between phenotypes such as diseases and drug susceptibility and genetic polymorphisms, and use the linkage disequilibrium to determine polymorphisms that are truly related to drug responsiveness. Is possible. However, there is no specific method so far. Therefore, the present invention provides a specific method for examining drug responsiveness using a haplotype.
In the present invention, the efficiency of the analysis of the relationship between the phenotype and the gene polymorphism is improved by constructing haplotype blocks and tags. That is, regardless of whether a certain polymorphism is associated with a phenotype (for example, whether or not it is a cause), a polymorphism existing at a predetermined frequency on the haplotype block is selected as a landmark, and this selected If the polymorphism is used as a tag, it can be analyzed more efficiently. In the present invention, the landmark is referred to as “tag polymorphism”. Then, the haplotype configured by the tag polymorphism (this is referred to as “tag haplotype”) or a combination thereof is used to analyze the relationship with the phenotype. “Tag haplotype” means a minimum combination of polymorphisms that can explain a certain frequency of haplotypes. Examples of the constant frequency include at least 70%, preferably 80%, more preferably 90%, and still more preferably 95%.
In other words, it is as follows. Assume that all common polymorphisms in a block (Common polymorphism) are used to distinguish haplotypes. A tag polymorphism is selected from the polymorphisms, but a polymorphism that can distinguish haplotypes with the same efficiency (at least 70%, preferably 95%) as with all common polymorphisms. It is a type. All combinations of tag polymorphisms that define a frequency of 70 to 95% are included in the present invention. A tag haplotype is a haplotype composed of only tag polymorphisms, and each tag haplotype corresponds almost one-to-one with a haplotype using all common polymorphisms. In rare cases, multiple haplotypes may correspond to a single tag haplotype, but the sum of all such haplotypes is less than 5%.
The combination of tag haplotypes is not limited, and at least one tag haplotype is sufficient. All combinations of tag polymorphisms contained within a block can also be used.
In the present invention, in addition to the tag polymorphism or a combination thereof, a polymorphism that is not constructed within the haplotype block, that is, a polymorphism that exists outside the block (referred to as “extra-block polymorphism”) is described above. Can be combined with tag polymorphism.
Furthermore, when the haplotype block is constructed, the association between the polymorphism in which the allele frequency of minor alleles in the population is less than a certain ratio (this is referred to as “Rare polymorphism”) or a combination thereof and the phenotype is analyzed.
In the present specification, SNPs are appropriately described as examples of polymorphisms, but polymorphisms are not limited to SNPs. In this specification, even if “SNPs” is displayed, it may represent a single SNP.
As shown in FIG. 1, first, a gene to be subjected to haplotype analysis is collected from a test population, and SNPs information is obtained from the gene (FIG. 1, S1). The “test population” here is a population for obtaining basic information of SNPs used for examining the association with the phenotype, and it is preferable that the number of individuals included in the population is large. In the present invention, the number in the target group is 100 or more, preferably 200 or more.
Next, by using the information of the detected SNPs, it is calculated how many percent of the entire population has the SNPs in the test population, and exists in common with a certain frequency (for example, 0.1) or more. SNPs to be selected are selected as “Common SNPs” (S2). “Common” means “common”. The concept for Common SNPs is referred to as “Rare SNPs” (meaning rare SNPs, which may be referred to as “Uncommon SNPs”), and is specified as SNPs existing at a certain frequency (for example, 0.1) (S3). . Specific contents of the Rare SNPs will be described later. Then, a haplotype block is constructed in accordance with a predetermined algorithm for the above-mentioned Common SNPs (S4), and SNPs existing in the block and SNPs existing outside the block are distinguished (S5). When SNPs in the block are used, Common SNPs are specified (S6), and tag SNPs (also referred to as “htSNPs”) are selected from them (S7). The tag SNPs are exposed to (i) a target test population exposed to drugs or foreign substances, (ii) a target test population that may be exposed to drugs or foreign substances, or (iii) disease risk factors. SNPs used for investigating the relationship with phenotypes such as drug effects and side effects in the target test population (the populations of (i) to (iii) are referred to as case populations). “Disease risk factor” means an environment causing a disease such as a tumor, and includes an individual's genetic background, sex, age, taste or lifestyle (for example, alcohol, tobacco, etc.), chemical substances, and the like.
Using this tag SNPs or using a combination of tag SNPs and Rare SNPs, the frequency of haplotypes between case populations or the frequency of individuals having such haplotypes is compared, or haplotypes in case populations and control populations. Or the frequency of individuals with such haplotypes are compared and tested for association with phenotype (S8).
On the other hand, when using out-of-block SNPs (S9), the association with the phenotype can be tested alone or in combination with the tag SNPs. That is, the frequency of polymorphisms (SNPs) in case populations or the frequency of individuals having such SNPs is compared, or the frequency of SNPs in the case population and control population or the frequency of individuals having such SNPs is compared. And test for association with the phenotype.
FIG. 2 is a schematic diagram showing that 16 SNPs exist on a certain gene X. FIG. Sixteen SNPs are SNPs possessed by a certain percentage or more of the entire population, and indicate Common SNPs (Common polymorphism). Then, assuming that SNPs 1 to 6, SNPs 8 to 10 and SNPs 12 to 15 form haplotypes, respectively, block A, block B and block C are constructed (FIG. 2). SNP7, SNP11, and SNP16 indicate out-of-block SNPs.
Moreover, the figure which extracted the haplotype block A is shown in FIG. As shown in FIG. 3, SNP1 is G / A, SNP2 is G / A, SNP3 is T / C, SNP4 is G / A, SNP5 is A / T, SNP6 is C / G, and the haplotype pattern is It is assumed that there are four types a to d. The haplotype a is “G-A-T-G-T-C”, b is “G-A-T-A-A-C”, c is “A-A-C-G-A-C”, d Is “GAC-ATC”. The tag SNPs are assumed to be SNP2, SNP3, and SNP5 (in FIG. 3, SNPs marked with *).
The block A exemplifies a haplotype composed of six SNPs, and is not limited to six SNPs.
Here, assuming that there are N SNPs in one haplotype block, the combination of SNPs, that is, the number of haplotype patterns is 2. ^N It becomes a piece. However, according to population genetics considerations, in a finite population, a chance factor strongly acts on the increase or disappearance of polymorphisms and haplotypes. For this reason, the number of haplotype patterns does not correspond to all combinations of SNPs, but is smaller than all combinations. In the example shown in FIG. 3, the haplotype pattern in block A is 2. ⁶ This is not a combination of streets, but four combinations. Therefore, by using haplotypes, it is possible to efficiently analyze the relationship between phenotypes and haplotypes using a small number of combinations. Thereby, analysis efficiency can be improved rather than analyzing each polymorphism.
Furthermore, in the present invention, as described above, when analyzing the relationship between a haplotype and a phenotype, what kind of polymorphism should be used for analysis, among many polymorphisms existing in the haplotype block, The polymorphism to be analyzed and its combination are specified. For example, a combination of SNPs (ht SNPs, out-of-block SNPs, Rare SNPs, etc.) which SNPs information should be used is provided. Next, the relationship between one phenotype and another phenotype is analyzed using the narrowed htSNPs, out-of-block SNPs, RareSNPs, or combinations thereof as a marker (FIG. 1, S8).
In practice, the genotype of the individual taking the drug is examined using all of the htSNPs and the non-block Common SNPs, and the relationship with the phenotype is analyzed. Among these, for htSNPs, the relationship between phenotypes and haplotypes (tag haplotypes) using htSNPs is tested, and for non-block Common SNPs, the relationship between each SNPs and phenotype is tested. The related test uses the htSNPs, out-of-block SNPs, and Rare SNPs to determine the frequency of haplotypes or the frequency of individuals having such haplotypes in relation to a specific phenotype, or the frequency of SNPs or such SNPs. The frequency of individuals having the same is compared, and the association with the phenotype is tested (FIG. 1, S8). Thereby, it is possible to clarify the search range as to which SNPs should be analyzed among the innumerable SNPs.
If there is a correlation between the tag haplotype and the phenotype, select all the polymorphisms contained in the haplotype block involved in the above-mentioned associations including htSNPs and comprehensively analyze them. Thus, the related polymorphism to the phenotype is estimated or specified (FIG. 1, S10).
In FIG. 3, as a result of administering a drug to a subject, a group having no or few side effects is a safety group, a group having strong side effects is a side effect group, and the frequency of haplotypes in the side effect group and the frequency of haplotypes in the safety group. Consider the case of comparing. Of the combinations of the tag haplotypes of SNP2 and SNP3, the frequency of AT is 80% in the safety group, whereas it is 20% in the side effect group, and the frequency of AC is the safety group. Is 20%, whereas it is 80% in the side effect group.
Here, whether or not there is a difference in the haplotype frequency between the safety group and the side effect group is tested by a statistical test. The method is, for example, a chi-square test method for independence.
The haplotype patterns satisfying the above differences are combinations of haplotypes a and b with tag haplotype AT (frequency with side effects 20%), and combinations of haplotypes c and d with tag haplotype A -C (side effect frequency 80%).
Since an association was found between a specific tag haplotype (a combination of “AC” of haplotype c and a combination of “AC” of haplotype d) of a specific block (block A) and a phenotype, this The haplotype block (block A) involved in the association is selected, and all polymorphisms existing in the haplotype c and haplotype d are selected starting from the tag SNPs. That is, a haplotype including a specific tag haplotype related to the block is selected from major haplotypes (major haplotypes) constructed in advance for the block. Then, it is determined whether or not the haplotype and the phenotype are related. The “major haplotype” means a haplotype that is selected from haplotypes with a high frequency of haplotypes in the population and has a total ratio of 95% or more.
In the above example, the relationship with the phenotype is analyzed for all SNPs in the block starting from SNP2, SNP3, or SNP5. Haplotypes to be selected at this time are haplotypes (SNP1 to SNP6) including all Common SNPs (not limited to htSNPs) in the block.
Also, to determine whether Rare SNPs are associated with a phenotype, sequencing can be performed on related blocks to find related Rare SNPs. If there is a phenotype association with specific out-of-block Common SNPs, sequencing around it will find relevant SNPs.
In the above process, when there is a true association between the intra-block Common SNPs or the intra-block Rare SNPs and the phenotype, SNPs related to the association can be found. Judgment whether or not the above is related is based on the following criteria.
(A) Whether or not any of the common SNPs constituting the specific haplotype or a combination of the respective common SNPs and the phenotype are related.
(B) Whether Rare SNPs assigned to a particular haplotype are associated with a phenotype.
By specifying the SNPs related to the difference, the SNPs related to the causative disease can be found, and diagnosis, drug selection, or causal SNPs can be specified.
Each step of the method of the present invention or each means of the system of the present invention will be described in detail below.
2. Genetic polymorphism
The gene polymorphism includes a single nucleotide polymorphism, an insertion / deletion polymorphism, and a polymorphism generated by a different number of base sequence repeats. Single nucleotide polymorphisms (SNPs) generally mean polymorphisms in which one specific base in a certain gene or its complementary strand (complementary sequence) region is replaced with another base, In addition to the polymorphism resulting from the substitution, a polymorphism resulting from deletion of the one base, and a polymorphism resulting from insertion of one base into the one base are also included.
The insertion / deletion type polymorphism is a polymorphism caused by deletion or insertion of a plurality of bases (for example, 2 to several tens of bases), and lacks several hundred bases to several thousand bases. Some have been lost or inserted. Furthermore, the polymorphism caused by the difference in the number of repeats of the base sequence refers to a sequence in which a sequence of 2 to several tens of bases is repeated and the number of repetitions varies among individuals. A repeating unit having several to several tens of bases is called VNTR (variable number of tandem repeat), and a unit having about 2 to 4 base units is called a microsatellite polymorphism. In VNTR and microsatellite polymorphism, variations are obtained by the difference in the number of repetitions among individual alleles (alleles).
The gene polymorphism information can be obtained by using a general gene polymorphism detection method. For example, sequencing method, PCR method, fragment length polymorphism assay, hybridization method using allele-specific oligonucleotide as a template (eg, TaqMan PCR method, invader method, DNA chip method), method using primer extension reaction, A sequence method, a MALDI-TOF / MS method, a DNA chip method, or the like is employed. The PCR method and the sequencing method can be used for detecting any gene polymorphism, and the other methods can be used mainly for detecting SNPs.
The TaqMan PCR method is a method using a PCR reaction with fluorescently labeled allele-specific oligos and Taq DNA polymerase (Livak, KJ Genet. Anal. 14, 143 (1999); Morris T. et al. , J. Clin. Microbiol. 34, 2933 (1996)). The invader method consists of the hybridization of two types of reporter probes specific to each allele of SNP and one type of invader probe to the template DNA, and a special endonuclease activity that recognizes and cleaves the DNA structure. This method is a combination of cleavage of DNA by an enzyme having the enzyme (Livak, KJ Biomol. Eng. 14, 143-149 (1999); Morris T. et al., J. Clin. Microbiol. 34, 2933 (1996). ); Lyamichev, V. et al., Science, 260, 778-783 (1993), etc.).
Moreover, as a method using a primer extension reaction, for example, the SniPer method can be adopted. The SniPer method is based on a technique called the RCA (rolling cycle amplification) method, and synthesizes complementary strand DNA continuously while DNA polymerase moves on it using circular single strand DNA as a template. It will be. According to this method, SNPs are determined by measuring the presence or absence of a color reaction that occurs when DNA amplification occurs (Lizardi, PM et al., Nature Genet., 19, 225-232 (1998)). Piated, AS et al., Nature Biotech., 16, 359-363 (1998)).
The sequencing method is a method of analyzing the frequency of gene polymorphism (especially SNPs) by amplifying a region containing a gene polymorphism by PCR and sequencing a DNA sequence using Dye Terminator or the like.
The MALDI-TOF / MS method is a method using a mass spectrometer and is basically a method of SNP genotyping using a difference in mass of different single bases. There are a method using PCR amplification and a method using multipleplex (Haff, LA, Smirnov, IP, Genome Res., 7, 378- (1997); Little, DP et al. Eur. J. Clinica. Chem. Clin. Biochem., 35, 545- (1997); Ross, P., et al. Nat Biotechnol., 16, 1347- (1998)).
The DNA chip method uses a method in which many types of DNA probes are aligned and fixed on a substrate such as glass, and then labeled DNA is hybridized to detect the label (fluorescence, etc.) signal on the probe. Thus, a complete match and a single base mismatch are separately detected by hybridization.
A commercially available kit can be used for polymorphism information analysis represented by SNPs, and automation is also possible.
In the present invention, the gene to be analyzed is not particularly limited. For example, at least one of the following genes can be analyzed (Table 1). In Example 5, the present invention is carried out using NAT2, which is an enzyme related to sulfasalazine metabolism. Similar to NAT2, for example, Alcohol dehydrogenase (ADH) is a well-known enzyme and can be analyzed in the present invention.

Here, it is considered that each gene shown in Table 1 performs some function from the time a drug is administered until it is excreted through absorption, distribution, and metabolism. In some cases, when one gene is expressed, this function forms a network with other genes and a plurality of genes are related to each other. Therefore, in the present invention, when polymorphism information and phenotype are determined not only by analyzing one gene but also by analyzing all or a plurality of partial genes shown in Table 1. There is.
Table 2 shows an example of information on SNPs existing in the genes shown in Table 1 and analyzed in the present invention.

In Table 2, the “number” column shows the identification numbers (for example, L1 and L2) of the SNPs in the block. “Contig pos” is a number indicating the position on the genome where the SNPs are present. “Rs number” is the number of SNP information registered in NCBI (National Center for Biotechnology Information). SNPs information can be obtained by inputting rs number from the NCBI website (http://www.ncbi.nlm.nih.gov/). For example, when “rs2293004” is entered with the item “SNP” on the NCBI Web site search screen, an information screen (the following URL) for the SNP can be obtained.
http: // www. ncbi. nlm. nih. gov / SNP / snp_ref. cgi? rs = 2293004
Then, the SNP information of rs2293004 in the URL is “G / A”.
“RK number” is an identification number assigned by the applicant. In the “SNP” column, alleles are displayed separated by “/”. In the “Block” section, htSNPs are indicated by “T”. In the rightmost column of “Block”, “C” represents a common SNP, and “R” represents a Rare SNP.
3. Haplotype analysis
(1) Construction of haplotype blocks
As shown in FIG. 4, a case is considered in which linkage disequilibrium analysis is performed on all combinations of nine SNPs existing on gene Y to construct a haplotype block. Establish certain criteria by calculating the measure of linkage disequilibrium between one SNPs and other SNPs in a population and the number of major haplotypes when using multiple adjacent SNPs information; A haplotype block configuration is performed using as an index.
As a technique for constructing a haplotype block, a previously reported technique can be used (Avi-Itzhak, HI et al., Pac. Symp. Biocompute 466-477 (2003)).
That is, first, the above haplotype estimation is performed using the common SNPs adjacent to the linked SNP loci (FIG. 1, S1, S2). Usually, the haplotype frequency in the population is 5% or more, preferably 5% to 20%, more preferably 5% to 10%, but is not limited to this frequency. As a result, the haplotype frequency of the population is estimated, and the value of D ′ is estimated using it. D ′ is an indicator of linkage disequilibrium between the two SNPs. A pair whose estimated D ′ is 0.90 (not limited to 0.90) or more is searched for. When such a pair of Common SNPs is discovered, the block is extended as much as possible in the 5 ′ and 3 ′ directions using the pair as a seed for haplotype block formation.
A region including the above-described common SNPs having strong linkage disequilibrium is defined as a provisional haplotype block, and a haplotype estimated to 5% (not limited to 5%) or more based on estimation of haplotypes so far is defined as a provisional main haplotype. The common SNPs adjacent to the temporary haplotype block are added to the common SNPs included in the temporary haplotype block, and the haplotype estimation is performed.
As a result, when no haplotype of 5% (not limited to 5%) or more other than the provisional main haplotype so far is generated, the haplotype is extended by adding new Common SNPs to the provisional haplotype. When such a haplotype is newly generated, the extension of the haplotype block is considered to have stopped in that direction. For example, as shown in FIG. 4A, when it is determined that SNP1 and SNP2 form a strong linkage disequilibrium, it is assumed that SNP1 and SNP2 tentatively form a haplotype (a), and then It is analyzed whether or not the haplotype block and SNP3 form a strong linkage disequilibrium (b), and the linkage disequilibrium with the downstream SNP is sequentially analyzed (c). As a result, when 5% or more of new major haplotypes are generated by adding SNP5 to the blocks formed by SNP1 to SNP4, it is determined that SNP1 to SNP4 form one block (d). Similarly, it is analyzed whether the blocks of SNP1 to SNP4 and SNP6, SNP7,... SNP9 form a strong linkage disequilibrium. Next, the above analysis is repeated starting from SNP2 (eg). This operation is analyzed for all combinations up to SNP9 (ai), and haplotype blocks are constructed.
The above analysis is performed for both the 5 ′ direction and the 3 ′ direction. If the extension of the haplotype block is not stopped in either direction, one adjacent SNPs in that direction is added and the steps are repeated.
When the haplotype block stops extending in any direction, the provisional haplotype block at that stage is set as the haplotype block (FIG. 1, S4).
In general, the relationship of block construction can be known using the matrix shown in FIG. 4B. Each matrix (81 squares in FIG. 4B) can indicate a difference in the probability of chaining which SNP and which SNP constitute a block according to a difference in color or color density.
(2) Selection of tag haplotype
Tag SNPs (htSNPs) are selected from the blocks constructed as described above (FIG. 1, S7). The htSNPs can explain a haplotype of a certain ratio (for example, 90%) or more depending on the combination, and are the minimum Common SNPs that can distinguish the main haplotypes. Tag selection is performed after distinguishing SNPs inside and outside the block (FIG. 1, S5, S6), and is executed according to a predetermined computer program for the common SNPs within the block and identified as htSNPs (FIG. 1, S7). Examples of htSNPs are shown in Table 3. The identified htSNPs can be used alone or in combination of two or more.
In Table 3, what is displayed as “htSNPs” in the “Block #” (# represents a serial number) is a tag SNPs. The SNPs indicated in the “Between” section are out-of-block SNPs (described later). The numbers in the leftmost column are the identification numbers assigned by the applicant (corresponding to “RK number” in Table 2), and the symbols (for example, “DDOST”, “NDUFS5”) listed in the second column from the left Etc.) is the gene name. From the next column, one SNP, its frequency, the other SNP, its frequency, the position of the SNP on the genome, rs number, and the rightmost column are gene aliases. However, the alias of the gene is displayed even if it is the same as the gene name in the second column.

In Table 3, “del” means a deletion. For example, SNPs indicated as “CT / del” indicate CT or deletion. “Ins” means insertion. For example, SNPs displayed as “G / ins” represent SNPs in which G is inserted at that position. In addition, SNPs with parentheses and numbers such as “(TC) 2-3” are repeated as many times as the number of bases in the parentheses (in this case (2-3) (TC)). Indicates that it is a polymorphism.
(3) Selection of non-block SNPs
A haplotype block is defined as a set of Common SNPs that form a linkage disequilibrium with an index of strength of a certain level or more when constructed. Therefore, when the haplotype block is constructed, SNPs that are outside the range of the index and deviate from the haplotype block are generated (for example, SNP7, SNP11, and SNP16 in FIG. 2). In this way, it is determined that linkage disequilibrium is weak, and SNPs that deviate from the block are referred to as “Out-of-block SNPs” or “out-block SNPs”. For example, examples of non-block SNPs are shown in the “Between” section of Table 3 above.
In the present invention, not only SNPs in a block but also non-block SNPs can be analyzed alone or in combination as described above (FIG. 1, S9). It is also possible to analyze the relationship with the phenotype by adding data of non-block SNPs to the haplotype block data constructed in advance.
(4) Analysis of Rare SNPs
(I) Concept of Rare SNPs
As shown in FIG. 2 and FIG. 5, haplotype a (“A−” is determined by SNPs at four locations, SNP12 (G / A), SNP13 (C / T), SNP14 (C / A), and SNP15 (T / A). T-A-A "), haplotype b (" A-T-C-T "), haplotype c (" G-C-C-T ") and haplotype d (" G-C-A-T "). Assume that four types of main haplotype blocks (block C) are formed. These haplotypes are assumed to be constructed when the main allele frequency satisfies a condition of a certain value or higher (for example, 0.1 or higher). In this case, the minor allele frequency is smaller than 0.1, and the SNPs included in the frequency range do not contribute to the construction of the haplotype block.
However, paying attention to such non-contributing SNPs, the SNPs can also be used to correlate with phenotypes. Such SNPs are referred to as “Rare SNPs” or “rare SNPs” (FIG. 1, S3). It is also determined whether the Rare SNPs exist outside or inside the block, and both can be distinguished and used (FIG. 1, S11, S12, S14).
In FIG. 5, it is assumed that Rare SNPs (“T” of SNP 17) having a minor allele frequency of 4% (frequency is less than 0.1) exist between SNP 13 and SNP 14. This Rare SNPs is present in block C and on a specific haplotype (for example, haplotype d in FIG. 5) and not on other haplotypes (haplotypes a, b and c in FIG. 5). , Rare SNPs are components of certain haplotypes. In this manner, positioning the Rare SNPs to a specific major haplotype is referred to as “assigning Rare SNPs” or “assigning Rare SNPs” (FIG. 1, S13). Rare SNPs are assigned to the main haplotypes in the block (FIG. 1, S12, S13). Because populations with haplotypes with assigned Rare SNPs are often associated with phenotypes, Rare SNPs can be used for disease diagnosis, drug selection, and identification of causative SNPs.
Therefore, after obtaining Sare SNPs information in S1 and S3 of FIG. 1, it is possible to proceed directly to the phenotype-related tests (FIGS. 1 and S8). A phenotype-related test (FIG. 1, S8) can also be performed along with the information. For example, when the Rare SNPs and htSNPs are combined and implemented, a combination of all htSNPs included in the block and at least one Rare SNPs (preferably one Rare SNPs) is preferable. The combinations of Rare SNPs are not particularly limited, and all combinations including Rare SNPs outside the block can be used in addition to Rare SNPs included in the block.
Examples of Rare SNPs are shown in Table 2 (SNPs labeled “R” on the rightmost side).
In Table 2, below the list of each block, Rare SNPs and Rare Assign Block information are described.
Haplotypes containing at least one of these Rare SNPs are also included in the present invention.
Note that since the SNP is “biralic”, the SNP can be expressed by two numbers such as 0 and 1 (in parentheses in the lower part of FIG. 5). In the case of FIG. 5, 0 represents a main allele (allele with high allele frequency), and 1 represents a minor allele (allele with low allele frequency). The Rare SNPs are assigned to the haplotype d, and the haplotype d can be expressed as “11 (1) 10” (“(1)” in parentheses is the Rare SNPs). However, the numbers are not limited to 0 and 1, and are not limited to the method of notation using the numbers.
(Ii) A method for determining allocation of Rare SNPs
A method for determining the assignment of Rare SNPs to major haplotypes (FIG. 1, S13) will be specifically described.
First, genotype data of individuals belonging to a group of common SNPs (for example, using SNPs having a minor allele frequency of 10% or more, but not limited to 10%) in one haplotype block, and the haplotype frequency of the above group Estimate the haplotype frequency of the population using the method of estimating. Then, among the haplotype frequencies of the obtained population, haplotypes that total 95% or more are selected from the high-frequency haplotypes and named as the main haplotype. The above 95% is not limited to this value.
Next, one Sare SNPs (assuming that the minor allele frequency is less than 10% but not limited to 10%) is added to the common SNPs (the number is n) in the one haplotype described above, and n + 1 Haplotype estimation is performed using genotypes of individual SNPs.
As a result of this haplotype estimation, the haplotype frequency of the population is estimated. If there is only one haplotype that contains a minor allele for the Rare SNPs in the population, the minor allele of the Rare SNPs is assigned to one haplotype.
When the main haplotype to which the minor allele of the Rare SNPs is assigned is not determined by the above method based on the estimation result of the haplotype frequency of the group, the following method is used.
The diplotype form of the individual estimated by the above method is the diplotype form of the individual (a combination or permutation of the two haplotypes held by the individual) under the maximum likelihood estimated haplotype frequency (Θ) of the population. ) Posterior probability distribution. Investigating individuals with a posterior probability of diplotypes that have a haplotype containing minor alleles of Rare SNPs among the estimated diplotypes of individuals, and checking the estimated haplotype information of those individuals , Haplotypes to which most of the minor alleles of Rare SNPs are assigned. This haplotype is referred to as “majority-assigned haplotype”. Then, the ratio that the minor allele of the Rare SNPs is assigned to the majority-assigned type is calculated from the information of the estimated diplotype type of the individual.
A method for determining the assignment of Rare SNPs to the main haplotypes formed by Common SNPs in the haplotype block will be described below.
First, a method for estimating the haplotype frequency of a group necessary for assignment of RareSNPs and estimating a diplotype form of an individual will be described below. This approach is that reported by Kitamura et al., Ito et al. (Kitamura Y. et al. (2002) Ann. Hum. Genet 66: 183-193; Ito T. et al. (2003) Am. Hum. Genet. 72: 384-398). In this method, the given data is genotypes (combinations, not permutations) for a plurality of loci linked in a plurality of individuals belonging to a specific group. What should be estimated is the frequency of the population haplotypes as parameters and the diplotype form of the individual (a combination or permutation of two haplotypes held by the individual).
Consider the following experiment: For all L loci linked, all possible haplotypes (2 for SNP loci) ^L Frequency) is determined non-deterministically. This frequency is expressed as Θ = (θ ₁ ,. . . , Θ _M ) (I = 1 to M). Where M is the number of possible haplotypes and θ _i Is the frequency of the i-th haplotype. Two haplotypes are distributed non-deterministically to n individuals according to Θ. Each individual is given two ordered haplotypes. This is called a permutation diplotype form. Let di be the permutation diplotype form given to the i th individual and D = (d ₁ , D ₂ ,. . . , D _n ).
One result of such an experiment is represented by (Θ, D), and once one result is determined, the genotype of all individuals is determined for all L loci. “(Θ, D)” means Θ and D. In practice, the genotype of all individuals is determined solely by D (not by Θ).
Since Θ is a continuous quantity, the sample space is non-additive infinite. D1, d2,. . . , Dn are independent of each other.
That is, if the probability of being D under the condition of Θ is P (D | Θ),

It becomes.
Assuming Hardy-Weinberg equilibrium and assuming that the haplotypes composing di are jth and kth, d under the condition of Θ _i The probability P (d _i | Θ) is

It becomes.
The observed genotype for all individuals is G = (g ₁ ,. . . , G _n ). Where g _k Is gk, where gk is the genotype for the L locus of the kth individual. _k = (G _k1 ,. . . , G _kL ) Here, gkl is the genotype (combination, not permutation) of the l locus of the kth individual.
The likelihood function is the conditional probability under Θ of the set of those that match G among the results (Θ, D) that are simply elements of the sample space Ω,

[P (g _k | Θ) is simply g of events dk _k Is the probability of the set (dk∩gk) that matches ]

It becomes. However, Q _k Is g _k Matches d _k Is a set of permutations of haplotype numbers that constitute That is, the likelihood function is

And this should be maximized on Θ.
This is maximized by the EM algorithm and d under the maximum likelihood estimate. _k Find the probability distribution of. The EM algorithm (Expectation-Maximization algorithm (EM-algorithm)) is a standard algorithm used to determine the distribution of the haplotype frequency that most easily obtains SNP observations. In addition, the maximum likelihood estimated value means a value having the best value of θ when the validity of the data is maximized with respect to a specific θ. This program is LDSUPPORT (Kitamura Y. et al. (2003) Ann. Hum. Genet. 66: 183-193).

It becomes.
The maximum likelihood estimator of Θ satisfies the following equation:

Where n _i Is the number of the i-th haplotype under a particular D. Thus, the above equation is the expected value of the i-th haplotype number averaged over all Ds that match G. This expresses the internal consistency of the maximum likelihood estimate, and the following EM algorithm iteration process is created.
For all i (i = 1, 2,..., M)

The frequency of the i-th haplotype in the step. )
Ask for.
Repeat this until Θ ^t Stop when it has converged. Also, the posterior probability of dk under the maximum likelihood estimator:

Calculate
It can be seen that this method is to estimate complete information because one individual has two haplotypes, and thus the information is incomplete. An ldpooled program that generalizes this and mixes many haplotypes into many small groups each containing 2m haplotypes and estimates complete information from incomplete information (Ito T. et al. et al. (2003) Am. J. Hum. Genet. 72: 384-398). That is, the general haplotype estimation method is a special case where m = 1 only in the estimation method from 2m haplotype pools.
The above is a method for estimating the haplotype frequency of a group and the diplotype form of an individual from loci for a plurality of linked loci of a large number of individuals. This approach has been proven to give an accurate estimate when linkage disequilibrium is strong (eg, SNPs within a single haplotype block) between linked loci.
Such a haplotype estimation method is generally a method using the above-mentioned EM algorithm, but besides this, the Clark method (Clark AG (1990) Mol. Biol. Evol. 7: 111-122), PHASE method (Stephens M. et al. (2001) Am. J. Hum. Genet. 68: 978-989), PL method (Niu T. et al. (2002) Am. J. Hum. Genet. 70: 157-169). ) And the like are also known, and the technique used for assignment of Rare SNPs is not limited to the EM method.
With regard to Rare SNPs in a haplotype block, as a result of analyzing assignments to major haplotypes related to the block by such a method, the ratio of minor alleles of Rare SNPs assigned to majority-assigned haplotypes is often 100%. It was shown to be close. However, although some are assigned to two or more major haplotypes, the percentage of minor alleles of Rare SNPs assigned other than the majority-assigned haplotype is low.
(Iii) In-block Rare SNPs Analysis Method
As described above, for Rare SNPs within a haplotype block, it was found that most of the minor alleles were assigned to one major haplotype. In such a case, even if Rare SNPs related to the phenotype cannot be found, it is possible to search for it using the main haplotype as a tag.
The causes of serious drug side effects include dihydropyrimidine dehydrogenase deficiency related to the side effects of 5-fluorouracil and its derivatives, and thiopurine methyltransferase deficiency related to the side effects of 6-mercaptopurine and its derivatives. These are related to RareSNPs. If the cause is already known, it is easy to identify the causative gene and responsible mutation, but in many cases the causative (related) gene and responsible mutation are unknown. . In addition, since these SNPs have a low frequency of minor alleles, they are often not found by SNP screening.
In such a case, a technique for detecting Rare SNPs using a major haplotype as a tag can be constructed. That is, common SNPs (or htSNPs as described later) are used to compare the frequency of haplotypes in severe side effects and control populations, or the frequency of individuals having such haplotypes. The haplotype block of a gene having a haplotype with a difference between the populations is sequenced for all individuals showing severe side effects. With the above method, it is possible to discover Rare SNPs that cause severe side effects that could not be found at the screening stage.
(Iv) Out-of-block Rare SNPs Analysis Method
What is considered to be most difficult to analyze is a technique for searching Rare SNPs outside haplotype blocks related to phenotype (for example, severe side effects).
As described above, in general, there are many cases where Rare SNPs cannot be found by screening. Therefore, it is desirable to use Common SNPs as a tag for discovery.
However, linkage disequilibrium between SNPs is recognized even outside the block. Therefore, the frequency of the severe side effect population and the control population is compared for out-of-block Common SNPs. If there is a significant difference between the two populations, Rare SNPs related to side effects can be found by sequencing a sample of individuals with severe side effects around the Common SNPs.
(V) Analysis by combination of htSNPs and Rare SNPs
In the present invention, haplotype estimation can be performed more accurately by a combination of htSNPs and Rare SNPs. The combination of htSNPs and Rare SNPs is preferably a combination of all htSNPs in the block and one of the Rare SNPs.
The analysis method of the relationship between Rare SNPs and main haplotypes is as follows.
In order to describe the relationship between Rare SNPs and haplotypes, it is necessary to define a probability space. Therefore, in the present invention, concepts called “complete haplotype” and “incomplete haplotype” are adopted.
Suppose now that there is a haplotype block with a total of N linked polymorphic loci (a haplotype block with N loci linked). The i-th complete haplotype block is set to “Hi” (FIG. 19A). The “complete haplotype” is a list of alleles when each of the N loci has one allele (a total of N alleles). A set of complete haplotypes possessed by all individuals in the group is defined as a sample space Ω (FIG. 19A). Here, Hi is redefined as a subset of Ω such that the element is Hi. That is, the list of the i-th complete haplotype in the sample space Ω is defined as Hi, and further, a set having such Hi as an element is redefined as “Hi”.
Further, one low-frequency allele included in N loci in the block is set to “X” (FIG. 19A). Then, in the list of complete haplotypes in the group, a set of complete haplotypes including X (this X means a low-frequency allele) in the list is rewritten as X (this X means a set). Define. Then, this redefined X becomes a subset of Ω (FIG. 19A).
On the other hand, if one of the incomplete haplotypes is Ai, Ai is the i-th haplotype defined only by the htSNP in the block, for example (FIG. 19B). That is, “incomplete haplotype” refers to a list of alleles when all N loci have fewer than N alleles. For example, A ₂ Refers to the two alleles labeled ht for the four sitting positions. Then, among the complete haplotypes in the group, a set having a list of Ai for the htSNP locus is redefined as Ai. A haplotype defined only by htSNPs is an incomplete haplotype because it is a partial haplotype of a complete haplotype. In other words, using the above definition, the complete haplotype Hi, the incomplete haplotype Ai, and the low frequency allele X at one SNP locus are all defined as a subset of Ω, and therefore, events defined on the same probability space. (Event) is interpreted. The following set of events (complementary events) that supplement such events:

Can also be defined. Such a concept is difficult to define by other methods.
The relationship between one Rare SNP and a haplotype configured by htSNPs in the block is analyzed as follows. That is, haplotype estimation is performed using one of all htSNPs and Rare SNPs (defined as frequency <0.1) in the block.
For haplotype estimation, LDSUPPORT (Kitamura et al. 2002) or PHASE (Stephens et al. 2001) software can be used.
“LDSUPPORT” is a program that simultaneously estimates the haplotype frequency of a population and the diplotype distribution of each individual (diplotype posterior distribution) based on the expectation-maximization (EM) algorithm, whereas “PHASE” It is a program for estimating a haplotype using the Markov-chain Monte-Carlo method and a coalescence model. After the haplotype estimation is finished, the following four overlap probabilities:

Is calculated for each Ai.
In the above (a), “P (Ai, X)” means the probability of being Ai and X, and similarly, the probability that (b) is not Ai and is X, (C) means a probability that is Ai and not X, and (d) means a probability that is not Ai and is not X.
Using the probability defined in this way, the following calculation can be performed in the present invention.

P (Ai | X) is the probability of being Ai under the condition of X. In other words, among the set Ai (including X) and its complement (including X), This means the probability that X is included.
Here, among the i haplotypes, a set that maximizes P (Ai | X) is Aj, and Aj can be selected by the above equation. This is the incomplete haplotype to which X is assigned. The formula for choosing Aj is:

The above equation is calculated as a measure of the assignment of Rare SNPs to incomplete haplotypes. A detailed method for assigning the incomplete haplotypes of the Rare SNPs will be described in detail with reference to one block in the third embodiment.
(Vi) Simulation for detecting Rare SNPs related to a phenotype with a haplotype configured by htSNP
Rare SNPs have a high probability of being related to phenotypes, but in the present invention, it is possible to analyze the relationship with phenotypes by using tag haplotypes without using Rare SNPs.
The probability of finding significance by using Aj instead of X is:

Depends on etc. Where M ₁ And M ₂ Is the number of affected individuals and controls, respectively.

Complementary set.
The algorithm for the simulation is as follows.
In this simulation, it was assumed that only X (ie, the loci associated with low frequency alleles of certain Rare SNPs) are loci directly related to the phenotype. The incomplete haplotype Aj to which much of X is assigned may also be related to ψ (the set of complete haplotypes in the affected person), but this is a relationship only through the relationship between X and ψ. Since it was assumed that X is related to phenotype, the frequency is thought to be different between affected and unaffected populations.
That is,

Case-control studies test for these frequency differences. Here, r is defined to represent the following ratio.

The inventor considered whether Aj is a good marker for detecting X. X is ψ

Since X is the only locus that is directly related to the phenotype, and the relationship between Aj and the phenotype is only through X, the following equation holds:

The following equation is obtained by Bayes' theorem.

From equation (3), equation (5) becomes:

Similar to P (Aj | X, ψ), the following formula:

It is important. The reason is:

Is calculated from (1) and (2).

The reason is that P (ψ) is very small.

This is because only the case is considered. That is, the frequency of X in the control group was the same as the frequency in the group. Thus, equations (1) and (2) are as follows:

For each Rare SNPs,

Was calculated. The following formula:

The calculation results of are illustrated in the examples.
Before the simulation, the ratio r, the affected person, the number of controls, M ₁ , M ₂ For each variable, a combination of various values was given. 2M for simulation ₁ Is given to the affected population from the binomial distribution of the frequency parameter P (Aj | ψ), and the number 2M ₂ Frequency parameters for haplotypes of control population:

From the binomial distribution.

X2 Contingency table was used. The ratio of the number of repetitions showing P <0.01 was taken as empirical power. 5,000 iterations were performed on one Rare SNPs. This test is

The purpose is to detect the difference. Therefore, the ratio

Is extremely important. This ratio is as follows from equations (7) and (8).

Therefore, the ratio (9) is

Depends on.

If so, the ratio (9) is 1. Then

It is no longer possible to detect the difference between Software ANASSIG for simulation is created by the present inventor in C language.
4). Analysis of association with phenotype
The phenotype includes a phenotype related to the sensitivity of a drug or a foreign substance and a phenotype related to a disease.
Phenotypes related to drug or foreign body susceptibility include pharmacokinetics (especially blood concentration), drug effectiveness (including disappearance of disease markers, improvement of clinical symptoms), disease susceptibility and its strength, and presence or absence of side effects The strength and the like can be mentioned. Pharmacokinetics refers to the in vivo behavior of a drug (for example, the blood concentration of the drug) after receiving the drug, and includes absorption, distribution, metabolism, excretion, and the like. Foreign substances mean substances other than physiological substances that enter the human body orally or parenterally, and mean determinants of the likelihood of disease. For example, chemical substances contained in tobacco, air pollutants, drinking water pollutants, food additives, agricultural chemicals and the like can be mentioned. Sugar may be used for diabetes, salt for hypertension, and viruses, bacteria-derived substances (such as endotoxin), environmental hormones, and various genetic predispositions may also be foreign substances.
Disease phenotypes include pathology when suffering from diseases such as cancer, presence or absence of onset when exposed to risk factors such as carcinogens, presence or absence of complications when suffering from disease, presence or absence of recurrence, etc. Can be mentioned. Examples of the disease include malignant tumors, immune system diseases, circulatory system diseases, metabolic system diseases, renal urinary system diseases, respiratory system diseases, musculoskeletal diseases, and one or more types of these diseases. Both are subject to analysis. As for lifestyle-related diseases such as diabetes and hypertension, hyperglycemia, presence or absence of diabetes complications, systolic blood pressure, diastolic blood pressure, presence or absence of heart failure as a complication of hypertension, and the like can be mentioned.
The susceptibility of a disease includes the presence or absence of the possibility of suffering from a disease or the strength.
Using the phenotype as an index, the association can be examined between the test group and the test group, or between the test group and the control group.
5). Use of analysis results
The results analyzed as described above are the methods for predicting the sensitivity of a drug or a foreign substance, the method for selecting a treatment or prevention method, the method for selecting a diagnostic marker, the method for selecting a drug, and for the prevention or treatment of a disease. It can be used for a method for determining an appropriate dose of a drug.
It can also be applied to a method for examining drug-drug interactions. “Drug interaction” means that when multiple drugs are administered at the same time, they are qualitatively and quantitatively different from the effects expected when each drug is administered independently. It means the action when it is done. As a result, a phenomenon in which specific side effects are enhanced or the effects are reduced is observed. This method is useful for identifying SNPs in question because of problems such as increased side effects and diminished effects when taking multiple drugs.
Furthermore, the analysis result can be used in a related polymorph identification method. “Related polymorphism” means a polymorphism that causes a phenotype, a polymorphism that changes the phenotype quantitatively or qualitatively by various mechanisms, identifies the locus of the disease, or identifies a foreign substance or drug. It can be used for analysis of genes related to susceptibility.
6). Computer program
In the haplotype analysis program of the present invention, a configuration example showing means for causing a computer to execute is shown in FIG.
As shown in FIG. 6, the haplotype analysis system of the present invention includes a CPU 601, a ROM 602, a RAM 603, an input unit 604, an information communication transmission / reception unit 605, an output unit 606, a hard disk drive (HDD) 607, a CD-ROM drive 608, and the like. Is provided.
The CPU 601 (also referred to as MPU) controls the entire polymorphism analysis data processing system according to a program stored in information storage means (for example, magnetic and / or optical recording medium) of the host computer. Then, the information received from the input unit 604 or the like is supplied to the output unit 606. Also, analysis processing can be executed based on information received through the network line 609. Examples of the information received through the network line 609 include SNP information from NCBI (http://www.ncbi.nlm.nih.gov/). The input unit 604 is a keyboard, a mouse, or the like, and is operated when inputting conditions or data necessary for executing analysis processing. The ROM 602 stores a program for instructing processing necessary for the operation of the analysis processing system of the present invention. The RAM 603 temporarily stores data necessary for executing processing in the analysis processing system.
The transmission / reception unit 605 executes information communication (data transmission / reception processing) with the network line 609 or the like based on a command from the CPU 601, and examples thereof include a modem and a router. The output unit 606 performs information display processing on the gene polymorphism analysis data input from the input unit 604 and other various conditions based on a command from the CPU 601 (for example, a display screen or a printer). The CD-ROM drive 608 reads a program or data for causing the analysis processing system stored in the CD-ROM to function based on an instruction from the CPU 601 and stores the program or data in the RAM 603, for example. A rewritable CD-R or CD-RW can also be used as a recording medium instead of the CD-ROM. In that case, a CD-R or CD-RW drive is provided instead of the CD-ROM drive 608. In addition to the above medium, a DVD, MO, and those media may be used, and a corresponding drive may be provided.
The program of the present invention can be written in, for example, C language, Java, Perl, Fortran, Pascal, etc., and is designed to be compatible with a cross platform. Therefore, this software can be operated on Linux (registered trademark) 95/98/2000 / XP, Linux, UNIX (registered trademark), and Macintosh.
7). Computer recording media
The program of the present invention can be stored in a computer-readable recording medium or a storage means that can be connected to a computer. A computer recording medium or storage means containing the program of the present invention is also included in the present invention. Examples of the recording medium or storage means include a magnetic medium (flexible disk, hard disk, etc.), an optical medium (CD, DVD, etc.), a magneto-optical medium (MO, MD), and the like.
Hereinafter, the present invention will be described more specifically with reference to examples. However, the present invention is not limited to these examples.

本実施例は、７５２名の集団によるハプロタイプ解析を行なったものである。
（１）材料及び方法
本実施例は、東京女子医大のゲノム倫理委員会及びファーマＳＮＰコンソーシアムのゲノム倫理委員会による認定のもとに行われたものである。ボランティアから得られたＤＮＡを被検対象（母集団）とした。被験者からインフォームドコンセントを得た。総計１０３２名のボランティアを集め、７５２名の被験者からＤＮＡを無作為に選択した。７５２名の被験者の中で、男性は４４９名、女性は３０３名であった。被験者の年齢は男性が３６．１±１１．５歳であり、女性が４０．６±１１．３歳である。
前記表１に示す１４７個の遺伝子の３０３６個のＳＮＰｓについて、７５２名の日本人のＤＮＡの遺伝子型を特定した。これらの遺伝子は薬剤関連遺伝子、又は薬剤関連遺伝子候補のいずれかである。表１は、遺伝子、染色体位置、及び本発明におけるそれぞれの遺伝子中のＳＮＰｓの数のリストである。
表１に示すように、１４１個の遺伝子の２９９９個のＳＮＰ位置は常染色体上にあり、６個の遺伝子の３７個のＳＮＰｓはＸ連鎖であった。遺伝子のいくつかは薬剤反応と臨床的に関連することが知られており、他のものはトランスポーター、酸化還元酵素、各種転移酵素及び他のタンパク質である。
（２）遺伝子タイピング
インベーダーアッセイは、構造特異的切断酵素及びユニバーサル蛍光共鳴エネルギー転移（ＦＲＥＴ）システムと組み合わせたものである。アレル特異的オリゴヌクレオチド対及びインベーダープローブを設計し、インベーダーアッセイキット（ＴｈｉｒｄＷａｖｅＴｅｃｈｎｏｌｏｇｉｅｓ）を用いてＳＮＰの遺伝子タイピング（ＳＮＰｓの検出）を行なった（Ｋｗｉａｔｋｏｗｓｋｉ，Ｒ．Ｗ．ｅｔａｌ．，Ｍｏｌ．Ｄｉａｇｎ．，４，３５３−３６４（１９９９））。タイピングエラーの頻度は、経験的に０．０００１０４５であった。
（３）ハプロタイプブロックの構築
ハプロタイプブロックを構築するため、アルゴリズムは公知方法に基づき（Ｚｈａｎｇ，Ｋ．ｅｔａｌ．，Ｐｒｏｃ．Ｎａｔｌ．Ａｃａｄ．Ｓｃｉ．ＵＳＡ，９９，７３３５−７３３９（２００２）；Ｇａｂｒｉｅｌ，Ｓ．Ｂ．ｅｔａｌ．，Ｓｃｉｅｎｃｅ，２９６，２２２５−２２２９（２００２））、以下の通りブロックを構築した。
全てのペアをなすＤ’値が０．９以上となる初期インターバルを構築した。ハプロタイプがこのインターバル内に推定されるときは、頻度が５％以上の主要ハプロタイプが多少存在した。主要ハプロタイプの合計頻度は９０％以上であった。上述のように作製した初期インターバルに対し、隣接するＳＮＰを加えてハプロタイプの推測を行った。推測により新たに主要ハプロタイプが出現しない場合は、新規ＳＮＰは、新規インターバルを作成するために別のＳＮＰｓに加えた。隣接するＳＮＰの包含関係が新たな主要ハプロイドを生ずるまで、この手法を５’及び３’方向のそれぞれについて繰返した。２個のＳＮＰｓを除く全てのＳＮＰｓについて得られたインターバルを試験し、主要ハプロタイプの数を増加する包含関係をハプロタイプブロックと定義した。
（４）ハプロタイプ頻度の推定及びｈｔＳＮＰの選択
ハプロタイプブロック内のハプロタイプについての推定は、アレル頻度が０．１以上のＳＮＰｓのみを用い、また、ＥＭアルゴリズムを実行するソフトウェアＬＤＳＵＰＰＯＲＴ（ＫｉｔａｍｕｒａＹ．ｅｔａｌ．，Ａｎｎ．Ｈｕｍ．Ｇｅｎｅｔ．，６６，１８３−１９３（２００２））を用いて行なった。
ブロック内の全てのハプロタイプのうち、９５％以上又は９０％以上を説明する主要ハプロタイプを用いて、ｈｔＳＮＰｓを選択した。ｈｔＳＮＰｓの選択方法は、基本的には公知方法（Ｄａｌｙ，Ｍ．Ｊ．ｅｔａｌ，，Ｎａｔ．Ｇｅｎｅｔ．，２９，２２９−２３２（２００１）；Ｐａｔｉｌ，Ｎ．ｅｔａｌ．，Ｓｃｉｅｎｃｅ，２９４，１７１９−１７２３（２００１）；Ｊｏｈｎｓｏｎ，Ｇ．Ｃ．ｅｔａｌ．，Ｎａｔ．Ｇｅｎｅｔ．，２９，２３３−２３７（２００１））と同一とした。具体的には、Ａｖｉ−ＩｔｚｈａｋらのＰｈａｓｅｒＩＩ法（Ａｖｉ−Ｉｔｚｈａｋ，Ｈ．Ｉ．ｅｔａｌ．，Ｐａｃ．Ｓｙｍｐ．Ｂｉｏｃｏｍｐｕｔ．，４６６−４７７（２００３））を用いた。
（５）主要ハプロタイプへのＲａｒｅＳＮＰｓの割当て
ＲａｒｅＳＮＰｓをもつ対立遺伝子の主要ハプロタイプへの割当ては以下のように行った。すなわち、ＬＤＳＵＰＰＯＲＴ又はＰＨＡＳＥのいずれかにより、ブロック内の０．１以上のマイナーアレル頻度をもつ全てのＣｏｍｍｏｎＳＮＰｓに加え、ＲａｒｅＳＮＰｓのデータも用いて、ハプロタイプの推定を行った。ＬＤＳＵＰＰＯＲＴではＥＭアルゴリズムに基づき、それぞれの被験者の集団ハプロタイプ頻度とディプロタイプ分布（ディプロタイプ構成後の分布）を推測し、ＰＨＡＳＥではマルコフ連鎖モンテカルロ法とハプロタイプ推定の融合モデルを用いた。ハプロタイプ推定後、ＲａｒｅＳＮＰｓのマイナーアレルがどの程度の割合で単一の主要ハプロタイプへ割当られたかを試験した。
（６）結果
（６−１）概要
本実施例においては、２９９９個の常染色体上のＳＮＰのうち、２６０３ＳＮＰｓについてハプロタイプブロック構築を行い解析した（図７）。それらのＳＮＰｓのうち、ＣｏｍｍｏｎＳＮＰｓ（サンプル内の割合が０．１以上のもの）は２０８５個、ＲａｒｅＳＮＰｓ（サンプル内の割合が０．１未満のもの）は５１８個であった。
２０８５個のＣｏｍｍｏｎＳＮＰｓを用いてハプロタイプ構築を行った結果、１３１個の常染色体遺伝子中、全部で２４５個のブロックを同定した。２０８５個のＣｏｍｍｏｎＳＮＰｓ中、ブロック内ＣｏｍｍｏｎＳＮＰｓは１９８２個（ＣｏｍｍｏｎＳＮＰｓ全体の９５．１％）であった。ブロック外ＳＮＰｓは１０３個（４．９％）であった。さらに、ブロック内ＣｏｍｍｏｎＳＮＰｓを用い、ブロックごとにｈｔＳＮＰｓを選択した。ｈｔＳＮＰｓの選択には、ハプロタイプの９５％以上を説明するために必要な最低限のＳＮＰを選択した。この結果、ｈｔＳＮＰｓの総数は６５８個になった。従って、ブロック内ＣｏｍｍｏｎＳＮＰｓの１９８２個が３３．２％に絞り込まれたことになる。
以上より、ブロック内のＣｏｍｍｏｎＳＮＰｓについてはｈｔＳＮＰｓ６５８個を用いて解析し、ブロック外ＣｏｍｍｏｎＳＮＰｓについては１０３個すべてを用いて解析することができる。これらのＳＮＰの数は、最初のＣｏｍｍｏｎＳＮＰｓ数２０８５個の３６．５％に相当する（図７）。
一方、ＲａｒｅＳＮＰｓは５１８個存在し（図７）、その内訳は、ＣｏｍｍｏｎＳＮＰｓを用いて構築されたハプロタイプブロック内に存在するものが３３４個（６４．５％）であり、ブロック外に存在するものが１８４個（３４．５％）であった。３３４個のブロック内ＲａｒｅＳＮＰｓについては、それぞれのＳＮＰのｍｉｎｏｒａｌｌｅｌｅのほとんどは一つのハプロタイプにアサインされた（ｍａｊｏｒｉｔｙ−ａｓｓｉｇｎｅｄｈａｐｌｏｔｙｐｅ）。
以上より、ＣｏｍｍｏｎＳＮＰｓと薬剤反応性などの表現型との間に真の相関があればｈｔＳＮＰｓとブロック外ＣｏｍｍｏｎＳＮＰｓを合計した全体の約３６．５％のＳＮＰｓの検索により検出できる。また、同様の解析によりブロック内ＲａｒｅＳＮＰｓと表現型の真の相関も検出できる。
本発明において構築したブロック及びｈｔＳＮＰｓの例を前記表２〜３に示す。
（６−２）常染色体上のＳＮＰｓ頻度の分布：
本実施例において、ＳＮＰｓは９６の染色体に見出されているが、多くのＳＮＰｓが１，５０４個の常染色体中の１％以内に存在していた。すなわち、試験した１０９個（３．６％）の常染色体ＳＮＰｓは、１，５０４個の染色体中の１％以内に存在していた。図８は、１，５０４個の染色体の２，９９９個の常染色体ＳＮＰｓ全てに対するマイナーアレル頻度のヒストグラムを示す。
（６−３）コーディング配列中の塩基の変化の影響
３，０３６個のＳＮＰｓのうち、２２９個はコーディング領域に存在し、塩基置換の結果、すなわち、生じた同義又は非同義変化が１８３個存在した。それらのうち、９０個は非同義置換であり、９３個は同義置換であった。図９は、異なるマイナーなアレル頻度をもつ非同義及び同義置換ＳＮＰｓのヒストグラムを示す。
また、同義置換と非同義置換との間において、頻度に差があるか否かをマンホイットニー検定（Ｍａｎｎ−Ｗｈｉｔｎｅｙ’ｓｔｅｓｔ）により試験した。その結果、非同義置換ＳＮＰｓの頻度は同義置換ＳＮＰｓの頻度よりも低かった（Ｐ＜０．０５）。
（６−４）ハプロタイプブロックの構築：
マイナーアレルの頻度が０．１以上であるＳＮＰｓを用いて、１３１個の常染色体遺伝子に合計２４５個のブロックを同定した。本実施例では、同一の均一な集団から多くの被験者（７５２名）を用いて評価しているので、かなり信頼できるものである。
図１０は、常染色体遺伝子あたりのブロック数のヒストグラムを示す。常染色体遺伝子あたりのブロック数は、１．７０±６．３４（平均±Ｓ．Ｄ．）であった。いくつかの遺伝子は互いに近接して存在し、一の遺伝子から他の遺伝子にわたりハプロタイプブロックが構築された。すなわち、以下の場合において、ブロックに１個以上の遺伝子が含まれていた；ＡＢＣＧ５−ＡＢＣＧ８、ＡＤＨ４−ＡＤＨ６−ＡＤＨ１−ＡＤＨ２、ＡＬＰＨ３Ｂ１−ＮＤＵＦＳ８、ＣＹＰ３Ａ５−ＣＹＰ３Ａ７−ＣＹＰ３Ａ４、ＭＧＳＴ３−ＡＬＤＨ９Ａ１、ＧＳＴＰｉ−ＮＤＵＦＶ１、ＡＢＣＢ４−ＡＢＣＢ１、ＮＤＵＦＡ７−ＮＤＵＦＢ７及びＣＹＰ４Ｆ８−ＣＹＰ４Ｆ３。
（６−５）ブロックの長さ：
常染色体遺伝子の合計２４５個のブロックのうち、１１２個のブロックにおいて、ブロックの５’又は３’端は、試験したＳＮＰｓのセットをもつ領域の端と同一であった。そのようなブロックの端は真の端の場合もあれば、さらに広がっているために真の端ではない場合もある。残りの１１７個のブロックでは、５’端も３’端も領域の端とは同一ではなかった。それらのうち、両端のヌクレオチドの位置が９７個のブロックで明らかとなった。図１１は、マイナーアレル頻度が０．１以上であるＳＮＰｓを用いて構築した常染色体ブロックの長さのヒストグラムを示す。
ブロックのサイズは０．１−７７．９ｋｂであり、平均１１．５ｋｂ（Ｓ．Ｄ．１０．９ｋｂ）、中央値８．５ｋｂであった。最も大きいブロックはＡＬＤＨ１Ａ２内に存在し、７７．９ｋｂの長さを有していた。また、２番目に大きいブロックは４３．３ｋｂであり、ＡＢＣＧ５からＡＢＣＧ８まで伸びていた。
（６−６）ブロック内主要ハプロタイプの数
ハプロタイプブロックを用いることの利点は、ブロック内の可能なハプロタイプの合計数は多数あっても、ほとんどのブロック内ハプロタイプを限定された少数の主要ハプロタイプにより説明できることである。そこで、それぞれのブロックに対して、ブロック内の全てのハプロタイプの９０％以上及び９５％以上を説明する主要ハプロタイプの数を計算した。その結果、上記９０％以上及び９５％以上を説明する主要ハプロタイプは、それぞれ３．３７±０．８５（中央値３）、４．０５±１．２１（中央値４）であった。
（６−７）ブロック内主要ハプロタイプを表すのに必要なｈｔＳＮＰｓの数：
ほとんどのブロック内ハプロタイプを説明する主要ハプロタイプの数を限定する場合は、ｈｔＳＮＰｓを用いることができる。本発明者は、各ハプロタイプブロックについて、全ての主要ハプロタイプを示すｈｔＳＮＰｓを選択した。そのようなｈｔＳＮＰｓを用いて、全ての主要ハプロタイプを互いに区別することができる。このことは、その他にマーカーを追加しても、主要ハプロタイプによって説明されるハプロタイプの割合がそれほど増加しなかったことを意味する。
９０％以上及び９５％以上のハプロタイプを説明する主要ハプロタイプについて必要とされるｈｔＳＮＰｓの数は、それぞれ２．２５±０．７９（中央値２）、２．６９±１．００（中央値３）であった。
ブロック内ＣｏｍｍｏｎＳＮＰｓ（１９８２個）からｈｔＳＮＰｓを選択すると、ｈｔＳＮＰｓの数は全部で６５８個であった（図７）。
（６−８）ブロック内の主要ハプロタイプへのそれぞれのＲａｒｅＳＮＰｓのマイナーなアレルの割り当て：
ＲａｒｅＳＮＰｓをマイナーアレル頻度が０．１未満のＳＮＰとして定義する場合、５１８個のＲａｒｅＳＮＰｓが得られた（図７）。これらのＲａｒｅＳＮＰｓのうち、３３４個（６４．５％）がブロック内ＲａｒｅＳＮＰｓであり、１８４個（３５．５％）がブロック外ＲａｒｅＳＮＰｓであった（図７）。
本発明において解析されたＲａｒｅＳＮＰｓの一例を表２に示す。
また、下記の表４は、主要ハプロタイプへＲａｒｅＳＮＰｓのマイナーアレルを割り当てるためにＥＭに基づくアルゴリズムが導入されたソフトウェアＡｓｓｉｇｎＨａｐｌｏの出力例を示す。

典型的には、ＲａｒｅＳＮＰｓのマイナーアレル全てが単一の主要ハプロタイプにａｓｓｉｇｎされた（表４）。ここで、ＲａｒｅＳＮＰｓ２１８のマイナーアレルは３個だけ存在し、３個のマイナーアレル全てがハプロタイプ「１１１１１０１１１１１０（１）１１１０」にａｓｓｉｇｎされた（括弧内の数字はＲａｒｅＳＮＰｓを示す）。このことは、ＲａｒｅＳＮＰｓのマイナーアレル「（１）」は、単一の主要ハプロタイプ「１１１１１０１１１１１０（−）１１１０」にａｓｓｉｇｎされたことを示す（「ｍａｊｏｒｉｔｙ−ａｓｓｉｇｎｅｄｈａｐｌｏｔｙｐｅ」）。
しかしながら、表５に示すように、全てのＲａｒｅＳＮＰｓのマイナーアレルが「ｍａｊｏｒｉｔｙ−ａｓｓｉｇｎｅｄｈａｐｌｏｔｙｐｅ」にａｓｓｉｇｎされるわけではなかった。この場合、ディプロタイプ形２、３及び４は１つの事象に集中せず、アルゴリズムによって、ＲａｒｅＳＮＰｓのマイナーアレル１２個が単一のハプロタイプ００（１）００００にａｓｓｉｇｎされる可能性が示された。

以下の表６に示すように、ＲａｒｅＳＮＰｓのマイナーアレル１３７個全てが単一のハプロタイプ０００００（１）００にａｓｓｉｇｎされた。但し、４個のマイナーアレルについては１１１１１（１）１０にａｓｓｉｇｎされる可能性があった。

ＲａｒｅＳＮＰｓのマイナーアレルをもつ被験者のディプロタイプ構成の可能性の一つがマイナーアレルの「ｍａｊｏｒｉｔｙ−ａｓｓｉｇｎｅｄｈａｐｌｏｔｙｐｅ」への割当てを示す場合は、そのようなマイナーなアレルは「ｍａｊｏｒｉｔｙ−ａｓｓｉｇｎｅｄｈａｐｌｏｔｙｐｅ」へ割当られると考えられる。
しかし、ＲａｒｅＳＮＰｓのマイナーアレル全てが単一の「ｍａｊｏｒｉｔｙ−ａｓｓｉｇｎｅｄｈａｐｌｏｔｙｐｅ」に割当てられるわけではない場合が存在した（表７）。

この例では、ＲａｒｅＳＮＰｓのマイナーアレル２６個のうち２４個が単一の「ｍａｊｏｒｉｔｙ−ａｓｓｉｇｎｅｄｈａｐｌｏｔｙｐｅ」１１０１１（１）０１０１１１１にａｓｓｉｇｎされたのに対し、２個のマイナーアレルはａｓｓｉｇｎすることができなかった。２個のマイナーアレルのうちの一方は、ディプロタイプ形が０１０１１（１）０１０１１１１／０００００００００００００の被験者のものであり、他方はディプロタイプ形が１１０１１（１）０１０１０１１／０００００００００００００のものであった。これらについてはミスタイピング、組換え、あるいは突然変異のいずれかの原因によるものと思われる。
図１２ａは、異なる頻度のＲａｒｅＳＮＰｓに対して、「ｍａｊｏｒｉｔｙ−ａｓｓｉｇｎｅｄｈａｐｌｏｔｙｐｅ」にａｓｓｉｇｎされたマイナーアレルの割合を示す。これらのデータはＲａｒｅＳＮＰｓの殆どが単一の「ｍａｊｏｒｉｔｙ−ａｓｓｉｇｎｅｄｈａｐｌｏｔｙｐｅ」に割当てられたことを示す。
図１２ｂは、データをソフトウェアＰＨＡＳＥにより解析されたときの結果であり、異なる頻度のＲａｒｅＳＮＰｓに対して、「ｍａｊｏｒｉｔｙ−ａｓｓｉｇｎｅｄｈａｐｌｏｔｙｐｅ」にａｓｓｉｇｎされたマイナーアレルの割合を示す。これらの結果もまた、ＲａｒｅＳＮＰｓの殆どが単一の「ｍａｊｏｒｉｔｙ−ａｓｓｉｇｎｅｄｈａｐｌｏｔｙｐｅ」にａｓｓｉｇｎされる傾向にあることが示された。
ブロックに隣接するＳＮＰｓについて、主要ハプロタイプへの割当を検討した結果、ＲａｒｅＳＮＰｓのマイナーアレルの大多数が必ずしも単一の主要ハプロタイプへ割当られるわけではなかった。従って、「ｍａｊｏｒｉｔｙ−ａｓｓｉｇｎｅｄｈａｐｌｏｔｙｐｅ」が存在することは、ブロック内のＲａｒｅＳＮＰｓが有する特徴であるといえる。
本実施例においては多数の試料を用いてＲａｒｅＳＮＰｓの解析を行なったため、ＲａｒｅＳＮＰｓのマイナーアレルの頻度は従来よりも精度が高いことに注目するべきである。従って、本発明によって得られたデータは、ＲａｒｅＳＮＰｓ及び主要ハプロタイプを用いて、表現型との関連性を包括的に試験するための極めて有用なツールとなるものである。In this example, haplotype analysis was performed using a population of 752 people.
(1) Materials and Methods This example was conducted with the approval of the Genomic Ethics Committee of Tokyo Women's Medical University and the Genomic Ethics Committee of the Pharma SNP Consortium. DNA obtained from volunteers was used as a test subject (population). Informed consent was obtained from the subject. A total of 1032 volunteers were collected and DNA was randomly selected from 752 subjects. Among the 752 subjects, there were 449 males and 303 females. The age of the subjects is 36.1 ± 11.5 years for men and 40.6 ± 11.3 years for women.
For 3036 SNPs of 147 genes shown in Table 1, 752 Japanese DNA genotypes were identified. These genes are either drug-related genes or drug-related gene candidates. Table 1 is a list of genes, chromosomal locations, and the number of SNPs in each gene in the present invention.
As shown in Table 1, 2999 SNP positions of 141 genes were on the autosome, and 37 SNPs of 6 genes were X-linked. Some of the genes are known to be clinically related to drug reactions, others are transporters, oxidoreductases, various transferases and other proteins.
(2) Genotyping The invader assay is a combination of a structure-specific cleavage enzyme and a universal fluorescence resonance energy transfer (FRET) system. Allele-specific oligonucleotide pairs and invader probes were designed, and SNP genotyping (detection of SNPs) was performed using an invader assay kit (Third Wave Technologies) (Kwiatkowski, RW et al., Mol. Diagnostics). , 4, 353-364 (1999)). The frequency of typing errors was empirically 0.0001045.
(3) Construction of a haplotype block To construct a haplotype block, the algorithm is based on known methods (Zhang, K. et al., Proc. Natl. Acad. Sci. USA, 99, 7335-7339 (2002); Gabriel, S. B. et al., Science, 296, 2225-2229 (2002)), blocks were constructed as follows.
An initial interval was constructed in which all paired D 'values were 0.9 or greater. When haplotypes were estimated within this interval, there were some major haplotypes with a frequency of 5% or more. The total frequency of major haplotypes was over 90%. Haplotypes were estimated by adding adjacent SNPs to the initial interval produced as described above. If no new major haplotypes appeared by speculation, the new SNP was added to another SNP to create a new interval. This procedure was repeated for each of the 5 ′ and 3 ′ directions until the inclusion relationship of adjacent SNPs resulted in a new major haploid. The intervals obtained for all SNPs except two SNPs were tested and the inclusion relationship that increased the number of major haplotypes was defined as the haplotype block.
(4) Estimation of haplotype frequency and selection of htSNP For estimation of haplotypes in a haplotype block, only SNPs having an allele frequency of 0.1 or more are used, and software SUPSUPPORT (Kitamura Y. et al. Ann. Hum. Genet., 66, 183-193 (2002)).
Of all haplotypes in the block, htSNPs were selected using the major haplotypes that account for 95% or more or 90% or more. The selection method of htSNPs is basically a known method (Dally, MJ et al, Nat. Genet., 29, 229-232 (2001); Patil, N. et al., Science, 294, 1719. -1723 (2001); Johnson, GC et al., Nat. Genet., 29, 233-237 (2001)). Specifically, Phaser II method (Avi-Itzak, HI et al., Pac. Symp. Biocomput., 466-477 (2003)) by Avi-Itzhak et al. Was used.
(5) Assignment of Rare SNPs to major haplotypes Alleles having Rare SNPs were assigned to major haplotypes as follows. That is, the haplotype was estimated by using data of Rare SNPs in addition to all Common SNPs having a minor allele frequency of 0.1 or more in a block by either LDSUPPORT or PHASE. LDSUPPORT estimated the haplotype frequency and diplotype distribution (distribution after diplotype composition) of each subject based on the EM algorithm, and PHASE used a fusion model of Markov chain Monte Carlo method and haplotype estimation. After haplotype estimation, it was tested to what extent the minor alleles of Rare SNPs were assigned to a single major haplotype.
(6) Results (6-1) Summary In this example, among the 2999 autosomal SNPs, 2603 SNPs were constructed and analyzed (FIG. 7). Among those SNPs, there were 2085 common SNPs (with a ratio of 0.1 or more in the sample) and 518 rare SNPs (with a ratio within the sample of less than 0.1).
As a result of haplotype construction using 2085 Common SNPs, a total of 245 blocks were identified among 131 autosomal genes. Among the 2085 common SNPs, the number of common SNPs in the block was 1982 (95.1% of the total common SNPs). There were 103 out-of-block SNPs (4.9%). Furthermore, htSNPs were selected for each block using intra-block Common SNPs. For the selection of htSNPs, the minimum SNP necessary to explain 95% or more of the haplotypes was selected. As a result, the total number of htSNPs was 658. Therefore, 1982 common SNPs in the block are narrowed down to 33.2%.
As described above, the common SNPs in the block can be analyzed using 658 htSNPs, and the non-block common SNPs can be analyzed using all 103. The number of these SNPs corresponds to 36.5% of the initial number of common SNPs of 2085 (FIG. 7).
On the other hand, there are 518 Rare SNPs (FIG. 7), the breakdown of which is 334 (64.5%) in the haplotype block constructed using Common SNPs, and they are outside the block. It was 184 (34.5%). For the 334 intra-block Rare SNPs, most minor minors of each SNP were assigned to one haplotype (majority-assigned type).
From the above, if there is a true correlation between the common SNPs and phenotypes such as drug reactivity, it can be detected by searching about 36.5% of SNPs in total of htSNPs and non-block common SNPs. Moreover, true correlation between intra-block Rare SNPs and phenotypes can also be detected by similar analysis.
Examples of blocks and htSNPs constructed in the present invention are shown in Tables 2 to 3 above.
(6-2) Distribution of SNP frequency on autosome:
In this example, SNPs were found on 96 chromosomes, but many SNPs were present within 1% of 1,504 autosomes. That is, 109 (3.6%) autosomal SNPs tested were present within 1% of 1,504 chromosomes. FIG. 8 shows a histogram of minor allele frequencies for all 2,999 autosomal SNPs of 1,504 chromosomes.
(6-3) Effect of base change in coding sequence Of 3,036 SNPs, 229 are present in the coding region, and there are 183 synonymous or non-synonymous changes resulting from base substitution. did. Of those, 90 were non-synonymous substitutions and 93 were synonymous substitutions. FIG. 9 shows a histogram of non-synonymous and synonymous replacement SNPs with different minor allele frequencies.
In addition, whether there is a difference in frequency between synonymous substitution and non-synonymous substitution was tested by Mann-Whitney's test. As a result, the frequency of non-synonymous substitution SNPs was lower than the frequency of synonym substitution SNPs (P <0.05).
(6-4) Construction of haplotype block:
Using SNPs with a minor allele frequency of 0.1 or higher, a total of 245 blocks were identified in 131 autosomal genes. In this embodiment, since many subjects (752) are evaluated from the same uniform group, it is quite reliable.
FIG. 10 shows a histogram of the number of blocks per autosomal gene. The number of blocks per autosomal gene was 1.70 ± 6.34 (mean ± SD). Several genes existed close to each other, and haplotype blocks were constructed from one gene to the other. That is, in the following cases, the block contained one or more genes: ABCG5-ABCG8, ADH4-ADH6-ADH1-ADH2, ALPH3B1-NDUFS8, CYP3A5-CYP3A7-CYP3A4, MGST3-ALDH9A1, GSTPi-NDUFV1, ABCB4-ABCB1, NDUFA7-NDUFB7 and CYP4F8-CYP4F3.
(6-5) Block length:
Of the total 245 blocks of autosomal genes, in 112 blocks, the 5 ′ or 3 ′ end of the block was identical to the end of the region with the set of SNPs tested. The end of such a block may be the true end, or it may be wider and not the true end. In the remaining 117 blocks, neither the 5 'end nor the 3' end was the same as the region end. Among them, the nucleotide positions at both ends were revealed in 97 blocks. FIG. 11 shows a histogram of the length of autosomal blocks constructed using SNPs with a minor allele frequency of 0.1 or higher.
The block size was 0.1-77.9 kb, the average was 11.5 kb (SD 10.9 kb), and the median was 8.5 kb. The largest block was present in ALDH1A2 and had a length of 77.9 kb. The second largest block was 43.3 kb, extending from ABCG5 to ABCG8.
(6-6) Number of major haplotypes in a block The advantage of using a haplotype block is explained by the small number of major haplotypes that have limited most haplotypes in a block, even though the total number of possible haplotypes in the block is large It can be done. Thus, for each block, the number of major haplotypes that account for over 90% and over 95% of all haplotypes in the block was calculated. As a result, the main haplotypes explaining 90% or more and 95% or more were 3.37 ± 0.85 (median 3) and 4.05 ± 1.21 (median 4), respectively.
(6-7) Number of htSNPs required to represent the major haplotypes in the block:
HtSNPs can be used to limit the number of major haplotypes that account for most intra-block haplotypes. The inventor selected htSNPs representing all major haplotypes for each haplotype block. Such htSNPs can be used to distinguish all major haplotypes from each other. This means that adding other markers did not significantly increase the proportion of haplotypes explained by the major haplotypes.
The number of htSNPs required for the major haplotypes accounting for 90% or more and 95% or more haplotypes is 2.25 ± 0.79 (median 2), 2.69 ± 1.00 (median 3), respectively. Met.
When htSNPs were selected from the intra-block Common SNPs (1982), the total number of htSNPs was 658 (FIG. 7).
(6-8) Assignment of minor alleles of each Rare SNPs to the major haplotypes in the block:
When defining Rare SNPs as SNPs with a minor allele frequency of less than 0.1, 518 Rare SNPs were obtained (FIG. 7). Of these Rare SNPs, 334 (64.5%) were intra-block Rare SNPs and 184 (35.5%) were out-of-block Rare SNPs (FIG. 7).
An example of Rare SNPs analyzed in the present invention is shown in Table 2.
Table 4 below shows an output example of the software AssignHaplo in which an algorithm based on EM is introduced in order to assign minor alleles of Rare SNPs to major haplotypes.

Typically, all minor alleles of Rare SNPs were assigned to a single major haplotype (Table 4). Here, there are only three minor alleles of the Rare SNPs 218, and all three minor alleles are assigned to the haplotype “111110111110 (1) 1110” (the numbers in parentheses indicate the Rare SNPs). This indicates that the minor allele “(1)” of the Rare SNPs was assigned to a single major haplotype “111110111110 (−) 1110” (“majority-assigned haplotype”).
However, as shown in Table 5, the minor alleles of all Rare SNPs were not assigned to “majority-assigned haplotype”. In this case,

diplotype forms

2, 3, and 4 did not concentrate on one event, indicating that the algorithm could assign 12 minor alleles of Rare SNPs to a single haplotype 00 (1) 0000. .

As shown in Table 6 below, all 137 minor alleles of Rare SNPs were assigned to a single haplotype 00000 (1) 00. However, four minor alleles could be assigned to 11111 (1) 10.

If one of the possible diplotype configurations of a subject with a minor allele of Rare SNPs indicates the assignment of a minor allele to “majority-assigned”, such minor allele is assigned to “majority-assigned”. It is thought that.
However, there were cases where not all minor alleles of Rare SNPs were assigned to a single “majority-assigned haplotype” (Table 7).

In this example, 24 of the 26 minor alleles in the Rare SNPs were assigned to a single “majority-assigned” 11011 (1) 01011111, whereas the two minor alleles cannot be assigned. It was. One of the two minor alleles was from a subject with a diplotype form of 01011 (1) 01011111/0000000000000000, and the other had a diplotype form of 11011 (1) 0101101/0000000000000000. These may be due to either mistyping, recombination, or mutation.
FIG. 12a shows the percentage of minor alleles assigned to “majority-assigned haplotype” for different frequencies of SARE SNPs. These data indicate that most of the SARE SNPs were assigned to a single “majority-assigned haplotype”.
FIG. 12b shows the results when the data was analyzed by software PHASE and shows the percentage of minor alleles assigned to “majority-assigned haplotype” for different frequency of SARE SNPs. These results also indicated that most of the SARE SNPs tend to be assigned to a single “majority-assigned haplotype”.
As a result of considering assignment of SNPs adjacent to the block to the main haplotype, the majority of minor alleles of the Rare SNPs were not necessarily assigned to a single main haplotype. Therefore, it can be said that the presence of “majority-assigned haplotype” is a feature of Rare SNPs in the block.
In this example, since the analysis of Rare SNPs was performed using a large number of samples, it should be noted that the frequency of minor alleles of Rare SNPs is higher than that of the conventional technique. Therefore, the data obtained by the present invention is a very useful tool for comprehensively testing the association with phenotypes using Rare SNPs and major haplotypes.

ハプロタイプと、スルファサラジンに対する副作用との関係
スルファサラジン（ｓｕｌｆａｓａｌａｚｉｎｅ）は、潰瘍性大腸炎治療剤の１つであり、限局性腸炎、非特異性大腸炎などにも効能・効果を有する。
本実施例においては、どのようなハプロタイプを有する個体が副作用と相関するかについて試験を行った。
スルファサラジンの副作用には、例えば投与量依存性副作用として悪心、嘔吐、頭痛、溶血性貧血、巨赤芽球性貧血などが挙げられる。
スルファサラジンの投与によって上記副作用の少なくとも１つが表れた被験者から薬物代謝酵素遺伝子のＳＮＰｓのタイピングを行い、ハプロタイプ解析を行った。
表２に記載のすべての薬物代謝酵素遺伝子の中からから副作用と関連すると考えられる遺伝子を解析し、ＮＡＴ２遺伝子を選択した。そして、ＮＡＴ２遺伝子から２４個のＳＮＰｓを選出した（表８）。

表８において「ＣＤ」はコーディング領域のＳＮＰｓを表わす。ｈｔＳＮＰｓ及びＲａｒｅＳＮＰｓに「＋」印を付しておいた。表１に示す２４ＳＮＰｓからＲａｒｅＳＮＰｓを除いた１８ＳＮＰｓのハプロタイプを表９に示す。

数字の「０」はアレル１を、「１」はアレル２を表わす。
そして、表８に記載の４個のｈｔＳＮＰｓを表１０に示す。

また、下記表１１は、１〜４個のｈｔＳＮＰｓをアレル１又はアレル２のどちらかに特定したときのハプロタイプの組合せを表わす。

「＊」印は、アレル１及びアレル２のどちらでもよいことを意味する。表１１に示すハプロタイプの組合せを用いて、一のハプロタイプ（例えば「＊０＊１」）と残りのすべてのハプロタイプ（「＊０＊１」以外のすべてのハプロタイプ）との間で頻度の有意差検定を行った。
その結果、ハプロタイプ「＊０＊１」を有する被験者が他のハプロタイプと比較して最も有意差があり、このハプロタイプを有する被験者が、スルファラジンの毒性が最も強く出るものであることが分かった。Relationship between Haplotype and Side Effects on Sulfasalazine Sulfasalazine is one of the therapeutic agents for ulcerative colitis, and has efficacy and effects on localized enteritis, nonspecific colitis and the like.
In this example, a test was conducted to determine what kind of haplotype an individual correlates with a side effect.
Examples of the side effects of sulfasalazine include nausea, vomiting, headache, hemolytic anemia, megaloblastic anemia and the like as dose-dependent side effects.
SNPs of drug metabolizing enzyme genes were typed from subjects who exhibited at least one of the above-mentioned side effects by administration of sulfasalazine, and haplotype analysis was performed.
From all the drug-metabolizing enzyme genes listed in Table 2, a gene considered to be associated with side effects was analyzed, and the NAT2 gene was selected. Then, 24 SNPs were selected from the NAT2 gene (Table 8).

In Table 8, “CD” represents SNPs in the coding region. htSNPs and Rare SNPs were marked with a “+” sign. Table 9 shows haplotypes of 18 SNPs obtained by removing Rare SNPs from 24 SNPs shown in Table 1.

The number “0” represents allele 1 and “1” represents allele 2.
The four htSNPs listed in Table 8 are shown in Table 10.

Table 11 below shows combinations of haplotypes when 1 to 4 htSNPs are specified as either allele 1 or allele 2.

The “*” mark means that either allele 1 or allele 2 may be used. Using the haplotype combinations shown in Table 11, a significant difference in frequency between one haplotype (eg “* 0 * 1”) and all remaining haplotypes (all haplotypes other than “* 0 * 1”) The test was performed.
As a result, it was found that subjects having the haplotype “* 0 * 1” had the most significant difference compared to other haplotypes, and subjects having this haplotype had the strongest toxicity of sulfarazine.

ＲａｒｅＳＮＰｓと主要ハプロタイプとの関係．
本実施例では、ＲａｒｅＳＮＰｓのうちの一つをブロック内から選択し、Ｐ（Ａｉ｜Ｘ）をｉ個のすべてのハプロタイプについて計算した。そして、Ｐ（Ａｊ｜Ｘ）を最大化するｉ＝ｊを選択し、すべてのＲａｒｅＳＮＰｓについて、それぞれＰ（Ａｊ｜Ｘ）を計算した。
図１３は、本実施例において４，１０４個の常染色体ＳＮＰｓを統計的に分類した結果を示す図であり、図７に示す２０８５個の常染色体ＳＮＰｓに、更に数を増やして解析した結果である。
図１４は、ＬＤブロックの内部のＲａｒｅＳＮＰｓのｈｔＳＮＰにより構成されたハプロタイプへの割り当ての解析法を示す図であり、Ｐ（Ａｊ｜Ｘ）を計算する詳しい方法と、ＲａｒｅＳＮＰｓの低頻度アレルをｈｔＳＮＰにより構成されたハプロタイプに割り当てる方法を詳しく解説した。図１４では、ＤＰＹＤ（ｄｉｈｙｄｒｏｐｙｒｉｍｉｄｉｎｅｄｅｈｙｄｒｏｇｅｎａｓｅ）の中の一つのブロック（ブロック４）について解析した。
図１４において、パネル（Ａ）は、染色体１ｐに存在するＤＰＹＤ（ｄｉｈｙｄｒｏｐｙｒｉｍｉｄｉｎｅｄｅｈｙｄｒｏｇｅｎａｓｅ）遺伝子のデータを用いたときのＳＮＰのタイピングを行った結果の模式図である。ＤＰＹＤ遺伝子のすべてのＳＮＰ座位のうち、低頻度アレルの頻度が≧０．１のｃｏｍｍｏｎＳＮＰのみを用いて構築したブロックの模式図をパネル（Ａ）の下に棒線で示した（図１４（Ｂ））。
図１４（Ｃ）は、構築された１４のブロックの中で、９個のｃｏｍｍｏｎＳＮＰの存在するブロック４を例として用いたときの頻度を示す。このブロックは９個のｃｏｍｍｏｎＳＮＰに加えて、５個のＲａｒｅＳＮＰｓ（「ｕｎｃｏｍｍｏｎ」と表示）を含んでいる。（Ｃ）には、後に選択するｈｔＳＮＰ（ｈｔ）（Ｅの項に書く）も示してある。
（Ｄ）は、９個のｃｏｍｍｏｎＳＮＰのみを用いてハプロタイプを推定した結果を示す。一つのハプロタイプは、数字の「１」または「２」のリストとして示され、括弧の中にその頻度を示した。
（Ｅ）は、推定されたハプロタイプのデータ（パネルＤ）を用いてｈｔＳＮＰを選択したことを示す。ブロック４については、座位２，６，１２（パネルＤでは「^＊」で示す）をｈｔＳＮＰとして選択した。
（Ｆ）は、３つの選択されたｈｔＳＮＰに加え、ＲａｒｅＳＮＰｓのうちの一つ（パネルＣの４，５，７，８，または９である）をブロック４より選択し、割り当ての解析を行ったことを示す。この場合は、座位５をＲａｒｅＳＮＰｓとして選択した。座位２，６，１２のｈｔＳＮＰｓ及び座位５のＲａｒｅＳＮＰｓの遺伝子型データ（パネルＣを参照）を用い、ハプロタイプ推定を行った。推定されたハプロタイプは「１」又は「２」により表示される４つの番号のリストで示される。このリストで、ＲａｒｅＳＮＰｓ座位（この場合座位５）のアレルは括弧の中の数（ｕ）で表されている。ここでわかることは、（ｕ）が（１）を持ったハプロタイプは一つしか無いということである。即ちそのハプロタイプは「２（１）１２」である。従って、ハプロタイプ「２（−）１２」が「Ａｊ」であるとされる。即ち、Ａｊは、ＲａｒｅＳＮＰｓの低頻度アレルの大部分が割り当てられる不完全ハプロタイプと同定されるのである。
すべてのハプロタイプは、（１）が存在するか否か（Ｘ又はＸの補集合）、そして「２（−）１２」というハプロタイプが存在するか否か（Ａｊ又はＡｊの補集合）により分類される。即ち、次の４つのカテゴリーができる。

である。同じカテゴリーに分類されたハプロタイプは、頻度が合計され、

が計算される。
座位５については次のような推定頻度が得られる。

である。
即ち、座位５については

であり、これは座位５のすべての低頻度アレルがハプロタイプ２（−）１２（Ａｊ）に割り当てられたことを示す。
（Ｇ）は、座位８がＲａｒｅＳＮＰｓとして選択されたときのハプロタイプ及び頻度を示す。ハプロタイプは、座位２，６，１２のｈｔＳＮＰｓ、および座位８のＲａｒｅＳＮＰｓの遺伝子型データを用いて推定した。推定されたハプロタイプは、「１」又は「２」の数により表示される４つのリストにより示され、座位８のアレルは括弧内に示される（ｕ）。推定されたハプロタイプの中で（２）をもつものは「１１（２）１」及び「１２（２）２」の２つである。そのハプロタイプの頻度を比較すると、「１１（２）１」は０．０００７であるのに対し、「１２（２）２」は０．０７９１である。従って、「１１（２）１」よりも、「１２（２）２」の方がＲａｒｅＳＮＰｓの低頻度アレルの大部分が割り当てられるハプロタイプであるといえるため、「１２（２）２」をＡｊと判定した。
すべてのハプロタイプが（２）のあり（Ｘ）又はなし（Ｘの補集合）、そしてハプロタイプ１２（−）２のあり（Ａｊ）又はなし（Ａｊの補集合）により４つのカテゴリーに分類され、同じカテゴリーに属するハプロタイプの頻度は加えられ、

が計算される。
座位８については以下の推定された確率が得られた。即ち、

である。
このように座位８については

と計算され、これは座位８における低頻度ハプロタイプの大半は１２（−）２（Ａｊ）に割り当てられたことを示している。
このように計算されたＰ（Ａｊ｜Ｘ）を、ブロック内のすべてのＲａｒｅＳＮＰｓの（低頻度アレル）頻度Ｐ（Ｘ）に対してプロットした。
図１５は、ＲａｒｅＳＮＰｓの低頻度アレル（Ｘ）が割り当てられるＡｊ（ｈｔＳＮＰにより構築されたハプロタイプ）の割合を示す。即ち、図１５はＰ（Ｘ）とＰ（Ａｊ｜Ｘ）の関係を示す。Ｐ（Ａｊ｜Ｘ）を計算する方法の詳細は図１４に記載した。Ｐ（Ｘ）はＲａｒｅＳＮＰｓの低頻度アレルの頻度を示す。
この結果は、Ｐ（Ｘ）の値にかかわらずＰ（Ａｊ｜Ｘ）はほとんどの例で１に近いということである。即ち、Ｐ（Ａｊ｜Ｘ）の平均±ＳＤは０．９４３±０．１１７であり、ＲａｒｅＳＮＰｓの８３．９％（４５９）についての平均±ＳＤは＞０．９であった。このようなデータは、ほとんどの例で、それぞれのＲａｒｅＳＮＰｓは一つのｈｔＳＮＰのアレルで定義される不完全ハプロタイプに割り当てられることを示す。
次に、本発明者はＰ（Ｘ｜Ａｊ）を、ブロック内のすべてのＲａｒｅＳＮＰｓについて計算し、それをＰ（Ｘ）に対してプロットした（図１６）。
図１６は、ｈｔＳＮＰにより構築されたハプロタイプＡｊのうち、ＲａｒｅＳＮＰｓの低頻度アレル（Ｘ）を保有する割合、即ちＰ（Ｘ｜Ａｊ）を示す。
Ｐ（Ｘ｜Ａｊ）は下記式のように計算した。

また、以下の確率：

を計算する方法の詳細は図１４に記した。
その結果、Ｐ（Ｘ｜Ａｊ）は０＜Ｐ（Ｘ）＜０．０３（Ｐ＜０．０００００１，ｎ＝２３３；Ｓｐｅａｒｍａｎ’ｓｒａｎｋｃｏｒｒｅｌａｔｉｏｎｃｏｅｆｆｉｃｉｅｎｔ）の領域ではＰ（Ｘ）と正の相関があるが、それ以外の領域（０．０３≦Ｐ（Ｘ）＜１）ではそうではないということがわかった（Ｐ＝０．０５０，ｎ＝３１４）．
Ｘが最初に突然変異で生じた時にはＰ（Ｘ｜Ａｊ）は恐らく非常に小さかったであろうし、Ｐ（Ａｊ｜Ｘ）は恐らく１であったろう。時間がたつにつれて、もしＸが消滅しなければＰ（Ｘ）とＰ（Ｘ｜Ａｊ）は増加し、Ｐ（Ａｊ｜Ｘ）は低下するはずである。従って、Ｐ（Ｘ）＜０．０３のときのＰ（Ａｊ｜Ｘ）、およびＰ（Ｘ｜Ａｊ）のデータは、Ｘが生じた時の状態を反映していると考えられる。
図１７は、ＲａｒｅＳＮＰｓの高頻度アレル（Ｘの補集合）のうちＡｊに割り当てられる割合と、ＲａｒｅＳＮＰｓの低頻度アレル（Ｘ）のうちＡｊに割り当てられる割合との比、即ち以下の比：

を示す。

Relationship between Rare SNPs and major haplotypes.
In this example, one of the SARE SNPs was selected from within the block, and P (Ai | X) was calculated for all i haplotypes. Then, i = j that maximizes P (Aj | X) was selected, and P (Aj | X) was calculated for all Rare SNPs.
FIG. 13 is a diagram showing the results of statistically classifying 4,104 autosomal SNPs in this example, and the results obtained by further increasing the number of 2085 autosomal SNPs shown in FIG. 7 were analyzed. is there.
FIG. 14 is a diagram showing an analysis method of assignment of Rare SNPs inside LD blocks to haplotypes configured by htSNP, and shows a detailed method for calculating P (Aj | X) and a low frequency allele of Rare SNPs. The method of assigning to haplotypes configured by htSNP was explained in detail. In FIG. 14, one block (block 4) in DPYD (dihydropyrimidine dehydrogenase) was analyzed.
In FIG. 14, panel (A) is a schematic diagram of the results of SNP typing using data of DPYD (dihydropyrimidine dehydrogenase) gene present on chromosome 1p. Of all the SNP loci of the DPYD gene, a schematic diagram of a block constructed using only common SNPs with a frequency of low-frequency alleles ≧ 0.1 is shown by a bar line below the panel (A) (FIG. 14 ( B)).
FIG. 14C shows the frequency when the block 4 in which nine common SNPs are present is used as an example among the constructed 14 blocks. This block includes 5 Common SNPs (indicated as “uncommon”) in addition to 9 common SNPs. (C) also shows htSNP (ht) to be selected later (written in the section E).
(D) shows the result of estimating the haplotype using only 9 common SNPs. One haplotype is shown as a list of numbers “1” or “2”, with the frequency in parentheses.
(E) shows that htSNP was selected using the estimated haplotype data (panel D). For block 4,

loci

2, 6, 12 (indicated by “ ^* ” in panel D) were selected as htSNPs.
(F) selects one of the Rare SNPs (4, 5, 7, 8, or 9 in panel C) from block 4 in addition to the three selected htSNPs, and performs allocation analysis It shows that. In this case, locus 5 was selected as Rare SNPs. Haplotype estimation was performed using genotype data of htSNPs at

loci

2, 6 and 12 and Rare SNPs at loci 5 (see panel C). Estimated haplotypes are shown in a list of four numbers displayed by “1” or “2”. In this list, the allele of the Sare SNPs locus (in this case, locus 5) is represented by the number (u) in parentheses. What can be seen here is that (u) has only one haplotype with (1). That is, the haplotype is “2 (1) 12”. Therefore, the haplotype “2 (−) 12” is assumed to be “Aj”. That is, Aj is identified as an incomplete haplotype to which most of the rare alleles of Rare SNPs are assigned.
All haplotypes are classified by whether (1) is present (X or X complement) and whether a haplotype of “2 (−) 12” is present (Aj or Aj complement). The That is, the following four categories are created.

It is. Haplotypes that fall into the same category are summed in frequency,

Is calculated.
The following estimated frequency is obtained for the sitting position 5.

It is.
That is, for sitting position 5

Which indicates that all low frequency alleles at locus 5 have been assigned to haplotype 2 (−) 12 (Aj).
(G) shows the haplotype and frequency when locus 8 is selected as a Rare SNP. Haplotypes were estimated using genotype data of htSNPs at

loci

2, 6, 12 and Rare SNPs at loci 8. The estimated haplotypes are shown by four lists, represented by the number “1” or “2”, and the allele at locus 8 is shown in parentheses (u). Among the estimated haplotypes, those having (2) are “11 (2) 1” and “12 (2) 2”. Comparing the frequencies of the haplotypes, “11 (2) 1” is 0.0007, whereas “12 (2) 2” is 0.0791. Therefore, since “12 (2) 2” is a haplotype to which most of the rare alleles of Rare SNPs are assigned rather than “11 (2) 1”, “12 (2) 2” is changed to Aj. It was determined.
All haplotypes are classified into four categories with (2) with (X) or without (X complement) and with haplotype 12 (-) 2 with (Aj) or without (Aj complement) The frequency of haplotypes belonging to the category is added,

Is calculated.
For the sitting position 8, the following estimated probabilities were obtained. That is,

It is.
Thus, for sitting position 8,

This indicates that the majority of the low frequency haplotypes at locus 8 were assigned to 12 (−) 2 (Aj).
P (Aj | X) calculated in this way was plotted against the (low frequency allele) frequency P (X) of all Rare SNPs in the block.
FIG. 15 shows the ratio of Aj (haplotype constructed by htSNP) to which the low frequency allele (X) of Rare SNPs is assigned. That is, FIG. 15 shows the relationship between P (X) and P (Aj | X). Details of the method of calculating P (Aj | X) are shown in FIG. P (X) indicates the frequency of the low frequency allele of the Rare SNPs.
The result is that P (Aj | X) is close to 1 in most cases regardless of the value of P (X). That is, the mean ± SD of P (Aj | X) was 0.943 ± 0.117, and the mean ± SD for 83.9% (459) of Rare SNPs was> 0.9. Such data indicates that in most instances, each Rare SNP is assigned to an incomplete haplotype defined by one htSNP allele.
Next, we calculated P (X | Aj) for all Rare SNPs in the block and plotted it against P (X) (FIG. 16).
FIG. 16 shows the proportion of Rare SNPs having a low frequency allele (X) among haplotypes Aj constructed by htSNP, that is, P (X | Aj).
P (X | Aj) was calculated as follows.

And the following probabilities:

Details of the method of calculating are shown in FIG.
As a result, P (X | Aj) has a positive correlation with P (X) in the region of 0 <P (X) <0.03 (P <0.000001, n = 233; Spearman's rank correlation coefficient). It was found that this was not the case in other regions (0.03 ≦ P (X) <1) (P = 0.050, n = 314).
When X was first mutated, P (X | Aj) was probably very small and P (Aj | X) was probably 1. Over time, if X does not disappear, P (X) and P (X | Aj) should increase and P (Aj | X) should decrease. Therefore, it is considered that the data of P (Aj | X) and P (X | Aj) when P (X) <0.03 reflects the state when X occurs.
FIG. 17 shows the ratio of the ratio assigned to Aj in the high-frequency alleles (complement of X) of Rare SNPs and the ratio assigned to Aj in the low-frequency alleles (X) of Rare SNPs, that is, the following ratios:

Indicates.

表現型に関係したＲａｒｅＳＮＰｓをｈｔＳＮＰにより構成されたハプロタイプを用いて検出する確率
ｈｔＳＮＰのみを用いて構成されたハプロタイプ（不完全ハプロタイプ）は、主要ハプロタイプの検出に有効である。しかし、ｈｔＳＮＰや不完全ハプロタイプが表現型に関係したＲａｒｅＳＮＰｓの検出にどれほど有効かは不明である。本発明者は、得られたハプロタイプデータを用いてこの問題を研究した。
薬物代謝酵素欠損症のホモ接合体の個体は、薬物を投与したときに重症の副作用が生じることがわかっている。このような場合、原因である低頻度アレルＸは罹患者集団において増加していると思われる。即ち、

よりかなり高いと予想される。しかし、上記の条件はＸを検出するために不完全ハプロタイプが有効であるための十分条件ではない。表現型に関係したＲａｒｅＳＮＰｓ，Ｘではなく、不完全ハプロタイプが有効であるためには、

が異なっていることが重要である。
そのため、本発明者はすべてのＸに対して以下の比：

を計算し、Ｐ（Ｘ）に対する上記比を図１７に示した。
その結果、

よりも多くの場合かなり高く、特にＰ（Ｘ）＞０．０２の場合に高いことが分かった。実際に、

の平均±ＳＤは０．２００±０．２３０であった。
従って、Ｘを使わなくても、Ａｊを、Ｘを探索するためのマーカーとして用いることができる。
また、図１８は、ｈｔＳＮＰで構成されたハプロタイプの頻度の違いを比較する検定が有意となる確率を示す図であって、罹患者とコントロール集団の間で表現型に関係するＲａｒｅＳＮＰｓの低頻度アレルの頻度が異なるときの図である。
以下の比：

を８に、有意水準を０．０１に、罹患者集団とコントロール集団における人数それぞれＭ_１、Ｍ_２を５０と５００にセットした。
図１８に示す結果は、Ｐ（Ｘ）、ｒ、Ｍ_１及びＭ_２が十分に大きいときは、表現型に関連するＸをＡｊによって検出できることを示す。例えばＰ（Ｘ）＜０．０３のときは、確率は０．２３９±０．３１３（平均±ＳＤ，ｎ＝２３３）であるのに対し、Ｐ（Ｘ）≧０．０３のときは、確率は０．８８５±０．２５１（平均±ＳＤ，ｎ＝３１４）であった（図１８）。確率はＰ（Ｘ）とそのパラメーター（ｒ，Ｍ_１及びＭ_２）に依存し、これらのパラメーター及びＰ（Ｘ）（ＲａｒｅＳＮＰｓの低頻度アレルの頻度）のいずれかが上昇するにつれて確率も上昇した。図１８において使用した条件において、ほとんどのブロック内ＲａｒｅＳＮＰｓは、ＲａｒｅＳＮＰｓの頻度が０．０３を超えるとｈｔＳＮＰｓハプロタイプによって同定されることがわかった。Probability of detecting Rare SNPs related to a phenotype using a haplotype configured by htSNP A haplotype (incomplete haplotype) configured using only htSNP is effective in detecting a major haplotype. However, it is unclear how effective htSNPs and incomplete haplotypes are in detecting Rare SNPs related to the phenotype. The inventor studied this problem using the haplotype data obtained.
It has been found that homozygous individuals with drug-metabolizing enzyme deficiency have severe side effects when the drug is administered. In such cases, the causal low frequency allele X appears to be increasing in the affected population. That is,

Expected to be considerably higher. However, the above conditions are not sufficient conditions for the incomplete haplotype to be effective for detecting X. In order for the incomplete haplotype to be effective rather than the Rare SNPs, X related to the phenotype,

It is important that they are different.
Therefore, the inventor has the following ratio for all X:

FIG. 17 shows the ratio with respect to P (X).
as a result,

Was found to be quite high, especially when P (X)> 0.02. actually,

The mean ± SD was 0.200 ± 0.230.
Therefore, Aj can be used as a marker for searching for X without using X.
FIG. 18 is a diagram showing the probability that the test for comparing the difference in the frequency of haplotypes composed of htSNPs becomes significant, and the frequency of Rare SNPs related to the phenotype between affected individuals and the control population is low. It is a figure when the frequency of an allele differs.
The following ratio:

Was set to 8, the significance level was set to 0.01, and the number of persons in the affected and control populations was set to M ₁ and M ₂ of 50 and 500, respectively.
The results shown in FIG. 18 show that X associated with the phenotype can be detected by Aj when P (X), r, M ₁ and M ₂ are sufficiently large. For example, when P (X) <0.03, the probability is 0.239 ± 0.313 (mean ± SD, n = 233), whereas when P (X) ≧ 0.03, the probability Was 0.885 ± 0.251 (mean ± SD, n = 314) (FIG. 18). The probability depends on P (X) and its parameters (r, M ₁ and M ₂ ), and the probability increases as either of these parameters and P (X) (the frequency of the low frequency allele of Rare SNPs) increases. did. Under the conditions used in FIG. 18, it was found that most intra-block Rare SNPs were identified by the htSNPs haplotype when the frequency of Rare SNPs exceeded 0.03.

スルファサラジンによる副作用とＮ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅ２遺伝子のハプロタイプの関係
本実施例は、ブロックを構築し、それに基づいてｈｔＳＮＰを抽出し、それを用いて表現型に関係する遺伝子と多型を探索することが有用であることを示す例である。
Ｎ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅ２（以下ＮＡＴ２と略）は抗結核薬イソニアジッドや関節リウマチ治療薬剤、スルファサラジンの代謝において重要な役割を果たす。ＮＡＴ２活性が遺伝的に低下した個体が知られており、そのような個体ではイソニアジッドの不活性化が遅延することが知られている（ＤａｓＫＭ，ＥａｓｔｗｏｏｄＭＡ，ＭｃＭａｎｕｓＪＰ，ＳｉｒｃｕｓＷ．Ａｄｖｅｒｓｅｒｅａｃｔｉｏｎｓｄｕｒｉｎｇｓａｌｉｃｙｌａｚｏｓｕｌｆａｐｙｒｉｄｉｎｅｔｈｅｒａｐｙａｎｄｔｈｅｒｅｌａｔｉｏｎｗｉｔｈｄｒｕｇｍｅｔａｂｏｌｉｓｍａｎｄａｃｅｔｙｌａｔｏｒｐｈｅｎｏｔｙｐｅ．ＮＥｎｇｌＪＭｅｄ１９７３；２８９：４９１−５．）。
ＮＡＴ２遺伝子は８番染色体短腕（８ｐ２２）に存在し、２つのエキソンを持つ遺伝子である。ＮＡＴ２遺伝子のコード領域には多型があることが知られており、その一部は活性を低下させることがわかっている（ＧｒａｎｔＤＭ，ＧｏｏｄｆｅｌｌｏｗＧＨ，ＳｕｇａｍｏｒｉＫ，ＤｕｒｅｔｔｅＫ．ＰｈａｒｍａｃｏｇｅｎｅｔｉｃｓｏｆｔｈｅｈｕｍａｎａｒｙｌａｍｉｎｅＮ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅｓ．Ｐｈａｒｍａｃｏｌｏｇｙ２０００；６１：２０４−１１、ＣａｓｃｏｒｂｉＩ，ＤｒａｋｏｕｌｉｓＮ，ＢｒｏｃｋｍｏｌｌｅｒＪ，Ｍａｕｒｅｒ，ＳｐｅｒｌｉｎｇＫ，ＲｏｏｔｓＩ．ＡｒｙｌａｍｉｎｅＮ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅ（ＮＡＴ２）ｍｕｔａｔｉｏｎｓａｎｄｔｈｅｉｒａｌｌｅｌｉｃｌｉｎｋａｇｅｉｎｕｎｒｅｌａｔｅｄＣａｕｃａｓｉａｎｉｎｄｉｖｉｄｕａｌｓ：ｃｏｒｒｅｌａｔｉｏｎｗｉｔｈｐｈｅｎｏｔｙｐｉｃａｃｔｉｖｉｔｙ．ＡｍＪＨｕｍＧｅｎｅｔ１９９５；５７：５８１−９２．、ＤｅｇｕｃｈｉＴ，ＭａｓｈｉｍｏＭ，ＳｕｚｕｋｉＴ．ＣｏｒｒｅｌａｔｉｏｎｂｅｔｗｅｅｎａｃｅｔｙｌａｔｏｒｐｈｅｎｏｔｙｐｅｓａｎｄｇｅｎｏｔｙｐｅｓｏｆｐｏｌｙｍｏｒｐｈｉｃａｒｙｌａｍｉｎｅＮ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅｉｎｈｕｍａｎｌｉｖｅｒ．ＪＢｉｏｌＣｈｅｍ１９９０；２６５：１２７５７−６０．、ＶａｔｓｉｓＫＰ，ｍａｒｔｅｌｌＫＪ，ＷｅｂｅｒＷＷ．ＤｉｖｅｒｓｅｐｏｉｎｔｍｕｔａｔｉｏｎｓｉｎｔｈｅｈｕｍａｎｇｅｎｅｆｏｒｐｏｌｙｍｏｒｐｈｉｃＮ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅ．ＰｒｏｃＮａｔｌＡｃａｄＳｃｉＵＳＡ１９９１；８８：６３３３−７．、ＨｉｃｋｍａｎＤ，ＳｉｍＥ．Ｎ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅｐｏｌｙｍｏｒｐｈｉｓｍ．Ｃｏｍｐａｒｉｓｏｎｏｆｐｈｅｎｏｔｙｐｅｓｉｎｈｕｍａｎｓ．ＢｉｏｃｈｅｍＰｈａｒｍａｃｏｌ１９９１；４２：１００７−１４．、ＤｅｇｕｃｈｉＴ．ＳｅｑｕｅｎｃｅｓａｎｄｅｘｐｒｅｓｓｉｏｎｏｆａｌｌｅｌｅｓｏｆｐｏｌｙｍｏｒｐｈｉｃａｒｙｌａｍｉｎｅＮ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅｏｆｈｕｍａｎｌｉｖｅｒ．ＪＢｉｏｌＣｈｅｍ１９９２；２６７：１８１４０−７．、ＨｉｃｍａｎＤ，ＲｉｓｃｈＡ，ＣａｍｉｌｌｅｒｉＪＰ，ＳｉｍＥ．ＧｅｎｏｔｙｐｉｎｇｈｕｍａｎｐｏｌｙｍｏｒｐｈｉｃａｒｙｌａｍｉｎｅＮ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅ：ｉｄｅｎｔｉｆｉｃａｔｉｏｎｏｆｎｅｗｓｌｏｗａｌｌｏｔｙｐｉｃｖａｒｉａｎｔｓ．Ｐｈａｒｍａｃｏｇｅｎｅｔｉｃｓ１９９２；２：２１７−２６．、ＡｂｅＭ，ＤｅｇｕｃｈｉＴ，ＳｕｚｕｋｉＴ．ＴｈｅｓｔｒｕｃｔｕｒｅａｎｄｃｈａｒａｃｔｅｒｉｓｔｉｃｓｏｆａｆｏｕｒｔｈａｌｌｅｌｅｏｆｐｏｌｙｍｏｒｐｈｉｃＮ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅｇｅｎｅｆｏｕｎｄｉｎｔｈｅＪａｐａｎｅｓｅｐｏｐｕｌａｔｉｏｎ．ＢｉｏｃｈｅｍＢｉｏｐｈｙｓＲｅｓＣｏｍｍｕｎ１９９３；１９１：２６４−９．、ＬｉｎＨＪ，ＨａｎＣＹ，ＬｉｎＢＫ，ＨａｒｄｙＳ．ＳｌｏｗａｃｅｔｙｌａｔｏｒｍｕｔａｔｉｏｎｓｉｎｔｈｅｈｕｍａｎｐｏｌｙｍｏｒｐｈｉｃＮ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅｇｅｎｅｉｎ７８６Ａｓｉａｎｓ，ｂｌａｃｋｓ，Ｈｉｓｐａｎｉｃｓ，ａｎｄｗｈｉｔｅｓ：ａｐｐｌｉｃａｔｉｏｎｔｏｍｅｔａｂｏｌｉｃｅｐｉｄｅｍｉｏｌｏｇｙ．ＡｍＪＨｕｍＧｅｎｅｔ１９９３；５２：８２７−３４．、ＬｉｎＨＪ，ＨａｎＣＹ，ＬｉｎＢＫ，ＨａｒｄｙＳ．ＥｔｈｎｉｃｄｉｓｔｒｉｂｕｔｉｏｎｏｆｓｌｏｗａｃｅｔｙｌａｔｏｒｍｕｔａｔｉｏｎｓｉｎｔｈｅｐｏｌｙｍｏｒｐｈｉｃＮ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅ（ＮＡＴ２）ｇｅｎｅ．Ｐｈａｒｍａｃｏｇｅｎｅｔｉｃｓ１９９４；４：１２５−３４．）。ＮＡＴ２のコード領域の少なくとも４箇所にアミノ酸置換を伴うＳＮＰ（Ｓｉｎｇｌｅｎｕｃｌｅｏｔｉｄｅｐｏｌｙｍｏｒｐｈｉｓｍ）が知られている（図２０）。これらのＳＮＰでは頻度の低いアレル（ｍｉｎｏｒａｌｌｅｌｅ）が酵素活性を低下させることが知られている。これらの活性を低下させるアレル（ｍｉｎｏｒａｌｌｅｌｅ）を一つも持たない染色体（のＮＡＴ２の存在する領域）を野生型（ｗｉｌｄｔｙｐｅ）ということにすると、野生型染色体（のＮＡＴ２の存在する領域）を一つでも持っている個体はＮＡＴ２の活性が十分存在するためｒａｐｉｄａｃｅｔｙｌａｔｏｒ表現型となる。個体の持っている二つの相同染色体のいずれもが野生型で無い場合はどちらの染色体上にも活性を低下させる塩基が少なくとも一つ存在するためｓｌｏｗａｃｅｔｙｌａｔｏｒ表現型となる（ＧｒａｎｔＤＭ，ＧｏｏｄｆｅｌｌｏｗＧＨ，ＳｕｇａｍｏｒｉＫ，ＤｕｒｅｔｔｅＫ．ＰｈａｒｍａｃｏｇｅｎｅｔｉｃｓｏｆｔｈｅｈｕｍａｎａｒｙｌａｍｉｎｅＮ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅｓ．Ｐｈａｒｍａｃｏｌｏｇｙ２０００；６１：２０４−１１、ＣａｓｃｏｒｂｉＩ，ＤｒａｋｏｕｌｉｓＮ，ＢｒｏｃｋｍｏｌｌｅｒＪ，Ｍａｕｒｅｒ，ＳｐｅｒｌｉｎｇＫ，ＲｏｏｔｓＩ．ＡｒｙｌａｍｉｎｅＮ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅ（ＮＡＴ２）ｍｕｔａｔｉｏｎｓａｎｄｔｈｅｉｒａｌｌｅｌｉｃｌｉｎｋａｇｅｉｎｕｎｒｅｌａｔｅｄＣａｕｃａｓｉａｎｉｎｄｉｖｉｄｕａｌｓ：ｃｏｒｒｅｌａｔｉｏｎｗｉｔｈｐｈｅｎｏｔｙｐｉｃａｃｔｉｖｉｔｙ．ＡｍＪＨｕｍＧｅｎｅｔ１９９５；５７：５８１−９２．、ＤｅｇｕｃｈｉＴ，ＭａｓｈｉｍｏＭ，ＳｕｚｕｋｉＴ．ＣｏｒｒｅｌａｔｉｏｎｂｅｔｗｅｅｎａｃｅｔｙｌａｔｏｒｐｈｅｎｏｔｙｐｅｓａｎｄｇｅｎｏｔｙｐｅｓｏｆｐｏｌｙｍｏｒｐｈｉｃａｒｙｌａｍｉｎｅＮ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅｉｎｈｕｍａｎｌｉｖｅｒ．ＪＢｉｏｌＣｈｅｍ１９９０；２６５：１２７５７−６０．、ＨｉｃｋｍａｎＤ，ＳｉｍＥ．Ｎ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅｐｏｌｙｍｏｒｐｈｉｓｍ．Ｃｏｍｐａｒｉｓｏｎｏｆｐｈｅｎｏｔｙｐｅｓｉｎｈｕｍａｎｓ．ＢｉｏｃｈｅｍＰｈａｒｍａｃｏｌ１９９１；４２：１００７−１４．、ＬｉｎＨＪ，ＨａｎＣＹ，ＬｉｎＢＫ，ＨａｒｄｙＳ．ＳｌｏｗａｃｅｔｙｌａｔｏｒｍｕｔａｔｉｏｎｓｉｎｔｈｅｈｕｍａｎｐｏｌｙｍｏｒｐｈｉｃＮ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅｇｅｎｅｉｎ７８６Ａｓｉａｎｓ，ｂｌａｃｋｓ，Ｈｉｓｐａｎｉｃｓ，ａｎｄｗｈｉｔｅｓ：ａｐｐｌｉｃａｔｉｏｎｔｏｍｅｔａｂｏｌｉｃｅｐｉｄｅｍｉｏｌｏｇｙ．ＡｍＪＨｕｍＧｅｎｅｔ１９９３；５２：８２７−３４．、ＬｉｎＨＪ，ＨａｎＣＹ，ＬｉｎＢＫ，ＨａｒｄｙＳ．ＥｔｈｎｉｃｄｉｓｔｒｉｂｕｔｉｏｎｏｆｓｌｏｗａｃｅｔｙｌａｔｏｒｍｕｔａｔｉｏｎｓｉｎｔｈｅｐｏｌｙｍｏｒｐｈｉｃＮ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅ（ＮＡＴ２）ｇｅｎｅ．Ｐｈａｒｍａｃｏｇｅｎｅｔｉｃｓ１９９４；４：１２５−３４．）。
一つの染色体（のＮＡＴ２の存在する領域）が野生型であるかどうかはハプロタイプ情報によりわかる。例えば、上記の４つのＳＮＰについてすべてｍａｊｏｒａｌｌｅｌｅを持つハプロタイプは野生型染色体に対応する。これを野生型ハプロタイプと言うことにする。個体がｒａｐｉｄａｃｅｔｙｌａｔｏｒ（薬物代謝が速く、副作用がない、又は少ない個体）かｓｌｏｗａｃｅｔｙｌａｔｏｒ（薬物代謝が遅く、副作用が出る個体。）かは野生型ハプロタイプを少なくとも一つ持っているかにより決まる。一つの個体は両親由来の二つのハプロタイプを持ち、この二つのハプロタイプの組み合わせをディプロタイプ形（ｄｉｐｌｏｔｙｐｅｃｏｎｆｉｇｕｒａｔｉｏｎ）という。従って、個体のディプロタイプ形を構成するハプロタイプの少なくとも一つが野生型ハプロタイプであればｒａｐｉｄａｃｅｔｙｌａｔｏｒになることになる。しかし、問題はディプロタイプ形は容易に観察できないことである。観察できるのは４つのＳＮＰ座位におけるそれぞれの遺伝子型である。遺伝子型は二つのアレル（この場合はＳＮＰ座位における塩基）の組み合わせである。ディプロタイプ形の情報と遺伝子型の情報を比較すると、ディプロタイプ形が完全情報であるのに比較し、遺伝子型（４つの座位の）は不完全情報である。前者から容易に後者が復元できるが、後者から前者を復元することは通常困難である。
本発明者は、不完全情報である遺伝子型情報を用い、完全情報であるディプロタイプ形を推定する手法を発表し、この手法を用いると遺伝子型を用いてディプロタイプ形を推定できることを示した（ＴａｎａｋａＥ，ＴａｎｉｇｕｃｈｉＡ，ＵｒａｎｏＷ，ＮａｋａｊｉｍａＨ，ＭａｔｓｕｄａＹ，ＫｉｔａｍｕｒａＹ，ＳａｉｔｏＭ，ＹａｍａｎａｋａＨ，ＳａｉｔｏＴ，ＫａｍａｔａｎｉＮ．ＡｄｖｅｒｓｅｅｆｆｅｃｔｓｏｆｓｕｌｆａｓａｌａｚｉｎｅｉｎｐａｔｉｅｎｔｓｗｉｔｈｒｈｅｕｍａｔｏｉｄａｒｔｈｒｉｔｉｓａｒｅａｓｓｏｃｉａｔｅｄｗｉｔｈｄｉｐｌｏｔｙｐｅｃｏｎｆｉｇｕｒａｔｉｏｎａｔｔｈｅＮ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅ２ｇｅｎｅ．ＪＲｈｅｕｍａｔｏｌ２００２；２９：２４９２−９．）。ディプロタイプ形が推定できれば、それに基づいて、例えばスルファサラジンの副作用の発現を予測できると考えられる。本発明では、前述の４つの座位に加え７つのＳＮＰ座位（実際に多型が存在したのはこの内、６座位）の遺伝子型を用いて個人のディプロタイプ形の最尤推定を行うことにより、ディプロタイプ形とスルファサラジンの副作用の発現の関係を解析した。その結果、個人のディプロタイプ形が野生型ハプロタイプを含む場合は含まない場合に比べ、約７．７３倍（９５％信頼区間３．５４−１６．８６）スルファサラジンの副作用を来たしやすいことを示した（ＴａｎａｋａＥ，ＴａｎｉｇｕｃｈｉＡ，ＵｒａｎｏＷ，ＮａｋａｊｉｍａＨ，ＭａｔｓｕｄａＹ，ＫｉｔａｍｕｒａＹ，ＳａｉｔｏＭ，ＹａｍａｎａｋａＨ，ＳａｉｔｏＴ，ＫａｍａｔａｎｉＮ．ＡｄｖｅｒｓｅｅｆｆｅｃｔｓｏｆｓｕｌｆａｓａｌａｚｉｎｅｉｎｐａｔｉｅｎｔｓｗｉｔｈｒｈｅｕｍａｔｏｉｄａｒｔｈｒｉｔｉｓａｒｅａｓｓｏｃｉａｔｅｄｗｉｔｈｄｉｐｌｏｔｙｐｅｃｏｎｆｉｇｕｒａｔｉｏｎａｔｔｈｅＮ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅ２ｇｅｎｅ．ＪＲｈｅｕｍａｔｏｌ２００２；２９：２４９２−９．）。
本実施例では、このＮＡＴ２遺伝子のディプロタイプ形と副作用の出現との関係が本特許出願明細書に記載された方法により検出できるかどうかを検討した。上記のように、表現型（副作用の出現）に関係した４つのＳＮＰ座位を含んだハプロタイプ推定によりＮＡＴ２遺伝子と表現型の関係を証明できることは確実である。しかし、本発明による方法では、表現型に関係しているＳＮＰではなく、必ずしも表現型に関係していないＳＮＰを選択し、それを用いて表現型と遺伝子の関係を検討する。その手法は、まず多数の個体の遺伝子型情報からハプロタイプブロック構造を決定し、それぞれのハプロタイプブロック内から頻度の高いＳＮＰ（例えば０．１以上のＳＮＰ）を用い、ｈｔＳＮＰ（ｈａｐｌｏｔｙｐｅ−ｔａｇｇｉｎｇＳＮＰ）を選択するというものである。そして、そのようなｈｔＳＮＰのみを用い、ハプロタイプに基づいた表現型との関係の解析を行う。前述のように、ｈｔＳＮＰは、それのみを用いたハプロタイプが大多数のハプロタイプを代表するように選ばれたＳＮＰである。
本実施例では、ＮＡＴ２遺伝子とスルファサラジンによる副作用の関係をｈｔＳＮＰのみにより検出できるかどうかを検討した。問題は、表現型に関係する上記の４つのＳＮＰのほとんどが低頻度のＳＮＰであることである。即ちこれらのＳＮＰのｍｉｎｏｒａｌｌｅｌｅの頻度は０．０１５−０．１９４であり、一つのＳＮＰのみが頻度０．１を越える。従って、主としてｃｏｍｍｏｎＳＮＰを用いる今回の解析方法では、それらの低頻度アレルは入っていない可能性が高い。実際に、今回のＳＮＰｆｉｎｄｉｎｇにより発見されたＳＮＰのデータベースは、上記の４箇所のアミノ酸置換を伴うＳＮＰの一つも含んでいない。従って、このような表現型に関係しているＳＮＰを用いずに、マーカーであるｈｔＳＮＰのみを用いて表現型（この場合はスルファサラジンの副作用）に関係する遺伝子を検出できるかどうかを検討した。
本実施例では、スルファサラジンの副作用を表現型としている。このスルファサラジンの副作用は、具体的にはスルファサラジンの投与患者において観察される、嘔吐、吐き気、腹痛、下痢などの消化器症状、皮疹、発熱、眩暈、頭痛、肝機能障害、白血球減少等の症状の有無を指標に用いて観察した。
今回のデータベースに含まれるＳＮＰと、前述のＮＡＴ２遺伝子のｅｘｏｎ２上に存在する６つのＳＮＰを含むすべてのＳＮＰを表１２に示す。このうち、アミノ酸置換を伴うＳＮＰは表１２のＳＮＰｎｏ７，９，１０，１１であり（表１２「ｎｏｎ」のカラム中「＋」印）、その他はアミノ酸置換を伴わないＳＮＰである。合計２４個のＳＮＰを含む（表１２）。

今回のデータベースに含まれるＮＡＴ２遺伝子とその周辺のＳＮＰ２４個のうち、遺伝子頻度が０．１以上のものを用いてハプロタイプブロックの決定を行った。その結果、ＳＮＰ１−２４の全てが一つのブロックに入ることがわかった。また、ハプロタイプ推定を行った結果をもとにｈｔＳＮＰを選択した。その結果、ＳＮＰ１，６，１９，２３の４つがｈｔＳＮＰとして抽出された（図２０、表１２「ｈｔＳＮＰ」のカラム）。これらのｈｔＳＮＰはいずれもアミノ酸置換を伴わないＳＮＰである。これらのｈｔＳＮＰのみを用いて、副作用群と非副作用群を合わせてハプロタイプ推定を行った結果が表１３である。

表１３中、４つのｈｔＳＮＰのみを用いたハプロタイプを、当該ｈｔＳＮＰをＮＡＴ２遺伝子の５’から３’方向に順に並べ、主要アレルを（０）、マイナーアレルを（１）とする数字の並びで表した（図１３「ｈｔＳＮＰｈａｐｌｏｔｙｐｅ」のカラム）。また、表１３中、「Ａｓｓｉｇｎｅｄｐｈｅｎｏｔｙｐｅ−ａｓｓｏｃｉａｔｅｄＳＮＰ」のカラムは、アミノ酸置換を伴って、表現型に関係するＳＮＰ（表１２のＳＮＰｎｏ７，９，１０または１１）がｈｔＳＮＰのみを用いた当該ハプロタイプにアサインするハプロタイプを示した。
次に、副作用群と非副作用群を別々にハプロタイプ頻度推定を行った結果を表１４に示す。

この表によると、副作用群（図１４中「Ｆｒｅｑｕｅｎｃｙｉｎｇｒｏｕｐｗｉｔｈａｄｖｅｒｓｅｒｅａｃｔｉｏｎｓ」のカラム）で「０−１−０−１」のハプロタイプの頻度が上昇している（０．１３１１から０．３１２５）。
次に、単一のＳＮＰを含め、全てのハプロタイプについて、その頻度が副作用群と非副作用群で差があるかを検討した。対象のハプロタイプはｈｔＳＮＰである４つの座位全ての情報を含んだハプロタイプに加え、４つのうち、一部の座位の情報に関してのハプロタイプについても検討した（表１５）。

表１５中、例えば、「^＊−１−^＊−１」は左から（５’から３’方向へ）第２，４座位がｍｉｎｏｒａｌｌｅｌｅ（ｍａｊｏｒａｌｌｅｌｅは０で、ｍｉｎｏｒａｌｌｅｌｅは１で示す）であり、その他の座位が任意の（０、または１）アレルであるハプロタイプすべてを含む集合を考える。即ち不完全ハプロタイプということになる（ＫａｍａｔａｎｉＮ，ＳｅｋｉｎｅＡ，ＫｉｔａｍｏｔｏＴ，ＩｉｄａＡ，ＳａｉｔｏＳ，ＫｏｇａｍｅＡ，ＩｎｏｕｅＥ，ＫａｗａｍｏｔｏＭ，ＨａｒｉｇａｉＭ，ＮａｋａｍｕｒａＹ．Ｌａｒｇｅ−ＳｃａｌｅＳｉｎｇｌｅ−ＮｕｃｌｅｏｔｉｄｅＰｏｌｙｍｏｒｐｈｉｓｍ（ＳＮＰ）ａｎｄＨａｐｌｏｔｙｐｅＡｎａｌｙｓｅｓ，ＵｓｉｎｇＤｅｎｓｅＳＮＰＭａｐｓ，ｏｆ１９９Ｄｒｕｇ−ＲｅｌａｔｅｄＧｅｎｅｓｉｎ７５２Ｓｕｂｊｅｃｔｓ：ｔｈｅＡｎａｌｙｓｉｓｏｆｔｈｅＡｓｓｏｃｉａｔｉｏｎｂｅｔｗｅｅｎＵｎｃｏｍｍｏｎＳＮＰｓｗｉｔｈｉｎＨａｐｌｏｔｙｐｅＢｌｏｃｋｓａｎｄｔｈｅＨａｐｌｏｔｙｐｅｓＣｏｎｓｔｒｕｃｔｅｄｗｉｔｈＨａｐｌｏｔｙｐｅ−ＴａｇｇｉｎｇＳＮＰｓ．（２００４）ＡｍＪＨｕｍＧｅｎｅｔ７５：１９０−２０３）。この方法では単一座位のアレルのみが確定しているハプロタイプの集合も含まれる。即ち、「^＊−１−^＊−^＊」は第二座位がｍｉｎｏｒａｌｌｅｌｅであるハプロタイプの集合である。この集合は即ち、第二座位のＳＮＰがｍｉｎｏｒａｌｌｅｌｅであるということと同じことである。
このような手法で４つのすべてのＳＮＰの情報を持ったハプロタイプ、一部のみの情報を持った不完全ハプロタイプ、単一ＳＮＰのみの情報を含む集合すべての場合について副作用群と非副作用群で比較することが可能である。この場合、副作用群の人数は１６、非副作用群は１２８なので、それぞれの群の全ハプロタイプ数は３２、２５６である。しかし、上記のさまざまな不完全ハプロタイプを二つの群を合わせた群で数えると、ハプロタイプ（ＳＮＰのｍｉｎｏｒａｌｌｅｌｅのこともある）の数は変化する。この値が小さい場合（例えば５未満）は検定で有意と出る可能性はほとんどない。従って、この値（周辺度数）が５以上の場合のみを検定の対象とした。さらに、例えば「０−０−０−０」ハプロタイプと「０−０−０−^＊」ハプロタイプはほとんど同じ集合である。なぜなら「０−０−０−１」ハプロタイプがほとんど存在しないからである。このような場合は、よりハプロタイプを特異的に示す「０−０−０−０」ハプロタイプを採用した。このような条件で同じハプロタイプを結合して行くと、検定の対象となるハプロタイプは２０個となった。そして、それぞれのハプロタイプについて、それぞれの群の個数を数え、それに含まれないハプロタイプの二群にわけた２ｘ２のｃｏｎｔｉｎｇｅｎｃｙｔａｂｌｅを作成しχ二乗検定を行った。
表１５の「Ｃｈｉｓｑｕａｒｅ」カラムのそれぞれのセルの個数は、ハプロタイプ頻度（表１５の「Ｆｒｅｑｕｅｎｃｙ」カラム、副作用群及び非副作用群の二つの群を合わせた不完全ハプロタイプ頻度）の対応する値に個体数ｘ２を乗じ、四捨五入して求めた。その結果、表１５に示す結果となった。
最も高い有意水準で有意を示したのは、「０−１−０−１」のハプロタイプである。引き続き「０−１−０−^＊」、「^＊−１−^＊−１」のハプロタイプなどが有意を示した。これらはＰ＜０．０１のレベルで有意であった。ここで、Ｐ値は、不完全ハプロタイプの頻度と、副作用の発生との関係から求めた。また、この値はＢｏｎｆｅｒｒｏｎｉの多重比較の修正を行えば有意では無くなる。このことは、後述の通り、「０−１−０−１」には表１２のＳＮＰｎｏ９の表現型に関係するＳＮＰがａｓｓｉｇｎされるため（表１３）、副作用群と非副作用群で頻度に有意の差があると判定されたと考えられる。ＳＮＰの中では「^＊−^＊−^＊−０」が有意を示したが、有意水準はＰ＜０．０５であった。以上の方法では不完全ハプロタイプの頻度を副作用群と非副作用群の両群で比較した。
引き続き、不完全ハプロタイプを有する個体の頻度の差を検定した。即ち、これらの４つのｈｔＳＮＰを用いてＮＡＴ２遺伝子のハプロタイプとスルファサラジンの副作用に関係があるかどうかを検定する一般化尤度比を用いた最尤法による検定を行った。本検定法（プログラムＰｅｎｈａｐｌｏに搭載）は、特定のハプロタイプの保有と質的表現型との間で関係があるかどうかの検定を最尤法を用いて行う方法である（後述）。最尤法にはＥＭ（Ｅｘｐｅｃｔａｔｉｏｎ−ｍａｘｉｍｉｚａｔｉｏｎ）アルゴリズムを用いる。この検定で用いるハプロタイプとしては、すべてのＳＮＰ座位についてアレルの情報があるハプロタイプだけではなく、限られた座位のみにアレルの情報がある不完全ハプロタイプも検定の対象とできる。不完全ハプロタイプは単なる一つのＳＮＰのアレルも含む。従って、本手法ではハプロタイプとＳＮＰの解析の両方が統合的にできる。
このようなプログラムＰｅｎｈａｐｌｏによる検定の結果、「^＊−０−^＊−０」の不完全ハプロタイプを対象とした検定でＰ＝０．００００８４の有意水準で有意の関係が見られた（表１６）。

不完全ハプロタイプや単一ＳＮＰも含めたすべての検定可能な対象の数は２６なので（今回も両群で特定の不完全ハプロタイプの保有者、あるいは非保有者の総数が５未満の場合は除外した）、Ｂｏｎｆｅｒｒｏｎｉの多重検定に対する修正を施した結果でもＰ＝０．００００８４ｘ２６＝０．００２１８となりＰ＝０．０１の水準でも有意である。単一ＳＮＰについては「^＊−^＊−^＊−０」の不完全ハプロタイプ（即ち、ＳＮＰ）でＰ＝０．００２１５の有意水準で有意の関係が見られた（表１６、Ｎｏ２）。ここで選択されたｈｔＳＮＰ４つはＮＡＴ２遺伝子の発現などに関与している可能性は極めて低い。むしろＮＡＴ２の活性に関与し、スルファサラジンの副作用に関係することが既にわかっているＳＮＰ（即ち、ＳＮＰｎｏ７，９，１０，１１）がｈｔＳＮＰと強い連鎖不平衡の関係にあり、そのためｈｔＳＮＰを用いたハプロタイプ解析により有意の結果が得られたと考えられる。
このように「^＊−０−^＊−０」の不完全ハプロタイプを用いた検定で強い有意の結果が得られたことより、ハプロタイプブロックを構築し、その中からｈｔＳＮＰを抽出して、そのｈｔＳＮＰにより不完全ハプロタイプを含めた表現型との関係の解析を行うことにより、表現型と関係した遺伝子を検出できるという実例を示した。このようにｈｔＳＮＰのみを用い、ハプロタイプと表現型との関係の検定を行うことにより、このハプロタイプブロック内に原因のＳＮＰが存在することが示唆される。実際に、ＮＡＴ２とスルファサラジンによる副作用の関係については原因である４つのＳＮＰ（即ちＳＮＰｎｏ７，９，１０，１１）は同じハプロタイプブロック内に存在する。
ここで、何故、不完全ハプロタイプである「^＊−０−^＊−０」が、不完全ハプロタイプを保有する個体の頻度の差の検定で最小のＰを示したかを詳細に検討する。４個のｈｔＳＮＰのみを用い、本実施例では１４４名の個体を用いて集団のハプロタイプ頻度の推定を行った結果が表１３である。表１３に示すように、３個のハプロタイプによりすべてのハプロタイプのうち＞８０％を説明し、５個のハプロタイプにより全体の＞９４％を説明する。
前述のように、表現型に関係するＳＮＰ、ＳＮＰｎｏ７，９，１０，１１はｈｔＳＮＰに入っていない。しかし、本発明の方法によりこれらのＳＮＰのｈｔＳＮＰで構成されるハプロタイプへのａｓｓｉｇｎｍｅｎｔを決めることができた。解析の結果を表１７に示す。

表１７に示すようにＳＮＰｎｏ７，９，１０，１１の表現型に関係するＳＮＰは、ｈｔＳＮＰで構成されたハプロタイプ、「１０００」、「０１０１」、「１０００」、「１１００」にそれぞれａｓｓｉｇｎされる。表現型に関係するＳＮＰ全体のうち、それらのハプロタイプにａｓｓｉｇｎされる割合［Ｐ（Ａｊ｜Ｘ）］は０．６８５〜１．０００の値を示す。表１３に示したｈａｐｌｏｔｙｐｅｎｏによると、表現型に関係するＳＮＰはそれぞれ５，２，５，３のｈａｐｌｏｔｙｐｅｎｏのハプロタイプにａｓｓｉｇｎされる。このように表現型に関係する４つのＳＮＰのそれぞれのｍｉｎｏｒａｌｌｅｌｅの大半がｈｔＳＮＰで構成されるハプロタイプのうち、ｈａｐｌｏｔｙｐｅｎｏ５，２，５，３のハプロタイプにａｓｓｉｇｎされるため、表現型に関係するＳＮＰを用いず、ｈｔＳＮＰのみを用いた解析で、表現型との関係を検定する手法で有意な結果が得られると考えられる。
表１８はｈｔＳＮＰ４個（ＳＮＰｎｏ１，６，１９，２３）、表現型に関係するＳＮＰ４個（ＳＮＰｎｏ７，９，１０，１１）のすべてを用いてハプロタイプの頻度を推定した結果である。

ここでは、表現型に関係するＳＮＰ（ここではＰＳＮＰと表示）４個の両端に２個ずつのｈｔＳＮＰが存在する。表７で示す１２のハプロタイプは、表現型に関係するＳＮＰ４個のうち一つでもｍｉｎｏｒａｌｌｅｌｅ（表７「Ｈａｐｌｐｔｙｐｅ（８ＳＮＰｓ）ｈｔＳＮＰ｜ＰＳＮＰ｜ｈｔＳＮＰ」のカラムのＰＳＮＰにおいて１で示されたもの）を含む場合（表７の「ＰＳＮＰ」のカラムでｙｅｓで示す）と、含まない場合（表７のＰＳＮＰのカラムでｎｏで示す）とにわけられる（表７の「Ｈａｐｌｏｔｙｐｅ（ｈｔＳＮＰ）ｙｅｓ」カラムおよび「Ｈａｐｌｏｔｙｐｅ（ｈｔＳＮＰ）ｎｏ」カラム）。これらのハプロタイプのうち、表現型に関係するＳＮＰのｍｉｎｏｒａｌｌｅｌｅを含まないものが一つでも個体に存在すればｒａｐｉｄａｃｅｔｙｌａｔｏｒ表現型を示すと考えられる。このようなｍｉｎｏｒａｌｌｅｌｅを含まないハプロタイプのｈｔＳＮＰのみにより構成されるハプロタイプを調べると、「００００」、「１０１０」、「００１０」、「１０００」、「０１００」の５つである。この５つのハプロタイプはいずれも４つのｈｔＳＮＰによって構成されており、５’末端から０（ｍａｊｏｒａｌｌｅｌｅ）、または１（ｍｉｎｏｒａｌｌｅｌｅ）の文字４つで記載されている。この４つのｈｔＳＮＰを示す文字のうち、第４番目のｈｔＳＮＰはすべて０となっていることがわかる（表７「Ｈａｐｌｏｔｙｐｅ（ｈｔＳＮＰ）ｎｏ」カラム）。また第２番目のｈｔＳＮＰは極めて頻度の低い（０．００３６）ハプロタイプ「０１００」を除いてはすべて０である。第１，３番目のｈｔＳＮＰについては０と１が混在している。これが表１６のような検定で「^＊−０−^＊−０」の不完全ハプロタイプが最小のＰ（従って、最大の有意性）を示した理由であると考えられる。即ち、表現型に関係するＳＮＰについての野生型ハプロタイプに対応するｈｔＳＮＰにより構成されるハプロタイプに共通するものが「^＊−０−^＊−０」なのである。野生型ハプロタイプを一つでも持てば、副作用を起こしにくいことは前述のとおりである。また、「^＊−^＊−^＊−０」という不完全ハプロタイプがそれに続いて高い有意性を示す理由も同様である。
以上のように、４つのｈｔＳＮＰのすべての情報を用いた場合と、その一部を用いた場合、さらには一つの座位のみを用いた場合について、同一の手法でハプロタイプ頻度の両群間の差と、特定のハプロタイプを有する個体の頻度の差を調べる検定を行った。今回の実施例では、両群間でハプロタイプの頻度の差を検定する方法より、ハプロタイプを保有する個体の頻度の差を比較する検定の方が検出力が高かった。
また、ハプロタイプの頻度を比較する方法では、表現型に関係するＳＮＰがａｓｓｉｇｎされる一つのハプロタイプが最も有意となった。即ち、表１５のように、「０−１−０−１」ハプロタイプが最も有意を示した。この理由は、このハプロタイプに表現型関連ＳＮＰのＳＮＰｎｏ９のｍｉｎｏｒａｌｌｅｌｅがａｓｓｉｇｎされており、しかもそのハプロタイプのすべてがＳＮＰｎｏ９のｍｉｎｏｒａｌｌｅｌｅを持っていることによる。これは表１７のＰ（Ｘ｜Ａｊ）＝１という値によって示される。これに対して、特定のハプロタイプを保有する個体の頻度を比較した検討では「^＊−０−^＊−０」ハプロタイプが最も有意を示した。前述のように、これは表現型に関係するＳＮＰのｍｉｎｏｒａｌｌｅｌｅがａｓｓｉｇｎされないハプロタイプの集合である。その理由は、ＮＡＴ２の副作用が表現型に関係するＳＮＰのｍｉｎｏｒａｌｌｅｌｅを持っているかいないかではなく、むしろそのｍｉｎｏｒａｌｌｅｌｅを有しない染色体を少なくとも一つ有するかどうかによるからと考えられる。それは、表現型に関係するＳＮＰのｍｉｎｏｒａｌｌｅｌｅのａｓｓｉｇｎしないハプロタイプの集合であり、それが「^＊−０−^＊−０」であったということになる。
いずれにせよ、本明細書で述べたように、薬剤に関係する遺伝子の多数のＳＮＰの中から比較的頻度の高いＳＮＰ（例えば＞０．１のＳＮＰ）を選択し、それを用いてハプロタイプブロックを構成し、各ハプロタイプブロックの中でｈｔＳＮＰを選択し、それを用いて表現型とｈｔＳＮＰで構成されるハプロタイプとの関係を検定することは特定の遺伝子と表現型の関係を検定するために有効である。その理由は、表現型に関係したＳＮＰ（多くの場合頻度の比較的低い）の大多数が一つの主要なハプロタイプにａｓｓｉｇｎされることによる。このａｓｓｉｇｎｍｅｎｔがあれば疾患とＳＮＰの関係がＮＡＴ２遺伝子のように比較的複雑でも検定によって有意となる。本発明の手法は、表現型と関係する遺伝子を見つけるために非常にｆｌｅｘｉｂｌｅな方法を提供する。このようにｈｔＳＮＰにより構成されるハプロタイプと表現型との関係が明らかになれば、改めて、そのハプロタイプブロック内、及び近傍のＳＮＰを詳細に検討し、表現型に関係する多型を求めることが可能である。Relationship between side effects caused by sulfasalazine and haplotype of N-acetyltransferase 2 gene
This example is an example showing that it is useful to construct a block, extract an htSNP based on the block, and use it to search for genes and polymorphisms related to the phenotype.
N-acetyltransferase 2 (hereinafter abbreviated as NAT2) plays an important role in the metabolism of the antituberculosis drug isoniazid, a therapeutic agent for rheumatoid arthritis, and sulfasalazine. Individuals with genetically reduced NAT2 activity are known, and it is known that isoniazid inactivation is delayed in such individuals (Das KM, Eastwood MA, McManus JP, Sircus W. Adverse Reactions During). salicylazosulfafiridine health and the relation with drug metabolism and acetylator phenotype. N Engl J Med 1973; 289: 491-5.).
The NAT2 gene is present on the short arm of chromosome 8 (8p22) and has two exons. It is known that there are polymorphisms in the coding region of the NAT2 gene, and some of them are known to decrease the activity (Grant DM, Goodfellow GH, Sugamori K, Durette K. Pharmacogenetics of the human arylamine N Acetyltransferases. Pharmacology 2000; 61: 204-11, Cascorbi I, Dracoulis N, Blockmoller J, Maurer, Sperling K, Roots I. ArylamineN-acetyltransferase. uals: correlation with phenotypic activity.Am J Hum Genet 1995; 57:. 581-92, Deguchi T, Mashimo M, Suzuki T.Correlation between acetylator phenotypes and genotypes of polymorphic arylamine N-acetyltransferase in human liver.J Biol Chem 1990; 265: 12757-60., Vatsis KP, martell KJ, Weber WW.Diverse point mutations in the human gene for polymorphic N-acetyltransferase. roc Natl Acad Sci USA 1991; 88: 6333-7, Hickman D, Sim E. lensferase polymorphism, Comparison of phenotypes in Humans. of polymorphic arylamine N-acetyltransferase of human liver.J Biol Chem 1992; 267: 18140-7. Hicman D, Rich A, Camilleri JP, Sim E. et al. Genotyping human polymorphic N-acetyltransferase: identification of new slow alloying variants. Pharmacogenetics 1992; 2: 217-26. Abe M, Deguchi T, Suzuki T. The structure and charactaristics of a fourth all of polymorphic N-acetyltransferase gene found in the Japan population. Biochem Biophys Res Commun 1993; 191: 264-9. Lin HJ, Han CY, Lin BK, Hardy S .; Slow acetylator mutations in the human polymorphic N-acetyltransferase gene in 786 Asias, blacks, Hispanics, and whites: application to metabolic. Am J Hum Genet 1993; 52: 827-34. Lin HJ, Han CY, Lin BK, Hardy S .; Ethnic distribution of slow acetate mutations in the polymorphic N-acetyltransferase (NAT2) gene. Pharmacogenetics 1994; 4: 125-34. ). A single nucleotide polymorphism (SNP) involving amino acid substitution at least at four positions in the coding region of NAT2 is known (FIG. 20). In these SNPs, it is known that a minor allele reduces enzyme activity. Chromosomes that do not have any minor alleles that reduce these activities (regions where NAT2 is present) are referred to as wild types, and wild-type chromosomes (regions where NAT2 exists) Individuals who have at least one have a rapid acetator phenotype because NAT2 has sufficient activity. If neither of the two homologous chromosomes possessed by the individual is a wild type, there is at least one base that reduces the activity on either chromosome, resulting in a slow acetator phenotype (Grant DM, Goodfellow GH, Sugamori K, Durette K.Pharmacogenetics of the human arylamine N-acetyltransferases.Pharmacology 2000; 61: 204-11, Cascorbi I, Drakoulis N, Brockmoller J, Maurer, Sperling K, Roots I.Arylamine N-acetyltransferase (NAT2) mutations and their alleric l nkage in unrelated Caucasian individuals: correlation with phenotypic activity.Am J Hum Genet 1995; 57:. 581-92, Deguchi T, Mashimo M, Suzuki T.Correlation between acetylator phenotypes and genotypes of polymorphic arylamine N-acetyltransferase in human liver.J Biol Chem 1990; 265: 12757-60., Hickman D, Sim E. N-acetyltransferase polymorphism.Comparison of phenology. es in humans.Biochem Pharmacol 1991; 42: 1007-14., Lin HJ, Han CY, Lin BK, Hardy S. to metabolic epidemiology.Am J Hum Genet 1993; 52: 827-34., Lin HJ, Han CY, Lin BK, Hardy S. Ethnic distribution of slow acetylators. c N-acetyltransferase (NAT2) gene. Pharmacogenetics 1994; 4: 125-34. ).
Whether or not one chromosome (region where NAT2 exists) is a wild type can be determined from haplotype information. For example, haplotypes having major alleles for the above four SNPs correspond to wild type chromosomes. This is called a wild type haplotype. Whether an individual is a rapid acetator (individuals with fast drug metabolism and no or few side effects) or a slow acetator (an individual with slow drug metabolism and side effects) depends on having at least one wild-type haplotype. One individual has two haplotypes derived from the parents, and the combination of the two haplotypes is called a diplotype configuration. Accordingly, if at least one of the haplotypes constituting the diplotype form of the individual is a wild-type haplotype, it becomes a rapid acetator. However, the problem is that the diplotype form cannot be easily observed. What can be observed is the genotype of each of the four SNP loci. The genotype is a combination of two alleles (in this case the base at the SNP locus). Comparing the information of the diplotype and the information of the genotype, the genotype (of 4 loci) is incomplete information compared to the information of the diplotype that is complete. The latter can be easily restored from the former, but it is usually difficult to restore the former from the latter.
The present inventor announced a method for estimating a diplotype shape that is complete information using genotype information that is incomplete information, and showed that the diplotype shape can be estimated using a genotype by using this method. (Tanaka E, Taniguchi A, Urano W, Nakajima H, Matsuda Y, Kitamura Y, Saito M, Yamanaka H, Saito T, Kamatani N.Adverse effects of sulfasalazine in patients with rheumatoid arthritis are associated with diplotype configuration at the N-acetyltransferase 2 gene.J Rheumatol 2002 29: 2492-9).. If the diplotype form can be estimated, it is considered that the side effect of sulfasalazine can be predicted based on the diplotype form. In the present invention, in addition to the above-mentioned four loci, the maximum likelihood estimation of the diplotype shape of an individual is performed using genotypes of seven SNP loci (of which six are actually polymorphisms). The relationship between diplotype form and the expression of sulfasalazine side effects was analyzed. As a result, it was shown that the side effect of sulfasalazine is likely to occur about 7.73 times (95% confidence interval 3.54-16.86) compared to the case where the individual diplotype form contains the wild-type haplotype. (Tanaka E, Taniguchi A, Urano W, Nakajima H, Matsuda Y, Kitamura Y, Saito M, Yamanaka H, Saito T, Kamatani N.Adverse effects of sulfasalazine in patients with rheumatoid arthritis are associated with diplotype configuration at the N-acetyltransferase 2 gene.J Rheumatol 002; 29: 2492-9)..
In this example, it was examined whether the relationship between the diplotype form of the NAT2 gene and the appearance of side effects could be detected by the method described in this patent application specification. As described above, it is certain that the relationship between NAT2 gene and phenotype can be proved by haplotype estimation including four SNP loci related to phenotype (appearance of side effects). However, in the method according to the present invention, a SNP not necessarily related to a phenotype is selected instead of a SNP related to a phenotype, and the relationship between the phenotype and the gene is examined using the selected SNP. In this method, first, a haplotype block structure is determined from genotype information of a large number of individuals, and a high-frequency SNP (for example, 0.1 or more SNP) is used from within each haplotype block to determine an htSNP (haplotype-tagging SNP). It is to choose. Then, using only such htSNP, the relationship with the phenotype based on the haplotype is analyzed. As described above, the htSNP is a SNP selected so that the haplotype using only it represents the majority of haplotypes.
In this example, it was examined whether or not the relationship between the NAT2 gene and side effects caused by sulfasalazine can be detected only by htSNP. The problem is that most of the four SNPs related to phenotype are low frequency SNPs. That is, the frequency of minor alleles of these SNPs is 0.015-0.194, and only one SNP exceeds the frequency 0.1. Therefore, in this analysis method using mainly the common SNP, there is a high possibility that these low frequency alleles are not included. Actually, the SNP database discovered by this SNP finding does not include one of the SNPs with the above four amino acid substitutions. Therefore, whether or not a gene related to a phenotype (in this case, a side effect of sulfasalazine) can be detected using only the marker htSNP without using the SNP related to such phenotype.
In this example, the side effect of sulfasalazine is phenotype. The side effects of sulfasalazine are specifically observed in patients treated with sulfasalazine, such as digestive symptoms such as vomiting, nausea, abdominal pain, diarrhea, rash, fever, dizziness, headache, liver dysfunction, leukopenia, etc. The presence or absence was observed as an index.
Table 12 shows all SNPs including the SNPs included in this database and the six SNPs present on exon2 of the NAT2 gene. Among these, SNPs with amino acid substitution are SNP no7, 9, 10, 11 of Table 12 ("+" mark in column of Table 12 "non"), and others are SNPs without amino acid substitution. A total of 24 SNPs are included (Table 12).

The haplotype block was determined using the NAT2 gene and 24 surrounding SNPs included in this database with genes having a gene frequency of 0.1 or more. As a result, it was found that all of the SNPs 1-24 belong to one block. Moreover, htSNP was selected based on the result of haplotype estimation. As a result, four

SNPs

1, 6, 19, and 23 were extracted as htSNPs (FIG. 20, Table 12, “htSNP” column). These htSNPs are all SNPs with no amino acid substitution. Table 13 shows the results of haplotype estimation using only these htSNPs for the side effect group and the non-side effect group.

In Table 13, the haplotypes using only four htSNPs are arranged in the order of numbers in which the htSNPs are arranged in order from 5 ′ to 3 ′ of the NAT2 gene, the major allele is (0) and the minor allele is (1). (FIG. 13, “htSNP haplotype” column). In Table 13, the column of “Assigned phenotype-associated SNP” indicates that the SNP related to the phenotype (SNP no7, 9, 10, or 11 in Table 12) uses only htSNP with amino acid substitution. The haplotype assigned to is shown.
Next, Table 14 shows the results of haplotype frequency estimation performed separately for the side effect group and the non-side effect group.

According to this table, the frequency of the haplotype of “0-1-0-1” is increased (0.1311 to 0.3125) in the side effect group (column of “Frequency in group with recommendations reactions” in FIG. 14). .
Next, it was examined whether the frequency of all haplotypes including a single SNP is different between the side effect group and the non-side effect group. In addition to the haplotype that included information on all four loci, which are htSNPs, the haplotypes related to information on some of the loci were also examined (Table 15).

In Table 15, for example, “ ^* -1- ^* −1 ”from the left (in the 5 ′ to 3 ′ direction), the second and fourth loci are minor allele (major allele is 0 and minor allele is 1), and other loci are arbitrary (0, or 1) Consider a set containing all haplotypes that are alleles. In other words, it is an incomplete haplotype (Kamatani N, Sekin A, Kitamoto T, Iida A, Saito S, Kogame A, Inoue E, Kawamoto M, Harigai M, Nakamura Y. Haplotype Analyzes, Using Sense SNP Map, of 199 Drug-Relative Genes in the High Measures of the Hedgepumps in the Society of the United States. . Th Haplotype-Tagging SNPs (2004) Am J Hum Genet 75: 190-203). This method also includes a set of haplotypes in which only a single locus allele is defined. That is, " ^* -1- ^* − ^* "Is a set of haplotypes whose second locus is minor allele. This set is the same as the fact that the SNP at the second locus is minor allele.
In this way, haplotypes with information on all four SNPs, incomplete haplotypes with only a part of information, and all sets containing information on only a single SNP are compared between the side effect group and the non-side effect group. Is possible. In this case, since the number of side effect groups is 16 and the number of non-side effect groups is 128, the total number of haplotypes in each group is 32,256. However, when the above-mentioned various incomplete haplotypes are counted in a group including the two groups, the number of haplotypes (which may be minor alleles of SNPs) changes. When this value is small (for example, less than 5), there is almost no possibility of being significant in the test. Therefore, only when this value (peripheral frequency) is 5 or more, the test was performed. Further, for example, “0-0-0-0” haplotype and “0-0-0-” ^* “Haplotypes are almost the same set. This is because there is almost no “0-0-0-1” haplotype. In such a case, a “0-0-0-0” haplotype indicating a haplotype more specifically was employed. When the same haplotypes were combined under such conditions, the number of haplotypes to be tested became 20. Then, for each haplotype, the number of each group was counted, and a 2 × 2 contingency table divided into two groups of haplotypes not included in the haplotype was created, and χ square test was performed.
The number of each cell in the “Chi square” column of Table 15 corresponds to the corresponding value of the haplotype frequency (the “Frequency” column of Table 15, the incomplete haplotype frequency combining the two groups of the side effect group and the non-adverse group). Obtained by multiplying by x2 and rounding off. As a result, the results shown in Table 15 were obtained.
The haplotype of “0-1-0-1” showed significance at the highest significance level. Continue with "0-1-0- ^* , " ^* -1- ^* The -1 ”haplotype was significant. These were significant at the level of P <0.01. Here, the P value was determined from the relationship between the frequency of incomplete haplotypes and the occurrence of side effects. In addition, this value becomes insignificant if Bonferroni's multiple comparison is corrected. As will be described later, since “0-1-0-1” is assigned a SNP related to the phenotype of SNP no9 in Table 12 (Table 13), it frequently occurs in the side effect group and the non-side effect group. It is considered that there was a significant difference. Within the SNP, “ ^* − ^* − ^* “−0” was significant, but the significance level was P <0.05. In the above method, the frequency of incomplete haplotypes was compared in both the side effect group and the non-side effect group.
Subsequently, the difference in frequency of individuals with incomplete haplotypes was tested. That is, a test by the maximum likelihood method using a generalized likelihood ratio for testing whether these four htSNPs are related to the side effect of the haplotype of NAT2 gene and sulfasalazine was performed. This test method (installed in the program Penhaplo) is a method that uses the maximum likelihood method to test whether there is a relationship between possession of a specific haplotype and a qualitative phenotype (described later). For the maximum likelihood method, an EM (Expectation-maximization) algorithm is used. As haplotypes used in this test, not only haplotypes having allele information at all SNP loci but also incomplete haplotypes having allele information only at limited loci can be tested. An incomplete haplotype also includes an allele of just one SNP. Therefore, in this method, both haplotype and SNP analysis can be integrated.
As a result of the test by such a program Penhaplo, “ ^* -0- ^* A test for the incomplete haplotype of “−0” showed a significant relationship at the significance level of P = 0.000084 (Table 16).

The total number of testable subjects, including incomplete haplotypes and single SNPs, is 26 (again, this was excluded if the total number of holders or non-carriers of a particular incomplete haplotype in both groups was less than 5) ), The result of the modification to Bonferroni's multiple test is P = 0.000084 × 26 = 0.00218, which is significant even at the level of P = 0.01. For a single SNP, ^* − ^* − ^* A significant relationship was found with an incomplete haplotype of “−0” (ie, SNP) at a significance level of P = 0.00215 (Table 16, No 2). The four htSNPs selected here are very unlikely to be involved in NAT2 gene expression or the like. Rather, SNPs that are known to be involved in NAT2 activity and related to the side effects of sulfasalazine (ie, SNP no7, 9, 10, 11) are in a strong linkage disequilibrium relationship with htSNP, and therefore htSNP was used. It is thought that a significant result was obtained by haplotype analysis.
in this way" ^* -0- ^* Since the test using the incomplete haplotype of “−0” gave a strong and significant result, a haplotype block was constructed, htSNP was extracted from the block, and the phenotype including the incomplete haplotype was determined by the htSNP. By analyzing the relationship, we showed an example that a gene related to phenotype can be detected. Thus, by using only htSNP and testing the relationship between the haplotype and the phenotype, it is suggested that the causal SNP exists in this haplotype block. In fact, the four SNPs that are responsible for the side effect relationship between NAT2 and sulfasalazine (ie, SNP no 7, 9, 10, 11) are in the same haplotype block.
Here, why is the incomplete haplotype " ^* -0- ^* It is examined in detail whether “−0” showed the smallest P in the test of the difference in the frequency of individuals with incomplete haplotypes. Table 13 shows the results of estimating the haplotype frequency of the group using only four htSNPs and 144 individuals in this example. As shown in Table 13, three haplotypes explain> 80% of all haplotypes and five haplotypes explain> 94% of the total.
As described above, SNP related to the phenotype, SNP no7, 9, 10, and 11 are not included in htSNP. However, the assignment of haplotypes composed of htSNPs of these SNPs could be determined by the method of the present invention. The results of analysis are shown in Table 17.

As shown in Table 17, SNPs related to the phenotypes of SNP no7, 9, 10, 11 are assigned to haplotypes composed of htSNPs, “1000”, “0101”, “1000”, and “1100”, respectively. . The ratio [P (Aj | X)] assigned to those haplotypes in the entire SNPs related to the phenotype shows a value of 0.685 to 1.000. According to haplotype no shown in Table 13, SNPs related to phenotypes are assigned to haplotypes of haplotype no of 5, 2, 5 and 3, respectively. As described above, since most of the minor alleles of the four SNPs related to the phenotype are assigned to the haplotypes of haplotype no 5, 2, 5, 3 among the haplotypes configured by htSNP, the SNPs related to the phenotype. It is considered that a significant result can be obtained by a method of examining the relationship with the phenotype by an analysis using only htSNP without using.
Table 18 shows the results of estimating the frequency of haplotypes using all four htSNPs (SNP no1,6,19,23) and four SNPs related to the phenotype (SNP no7,9,10,11).

Here, there are two htSNPs at both ends of four SNPs (in this case, indicated as PSNP) related to the phenotype. The twelve haplotypes shown in Table 7 are minor alleles (shown as 1 in the PSNP of the column of Table 7 “Halptype (8 SNPs) htSNP | PSNP | htSNP”), even if one of the four SNPs related to the phenotype. Including (indicated by “yes” in the “PSNP” column of Table 7) and not including (indicated by “no” in the PSNP column of Table 7) (“Haplotype (htSNP) yes” column in Table 7 and “Haplotype (htSNP) no” column). Among these haplotypes, it is considered that a rapid acetator phenotype is exhibited if at least one of the haplotypes that does not contain the minor allele of the SNP related to the phenotype is present in the individual. Examining haplotypes composed only of hapSNPs of haplotypes that do not include such minor alleles, there are five “0000”, “1010”, “0010”, “1000”, and “0100”. All of these five haplotypes are composed of four htSNPs, and are described by four characters of 0 (major allele) or 1 (minor allele) from the 5 ′ end. Of the four htSNP characters, the fourth htSNP is all 0 (Table 7 “Haplotype (htSNP) no” column). The second htSNPs are all 0 except for the very low frequency (0.0036) haplotype “0100”. For the first and third htSNPs, 0 and 1 are mixed. This is a test like Table 16. ^* -0- ^* The incomplete haplotype of “−0” is considered to be the reason that showed the smallest P (and therefore the greatest significance). That is, what is common to haplotypes composed of htSNPs corresponding to wild-type haplotypes for SNPs related to phenotypes is “ ^* -0- ^* -0 ". As described above, it is difficult to cause side effects if there is even one wild-type haplotype. Also," ^* − ^* − ^* The reason why the incomplete haplotype of “−0” is subsequently highly significant is also the same.
As described above, when all the information of four htSNPs is used, when a part thereof is used, and when only one locus is used, the difference between the two groups in the haplotype frequency by the same method. And a test for examining the difference in the frequency of individuals having a specific haplotype. In this example, the test comparing the frequency difference of individuals having haplotypes was more powerful than the method of testing the frequency difference of haplotypes between the two groups.
Moreover, in the method of comparing the frequency of haplotypes, one haplotype to which a SNP related to a phenotype is assigned was most significant. That is, as shown in Table 15, the “0-1-0-1” haplotype was most significant. This is because the minor allele of SNP no9 of the phenotype-related SNP is assigned to this haplotype, and all of the haplotypes have the minor allele of SNP no9. This is indicated by the value P (X | Aj) = 1 in Table 17. In contrast, an examination comparing the frequency of individuals with a specific haplotype ^* -0- ^* The “0” haplotype was most significant. As described above, this is a set of haplotypes to which the minor allele of the SNP related to the phenotype is not assigned. The reason is considered to be that the side effect of NAT2 is not based on whether or not it has a minor allele of SNP related to the phenotype, but rather on whether or not it has at least one chromosome that does not have the minor allele. It is a set of non-assigned haplotypes of SNP minor alleles related to phenotypes, ^* -0- ^* -0 ".
In any case, as described herein, a relatively frequent SNP (eg,> 0.1 SNP) is selected from a large number of SNPs of a drug-related gene and used to use the haplotype block. It is effective to test the relationship between a specific gene and a phenotype by selecting htSNP in each haplotype block and using it to test the relationship between the phenotype and the haplotype composed of htSNP. It is. The reason is that the majority of phenotype-related SNPs (often relatively infrequent) are assigned to one major haplotype. With this assignment, even if the relationship between the disease and the SNP is relatively complex like the NAT2 gene, it becomes significant by the test. The technique of the present invention provides a very flexible method for finding genes associated with phenotypes. If the relationship between haplotypes composed of htSNPs and phenotypes becomes clear in this way, it is possible to examine the SNPs in and near the haplotype block in detail and obtain polymorphisms related to the phenotypes. It is.

プログラム「Ｐｅｎｈａｐｌｏ」
実施例５で用いる、一般化尤度比を用いた最尤法による検定方法を以下に述べる。当該方法は、所定の集団において観察された遺伝子型データ及び表現型データを用いた浸透率推定方法であり、当該方法により得られた浸透率（推定値）を用いて、新たな特定の個体が表現型を発現する確率を、その個体の遺伝子型データを用いて推定する方法である。言い換えると、特定のハプロタイプの保有と質的表現型との間で関係があるかどうかの検定を最尤法を用いて行う方法である。
本発明に係る浸透率推定方法及び表現形発現確率推定方法は、以下に説明するアルゴリズム（以下、「本アルゴリズム」と称する）によって実現される。本アルゴリズムは、コホート研究又は臨床治療試験の結果として得られた所定の集団において観察された遺伝子型データ及び表現型データ、若しくは、ケースコントロール研究の結果として得られた所定の集団において観察された遺伝子型データ及び表現型データを用いることができる。
１．コホート研究又は臨床治療試験への適用
以下の説明においては、先ず、コホート研究又は臨床治療試験から得られた個人の遺伝子型データと表現型データとを用いて、表現型とハプロタイプの存在の関連を検定し、ハプロタイプを基礎とした浸透率を推定する方法について説明する。本アルゴリズムは、ＥＭ（ｅｘｐｅｃｔａｔｉｏｎ−ｍａｘｉｍｉｚａｔｉｏｎ）アルゴリズムに基づいて作成されたものである。
本アルゴリズムは、集団のハプロタイプ頻度に加え、ハプロタイプを保有している個体と保有していない個体の間で異なる浸透率を推定することができる。従って、本アルゴリズムによれば、相対危険も最尤推定することができる。具体的に、本アルゴリズムにおいては、先ず、表現型とハプロタイプの間に関連が無いと言う仮定の下での最大尤度（Ｌ_０ｍａｘ）（帰無仮説；即ち浸透率は１つ）および、関連があるという仮説の下での最大尤度（Ｌ_ｍａｘ）（対立仮説；即ち浸透率は２つ）を計算する。次に、本アルゴリズムでは、統計量、例えば−２ｌｏｇ（Ｌ_０ｍａｘ／Ｌ_ｍａｘ）（以下単に統計量と称する）を算出し、この統計量に基づいて表現型とハプロタイプとの関連の検定を行う。
本アルゴリズムは、検体の遺伝子型情報から、表現型を発現する確率を推定する方法に適用することができる。すなわち、本アルゴリズムによれば、検体の遺伝子型情報に基づいて、当該検体において所定の表現型を発現する確率を推定することができる。これにより、本アルゴリズムは特に、遺伝的要因と薬剤に対する個人の反応の関連の解析に有用である。本アルゴリズムは、コンピュータソフトウェアＰＥＮＨＡＰＬＯに搭載することで、コンピュータ上で実行することができる。ここでコンピュータとは、動作を全て制御するＣＰＵ（制御手段）と、プログラムの実行指示等を入力できるキーボード及びマウスと、記憶媒体（データベース）等に含まれる各種データを取得する入力手段と、ディスプレイ装置等の表示手段と、一時的な情報及びプログラム等が記録されるメモリーと、各種データやプログラム等が格納されたハードディスク等の記憶手段とを備えるものである。また、コンピュータは、インターネット等の通信回線網を介して外部のデータベースや他のコンピュータ等と接続されたものであっても良い。
本アルゴリズムをコンピュータ上で実行する場合、コンピュータソフトウェアＰＥＮＨＡＰＬＯをインストールする。これにより、コンピュータは、コンピュータソフトウェアＰＥＮＨＡＰＬＯに従って、ＣＰＵの制御のもと本アルゴリズムを実行することができる。なお、遺伝子型データ及び表現型データは、例えばインターネット等の通信回線網を介して入力手段により取得してもよいし、或いは遺伝子型データ及び／又は表現型データを格納した記憶媒体から取得することもできる。
すなわち、本アルゴリズムは、コンピュータを、遺伝子型データ及び表現型データを取得する入力手段と、入力手段で取得した遺伝子型データ及び表現型データを用いてハプロタイプ頻度並びに浸透率の最尤推定値と最大尤度（Ｌ_０ｍａｘ）とを求める制御手段（演算手段）として機能させるものである。また、本アルゴリズムは、当該制御手段によって、上記最尤推定値から浸透率を求めるようにコンピュータを機能させることができる。さらに、本アルゴリズムは、当該制御手段によって、上記最大尤度（Ｌ_０ｍａｘ）及び最大尤度（Ｌ_ｍａｘ）から尤度比を求め、所定のディプロタイプ形と所定の表現型とに関連性があるという仮説をχ^２分布により検定するようにコンピュータを機能させることができる。さらにまた、本アルゴリズムは、当該制御手段によって、最尤推定値と検査対象の個体の遺伝子型データとを用いて、当該個体が所定の表現型を発現する確率を求めるようにコンピュータを機能させることができる。
ここで、遺伝子型データとは、ある個体に関して、いわゆるＳＮＰタイピング等を実施した結果として得られる多型の位置を意味する情報と、多型の種類を意味する情報とを含むデータである。遺伝子型データは、個体を特定する情報を除くことにより匿名化されていてもよい。
表現型データとは、ある個体が所定の表現型を有するか否かを意味するバイナリーデータである。特定の表現型とは、例えば、臨床検査及び診断などによって検定される薬剤作用或いは副作用の有無、罹患の有無等を挙げることができる。
（本アルゴリズム）
以下、本アルゴリズムを詳細に説明する。
＜標本空間＞
ｌ個の連鎖するＳＮＰ座位があるとする。すべての可能なハプロタイプの数はＬ＝２^ｌである。我々は、無限のハプロタイプコピーの集合を定義する。ここで、ハプロタイプの頻度はΘ＝（θ_１，．．，θ_Ｊ，．．，θ_Ｌ）であり、ここでθ_Ｊはｊのハプロタイプ頻度である。ただし、

である。Ｎ人の個体のそれぞれに、ハプロタイプコピーの集合よりランダムに引き出して、二つのハプロタイプコピーを順番に与える。ａ_１、ａ_２、…、ａ_Ｌ２を可能なディプロタイプ形とする。ｉ番目の個体のディプロタイプ形がａ_ｋである確率はＰ（ｄ_ｉ＝ａ_ｋ｜Θ）＝θ_ｌθ_ｍである。ここにｄ_ｉはｉ番目の個体のディプロタイプ形であり、ｌ及びｍはａ_ｋを構成するハプロタイプの順番である。これはハプロタイプレベルでのハーディーワインバーグ平衡が仮定されていることを意味している。ｉ番目の個体は表現型ψ_＋をｄ_ｉの関数で表される確率のもとに発症する。理論的にはすべてのディプロタイプ形に対して浸透率を仮定することが可能である。しかし、すべてのディプロタイプ形に浸透率を対応付けることは現実的ではない。そこで、本アルゴリズムにおいては２つの浸透率のみを仮定する。即ちＨ_ａｌｌをすべてのハプロタイプの集合とし、Ｈ_＋をＨ_ａｌｌの部分集合で、その存在により他と異なった表現型をきたすハプロタイプの集合とする。典型的な例ではＨ_＋はただ一つのハプロタイプを含むが、複数のハプロタイプを要素として含むこともできる。もし、Ｈ_＋が特定の座位で特定のアレルを含むすべてのハプロタイプの集合と定義すれば、（ハプロタイプではなく）アレルと表現型との関連を検定することと同じになる。
また、Ｈ_ａｌｌの部分集合としてはＨ_＋に限定されず、以下に説明するＨ_ｌを定義してもよい。Ｈ_ｌは、ＥＭアルゴリズムによって推定されたハプロタイプ分布に基づいて、個別のハプロタイプを区別する情報を与える座位及びこれらの座位から組み合わせによって重複する情報をもつ座位を特定し、特定した座位をマスクすることによってＨ_ａｌｌの部分集合として定義される。
ここでマスクとは、ハプロタイプを構成する複数の座位のうち、１以上の特定の座位については全ての多型が当てはまるものとして情報を隠蔽することを意味する。したがって、マスクによって一部が隠蔽された不完全ハプロタイプは、複数のハプロタイプを要素として含む集合Ｈ_ｌとして表現される概念である。その定義から明らかに

となる。さらに、唯一つの座位の情報を用いて他の座位をマスクした不完全ハプロタイプを構築すると、使用した座位のＳＮＰ情報のみが用いられることからＳＮＰと同義となる。これを集合Ｈ_ＳＮＰとして定義すると、Ｈ_ＳＮＰはＨ_ｌの特別な場合であり、

となる。
検定対象としてＨ_＋の代わりに不完全ハプロタイプＨ_ｌを用いる場合の合理性は以下のとおりである。
１）ハプロタイプで遺伝子多型を表現する場合、多型の原因は塩基置換と組み換えによるものである。ある領域のハプロタイプがある表現型と関連する原因座位と関連するときに、複数のハプロタイプが対応付けられるならば、それは原因座位が発生した後に生じた突然変異、あるいは組み換えによるものである。不完全ハプロタイプを構築することにより、突然変異は特定の座位のマスク、組み換えは連続する座位のマスクとして表現することが可能である。
２）Ｌ座位のＳＮＰ情報を用いて不完全ハプロタイプを構築する際に、マスクする座位を０からＬ−１まで変化させることにより、全ての情報を用いたハプロタイプからＳＮＰまでを本アルゴリズムの検定対象に含めることが可能となる。
ここで、Ｌ座位のＳＮＰ情報を用いて全ての不完全ハプロタイプＨ_ｌを構築することを考えると、座位ごとに２つのアレルとマスク操作からなる３通りの情報を適用することとなるため、単純には３^Ｌ−１とおりの組み合わせを考慮する必要が生じる。ハプロタイプ推定における組み合わせ数が２^Ｌとおりであることと比較しても、膨大な組み合わせ数となる。しかしながら、連鎖不平衡が強い領域においては、現実にはハプロタイプは１０にも満たない場合がほとんどであり、単純に全ての組み合わせを構築する必要はない。したがって、不完全ハプロタイプＨ_ｌは以下の１〜３のアルゴリズムに従って構築することができる。
１．ＥＭアルゴリズムによるハプロタイプ推定を行う。
２．推定されたハプロタイプ分布より、個別のハプロタイプを区別する情報を与える座位を抽出し、さらにそれらの座位から組み合わせによって重複する情報を持つものを削除することによって、ハプロタイプタグＳＮＰ（以下、ｈｔＳＮＰとする）を定める。
３．ｈｔＳＮＰの座位に対して、２つのアレルとマスク操作からなる３通りの情報を順次適用して、不完全ハプロタイプＨ_ｌを構築する。さらに、異なるマスク方法でも同じＨ_ｌが構築されるケースがあるため、これを削除する。
以下においては、Ｈ_＋を検定対象とする場合を説明するが、この場合と同様にして検定対象として不完全ハプロタイプＨ_ｌとすることができる。
Ｄ_＋をＨ_＋の要素を含むディプロタイプ形の集合とする。ｑ_＋をｉ番目の個体が

の下で表現型ψ_＋をきたす確率とする。そしてｑ₋をｉ番目の個体が

の条件の下で表現型ψ_＋をきたす確率とする。
即ちψ_ｉをｉ番目の個体の表現型とすると、

となる。ここでΘとｑ_＋，ｑ₋は独立である。
このように本アルゴリズムにおいては、従来のＥＭアルゴリズムと異なり、表現型の発生の過程が含められている。また本アルゴリズムにおいては、Θに加えてｑ_＋やｑ₋などのパラメータが確率空間の定義の上で含まれている。特に、本アルゴリズムにおいて、ψ_ｉはｄ_ｉの条件の下でΘとは独立であることに注意する。
＜尤度関数＞
本アルゴリズムに用いる観察データは、個体の遺伝子型データ及び表現型データである。ここで、遺伝子型データのベクトルをＧ_ｏｂｓ＝（ｇ_１，ｇ_２，．．，ｇ_Ｎ）とし、表現型データのベクトルをΨ_ｏｂｓ＝（ｗ_１，ｗ_２，．．，ｗ_Ｎ）とする。ここでｇ_ｉとｗ_ｉは、それぞれｉ番目の個体の観察される遺伝子型、表現型である。そうすると、尤度関数は次のようになる。

ここでＡ_ｉはｉ番目の個体についてｇ_ｉに合致するａ_ｋの集合である。
また、ｄ_ｉはｑ_＋，ｑ₋と独立であり、ψ_ｉはｄ_ｉの条件下でΘと独立なので、

ここでＡ_ｉはｉ番目の個体についてｇ_ｉに合致するａ_ｋの集合である。
いかなるｉとｋについて、

となる。
調べられた座位に関して、表現型がディプロタイプ形と独立であるという帰無仮説のもとで、尤度関数は

となる。ここでｑ_０はすべてのディプロタイプ形に対応する浸透率であり、Ａ_ｉはｉ番目の個体についてｇ_ｉに合致するａ_ｋの集合である。
いかなるｉとｋに対して、

となる。
＜ＥＭアルゴリズム＞
本アルゴリズムにおいては、上記式（１）をΘ、ｑ_＋とｑ₋の上で最大化し、得られた最大尤度をＬ_ｍａｘとして算出する。また、本アルゴリズムにおいては、上記式（２）をΘとｑ_０の上で最大化し、得られた最大尤度をＬ_０ｍａｘとして算出する。次に、本アルゴリズムでは、尤度比Ｌ_０ｍａｘ／Ｌ_ｍａｘをハプロタイプの存在と表現型との関連の検定に用いる。
Ｌ_ｍａｘの最大化については、推定すべきパラメータはΘ＝（θ_１，θ_２，．．．，θ_Ｌ）、ｑ_＋及びｑ₋である。一方、Ｌ_０ｍａｘの最大化については推定すべきパラメータはΘ＝（θ_１，θ_２，．．．，θ_Ｌ）及びｑ_０である。Ｌ_０ｍａｘの最大化において張られる空間は、Ｌ_ｍａｘの最大化において張られる空間の部分空間である。帰無仮説の下では−２ｌｏｇ（Ｌ_０ｍａｘ／Ｌ_ｍａｘ）は自由度１のχ^２分布に従う。
もしｄ_１，ｄ_２，．．．，ｄ_Ｎとψ_１，Ψ_２，．．．，Ψ_Ｎに関して完全データが得られるならば、θ_１，θ_２，．．．，θ_Ｌとｑ_＋，ｑ₋の最大推定量は、それぞれ

（ただしｊ＝１，２，…，Ｌ）のように簡単に得られる。ここで、ｎ_ｊはＮ人の個体の中のｊ番目のハプロタイプのハプロタイプコピーの数である。また、

である。ここで＃｛ｉ；，，｝は、「；」の後の条件を満たす個体の数を表す。
しかしながら、個体のディプロタイプ形に関する完全データは得られず、単に個体の遺伝子型データと表現型データを観察するのみである。従って、我々はｎ_ｊ／（２Ｎ）、Ｎ_＋Ψ＋／Ｎ_＋およびＮ_−Ψ＋／Ｎ₋の期待値を、真の値の変わりに代入する以下のアルゴリズムを作成する。すなわち、本アルゴリズムは以下のステップ（ｉ）〜（ｖｉｉｉ）を含むものである。
（ｉ）ｎ＝０について初期値（例えば、θ_ｊ ^（ｎ）＝１／Ｌ）を、Θ^（ｎ）＝（θ_１ ^（ｎ）、θ_２ ^（ｎ）…、θ_Ｌ ^（ｎ））に与える、ただし、

である。
（ｉｉ）ｎ＝０について初期値をｑ_＋ ^（ｎ）、ｑ₋ ^（ｎ）に与える。ただし、０＜ｑ_＋ ^（ｎ）、ｑ₋ ^（ｎ）＜１である。
（ｉｉｉ）すべてのｉ、そしてｇ_ｉに合致するすべてのａ_ｋについて以下を計算する。

ここで、Ａ_ｉはｇ_ｉと合致するａ_ｍの集合である。ここで、本アルゴリズムにおいては、ｇ_ｉに合致するａ_ｋのみについて調べることに注意する。さらに、ｄ_ｉは、ｑ_＋ ^（ｎ）及びｑ₋ ^（ｎ）と独立であり、ψ_ｉはｄ_ｉの条件下でΘ^（ｎ）と独立なので、上記式（３）は以下のようになる。

（ｉｖ）Ｎ人の個体に保有されているｊ番目のハプロタイプのハプロタイプコピーの数であるｎ_ｊはランダム変数なので、ｊ番目のハプロタイプのハプロタイプコピーの数の期待値を、以下のように定義できる。

ここでｆ_ｊ（ａ_ｋ）はａ_ｋの中のｊ番目のハプロタイプのハプロタイプコピー数であり、Ａ_ｉはｉ番目の個体についてｇ_ｉに合致するａ_ｋの集合である。ここでｆ_ｊ（ａ_ｋ）は０，１，２のいずれかである。この期待値をすべてのｊについて計算する。
（ｖ）ここで、Ｎ_＋Ψ＋／Ｎ_＋とＮ_−Ψ＋／Ｎ₋はランダム変数であるため、以下のように期待値をそれぞれ定義することができる。

また、

ここで、

である。
（ｖｉ）ステップ（ｉｖ）の結果からΘを以下のように、次のステップのために更新する。

ステップ（ｖ）の計算結果より、浸透率を次のステップのために以下のように更新する。

（ｖｉｉ）（ｉｉｉ）から（ｖｉ）までのステップを値が収束するまで繰り返す。
収束した場合の最大推定値を

とする。
（ｖｉｉｉ）極大を避けるために、さまざまなθ_ｊ ^（０）（ｊ＝１，２，…，Ｌ）、ｑ_＋ ^（０）及びｑ₋ ^（０）の初期値をテストする。ここで、

は、対立仮説の下での最大尤度Ｌ_ｍａｘである。
もし、ｑ_０＝ｑ_＋＝ｑ₋の条件を与え、ステップ（ｉｉｉ）から（ｖｉｉ）までを繰り返せば帰無仮説の下での最大尤度Ｌ_０ｍａｘが得られる。
帰無仮説の下では統計量−２ｌｏｇ（Ｌ_０ｍａｘ／Ｌ_ｍａｘ）は自由度１のχ^２分布に漸近的に従うと期待される。
＜表現型発現確率推定アルゴリズム＞
上述したＥＭアルゴリズム（本アルゴリズム）によれば、検体の遺伝子型データに基づいて、当該検体において所定の表現型を発現する確率を推定することができる。対象となる検体をＮ＋１とし、当該検体において観測された遺伝子型をｇ_Ｎ＋１とすると、当該検体が表現型ψ_＋を発現する確率

を以下の式に従って推定することができる。

ここでｄ_Ｎ＋１はＮ＋１番目の検体に関するディプロタイプ形を示している。
上記のＥＭアルゴリズム及び表現型発現確率推定アルゴリズム（本アルゴリズム）は、例えばコンピュータソフトウェアに搭載することができる。本アルゴリズムをコンピュータソフトウェアに搭載することによって、全ての計算をコンピュータにおいて行うことができる。
また、本アルゴリズムを搭載したソフトウェアをインストールしたコンピュータは、通信回線網を介して外部ネットワークと接続されていてもよく、例えば、コホート研究又は臨床治療試験から得られた個人の遺伝子型データ及び表現型データを、当該通信回線網を介して取得することができる。また、本アルゴリズムによって推定された浸透率及び確率を、通信回線網を介して外部に出力することもできる。
２．ケースコントロール研究への適用
次に、ケースコントロール研究の結果として得られた個人の遺伝子型データと表現型データとを上述したアルゴリズムに適用して、表現型とハプロタイプの存在の関連を検定し、ハプロタイプを基礎とした浸透率を推定する方法について説明する。
ケースコントロール研究においては、ケース（ａｆｆｅｃｔｅｄ）とコントロール（ｎｏｎ−ａｆｆｅｃｔｅｄ）の数を固定してサンプルする。このため、ケースとコントロールの比を“ｃａｓｅ：ｃｏｎｔｒｏｌ＝ω：１−ω”とし、全集団におけるａｆｆｅｃｔｅｄの割合（罹患率）をλとすると、ケースは標本空間におけるａｆｆｅｃｔｅｄ（λ）の中からω分、コントロールは標本空間におけるｎｏｎ−ａｆｆｅｃｔｅｄ（１−λ）の中から（１−ω）分、抽出されることになる。
したって、ケースコントロール研究において推定されるパラメータは、ある個体がディプロタイプ形のＤ_＋の要素を持つかどうかということが、ある特定の表現型（例えば特定の疾患）の有無と関連しているとすると、下記表１９で与えられる。

上記表１９においてωは定数であり、

はケースとコントロールを合わせた集団におけるディプロタイプ形の頻度である。また、ｒ_＋及びｒ₋は、上述したコホート研究又は臨床治療試験に適用した場合に推定される浸透率ｑ_＋及びｑ₋とは異なり浸透率を意味するものではない。これら推定値の関係は下記式のように示すことができる。

上記式から判るように、ｒ_＋及びｒ₋は互いに独立の関係にないことがわかり、本アルゴリズムをケースコントロール研究に適用する場合には、上述したコホート研究又は臨床治療試験に適用した場合と比較して、推定されるパラメータが１つ少ないことがわかる。ｒ_＋＝ｒ₋が成立するのは、ｒ_＋＝ｒ₋＝ｗの場合である。よって、本アルゴリズムをケースコントロール研究に適用した場合の帰無仮説及び対立仮説は、

となる。また、浸透率ｑ_＋は推定値ｒ_＋用いて

と表すことができる。したがって、上述した本アルゴリズムをケースコントロール研究に適用して推定された推定値ｒ_＋を代入することによって、ケースコントロール研究の結果を用いて浸透率の推定値を算出することができる。
なお、罹患率λはケースコントロール研究の結果からは推定できないため、別途に与える必要がある。例えば、罹患率λは、対象とする疾患に関する統計調査や特定の集団における追跡調査などから特定の数値として得ることができる。
また、上記ｒ_＋及びｑ_＋の関係を示す式に基づいて、ｒ_＋及びｑ_＋の関係を図２１及び２に示す。図２２は、ケースとコントロールが同数のもとでλの値を０．１、０．０１及び０．００１とした場合を示す特性図である。図２２は、λを０．０１とし、ケースとコントロールとの比率（ｃａｓｅ／ｃｏｎｔｒｏｌ）を１、５及び１０とした場合を示す特性図である。
また、ケースコントロール研究の結果を用いた場合であっても、上述したコホート研究又は臨床治験試験の結果を用いる場合と同様に、検体の遺伝子型データに基づいて、当該検体において所定の表現型を発現する確率を推定することができる。
＜シミュレーション＞
なお、以下の説明において、

をそれぞれ“Θハット”、“ｑ_＋ハット”及び“ｑ₋ハット”と呼ぶ。
帰無仮説の下での統計量−２ｌｏｇ（Ｌ _０ｍａｘ／Ｌ _ｍａｘ）の経験的分布：
先ず、統計量−２ｌｏｇ（Ｌ_０ｍａｘ／Ｌ_ｍａｘ）の経験的分布を帰無仮説の下でのシミュレーションにより検討した。本例では、ハプロタイプ頻度Θをシミュレーションではなく、現実データから得た。即ち、過去のＳＡＡ（血清アミロイドＡ）遺伝子に関する研究によって得られたΘを用いた［Ｍｏｒｉｇｕｃｈｉｅｔａｌ．（２００１）Ａｎｏｖｅｌｓｉｎｇｌｅ−ｎｕｃｌｅｏｔｉｄｅｐｏｌｙｍｏｒｐｈｉｓｍａｔｔｈｅ５’−ｆｌａｎｋｉｎｇｒｅｇｉｏｎｏｆＳＡＡ１ａｓｓｏｃｉａｔｅｄｗｉｔｈｒｉｓｋｏｆｔｙｐｅＡＡａｍｙｌｏｉｄｏｓｉｓｓｅｃｏｎｄａｒｙｔｏｒｈｅｕｍａｔｏｉｄａｒｔｈｒｉｔｉｓ．ＡｒｔｈｒｉｔｉｓＲｈｅｕｍ４４：１２６６−１２７２］。ｑ_０については０と１の間の様々な値を試した。なお、帰無仮説の下では浸透率はすべてのディプロタイプ形に対して同じとなる。
シミュレーションを始めるため、Θを用いてハプロタイプを引き出すことによって、Ｎ人の個体のそれぞれに対して順番を付けた２つのハプロタイプコピーを与えた。そして、それぞれの個体の表現型をｑ_０に基づいて与えた。即ち、ｑ_０を任意の個体が表現型ψ_＋を発生する確率とした。その後、相に関する情報を除去し、上記のアルゴリズムを適用し、統計量−２ｌｏｇ（Ｌ_０ｍａｘ／Ｌ_ｍａｘ）を計算した。シミュレーションは繰り返し行い、その統計量の分布を調べた。
なお、ＳＡＡ遺伝子に関するハプロタイプデータは６個のＳＮＰのデータを含んでいる。表２０に、ＳＡＡ遺伝子に関するハプロタイプと頻度を示す。

なお、上記表２０においてハプロタイプ「ＡＣＣＧＴＣ」は、対立仮説によるシミュレーションにおいて「表現型に関係するハプロタイプ」として対応付けられたハプロタイプである。
この表２０に従って、ハプロタイプコピーをランダムに引き、それぞれの個体に２つの順位付のハプロタイプコピーを与えた。表現型ψ_＋を発生する確率はすべての個体について同じなので、それぞれの個体に固定された確率ｑ_０に基づいて表現型ψ_＋を与えた。ｑ_０には０から１までのさまざまな値を与えた。
図２３は、様々なｑ_０とＮの下での統計量−２ｌｏｇ（Ｌ_０ｍａｘ／Ｌ_ｍａｘ）の分布を示すヒストグラムである。なお、図２３において、ａはｑ_０＝０．２、Ｎ＝１，０００と設定した場合の結果を示し、ｂはｑ_０＝０．１、Ｎ＝１，０００と設定した場合の結果を示し、ｃはｑ_０＝０．２、Ｎ＝２００と設定した場合の結果を示し、ｄはｑ_０＝０．２、Ｎ＝１００と設定した場合の結果を示す。図２３において、統計量のヒストグラムは棒グラフで示し、自由度１のχ^２分布の確率密度関数は曲線で示した。
また、ｑ_０とＮに様々な値を与えた時、統計量の分布が自由度１のχ^２分布に従うと仮定して計算した第一種の過誤（α＝０．０５）の確率と、対立仮説の下で推定された“ｑ_＋ハット”と“ｑ₋ハット”の値を、表２１に示す。

なお、上記表２１において、“ｑ_＋ハット”と“ｑ₋ハット”の値は平均±標準偏差で示した。また、タイプＩ過誤率としては統計量の値が３．８４１（自由度１のχ^２分布の累積密度関数が０．９５となる値である）を超えた割合を示した。
図２３及び表２１に示した結果より、帰無仮説の下ではこの統計量が漸近的に自由度１のχ^２分布に従うことが明らかとなった。
対立仮説の下でのシミュレーション：
次に、対立仮説の下でのシミュレーションを行った。ここで、ハプロタイプの一つを表現型ψ_＋に関係したハプロタイプと決め、このハプロタイプを「表現型に関係したハプロタイプ」と呼ぶことにする。このシミュレーションにおいて、Ｄ_＋は、少なくとも１つの「表現型に関係したハプロタイプ」を持つディプロタイプ形の集合と定義される。二つの浸透率ｑ_＋とｑ₋にいずれも０から１までの値を与えた。
このシミュレーションでは、先ず、それぞれの個体に順番を付けた２つのハプロタイプのコピーを、Θを用いて引き出す。そして、それぞれの個体の表現形ψ_＋を発生する確率として、ｑ_＋又はｑ₋を与える。なお個体が「表現型に関係するハプロタイプ」を持っているときの表現形ψ_＋を発生する確率はｑ_＋とし、持っていないときの確率はｑ₋とした。
相の情報を除去した後、遺伝子型データと表現型データを上記のアルゴリズムにかけた。シミュレーションは多数回繰り返し、得られた結果を解析した。そして、様々なｑ_＋とｑ₋の値の下で、この統計量−２ｌｏｇ（Ｌ_０ｍａｘ／Ｌ_ｍａｘ）が帰無仮説の下では自由度１のχ^２分布に従うと仮定し、検出力を計算した。
結果として、ｑ_＋の値を固定し、ｑ_＋／ｑ₋（これは即ち、相対危険である）の値を変化させることによって得られた統計量の分布を図２４に示す。なお、図２４において、ａ〜ｆは以下の条件を設定し、ｑ₋の値を変化させて統計量を算出した結果を示している。
（ａ）ｑ_＋＝０．２、Ｎ＝１，０００
（ｂ）ｑ_＋＝０．１、Ｎ＝１，０００
（ｃ）ｑ_＋＝０．２、Ｎ＝２００
（ｄ）ｑ_＋＝０．２，Ｎ＝１００
（ｅ）ｑ_＋＝０．４，Ｎ＝１００
（ｆ）ｑ_＋＝０．５，Ｎ＝１００
図２４のａ〜ｆにおいて、３．８４１の値の水平線は自由度１のχ^２分布に従うとした場合の（Ｐ＝０．０５）での限界値を示している。
また、同じシミュレーションを様々なｑ_＋、ｑ₋およびＮの値の下で１０，０００回計算し、統計量が３．８４１（自由度１のχ^２分布の累積密度関数０．９５を与える値である）を超える試行の割合を経験的な検出力とした。結果を図２５に示す。図２５は検出力がｑ_＋／ｑ₋（ｑ_＋／ｑ₋≧１）（即ち、相対危険）を増加させることにより増加し、Ｎを増加させることにより増加することを示している。
さらに、「表現型に関連したハプロタイプ」の頻度の統計量に対する影響を調べた結果を図２６に示す。図２６からは、「表現型に関連するハプロタイプ」の頻度が０から１の間の中間的な値で検出力がピークを取ることが明らかとなった。
推定された浸透率“ｑ _＋ハット”及び“ｑ ₋ ハット”の分布：
次に、上述した対立仮説の下での推定値“ｑ_＋ハット”及び“ｑ₋ハット”の分布を調べた。具体的には、対立仮説の下で、与えられた浸透率ｑ_＋＝０．２と変化させる浸透率ｑ₋の下で行った。相対危険（ｑ_＋／ｑ₋）は１．０から２．０まで変化させた。標本サイズＮは１，０００に固定した。本アルゴリズムを用いて、対立仮説の下で“ｑ_＋ハット”及び“ｑ₋ハット”を推定し、統計量−２ｌｏｇ（Ｌ_０ｍａｘ／Ｌ_ｍａｘ）を計算した。このシミュレーションをそれぞれのパラメータセットについて１０，０００回繰り返した。結果を表２２に示す。

表２２中“ｑ_＋ハット“及び“ｑ₋ハット”は平均±標準偏差で示した。経験的検出率は、統計量の値が３．８４１（自由度１のχ^２分布の累積密度関数が０．９５となる）を超えた試行の割合として示した。表２２に示したように、本アルゴリズムによれば、ｑ_＋およびｑ₋の推定がかなり正確であること、ばらつきは比較的小さいことが明らかとなった。この結果から、同様にｑ_＋／ｑ₋の推定値、“ｑ_＋ハット”及び“ｑ₋ハット”も比較的正確と考えられる。
次に、“Θハット”とディプロタイプ形の事後確率分布（ディプロタイプ分布）が、表現型データを入れた場合と入れない場合で異なるか否かを検討した。表現型データを入れない場合の推定は、いわゆるＬＤＳＵＰＰＯＲＴプログラムによって行った。このＬＤＳＵＰＰＯＲＴプログラムは遺伝子型データのみで個体のディプロタイプ分布を計算するものである。しかし、浸透率をｑ_０＝ｑ_＋＝ｑ₋と置いたとき、即ち、帰無仮説のもとでは“Θハット”とＬＤＳＵＰＰＯＲＴで推定されたΘと個人のディプロタイプ分布は同じであった。データは示さないが、“ｑ_＋ハット”及び“ｑ₋ハット”は個人の表現型を変化させることにより変化した。
表２０に示したＳＡＡ遺伝子により得られたΘを用いて、対立仮説の下でシミュレーションを行った結果を表２３に示す。なお、シミュレーションのパラメータはｑ_＋＝０．２、ｑ₋＝０．１及びＮ＝１，０００である。表２３には典型的な４人のデータを示す。

表２３において、「表現型」の欄はそれぞれの罹患状態を示している。ここでＮは非罹患者を示し、Ａは罹患者を示す。表２３において「ディプロタイプ分布」の欄はそれぞれの個体のディプロタイプ形の事後分布を示している。特に、「遺伝子型のみ」の欄は表現型の情報なしにＬＤＳＵＰＰＯＲＴを用いて解析を行った結果を示す。「＋表現型」の欄は遺伝子型データ及び表現形データを用いて本アルゴリズムにより解析を行った結果を示す。「表現型を変える」の欄は個人の表現型データのみを反転したデータを用いて本アルゴリズムにより解析を行った結果を示す。
表２３に示すように、“Θハット”と推定された個人のディプロタイプ形は表現型データを入れたり個人の表現型を変化させることにより確かに変化した。表２３には示さないが、推定された個体のディプロタイプ分布が一つの事象に集中した場合は、“Θハット”も個人のディプロタイプ分布もほとんど変化しなかった。これと比較して、表２２に示したように、個体のディプロタイプ分布が一つの事象に集中しなかった場合は、“Θハット”も個人のディプロタイプ分布も表現型データを入れたり、個人の表現型を変化させることにより変化した。
次に、表現型を入れて推定した（即ち、本アルゴリズムによる）“Θハット”及びディプロタイプ形の両方或いは片方が、表現型を入れないで推定したもの（即ち、ＬＤＳＵＰＰＯＲＴプログラムによる）と異なる場合に、どちらのアルゴリズムによる推定がより正確かを検証した。
具体的には、対立仮説の下でシミュレーションを行うに際して、ＳＡＡ遺伝子及び人工的に設計した遺伝子からΘを得た。なお、人工的に設計した遺伝子に対しては、互いに弱い連鎖不平衡が存在する６つのＳＮＰ座位に関するデータを作製し、ハプロタイプコピーの集合を作成した。
ＳＡＡ遺伝子のシミュレーションにおいて、パラメータをｑ_＋＝０．５、ｑ₋＝０．１２５及びＮ＝１００とし、シミュレーションを１０，０００回行った。人工的な遺伝子のシミュレーションにおいて、パラメータをｑ_＋＝０．５、Ｎ＝１，０００とし、ｑ₋の値は変化させた。また、このシミュレーションは１，０００回行った。
それぞれのシミュレーションの後で、相の情報を削除し、データを本アルゴリズムにかけた。それぞれの個体について真のディプロタイプ形と推定したディプロタイプ分布とを比較し、真のディプロタイプ形が推定された最も確率の高いディプロタイプ分布と一致している場合を、正確な推定とした。その結果として、ディプロタイプ形の推定が正確であった個体の割合を記録した。結果を表２４に示す。

表２４に示すように、表現形データを追加した場合（本アルゴリズムを用いる場合）には、ディプロタイプ形が正しく推定された個体の割合が増加していることから、ディプロタイプ形の推定がより正確になることが明らかとなった。
ケースコントロール研究；帰無仮説における第１種の過誤に関する検討
本アルゴリズムを用いて、ＳＡＡ遺伝子（上記表２０）のハプロタイプ頻度および人工的に作った連鎖不平衡の弱いハプロタイプ頻度（以後、ＡＲＴと略す）に基づき、ケースコントロール研究の場合の、帰無仮説における第１種の過誤をシミュレーションにより評価した。本シミュレーションに使用したハプロタイプ頻度を表２５に示す。

表２５に示した２種類のハプロタイプ頻度に基づいたシミュレーションの結果、得られた第１種の過誤の値をそれぞれ図２７及び図２８に示す。図２７はＳＡＡ遺伝子のハプロタイプ頻度のデータを用いた結果を示す特性図であり、図２８はＡＲＴのハプロタイプ頻度のデータを用いた結果を示す特性図である。なお、これら図２７及び図２８においては、本アルゴリズムによるχ^２値と完全データに基づく分割表によるχ^２値とを、誤差棒が見やすいようにずらして表示している。
図２７より、ＳＡＡ遺伝子のハプロタイプ頻度のデータを用いた場合にはサンプル数４００（ケース＝コントロール＝２００）以上で、のハプロタイプ頻度のデータを用いた場合にはサンプル数６００（ケース＝コントロール＝３００）以上で第１種の過誤がシミュレーションの統計誤差範囲で有意水準０．０５と一致することが確認された。
ケースコントロール研究の結果として得られた個人の遺伝子型データと表現型データとを上述したアルゴリズムに適用した場合には、個体に対してディプロタイプ形の推定を行なった後に分割表を用いて検定する方法と比較して以下のような優位性を示すことができる。ここでは、個人のディプロタイプ形決定方法として以下の４つの手法と比較する。
１．ケース集団とコントロール集団について別々にディプロタイプ形の推定を行ない、個人に対して複数のディプロタイプ形がありうる場合には、ディプロタイプ形の確率が最大のものをその個人のディプロタイプ形として採用する。（以下、ｓｅｐａｒａｔｅ０と表記）
２．ケース集団とコントロール集団について別々にディプロタイプ形の推定を行ない、個人に対して複数のディプロタイプ形がありうる場合には、ディプロタイプ形の確率に従って個人を分割する。（以下、ｓｅｐａｒａｔｅ１と表記）
３．ケース集団とコントロール集団を合わせた全集団に対してディプロタイプ形の推定を行ない、ディプロタイプ形の確率が最大のものをその個人のディプロタイプ形として採用する。（以下、ｓｅｐａｒａｔｅ２と表記）
４．ケース集団とコントロール集団を合わせた全集団に対してディプロタイプ形の推定を行ない、ディプロタイプ形の確率に従って個人を分割する。（ｓｅｐａｒａｔｅ３と表記）
上記ｓｅｐａｒａｔｅ０〜ｓｅｐａｒａｔｅ３の手法に基づき作成した分割表を元にピアソンの検定統計量を算出する。また、ディプロタイプ形の完全データに基づいて作成した分割表から得られたピアソン検定統計量（ｄｉｐｌｏｔｙｐｅと表記）および本アルゴリズムによる尤度比検定統計量（Ｐｅｎｈａｐｌｏと表記）についても同時に算出し、これらの統計量の相関係数を調べた。表２５に示したＡＲＴのハプロタイプ頻度のデータに基づくシミュレーション結果を表２６に示す。

表２６から判るように、完全データに基づく検定統計量（ｄｉｐｌｏｔｙｐｅ）と最も相関が強いものは、本アルゴリズムによる尤度比検定統計量（Ｐｅｎｈａｐｌｏ）で、次に相関が高いものはｓｅｐａｒａｔｅ３による分割方法であった。また、本アルゴリズムによる帰無仮説の第１の過誤とｓｅｐａｒａｔｅ３手法による帰無仮説の第１の過誤とをシミュレーションにより求めた結果を図２９に示す。
図２９より、本アルゴリズムによる第１の過誤は有意水準０．０５と一致するのに対し、ｓｅｐａｒａｔｅ３手法は第１の過誤を過小評価することが分かる。このことから、本アルゴリズムによる解析は、既存のディプロタイプ形を推定後、分割表により検定する手法より優位であると言える。
＜現実データの分析１＞
現実に収集されたデータに本アルゴリズムを適用した。現実データとして、ＭＴＨＦＲ遺伝子［Ｕｒａｎｏｅｔａｌ．（２００２）Ｐｏｌｙｍｏｒｐｈｉｓｍｓｉｎｔｈｅｍｅｔｈｙｌｅｎｅｔｅｔｒａｈｙｄｒｏｆｏｌａｔｅｒｅｄｕｃｔａｓｅｇｅｎｅｗｅｒｅａｓｓｏｃｉａｔｅｄｗｉｔｈｂｏｔｈｔｈｅｅｆｆｉｃａｃｙａｎｄｔｈｅｔｏｘｉｃｉｔｙｏｆｍｅｔｈｏｔｒｅｘａｔｅｕｓｅｄｆｏｒｔｈｅｔｒｅａｔｍｅｎｔｏｆｒｈｅｕｍａｔｏｉｄａｒｔｈｒｉｔｉｓ，ａｓｅｖｉｄｅｎｃｅｄｂｙｓｉｎｇｌｅｌｏｃｕｓａｎｄｈａｐｌｏｔｙｐｅａｎａｌｙｓｅｓ．Ｐｈａｒｍａｃｏｇｅｎｅｔｉｃｓ．１２：１８３−１９０］とＮＡＴ２遺伝子［Ｔａｎａｋａｅｔａｌ．（２００２）ＡｄｖｅｒｓｅｅｆｆｅｃｔｓｏｆｓｕｌｆａｓａｌａｚｉｎｅｉｎｐａｔｉｅｎｔｓｗｉｔｈｒｈｅｕｍａｔｏｉｄａｒｔｈｒｉｔｉｓａｒｅａｓｓｏｃｉａｔｅｄｗｉｔｈｄｉｐｌｏｔｙｐｅｃｏｎｆｉｇｕｒａｔｉｏｎａｔｔｈｅＮ−ａｃｅｔｙｌｔｒａｎｓｆｅｒａｓｅ２ｇｅｎｅ．ＪＲｈｅｕｍａｔｏｌ２９：２４９２−２４９９］に関するデータを用いた。いずれのデータもコホート研究に由来するものである。
ＭＴＨＦＲ遺伝子に関係するデータのセットは、関節リウマチの患者のコホート研究に由来するものである。メトトレキサートを服用した１０４人の患者について副作用が出たかどうか、ＭＴＨＦＲ遺伝子の２つのＳＮＰがどうなっているかについて調べた。なお、一つのハプロタイプが「表現型に関連するハプロタイプ」と仮定されている。
その結果、統計量−２Ｌｏｇ（Ｌ_０ｍａｘ／Ｌ_ｍａｘ）は６．８０７４であり、確かに統計的に有意であった（Ｐ＜０．０１）。この場合の“ｑ_＋ハット”及び“ｑ₋ハット”の最尤推定値はそれぞれ０．２５７１及び０．０５８８であった。即ち、相対危険の最尤推定値は４．３７であった。また、それぞれの個体のディプロタイプ分布は一つの事象に集中した。データは示さないが、すべての個体ついて、対立仮説の下で推定されたディプロタイプ形は、帰無仮説の下で推定されたもの、あるいはＬＤＳＵＰＰＯＲＴにより推定されたものとほぼ同じであった。即ち、この例では表現型データを用いる本アルゴリズムであろうと、表現型データを用いないＬＤＳＵＰＰＯＲＴであろうと、推定されたディプロタイプ形はほぼ同じとなった。“Θハット”もまた、本アルゴリズムであろうとＬＤＳＵＰＰＯＲＴであろうと変化はなかった。
一方、ＮＡＴ２遺伝子に関係するデータセットも関節リウマチ患者のコホート研究により得られたものである。スルファサラジンを服用した１４４人の患者について副作用の発生とＮＡＴ２遺伝子における７個のＳＮＰについて検索したものである［上記Ｔａｎａｋａらの文献参照］。野生型ハプロタイプとしてわかっている一つのハプロタイプが「表現型に関連したハプロタイプ」と仮定されていた。統計量−２ｌｏｇ（Ｌ_０ｍａｘ／Ｌ_ｍａｘ）を計算すると、１３．４６２９となるが、これは有意であり（Ｐ＜０．００１）このハプロタイプの存在が副作用と関係していることを示す。また、最尤推定値“ｑ_＋ハット”及び“ｑ₋ハット”はそれぞれ０．０８０９と０．６２４８であった。即ち、相対危険の最尤推定値は０．１２９である。即ち、「表現型に関連したハプロタイプ」存在は副作用の低下と関連していた。
なお、それぞれの個体のディプロタイプ分布は１人を除いて一つの事象に集中した。すべての個体について、対立仮説の下で本アルゴリズムにより推定されたディプロタイプ形は帰無仮説やＬＤＳＵＰＰＯＲＴにより推定されたものと同じであった。
以上、現実データを用いた分析の結果より、本アルゴリズムは、遺伝子型データと表現形データとを用いて、浸透率を算出することができる。これにより、本アルゴリズムによれは、遺伝子型データと表現型データより個体のレベルで表現型とハプロタイプの関連を検定することができる。また、本アルゴリズムによれば、特定のハプロタイプのあるなしにより異なった浸透率が対応するという仮定の下で、浸透率を最尤推定することもでき、更に、それらの浸透率から当然相対危険の最尤推定値も得ることができる。
ところで、上述した不完全ハプロタイプＨ_ｌを検定対象として、スルファサラジンを服用した１４４人の関節リウマチ患者のコホート研究から得られたＮＡＴ２遺伝子に関する７個のＳＮＰからなる遺伝子型データと副作用のデータに対して、解析を実施した。上述したように、このデータからは、野生型ハプロタイプを持っていると持っていない場合と比較して副作用が発生するリスクが０．１２９倍と小さくなることが示された。
具体的に、野生型ハプロタイプは「ＧＣＴＣＧＡＧ」というＳＮＰの並びで表されるが、不完全ハプロタイプを上記に示した方法で構築すると野性型ハプロタイプを指定しなくても「ＧＣ^＊＊＊＊Ｇ」が最も有意であるという解析結果が得られた。ここで「^＊」はマスクされたことを示している。すなわち、本例では、３座位目〜６座位目まではマスクすることができ、１、２及び７座位目のＳＮＰについて構築した不完全ハプロタイプＨ_ｌを検定対象とすることができる。そして、構築した不完全ハプロタイプＨ_ｌを検定対象として、本アルゴリズムを適用した結果、「ＧＣ^＊＊＊＊Ｇ」が最も有意であることが判った。なお、「ＧＣ^＊＊＊＊Ｇ」において、マスクされた座位は他の座位の情報で表現されていることを意味しており、２座位目が「Ｃ」、７座位目が「Ｇ」であることが表現型と関連していることを示している。野生型ハプロタイプ「ＧＣＴＣＧＡＧ」は、正しく「ＧＣ^＊＊＊＊Ｇ」という表現と一致するハプロタイプとなっている。このように、不完全ハプロタイプＨ_ｌを用いることによって、「表現型に関連したハプロタイプ」を探索できることが判る。
さらに、不完全ハプロタイプＨ_ｌを用いた他の適用例として、メソトレキサートを服用した１７５人の関節リウマチ患者のコホート研究から得られたＡＢＣＢ８遺伝子に関する３つのＳＮＰからなる遺伝子型データと、副作用を生じたときに投与される葉酸の投与の有無のデータに対して、検定対象に不完全ハプロタイプを用いた解析を実施した。その結果を表２７に示す。

表２７に示す解析結果において、最も低いＰ値を示しているのはハプロタイプ「ＣＧ^＊」であり、全ての座位の情報を用いたハプロタイプ「ＣＧＡ」よりもより有意であることが示されている。これは、第１座位と第２座位による不完全ハプロタイプ「ＣＧ」が、表現型をとる原因とより強い関連があることを示している。すなわち、原因座位は「ＣＧＡ」及び「ＣＧＧ」の両方のハプロタイプと関連しており、第３座位をマスクすることにより、表現型との関連が強く表われたものと理解することができる。この結果は、不完全ハプロタイプＨ_ｌを用いることにより、複数のハプロタイプと関連する表現型に対しても検出することが可能となることを示すものである。
産業上の利用の可能性
本発明により、ハプロタイプの解析方法が提供される。本発明の方法によれば、ＳＮＰひとつひとつを解析することなく、ハプロタイプを利用して薬物等の感受性に関する個人差を評価し、個別化医療（オーダーメイド医療又はテーラーメイド医療）のためのツールとして利用することができる。  Program “Penhaplo”
  The test method based on the maximum likelihood method using the generalized likelihood ratio used in Example 5 will be described below. The method is a penetrability estimation method using genotype data and phenotype data observed in a predetermined population, and a new specific individual is identified using the penetrance (estimated value) obtained by the method. This is a method for estimating the probability of developing a phenotype using the genotype data of the individual. In other words, it is a method that uses the maximum likelihood method to test whether there is a relationship between possession of a specific haplotype and a qualitative phenotype.
  The penetration rate estimation method and the phenotype expression probability estimation method according to the present invention are realized by an algorithm described below (hereinafter referred to as “the present algorithm”). The algorithm uses genotype and phenotype data observed in a given population obtained as a result of a cohort study or clinical treatment trial, or a gene observed in a given population obtained as a result of a case control study. Type data and phenotype data can be used.
1. Application to cohort studies or clinical treatment trials
  In the following explanation, first, the genotype data and phenotype data of individuals obtained from cohort studies or clinical treatment trials are used to test the association between the presence of phenotypes and haplotypes. A method for estimating the rate will be described. This algorithm is created based on the EM (expectation-maximization) algorithm.
  In addition to the haplotype frequency of the population, the present algorithm can estimate different penetrances between individuals who have haplotypes and those who do not. Therefore, according to this algorithm, it is possible to estimate the maximum likelihood of the relative risk. Specifically, in the present algorithm, first, the maximum likelihood under the assumption that there is no relationship between the phenotype and the haplotype (L_0max) (Null hypothesis; i.e., one penetrance) and the maximum likelihood (L_max) (Alternative hypothesis; two penetrances). Next, in this algorithm, a statistic such as −2 log (L_0max/ L_max) (Hereinafter simply referred to as a statistic), and based on this statistic, an association test between the phenotype and the haplotype is performed.
  This algorithm can be applied to a method for estimating the probability of developing a phenotype from the genotype information of a specimen. That is, according to this algorithm, the probability that a predetermined phenotype is expressed in the sample can be estimated based on the genotype information of the sample. Thus, the algorithm is particularly useful for analyzing the relationship between genetic factors and individual responses to drugs. This algorithm can be executed on a computer by being installed in the computer software PENHAPLO. Here, the computer means a CPU (control means) that controls all operations, a keyboard and mouse that can input program execution instructions, an input means that acquires various data contained in a storage medium (database), and a display. It includes a display unit such as a device, a memory in which temporary information and programs are recorded, and a storage unit such as a hard disk in which various data and programs are stored. The computer may be connected to an external database or another computer via a communication network such as the Internet.
  When this algorithm is executed on a computer, the computer software PENHAPLO is installed. As a result, the computer can execute the present algorithm under the control of the CPU in accordance with the computer software PENHAPLO. The genotype data and phenotype data may be obtained by an input means via a communication network such as the Internet, or may be obtained from a storage medium storing genotype data and / or phenotype data. You can also.
  That is, this algorithm uses a computer to input genotype data and phenotype data, and maximum likelihood estimates and maximum haplotype frequencies and penetrances using genotype data and phenotype data acquired by the input means. Likelihood (L_0max) To function as control means (calculation means). In addition, this algorithm can cause the computer to function so as to obtain the penetration rate from the maximum likelihood estimated value by the control means. Furthermore, the present algorithm uses the control means to determine the maximum likelihood (L_0max) And maximum likelihood (L_max) To determine the likelihood ratio and the hypothesis that there is a relationship between the given diplotype form and the given phenotype²The computer can be operated to test by distribution. Furthermore, this algorithm causes the computer to function by the control means so as to obtain the probability that the individual develops a predetermined phenotype using the maximum likelihood estimate and the genotype data of the individual to be examined. Can do.
  Here, the genotype data is data including information indicating the position of a polymorphism and information indicating the type of polymorphism obtained as a result of performing so-called SNP typing or the like for a certain individual. Genotype data may be anonymized by excluding information that identifies individuals.
  The phenotype data is binary data that means whether a certain individual has a predetermined phenotype. Specific phenotypes can include, for example, the presence or absence of drug action or side effects, the presence or absence of morbidity, etc., which are assayed by clinical tests and diagnosis.
(This algorithm)
  Hereinafter, this algorithm will be described in detail.
<Sample space>
  Suppose there are l linked SNP loci. The number of all possible haplotypes is L = 2^lIt is. We define an infinite set of haplotype copies. Here, the frequency of the haplotype is Θ = (θ₁,. . , Θ_J,. . , Θ_L) Where θ_JIs the haplotype frequency of j. However,

It is. Each of the N individuals is randomly drawn from the set of haplotype copies and given two haplotype copies in turn. a₁, A₂... a_L2Are possible diplotypes. The diplotype form of the i-th individual is a_kIs the probability P (d_i= A_k| Θ) = θ_lθ_mIt is. Where d_iIs the diplotype form of the i th individual, l and m are a_kIs the order of haplotypes. This means that a Hardy Weinberg equilibrium at the haplotype level is assumed. i-th individual has phenotype ψ₊D_iIt develops with the probability expressed by the function of. Theoretically it is possible to assume penetrance for all diplotype forms. However, it is not realistic to associate penetrance with all diplotype forms. Therefore, only two penetration rates are assumed in this algorithm. That is, H_allIs a set of all haplotypes and H₊H_allThis is a subset of haplotypes that, due to their existence, have a different phenotype. A typical example is H₊Contains only one haplotype, but can contain multiple haplotypes as elements. If H₊Is defined as the set of all haplotypes containing a specific allele at a specific locus, which is equivalent to testing the association between the allele (not the haplotype) and the phenotype.
H_allAs a subset of H₊Is not limited to H, and is described below._lMay be defined. H_lIs based on the haplotype distribution estimated by the EM algorithm, identifies loci that provide information that distinguishes individual haplotypes, and loci that overlap by combining these loci, and masks the identified loci by masking the identified loci._allIs defined as a subset of
Here, the mask means that information is concealed assuming that all polymorphisms apply to one or more specific loci among a plurality of loci constituting the haplotype. Therefore, an incomplete haplotype partially hidden by the mask is a set H including a plurality of haplotypes as elements._lIt is a concept expressed as Clearly from its definition

It becomes. Furthermore, when an incomplete haplotype in which other loci are masked using information on only one loci is used, it is synonymous with SNP because only SNP information on the loci used is used. This is set H_SNPDefined as H_SNPIs H_lIs a special case of

It becomes.
  H as the test subject₊Incomplete haplotype H instead of_lThe rationale for using is as follows.
  1) When a gene polymorphism is expressed by a haplotype, the polymorphism is caused by base substitution and recombination. If a haplotype of a region is associated with a causal locus associated with a phenotype, if multiple haplotypes are associated, it is due to a mutation or recombination that occurred after the causal locus occurred. By constructing incomplete haplotypes, mutations can be expressed as masks for specific loci and recombination as masks for successive loci.
  2) When constructing an incomplete haplotype using SNP information of the L locus, by changing the locus to be masked from 0 to L-1, the haplotype using all the information to the SNP is tested by this algorithm Can be included.
  Here, all incomplete haplotypes H using the SNP information of the L locus_lSince three types of information consisting of two alleles and a mask operation are applied for each sitting position,^L−1 combinations need to be considered. Number of combinations in haplotype estimation is 2^LEven if it is as it is, it becomes a huge number of combinations. However, in the region where linkage disequilibrium is strong, in reality, the haplotype is almost less than 10, and it is not necessary to simply construct all combinations. Therefore, incomplete haplotype H_lCan be constructed according to the following algorithms 1-3.
  1. Haplotype estimation by EM algorithm is performed.
  2. From the estimated haplotype distribution, loci that give information that distinguishes individual haplotypes are extracted, and further, those having overlapping information by combination are deleted from those loci, thereby haplotype tag SNP (hereinafter referred to as htSNP). Determine.
  3. Three types of information consisting of two alleles and a mask operation are sequentially applied to the htSNP locus, resulting in an incomplete haplotype H_lBuild up. Furthermore, the same H in different mask methods_lSince there is a case where is constructed, this is deleted.
  In the following, H₊In this case, the incomplete haplotype H is used as the test object._lIt can be.
  D₊H₊A set of diplotype forms containing the elements of. q₊I-th individual

Under the phenotype ψ₊The probability of causing And q₋I-th individual

Phenotype ψ under the condition of₊The probability of causing
That is ψ_iIs the phenotype of the i th individual,

It becomes. Where Θ and q₊, Q₋Are independent.
Thus, unlike the conventional EM algorithm, this algorithm includes a process of generating a phenotype. In this algorithm, q in addition to Θ₊Or q₋Are included on the definition of the probability space. In particular, in this algorithm, ψ_iD_iNote that it is independent of Θ under the condition
<Likelihood function>
The observation data used in this algorithm is individual genotype data and phenotype data. Where genotype data vector is G_obs= (G₁, G₂,. . , G_N) And the vector of phenotypic data is Ψ_obs= (W₁, W₂,. . , W_N). Where g_iAnd w_iAre the observed genotype and phenotype of the i th individual, respectively. Then, the likelihood function is as follows.

Where A_iG for the i th individual_iA that matches_kIs a set of
D_iIs q₊, Q₋And ψ_iD_iIs independent of Θ under the conditions of

Where A_iG for the i th individual_iA that matches_kIs a set of
For any i and k

It becomes.
For the investigated locus, under the null hypothesis that the phenotype is independent of the diplotype form, the likelihood function is

It becomes. Where q₀Is the penetrance corresponding to all diplotype forms, A_iG for the i th individual_iA that matches_kIs a set of
For any i and k

It becomes.
<EM algorithm>
  In this algorithm, the above equation (1) is changed to Θ, q₊And q₋And the maximum likelihood obtained is L_maxCalculate as In this algorithm, the above equation (2) is changed to Θ and q₀And the maximum likelihood obtained is L_0maxCalculate as Next, in this algorithm, the likelihood ratio L_0max/ L_maxIs used to test the association between the presence of a haplotype and the phenotype.
  L_maxFor maximization, the parameter to be estimated is Θ = (θ₁, Θ₂,. . . , Θ_L), Q₊And q₋It is. On the other hand, L_0maxFor maximization of, the parameter to be estimated is Θ = (θ₁, Θ₂,. . . , Θ_L) And q₀It is. L_0maxThe space created in the maximization of L is L_maxIt is a subspace of the space stretched in the maximization of. Under the null hypothesis -2 log (L_0max/ L_max) Is χ with one degree of freedom²Follow the distribution.
  If d₁, D₂,. . . , D_NAnd ψ₁, Ψ₂,. . . , Ψ_NIf complete data can be obtained for₁, Θ₂,. . . , Θ_LAnd q₊, Q₋Is the maximum estimator of

(Where j = 1, 2,..., L). Where n_jIs the number of haplotype copies of the jth haplotype among N individuals. Also,

It is. Here, # {i; ,,} represents the number of individuals that satisfy the condition after “;”.
However, complete data regarding the diplotype form of the individual is not available, and only genotype data and phenotype data of the individual are observed. Therefore we have n_j/ (2N), N_{+ Ψ +}/ N₊And N_{-Ψ +}/ N₋Create the following algorithm that substitutes the expected value of instead of the true value. That is, this algorithm includes the following steps (i) to (viii).
(I) Initial value for n = 0 (eg, θ_j ^(N)= 1 / L), Θ^(N)= (Θ₁ ^(N), Θ₂ ^(N)..., θ_L ^(N)), But

It is.
(Ii) For n = 0, the initial value is q₊ ^(N), Q₋ ^(N)To give. However, 0 <q₊ ^(N), Q₋ ^(N)<1.
(Iii) all i and g_iAll a that match_kCalculate the following for.

Where A_iIs g_iMatches a_mIs a set of Here, in this algorithm, g_iA that matches_kNote that only examine about. And d_iIs q₊ ^(N)And q₋ ^(N)And ψ_iD_iUnder the conditions of^(N)Therefore, the above equation (3) is as follows.

(Iv) n which is the number of haplotype copies of the jth haplotype held in N individuals_jIs a random variable, the expected value of the number of haplotype copies of the jth haplotype can be defined as follows:

Where f_j(A_k) Is a_kIs the number of haplotype copies of the jth haplotype in A_iG for the i th individual_iA that matches_kIs a set of Where f_j(A_k) Is either 0, 1, or 2. This expected value is calculated for all j.
(V) where N_{+ Ψ +}/ N₊And N_{-Ψ +}/ N₋Since is a random variable, the expected value can be defined as follows.

Also,

here,

It is.
(Vi) Update Θ for the next step from the result of step (iv) as follows:

From the calculation result of step (v), the penetration rate is updated as follows for the next step.

(Vii) The steps from (iii) to (vi) are repeated until the value converges.
Maximum estimated value when converged

And
(Viii) To avoid maxima, various θ_j ⁽⁰⁾(J = 1, 2,..., L), q₊ ⁽⁰⁾And q₋ ⁽⁰⁾Test the initial value of. here,

Is the maximum likelihood L under the alternative hypothesis_maxIt is.
  If q₀= Q₊= Q₋And the steps (iii) to (vii) are repeated, and the maximum likelihood L under the null hypothesis L_0maxIs obtained.
  Under the null hypothesis, statistic-2log (L_0max/ L_max) Is χ with one degree of freedom²It is expected to follow the distribution asymptotically.
<Phenomenal expression probability estimation algorithm>
  According to the EM algorithm (this algorithm) described above, it is possible to estimate the probability of developing a predetermined phenotype in the specimen based on the genotype data of the specimen. The target specimen is N + 1, and the genotype observed in the specimen is g_{N + 1}Then the specimen is phenotype ψ₊Probability of developing

Can be estimated according to the following equation:

  Where d_{N + 1}Indicates the diplotype form for the (N + 1) th specimen.
  The EM algorithm and the phenotypic expression probability estimation algorithm (the present algorithm) can be installed in, for example, computer software. By installing this algorithm in computer software, all calculations can be performed in a computer.
  In addition, the computer on which the software loaded with this algorithm is installed may be connected to an external network via a communication network, for example, individual genotype data and phenotype obtained from a cohort study or clinical treatment test. Data can be acquired via the communication network. Further, the penetration rate and probability estimated by this algorithm can be output to the outside via a communication network.
2. Application to case-control studies
  Next, we apply the genotype data and phenotype data of individuals obtained as a result of case-control studies to the above-mentioned algorithm to test the relationship between the presence of phenotypes and haplotypes, and to determine the penetration rate based on haplotypes. A method of estimating the will be described.
  In case-control studies, the number of cases and non-affected samples is fixed. Therefore, if the case to control ratio is “case: control = ω: 1−ω” and the ratio of affected (the morbidity) in the entire population is λ, the case is determined from among the affected (λ) in the sample space. Minutes and controls are extracted by (1-ω) from non-affected (1-λ) in the sample space.
  Thus, the parameters estimated in case-control studies are that an individual is a diplotype D₊If it is related to the presence or absence of a specific phenotype (for example, a specific disease), it is given in Table 19 below.

In Table 19, ω is a constant,

Is the frequency of the diplotype form in the combined case and control population. R₊And r₋Is the estimated penetration rate q when applied to the cohort study or clinical treatment trial described above.₊And q₋Unlike that, it does not mean penetration. The relationship between these estimated values can be expressed by the following equation.

As can be seen from the above equation, r₊And r₋Are not independent of each other, and when this algorithm is applied to case-control studies, it is estimated to have one less parameter than when applied to the cohort study or clinical treatment trial described above. I understand. r₊= R₋Is true when r₊= R₋= W. Therefore, the null and alternative hypotheses when this algorithm is applied to case-control studies are

It becomes. Further, the penetration rate q₊Is the estimated value r₊make use of

It can be expressed as. Therefore, the estimated value r estimated by applying the above-described algorithm to the case control study.₊By substituting, an estimate of penetration can be calculated using the results of the case control study.
  The morbidity λ cannot be estimated from the results of case-control studies, so it must be given separately. For example, the morbidity λ can be obtained as a specific numerical value from a statistical survey on a target disease or a follow-up survey in a specific population.
  In addition, the above r₊And q₊Based on the equation indicating the relationship of r₊And q₊The relationship is shown in FIGS. FIG. 22 is a characteristic diagram showing the case where the values of λ are 0.1, 0.01, and 0.001 under the same number of cases and controls. FIG. 22 is a characteristic diagram showing the case where λ is 0.01 and the ratio of the case to the control (case / control) is 1, 5, and 10.
  In addition, even when using the results of a case control study, a predetermined phenotype is assigned to the sample based on the genotype data of the sample, as in the case of using the results of the cohort study or clinical trial described above. The probability of developing can be estimated.
<Simulation>
In the following explanation,

Are respectively “Θ hat” and “q₊"Hat" and "q₋Called "hat".
Statistic under the null hypothesis -2log (L _0max / L _max ) Empirical distribution:
  First, statistic-2 log (L_0max/ L_max) Was studied by simulation under the null hypothesis. In this example, the haplotype frequency Θ was obtained from real data instead of simulation. That is, Θ obtained by research on past SAA (serum amyloid A) gene was used [Moriguchi et al. (2001) A novel single-nucleotide polymorphism at the 5'-flanking region of SAA1 associated with risk of the AA amyloidosis second. Arthritis Rheum 44: 1266-1272]. q₀For, various values between 0 and 1 were tried. Under the null hypothesis, the penetration rate is the same for all diplotype forms.
  To begin the simulation, we gave two haplotype copies ordered for each of the N individuals by using Θ to derive the haplotype. Then the phenotype of each individual is q₀Based on. That is, q₀Any individual has a phenotype ψ₊The probability of occurrence. Then, the information about the phase is removed and the above algorithm is applied, and the statistic −2 log (L_0max/ L_max) Was calculated. The simulation was repeated and the distribution of statistics was examined.
  In addition, the haplotype data regarding the SAA gene includes data of 6 SNPs. Table 20 shows the haplotype and frequency for the SAA gene.

  In Table 20, the haplotype “ACCGTC” is a haplotype associated as “a haplotype related to phenotype” in the simulation based on the alternative hypothesis.
  According to Table 20, haplotype copies were randomly drawn and each individual was given two ranked haplotype copies. Phenotype ψ₊Is the same for all individuals, so the probability q fixed for each individual₀Phenotype ψ based on₊Gave. q₀Various values from 0 to 1 were given for.
  FIG. 23 shows various q₀And statistic under 2 and N-2 log (L_0max/ L_max). In FIG. 23, a is q.₀= 0.2, N = 1,000 shows the result, b is q₀= 0.1, N = 1,000 shows the result, c is q₀= 0.2, N = 200 shows the result, d is q₀= 0.2, N = 100 shows the result when set. In FIG. 23, the histogram of statistics is shown as a bar graph, and χ with one degree of freedom.²The probability density function of the distribution is shown by a curve.
  Q₀When various values are given to N and N, the distribution of statistics is χ with one degree of freedom.²The probability of the first type of error (α = 0.05) calculated on the assumption that it follows the distribution, and the “q” estimated under the alternative hypothesis₊"Hat" and "q₋The value of “hat” is shown in Table 21.

  In Table 21 above, “q₊"Hat" and "q₋The value of “hat” is shown as the mean ± standard deviation. As the type I error rate, the statistic value is 3.841 (χ of 1 degree of freedom²The ratio exceeding the cumulative density function of the distribution was 0.95).
  From the results shown in FIG. 23 and Table 21, under the null hypothesis, this statistic is asymptotically χ with one degree of freedom.²It became clear to follow the distribution.
Simulation under the alternative hypothesis:
  Next, a simulation was performed under the alternative hypothesis. Where one of the haplotypes is the phenotype ψ₊The haplotype is related to the haplotype, and this haplotype is called “the haplotype related to the phenotype”. In this simulation, D₊Is defined as a set of diplotype forms with at least one “phenotype-related haplotype”. Two penetration rates q₊And q₋Each gave a value from 0 to 1.
  In this simulation, first, a copy of two haplotypes in which each individual is ordered is extracted using Θ. And the expression form of each individual ψ₊As the probability of generating q₊Or q₋give. In addition, expression form ψ when individual has "phenotype-related haplotype"₊The probability of generating q is q₊And the probability of not having it is q₋It was.
  After removing the phase information, the genotype data and phenotype data were subjected to the above algorithm. The simulation was repeated many times and the results obtained were analyzed. And various q₊And q₋This statistic −2log (L_0max/ L_max) Is χ with one degree of freedom under the null hypothesis²The power was calculated assuming the distribution was followed.
  As a result, q₊The value of q is fixed and q₊/ Q₋FIG. 24 shows the distribution of statistics obtained by changing the value of (this is a relative danger). In FIG. 24, a to f set the following conditions, and q₋The result of calculating the statistic by changing the value of is shown.
(A) q₊= 0.2, N = 1,000
(B) q₊= 0.1, N = 1,000
(C) q₊= 0.2, N = 200
(D) q₊= 0.2, N = 100
(E) q₊= 0.4, N = 100
(F) q₊= 0.5, N = 100
  In a to f of FIG. 24, a horizontal line having a value of 3.841 is χ with one degree of freedom.²The limit value at (P = 0.05) in the case of following the distribution is shown.
  The same simulation can be performed with various q₊, Q₋And 10,000 times under the values of N and N, the statistic is 3.841 (χ with 1 degree of freedom²The proportion of trials exceeding the distribution (which gives a cumulative density function of 0.95) was taken as empirical power. The results are shown in FIG. FIG. 25 shows that the detection power is q₊/ Q₋(Q₊/ Q₋It shows that it increases by increasing ≧ 1) (ie, relative risk), and increases by increasing N.
  Further, FIG. 26 shows the result of examining the influence of the frequency of “phenotype-related haplotype” on the statistic. From FIG. 26, it became clear that the power of the “haplotype related to the phenotype” peaked at an intermediate value between 0 and 1.
Estimated penetration rate “q ₊ "Hat" and "q ₋ Hat distribution:
  Next, the estimated value “q” under the alternative hypothesis described above.₊"Hat" and "q₋The distribution of “hat” was examined. Specifically, under the alternative hypothesis, the given penetration rate q₊= Permeation rate q changed to 0.2₋Went under. Relative danger (q₊/ Q₋) Was changed from 1.0 to 2.0. The sample size N was fixed at 1,000. Using this algorithm, “q” under the alternative hypothesis₊"Hat" and "q₋”Hat” and statistic-2 log (L_0max/ L_max) Was calculated. This simulation was repeated 10,000 times for each parameter set. The results are shown in Table 22.

  “Q in Table 22₊Hat “and“ q₋“Hat” is shown as mean ± standard deviation. The empirical detection rate has a statistic value of 3.841 (χ²The ratio of trials exceeding the cumulative density function of the distribution was 0.95). As shown in Table 22, according to this algorithm, q₊And q₋It has been found that the estimation of is fairly accurate and the variation is relatively small. From this result, q₊/ Q₋Estimated value of "q₊"Hat" and "q₋Hats are also considered relatively accurate.
  Next, it was examined whether or not the “Θ hat” and the diplotype posterior probability distribution (diplotype distribution) differ depending on whether or not phenotype data is included. The estimation when no phenotype data was entered was performed by a so-called LDSUPPORT program. This LDSUPPORT program calculates the diplotype distribution of an individual using only genotype data. However, the penetration rate is q₀= Q₊= Q₋That is, under the null hypothesis, “Θ hat” and Θ estimated by LDSUPPORT and the individual diplotype distribution were the same. Data not shown, but “q₊"Hat" and "q₋“Hat” changed by changing an individual's phenotype.
  Table 23 shows the results of simulation under the alternative hypothesis using Θ obtained by the SAA gene shown in Table 20. The simulation parameters are q₊= 0.2, q₋= 0.1 and N = 1,000. Table 23 shows typical data for four people.

  In Table 23, the “phenotype” column shows each diseased state. Here, N indicates an unaffected person, and A indicates an affected person. In Table 23, the “Diplotype distribution” column shows the posterior distribution of the diplotype form of each individual. In particular, the column of “genotype only” shows the result of analysis using LDSUPPORT without phenotype information. The column “+ phenotype” shows the result of analysis by this algorithm using genotype data and phenotype data. The column “Change phenotype” shows the result of analysis using this algorithm using data obtained by inverting only the personal phenotype data.
  As shown in Table 23, the individual diplotype shape estimated to be “Θ-hat” certainly changed by entering phenotype data or changing the individual phenotype. Although not shown in Table 23, when the estimated individual diplotype distribution was concentrated in one event, neither the “Θ hat” nor the individual diplotype distribution changed much. Compared with this, as shown in Table 22, when the individual diplotype distribution did not concentrate on one event, both the “Θ hat” and the individual diplotype distribution entered phenotypic data, It changed by changing the phenotype of.
  Next, if both or one of the “Θhat” and diplotype forms estimated with the phenotype (ie, according to this algorithm) differ from those estimated without the phenotype (ie, according to the LDSUPPORT program) Then, we verified which algorithm was more accurate.
  Specifically, when the simulation was performed under the alternative hypothesis, Θ was obtained from the SAA gene and the artificially designed gene. For the artificially designed genes, data on 6 SNP loci with weak linkage disequilibrium with each other were created, and a set of haplotype copies was created.
  In the simulation of the SAA gene, the parameter is q₊= 0.5, q₋= 0.125 and N = 100, and the simulation was performed 10,000 times. In artificial gene simulation, the parameter is q₊= 0.5, N = 1,000, q₋The value of was changed. This simulation was performed 1,000 times.
  After each simulation, the phase information was deleted and the data was applied to the algorithm. For each individual, the true diplotype shape was compared with the estimated diplotype distribution, and the case where the true diplotype shape matched the estimated diplotype distribution with the highest probability was regarded as an accurate estimation. As a result, the percentage of individuals whose diplotype shape was estimated correctly was recorded. The results are shown in Table 24.

As shown in Table 24, when the phenotype data is added (when this algorithm is used), the proportion of individuals in which the diplotype shape is correctly estimated increases, so that the diplotype shape is estimated more. It became clear that it became accurate.
Case-control study: Examination of type 1 error in the null hypothesis
Based on the haplotype frequency of the SAA gene (Table 20 above) and the artificially created weak haplotype frequency of linkage disequilibrium (hereinafter abbreviated as ART) using this algorithm, The first type of error was evaluated by simulation. Table 25 shows the haplotype frequencies used in this simulation.

  As a result of simulation based on the frequency of the two types of haplotypes shown in Table 25, the first type error values obtained are shown in FIGS. 27 and 28, respectively. FIG. 27 is a characteristic diagram showing the results using data on the haplotype frequency of the SAA gene, and FIG. 28 is a characteristic diagram showing the results using data on the haplotype frequency of ART. 27 and 28, χ by this algorithm is used.²Χ with contingency table based on values and complete data²The values are shifted and displayed so that the error bars are easy to see.
  From FIG. 27, when the haplotype frequency data of the SAA gene is used, the number of samples is 400 (case = control = 200) or more, and when the haplotype frequency data is used, the number of samples is 600 (case = control = 300). ) Thus, it was confirmed that the type 1 error coincided with the significance level of 0.05 in the statistical error range of the simulation.
  When individual genotype data and phenotype data obtained as a result of case-control studies are applied to the algorithm described above, the diplotype is estimated for the individual and then tested using a contingency table Compared with the method, the following advantages can be shown. Here, the following four methods are compared as a method for determining the diplotype form of an individual.
1. Diplotypes are estimated separately for the case population and the control population, and if there are multiple diplotypes for an individual, the one with the highest probability of the diplotype is adopted as the diplotype for that individual. To do. (Hereafter referred to as separate0)
2. The diplotype shape is estimated separately for the case group and the control group, and when there are a plurality of diplotype shapes for an individual, the individuals are divided according to the probability of the diplotype shape. (Hereafter referred to as separate1)
3. The diplotype shape is estimated for the entire population including the case population and the control population, and the one with the highest probability of the diplotype shape is adopted as the diplotype shape of the individual. (Hereafter referred to as separate2)
4). The diplotype form is estimated for the entire group including the case group and the control group, and the individuals are divided according to the probability of the diplotype form. (Indicated as separate3)
  Pearson's test statistic is calculated based on a contingency table created based on the method of separate0 to separate3. In addition, Pearson test statistic (denoted as diplotype) obtained from contingency table created based on complete data of diplotype form and likelihood ratio test statistic (denoted as Penhaplo) by this algorithm are calculated simultaneously. The correlation coefficient of the statistics was investigated. The simulation results based on the ART haplotype frequency data shown in Table 25 are shown in Table 26.

  As can be seen from Table 26, the one having the strongest correlation with the test statistic based on complete data (diplotype) is the likelihood ratio test statistic (Penhaplo) according to the present algorithm, and the one with the next highest correlation is the dividing method by separate3. Met. In addition, FIG. 29 shows the results obtained by simulation of the first error in the null hypothesis by this algorithm and the first error in the null hypothesis by the separate3 method.
  FIG. 29 shows that the first error due to this algorithm matches the significance level of 0.05, whereas the separate3 method underestimates the first error. From this, it can be said that the analysis by this algorithm is superior to the method of testing by using a contingency table after estimating the existing diplotype shape.
<Analysis of real data 1>
  The algorithm was applied to data collected in reality. As actual data, the MTHFR gene [Urano et al. (2002) Polymorphisms in the methylenetetrahydrofolate reductase gene were associated with both the efficacy and the toxicity of methotrexate used for the treatment of rheumatoid arthritis, as evidenced by single locus and haplotype analyses. Pharmacogenetics. 12: 183-190] and the NAT2 gene [Tanaka et al. (2002) Adverse effects of sulphasalazine in patents with rheumatoid arthritis are associated with dif- terrate configuration at the N-acetylate. J Rheumatol 29: 2492-2499] was used. All data are from cohort studies.
  The set of data related to the MTHFR gene comes from a cohort study of patients with rheumatoid arthritis. The 104 patients taking methotrexate were examined for side effects and the two SNPs of the MTHFR gene. One haplotype is assumed to be a “haplotype related to phenotype”.
  As a result, statistic-2Log (L_0max/ L_max) Was 6.8074 and was indeed statistically significant (P <0.01). In this case, “q₊"Hat" and "q₋The maximum likelihood estimates for “hat” were 0.2571 and 0.0588, respectively, that is, the maximum likelihood estimate for relative risk was 4.37. Also, the diplotype distribution for each individual was one event. Data not shown, but for all individuals, the diplotype shape estimated under the alternative hypothesis is approximately the same as that estimated under the null hypothesis or that estimated by LDSUPPORT That is, in this example, the estimated diplotype shape was almost the same whether this algorithm using phenotypic data or LDSUPPORT without phenotypic data. There was no change whether this algorithm or LDSUPPORT.
  On the other hand, a data set related to the NAT2 gene was also obtained from a cohort study of rheumatoid arthritis patients. The 144 patients who took sulfasalazine were searched for occurrence of side effects and 7 SNPs in the NAT2 gene [see Tanaka et al. One haplotype known as a wild-type haplotype was assumed to be a “phenotype-related haplotype”. Statistic-2log (L_0max/ L_max) Is 13.4629, which is significant (P <0.001), indicating that the presence of this haplotype is associated with side effects. Further, the maximum likelihood estimated value “q₊"Hat" and "q₋“Hats” were 0.0809 and 0.6248, respectively, ie, the maximum likelihood estimate of relative risk was 0.129. That is, the presence of “phenotype-related haplotypes” was associated with reduced side effects. It was.
  The diplotype distribution of each individual was concentrated on one event except for one person. For all individuals, the diplotype shape estimated by this algorithm under the alternative hypothesis was the same as that estimated by the null hypothesis or LDSUPPORT.
  As described above, from the result of analysis using real data, the present algorithm can calculate the penetration rate using genotype data and phenotype data. Thus, according to the present algorithm, the relationship between the phenotype and the haplotype can be tested at the individual level from the genotype data and the phenotype data. In addition, according to this algorithm, it is also possible to estimate the maximum penetration rate under the assumption that different penetration rates correspond to the presence or absence of a specific haplotype. A maximum likelihood estimate can also be obtained.
  By the way, the incomplete haplotype H mentioned above._lWas subjected to analysis on genotype data consisting of 7 SNPs related to NAT2 gene and side effect data obtained from a cohort study of 144 rheumatoid arthritis patients taking sulfasalazine. As described above, this data showed that the risk of side effects occurring with the wild-type haplotype is reduced by 0.129 times compared to the case without the wild-type haplotype.
  Specifically, wild-type haplotypes are represented by a sequence of SNPs “GCTCGAG”. However, when an incomplete haplotype is constructed by the above-described method, “GC”^***The analysis result that “G” was the most significant was obtained. here"^*"Indicates that it has been masked. That is, in this example, the third to sixth loci can be masked, and the incomplete haplotype H constructed for the SNPs of 1, 2 and 7 loci._lCan be the subject of the examination. And constructed incomplete haplotype H_lAs a result of applying this algorithm to^***G "was found to be most significant. In addition, "GC^***"G" means that the masked locus is expressed by information of another locus, and that the second locus is "C" and the seventh locus is "G" is related to the phenotype. It shows that. The wild type haplotype “GCTCGAG” is correctly^***The haplotype matches the expression “G”. Thus, incomplete haplotype H_lIt can be seen that “Haplotype related to phenotype” can be searched by using.
  In addition, incomplete haplotype H_lAs another example of application of genotypes, the genotype data consisting of 3 SNPs for ABCB8 gene obtained from a cohort study of 175 rheumatoid arthritis patients taking methotrexate and the folic acid administered when side effects occur An analysis using incomplete haplotypes was performed on the data on the presence or absence of administration. The results are shown in Table 27.

In the analysis results shown in Table 27, the haplotype “CG” indicates the lowest P value.^*”And is shown to be more significant than the haplotype“ CGA ”using all loci information. This indicates that the incomplete haplotype “CG” due to the first and second loci is more strongly related to the cause of the phenotype. That is, it can be understood that the causal locus is associated with both “CGA” and “CGG” haplotypes, and that the masking of the third locus strongly expresses the relationship with the phenotype. The result is incomplete haplotype H_lThis indicates that it is possible to detect phenotypes associated with a plurality of haplotypes.
Industrial applicability
According to the present invention, a method for analyzing a haplotype is provided. According to the method of the present invention, without analyzing each SNP, individual differences relating to susceptibility of drugs and the like are evaluated using haplotypes and used as a tool for personalized medicine (custom-made medicine or tailor-made medicine). be able to.

【配列表】

[Sequence Listing]

Claims

Haplotype analysis method with the following steps:
(A) detecting a genetic polymorphism for at least one of the drug-related genes shown in Table 1 obtained from the test population;
(B) processing the detection information to select a Common polymorph;
(C) constructing a haplotype block using the selected common polymorphism;
(D) identifying a common polymorphism in the haplotype block; and (e) identifying a tag polymorphism from the common polymorphism in the haplotype block;
Including said method.

The method according to claim 1, wherein the gene polymorphism is a single nucleotide polymorphism, a polymorphism caused by deletion, substitution or insertion of a plurality of bases, or a polymorphism caused by VNTR or microsatellite.

The method according to claim 1, wherein the tag polymorphism is at least one selected from those shown in the "htSNPs" column in the "Block" section of Table 3.

Haplotype analysis method with the following steps:
(A) detecting a genetic polymorphism for at least one of the drug-related genes shown in Table 1 obtained from the test population;
(B) processing the detection information to select a Common polymorph;
(C) constructing a haplotype block using the identified common polymorphism, and (d) identifying a common polymorphism outside the haplotype block;
Including said method.

The method according to claim 4, wherein the gene polymorphism is a single nucleotide polymorphism, a polymorphism caused by deletion, substitution or insertion of a plurality of bases, or a polymorphism caused by VNTR or microsatellite.

5. The method according to claim 4, wherein the common polymorphism outside the haplotype block is at least one selected from those shown in the “Between” section of Table 3.

Haplotype analysis method with the following steps:
(A) detecting a gene polymorphism for a drug-related gene obtained from a test population;
(B) processing the detection information to select a Common polymorph;
(C) constructing a haplotype block using the identified Common polymorphism;
(D) identifying a Rare polymorphism within and / or outside the haplotype block; and (e) assigning the Rare polymorphism to a major haplotype;
Including said method.

The method according to claim 7, wherein the drug-related gene is at least one selected from those shown in Table 1.

The method according to claim 7, wherein the gene polymorphism is a single nucleotide polymorphism, a polymorphism caused by deletion, substitution or insertion of a plurality of bases, or a polymorphism caused by VNTR or microsatellite.

The method according to claim 7, wherein the Rare polymorphism is at least one selected from those shown in Table 2.

A method for analyzing the relationship between drug or foreign body susceptibility or disease susceptibility and haplotype, comprising the following steps:
(A) any one of claims 1 to 3 for a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease; A step of identifying a tag polymorphism by the method of claim 1, and (b) using the identified tag polymorphism or a combination thereof, the frequency of the haplotype of a test population having one phenotype or the haplotype. Comparing the frequency of individuals having to the frequency of haplotypes of test populations having other phenotypes or the frequency of individuals having the haplotypes;
Including said method.

A method for analyzing the relationship between drug or foreign body susceptibility or disease susceptibility and haplotype, comprising the following steps:
(A) About a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a disease risk factor, (B) identifying the common polymorphism outside the block by the method according to item 1, and (b) the polymorphism frequency of the test population having one phenotype or the polymorphism for the identified common polymorphism Comparing the frequency of individuals having a polymorphism in a test population with other phenotypes or the frequency of individuals having the polymorphism with
Including said method.

The following steps:
(A) About a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a disease risk factor, (B) identifying the common polymorphism outside the block by the method according to item 1, and (b) the polymorphism frequency of the test population having one phenotype or the polymorphism for the identified common polymorphism Comparing the frequency of individuals having a polymorphism in a test population with other phenotypes or the frequency of individuals having the polymorphism with
The method of claim 11, further comprising:

A method for analyzing the relationship between drug or foreign body susceptibility or disease susceptibility and haplotype, comprising the following steps:
(A) any one of claims 7 to 10 for a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease; Assigning a Rare polymorphism to a specific major haplotype by the method of paragraph 1, and (b) for the assigned Rare polymorphism, the frequency of the haplotype of a test population having a phenotype or the haplotype. Comparing the frequency of individuals having to the frequency of haplotypes of test populations having other phenotypes or the frequency of individuals having the haplotypes;
Including said method.

The following steps:
(A) any one of claims 7 to 10 for a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease; Assigning a Rare polymorphism to a specific major haplotype by the method of paragraph 1, and (b) for the assigned Rare polymorphism, the frequency of the haplotype of a test population having a phenotype or the haplotype. Comparing the frequency of individuals having to the frequency of haplotypes of test populations having other phenotypes or the frequency of individuals having the haplotypes;
The method according to claim 11, further comprising:

A method for estimating a polymorphism associated with drug or foreign body susceptibility or disease susceptibility comprising the following steps:
(A) About a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, any one of claims 1 to 3 Identifying a tag polymorphism by the method according to item 1,
(B) Using the identified tag polymorphism or a combination thereof, the frequency of the haplotype of the test population having one phenotype or the frequency of individuals having the haplotype and the test population having another phenotype Comparing the frequency of a haplotype or the frequency of an individual having the haplotype;
(C) selecting a block involved in the frequency difference;
(D) selecting a polymorphism present in the selected block; and (e) estimating a polymorphism associated with a frequency difference;
Including said method.

A method for estimating a polymorphism associated with drug or foreign body susceptibility or disease susceptibility comprising the following steps:
(A) About a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a disease risk factor, Identifying the common polymorphism outside the block by the method according to item 1,
(B) About the identified common polymorphism, the frequency of the polymorphism of the test population having one phenotype or the frequency of individuals having the polymorphism and the polymorphism of the test population having another phenotype Comparing the frequency or the frequency of an individual having the polymorphism, and (c) estimating a polymorphism associated with the frequency difference,
Including said method.

The following steps:
(A) About a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a disease risk factor, Identifying the common polymorphism outside the block by the method according to item 1,
(B) About the identified common polymorphism, the frequency of the polymorphism of the test population having one phenotype or the frequency of individuals having the polymorphism and the polymorphism of the test population having another phenotype Comparing the frequency or the frequency of an individual having the polymorphism, and (c) estimating a polymorphism associated with the frequency difference,
The method of claim 16, further comprising:

A method for estimating a polymorphism associated with drug or foreign body susceptibility or disease susceptibility comprising the following steps:
(A) any one of claims 7 to 10 for a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease; Assigning a Rare polymorphism to a particular major haplotype according to the method of claim 1;
(B) for the assigned Rare polymorphism, the frequency of a haplotype of a test population having one phenotype or the frequency of individuals having the haplotype and the frequency of a haplotype of a test population having another phenotype or the Comparing the frequency of individuals with haplotypes; and (c) estimating a polymorphism associated with a frequency difference;
Including said method.

The following steps:
(A) any one of claims 7 to 10 for a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease; Assigning a Rare polymorphism to a particular major haplotype according to the method of claim 1;
(B) for the assigned Rare polymorphism, the frequency of a haplotype of a test population having one phenotype or the frequency of individuals having the haplotype and the frequency of a haplotype of a test population having another phenotype or the Comparing the frequency of individuals with haplotypes; and (c) estimating a polymorphism associated with a frequency difference;
The method according to any one of claims 16 to 18, further comprising:

A method for analyzing individual differences in phenotype relating to drug or foreign substance susceptibility or disease susceptibility, the analysis result analyzed by the method according to any one of claims 11 to 15, or claims 16 to 20 The method comprising the step of associating an individual with a phenotype using an estimation result estimated by the method according to claim 1 as an index.

The method according to any one of claims 11 to 21, wherein the sensitivity of the drug is pharmacokinetics, drug efficacy, or drug-related side effects.

23. The method of claim 22, wherein the pharmacokinetics is a kinetic related to drug absorption, distribution, metabolism or excretion.

24. The method of claim 22, wherein the pharmacokinetics is a kinetic related to the blood concentration of the drug.

The method according to any one of claims 11 to 21, wherein the susceptibility of the disease is presence or absence of morbidity or strength.

The disease is at least one selected from the group consisting of a malignant tumor, an immune system disease, a cardiovascular disease, a metabolic disease, a renal urological disease, a respiratory disease and a musculoskeletal disease. The method of any one of these.

27. A method for predicting drug or foreign substance sensitivity or disease using the analysis result or estimation result obtained by the method according to claim 11 as an index.

A method for selecting a drug for preventing or treating a disease and / or a method for preventing or treating a disease using the analysis result or the estimation result obtained by the method according to any one of claims 11 to 26 as an index. .

27. A method for determining an appropriate dose of a drug for preventing or treating a disease, using the analysis result or the estimation result obtained by the method according to any one of claims 11 to 26 as an index.

A method for analyzing a drug-drug interaction using an analysis result or an estimation result obtained by the method according to any one of claims 11 to 26 as an index.

27. A method for determining an associated polymorphism related to drug or foreign substance or disease susceptibility, using the analysis result or estimation result obtained by the method according to any one of claims 11 to 26 as an index.

A haplotype analysis program comprising the following means:
(A) means for detecting a genetic polymorphism for at least one of the drug-related genes shown in Table 1 obtained from the test population;
(B) means for processing the detection information and selecting a Common polymorph;
(C) means for constructing a haplotype block using the identified Common polymorphism,
(D) means for identifying a common polymorphism in the haplotype block; and (e) means for identifying a tag polymorphism from the common polymorphism in the haplotype block;
The program for functioning as:

The program according to claim 32, wherein the gene polymorphism is a single nucleotide polymorphism, a polymorphism caused by deletion, substitution or insertion of a plurality of bases, or a polymorphism caused by VNTR or microsatellite.

33. The program according to claim 32, wherein the tag polymorphism is at least one selected from those shown in the "htSNPs" column in the "Block" section of Table 3.

A haplotype analysis program comprising the following means:
(A) means for detecting a genetic polymorphism for at least one of the drug-related genes shown in Table 1 obtained from the test population;
(B) means for processing the detection information and selecting a Common polymorph;
(C) means for constructing a haplotype block using the identified common polymorphism, and (d) means for identifying a common polymorphism outside the haplotype block,
The program for functioning as:

The program according to claim 35, wherein the gene polymorphism is a single nucleotide polymorphism, a polymorphism caused by deletion, substitution or insertion of a plurality of bases, or a polymorphism caused by VNTR or microsatellite.

36. The program according to claim 35, wherein the common polymorphism outside the haplotype block is at least one selected from those shown in the "Between" section of Table 3.

A haplotype analysis program comprising the following means:
(A) a means for detecting a gene polymorphism for a drug-related gene obtained from a test population,
(B) means for processing the detection information and selecting a Common polymorph;
(C) means for constructing a haplotype block using the identified Common polymorphism,
(D) means for identifying a Rare polymorphism within and / or outside the haplotype block; and (e) means for assigning the Rare polymorphism to a major haplotype;
Program to function as.

The program according to claim 38, wherein the drug-related gene is at least one selected from those shown in Table 1.

The program according to claim 38, wherein the gene polymorphism is a single nucleotide polymorphism, a polymorphism caused by deletion, substitution or insertion of a plurality of bases, or a polymorphism caused by VNTR or microsatellite.

The program according to claim 38, wherein the Rare polymorphism is at least one selected from those shown in Table 2.

A program for analyzing the relationship between drug or foreign substance susceptibility or disease susceptibility and a haplotype, the computer comprising the following means:
(A) About a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, any of claims 32 to 34 Means for identifying a tag polymorphism by the program according to item 1, and (b) using the identified tag polymorphism or a combination thereof, the frequency of the haplotype of a test population having one phenotype or the haplotype. Means for comparing the frequency of individuals having and the frequency of haplotypes of test populations having other phenotypes or the frequency of individuals having the haplotypes;
The program for functioning as:

A program for analyzing the relationship between drug or foreign substance susceptibility or disease susceptibility and a haplotype, the computer comprising the following means:
(A) For a drug-related gene collected from a test population exposed to or potentially exposed to a drug or a foreign substance, or a test population exposed to a risk factor for a disease, any of claims 35 to 37 Means for identifying a common polymorphism outside the block by the program according to item 1, and (b) a frequency of a polymorphism in a test population having one phenotype or the polymorphism of the identified common polymorphism Means for comparing the frequency of individuals having a polymorphism in a test population having other phenotypes or the frequency of individuals having the polymorphism,
The program for functioning as:

The following means:
(A) For a drug-related gene collected from a test population exposed to or potentially exposed to a drug or a foreign substance, or a test population exposed to a risk factor for a disease, any of claims 35 to 37 Means for identifying a common polymorphism outside the block by the program according to item 1, and (b) a frequency of a polymorphism in a test population having one phenotype or the polymorphism of the identified common polymorphism Means for comparing the frequency of individuals having a polymorphism in a test population having other phenotypes or the frequency of individuals having the polymorphism,
The program according to claim 42, further comprising:

A program for estimating a polymorphism associated with drug or foreign body susceptibility or disease susceptibility, wherein the computer comprises the following means:
(A) About a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a disease risk factor, Means for assigning a Rare polymorphism to a specific major haplotype by the program according to Item 1, and (b) for the assigned Rare polymorphism, the frequency of the haplotype of a test population having one phenotype or the haplotype. Means for comparing the frequency of individuals having and the frequency of haplotypes of test populations having other phenotypes or the frequency of individuals having the haplotypes;
The program for functioning as:

The following means:
(A) About a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a disease risk factor, Means for assigning a Rare polymorphism to a specific major haplotype by the program according to Item 1, and (b) for the assigned Rare polymorphism, the frequency of the haplotype of a test population having one phenotype or the haplotype. Means for comparing the frequency of individuals having and the frequency of haplotypes of test populations having other phenotypes or the frequency of individuals having the haplotypes;
45. The program according to any one of claims 42 to 44, further comprising:

A program for estimating a polymorphism associated with drug or foreign body susceptibility or disease susceptibility, wherein the computer comprises the following means:
(A) About a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a risk factor of a disease, any of claims 32 to 34 Means for identifying a tag polymorphism by the program according to item 1,
(B) Using the identified tag polymorphism or a combination thereof, the frequency of the haplotype of the test population having one phenotype or the frequency of individuals having the haplotype and the test population having another phenotype Means for comparing the frequency of a haplotype or the frequency of an individual having the haplotype;
(C) means for selecting blocks involved in the difference in frequency;
(D) means for selecting a polymorphism present in the selected block; and (e) means for estimating a polymorphism associated with a frequency difference;
The program for functioning as:

A program for estimating a polymorphism associated with drug or foreign body susceptibility or disease susceptibility, wherein the computer comprises the following means:
(A) For a drug-related gene collected from a test population exposed to or potentially exposed to a drug or a foreign substance, or a test population exposed to a risk factor for a disease, any of claims 35 to 37 Means for identifying a common polymorphism outside the block by the program according to item 1,
(B) About the identified common polymorphism, the frequency of the polymorphism of the test population having one phenotype or the frequency of individuals having the polymorphism and the polymorphism of the test population having another phenotype Means for comparing the frequency or the frequency of an individual having the polymorphism, and (c) means for estimating the polymorphism associated with the frequency difference,
The program for functioning as:

The following means:
(A) For a drug-related gene collected from a test population exposed to or potentially exposed to a drug or a foreign substance, or a test population exposed to a risk factor for a disease, any of claims 35 to 37 Means for identifying a common polymorphism outside the block by the program according to item 1,
(B) About the identified common polymorphism, the frequency of the polymorphism of the test population having one phenotype or the frequency of individuals having the polymorphism and the polymorphism of the test population having another phenotype Means for comparing the frequency or the frequency of an individual having the polymorphism, and (c) means for estimating the polymorphism associated with the frequency difference,
48. The program of claim 47, further comprising:

A program for estimating a polymorphism associated with drug or foreign body susceptibility or disease susceptibility, wherein the computer comprises the following means:
(A) About a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a disease risk factor, Means for assigning a Rare polymorphism to a specific major haplotype by the program according to item 1;
(B) for the assigned Rare polymorphism, the frequency of a haplotype of a test population having one phenotype or the frequency of individuals having the haplotype and the frequency of a haplotype of a test population having another phenotype or the Means for comparing the frequency of individuals with haplotypes; and (c) means for estimating polymorphisms associated with frequency differences;
The program for functioning as:

The following means:
(A) About a drug-related gene collected from a test population exposed to or likely to be exposed to a drug or a foreign substance, or a test population exposed to a disease risk factor, Means for assigning a Rare polymorphism to a specific major haplotype by the program according to item 1;
(B) for the assigned Rare polymorphism, the frequency of a haplotype of a test population having one phenotype or the frequency of individuals having the haplotype and the frequency of a haplotype of a test population having another phenotype or the Means for comparing the frequency of individuals with haplotypes; and (c) means for estimating polymorphisms associated with frequency differences;
The program according to any one of claims 47 to 49, further including:

A program for analyzing individual differences in phenotype relating to drug or foreign substance susceptibility or disease susceptibility, wherein the computer is analyzed by the program according to any one of claims 42 to 46, or 52. The program for functioning as means for associating an individual with a phenotype using the estimation result estimated by the program according to any one of claims 47 to 51 as an index.

53. The program according to any one of claims 47 to 52, wherein the sensitivity of the drug is sensitivity related to pharmacokinetics, drug effectiveness, or drug side effects.

54. The program according to claim 53, wherein the pharmacokinetics is a kinetic relating to absorption, distribution, metabolism or excretion of the drug.

54. The program according to claim 53, wherein the pharmacokinetics is a kinetic relating to a blood concentration of the drug.

53. The program according to any one of claims 47 to 52, wherein the susceptibility of the disease is presence or absence of morbidity or strength.

The disease is at least one selected from the group consisting of a malignant tumor, an immune system disease, a cardiovascular disease, a metabolic disease, a renal urological disease, a respiratory disease and a musculoskeletal disease. The program according to any one of the above.

58. A program for causing a computer to function as a means for predicting drug or foreign substance sensitivity or disease using the analysis result or estimation result obtained by the program according to any one of claims 42 to 57 as an index.

58. Using a computer, an analysis result or an estimation result obtained by the program according to any one of claims 42 to 57 as an index, a drug for preventing or treating a disease and / or a method for preventing or treating a disease. A program for functioning as a means for selection.

A computer is caused to function as a means for determining an appropriate dose of a drug for preventing or treating a disease, using the analysis result or the estimation result obtained by the program according to any one of claims 42 to 57 as an index. Program for.

58. A program for causing a computer to function as a means for analyzing a drug-drug interaction using an analysis result or an estimation result obtained by the program according to any one of claims 42 to 57 as an index.

58. A computer functioning as a means for determining an associated polymorphism related to drug or foreign substance or disease susceptibility, using the analysis result or estimation result obtained by the program according to any one of claims 42 to 57 as an index. program.

A computer-readable recording medium on which the program according to any one of claims 32 to 62 is recorded.

A haplotype containing at least one Rare polymorphism shown in Table 2.