JP2010224815A

JP2010224815A - Retrieval algorithm for epistasis effect based on comprehensive genome wide snp information

Info

Publication number: JP2010224815A
Application number: JP2009070753A
Authority: JP
Inventors: Masaaki Matsuura; 正明松浦; Masaru Ushijima; 大牛嶋; Minoru Isomura; 実磯村; Yoshio Miki; 義男三木; Seiji Okuizumi; 盛司奥泉
Original assignee: Japanese Foundation for Cancer Research; NEC Solution Innovators Ltd
Current assignee: Japanese Foundation for Cancer Research; NEC Solution Innovators Ltd
Priority date: 2009-03-23
Filing date: 2009-03-23
Publication date: 2010-10-07
Anticipated expiration: 2029-03-23
Also published as: JP5413952B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method of identifying an SNP pair having synergistic epistasis effects within an actual time, even when main effects are not confirmed with respect to a binary phenotype by using genome side SNP data. <P>SOLUTION: The binary phenotype data of each specimen are input and stored (1). The genotype of M pieces of single nucleotide polymorphism (SNP) with respect to each specimen are input and stored (2). Genotype-categorized counting is calculated for every phenotype with respect to each SNP (3). The genotype-categorized counting is stored by determining the validity/invalidity of analysis continuation with respect to each SNP (4). Dominant/recessive type for the phenotype is determined and stored with respect to the SNP determined to be valid for analysis continuation (5). The epistasis effects are determined and stored based on a division table based on a division table based on the phenotype and the dominant/recessive type with respect to the two SNP determined to be valid for analysis continuation, and the SNP pair to be analyzed is changed according to an analytic procedure (6). The epistasis effects are verified by logistic regression analysis with respect to the SNP pair whose epistasis effects are determined (7). <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、表現型として２値の値で代表されるラベルを有する各対象者に対してゲノムワイド上に100万個に及ぶ一塩基多型（SNP）のジェノタイプが観測されているデータを用いて、個々のSNPでは表現型に影響を及ぼさないが２つのSNPが同時に存在する場合のみに相乗的に表現型に影響を及ぼすエピスタシス効果を有する複数のSNPを、高速に同定する方法、および同定プログラムに関する。 The present invention provides data in which 1 million single nucleotide polymorphism (SNP) genotypes are observed on the genome wide for each subject having a label represented by a binary value as a phenotype. A method of rapidly identifying multiple SNPs that have an epistasis effect that synergistically affects the phenotype only when two SNPs are present simultaneously, although the individual SNPs do not affect the phenotype, and It relates to an identification program.

様々な生物種において、ゲノム上の遺伝子が個体の生物学的な特徴を示す表現型に関与してことが知られている。単独の遺伝子が表現型に作用する場合もあるが、一般には複数の遺伝子が一つの表現型に作用している場合も考えられる。エピスタシスとは、古くは、遺伝子の作用としての非相加的遺伝子効果の中でも上位性効果と定義されていた。非特許文献１：Bateson, Mendel's Principles of Heredity. Cambridge University Press, Cambridge 1909を参照のこと。現在では「遺伝子間の相互作用」として捉えられており、遺伝子と表現型の関連性を明らかにしていく上で極めて重要な概念である。非特許文献２：Cordell. Hum Molecular Genet, Vol.11, No.20, 2463-2468, 2002を参照のこと.「相互作用」とは、個々の遺伝子が表現型に独立に作用する場合の効果よりも大きい場合に相乗的(synergistic)エピスタシスと呼ばれ、独立に作用する場合の効果よりも小さくなる場合には拮抗的(antagonistic)エピスタシスと呼ばれている。すなわち、ある表現型に対し、特定の個数の遺伝子セットにおいて相乗的エピスタシスがある場合、個々の遺伝子効果の総和よりも遺伝子セット全体の効果の方が大きくなる。ゲノムが複雑になるほどエピスタシスの効果は相乗的になるとの報告があり、ヒトのゲノムと表現型の関連性を解明していく上でエピスタシスを考慮することの重要性が示された。（非特許文献３：Sanjuan and Elena, PNAS Vol.103, No.39, 14402-14405, 2006 を参照のこと.）
これまで古くからエピスタシスの概念は存在したが、実際にエピスタシス効果を探索することが困難な場合がある。少数遺伝子セットの解析において、ロジスティックモデルを用いた解析法も提案されている。非特許文献４：Cordell and Clayton. Am J Hum Genet, Vol.70, No.1, 124-141, 2002を参照のこと。単独遺伝子が表現型に対して主効果を持つ場合のエピスタシス効果の検出には罰則付最尤推定法およびBSE法の可変間隔アプローチがある。非特許文献５：Zhang, Shrinkage Estimation Method for Mapping Multiple Quantitative Trait Loci、Vol.33 No.10, Page.861-869, 2006を参照のこと。また、遺伝子座における主効果と２遺伝子座間の交互作用をエントロピーに基づいて解析する方法も提案されている。非特許文献６：Dong et al. Eur J Hum Genet. Vol.16, 229-235, 2008を参照のこと。 In various biological species, it is known that genes on the genome are involved in a phenotype that shows the biological characteristics of an individual. A single gene may act on the phenotype, but generally a plurality of genes may act on one phenotype. Epistasis has long been defined as a superordinate effect among non-additive gene effects as a gene action. Non-Patent Document 1: Bateson, Mendel's Principles of Heredity. See Cambridge University Press, Cambridge 1909. At present, it is regarded as “interaction between genes” and is an extremely important concept in clarifying the relationship between genes and phenotypes. Non-Patent Document 2: See Cordell. Hum Molecular Genet, Vol. 11, No. 20, 2463-2468, 2002. “Interaction” refers to the effect of individual genes acting on the phenotype independently. Is called synergistic epistasis, and if it is less than the effect of acting independently, it is called antagonistic epistasis. That is, when there is synergistic epistasis in a specific number of gene sets for a certain phenotype, the effect of the entire gene set is greater than the sum of the individual gene effects. It has been reported that as the genome becomes more complex, the effect of epistasis becomes synergistic, indicating the importance of considering epistasis in elucidating the relationship between the human genome and the phenotype. (See Non-Patent Document 3: Sanjuan and Elena, PNAS Vol.103, No.39, 14402-14405, 2006.)
The concept of epistasis has existed for a long time, but it may be difficult to actually search for the epistasis effect. An analysis method using a logistic model has also been proposed for analysis of a small number of gene sets. Non-Patent Document 4: See Cordell and Clayton. Am J Hum Genet, Vol. 70, No. 1, 124-141, 2002. There are penalized maximum likelihood estimation method and BSE method variable interval approach to detect epistasis effect when single gene has main effect on phenotype. See Non-Patent Document 5: Zhang, Shrinkage Estimation Method for Mapping Multiple Quantitative Trait Loci, Vol.33 No.10, Page.861-869, 2006. A method for analyzing the main effect at a locus and the interaction between two loci based on entropy has also been proposed. Non-Patent Document 6: See Dong et al. Eur J Hum Genet. Vol. 16, 229-235, 2008.

しかしながら、複数の遺伝子を考える場合、それぞれの遺伝子単独では表現型に効果を与えないが、複数の遺伝子セットが存在してはじめて効果が表れる場合には、個々の遺伝子の効果を探索し、その結果を基に遺伝子セットが有する相互効果を探索することは不可能である。 However, when multiple genes are considered, each gene alone has no effect on the phenotype, but when multiple genes set exist and the effect appears only when there are multiple gene sets, the effect of each gene is searched and the result It is impossible to search for mutual effects of gene sets based on the above.

現在、ヒトのゲノムの変異を網羅的に調べられる技術が発達し、ヒトの表現型、特に疾患の易罹患性や副作用の個人差などを同定するために、ゲノム上の一塩基多型変異（Single Nucleotide Polymorphizm）を基に、ゲノムワイドに遺伝子変異が調べられている。Marchiniらは、2005年の時点で、ゲノムワイドに調べられた遺伝子データに対して複数遺伝子座を考慮した解析の困難さが指摘されている。非特許文献７：Marchini et al. Nat. Genet. Vol.37, No.4, 413-417, 2005を参照のこと。最新の技術（2009年1月）では、一人の患者に対して約90万箇所のSNPが調べられている。これまでのエピスタシスの概念は「遺伝子間の相互作用」であったが、SNPの効果を考える場合、エピスタシスの概念は「SNP間の相互作用」に拡張する必要がある。最小のSNPセットは、2SNPの場合であるが、90万SNPから2個のSNPを取る組み合わせは約5000億通り存在し、上記で述べた主効果がない場合のエピスタシスを探索するためには、約5000億通りを調べなければならない。 Currently, a technology to comprehensively examine human genome mutations has been developed, and in order to identify human phenotypes, especially susceptibility to diseases and individual differences in side effects, single nucleotide polymorphism mutations ( Based on Single Nucleotide Polymorphizm), gene mutations have been examined genome-wide. Marchini et al. Pointed out that it was difficult to analyze genome data considering multiple loci as of 2005. Non-Patent Document 7: See Marchini et al. Nat. Genet. Vol. 37, No. 4, 413-417, 2005. In the latest technology (January 2009), about 900,000 SNPs have been examined for one patient. Until now, the concept of epistasis was “interaction between genes”, but when considering the effect of SNP, the concept of epistasis needs to be extended to “interaction between SNPs”. The smallest SNP set is the case of 2 SNPs, but there are about 500 billion combinations that take 2 SNPs from 900,000 SNPs, and in order to search for epistasis when there is no main effect as described above, About 500 billion streets must be examined.

Bateson, Mendel's Principles of Heredity. Cambridge University Press, Cambridge 1909.Bateson, Mendel's Principles of Heredity.Cambridge University Press, Cambridge 1909. Cordell. Hum Molecular Genet, Vol.11, No.20, 2463-2468, 2002.Cordell. Hum Molecular Genet, Vol. 11, No. 20, 2463-2468, 2002. Sanjuan and Elena, PNAS Vol.103, No.39, 14402-14405, 2006.Sanjuan and Elena, PNAS Vol.103, No.39, 14402-14405, 2006. Cordell and Clayton. Am J Hum Genet, Vol.70, No.1, 124-141, 2002.Cordell and Clayton. Am J Hum Genet, Vol. 70, No. 1, 124-141, 2002. Zhang, Shrinkage Estimation Method for Mapping Multiple Quantitative Trait Loci、Vol.33 No.10, Page.861-869, 2006.Zhang, Shrinkage Estimation Method for Mapping Multiple Quantitative Trait Loci, Vol.33 No.10, Page.861-869, 2006. Dong et al. Eur J Hum Genet. Vol.16, 229-235, 2008.Dong et al. Eur J Hum Genet. Vol.16, 229-235, 2008. Marchini et al. Nat. Genet. Vol.37, No.4, 413-417, 2005.Marchini et al. Nat. Genet. Vol. 37, No. 4, 413-417, 2005.

従来の手法では、以下のような問題点が生じる。 The conventional method has the following problems.

従来の手法の問題点は、近年までゲノムワイドにジェノタイプを調べる技術が存在しなかったため、特定の２つの遺伝子間の遺伝子交互作用効果の検出、あるいは、調べるべき遺伝子の数が増加しても高々100個程度の限られた少数の遺伝子セットに対する遺伝子交互作用効果の解析しか行なわれてこなかったため、２つの遺伝子間の遺伝子交互作用効果を検出する方法は解の探索のための繰り返し計算や複雑なアルゴリズムが使用され、解析時間に対して制約条件が設定されることはなかったため、膨大な量の遺伝子データに対して対応できる手法は存在しないことにある。近年のゲノムワイド解析では約90万箇所のSNPのジェノタイプが調べられ、エピスタシスを調べる場合、90万SNPから2個のSNPを取る組み合わせは約5000億通り存在し、2個のSNP間のエピスタシスを探索するためには、約5000億通りを調べなければならない。仮に２つのSNPの解析を１秒で行なっても、15854年の計算時間が必要となり、実質的に解析不可能であり、従来の手法では、ゲノムワイドデータに対して、主効果がない場合に2個のSNP間のエピスタシスを網羅的に探索することは不可能である。 The problem with conventional methods is that until recently there has been no genome-wide technology for genotype detection, so even if the detection of the gene interaction effect between two specific genes or the number of genes to be examined increases. Since only the analysis of gene interaction effects on a limited number of genes, limited to a maximum of about 100, has been performed, the method for detecting gene interaction effects between two genes is an iterative calculation or complex search for solution search. Therefore, there is no method that can deal with a huge amount of gene data. In recent genome-wide analysis, about 900,000 SNP genotypes have been examined, and when examining epistasis, there are about 500 billion combinations that take 2 SNPs from 900,000 SNPs, and epistasis between two SNPs In order to explore, you have to investigate about 500 billion ways. Even if two SNPs are analyzed in 1 second, the calculation time of 15854 is required, which is practically impossible to analyze, and the conventional method has no main effect on genome-wide data. It is impossible to exhaustively search for epistasis between two SNPs.

本発明の目的は、ゲノムワイド解析で得られる約100万箇所のSNPのジェノタイプデータに対して、主効果がない場合でも、2個のSNP間のエピスタシスを網羅的に探索を完了するための高速な同定方法、およびデータ解析システムを提供することにある。 The object of the present invention is to complete the search for epistasis between two SNPs, even when there is no main effect on genotype data of about 1 million SNPs obtained by genome-wide analysis. To provide a high-speed identification method and a data analysis system.

本発明では、入力装置を介して入力されたN個の検体から観測された総計M個（Mは50万以上）のSNPのジェノタイプデータと各検体に対応する表現型のクラスのデータと、これらデータから算出したクラス別ジェノタイプ別計数を基に優性・劣性を判定した結果とを高速にアクセスできる内部記憶装置に記憶させ必要時に参照することにより、同じSNPに対する無駄な繰り返し計算を排除している。 In the present invention, a total of M SNP genotype data observed from N specimens input via the input device (M is 500,000 or more) and phenotypic class data corresponding to each specimen, The result of determining superiority or inferiority based on class-specific genotype counts calculated from these data is stored in an internal storage device that can be accessed at high speed and referenced when necessary, eliminating unnecessary repeated calculations for the same SNP. ing.

さらに本発明では、エピスタシス効果同定のための方法として、２種の表現型別に、２つのSNPの組み合わせに対してジェノタイプを優性・劣性で区別して構成される２ｘ２分割表内の計４個の要約数字データを用い、これら４個の数字データに対して計３個の演算で算出することができるオッズ比統計量を２個算出し判定を行っている。このようにエピスタシス同定に有効な情報を絞り込んだ上で計算量の極めて少ない統計量を用いることにより、大幅な計算時間の短縮が可能な判定方式を構築できることができ、網羅的なエピスタシス効果の同定が実時間で可能となり、本発明の目的を達成することができる。 Furthermore, in the present invention, as a method for identifying an epistasis effect, a total of four types in a 2 × 2 contingency table configured by distinguishing genotypes by dominance and recessive for two SNP combinations by two phenotypes. Using summary numeric data, two odds ratio statistics that can be calculated for these four numeric data by a total of three computations are calculated and judged. In this way, by narrowing down information useful for epistasis identification and using statistics with extremely small amount of calculation, it is possible to construct a judgment method that can greatly reduce the calculation time, and comprehensive identification of epistasis effects Can be achieved in real time, and the object of the present invention can be achieved.

本発明の一つの形態は、コンピュータを用い50万箇所以上におよぶゲノムワイドな一塩基多型（SNP）のジェノタイプデータから、2値のクラスを有する表現型に対して、主効果が確認されない場合でも相乗的相互作用（エピスタシス効果）を有するSNPのペアーを網羅的に同定するデータ解析システムであって、
（１）2種類のクラスの表現型を有するN個の検体から観測された総計M個（Mは50万以上）のSNPのジェノタイプデータと各検体に対応する表現型のクラスを入力する入力手段と、
（２）前記入力手段（１）を介して入力されたN個の検体の表現型のクラスと総計M個のジェノタイプデータを記憶する記憶手段と、
（３）前記記憶手段（２）によって記憶されているi番目のSNPに対して、検体N人に対する２つの表現型のクラスとジェノタイプデータの統計処理を行い、クラス別ジェノタイプ別計数を算出し、算出したマイナーアレル別計数を基にi番目のSNPの解析継続の適否を判定する前処理ステップとしてのスクリーニングを行う演算手段と、
（４）前記演算手段（３）によって、解析対象SNPとして「解析継続適」と判定された場合は、算出した計数を基にi番目のSNPが表現型に対して優性型か劣性型かを統計的手段により判定し、解析継続の適否および優性型・劣性型に関する判定結果を内部記憶装置に記憶する記憶手段と、
（５）前記統計的手段によって判定された、優性型・劣性型に関する判定結果に基づき、表現型の２つのクラスの各々に対して、i番目とｊ番目（ｊ≠i、初期値としてi=1, j=2）のそれぞれ２つのSNPの優性型・劣性型が判定された２ｘ２分割表を作成し、作成された２ｘ２分割表に対してエピスタシスを判定する指標を算出し、この指標を基にエピスタシス効果の有無を判定する演算手段と、
（６）前記演算手段（５）によって、「エピスタシス効果有り」と判定された場合に、２つのSNPに対する「エピスタシス効果有り」の判定結果を記憶し、次のSNPの解析に移るに際して、ｊ番目のSNPをj+1番目のSNPに変更し、前記記録手段（４）のステップに戻り、j+1番目のSNPの優性型・劣性型を統計的手段により判定し前記演算手段（３）のステップの計算を繰り返し、j+1がMに達した場合には、i番目のSNPをi+1番目に、ｊ番目をi+2番目のSNPを選択する解析手段と、
（７）前記演算手段（５）において、「エピスタシス効果有り」と判定された場合、ロジスティック解析分析を用いた多変量解析手段によって、相乗的エピスタシス効果の確認を行う演算手段と、
を具えている
ことを特徴とするデータ解析システムである。 In one embodiment of the present invention, the main effect is not confirmed for a phenotype having a binary class from genome-wide single nucleotide polymorphism (SNP) genotype data of more than 500,000 sites using a computer. A data analysis system that comprehensively identifies pairs of SNPs that have synergistic interactions (epistasis effects),
(1) Input to input a total of M SNP genotype data (M is more than 500,000) observed from N specimens with two types of phenotypes and the phenotype class corresponding to each specimen. Means,
(2) storage means for storing a phenotype class of N specimens and a total of M genotype data inputted via the input means (1);
(3) The i-th SNP stored in the storage means (2) is subjected to statistical processing of the two phenotype classes and genotype data for the sample N persons, and the class-specific genotype count is calculated. An arithmetic means for performing screening as a preprocessing step for determining the suitability of the i-th SNP analysis continuation based on the calculated minor allele count,
(4) When the calculation means (3) determines that “analysis continuation is suitable” as the SNP to be analyzed, whether the i-th SNP is dominant or inferior to the phenotype based on the calculated count Storage means for determining by statistical means, and storing the determination result on the suitability of analysis continuity and the dominant type / recessive type in the internal storage device;
(5) For each of the two classes of phenotypes based on the determination result regarding the dominant type and the recessive type determined by the statistical means, the i-th and j-th (j ≠ i, i = 1, j = 2) 2x2 contingency table in which the dominant type and recessive type of each two SNPs are determined, and an index for determining epistasis is calculated for the created 2x2 contingency table. Computing means for determining the presence or absence of an epistasis effect,
(6) When it is determined by the calculation means (5) that “the epistasis effect is present”, the determination result of “the epistasis effect is present” for the two SNPs is stored, and the j th The SNP is changed to the j + 1-th SNP, the process returns to the step of the recording means (4), the dominant type / recessive type of the j + 1-th SNP is determined by statistical means, and the calculation means (3) When the calculation of the steps is repeated and j + 1 reaches M, the analysis means for selecting the i-th SNP as the i + 1-th and the j-th as the i + 2-th SNP,
(7) In the calculation means (5), when it is determined that “there is an epistasis effect”, calculation means for confirming the synergistic epistasis effect by a multivariate analysis means using logistic analysis analysis;
It is a data analysis system characterized by comprising.

本発明の他の一つの形態は、
コンピュータを用い50万箇所以上におよぶゲノムワイドな一塩基多型（SNP）のジェノタイプデータから、2値のクラスを有する表現型に対して、主効果が確認されない場合でも相乗的相互作用（エピスタシス効果）を有するSNPのペアーを網羅的に同定するデータ解析方法であって、
（１）2種類のクラスの表現型を有するN個の検体から観測された総計M個（Mは50万以上）のSNPのジェノタイプデータと各検体に対応する表現型のクラスを入力する入力ステップと、
（２）前記入力ステップ（２）を介して入力されたN個の検体の表現型のクラスと総計M個のジェノタイプデータを、記憶手段に記憶する記憶ステップと、
（３）前記記憶ステップ（２）によって、記憶手段に記憶されているi番目のSNPに対して、検体N人に対する２つの表現型のクラスとジェノタイプデータの統計処理を行い、クラス別ジェノタイプ別計数を算出し、算出したマイナーアレル別別計数を基にi番目のSNPの解析継続の適否を判定する前処理ステップとしてのスクリーニングを行う演算ステップと、
（４）前記演算ステップ（３）において、解析対象SNPとして「解析継続適」と判定された場合は、算出した計数を基にi番目のSNPが表現型に対して優性型か劣性型かを統計的手段により判定し、解析継続の適否および優性型・劣性型に関する判定結果を内部記憶装置に記憶する記憶ステップと、
（５）前記ステップ（４）において、統計的手段よって判定された、優性型・劣性型に関する判定結果に基づき、表現型の２つのクラスの各々に対して、i番目とｊ番目（ｊ≠i、初期値としてi=1, j=2）のそれぞれ２つのSNPの優性型・劣性型が判定された２ｘ２分割表を作成し、作成された２ｘ２分割表に対してエピスタシスを判定する指標を算出し、この指標を基にエピスタシス効果の有無を判定する演算ステップと、
（６）前記演算ステップ（５）によって、「エピスタシス効果有り」と判定された場合に、２つのSNPに対する「エピスタシス効果有り」の判定結果を記憶し、次のSNPの解析に移るに際して、ｊ番目のSNPをj+1番目のSNPに変更し前記ステップ（４）に戻り、j+1番目のSNPの優性型・劣性型を統計的手段により判定し前記ステップ（３）の計算を繰り返し、j+1がMに達した場合には、i番目のSNPをi+1番目に、ｊ番目をi+2番目のSNPを選択する解析ステップと
（７）前記演算ステップ（５）において、「エピスタシス効果有り」と判定された場合、ロジスティック解析分析を用いた多変量解析手段によって、相乗的エピスタシス効果の確認を行う演算ステップと
を備えている
ことを特徴とするデータ解析方法である。 Another aspect of the present invention is:
Genome-wide single nucleotide polymorphism (SNP) genotype data of more than 500,000 sites using a computer, synergistic interaction (epistasis) even if the main effect is not confirmed for a phenotype having a binary class A data analysis method for comprehensively identifying SNP pairs having an effect),
(1) Input to input a total of M SNP genotype data (M is more than 500,000) observed from N specimens with two types of phenotypes and the phenotype class corresponding to each specimen. Steps,
(2) a storage step of storing in the storage means the phenotype class of N specimens and the total M genotype data inputted through the input step (2);
(3) According to the storing step (2), statistical processing of the two phenotype classes and genotype data for the sample N is performed on the i-th SNP stored in the storage means, and the class-specific genotype A calculation step for calculating a separate count and performing screening as a pre-processing step for determining the suitability of i-th SNP analysis continuation based on the calculated separate count by minor allele,
(4) In the calculation step (3), if it is determined that “analysis continuation is suitable” as the analysis target SNP, whether the i-th SNP is dominant or inferior to the phenotype based on the calculated count A storage step of determining by statistical means, and storing the determination result regarding the suitability of continuation of analysis and the dominant / recessive type in an internal storage device;
(5) In step (4), the i-th and j-th (j ≠ i) for each of the two classes of phenotypes based on the determination result regarding the dominant type and the recessive type determined by the statistical means. Create a 2x2 contingency table in which the dominant and inferior types of two SNPs (i = 1, j = 2) are determined as initial values, and calculate an index for determining epistasis for the created 2x2 contingency table A calculation step for determining the presence or absence of an epistasis effect based on this index,
(6) When it is determined in the calculation step (5) that “the epistasis effect is present”, the determination result of “the epistasis effect is present” for the two SNPs is stored, and the j th The SNP of j + 1 is changed to the j + 1-th SNP, and the process returns to step (4). The dominant / recessive type of the j + 1-th SNP is determined by statistical means, and the calculation of step (3) is repeated, j When +1 reaches M, an analysis step of selecting the i-th SNP as the i + 1-th and the j-th as the i + 2-th SNP; (7) In the calculation step (5), “Epistasis” When it is determined that “there is an effect”, the data analysis method includes a calculation step of confirming a synergistic epistasis effect by a multivariate analysis unit using logistic analysis analysis.

また、本発明の他の一つの形態は、上記の本発明にかかるデータ解析方法をコンピュータに実行させるプログラムである。具体的には、上記の本発明にかかるデータ解析方法を構成する一連のステップの数値演算処理を、コンピュータ上で実行させるプログラムを、当該コンピュータによって、読み取り可能な記録媒体上に収納したプログラム・ソースの形態を有する。 Another embodiment of the present invention is a program for causing a computer to execute the data analysis method according to the present invention. Specifically, a program source in which a program that causes a computer to execute a series of numerical computations of steps constituting the data analysis method according to the present invention is stored on a recording medium readable by the computer It has the form.

本発明における効果は、「主効果の有無」の判定は困難であるSNPに関しても、二つのSNPの組み合わせに因る「相乗的効果」を有する可能性の評価を効率的に実施できる点にある。特に、その評価に利用する「サンプル群」のサイズ（Ｎ＝ｎ１＋ｎ２）が小さい場合でも、「相乗的効果」を有する可能性を有する候補「SNPペア」を効果的に選別できる点にある。 The effect of the present invention is that it is possible to efficiently evaluate the possibility of having a “synergistic effect” due to the combination of two SNPs even for SNPs for which it is difficult to determine “the presence or absence of a main effect”. . In particular, even when the size of the “sample group” used for the evaluation (N = n1 + n2) is small, a candidate “SNP pair” having a possibility of having a “synergistic effect” can be effectively selected.

本発明にかかるデータ解析方法における手順を示す、データ解析方法のフローチャートである。It is a flowchart of the data analysis method which shows the procedure in the data analysis method concerning this invention. 本発明にかかるデータ解析方法における、ステップ（４）で利用される、各SNPにおける優性・劣性の判定アルゴリズムを示す図である。It is a figure which shows the determination algorithm of the dominance and inferiority in each SNP used at step (4) in the data analysis method concerning this invention. 本発明にかかるデータ解析方法における、ステップ（４）で作成される、各SNPの優性・劣性を考慮した２個のSNPに対する２個の表現型クラス別２ｘ２分割表の構成を説明する図である。It is a figure explaining the structure of the 2x2 contingency table according to two phenotype classes with respect to two SNP which considered the superiority / inferiority of each SNP created in step (4) in the data analysis method concerning this invention. . 本発明にかかるデータ解析方法における、ステップ（４）において、各SNPの優性・劣性を考慮して作成される「表現型クラス別２ｘ２分割表」を説明する図である。各SNPの優性型・劣性型を考慮した２個のSNPに対する２つの表現型クラス別２ｘ２分割表の構成 1 (優性型と優性型の場合）In the data analysis method concerning this invention, it is a figure explaining the "2x2 contingency table according to phenotype class" produced in consideration of the dominance and inferiority of each SNP in step (4). Configuration of 2 x 2 contingency tables by two phenotype classes for two SNPs taking into account the dominant / recessive type of each SNP (for dominant and dominant types) 本発明にかかるデータ解析方法における、ステップ（４）において、各SNPの優性・劣性を考慮して作成される「表現型クラス別２ｘ２分割表」を説明する図である。各SNPの優性型・劣性型を考慮した２個のSNPに対する２つの表現型クラス別２ｘ２分割表の構成２ (優性型と劣性型の場合）In the data analysis method concerning this invention, it is a figure explaining the "2x2 contingency table according to phenotype class" produced in consideration of the dominance and inferiority of each SNP in step (4). Configuration of 2 x 2 contingency tables by two phenotype classes for two SNPs considering the dominant and recessive types of each SNP 2 (for dominant and recessive types) 本発明にかかるデータ解析方法における、ステップ（４）において、各SNPの優性・劣性を考慮して作成される「表現型クラス別２ｘ２分割表」を説明する図である。各SNPの優性型・劣性型を考慮した２個のSNPに対する２つの表現型クラス別２ｘ２分割表の構成３ (劣性型と優性型の場合）In the data analysis method concerning this invention, it is a figure explaining the "2x2 contingency table according to phenotype class" produced in consideration of the dominance and inferiority of each SNP in step (4). Configuration of 2 x 2 contingency tables by two phenotype classes for two SNPs considering the dominant and recessive types of each SNP 3 (for recessive and dominant types) 本発明にかかるデータ解析方法における、ステップ（４）において、各SNPの優性・劣性を考慮して作成される「表現型クラス別２ｘ２分割表」を説明する図である。各SNPの優性型・劣性型を考慮した２個のSNPに対する２つの表現型クラス別２ｘ２分割表の構成４ (劣性型と劣性型の場合）In the data analysis method concerning this invention, it is a figure explaining the "2x2 contingency table according to phenotype class" produced in consideration of the dominance and inferiority of each SNP in step (4). Configuration of 2 x 2 contingency tables by 2 phenotype classes for 2 SNPs considering the dominant and recessive types of each SNP 4 (in case of recessive and recessive types) 本発明にかかるデータ解析方法の実施形態を説明する図であり、各SNPの主効果がない場合の相乗的エピスタシスの例とアルゴリズムを具体的に説明する図である。It is a figure explaining embodiment of the data analysis method concerning this invention, and is a figure explaining concretely the example and algorithm of a synergistic epistasis when there is no main effect of each SNP. 本発明にかかるデータ解析方法の実施形態を説明する図であり、約1250億のSNPの組み合わせの中から抽出された２１１組の中で、予測能力の高い結果の例に関して、その解析結果を具体的に説明する図である。It is a figure explaining embodiment of the data analysis method concerning this invention, and the analysis result is concretely shown about the example of a result with high prediction capability in 211 sets extracted from the combination of about 125 billion SNP. FIG.

本発明のデータ解析システムと、該データ解析システムを利用することで実施することができる、データ解析方法に関して、詳しく説明する。 The data analysis system of the present invention and a data analysis method that can be implemented by using the data analysis system will be described in detail.

本発明によるデータ解析方法の前記入力ステップ（１）では、N個の検体に対する2種類のクラスの表現型と、各検体から観測された総計M個のSNPのジェノタイプデータを各検体に対応させて入力する。 In the input step (1) of the data analysis method according to the present invention, two types of phenotypes for N specimens and a total of M SNP genotype data observed from each specimen are associated with each specimen. Enter.

記録ステップ（２）において、前記入力ステップ（１）によって入力されたN個の検体に対する2種類のクラスの表現型と、各検体から観測された総計M個のSNPのジェノタイプデータは、高速なアクセスが可能な内部記憶装置に記憶される。 In the recording step (2), the phenotypes of the two classes for the N samples input in the input step (1) and the total M genotype data of M SNPs observed from each sample are It is stored in an accessible internal storage device.

演算ステップ（３）においては、
前記記憶装置に記憶されたN個の検体に対する2種類のクラスの表現型と、各検体から観測された総計M個のSNPのジェノタイプデータを基に、第i番目のSNPに対して、検体N人に対する２つの表現型のクラスとジェノタイプデータの統計処理を行い、クラス別ジェノタイプ別計数を演算装置にて算出する。 In the calculation step (3),
Based on the two classes of phenotypes for the N samples stored in the storage device and the genotype data of a total of M SNPs observed from each sample, the samples for the i-th SNP Statistical processing of two phenotype classes and genotype data for N people is performed, and class-specific genotype counts are calculated by an arithmetic unit.

前記演算装置で算出されるマイナーアレル別計数を基に、i番目のSNPの解析継続の適否を判定する前処理ステップとしてのスクリーニングを行う。 Based on the minor allele count calculated by the arithmetic unit, screening is performed as a pre-processing step for determining the suitability of the i-th SNP analysis continuation.

各SNPのジェノタイプデータは、母由来および父由来の２つ塩基のタイプにより、集団中において2種のホモ接合体と、1種のヘテロの接合体の計３種類に分類される。ここでは、これらの2種のホモ接合体をAA、とaa、１種のヘテロ接合体をAaで表現する。 The genotype data of each SNP is classified into 3 types, 2 homozygotes and 1 heterozygote in the population, depending on the type of two bases derived from mother and father. Here, these two types of homozygotes are represented by AA and aa, and one type of heterozygote is represented by Aa.

さらに、a11、a12、a13をそれぞれ表現型１のクラスにおけるジェノタイプAA、Aa、aaでの計数をとし、a21、a22、a23をそれぞれ表現型２のクラスにおけるジェノタイプAA、Aa、aaでの計数をとする。 Furthermore, a11, a12, and a13 are the counts for genotypes AA, Aa, and aa in the phenotype 1 class, respectively, and a21, a22, and a23 are the genotypes AA, Aa, and aa in the phenotype 2 class, respectively. Count.

前記演算装置で実行される前処理ステップとしてのスクリーニングにおける、i番目のSNPの解析継続の適否を判定は、a11、a12、a13、a21、a22、a23が、以下の（I）〜（IV）の条件をどれか一つでも満たす時には、「解析継続否」と判断する。「解析継続否」と判定されたSNPは、以降の解析から除外する。
（I）a11 + a12 ≦1 または a11+ a13 ≦1 または a12+ a13 ≦1 （式１）
（II）a21 + a22 ≦1 または a21+ a23 ≦1 または a22+ a23 ≦1 （式２）
（III）a11 ＝ 0 かつ a23 ＝ 0 （式３）
（IV）a13 ＝ 0 かつ a21 ＝ 0 （式４）
前記（I）と（II）の条件は、表現型の各クラスにおいて３種のジェノタイプのうち２種のジェノタイプの検体数が０または１になることを指定し、（I）と（II）の条件のいずれかが成り立つ場合は、他のSNPのジェノタイプと組合わせた場合に、他のSNPの３種のジェノタイプのうち２種のジェノタイプで検体数がゼロになり、以降に述べるエピスタシス判定条件を明らかに満たすことができない場合を羅列したものである。 In the screening as a pre-processing step executed by the arithmetic device, whether or not the i-th SNP analysis is continued is determined by a11, a12, a13, a21, a22, a23 according to the following (I) to (IV) If any one of the above conditions is satisfied, it is determined that the analysis is not continued. SNPs that are determined to be “non-continuation of analysis” are excluded from the subsequent analysis.
(I) a11 + a12 ≤ 1 or a11 + a13 ≤ 1 or a12 + a13 ≤ 1 (Formula 1)
(II) a21 + a22 ≦ 1 or a21 + a23 ≦ 1 or a22 + a23 ≦ 1 (Formula 2)
(III) a11 = 0 and a23 = 0 (Formula 3)
(IV) a13 = 0 and a21 = 0 (Formula 4)
The conditions (I) and (II) specify that the number of specimens of two genotypes among three genotypes in each phenotype class is 0 or 1, and (I) and (II ) If any of the above conditions is met, when combined with other SNP genotypes, the number of specimens will be zero in two of the three other genotypes of other SNPs, This is a list of cases where the epistasis criteria to be described cannot be clearly satisfied.

また、（IV）の条件は、後述する「優性型か、劣性型かの判定」の際に利用する指標値ＯＲ１、ＯＲ２は、a13 ＝ 0 かつ a21 ＝ 0の場合には、その分母が０となり、算定不能となる条件に相当している。 The condition (IV) is that the index values OR1 and OR2 used in “determination of dominant type or inferior type” described later are 0 when a13 = 0 and a21 = 0. This corresponds to a condition that makes the calculation impossible.

一方、（III）の条件は、後述する「優性型か、劣性型かの判定」の際に利用する指標値ＯＲ１、ＯＲ２は、a11 ＝ 0 かつ a23 ＝ 0の場合には、その分子が０となり、結果的に、ＯＲ１＝０,ＯＲ２＝０となるため、信頼できる「優性・劣性の判定」ができない状態となる条件に相当している。 On the other hand, the condition of (III) is that the index values OR1 and OR2 used in the “determination of dominant type or recessive type” described later are zero when a11 = 0 and a23 = 0. As a result, since OR1 = 0 and OR2 = 0, this corresponds to a condition in which reliable “dominance / inferiority determination” cannot be performed.

記憶ステップ（４）では、まず、
前記演算ステップ（３）において、解析対象のSNPが、「解析継続適」が判定された場合は、算出した表現型クラス別ジェノタイプ別計数を基に、i番目のSNPが表現型に対して優性型か劣性型かを統計的手段により判定する。 In the memory step (4), first,
In the calculation step (3), if the SNP to be analyzed is determined to be “suitable for continued analysis”, the i-th SNP is compared with the phenotype based on the calculated phenotypic class-specific genotype count. Whether it is dominant or recessive is determined by statistical means.

前記判定手段では、a11、a12、a13をそれぞれ表現型１のクラスにおけるジェノタイプAA、Aa、aaでの計数をとし、a21、a22、a23をそれぞれ表現型２のクラスにおけるジェノタイプAA、Aa、aaでの計数をとする。ここで、OR1とOR2を次のように定義する。
OR1 = (a11 + a12) x a23 / (a13 x (a21 + a22) ) （式１４）
OR2 = a11 x (a22 + a23) / ((a12 + a13) x a21 ) （式１５）
OR1とOR2を比較し、OR1の値がOR2の値以上の場合（ＯＲ１≧ＯＲ２）は、ジェノタイプAAとAaのアレルを有する場合に、第１のクラスの表現型（例えば、副作用あり）になりやすいことを表現するため優性型（またはタイプ１）と判断し、OR1がOR2以下の場合（ＯＲ１＜ＯＲ２）は、劣性型（タイプ２）と判断し、「解析継続の適否」および優性・劣性型のタイプを１番のSNPからM番のSNPまでに対して算出し記憶する。 In the determination means, a11, a12, and a13 are respectively counted with genotypes AA, Aa, and aa in the phenotype 1 class, and a21, a22, and a23 are respectively genotypes AA, Aa, aa, in the phenotype 2 class. Let aa be the count. Here, OR1 and OR2 are defined as follows.
OR1 = (a11 + a12) x a23 / (a13 x (a21 + a22)) (Formula 14)
OR2 = a11 x (a22 + a23) / ((a12 + a13) x a21) (Formula 15)
If OR1 and OR2 are compared, and OR1 value is greater than or equal to OR2 (OR1 ≧ OR2), if you have alleles of genotypes AA and Aa, the first class phenotype (for example, with side effects) It is judged as dominant type (or type 1) in order to express that it is likely to occur, and when OR1 is less than OR2 (OR1 <OR2), it is judged as recessive type (type 2). The recessive type is calculated and stored for SNP from No. 1 to SNP of M.

演算ステップ（５）においては、
前記統計的手段によって判定された、「優性型・劣性型に関する判定結果」に基づき、表現型の２つのクラスの各々に対して、i番目とｊ番目（ｊ≠i、初期値としてi=1, j=2）のそれぞれ２つのSNPの優性・劣性が判定された２ｘ２分割表を作成し、作成された２ｘ２分割表に対してエピスタシスを判定する以下の指標を算出する。
R1= (x₁₁x₂₂)/(x₁₂x₂₁) ≧w₁ かつ R2= (y₁₁y₂₂)/(y₁₂y₂₁) <1/w₂ （式５）

すなわち、該ステップ（５）おける、「エピスタシス効果の有無」の判定においては、
判定の対象である、「第i番目のSNPと第ｊ番目のSNP」の組み合わせについて、
下記の手順に従って、指標として、R1＝(x₁₁x₂₂)/(x₁₂x₂₁)とR2＝(y₁₁y₂₂)/(y₁₂y₂₁)を算出し、
算出した指標に基づく、「エピスタシス効果の有無」の判定では、
指標：R1＝(x₁₁x₂₂)/(x₁₂x₂₁)とR2＝(y₁₁y₂₂)/(y₁₂y₂₁)が、
R1＝(x₁₁x₂₂)/(x₁₂x₂₁) ≧ w₁ かつ R2＝(y₁₁y₂₂)/(y₁₂y₂₁) ≦ 1/w₂ （式５）
前記（式５）を満足する場合に、
「エピスタシス効果有り」と判定する。 In the calculation step (5),
Based on the “determination result regarding dominant type / recessive type” determined by the statistical means, for each of the two classes of phenotypes, the i-th and j-th (j ≠ i, i = 1 as an initial value) , j = 2), a 2 × 2 contingency table in which the dominance and inferiority of each two SNPs is determined, and the following indices for determining epistasis are calculated for the created 2 × 2 contingency table.
R1 = (x ₁₁ x ₂₂ ) / (x ₁₂ x ₂₁ ) ≧ w ₁ and R2 = (y ₁₁ y ₂₂ ) / (y ₁₂ y ₂₁ ) <1 / w ₂ (Formula 5)

That is, in the determination of “presence / absence of epistasis effect” in step (5),
About the combination of “i-th SNP and j-th SNP”, which is the object of determination,
Follow the procedure below to calculate R1 = (x ₁₁ x ₂₂ ) / (x ₁₂ x ₂₁ ) and R2 = (y ₁₁ y ₂₂ ) / (y ₁₂ y ₂₁ ) as indicators,
Based on the calculated index, the determination of “existence of epistasis effect”
Index: R1 = (x ₁₁ x ₂₂ ) / (x ₁₂ x ₂₁ ) and R2 = (y ₁₁ y ₂₂ ) / (y ₁₂ y ₂₁ )
R1 = (x ₁₁ x ₂₂ ) / (x ₁₂ x ₂₁ ) ≥ w ₁ and R2 = (y ₁₁ y ₂₂ ) / (y ₁₂ y ₂₁ ) ≤ 1 / w ₂ (Formula 5)
When the above (Formula 5) is satisfied,
It is determined that there is an epistasis effect.

ここで、
x₁₁は表現型クラスが１でi番目とｊ番目のSNPが共に優性型の検体数、
x₁₂はi番目とSNPが優性型でｊ番目のSNPが劣性型の検体数、
x₂₁はi番目とSNPが劣性型でｊ番目のSNPが優性型の検体数、
x₂₂はi番目とｊ番目のSNPが共に劣性型の検体数、
y₁₁は表現型クラスが２でi番目とｊ番目のSNPが共に優性型の検体数、
y₁₂はi番目とSNPが優性型でｊ番目のSNPが劣性型の検体数、
y₂₁はi番目とSNPが劣性型でｊ番目のSNPが優性型の検体数、
y₂₂はi番目とｊ番目のSNPが共に劣性型の検体数である。 here,
x ₁₁ i-th and j th number of samples of SNP are both predominant form phenotypically class 1,
x ₁₂ is the i-th and the SNP sample number of the j-th SNP is recessive form in predominant form,
x ₂₁ is the number of specimens where the i-th and SNP are recessive and the j-th SNP is dominant,
x ₂₂ is the number of samples in which the i-th and j-th SNPs are both recessive,
y ₁₁ is the number of specimens with phenotype class 2 and i-th and j-th SNPs both dominant
y ₁₂ is the i-th and the SNP sample number of the j-th SNP is recessive form in predominant form,
y ₂₁ is the number of specimens where the i th and SNP are recessive and the j th SNP is dominant,
y ₂₂ is the number of samples in which the i-th and j-th SNPs are both recessive.

なお、上記の（式５）においてx₁₁、x₂₂、x₁₂、x₂₁ 、y₁₁、y₂₂、y₁₂、y₂₁は、下記の手順に従って、算出される。 In the above (Formula 5), x ₁₁ , x ₂₂ , x ₁₂ , x ₂₁ , y ₁₁ , y ₂₂ , y ₁₂ , y ₂₁ are calculated according to the following procedure.

また、上記の（式５）においてx₁₁、x₂₂、x₁₂、x₂₁は、表現型がクラス１における第i番目のSNPと第ｊ番目のSNPの優性型・劣性型の組み合わせによって決定される計数である。 In the above (Formula 5), x ₁₁ , x ₂₂ , x ₁₂ , and x ₂₁ are determined by the combination of the dominant type and the recessive type of the i-th SNP and the j-th SNP in the phenotype. It is a count.

同様に、y₁₁、y₂₂、y₁₂、y₂₁は、表現型がクラス２における第i番目のSNPと第ｊ番目のSNPの優性型・劣性型の組み合わせによって決定される計数である。 Similarly, y ₁₁ , y ₂₂ , y ₁₂ , and y ₂₁ are counts determined by a combination of the dominant type and the recessive type of the i-th SNP and the j-th SNP in the phenotype.

優性型とは、表現型のクラス１に対して、AAとAaのジェノタイプが関連するモデルで、A1=(AA,Aa)、A2=(aa)と記載される。また、劣性型とは、表現型のクラス１に対して、aaのジェノタイプが関連するモデルで、A1=(AA)、A2=(Aa,aa)と記載される。第ｊ番目のSNPの優性型については、BBとBbのジェノタイプが関連するモデルで、B1=(BB,Bb)、B2=(bb)と記載される。また、第ｊ番目のSNPの劣性型とは、表現型のクラス１に対して、bbのジェノタイプが関連するモデルで、B1=(BB)、B2=(Bb,bb)と記載される。（図３参照）
このとき、表現型がクラス１である検体に対して、c11を第i番目のSNPのジェノタイプAAと第j番目のSNPのジェノタイプBBを有する検体の計数（AAとBBを有する場合）であり、c12をAAとBbを有する検体の計数、c13をAAとbbを有する検体の計数とする。同様に、c21をAaとBB、c22をAaとBb、c23をAaとbb、c31をaaとBB、c32をaaとBb、c33をaaとbbの計数とする。これらの計数は以下の数式を満足する。
c11+c12+c13+c21+c22+c23+c31+c32+c33=n1 （式６）
表現型がクラス２である検体に対して、d11を第i番目のSNPのジェノタイプAAと第j番目のSNPのジェノタイプBBを有する検体の計数（AAとBBを有する場合）であり、d12をAAとBbを有する検体の計数、d13をAAとbbを有する検体の計数とする。 The dominant type is a model in which the AA and Aa genotypes are related to the phenotype class 1, and are described as A1 = (AA, Aa) and A2 = (aa). The recessive type is a model in which the genotype of aa is related to the phenotype class 1, and is described as A1 = (AA) and A2 = (Aa, aa). The dominant type of the jth SNP is a model in which BB and Bb genotypes are related, and is described as B1 = (BB, Bb), B2 = (bb). The j-th recessive type of the SNP is a model in which bb genotype is related to phenotype class 1, and is described as B1 = (BB), B2 = (Bb, bb). (See Figure 3)
At this time, for specimens having a phenotype of class 1, c11 is a count of specimens having genotype AA of i-th SNP and genotype BB of j-th SNP (when AA and BB are included). Yes, c12 is the count of the specimen having AA and Bb, and c13 is the count of the specimen having AA and bb. Similarly, c21 is Aa and BB, c22 is Aa and Bb, c23 is Aa and bb, c31 is aa and BB, c32 is aa and Bb, and c33 is aa and bb. These counts satisfy the following formula:
c11 + c12 + c13 + c21 + c22 + c23 + c31 + c32 + c33 = n1 (Formula 6)
For specimens with phenotype class 2, d11 is the count of specimens with genotype AA of i-th SNP and genotype BB of j-th SNP (if it has AA and BB), d12 Is the count of specimens with AA and Bb, and d13 is the count of specimens with AA and bb.

同様に、d21をAaとBB、d22をAaとBb、d23をAaとbb、d31をaaとBB、d32をaaとBb、d33をaaとbbの計数とする。これらの計数は以下の数式を満足する。
d11+d12+d13+d21+d22+d23+d31+d32+d33=n2 （式７）
優性型・劣性型に関する判定結果に基づき、
具体的にx₁₁、x₂₂、x₁₂、x₂₁ 、y₁₁、y₂₂、y₁₂、y₂₁は、下記のように場合分けして与えられる。
(i) 第i番目のSNPが優性型、第ｊ番目のSNPが優性型（図４参照）
x₁₁=c11+c12+c21+c22, x₁₂=c13+c23, x₂₁=c31+c32, x₂₂=c33,
y₁₁=d11+d12+d21+d22, y₁₂=d13+d23, y₂₁=d31+d32, y₂2=d33 （式８）
(ii) 第i番目のSNPが優性型、第ｊ番目のSNPが劣性型（図５参照）
x₁₁=c11+c21, x₁₂=c12+c13+c22+c23, x₂₁=c31, x₂₂=c32+c33,
y₁₁=d11+d21, y₁₂=d12+d13+d22+d23, y₂₁=d31, y₂₂=d32+d33 （式９）
(iii) 第i番目のSNPが劣性型、第ｊ番目のSNPが優性型（図６参照）
x₁₁=c11+c12, x₁₂=c13, x₂₁=c21+c22+c31+c32, x₂₂=c23+c33,
y₁₁=d11+d12, y₁₂=d13, y₂₁=d21+d22+d31+d32, y₂₂=d23+d33 （式１０）
(iv) 第i番目のSNPが劣性型、第ｊ番目のSNPが劣性型（図７参照）
x₁₁=c11, x₁₂=c12+c13, x₂₁=c21+c31, x₂₂=c22+c23+c32+c33
y₁₁=d11, y₁₂=d12+d13, y₂₁=d21+d31, y₂₂=d22+d23+d32+d33 （式１１）
優性型・劣性型に関する判定結果に基づき、上記の（ｉ）〜(iv)のいずれかで与えられる、x₁₁、x₂₂、x₁₂、x₂₁、y₁₁、y₂₂、y₁₂、y₂₁から、指標：(x₁₁x₂₂)/(x₁₂x₂₁)と(y₁₁y₂₂)/(y₁₂y₂₁)を算出する。 Similarly, d21 is Aa and BB, d22 is Aa and Bb, d23 is Aa and bb, d31 is aa and BB, d32 is aa and Bb, and d33 is aa and bb. These counts satisfy the following formula:
d11 + d12 + d13 + d21 + d22 + d23 + d31 + d32 + d33 = n2 (Formula 7)
Based on the determination results for dominant and recessive types,
Specifically, x ₁₁ , x ₂₂ , x ₁₂ , x ₂₁ , y ₁₁ , y ₂₂ , y ₁₂ , y ₂₁ are given in the following cases.
(i) The i-th SNP is the dominant type and the j-th SNP is the dominant type (see Fig. 4)
x ₁₁ = c11 + c12 + c21 + c22, x ₁₂ = c13 + c23, x ₂₁ = c31 + c32, x ₂₂ = c33,
y ₁₁ = d11 + d12 + d21 + d22, y ₁₂ = d13 + d23, y ₂₁ = d31 + d32, y ₂ 2 = d33 (Formula 8)
(ii) The i-th SNP is the dominant type and the j-th SNP is the recessive type (see Fig. 5)
x ₁₁ = c11 + c21, x ₁₂ = c12 + c13 + c22 + c23, x ₂₁ = c31, x ₂₂ = c32 + c33,
y ₁₁ = d11 + d21, y ₁₂ = d12 + d13 + d22 + d23, y ₂₁ = d31, y ₂₂ = d32 + d33 (Formula 9)
(iii) The i-th SNP is recessive and the j-th SNP is dominant (see Fig. 6)
x ₁₁ = c11 + c12, x ₁₂ = c13, x ₂₁ = c21 + c22 + c31 + c32, x ₂₂ = c23 + c33,
y ₁₁ = d11 + d12, y ₁₂ = d13, y ₂₁ = d21 + d22 + d31 + d32, y ₂₂ = d23 + d33 (Formula 10)
(iv) The i-th SNP is recessive and the j-th SNP is recessive (see Fig. 7)
x ₁₁ = c11, x ₁₂ = c12 + c13, x ₂₁ = c21 + c31, x ₂₂ = c22 + c23 + c32 + c33
y ₁₁ = d11, y ₁₂ = d12 + d13, y ₂₁ = d21 + d31, y ₂₂ = d22 + d23 + d32 + d33 (Formula 11)
X ₁₁ , x ₂₂ , x ₁₂ , x ₂₁ , y ₁₁ , y ₂₂ , y ₁₂ , y ₂₁ given in any of the above (i) to (iv) based on the determination result regarding the dominant type or the recessive type From these, the indices: (x ₁₁ x ₂₂ ) / (x ₁₂ x ₂₁ ) and (y ₁₁ y ₂₂ ) / (y ₁₂ y ₂₁ ) are calculated.

指標：(x₁₁x₂₂)/(x₁₂x₂₁)と(y₁₁y₂₂)/(y₁₂y₂₁)を用いることの妥当性は、表現型と優性・劣性を考慮した２つのSNPを用いる２ｘ２ｘ２分割表の対数線形モデルにおいて、３因子交互作用が無いという仮説の下での(x₁₁,x₂₂,x₁₂,x₂₁,y₁₁,y₂₂,y₁₂,y₂₁)の最尤推定量(z₁₁, z₂₂, z₁₂, z₂₁, v₁₁, v₂₂, v₁₂, v₂₁)が以下の式を満たすことに基づく。
Log (z₁₁z₂₂)/(z₁₂z₂₁) ？ log (v₁₁v₂₂)/(v₁₂v₂₁) ＝ 0 （式１６）

前記手段におけるエピスタシスを判定するための指標（式５）において、ｗ₁とｗ₂の「選択可能な範囲」は以下の条件で与えられる。
n1-3 ≦w₁≦ （n1/2-１）², （n2-3） ≦w₂≦ (n2/2-１）² （式１２）
ここで、w₁は、クラス１のn1個の検体に対する交差積比(x₁₁ x₂₂ )/(x₁₂ x₂₁)において、(x₁₂ x₂₁)の最小値はx₁₂ =1、x₂₁=1で与えられる。この条件の下でx₁₁ =1、またはx₂₂ =1 の時、交差積比(x₁₁ x₂₂ )/(x₁₂ x₂₁)の最小値は（n1-3）の値となる。 Indicators: (x ₁₁ x ₂₂ ) / (x ₁₂ x ₂₁ ) and (y ₁₁ y ₂₂ ) / (y ₁₂ y ₂₁ ) are valid using two SNPs that consider phenotype, dominance and inferiority. used in log-linear model of 2x2x2 contingency table, the maximum likelihood of under the hypothesis that there is no 3 factor interactions _{_{(x 11, x 22, x}} 12, x 21, y 11, y 22, y 12, y 21) The estimator (z ₁₁ , z ₂₂ , z ₁₂ , z ₂₁ , v ₁₁ , v ₂₂ , v ₁₂ , v ₂₁ ) is based on satisfying the following formula.
Log (z ₁₁ z ₂₂ ) / (z ₁₂ z ₂₁ )? log (v ₁₁ v ₂₂ ) / (v ₁₂ v ₂₁ ) = 0 (Formula 16)

In the index (Equation 5) for determining epistasis in the above means, the “selectable range” of w ₁ and w ₂ is given under the following conditions.
n1-3 ≤ w ₁ ≤ (n1 / 2-1) ² , (n2-3) ≤ w ₂ ≤ (n2 / 2-1) ² (Formula 12)
Here, w ₁ is the cross product ratio (x ₁₁ x ₂₂ ) / (x ₁₂ x ₂₁ ) for n1 specimens of class 1, and the minimum value of (x ₁₂ x ₂₁ ) is x ₁₂ = 1, x ₂₁ = 1. Under this condition, when x ₁₁ = 1 or x ₂₂ = 1, the minimum value of the cross product ratio (x ₁₁ x ₂₂ ) / (x ₁₂ x ₂₁ ) is a value of (n1-3).

また、(x₁₂ x₂₁)が最小値となるx₁₂=1、x₂₁=1の条件の下では、x₁₁＋x₂₂=n1-2であるため、(x₁₁ x₂₂ )が取りえる最大値は、f(x₁₁、x₂₂)＝x₁₁x₂₂＝x₁₁(n1-2-x₁₁)の最大値を考えることにより、｛（n1/2-１）²｝が得られる。同様にしてw₂の範囲が算出できる。 Also, under the condition of x ₁₂ = 1 and x ₂₁ = 1 where (x ₁₂ x ₂₁ ) is the minimum value, x ₁₁ + x ₂₂ = n1-2, so the maximum that (x ₁₁ x ₂₂ ) can take Considering the maximum value of f (x ₁₁ , x ₂₂ ) = x ₁₁ x ₂₂ = x ₁₁ (n1-2-x ₁₁ ), {(n1 / 2-1) ² } is obtained. Range of w ₂ in the same manner can be calculated.

なお、上記のw₁とw₂の「選択可能な範囲」のうち、ｗ₁＝n1-3、ｗ₂＝n2-3は、「最も緩やかな条件」に、ｗ₁＝（n1/2-１）²、ｗ₂＝(n2/2-１）²は、「最も厳しい条件」に相当している。本発明において、比較的に「緩やかな条件」を選択する際には、w₁とw₂は、例えば、以下の範囲に選択する。 Of the above “selectable ranges” of w ₁ and w ₂ , w ₁ = n1-3 and w ₂ = n2-3 are the “slowest conditions” and w ₁ = (n1 / 2− 1) ² and w ₂ = (n2 / 2-1) ² correspond to “the most severe conditions”. In the present invention, when selecting a “relaxed condition”, w ₁ and w ₂ are selected within the following range, for example.

n1-3 ≦ｗ₁≦ （n1＋√n1）-3、 n2-3 ≦ｗ₂≦ （n2＋√n2）-3 （式１７）
w₁とw₂の「選択可能な範囲」は上記（式１２）で与えられるが、必ずしも交差積比の分母が最少、すなわち、x₁₂ =1かつx₂₁=1とならない場合があるため、w₁の最小値と1/w₂の最大値を使用する。すなわち、ｗ₁＝n1-3、ｗ₂＝n2-3としている。 _{n1-3 ≦ w 1 ≦ (n1 +} √n1) -3, n2-3 ≦ w 2 ≦ (n2 + √n2) -3 ( Formula 17)
The “selectable range” of w ₁ and w ₂ is given by (Equation 12) above, but the denominator of the cross product ratio is not always the minimum, that is, x ₁₂ = 1 and x ₂₁ = 1 may not be satisfied. using the minimum value of w ₁ and a maximum value of 1 / w _2. That is, w ₁ = n1-3 and w ₂ = n2-3.

ここで、ｗ₁≧０、ｗ₂≧０より、表現型の各クラスにおけるサンプル数の条件はn1≧３、n2≧３で与えられる。また、サンプル群の個体数（n1＋n2）の理論的下限は（n1＋ｎ2）≧６となる。なお、母集団におけるクラス１とクラス２の割合を考慮すると、クラス１とクラス２の割合の推定値は、それぞれ、ｎ1/（ｎ1＋ｎ2）、ｎ2/（ｎ1＋ｎ2）で与えられため、サンプル群の個体数（ｎ1＋ｎ2）の下限は、ｎ1＞ｎ2のとき（ｎ1＋ｎ2）≧３[１＋｛ｎ1/（ｎ1＋ｎ2）｝/｛ｎ2/（ｎ1＋ｎ2）｝]、ｎ1＜ｎ2のとき（ｎ1＋ｎ2）≧３[１＋｛ｎ2/（ｎ1＋ｎ2）｝/｛ｎ1/（ｎ1＋ｎ2）｝]で与えられる。
なお、「主効果がある」可能性があると、判断される方式については、以下のようになる。 Here, since w ₁ ≧ 0 and w ₂ ≧ 0, the condition of the number of samples in each class of phenotype is given by n1 ≧ 3 and n2 ≧ 3. The theoretical lower limit of the number of samples (n1 + n2) in the sample group is (n1 + n2) ≧ 6. Considering the ratio of class 1 and class 2 in the population, the estimated values of class 1 and class 2 are given by n1 / (n1 + n2) and n2 / (n1 + n2), respectively. The lower limit of the number (n1 + n2) is (n1 + n2) ≧ 3 [1+ {n1 / (n1 + n2)} / {n2 / (n1 + n2)}] when n1> n2, and (n1 + n2) ≧ 3 [1+ {when n1 <n2. n2 / (n1 + n2)} / {n1 / (n1 + n2)}].
The method for determining that there is a possibility of “main effect” is as follows.

SNP1に関して、「クラス１」に対して見積もられるリスクは、
全体の平均リスクであるｒ₁は、（a11+a12+a13）/（a11+a12+a13＋a21+a22+a23）、
AAにおけるリスクｒ_1AAは、a11/（a11+a21）、
Aaにおけるリスクｒ_1Aaは、a12/（a12+a22）、
Aaにおけるリスクｒ_1aaは、a13/（a13+a23）となる。
その場合、例えば、ｒ_1aa＞ｒ_1Aa＞ｒ₁＞ｒ_1AA、すなわち、[a13/（a13+a23）]＞[a12/（a12+a22）]＞[a11/（a11+a21）]の関係があれば、「主効果がある」可能性があると、判断される。 For SNP1, the estimated risk for “Class 1” is
The overall average risk r ₁ is (a11 + a12 + a13) / (a11 + a12 + a13 + a21 + a22 + a23),
The risk r _1AA in AA is a11 / (a11 + a21),
The risk r _1Aa in Aa is a12 / (a12 + a22),
The risk r _1aa in Aa is a13 / (a13 + a23).
In that case, for example, r _1aa > r _1Aa > r ₁ > r _1AA , that is, a relation of [a13 / (a13 + a23)]> [a12 / (a12 + a22)]> [a11 / (a11 + a21)] If there is, it is determined that there is a possibility of “main effect”.

また、（ｒ_1aa/ｒ_1Aa）＞（ｒ_1Aa/ｒ_1AA）、すなわち、[a13/（a13+a23）]/[a12/（a12+a22）]＞[a12/（a12+a22）]/[a11/（a11+a21）]の関係があれは、明確な「主効果がある」可能性が高いと、判断される。

解析ステップ（６）では、
前記演算ステップ（５）において、「エピスタシス効果あり」と判定された場合に、２つのSNPに対する結果を記憶し、次のSNPの解析に移るに際して、ｊ番目のSNPをj+1番目のSNPに変更し前記ステップ（４）に戻り、j+1番目のSNPの優性・劣性を統計的手段により判定し前記ステップ（３）の計算を繰り返し、j+1がMに達した場合には、i番目のSNPをi+1番目に、ｊ番目をi+2番目のSNPを選択し解析する。 Also, (r _1aa / r _1Aa )> (r _1Aa / r _1AA ), that is, [a13 / (a13 + a23)] / [a12 / (a12 + a22)]> [a12 / (a12 + a22)] / If there is a relationship of [a11 / (a11 + a21)], it is judged that there is a high possibility of a clear “main effect”.

In the analysis step (6),
When it is determined in the calculation step (5) that “there is an epistasis effect”, the results for the two SNPs are stored, and when moving to the analysis of the next SNP, the j-th SNP is changed to the j + 1-th SNP. Change to step (4), determine the dominant / inferiority of the j + 1-th SNP by statistical means, repeat the calculation in step (3), and if j + 1 reaches M, i The i th SNP is selected as the i th and the i th +2 th SNP is selected and analyzed.

演算ステップ（７）では、
前記演算ステップ（５）において、エピスタシス効果ありと判定された場合、ロジスティック解析分析を用いた多変量解析手段によって、相乗的エピスタシス効果の確認を行う。

以下に、図面を参照して、本発明の実施の形態に係わる、網羅的ゲノムワイドSNP情報に基づく主効果が無い場合のエピスタシス効果の高速同定方法について説明する。なお、以下においては、乳がん患者６０例に対してタキソール単剤の術前化学療法を施行し、末梢神経障害の副作用発生に関連する相乗的エピスタシス効果を示す２つのSNPを同定する場合を例として説明する。 In the calculation step (7),
When it is determined in the calculation step (5) that there is an epistasis effect, the synergistic epistasis effect is confirmed by multivariate analysis means using logistic analysis analysis.

The epistasis effect fast identification method in the absence of main effects based on comprehensive genome-wide SNP information according to the embodiment of the present invention will be described below with reference to the drawings. In the following, an example is given in which 60 patients with breast cancer are given preoperative chemotherapy with taxol alone to identify two SNPs that exhibit synergistic epistasis effects related to the occurrence of side effects of peripheral neuropathy. explain.

図１には、以上のようにコンピュータ・ハードウェアとコンピュータ・プログラムとの協働により実現される一連のデータ解析が、フローチャート形式で図解されている。 In FIG. 1, a series of data analysis realized by the cooperation of computer hardware and a computer program as described above is illustrated in a flowchart format.

以下の実施例においては、本発明者らは本発明のデータ解析方法を、インフォームドコンセントの取れた乳がん患者の術前化学療法施行後の副作用に関連する相乗的エピスタシス効果を有する遺伝子探索に適用し、本発明のデータ解析方法が有効であることを確認した。同定されたＳＮＰデータに対してロジスティック回帰分析による統計学的検証を行ったところ、エピスタシス効果が認められた。パイオニア的発明として、本発明は、１００万箇所にも及ぶＳＮＰデータに対しても、主効果がない場合でも相乗的エピスタシス効果を示すSNPペアを実時間内で同定できることとなった。 In the following examples, the present inventors applied the data analysis method of the present invention to a gene search having a synergistic epistasis effect related to side effects after performing preoperative chemotherapy in a breast cancer patient with informed consent. The data analysis method of the present invention was confirmed to be effective. When statistical verification by logistic regression analysis was performed on the identified SNP data, an epistasis effect was observed. As a pioneering invention, the present invention can identify SNP pairs exhibiting a synergistic epistasis effect in real time even when there is no main effect even for SNP data of 1 million locations.

次に、本発明の実施例を、結果の図を参照して具体的に説明する。かかる実施例は、本発明の実施の形態の一例に相当するものである。なお、実施例に例示される具体的な形態に、本発明の技術的範囲は限定されるものではない。 Next, an example of the present invention will be described in detail with reference to the results. Such an example corresponds to an example of an embodiment of the present invention. The technical scope of the present invention is not limited to the specific modes exemplified in the examples.

本実施例では、インフォームドコンセントが取れた抗がん剤の術前化学療法を施行したがん患者のデータを基に、抗がん剤の副作用に関してエピスタシス効果を有するＳＮＰを同定し、副作用が起きるメカニズム考察のための遺伝子探索を行うことを目的とし、副作用に関連するエピスタシス効果を有するSNP群を同定する問題を考える。近年、ヒトの一塩基多型は大量にタイピングされており、例えば、ＡｆｆｙｍｅｔｒｉｘＳＮＰ６．０アレイ（登録商標）では、タイピング方法としてＤＮＡチップを用いたアレル特異的ハイブリダイゼーションを行い、全工程として１サンプルあたり５日でゲノム全体をカバーする９０６，６００種のＳＮＰのタイピングが可能である。 In this example, an SNP having an epistasis effect on the side effects of an anticancer drug was identified based on the data of cancer patients who had undergone preoperative chemotherapy for an anticancer drug with informed consent. The purpose of this study is to investigate genes that occur, and to identify the SNPs that have epistasis effects related to side effects. In recent years, human single nucleotide polymorphisms have been typed in large quantities. For example, in the Affymetrix SNP 6.0 array (registered trademark), allele-specific hybridization using a DNA chip is performed as a typing method. Typing of 906,600 SNPs covering the entire genome in 5 days per sample is possible.

１．材料と方法
解析に用いた検体は、癌研究会乳腺外科にてインフォームドコンセントが取れたタキソール単剤の術前化学療法（タキソール80mg/m² /q1w)を施行された乳癌６０症例である。 1. Materials and Methods The specimens used for the analysis were 60 cases of breast cancer that were treated with preoperative chemotherapy (taxol 80 mg / m ² / q1w) of taxol alone with informed consent at the Cancer Research Society Breast Surgery.

副作用は末梢神経障害（しびれ）について検討を行った。末梢神経障害あり（CTCグレード２以上）が８名、末梢神経障害なし（CTCグレード０または１）５２名である。 Side effects were examined for peripheral neuropathy (numbness). Eight people have peripheral neuropathy (CTC grade 2 or higher) and 52 have no peripheral neuropathy (CTC grade 0 or 1).

タイピング成功したSNPの割合である平均Call Rateは９９．５％、重複したサンプルでのタイピング結果の一致率は９９．９８％であった。９０９６２２個のSNPデータを末梢神経障害あり群・なし群に分け、両群間でのアレル頻度の差を2x2分割表を用いたFisher’s exact testにて検定した。CTCグレード2以上の末梢神経障害との相関を示したSNP(p ≦ 0.0001)は、３３SNPで１７遺伝子であった。これらの結果は、SNPごとの解析であり、従来の方法によって容易に算出可能である。 The average call rate, which is the proportion of SNPs that were successfully typed, was 99.5%, and the coincidence rate of typing results with duplicate samples was 99.98%. 909622 SNP data were divided into groups with and without peripheral neuropathy, and the difference in allele frequency between the two groups was tested by Fisher's exact test using a 2 × 2 contingency table. SNPs (p ≦ 0.0001) that showed a correlation with CTC grade 2 or higher peripheral neuropathy were 33 SNPs and 17 genes. These results are analysis for each SNP and can be easily calculated by a conventional method.

次に本発明によるデータ解析方法およびデータ解析システムを用いた２つのSNPの組み合わせとして５０万SNPまでを解析した。５０万SNPから２つのSNPを選択する選び方は１２４９９９７５００００通りである。 Next, up to 500,000 SNPs were analyzed as combinations of two SNPs using the data analysis method and data analysis system according to the present invention. There are 124999750,000 ways to select two SNPs from 500,000 SNPs.

９０９６２２個のSNPデータのうち、４０９１６９個のSNPが（ｉ）〜（ｉｖ）の条件をどれか一つでも満たし、「解析継続否」と判断された。これにより大幅な計算負荷の削除が可能となった。 Of the 909622 SNP data, 409169 SNPs satisfied any one of the conditions (i) to (iv), and it was judged as “analysis continuation failure”. This made it possible to delete a large calculation load.

ｎ１＝８、ｎ２＝５２となっており、上記のw₁とw₂の「選択可能な範囲」のうち、「最も緩やかな条件」に相当する、ｗ₁＝ｎ1-３、ｗ₂＝ｎ２−３を用いた。 n1 = 8 and n2 = 52, and w ₁ = n1−3, w ₂ = n2 corresponding to “the mildest condition” among the “selectable range” of w ₁ and w ₂ described above. -3 was used.

W₁=(8-3)=5、W₂＝(52-3)=49とし、５０万SNPから２つのSNPを選択する全ての組合わせを検討した結果、２１１個のSNPのペアーが選択された。 Assuming W ₁ = (8-3) = 5 and W ₂ = (52-3) = 49 and considering all combinations to select two SNPs from 500,000 SNPs, 211 SNP pairs were selected It was done.

「主効果がある」可能性があると、判断される場合を、
H１： [a13/（a13+a23）]＞[a12/（a12+a22）]＞[a11/（a11+a21）]
とし、
明確な「主効果がある」可能性が高いと、判断される場合を
H２： [a13/（a13+a23）]/[a12/（a12+a22）]＞[a12/（a12+a22）]/[a11/（a11+a21）]
とする。
２１１個のSNPペアの各々をSNP-AとSNP-Bで表し、
SNP−AでH1が成立するとき AH1＝1、成立しないときAH1＝0 とし、
SNP−BでH1が成立するとき BH1＝1、成立しないときBH1＝0 とした。 When it is determined that there is a possibility of “main effect”,
H1: [a13 / (a13 + a23)]> [a12 / (a12 + a22)]> [a11 / (a11 + a21)]
age,
When it is judged that there is a high possibility of a clear “main effect”
H2: [a13 / (a13 + a23)] / [a12 / (a12 + a22)]> [a12 / (a12 + a22)] / [a11 / (a11 + a21)]
And
Each of 211 SNP pairs is represented by SNP-A and SNP-B,
When H1 is established in SNP-A, AH1 = 1, otherwise AH1 = 0.
BH1 = 1 when H1 is established in SNP-B, and BH1 = 0 when not established.

２１１個のSNPペアに対し、AH1とBH1を組み合わせた結果を2 x 2 表にまとめると、表1のようになる。 The results of combining AH1 and BH1 for 211 SNP pairs are summarized in Table 2 as shown in Table 1.

表１のように、ともに「主効果がある」可能性があると判断された数は３ペアであり、全体の僅か１．４２％であった。ともに「主効果がある」可能性が無いと判断された数は８４ペアであり、全体の３９．８％であった。片方のみが「主効果がある」可能性があると判断された数は１２４ペアであり、全体の５８．７％であった。 As shown in Table 1, the number judged to be “mainly effective” was 3 pairs, only 1.42% of the total. In both cases, 84 pairs were judged not to have the “main effect”, which was 39.8% of the total. The number judged that only one of them may have “main effect” was 124 pairs, which was 58.7% of the total.

従って、今回の解析で得られた２１１例のうち、９８．６％は片方のSNPが「主効果がある」可能性がないと判断されたペアであった。
従来の解析方法では、ともに「主効果がある」可能性がないと判断された８４ペア（約４０％）を同定することは困難であり、本方法の有用性が示された。 Therefore, out of 211 cases obtained in this analysis, 98.6% were pairs in which one SNP was judged not to have a “main effect” possibility.
In the conventional analysis method, it was difficult to identify 84 pairs (about 40%) that were judged not to have a “main effect”, indicating the usefulness of this method.

同様に、２１１個のSNPペアに対し、AH1とBH1を組み合わせた結果を2 x 2 表にまとめると、表２のようになる。 Similarly, the results of combining AH1 and BH1 for 211 SNP pairs are summarized in Table 2 as shown in Table 2.

表２のように、ともに明確な「主効果がある」可能性があると判断された数は１７ペアであり、全体の３．３１％であった。ともに明確な「主効果がある」可能性がないと判断された数は５７ペアであり、全体の２７．０％であった。片方のみが「主効果がある」可能性があると判断された数は１３７ペアであり、全体の６４．９％であった。 As shown in Table 2, the number judged to have a clear “main effect” was 17 pairs, which was 3.31% of the total. In both cases, 57 pairs were judged to have no clear “main effect” possibility, which was 27.0% of the total. The number judged that only one of them may have a “main effect” was 137 pairs, which was 64.9% of the total.

従って、今回の解析で得られた２１１例のうち、９６．７％は片方のSNPが「主効果がある」可能性がないと判断されたペアであった。 Therefore, out of 211 cases obtained in this analysis, 96.7% were pairs in which one SNP was judged not to be “mainly effective”.

従来の解析方法では、ともに「主効果がない」可能性と判断された５７ペア（２７％）を同定することは困難であり、本方法の有用性が示された。 In the conventional analysis method, it was difficult to identify 57 pairs (27%), both of which were judged to have the “no main effect”, indicating the usefulness of this method.

これら２１１個SNPペアーのうち、主効果が認められず、相乗的エピスタシス効果が認められ、予測能力の高いSNPペアを図９に示す。図９においてSNP−Aは909622個の中の69146番目のSNPであり、SNP-Bは97440番目のSNPを表す。SNP-A1とSNP-B1、SNP-A1とSNP-B2 、NP-A２とSNP-B1の組み合わせの場合にはリスクが高くないが、SNP-A2とSNP-B2を組み合わせて有する場合のみにリスクが0.8以上に達していることが認められた。このハイリスク群はアレル頻度より算出すると日本人人口の約10％にあたり、SNP-A2とSNP-B2を組み合わせて有する場合には副作用確率が高いため、本薬剤の使用については慎重に考慮すべきであり、今後の個別化医療の進展にも寄与できる結果を示している。 Of these 211 SNP pairs, the main effect is not recognized, the synergistic epistasis effect is recognized, and the SNP pair with high prediction ability is shown in FIG. In FIG. 9, SNP-A represents the 69146th SNP out of 909622, and SNP-B represents the 97440th SNP. SNP-A1 and SNP-B1, SNP-A1 and SNP-B2, NP-A2 and SNP-B1 combination is not high risk, but only with SNP-A2 and SNP-B2 combination Was found to have reached 0.8 or higher. This high-risk group accounts for about 10% of the Japanese population when calculated from the allele frequency, and the combination of SNP-A2 and SNP-B2 has a high probability of side effects, so the use of this drug should be carefully considered It shows the results that can contribute to the progress of personalized medicine in the future.

５０万SNPから２つのSNPを選択する全ての組合わせを検討した結果、選択された２１１個のSNPのペアーを表３−１から表３−２１１に示す。表３−１から表３−２１１の各表には、選択されたSNP番号と（式５）でエピスタシス効果ありと判定されたR1とR2の値、さらに副作用有り群と無し群の２つのジェノタイプ別３ｘ３表と、これを優性型・劣性型にまとめた副作用有り群と無し群の２つの２ｘ２表を記載した。 As a result of examining all combinations for selecting two SNPs from 500,000 SNPs, 211 SNP pairs selected are shown in Tables 3-1 to 3-211. Each table of Table 3-1 to Table 3-211 includes the selected SNP number, the values of R1 and R2 determined to have an epistasis effect according to (Equation 5), and two genos of groups with and without side effects. A 3 × 3 table by type and two 2 × 2 tables of the group with and without side effects, which are summarized into dominant and recessive types, are described.

ロジスティック回帰分析による検証例
使用データは表３の中に存在する、69146番目のSNPと 97440番目のSNPである。 Example of verification by logistic regression analysis The usage data are the 69146th SNP and the 97440th SNP in Table 3.

検証のためのロジスティック回帰分析において、モデルに使用する変数として909622個の中の69146番目のSNP69146と97440番目のSNP97440の交互作用項SNP12=SNP69146・SNP97440を作り、切片とSNP12を変数とするモデルを構成し最尤法に基づき回帰変数を推定した結果、切片は-2.833（標準誤差0.59）で、交互作用項SNP12の回帰係数は4.442（標準誤差1.246、95％信頼区間の下限値と上限値は（1.948：6.937））であり、有意確率ｐ＝0.00074で統計的に有意な結果を示した。なお、ロジスティック回帰モデルの適合性もｐ＝0.00002で統計的に有意であった。 In the logistic regression analysis for verification, the interaction term SNP12 = SNP69146 / SNP97440 of 69146th SNP69146 and 97440th SNP97440 out of 909622 is used as a variable for the model, and the model with intercept and SNP12 as variables As a result of estimating the regression variables based on the maximum likelihood method, the intercept is -2.833 (standard error 0.59), the regression coefficient of the interaction term SNP12 is 4.442 (standard error 1.246, the lower and upper limits of the 95% confidence interval are (1.948: 6.937)), and showed a statistically significant result with a significance probability p = 0.00074. The fitness of the logistic regression model was also statistically significant at p = 0.00002.

検証のためのロジスティック回帰分析において、モデルに使用する変数として69146番目のSNP69146の変数と97440番目のSNP97440の変数と交互作用項SNP12を用いて、最尤法に基づきこれらの回帰変数を同時推定した結果、切片は-2.079（標準誤差1.06）で、SNP69146の主効果を表す回帰係数は-1.252（標準誤差1.470、95％信頼区間の下限値と上限値は（-4.197：6.937））であり有意確率ｐ＝0.398で統計的有意差は認められなかった。また、SNP97440の主効果を表す回帰係数は-0.629（標準誤差1.480、95％信頼区間の下限値と上限値は（-3.594：2.337））であり有意確率ｐ＝0.672で統計的有意差は認められなかった。交互作用項SNP12の回帰係数は5.570（標準誤差2.104、95％信頼区間（1.355：9.786））であり、有意確率ｐ＝0.011で統計的に有意な結果を示した。なお、ロジスティック回帰モデルの適合性もｐ＝0.00024で統計的に有意であった。 In logistic regression analysis for verification, the variables used in the model were the 69146th SNP69146 variable, the 97440th SNP97440 variable, and the interaction term SNP12, and these regression variables were estimated simultaneously based on the maximum likelihood method. As a result, the intercept is -2.079 (standard error 1.06), the regression coefficient representing the main effect of SNP69146 is -1.252 (standard error 1.470, the lower and upper limits of the 95% confidence interval are (-4.197: 6.937)) and significant There was no statistically significant difference with probability p = 0.398. The regression coefficient representing the main effect of SNP97440 is -0.629 (standard error 1.480, the lower and upper limits of the 95% confidence interval are (-3.594: 2.337)), and a statistically significant difference is observed with significance probability p = 0.672. I couldn't. The regression coefficient of the interaction term SNP12 was 5.570 (standard error 2.104, 95% confidence interval (1.355: 9.786)), and a statistically significant result was shown with a significance probability p = 0.011. The fitness of the logistic regression model was also statistically significant at p = 0.00024.

さらに、本発明の「解析プログラム」の一例を、以下に示す。以下に示す部分は、本発明のプログラム中の主要部分の一例として、「解析継続の適否」の判定ステップを除いた、ステップ（４）〜（６）の過程に相当するプログラム例である。下記のプログラムより、SNPの副作用に対する主効果がない場合でも相乗的エピスタシス効果を有するSNPペアの探索が可能となった。

「解析プログラム」の一例：
program SNP
integer IG*4
dimension IDAT(1000000,60),Adata(100000)
dimension IT(3,3,2) ,IS(60)
character Adata*20
OPEN (UNIT=1, FILE='D:\ptxPNP#Aold.txt')
OPEN (UNIT=2, FILE='D:\ptxPNP-out.txt')
OPEN (UNIT=3, FILE='D:\ptxPNP-out-begin.txt')

write(3,300)
300 FORMAT(1H ,'Start ')
CLOSE (UNIT = 3)

NN=60
DO 500 I=1,52
IS(I)=0
500 CONTINUE
DO 505 I=53,60
IS(I)=1
505 CONTINUE

c**************************************************
read(1,*) Adata(IG)
read(1,*) Adata(IG)
read(1,*) n1,n2
IG=1
5 continue@@
read(1,*,end=99) Adata(IG),(IDAT(IG,K),K=1,NN)
200 FORMAT(1H ,F5.3,44I4,A20)

IG=IG+1
c IF(IG.GT.500000) GOTO 99
GOTO 5
99 CONTINUE
IGEND=IG-1

DO 10 IG=1,IGEND-1
DO 20 JG=IG+1,IGEND

DO 40 kk=1,2
DO 40 J=1,3
DO 40 I=1,3
IT(I,J,kk)=0
40 CONTINUE

DO 30 K=1,NN
IF(IDAT(IG,K).EQ.-10.OR.IDAT(JG,K).EQ.-10) GOTO 30
IF(IS(K).EQ.0) GOTO 33
c AE(+):IS(K)=1
IF(IDAT(IG,K).EQ.0.and.IDAT(JG,K).EQ.0) IT(1,1,1)=IT(1,1,1)+1
IF(IDAT(IG,K).EQ.0.and.IDAT(JG,K).EQ.1) IT(1,2,1)=IT(1,2,1)+1
IF(IDAT(IG,K).EQ.0.and.IDAT(JG,K).EQ.2) IT(1,3,1)=IT(1,3,1)+1
IF(IDAT(IG,K).EQ.1.and.IDAT(JG,K).EQ.0) IT(2,1,1)=IT(2,1,1)+1
IF(IDAT(IG,K).EQ.1.and.IDAT(JG,K).EQ.1) IT(2,2,1)=IT(2,2,1)+1
IF(IDAT(IG,K).EQ.1.and.IDAT(JG,K).EQ.2) IT(2,3,1)=IT(2,3,1)+1
IF(IDAT(IG,K).EQ.2.and.IDAT(JG,K).EQ.0) IT(3,1,1)=IT(3,1,1)+1
IF(IDAT(IG,K).EQ.2.and.IDAT(JG,K).EQ.1) IT(3,2,1)=IT(3,2,1)+1
IF(IDAT(IG,K).EQ.2.and.IDAT(JG,K).EQ.2) IT(3,3,1)=IT(3,3,1)+1
GOTO 30
c AE(-):IS(K)=0
33 CONTINUE
IF(IDAT(IG,K).EQ.0.and.IDAT(JG,K).EQ.0) IT(1,1,2)=IT(1,1,2)+1
IF(IDAT(IG,K).EQ.0.and.IDAT(JG,K).EQ.1) IT(1,2,2)=IT(1,2,2)+1
IF(IDAT(IG,K).EQ.0.and.IDAT(JG,K).EQ.2) IT(1,3,2)=IT(1,3,2)+1
IF(IDAT(IG,K).EQ.1.and.IDAT(JG,K).EQ.0) IT(2,1,2)=IT(2,1,2)+1
IF(IDAT(IG,K).EQ.1.and.IDAT(JG,K).EQ.1) IT(2,2,2)=IT(2,2,2)+1
IF(IDAT(IG,K).EQ.1.and.IDAT(JG,K).EQ.2) IT(2,3,2)=IT(2,3,2)+1
IF(IDAT(IG,K).EQ.2.and.IDAT(JG,K).EQ.0) IT(3,1,2)=IT(3,1,2)+1
IF(IDAT(IG,K).EQ.2.and.IDAT(JG,K).EQ.1) IT(3,2,2)=IT(3,2,2)+1
IF(IDAT(IG,K).EQ.2.and.IDAT(JG,K).EQ.2) IT(3,3,2)=IT(3,3,2)+1
30 CONTINUE

ISNP1A=IT(1,1,1)+IT(1,2,1)+IT(1,3,1)
ISNP2A=IT(2,1,1)+IT(2,2,1)+IT(2,3,1)
ISNP3A=IT(3,1,1)+IT(3,2,1)+IT(3,3,1)

ISNP1B=IT(1,1,2)+IT(1,2,2)+IT(1,3,2)
ISNP2B=IT(2,1,2)+IT(2,2,2)+IT(2,3,2)
ISNP3B=IT(3,1,2)+IT(3,2,2)+IT(3,3,2)

JSNP1A=IT(1,1,1)+IT(2,1,1)+IT(3,1,1)
JSNP2A=IT(1,2,1)+IT(2,2,1)+IT(3,2,1)
JSNP3A=IT(1,3,1)+IT(2,3,1)+IT(3,3,1)
@
JSNP1B=IT(1,1,2)+IT(2,1,2)+IT(3,1,2)
JSNP2B=IT(1,2,2)+IT(2,2,2)+IT(3,2,2)
JSNP3B=IT(1,3,2)+IT(2,3,2)+IT(3,3,2)

C For SNP1
OR1=FLOAT( (ISNP1A+ISNP2A)*ISNP3B ) /
+ ( (FLOAT(ISNP3A)+0.1)*(FLOAT(ISNP1B+ISNP2B)+0.1) )
OR2=FLOAT( ISNP1A*(ISNP2B+ISNP3B) ) /
+ ( (FLOAT(ISNP2A+ISNP3A)+0.1)*(FLOAT(ISNP1B)+0.1) )

IF(OR1.GE.OR2) THEN
Itype=1
ELSE
Itype=2
ENDIF
35 CONTINUE
OR3=FLOAT( (JSNP1A+JSNP2A)*JSNP3B ) /
+ ( (FLOAT(JSNP3A)+0.1)*(FLOAT(JSNP1B+JSNP2B)+0.1) )
OR4=FLOAT( JSNP1A*(JSNP2B+JSNP3B) ) /
+ ( (FLOAT(JSNP2A+JSNP3A)+0.1)*(FLOAT(JSNP1B)+0.1) )

IF(OR3.GE.OR4) THEN
Jtype=1
ELSE
Jtype=2
ENDIF

37 CONTINUE
IF(Itype.EQ.1.and.Jtype.EQ.1) THEN
X11=FLOAT(IT(1,1,1)+IT(1,2,1)+IT(2,1,1)+IT(2,2,1) )
X12=FLOAT(IT(1,3,1)+IT(2,3,1))
X21=FLOAT(IT(3,1,1)+IT(3,2,1))
X22=FLOAT(IT(3,3,1))
Y11=FLOAT(IT(1,1,2)+IT(1,2,2)+IT(2,1,2)+IT(2,2,2) )
Y12=FLOAT(IT(1,3,2)+IT(2,3,2))
Y21=FLOAT(IT(3,1,2)+IT(3,2,2))
Y22=FLOAT(IT(3,3,2))
GOTO 77
ENDIF
IF(Itype.EQ.1.and.Jtype.EQ.2) THEN
X11=FLOAT(IT(1,1,1)+IT(2,1,1))
X12=FLOAT(IT(1,2,1)+IT(1,3,1)+IT(2,2,1)+IT(2,3,1) )
X21=FLOAT(IT(3,1,1))
X22=FLOAT(IT(3,2,1)+IT(3,3,1))
Y11=FLOAT(IT(1,1,2)+IT(2,1,2))
Y12=FLOAT(IT(1,2,2)+IT(1,3,2)+IT(2,2,2)+IT(2,3,2) )
Y21=FLOAT(IT(3,1,2))
Y22=FLOAT(IT(3,2,2)+IT(3,3,2))
GOTO 77
ENDIF

IF(Itype.EQ.2.and.Jtype.EQ.1) THEN
X11=FLOAT(IT(1,1,1)+IT(1,2,1))
X12=FLOAT(IT(1,3,1))
X21=FLOAT(IT(2,1,1)+IT(2,2,1)+IT(3,1,1)+IT(3,2,1) )
X22=FLOAT(IT(2,3,1)+IT(3,3,1))
Y11=FLOAT(IT(1,1,2)+IT(1,2,2))
Y12=FLOAT(IT(1,3,2))
Y21=FLOAT(IT(2,1,2)+IT(2,2,2)+IT(3,1,2)+IT(3,2,2) )
Y22=FLOAT(IT(2,3,2)+IT(3,3,2))
GOTO 77
ENDIF

IF(Itype.EQ.2.and.Jtype.EQ.2) THEN
X11=FLOAT(IT(1,1,1))
X12=FLOAT(IT(1,2,1)+IT(1,3,1))
X21=FLOAT(IT(2,1,1)+IT(3,1,1))
X22=FLOAT(IT(2,2,1)+IT(2,3,1)+IT(3,2,1)+IT(3,3,1) )
Y11=FLOAT(IT(1,1,2))
Y12=FLOAT(IT(1,2,2)+IT(1,3,2))
Y21=FLOAT(IT(2,1,2)+IT(3,1,2))
Y22=FLOAT(IT(2,2,2)+IT(2,3,2)+IT(3,2,2)+IT(3,3,2) )
ELSE
write(2,"Itype error exist")
ENDIF
77 CONTINUE

ICHECK=0
RU=50.0
RD=1.0/50.0
X=FLOAT(ISNP1A+ISNP3A)
Y=FLOAT(ISNP1B+ISNP3B)

IF(X11*X12*X21*X22.EQ.0.0) GO TO 20
IF(Y11*Y12*Y21*Y22.EQ.0.0) GO TO 20

w1=n1-3
w2=n2-3
IF( X11*X22/(X12*X21).GE.w1
+ .and.Y11*Y22/( Y12*Y21 ).LT.w2)
+ ICHECK=1

IF(ICHECK.EQ.1) NC=NC+1
c
IF(ICHECK.EQ.1) THEN
WRITE(2,699) IG,JG
699 FORMAT(1H , 2I8)

WRITE(2,333)(X11+1.0)*(X22+1.0)/( (X12+1.0)*(X21+1.0) ),
+ (Y11+1.0)*(Y22+1.0)/( (Y12+1.0)*(Y21+1.0) )
333 FORMAT(1H ,"Check", 2F8.3)

write(2,1000) OR1,OR2,Itype
write(2,1010) OR3,OR4,Jtype
1000 FORMAT(1H , 'OR, Itype',2F6.1,2x,I3)
1010 FORMAT(1H , 'OR, Jtype',2F6.1,2x,I3)

DO 70 I=1,3
WRITE(2,700) (IT(I,J,1), J=1,3), (IT(I,J,2), J=1,3)
700 FORMAT(1H ,3I3, 3x,3I3)
WRITE(2,698)
698 FORMAT(1H ,' ' )
70 CONTINUE

WRITE(2,702) X11, X12, Y11, Y12
WRITE(2,702) X21, X22, Y21, Y22
702 FORMAT(1H ,2F3.0, 3x,2F3.0)
WRITE(2,698)
END IF

20 CONTINUE
10 CONTINUE

write(2,15) IG
write(6,15) IG
15 FORMAT(1H ,'READ DATA = ',I10)

write(2,25) NC
write(6,25) NC
25 FORMAT(1H ,'WRITE DATA = ',I15)

CLOSE (UNIT = 1)
CLOSE (UNIT = 2)

stop
end
Furthermore, an example of the “analysis program” of the present invention is shown below. The following part is a program example corresponding to the process of steps (4) to (6), excluding the determination step of “appropriateness of continuation of analysis” as an example of a main part in the program of the present invention. The following program has made it possible to search for SNP pairs that have a synergistic epistasis effect even when there is no main effect on the side effects of SNP.

An example of “analysis program”:
program SNP
integer IG * 4
dimension IDAT (1000000,60), Adata (100000)
dimension IT (3,3,2), IS (60)
character Adata * 20
OPEN (UNIT = 1, FILE = 'D: \ ptxPNP # Aold.txt')
OPEN (UNIT = 2, FILE = 'D: \ ptxPNP-out.txt')
OPEN (UNIT = 3, FILE = 'D: \ ptxPNP-out-begin.txt')

write (3,300)
300 FORMAT (1H, 'Start')
CLOSE (UNIT = 3)

NN = 60
DO 500 I = 1,52
IS (I) = 0
500 CONTINUE
DO 505 I = 53,60
IS (I) = 1
505 CONTINUE

c ************************************************* *
read (1, *) Adata (IG)
read (1, *) Adata (IG)
read (1, *) n1, n2
IG = 1
5 continue @@
read (1, *, end = 99) Adata (IG), (IDAT (IG, K), K = 1, NN)
200 FORMAT (1H, F5.3,44I4, A20)

IG = IG + 1
c IF (IG.GT.500000) GOTO 99
GOTO 5
99 CONTINUE
IGEND = IG-1

DO 10 IG = 1, IGEND-1
DO 20 JG = IG + 1, IGEND

DO 40 kk = 1,2
DO 40 J = 1,3
DO 40 I = 1,3
IT (I, J, kk) = 0
40 CONTINUE

DO 30 K = 1, NN
IF (IDAT (IG, K) .EQ.-10.OR.IDAT (JG, K) .EQ.-10) GOTO 30
IF (IS (K) .EQ.0) GOTO 33
c AE (+): IS (K) = 1
IF (IDAT (IG, K) .EQ.0.and.IDAT (JG, K) .EQ.0) IT (1,1,1) = IT (1,1,1) +1
IF (IDAT (IG, K) .EQ.0.and.IDAT (JG, K) .EQ.1) IT (1,2,1) = IT (1,2,1) +1
IF (IDAT (IG, K) .EQ.0.and.IDAT (JG, K) .EQ.2) IT (1,3,1) = IT (1,3,1) +1
IF (IDAT (IG, K) .EQ.1.and.IDAT (JG, K) .EQ.0) IT (2,1,1) = IT (2,1,1) +1
IF (IDAT (IG, K) .EQ.1.and.IDAT (JG, K) .EQ.1) IT (2,2,1) = IT (2,2,1) +1
IF (IDAT (IG, K) .EQ.1.and.IDAT (JG, K) .EQ.2) IT (2,3,1) = IT (2,3,1) +1
IF (IDAT (IG, K) .EQ.2.and.IDAT (JG, K) .EQ.0) IT (3,1,1) = IT (3,1,1) +1
IF (IDAT (IG, K) .EQ.2.and.IDAT (JG, K) .EQ.1) IT (3,2,1) = IT (3,2,1) +1
IF (IDAT (IG, K) .EQ.2.and.IDAT (JG, K) .EQ.2) IT (3,3,1) = IT (3,3,1) +1
GOTO 30
c AE (-): IS (K) = 0
33 CONTINUE
IF (IDAT (IG, K) .EQ.0.and.IDAT (JG, K) .EQ.0) IT (1,1,2) = IT (1,1,2) +1
IF (IDAT (IG, K) .EQ.0.and.IDAT (JG, K) .EQ.1) IT (1,2,2) = IT (1,2,2) +1
IF (IDAT (IG, K) .EQ.0.and.IDAT (JG, K) .EQ.2) IT (1,3,2) = IT (1,3,2) +1
IF (IDAT (IG, K) .EQ.1.and.IDAT (JG, K) .EQ.0) IT (2,1,2) = IT (2,1,2) +1
IF (IDAT (IG, K) .EQ.1.and.IDAT (JG, K) .EQ.1) IT (2,2,2) = IT (2,2,2) +1
IF (IDAT (IG, K) .EQ.1.and.IDAT (JG, K) .EQ.2) IT (2,3,2) = IT (2,3,2) +1
IF (IDAT (IG, K) .EQ.2.and.IDAT (JG, K) .EQ.0) IT (3,1,2) = IT (3,1,2) +1
IF (IDAT (IG, K) .EQ.2.and.IDAT (JG, K) .EQ.1) IT (3,2,2) = IT (3,2,2) +1
IF (IDAT (IG, K) .EQ.2.and.IDAT (JG, K) .EQ.2) IT (3,3,2) = IT (3,3,2) +1
30 CONTINUE

ISNP1A = IT (1,1,1) + IT (1,2,1) + IT (1,3,1)
ISNP2A = IT (2,1,1) + IT (2,2,1) + IT (2,3,1)
ISNP3A = IT (3,1,1) + IT (3,2,1) + IT (3,3,1)

ISNP1B = IT (1,1,2) + IT (1,2,2) + IT (1,3,2)
ISNP2B = IT (2,1,2) + IT (2,2,2) + IT (2,3,2)
ISNP3B = IT (3,1,2) + IT (3,2,2) + IT (3,3,2)

JSNP1A = IT (1,1,1) + IT (2,1,1) + IT (3,1,1)
JSNP2A = IT (1,2,1) + IT (2,2,1) + IT (3,2,1)
JSNP3A = IT (1,3,1) + IT (2,3,1) + IT (3,3,1)
@
JSNP1B = IT (1,1,2) + IT (2,1,2) + IT (3,1,2)
JSNP2B = IT (1,2,2) + IT (2,2,2) + IT (3,2,2)
JSNP3B = IT (1,3,2) + IT (2,3,2) + IT (3,3,2)

C For SNP1
OR1 = FLOAT ((ISNP1A + ISNP2A) * ISNP3B) /
+ ((FLOAT (ISNP3A) +0.1) * (FLOAT (ISNP1B + ISNP2B) +0.1))
OR2 = FLOAT (ISNP1A * (ISNP2B + ISNP3B)) /
+ ((FLOAT (ISNP2A + ISNP3A) +0.1) * (FLOAT (ISNP1B) +0.1))

IF (OR1.GE.OR2) THEN
Itype = 1
ELSE
Itype = 2
ENDIF
35 CONTINUE
OR3 = FLOAT ((JSNP1A + JSNP2A) * JSNP3B) /
+ ((FLOAT (JSNP3A) +0.1) * (FLOAT (JSNP1B + JSNP2B) +0.1))
OR4 = FLOAT (JSNP1A * (JSNP2B + JSNP3B)) /
+ ((FLOAT (JSNP2A + JSNP3A) +0.1) * (FLOAT (JSNP1B) +0.1))

IF (OR3.GE.OR4) THEN
Jtype = 1
ELSE
Jtype = 2
ENDIF

37 CONTINUE
IF (Itype.EQ.1.and.Jtype.EQ.1) THEN
X11 = FLOAT (IT (1,1,1) + IT (1,2,1) + IT (2,1,1) + IT (2,2,1))
X12 = FLOAT (IT (1,3,1) + IT (2,3,1))
X21 = FLOAT (IT (3,1,1) + IT (3,2,1))
X22 = FLOAT (IT (3,3,1))
Y11 = FLOAT (IT (1,1,2) + IT (1,2,2) + IT (2,1,2) + IT (2,2,2))
Y12 = FLOAT (IT (1,3,2) + IT (2,3,2))
Y21 = FLOAT (IT (3,1,2) + IT (3,2,2))
Y22 = FLOAT (IT (3,3,2))
GOTO 77
ENDIF
IF (Itype.EQ.1.and.Jtype.EQ.2) THEN
X11 = FLOAT (IT (1,1,1) + IT (2,1,1))
X12 = FLOAT (IT (1,2,1) + IT (1,3,1) + IT (2,2,1) + IT (2,3,1))
X21 = FLOAT (IT (3,1,1))
X22 = FLOAT (IT (3,2,1) + IT (3,3,1))
Y11 = FLOAT (IT (1,1,2) + IT (2,1,2))
Y12 = FLOAT (IT (1,2,2) + IT (1,3,2) + IT (2,2,2) + IT (2,3,2))
Y21 = FLOAT (IT (3,1,2))
Y22 = FLOAT (IT (3,2,2) + IT (3,3,2))
GOTO 77
ENDIF

IF (Itype.EQ.2.and.Jtype.EQ.1) THEN
X11 = FLOAT (IT (1,1,1) + IT (1,2,1))
X12 = FLOAT (IT (1,3,1))
X21 = FLOAT (IT (2,1,1) + IT (2,2,1) + IT (3,1,1) + IT (3,2,1))
X22 = FLOAT (IT (2,3,1) + IT (3,3,1))
Y11 = FLOAT (IT (1,1,2) + IT (1,2,2))
Y12 = FLOAT (IT (1,3,2))
Y21 = FLOAT (IT (2,1,2) + IT (2,2,2) + IT (3,1,2) + IT (3,2,2))
Y22 = FLOAT (IT (2,3,2) + IT (3,3,2))
GOTO 77
ENDIF

IF (Itype.EQ.2.and.Jtype.EQ.2) THEN
X11 = FLOAT (IT (1,1,1))
X12 = FLOAT (IT (1,2,1) + IT (1,3,1))
X21 = FLOAT (IT (2,1,1) + IT (3,1,1))
X22 = FLOAT (IT (2,2,1) + IT (2,3,1) + IT (3,2,1) + IT (3,3,1))
Y11 = FLOAT (IT (1,1,2))
Y12 = FLOAT (IT (1,2,2) + IT (1,3,2))
Y21 = FLOAT (IT (2,1,2) + IT (3,1,2))
Y22 = FLOAT (IT (2,2,2) + IT (2,3,2) + IT (3,2,2) + IT (3,3,2))
ELSE
write (2, "Itype error exist")
ENDIF
77 CONTINUE

ICHECK = 0
RU = 50.0
RD = 1.0 / 50.0
X = FLOAT (ISNP1A + ISNP3A)
Y = FLOAT (ISNP1B + ISNP3B)

IF (X11 * X12 * X21 * X22.EQ.0.0) GO TO 20
IF (Y11 * Y12 * Y21 * Y22.EQ.0.0) GO TO 20

w1 = n1-3
w2 = n2-3
IF (X11 * X22 / (X12 * X21) .GE.w1
+ .and.Y11 * Y22 / (Y12 * Y21) .LT.w2)
+ ICHECK = 1

IF (ICHECK.EQ.1) NC = NC + 1
c
IF (ICHECK.EQ.1) THEN
WRITE (2,699) IG, JG
699 FORMAT (1H, 2I8)

WRITE (2,333) (X11 + 1.0) * (X22 + 1.0) / ((X12 + 1.0) * (X21 + 1.0)),
+ (Y11 + 1.0) * (Y22 + 1.0) / ((Y12 + 1.0) * (Y21 + 1.0))
333 FORMAT (1H, "Check", 2F8.3)

write (2,1000) OR1, OR2, Itype
write (2,1010) OR3, OR4, Jtype
1000 FORMAT (1H, 'OR, Itype', 2F6.1,2x, I3)
1010 FORMAT (1H, 'OR, Jtype', 2F6.1,2x, I3)

DO 70 I = 1,3
WRITE (2,700) (IT (I, J, 1), J = 1,3), (IT (I, J, 2), J = 1,3)
700 FORMAT (1H, 3I3, 3x, 3I3)
WRITE (2,698)
698 FORMAT (1H, '')
70 CONTINUE

WRITE (2,702) X11, X12, Y11, Y12
WRITE (2,702) X21, X22, Y21, Y22
702 FORMAT (1H, 2F3.0, 3x, 2F3.0)
WRITE (2,698)
END IF

20 CONTINUE
10 CONTINUE

write (2,15) IG
write (6,15) IG
15 FORMAT (1H, 'READ DATA =', I10)

write (2,25) NC
write (6,25) NC
25 FORMAT (1H, 'WRITE DATA =', I15)

CLOSE (UNIT = 1)
CLOSE (UNIT = 2)

stop
end

本発明のデータ解析方法は、上記の実施例で示すように、例えば、表現型として、薬剤の副作用の生起の例を考えると、単独のSNPでは検出できない副作用確率が高いSNPの組み合わせを同定することができるため、このようなSNPの組み合わせの使用に関しては慎重に考慮すべきであり、今後の個別化医療の進展にも寄与できる。また、副作用発生のメカニズムを探る知見を与える可能性も高く、ゲノム科学の進展に寄与できる点で高い利用可能性を有している。 As shown in the above examples, the data analysis method of the present invention identifies SNP combinations that have a high probability of side effects that cannot be detected by a single SNP, for example, considering the occurrence of side effects of a drug as a phenotype Therefore, the use of such SNP combinations should be carefully considered and can contribute to the development of personalized medicine in the future. In addition, there is a high possibility of giving knowledge to explore the mechanism of occurrence of side effects, and it has high applicability in that it can contribute to the progress of genomic science.

Claims

Genome-wide single nucleotide polymorphism (SNP) genotype data of more than 500,000 sites using a computer, synergistic interaction (epistasis) even if the main effect is not confirmed for a phenotype having a binary class A data analysis method for comprehensively identifying SNP pairs having an effect),
(1) Input to input a total of M SNP genotype data (M is more than 500,000) observed from N specimens with two types of phenotypes and the phenotype class corresponding to each specimen. Steps,
(2) a storage step of storing in the storage means the phenotype class of N specimens and the total M genotype data inputted through the input step (2);
(3) According to the storing step (2), statistical processing of the two phenotype classes and genotype data for the sample N is performed on the i-th SNP stored in the storage means, and the class-specific genotype A calculation step for calculating another count and performing screening as a pre-processing step for determining the suitability of i-th SNP analysis continuation based on the calculated minor allele count,
(4) In the calculation step (3), if it is determined that “analysis continuation is suitable” as the analysis target SNP, whether the i-th SNP is dominant or inferior to the phenotype based on the calculated count A storage step of determining by statistical means, and storing the determination result regarding the suitability of continuation of analysis and the dominant / recessive type in an internal storage device;
(5) In step (4), the i-th and j-th (j ≠ i) for each of the two classes of phenotypes based on the determination result regarding the dominant type and the recessive type determined by the statistical means. Create a 2x2 contingency table in which the dominant and inferior types of two SNPs (i = 1, j = 2) are determined as initial values, and calculate an index for determining epistasis for the created 2x2 contingency table A calculation step for determining the presence or absence of an epistasis effect based on this index,
(6) When it is determined in the calculation step (5) that “the epistasis effect is present”, the determination result of “the epistasis effect is present” for the two SNPs is stored, and the j th The SNP of j + 1 is changed to the j + 1-th SNP, and the process returns to step (4). The dominant / recessive type of the j + 1-th SNP is determined by statistical means, and the calculation of step (3) is repeated, When +1 reaches M, in the analysis step of selecting the i-th SNP as the i + 1-th and the j-th as the i + 2-th SNP, and (7) in the calculation step (5), the “epistasis” A data analysis method comprising: an operation step of confirming a synergistic epistasis effect by a multivariate analysis means using logistic analysis analysis when it is determined as “effective”.

As described in step (3) of claim 1,
In the operation for screening as a pre-processing step for determining the suitability of continuing analysis of the i-th SNP, "suitability of continuing analysis of the i-th SNP" is determined according to the following procedure: 2. The data analysis method according to 1.
The genotype data of each SNP is classified into 3 types, 2 homozygotes and 1 heterozygote in the population, depending on the type of two bases derived from mother and father. Here, these two homozygotes are represented by AA, aa, and one heterozygote by Aa;
Furthermore, a11, a12, and a13 are the counts for genotypes AA, Aa, and aa in the phenotype 1 class where the total number of specimens is n1, respectively, and a21, a22, and a23 are the phenotypes where the total number of specimens is n2. The counts for genotypes AA, Aa, aa in two classes;
In screening as a pre-processing step executed by the arithmetic device, the determination of “appropriateness of continuing analysis of the i-th SNP” is as follows:
When a11, a12, a13, a21, a22, a23 satisfy any one of the following conditions (I) to (IV), it is determined as “analysis continuation failure”,
SNPs determined as “continuation failure” are excluded from the analysis after step (4) (I) a11 + a12 ≦ 1 or a11 + a13 ≦ 1 or a12 + a13 ≦ 1 (Formula 1)
(II) a21 + a22 ≦ 1 or a21 + a23 ≦ 1 or a22 + a23 ≦ 1 (Formula 2)
(III) a11 = 0 and a23 = 0 (Formula 3)
(IV) a13 = 0 and a21 = 0 (Formula 4)

As described in step (5) of claim 1,
In determining whether or not there is an epistasis effect,
About the combination of “i-th SNP and j-th SNP”, which is the object of determination,
Follow the procedure below to calculate R1 = (x ₁₁ x ₂₂ ) / (x ₁₂ x ₂₁ ) and R2 = (y ₁₁ y ₂₂ ) / (y ₁₂ y ₂₁ ) as indicators,
Based on the calculated index, the determination of “existence of epistasis effect”
Index: R1 = (x ₁₁ x ₂₂ ) / (x ₁₂ x ₂₁ ) and R2 = (y ₁₁ y ₂₂ ) / (y ₁₂ y ₂₁ )
R1 = (x ₁₁ x ₂₂ ) / (x ₁₂ x ₂₁ ) ≥ w ₁ and R2 = (y ₁₁ y ₂₂ ) / (y ₁₂ y ₂₁ ) ≤ 1 / w ₂ (Formula 5)
When the above (Formula 5) is satisfied,
The data analysis method according to claim 2, wherein it is determined that “there is an epistasis effect”.
In the above (Formula 5), x ₁₁ , x ₂₂ , x ₁₂ , x ₂₁ , y ₁₁ , y ₂₂ , y ₁₂ , y ₂₁ are calculated according to the following procedure.
In the above (Formula 5), x ₁₁ , x ₂₂ , x ₁₂ , and x ₂₁ are determined by the combination of the dominant type and the recessive type of the i-th SNP and the j-th SNP in the phenotype. It is a count.
Similarly, y ₁₁ , y ₂₂ , y ₁₂ , and y ₂₁ are counts determined by a combination of the dominant type and the recessive type of the i-th SNP and the j-th SNP in the phenotype.
The dominant type is a model in which the AA and Aa genotypes are related to the phenotype class 1, and are described as A1 = (AA, Aa) and A2 = (aa). The recessive type is a model in which the genotype of aa is related to the phenotype class 1, and is described as A1 = (AA) and A2 = (Aa, aa). The dominant type of the jth SNP is a model in which BB and Bb genotypes are related, and is described as B1 = (BB, Bb), B2 = (bb). The j-th recessive type of the SNP is a model in which bb genotype is related to phenotype class 1, and is described as B1 = (BB), B2 = (Bb, bb).
At this time, for specimens having a phenotype of class 1, c11 is a count of specimens having genotype AA of i-th SNP and genotype BB of j-th SNP (when AA and BB are included). Yes, c12 is the count of the specimen having AA and Bb, and c13 is the count of the specimen having AA and bb. Similarly, c21 is Aa and BB, c22 is Aa and Bb, c23 is Aa and bb, c31 is aa and BB, c32 is aa and Bb, and c33 is aa and bb. These counts satisfy the following formula:
c11 + c12 + c13 + c21 + c22 + c23 + c31 + c32 + c33 = n1 (Formula 6)
For specimens with phenotype class 2, d11 is the count of specimens with genotype AA of i-th SNP and genotype BB of j-th SNP (if it has AA and BB), d12 Is the count of specimens with AA and Bb, and d13 is the count of specimens with AA and bb.
Similarly, d21 is Aa and BB, d22 is Aa and Bb, d23 is Aa and bb, d31 is aa and BB, d32 is aa and Bb, and d33 is aa and bb. These counts satisfy the following formula:
d11 + d12 + d13 + d21 + d22 + d23 + d31 + d32 + d33 = n2 (Formula 7)
Based on the determination results for dominant and recessive types,
Specifically, x ₁₁ , x ₂₂ , x ₁₂ , x ₂₁ , y ₁₁ , y ₂₂ , y ₁₂ , y ₂₁ are given in the following cases.
(i) The i-th SNP is the dominant type and the j-th SNP is the dominant type
x ₁₁ = c11 + c12 + c21 + c22, x ₁₂ = c13 + c23, x ₂₁ = c31 + c32, x ₂₂ = c33,
y ₁₁ = d11 + d12 + d21 + d22, y ₁₂ = d13 + d23, y ₂₁ = d31 + d32, y ₂₂ = d33 (Formula 8)
(ii) The i-th SNP is the dominant type and the j-th SNP is the recessive type
x ₁₁ = c11 + c21, x ₁₂ = c12 + c13 + c22 + c23, x ₂₁ = c31, x ₂₂ = c32 + c33,
y ₁₁ = d11 + d21, y ₁₂ = d12 + d13 + d22 + d23, y ₂₁ = d31, y ₂₂ = d32 + d33 (Formula 9)
(iii) The i-th SNP is recessive and the j-th SNP is dominant
x ₁₁ = c11 + c12, x ₁₂ = c13, x ₂₁ = c21 + c22 + c31 + c32, x ₂₂ = c23 + c33,
y ₁₁ = d11 + d12, y ₁₂ = d13, y ₂₁ = d21 + d22 + d31 + d32, y ₂₂ = d23 + d33 (Formula 10)
(iv) The i-th SNP is recessive and the j-th SNP is recessive
x ₁₁ = c11, x ₁₂ = c12 + c13, x ₂₁ = c21 + c31, x ₂₂ = c22 + c23 + c32 + c33
y ₁₁ = d11, y ₁₂ = d12 + d13, y ₂₁ = d21 + d31, y ₂₂ = d22 + d23 + d32 + d33 (Formula 11)

X ₁₁ , x ₂₂ , x ₁₂ , x ₂₁ , y ₁₁ , y ₂₂ , y ₁₂ , y ₂₁ given in any of the above (i) to (iv) based on the determination result regarding the dominant type or the recessive type From these, the indices: (x ₁₁ x ₂₂ ) / (x ₁₂ x ₂₁ ) and (y ₁₁ y ₂₂ ) / (y ₁₂ y ₂₁ ) are calculated.
Note that w ₁ and w ₂ described in (Equation 5) are specified in the following range.
n1-3 ≦ w ₁ ≦ (n1 / 2-1) ² , n2-3 ≦ w ₂ ≦ (n2 / 2-1) ² (Formula 12)

Since w ₁ and w ₂ described in (Formula 5) of Claim 3 are specified within the range described in (Formula 12), w ₁ and w ₂ are specified under the most lenient conditions shown in (Formula 13). Do
w ₁ = n1-3, w ₂ = n2-3 (Formula 13)
The data analysis method according to claim 3.

Genome-wide single nucleotide polymorphism (SNP) genotype data of more than 500,000 sites using a computer, synergistic interaction (epistasis) even if the main effect is not confirmed for a phenotype having a binary class A data analysis system for comprehensively identifying SNP pairs having effects),
(1) Input to input a total of M SNP genotype data (M is more than 500,000) observed from N specimens with two types of phenotypes and the phenotype class corresponding to each specimen. Means,
(2) storage means for storing a phenotype class of N specimens and a total of M genotype data inputted via the input means (1);
(3) The i-th SNP stored in the storage means (2) is subjected to statistical processing of the two phenotype classes and genotype data for the sample N persons, and the class-specific genotype count is calculated. An arithmetic means for performing screening as a preprocessing step for determining the suitability of the i-th SNP analysis continuation based on the calculated minor allele count,
(4) When the calculation means (3) determines that “analysis continuation is suitable” as the SNP to be analyzed, whether the i-th SNP is dominant or inferior to the phenotype based on the calculated count Storage means for determining by statistical means, and storing the determination result on the suitability of analysis continuity and the dominant type / recessive type in the internal storage device;
(5) For each of the two classes of phenotypes based on the determination result regarding the dominant type and the recessive type determined by the statistical means, the i-th and j-th (j ≠ i, i = 1, j = 2) 2x2 contingency table in which the dominant type and recessive type of each two SNPs are determined, and an index for determining epistasis is calculated for the created 2x2 contingency table. Computing means for determining the presence or absence of an epistasis effect,
(6) When it is determined by the calculation means (5) that “the epistasis effect is present”, the determination result of “the epistasis effect is present” for the two SNPs is stored, and the j th The SNP is changed to the j + 1-th SNP, the process returns to the step of the recording means (4), the dominant type / recessive type of the j + 1-th SNP is determined by statistical means, and the calculation means (3) When the calculation of the steps is repeated and j + 1 reaches M, the analysis means for selecting the i-th SNP as the i + 1-th and the j-th as the i + 2-th SNP,
(7) In the calculation means (5), when it is determined that “there is an epistasis effect”, calculation means for confirming the synergistic epistasis effect by a multivariate analysis means using logistic analysis analysis;
A data analysis system characterized by comprising:

The program which makes a computer perform the method as described in any one of Claim 1 to 4.