JP4069208B2

JP4069208B2 - Gene interaction estimation method, gene interaction estimation program, gene interaction estimation device, binding site estimation method, binding site estimation program, and binding site estimation device

Info

Publication number: JP4069208B2
Application number: JP2006198485A
Authority: JP
Inventors: 邦彦平石; 洋文土居
Original assignee: Japan Advanced Institute of Science and Technology
Current assignee: Japan Advanced Institute of Science and Technology
Priority date: 2006-07-20
Filing date: 2006-07-20
Publication date: 2008-04-02
Anticipated expiration: 2026-07-20
Also published as: JP2008027151A

Description

本発明は、遺伝子が因子の制御を受けるとき、その因子の制御を受ける他の遺伝子に与える影響を推定する遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置、並びに、これら遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置に用いて好適な結合サイト推定方法、結合サイト推定プログラム、及び結合サイト推定装置に関する。 The present invention relates to a gene interaction estimation method, a gene interaction estimation program, a gene interaction estimation apparatus , and a The present invention relates to a binding site estimation method, a binding site estimation program, and a binding site estimation apparatus suitable for use in a gene interaction estimation method, a gene interaction estimation program, and a gene interaction estimation apparatus .

近年のゲノム科学の進展にともない、ある環境における多数の遺伝子の発現強度を定量的に測定する技術であるＤＮＡマイクロアレイ技術により、多数の遺伝子発現データを同時に取得することが可能となっている。また、遺伝子の機能を解明する研究も、個々の遺伝子について解析する段階から、遺伝子の相互作用によって細胞内の活動がどのように行われるかを知る段階に移りつつある。 Along with the progress of genomic science in recent years, it has become possible to simultaneously acquire a large number of gene expression data by a DNA microarray technology which is a technique for quantitatively measuring the expression intensity of a large number of genes in a certain environment. Research to elucidate the function of genes is also moving from the stage of analyzing individual genes to the stage of knowing how intracellular activities are performed by gene interactions.

ここで、遺伝子は、例えば図２７に示すように、上流部分に制御領域を有し、そこに制御因子が結合することにより、遺伝子の発現が促進されたり抑制されたりする。また、遺伝子の発現によって因子（タンパク質）が生成されると、その因子が他の遺伝子の発現に影響を与えるが、各遺伝子が、自身を含めた他の遺伝子にいかなる影響を与えるかを推定する問題を遺伝子相互作用推定問題という。 Here, for example, as shown in FIG. 27, the gene has a control region in the upstream portion, and binding of the control factor thereto promotes or suppresses gene expression. In addition, when a factor (protein) is generated by the expression of a gene, the factor affects the expression of other genes, but it is estimated how each gene affects other genes including itself. The problem is called a gene interaction estimation problem.

かかる遺伝子相互作用推定の手法としては、発現データのみを利用したモデル推定によるものが提案されている。すなわち、ＤＮＡマイクロアレイによる発現データのみを利用し、発現データを最もよく説明可能なモデルを推定するものである。 As a method for estimating the gene interaction, a method based on model estimation using only expression data has been proposed. That is, a model that can best explain the expression data is estimated by using only the expression data obtained by the DNA microarray.

まず、この種の遺伝子相互作用推定の手法としては、ブール代数を用いたブーリアンネットワークが提案されている（例えば、非特許文献１等参照。）。ブーリアンネットワークは、同期して動作する論理回路のようなモデルである。例えば図２８に示すモデルの場合には、１個のノード（頂点）が１個の遺伝子に対応し、各頂点は、“０（発現していない）”又は“１（発現している）”の状態をとる。また、各頂点の状態は、単位時刻毎に同期して変化し、遺伝子発現の制御規則は、ブール関数の形式で表現される。かかる手法においては、発現の時系列データを与え、データとモデルとの不一致を最小にするようなブーリアンネットワークを推定する問題を解くことになる。しかしながら、かかる手法においては、定量的な発現強度を扱うことができないという問題がある。 First, as a method for estimating this kind of gene interaction, a Boolean network using Boolean algebra has been proposed (see, for example, Non-Patent Document 1). A Boolean network is a model like a logic circuit that operates synchronously. For example, in the case of the model shown in FIG. 28, one node (vertex) corresponds to one gene, and each vertex is “0 (not expressed)” or “1 (expressed)”. Take the state. Further, the state of each vertex changes synchronously at each unit time, and the gene expression control rule is expressed in the form of a Boolean function. In such a method, time series data of expression is given, and the problem of estimating a Boolean network that minimizes the mismatch between the data and the model is solved. However, in this method, there is a problem that quantitative expression intensity cannot be handled.

また、この種の遺伝子相互作用推定の手法としては、ベイジアンネットワークを用いる手法が提案されている（例えば、非特許文献２等参照。）。確率変数を頂点で表し、因果関係や相関関係といった依存する関係を有する変数間にリンクを張ったグラフ構造による確率モデルをグラフィカルモデルというが、このうち、例えば図２９に示すように、特にリンクの向きが因果関係の方向であり、このリンクを辿ったパスが循環しないものをベイジアンネットワークという。ベイジアンネットワークは、確率変数間の定性的な依存関係をグラフ構造によって表し、その確率変数間に定義される条件付き確率によって確率変数間の定量的な依存関係を表す。かかる手法においては、各遺伝子の発現強度を確率変数とし、その間の依存関係をベイジアンネットワークとして推定することになる。しかしながら、かかる手法においては、同じデータを説明可能な複数のモデルが存在してしまうという問題がある。 In addition, as a method for estimating this kind of gene interaction, a method using a Bayesian network has been proposed (see, for example, Non-Patent Document 2). A probability model based on a graph structure in which random variables are represented by vertices and linked between variables having a dependent relationship such as a causal relationship or correlation is called a graphical model. Among them, for example, as shown in FIG. A direction in which the direction is the direction of causality and the path following this link does not circulate is called a Bayesian network. A Bayesian network represents a qualitative dependency relationship between random variables by a graph structure, and represents a quantitative dependency relationship between random variables by a conditional probability defined between the random variables. In such a technique, the expression intensity of each gene is used as a random variable, and the dependence between them is estimated as a Bayesian network. However, such a method has a problem that a plurality of models that can explain the same data exist.

さらに、この種の遺伝子相互作用推定の手法としては、グラフィカルガウシアンモデリングを用いる手法が提案されている（例えば、非特許文献３等参照。）。相関係数によるクラスタリングの問題点は、直接の相互作用によって相関がみられる場合と、いくつかの他の遺伝子の発現を介した相互作用によって相関がみられる場合や、ある共通の遺伝子の発現に影響を受ける２つの遺伝子にみられる擬似相関の場合とを区別できないことにある。ここで、ｎ個の遺伝子がある環境での発現パターンは、ｎ次元の正規分布からのサンプルであると仮定すると、偏相関係数が“０”となるか否かによって擬似相関を排除することができる。これがグラフィカルガウシアンモデリングを用いる手法である。しかしながら、かかる手法においては、発現の挙動が非常に類似した遺伝子が存在した場合には、相関係数行列が正則でなくなり、偏相関係数を算出することができなくなる。そのため、かかる手法においては、前処理としてクラスタリングを行う必要があり、遺伝子グループ単位でしか依存関係を推定することができないという問題がある。 Furthermore, as a method for estimating this kind of gene interaction, a method using graphical Gaussian modeling has been proposed (see, for example, Non-Patent Document 3). The problem with clustering by correlation coefficient is that there is a correlation between direct interaction and a correlation through the expression of several other genes, or the expression of a common gene. It is indistinguishable from the case of the pseudo-correlation seen in the two affected genes. Here, assuming that the expression pattern in an environment with n genes is a sample from an n-dimensional normal distribution, the pseudo correlation is eliminated depending on whether or not the partial correlation coefficient is “0”. Can do. This is a technique using graphical Gaussian modeling. However, in such a method, when there is a gene whose expression behavior is very similar, the correlation coefficient matrix is not regular, and the partial correlation coefficient cannot be calculated. Therefore, in this method, clustering needs to be performed as preprocessing, and there is a problem that the dependency relationship can be estimated only in gene group units.

さらにまた、この種の遺伝子相互作用推定の手法としては、動的システムモデルを用いる手法が提案されている（例えば、非特許文献４等参照。）。この手法は、微分方程式等によるシステムモデルを仮定し、そのパラメータ推定を発現データとの一致度を目的関数とする最適化問題として定式化し、遺伝的アルゴリズムによって解くものである。しかしながら、この手法においては、計算コストが高く、数千もの遺伝子ネットワークを一度に解くことができず、分割や階層化等の処理を行う必要があるという問題がある。また、かかる手法においては、ノイズをも説明するモデルを出力してしまう可能性があり、いかにしてノイズを排除するかという問題もある。 Furthermore, as a method for estimating this kind of gene interaction, a method using a dynamic system model has been proposed (see, for example, Non-Patent Document 4). This method assumes a system model based on a differential equation or the like, formulates the parameter estimation as an optimization problem with the degree of coincidence with expression data as an objective function, and solves it by a genetic algorithm. However, this method has a problem that the calculation cost is high, thousands of gene networks cannot be solved at once, and processing such as division and hierarchization is required. In addition, in this method, there is a possibility of outputting a model that also explains noise, and there is a problem of how to eliminate noise.

このように、ＤＮＡマイクロアレイは、遺伝子相互作用の解析には有力な手段ではあるものの、上述した手法は、いずれも発現データのみを利用することから、ＤＮＡマイクロアレイによる発現データに含まれる多くのノイズの問題、実験コストの観点から数千もの遺伝子数に比べて数百程度の少数の実験データしか得られないという問題、転写因子と転写因子結合サイトとの関係を得ることができないという問題があり、推定精度が高くない。 As described above, although the DNA microarray is an effective means for analyzing gene interactions, all of the methods described above use only expression data. There is a problem that only a few hundreds of experimental data can be obtained compared to the number of thousands of genes from the viewpoint of the problem and the experimental cost, and there is a problem that the relationship between the transcription factor and the transcription factor binding site cannot be obtained, The estimation accuracy is not high.

この問題を解決するために、ＤＮＡマイクロアレイによる発現データと遺伝子の上流部分にある制御領域に含まれるパターンの解析とを組み合わせることにより、転写因子とこれによって制御される遺伝子群の関係を導き出す手法が提案されている（例えば、非特許文献５及び非特許文献６並びに特許文献１等参照。）。 In order to solve this problem, there is a method for deriving the relationship between transcription factors and genes controlled by these by combining expression data from DNA microarrays and analysis of patterns contained in control regions in the upstream part of genes. (For example, refer nonpatent literature 5, nonpatent literature 6, and patent literature 1 etc.).

具体的には、非特許文献５には、系列の発現データを用いて、転写因子の発現パターンと共通のモチーフを制御領域に有する遺伝子の発現パターンの合成との相関をみることにより、複数の転写因子の組み合わせによる影響を除外し、転写因子と結合サイトとの関係を推定する手法が開示されている。また、非特許文献６には、転写因子の発現パターンと各遺伝子の発現パターンとの相関によって遺伝子をクラスタリングし、その後モチーフをみつける手法が開示されている。さらに、特許文献１には、遺伝子発現量データと遺伝子配列データとに基づいてコンピュータによって遺伝子間の制御関係を推定する手法が開示されている。すなわち、この手法においては、遺伝子発現量の時系列データから、任意に早発現遺伝子と遅発現遺伝子とを選択し、その２遺伝子に関する相互相関関数を求め、その無相関検定を行って帰無仮説が棄却された場合には、その２遺伝子を記録する。そして、この手法は、早発現遺伝子と遅発現遺伝子とを選択し尽くすまで、この手順を繰り返し行い、その後、転写因子の上流配列に当該転写因子のコンセンサス制御配列の数が多い遺伝子と、先に記録された早発現遺伝子と遅発現遺伝子とをそれぞれ照合し、一致した場合に、その２遺伝子を記録して、これら２遺伝子には制御関係があると推定するものである。 Specifically, Non-Patent Document 5 uses a series of expression data to examine the correlation between the expression pattern of transcription factors and the synthesis of the expression pattern of a gene having a common motif in the control region. A method for excluding the influence of a combination of transcription factors and estimating the relationship between the transcription factor and the binding site is disclosed. Non-Patent Document 6 discloses a technique of clustering genes based on the correlation between the expression pattern of transcription factors and the expression pattern of each gene, and then finding a motif. Further, Patent Document 1 discloses a method for estimating a control relationship between genes by a computer based on gene expression level data and gene sequence data. In other words, in this method, an early expression gene and a late expression gene are arbitrarily selected from time-series data of gene expression levels, a cross-correlation function for the two genes is obtained, and an uncorrelated test is performed to determine a null hypothesis. If is rejected, record the two genes. In this method, this procedure is repeated until the early expression gene and the late expression gene are completely selected.After that, a gene having a large number of consensus control sequences for the transcription factor is included in the upstream sequence of the transcription factor. The recorded early expression gene and late expression gene are collated, and when they match, the two genes are recorded, and it is estimated that these two genes have a regulatory relationship.

阿久津達也著、「遺伝子ネットワークの推定アルゴリズム」、数理科学１９９９年６月号、株式会社サイエンス社、１９９９年、Ｎｏ．４３２、ｐ．４０−４６Takuya Akutsu, “Gene Network Estimation Algorithm”, Mathematical Sciences June 1999 issue, Science Inc., 1999, No. 432, p. 40-46 N. Friedman, M. Linial, I. Nachman and D. Pe’er著, 「Using Bayesian networksto analyzeexpression data」, Journal of Computational Biology,vol. 7, no. 3/4, 2000年, p. 601-620N. Friedman, M. Linial, I. Nachman and D. Pe’er, “Using Bayesian networks to analyze expression data”, Journal of Computational Biology, vol. 7, no. 3/4, 2000, p. 601-620 H. Toh, K. Horimoto著, 「Inference of a genetic network by a conbined approach of clusteranalysis and graphical Gaussian modeling」,Bioinformatics, vol. 18, no. 2, 2002年, p. 287-297H. Toh, K. Horimoto, "Inference of a genetic network by a conbined approach of clusteranalysis and graphical Gaussian modeling", Bioinformatics, vol. 18, no. 2, 2002, p. 287-297 岡本正宏著、「Ｓ−ｓｙｓｔｅｍによる遺伝子の相互作用推定」、ゲノム情報生物学、株式会社中山書店、２０００年、ｐ．１６５−１８８Masahiro Okamoto, “Estimation of gene interaction by S-system”, Genome Information Biology, Nakayama Shoten Co., Ltd., 2000, p. 165-188 K. Birnbaum, P. N. Benfey and D. E. Shasha著,「cis Element/Transcription Factor Analysis (cis/TF): AMethod for Discovering Transcription Factor/cis Element Relationships」, Genome Res. 2001 September, vol. 11, no. 9, 2001年, p. 1567-1573K. Birnbaum, PN Benfey and DE Shasha, “cis Element / Transcription Factor Analysis (cis / TF): AMethod for Discovering Transcription Factor / cis Element Relationships”, Genome Res. 2001 September, vol. 11, no. 9, 2001 Year, p. 1567-1573 Z. Zhu, Y. Pilpel and G. M. Church著, 「Computational Identification of Transcription Factor Binding Sitesvia a Transcription-factor-centric Clustering (TFCC) Algorithm」, J. Mol. Biol. vol. 318, 2002年, p. 7181Z. Zhu, Y. Pilpel and G. M. Church, “Computational Identification of Transcription Factor Binding Sitesvia a Transcription-factor-centric Clustering (TFCC) Algorithm”, J. Mol. Biol. Vol. 318, 2002, p. 7181 特開２００３−１４１１２３号公報JP 2003-141123 A

ところで、各遺伝子の発現レベルは、様々な要因によって変化するが、転写因子の発現強度の変化は、ノイズ等のバックグラウンドの変化に対して極めて小さい。これに対して、上述した非特許文献５及び非特許文献６並びに特許文献１に記載された従来の手法は、いずれも発現データ間の相関を求めることが基本であることから、ノイズの影響を大きく受けるという問題があった。 By the way, although the expression level of each gene changes depending on various factors, the change in the expression intensity of the transcription factor is extremely small with respect to the background change such as noise. On the other hand, since the conventional methods described in Non-Patent Document 5, Non-Patent Document 6 and Patent Document 1 described above are all based on obtaining correlation between expression data, the influence of noise is reduced. There was a problem of receiving large.

また、多くの転写因子は、常に発現しており、その後の作用によって制御されていることが知られているが、従来の手法は、かかる転写因子に対しては適用することができないという問題もあった。 In addition, many transcription factors are known to be constantly expressed and controlled by the subsequent action, but there is a problem that conventional methods cannot be applied to such transcription factors. there were.

さらに、特開２０００−３４２２５７号公報には、エンハンサー又はプロモーターの内部構造を探究することにより、遺伝子の転写制御をより明確にすることができる手法が開示されている。すなわち、この手法においては、遺伝子の制御因子の結合部位構造を推定するにあたって、遺伝子をコードしているコード領域の上流域又は下流域にある転写因子となるタンパク質が結合するエンハンサー又はプロモーター領域内の制御因子の結合部位の制御構造を推定したい遺伝子を設定し、前記エンハンサー又はプロモーター領域において前記遺伝子に関与する制御因子、若しくは仮説的に導入する制御因子の結合部位について、その遺伝子座及びその他の遺伝子の発現の要因をパラメータとする計算モデルを構築し、この構築した計算モデルによって遺伝子の転写量を算出し、パラメータ探索アルゴリズムを使用して、実験で得られている前記設定された遺伝子の発現が得られるように計算モデルのパラメータを探索し、前記エンハンサー又はプロモーターのマイクロストラクチャーを推定している。しかしながら、この手法は、遺伝子の制御関係そのものを推定するものではない。 Furthermore, Japanese Patent Application Laid-Open No. 2000-342257 discloses a technique that can clarify the transcriptional control of a gene by exploring the internal structure of an enhancer or promoter. That is, in this method, in estimating the binding site structure of a gene regulatory factor, an enhancer or promoter region to which a protein serving as a transcription factor in the upstream region or downstream region of the coding region encoding the gene binds. Set the gene for which the regulatory structure of the binding site of the regulatory factor is to be estimated, and the locus and other genes of the regulatory factor involved in the gene in the enhancer or promoter region or the binding site of the hypothetical regulatory factor A calculation model with the expression factor of the gene as a parameter is constructed, the transcription amount of the gene is calculated by the constructed calculation model, and the expression of the set gene obtained in the experiment is calculated using a parameter search algorithm. Search the parameters of the calculation model to obtain It has estimated the micro-structure of over or promoter. However, this method does not estimate the gene regulatory relationship itself.

本発明は、このような実情に鑑みてなされたものであり、ＤＮＡマイクロアレイを用いながらも、高精度に遺伝子相互作用の推定を行うことができる遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置、並びに、これら遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置に用いて好適な結合サイト推定方法、結合サイト推定プログラム、及び結合サイト推定装置を提供することを目的とする。 The present invention has been made in view of such circumstances, and a gene interaction estimation method, a gene interaction estimation program capable of estimating a gene interaction with high accuracy while using a DNA microarray, and Provided are a gene interaction estimation device , a gene interaction estimation method, a gene interaction estimation program, and a binding site estimation method, a binding site estimation program, and a binding site estimation device suitable for use in the gene interaction estimation device. For the purpose.

本願出願人は、遺伝子が因子の制御を受けるとき、その因子の制御を受ける他の遺伝子に与える影響を推定する遺伝子相互作用推定問題について鋭意研究を重ねた結果、発現データの相関を利用した第１のフィルタリング処理、結合サイトを推定する全く新規の手法である第２のフィルタリング処理、及び実験操作を加えた遺伝子発現データを利用した第３のフィルタリング処理による３つの独立した推定方法を融合させる、という発想をなすに至り、さらに、第２のフィルタリング処理として行うべき具体的処理として極めて斬新な手法を考案することにより、本発明をなすに至った。 The applicant of the present application, as a result of earnest research on the gene interaction estimation problem to estimate the effect on other genes that are controlled by the factor when the gene is controlled by the factor, Fusing three independent estimation methods by filtering processing of 1, second filtering processing which is a completely new technique for estimating binding sites, and third filtering processing using gene expression data to which an experimental operation is added, Furthermore, the present invention has been made by devising an extremely novel technique as a specific process to be performed as the second filtering process.

すなわち、上述した目的を達成する本発明にかかる遺伝子相互作用推定方法は、遺伝子ａが因子Ｆの制御を受けるとき、前記因子Ｆの制御を受ける他の遺伝子ｂに与える影響を推定する遺伝子相互作用推定方法において、特定の実験操作として遺伝子破壊を行うことによって得られているＤＮＡマイクロアレイ破壊株データと、塩基配列を表したデータであるＤＮＡ塩基配列データとをコンピュータに入力する入力工程と、前記コンピュータにおけるプロセッサが、前記入力工程にて入力された前記ＤＮＡマイクロアレイ破壊株データを読み込み、指定された前記遺伝子ａと前記ＤＮＡマイクロアレイ破壊株データの相関係数の絶対値が大きい上位所定個数の前記他の遺伝子ｂからなる遺伝子群Ｂを選択する遺伝子群選択工程と、前記プロセッサが、前記遺伝子ａのデータと、前記遺伝子群選択工程にて選択した前記遺伝子群Ｂの各遺伝子ｂのデータとのペアについて、前記入力工程にて入力された前記ＤＮＡ塩基配列データに基づいて、前記遺伝子ａの制御領域を文字列として取り出したときにその先頭からｉ番目の文字の場所を位置ｉとしたときに当該制御領域における各位置ｉから所定長ｌの連続領域をウィンドウＷ _ａ［ｉ］として切り出すとともに、前記遺伝子群Ｂの各遺伝子ｂの制御領域を文字列として取り出したときにその先頭からｊ番目の文字の場所を位置ｊとしたときに当該制御領域における各位置ｊから所定長ｌの連続領域をウィンドウＷ _ｂ［ｊ］として切り出し、これらのウィンドウＷ _ａ［ｉ］，Ｗ _ｂ［ｊ］に含まれる文字列データについて、これらのウィンドウ内の各位置で前後に所定個ずれた範囲での所定長の文字列同士の類似度を全て算出し、算出した類似度の最大値をその位置の類似度と定義したときの前記ウィンドウ内の各位置に対する類似度の合計であるウィンドウ類似度を算出し、算出した前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度からなるウィンドウ類似度データに基づいて、前記遺伝子ａの制御領域における各位置ｉに対する前記遺伝子群Ｂの類似度の最大値Ｍａｘ _ａＢ［ｉ］と平均値Ａｖｇ _ａＢ［ｉ］とを算出するウィンドウ類似度算出工程と、前記プロセッサが、前記遺伝子群選択工程にて選択した前記遺伝子群Ｂの各遺伝子ｂのデータと、前記ウィンドウ類似度算出工程にて算出した前記最大値Ｍａｘ _ａＢ［ｉ］からなる最大値データ及び前記平均値Ａｖｇ _ａＢ［ｉ］からなる平均値データとを読み込み、前記遺伝子ａの制御領域のうち、前記最大値Ｍａｘ _ａＢ［ｉ］が前記位置ｉにおける前記ウィンドウ類似度の分布において統計的に特異的に大きい第１の領域を求めるとともに、前記ウィンドウ類似度算出工程にて算出した前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度データに基づいて、当該遺伝子ａの制御領域の位置ｉの変化に対する前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度の最大値Ｍａｘ _ａｂ［ｉ］をみたときに極大値を与える位置をピーク位置としたとき当該ピーク位置を含むような第２の領域を求め、前記第１の領域であり且つ前記第２の領域である領域Ｒを探索する領域探索工程と、前記プロセッサが、前記遺伝子群選択工程にて選択した前記遺伝子群Ｂの各遺伝子ｂのデータと、前記ウィンドウ類似度算出工程にて算出した前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度データと、前記領域探索工程にて求めた前記領域Ｒからなる領域データとを読み込み、前記遺伝子群Ｂの各遺伝子ｂのうち、各領域Ｒについて、当該領域Ｒにピーク位置が存在し、且つ、類似度が所定値よりも高い遺伝子ｂからなる遺伝子群Ｂ _Ｒを求めて抽出する遺伝子群抽出工程と、前記プロセッサが、前記入力工程にて入力された前記ＤＮＡマイクロアレイ破壊株データに基づく前記実験操作を加えない野生株と前記実験操作を加えた破壊株との発現強度の変化比を用いて、各領域Ｒについて、前記遺伝子群抽出工程にて抽出した前記遺伝子群Ｂ _Ｒの各遺伝子のうち、前記実験操作において大きく発現強度が変化しているもののみを遺伝子群Ｂ _Ｒ ^Ｆとして抽出する第２の遺伝子群抽出工程とを備えることを特徴としている。 That is, the gene interaction estimation method according to the present invention that achieves the above-described object is such that when gene a is controlled by factor F, gene interaction that estimates the effect on other gene b that is controlled by factor F is estimated. In the estimation method, an input step for inputting DNA microarray disruption strain data obtained by performing gene disruption as a specific experimental operation and DNA base sequence data, which is data representing the base sequence, to a computer, and the computer The processor in (1) reads the DNA microarray disrupted strain data input in the input step, and the predetermined number of the upper predetermined number having a large absolute value of the correlation coefficient between the designated gene a and the DNA microarray disrupted strain data. and genes selection step of selecting a gene group B consisting of the genes b, the process Sa is the data of the gene a, for a pair of each gene b data of said gene group B selected in the genes selection step, based on the DNA base sequence data input in said input step When the control area of the gene a is taken out as a character string, a continuous area of a predetermined length l from each position i in the control area is defined as a window W _a [ i], and when the control region of each gene b of the gene group B is extracted as a character string, the position of the j-th character from the beginning is defined as a position j. cut out the continuous region length l as a window W _b [j], these windows W _a _[i], the character string data contained in W b [j], these c When all the similarities between character strings of a predetermined length within a range deviated by a predetermined number at each position in the window are calculated, and the maximum value of the calculated similarity is defined as the similarity at that position . The window similarity that is the sum of the similarities for each position of the gene group B is calculated, and each window in the control region of the gene a is calculated based on the window similarity data including the window similarities of the genes b of the gene group B. A window similarity calculation step of calculating a maximum value Max _aB [i] and an average value Avg _aB [i] of the similarity of the gene group B to the position i, and the processor selected in the gene group selection step and data of each gene b of the genes B, the maximum value data and the average value a consisting of said window the maximum value Max aB calculated in the similarity calculation step _[i] reads the average value data consisting of g _{aB [i],} of the control region of the gene a, statistically specifically greater in the distribution of the windows similarities at the maximum value Max aB _[i] is the position i The gene with respect to a change in the position i of the control region of the gene a based on the window similarity data of each gene b of the gene group B calculated in the window similarity calculation step while obtaining a first region A second region that includes the peak position when the position that gives the maximum value when the maximum value Max _ab [i] of the window similarity of each gene b of group B is viewed is determined as the peak position; A region searching step for searching for a region R that is one region and the second region, and the gene selected by the processor in the gene group selection step Region data consisting of the data of each gene b of B, the window similarity data of each gene b of the gene group B calculated in the window similarity calculation step, and the region R obtained in the region search step DOO reads, among the genes b of the genes B, and each region R, there is a peak located in the area R, and obtains the genes B _R of similarity is a high gene b than the predetermined value The gene group extraction step to extract the expression intensity, and the expression intensity of the wild strain not added to the experimental operation based on the DNA microarray disrupted strain data input in the input step and the disrupted strain added the experimental operation using the change ratio, for each region R, among the genes of the genes B _R extracted by said genes extraction step largely expression intensity in the experimental manipulation is It is characterized in that it comprises only what is turned into a second group of genes extraction step of extracting as a gene group B _R ^F.

また、上述した目的を達成する本発明にかかる遺伝子相互作用推定プログラムは、遺伝子ａが因子Ｆの制御を受けるとき、前記因子Ｆの制御を受ける他の遺伝子ｂに与える影響を推定するコンピュータ実行可能な遺伝子相互作用推定プログラムにおいて、前記コンピュータを、特定の実験操作として遺伝子破壊を行うことによって得られているＤＮＡマイクロアレイ破壊株データと、塩基配列を表したデータであるＤＮＡ塩基配列データとを入力する入力手段、前記入力手段によって入力された前記ＤＮＡマイクロアレイ破壊株データを読み込み、指定された前記遺伝子ａと前記ＤＮＡマイクロアレイ破壊株データの相関係数の絶対値が大きい上位所定個数の前記他の遺伝子ｂからなる遺伝子群Ｂを選択する遺伝子群選択手段、前記遺伝子ａのデータと、前記遺伝子群選択手段によって選択された前記遺伝子群Ｂの各遺伝子ｂのデータとのペアについて、前記入力手段によって入力された前記ＤＮＡ塩基配列データに基づいて、前記遺伝子ａの制御領域を文字列として取り出したときにその先頭からｉ番目の文字の場所を位置ｉとしたときに当該制御領域における各位置ｉから所定長ｌの連続領域をウィンドウＷ _ａ［ｉ］として切り出すとともに、前記遺伝子群Ｂの各遺伝子ｂの制御領域を文字列として取り出したときにその先頭からｊ番目の文字の場所を位置ｊとしたときに当該制御領域における各位置ｊから所定長ｌの連続領域をウィンドウＷ _ｂ［ｊ］として切り出し、これらのウィンドウＷ _ａ［ｉ］，Ｗ _ｂ［ｊ］に含まれる文字列データについて、これらのウィンドウ内の各位置で前後に所定個ずれた範囲での所定長の文字列同士の類似度を全て算出し、算出した類似度の最大値をその位置の類似度と定義したときの前記ウィンドウ内の各位置に対する類似度の合計であるウィンドウ類似度を算出し、算出した前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度からなるウィンドウ類似度データに基づいて、前記遺伝子ａの制御領域における各位置ｉに対する前記遺伝子群Ｂの類似度の最大値Ｍａｘ _ａＢ［ｉ］と平均値Ａｖｇ _ａＢ［ｉ］とを算出するウィンドウ類似度算出手段、前記遺伝子群選択手段によって選択された前記遺伝子群Ｂの各遺伝子ｂのデータと、前記ウィンドウ類似度算出手段によって算出された前記最大値Ｍａｘ _ａＢ［ｉ］からなる最大値データ及び前記平均値Ａｖｇ _ａＢ［ｉ］からなる平均値データとを読み込み、前記遺伝子ａの制御領域のうち、前記最大値Ｍａｘ _ａＢ［ｉ］が前記位置ｉにおける前記ウィンドウ類似度の分布において統計的に特異的に大きい第１の領域を求めるとともに、前記ウィンドウ類似度算出手段によって算出された前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度データに基づいて、当該遺伝子ａの制御領域の位置ｉの変化に対する前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度の最大値Ｍａｘ _ａｂ［ｉ］をみたときに極大値を与える位置をピーク位置としたとき当該ピーク位置を含むような第２の領域を求め、前記第１の領域であり且つ前記第２の領域である領域Ｒを探索する領域探索手段、前記遺伝子群選択手段によって選択された前記遺伝子群Ｂの各遺伝子ｂのデータと、前記ウィンドウ類似度算出手段によって算出された前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度データと、前記領域探索手段によって求められた前記領域Ｒからなる領域データとを読み込み、前記遺伝子群Ｂの各遺伝子ｂのうち、各領域Ｒについて、当該領域Ｒにピーク位置が存在し、且つ、類似度が所定値よりも高い遺伝子ｂからなる遺伝子群Ｂ _Ｒを求めて抽出する遺伝子群抽出手段、及び、前記入力手段によって入力された前記ＤＮＡマイクロアレイ破壊株データに基づく前記実験操作を加えない野生株と前記実験操作を加えた破壊株との発現強度の変化比を用いて、各領域Ｒについて、前記遺伝子群抽出手段によって抽出された前記遺伝子群Ｂ _Ｒの各遺伝子のうち、前記実験操作において大きく発現強度が変化しているもののみを遺伝子群Ｂ _Ｒ ^Ｆとして抽出する第２の遺伝子群抽出手段として機能させることを特徴としている。 In addition, the gene interaction estimation program according to the present invention that achieves the above-described object can be executed by a computer that estimates the influence on the other gene b controlled by the factor F when the gene a is controlled by the factor F. In a gene interaction estimation program, the computer inputs DNA microarray disrupted strain data obtained by performing gene disruption as a specific experimental operation, and DNA base sequence data that is data representing the base sequence The input means reads the DNA microarray disrupted strain data input by the input means, and the upper predetermined number of the other genes b having a large absolute value of the correlation coefficient between the designated gene a and the DNA microarray disrupted strain data A gene group selection means for selecting a gene group B comprising the gene a For a pair of data and data of each gene b of the gene group B selected by the gene group selection means, a control region of the gene a is determined based on the DNA base sequence data input by the input means. When the position of the i-th character from the beginning is taken as position i when taken out as a character string, a continuous area of a predetermined length l is cut out from each position i in the control area as _a window W _a [i], and the gene When the control region of each gene b in group B is taken out as a character string, a continuous region having a predetermined length l from each position j in the control region is defined as a window W when the position of the jth character from the beginning is defined as a position j. excised as _b [j], these windows W _a _[i], the character string data contained in W b [j], in these windows All the similarities between character strings of a predetermined length in a range that is shifted by a predetermined number at the front and back at each position, and each value in the window when the calculated maximum similarity is defined as the similarity of the position A window similarity that is a sum of similarities with respect to a position is calculated, and each position i in the control region of the gene a is calculated based on the window similarity data including the window similarity of each gene b of the gene group B. Window similarity calculation means for calculating the maximum value Max _aB [i] and the average value Avg _aB [i] of the similarity of the gene group B to each gene of the gene group B selected by the gene group selection means and b data the window maximum value data and the consisting calculated by the similarity calculating unit that said maximum value _Max aB [i] mean _Avg aB [i] or Becomes the average value is read and data of the control region of the gene a, obtaining a statistically specifically large first region the maximum value Max aB _[i] is in the distribution of the window similarity at the position i In addition, based on the window similarity data of each gene b of the gene group B calculated by the window similarity calculation means, each gene b of the gene group B with respect to a change in the position i of the control region of the gene a A second region that includes the peak position when a position that gives a maximum value when the maximum value Max _ab [i] of the window similarity is viewed is a peak position, and is the first region; Region search means for searching for the region R as the second region, data of each gene b of the gene group B selected by the gene group selection means; Reading the window similarity data of each gene b of the gene group B calculated by the window similarity calculation means and the area data consisting of the area R obtained by the area search means, among the genes b, for each region R, there is a peak located in the area R, and, the gene group extracting means similarity extracted seeking genes B _R consisting of high gene b than the predetermined value, and Using the change ratio of the expression intensity of the wild strain not subjected to the experimental operation based on the DNA microarray disrupted strain data input by the input means and the disrupted strain subjected to the experimental operation, for each region R, among the genes of the genes B _R extracted by the genetic group extracting means, which expression intensity greatly changes in the experimental procedure It is characterized in that to function only as a second gene group extracting means for extracting a group of genes B _R ^F.

さらに、上述した目的を達成する本発明にかかる遺伝子相互作用推定装置は、遺伝子ａが因子Ｆの制御を受けるとき、前記因子Ｆの制御を受ける他の遺伝子ｂに与える影響を推定する遺伝子相互作用推定装置において、特定の実験操作として遺伝子破壊を行うことによって得られているＤＮＡマイクロアレイ破壊株データと、塩基配列を表したデータであるＤＮＡ塩基配列データとを入力する入力手段と、前記入力手段によって入力された前記ＤＮＡマイクロアレイ破壊株データを読み込み、指定された前記遺伝子ａと前記ＤＮＡマイクロアレイ破壊株データの相関係数の絶対値が大きい上位所定個数の前記他の遺伝子ｂからなる遺伝子群Ｂを選択する遺伝子群選択手段と、前記遺伝子ａのデータと、前記遺伝子群選択手段によって選択された前記遺伝子群Ｂの各遺伝子ｂのデータとのペアについて、前記入力手段によって入力された前記ＤＮＡ塩基配列データに基づいて、前記遺伝子ａの制御領域を文字列として取り出したときにその先頭からｉ番目の文字の場所を位置ｉとしたときに当該制御領域における各位置ｉから所定長ｌの連続領域をウィンドウＷ _ａ［ｉ］として切り出すとともに、前記遺伝子群Ｂの各遺伝子ｂの制御領域を文字列として取り出したときにその先頭からｊ番目の文字の場所を位置ｊとしたときに当該制御領域における各位置ｊから所定長ｌの連続領域をウィンドウＷ _ｂ［ｊ］として切り出し、これらのウィンドウＷ _ａ［ｉ］，Ｗ _ｂ［ｊ］に含まれる文字列データについて、これらのウィンドウ内の各位置で前後に所定個ずれた範囲での所定長の文字列同士の類似度を全て算出し、算出した類似度の最大値をその位置の類似度と定義したときの前記ウィンドウ内の各位置に対する類似度の合計であるウィンドウ類似度を算出し、算出した前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度からなるウィンドウ類似度データに基づいて、前記遺伝子ａの制御領域における各位置ｉに対する前記遺伝子群Ｂの類似度の最大値Ｍａｘ _ａＢ［ｉ］と平均値Ａｖｇ _ａＢ［ｉ］とを算出するウィンドウ類似度算出手段と、前記遺伝子群選択手段によって選択された前記遺伝子群Ｂの各遺伝子ｂのデータと、前記ウィンドウ類似度算出手段によって算出された前記最大値Ｍａｘ _ａＢ［ｉ］からなる最大値データ及び前記平均値Ａｖｇ _ａＢ［ｉ］からなる平均値データとを読み込み、前記遺伝子ａの制御領域のうち、前記最大値Ｍａｘ _ａＢ［ｉ］が前記位置ｉにおける前記ウィンドウ類似度の分布において統計的に特異的に大きい第１の領域を求めるとともに、前記ウィンドウ類似度算出手段によって算出された前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度データに基づいて、当該遺伝子ａの制御領域の位置ｉの変化に対する前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度の最大値Ｍａｘ _ａｂ［ｉ］をみたときに極大値を与える位置をピーク位置としたとき当該ピーク位置を含むような第２の領域を求め、前記第１の領域であり且つ前記第２の領域である領域Ｒを探索する領域探索手段と、前記遺伝子群選択手段によって選択された前記遺伝子群Ｂの各遺伝子ｂのデータと、前記ウィンドウ類似度算出手段によって算出された前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度データと、前記領域探索手段によって求められた前記領域Ｒからなる領域データとを読み込み、前記遺伝子群Ｂの各遺伝子ｂのうち、各領域Ｒについて、当該領域Ｒにピーク位置が存在し、且つ、類似度が所定値よりも高い遺伝子ｂからなる遺伝子群Ｂ _Ｒを求めて抽出する遺伝子群抽出手段と、前記入力手段によって入力された前記ＤＮＡマイクロアレイ破壊株データに基づく前記実験操作を加えない野生株と前記実験操作を加えた破壊株との発現強度の変化比を用いて、各領域Ｒについて、前記遺伝子群抽出手段によって抽出された前記遺伝子群Ｂ _Ｒの各遺伝子のうち、前記実験操作において大きく発現強度が変化しているもののみを遺伝子群Ｂ _Ｒ ^Ｆとして抽出する第２の遺伝子群抽出手段とを備えることを特徴としている。 Furthermore, the gene interaction estimation apparatus according to the present invention that achieves the above-described object provides a gene interaction that estimates an influence on another gene b controlled by the factor F when the gene a is controlled by the factor F. In the estimation apparatus, input means for inputting DNA microarray disruption strain data obtained by performing gene disruption as a specific experimental operation, and DNA base sequence data which is data representing the base sequence, and the input means Read the inputted DNA microarray disruption strain data, and select the gene group B consisting of the predetermined number of the other genes b having a large absolute value of the correlation coefficient between the designated gene a and the DNA microarray disruption strain data Selected by the gene group selection means, the data of the gene a, and the gene group selection means When the control region of the gene a is extracted as a character string based on the DNA base sequence data input by the input means for the pair of the data of each gene b of the gene group B, the i th from the head When the position of the character is defined as position i, a continuous region of a predetermined length l is cut out from each position i in the control region as _a window W _a [i], and the control region of each gene b of the gene group B is a character string. When the position of the j-th character from the beginning is taken as position j, a continuous area of a predetermined length l is cut out as window W _b [j] from each position j in the control area , and these windows W _a With respect to the character string data included in [i], W _b [j] , a character having a predetermined length in a range shifted by a predetermined number at each position in these windows . Calculate all the similarities between columns , calculate the window similarity that is the sum of the similarities for each position in the window when the calculated maximum similarity is defined as the similarity of that position , and calculate Based on window similarity data composed of the window similarity of each gene b of the gene group B, the maximum value Max _aB [i] of the similarity of the gene group B to each position i in the control region of the gene a and Window similarity calculating means for calculating an average value Avg _aB [i], data of each gene b of the gene group B selected by the gene group selecting means, and the window similarity calculating means calculated by the window similarity calculating means It reads the average value data consisting of a maximum value data and the average value _Avg aB consisting maximum _{Max aB [i] [i]} , the gene a Of your area portions to determine statistically specifically large first area in the distribution of the window similarity the maximum value Max aB _[i] is in the position i, calculated by the window similarity calculation means Based on the window similarity data of each gene b of the gene group B, the maximum value Max _ab [i] of the window similarity of each gene b of the gene group B with respect to the change in the position i of the control region of the gene a. ], The second region that includes the peak position when the position that gives the maximum value is determined as the peak position, and the region R that is the first region and the second region is searched for. Data of each gene b of the gene group B selected by the region search means, the gene group selection means, and the window similarity calculation means. In addition, the window similarity data of each gene b of the gene group B and the area data composed of the area R obtained by the area searching means are read, and among the genes b of the gene group B, each area R for, there is a peak located in the area R, and the DNA similarity entered and genetic group extracting means for extracting seeking genes B _R consisting of high gene b than the predetermined value, by said input means The gene extracted by the gene group extraction means for each region R using a change ratio of expression intensity between a wild strain not subjected to the experimental operation based on microarray disrupted strain data and a disrupted strain subjected to the experimental operation among the genes of group B _R, the second gene to extract only those expression intensity greatly changes in the experimental procedure as genes B _R ^F It is characterized in that it comprises an extraction unit.

このような本発明にかかる遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置においては、ウィンドウ類似度という概念を取り入れることにより、遺伝子ａ，ｂの制御領域全体ではなく局所的な類似度を発見することが可能となる。 In such a gene interaction estimation method , a gene interaction estimation program, and a gene interaction estimation apparatus according to the present invention, by introducing the concept of window similarity, it is not the entire control region of genes a and b that is localized. It is possible to find a similar degree of similarity.

ここで、本発明にかかる遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置においては、前記ウィンドウ類似度を算出する際に、統計的に特異的に出現頻度が高い文字列については、それに対する類似度を０とする。すなわち、本願出願人は、制御領域におけるオリゴヌクレオチドの出現頻度の解析を行い、統計的に出現頻度が特異的に高いオリゴヌクレオチドは結合文字列になる可能性が低いことを新たに発見したが、本発明にかかる遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置においては、この事実を領域の類似度算出に反映させることにより、結合サイト以外で類似領域が発見される可能性を低減することができる。 Here, the gene interaction estimation method according to the present invention, the gene interaction estimation program, and in gene interaction estimation apparatus when calculating the window similarity statistically specifically frequency is higher string for, you a similarity to it with 0. That is, the applicant of the present application analyzed the frequency of occurrence of oligonucleotides in the control region, and newly discovered that oligonucleotides with a statistically high frequency of appearance are less likely to be binding strings. In the gene interaction estimation method , the gene interaction estimation program, and the gene interaction estimation apparatus according to the present invention, it is possible to find a similar region other than the binding site by reflecting this fact in the region similarity calculation. Can be reduced.

また、本発明にかかる遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置は、前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度からなるウィンドウ類似度データに基づいて、前記遺伝子ａの制御領域における各位置ｉに対する前記遺伝子群Ｂの類似度の最大値Ｍａｘ _ａＢ［ｉ］と平均値Ａｖｇ _ａＢ［ｉ］とを算出する。そして、本発明にかかる遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置は、前記遺伝子群Ｂの各遺伝子ｂのデータと、前記最大値Ｍａｘ _ａＢ［ｉ］からなる最大値データ及び前記平均値Ａｖｇ _ａＢ［ｉ］からなる平均値データとを読み込み、前記遺伝子ａの制御領域のうち、前記最大値Ｍａｘ_ａＢ［ｉ］が前記位置ｉにおける前記ウィンドウ類似度の分布において統計的に特異的に大きい第１の領域を求めるとともに、前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度データに基づいて、当該遺伝子ａの制御領域の位置ｉの変化に対する前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度の最大値Ｍａｘ_ａｂ［ｉ］をみたときに極大値を与える位置をピーク位置としたとき当該ピーク位置を含むような第２の領域を求め、前記第１の領域であり且つ前記第２の領域である領域Ｒを探索することを特徴としている。すなわち、本発明にかかる遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置においては、遺伝子ａ，ｂに共通の結合サイトが存在するとした場合には、このピーク位置からのウィンドウ内に存在する可能性が高いことから、かかるピーク位置を含むような領域Ｒを探索することにより、結合サイトの候補を絞り込むことが可能となる。 Moreover, the gene interaction estimation method , the gene interaction estimation program, and the gene interaction estimation apparatus according to the present invention are based on window similarity data composed of the window similarity of each gene b of the gene group B. The maximum value Max _aB [i] and the average value Avg _aB [i] of the similarity degree of the gene group B for each position i in the control region of the gene a are calculated. A gene interaction estimation method, a gene interaction estimation program, and a gene interaction estimation apparatus according to the present invention include a maximum value consisting of data of each gene b of the gene group B and the maximum value Max _aB [i]. reads the average value data consisting of data and the average value _Avg aB [i], of the control region of the gene a, statistically the maximum value _Max aB [i] is in the distribution of the window similarity at the position i together determine the specific size has a first region, on the basis of the window similarity data for each gene b of the genes B, each of the genes B with respect to a change in position i of the control region of the gene a the peak position when the position giving the maximum value and the peak position when viewed the maximum value Max ab window similarity _[i] of the gene b It obtains a second region that includes a, is characterized by searching for a region R is the a first region and the second region. That is, in the gene interaction estimation method , gene interaction estimation program, and gene interaction estimation apparatus according to the present invention, when a common binding site exists in genes a and b, a window from this peak position is displayed. Therefore, it is possible to narrow down the binding site candidates by searching for a region R that includes such a peak position.

さらに、本発明にかかる遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置は、前記遺伝子群Ｂの各遺伝子ｂのデータと、前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度データと、前記領域Ｒからなる領域データとを読み込み、前記遺伝子群Ｂの各遺伝子ｂのうち、各領域Ｒについて、当該領域Ｒにピーク位置が存在し、且つ、類似度が所定値よりも高い遺伝子ｂからなる遺伝子群Ｂ_Ｒを求めて抽出することを特徴としている。これにより、本発明にかかる遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置においては、遺伝子群Ｂのうち、制御領域内に所定値以上の類似度を有する遺伝子群Ｂ_Ｒを抽出することができる。 Furthermore, the gene interaction estimation method , the gene interaction estimation program, and the gene interaction estimation apparatus according to the present invention include the data of each gene b in the gene group B and the window similarity of each gene b in the gene group B. Degree data and region data composed of the region R, and for each region R among the genes b of the gene group B , a peak position exists in the region R, and the similarity is lower than a predetermined value. It is characterized by extracting seeking genes B _R consisting of high gene b. Thus, a gene interaction estimation method according to the present invention, the gene interaction estimation program, and in gene interaction estimation apparatus of genes B, genes B _R having a predetermined value or more similarity to the control region Can be extracted.

このように、本発明にかかる遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置においては、第２のフィルタリング処理としてこれらの処理を行い、特定のパターンを仮定せずに、指定した遺伝子ａと類似した制御領域を有する遺伝子を網羅的に探索する。このようにして求められた類似度が高い領域は、共通の因子が結合するサイトであるものと推定することができる。 Thus, in the gene interaction estimation method , the gene interaction estimation program, and the gene interaction estimation apparatus according to the present invention, these processes are performed as the second filtering process, without assuming a specific pattern, A gene having a control region similar to the designated gene a is exhaustively searched. Thus, it can be estimated that the area | region where the similarity degree calculated | required is a site which a common factor couple | bonds.

さらにまた、本発明にかかる遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置は、前記ウィンドウ類似度を算出するに先だって、特定の実験操作として遺伝子破壊を行うことによって得られているＤＮＡマイクロアレイ破壊株データを読み込み、指定された前記遺伝子ａと前記ＤＮＡマイクロアレイ破壊株データの相関係数の絶対値が大きい上位所定個数の前記他の遺伝子ｂからなる遺伝子群Ｂを選択することを特徴としている。すなわち、本発明にかかる遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置においては、共通の制御を受ける遺伝子のフィルタリングにのみＤＮＡマイクロアレイ破壊株データの相関係数を用いる。このように、本発明にかかる遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置においては、計算コストが高い制御領域の解析処理、すなわち、第２のフィルタリング処理に先だって、第１のフィルタリング処理として発現データの相関を利用した処理を行うことにより、計算コストが高い第２のフィルタリング処理の計算時間を大幅に削減することが可能となる。 Furthermore, the gene interaction estimation method , gene interaction estimation program, and gene interaction estimation apparatus according to the present invention can be obtained by performing gene disruption as a specific experimental operation prior to calculating the window similarity. Reading DNA microarray-disrupted strain data, and selecting a gene group B composed of a predetermined number of the other genes b having a large absolute value of the correlation coefficient between the designated gene a and the DNA microarray-disrupted strain data It is characterized by. That is, in the gene interaction estimation method , the gene interaction estimation program, and the gene interaction estimation apparatus according to the present invention, the correlation coefficient of the DNA microarray disrupted strain data is used only for filtering genes that are subject to common control. As described above, in the gene interaction estimation method , the gene interaction estimation program, and the gene interaction estimation apparatus according to the present invention, the control region analysis process having a high calculation cost, that is, the second filtering process is performed before the second filtering process. By performing the process using the correlation of the expression data as the first filtering process, the calculation time of the second filtering process having a high calculation cost can be significantly reduced.

また、本発明にかかる遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置は、前記ＤＮＡマイクロアレイ破壊株データに基づく前記実験操作を加えない野生株と前記実験操作を加えた破壊株との発現強度の変化比を用いて、各領域Ｒについて、前記遺伝子群Ｂ_Ｒの各遺伝子のうち、前記実験操作において大きく発現強度が変化しているもののみを遺伝子群Ｂ _Ｒ ^Ｆとして抽出することを特徴としている。具体的には、本発明にかかる遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置は、前記ＤＮＡマイクロアレイ破壊株データに基づいて前記遺伝子ａに強く影響を与えた因子群を求め、求めた各因子Ｆについて、当該因子Ｆの破壊株における発現強度の変化比に基づいて、前記遺伝子群Ｂ_Ｒの各遺伝子のうち、前記因子Ｆの影響を強く受けた遺伝子からなる遺伝子群を前記遺伝子群Ｂ_Ｒ ^Ｆとして抽出する。このように、本発明にかかる遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置においては、第３のフィルタリング処理として、第２のフィルタリング処理によって抽出された遺伝子群Ｂ_Ｒのうち、特定の実験操作において大きく発現強度が変化しているもののみを選択することにより、実験操作によって生成される因子又はそれに強く影響を受ける未知の因子を転写因子として推定することができる。 Further, the gene interaction estimation method , the gene interaction estimation program, and the gene interaction estimation apparatus according to the present invention include a wild strain that does not add the experimental operation based on the DNA microarray disrupted strain data, and a disruption that adds the experimental operation. using variation ratio of the expression intensities of the line extract, for each region R, among the genes of the genes B _R, only those expression intensity greatly changes in the experimental procedure as genes B _R ^F It is characterized by doing. Specifically, a gene interaction estimation method according to the present invention, the gene interaction estimation program, and genetic interactions estimation apparatus factor group of strongly influenced the DNA microarray disrupted strain data to be had based Dzu the gene a the calculated, for each factor F was determined, based on a change ratio of the expression intensity in the disrupted strain of the factor F, among the genes of the genes B _R, a gene consisting of strong received genes influence of said factor F you extract the group as the group of genes _B ^{R F.} Thus, a gene interaction estimation method according to the present invention, the gene interaction estimation program, and in gene interaction estimation apparatus as a third filtering process, the genes B _R extracted by the second filtering process Among them, by selecting only those whose expression intensity is greatly changed in a specific experimental operation, a factor generated by the experimental operation or an unknown factor that is strongly influenced by the factor can be estimated as a transcription factor.

さらに、本発明にかかる遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置は、各領域Ｒ及び各因子Ｆについての前記遺伝子群Ｂ _Ｒ ^Ｆの各遺伝子のデータと、塩基配列を表したデータであるＤＮＡ塩基配列データとに基づいて、前記遺伝子ａ及び前記遺伝子群Ｂ_Ｒ ^Ｆの各遺伝子について、各領域Ｒ内の前記ピーク位置からのウィンドウに含まれる文字列データを取り出し、当該ピーク位置に対するマルチプルアラインメントを行うことを特徴としている。これにより、本発明にかかる遺伝子相互作用推定方法、遺伝子相互作用推定プログラム、及び遺伝子相互作用推定装置においては、共通パターンの存在が確認できた場合には、推定が成功したものと判断することが可能となる。 Furthermore, the gene interaction estimation method , the gene interaction estimation program, and the gene interaction estimation apparatus according to the present invention include the data of each gene in the gene group B _R ^F for each region R and each factor F, and the base sequence Character string data included in a window from the peak position in each region R, for each gene of the gene a and the gene group B _R ^F , based on DNA base sequence data that is data representing It is characterized by performing multiple alignment for the peak position . Thereby, in the gene interaction estimation method , the gene interaction estimation program, and the gene interaction estimation apparatus according to the present invention, when the existence of the common pattern can be confirmed, it can be determined that the estimation is successful. It becomes possible.

さらにまた、上述した目的を達成する本発明にかかる結合サイト推定方法は、遺伝子ａと任意の他の遺伝子ｂとを制御する共通の因子が結合する結合サイトを推定する結合サイト推定方法において、特定の実験操作として遺伝子破壊を行うことによって得られているＤＮＡマイクロアレイ破壊株データと、塩基配列を表したデータであるＤＮＡ塩基配列データとをコンピュータに入力する入力工程と、前記コンピュータにおけるプロセッサが、指定された前記遺伝子ａのデータと、前記入力工程にて入力された前記ＤＮＡマイクロアレイ破壊株データに基づいて選択した遺伝子群Ｂの各遺伝子ｂのデータとのペアについて、前記入力工程にて入力された前記ＤＮＡ塩基配列データに基づいて、前記遺伝子ａの制御領域を文字列として取り出したときにその先頭からｉ番目の文字の場所を位置ｉとしたときに当該制御領域における各位置ｉから所定長ｌの連続領域をウィンドウＷ _ａ［ｉ］として切り出すとともに、前記遺伝子群Ｂの各遺伝子ｂの制御領域を文字列として取り出したときにその先頭からｊ番目の文字の場所を位置ｊとしたときに当該制御領域における各位置ｊから所定長ｌの連続領域をウィンドウＷ _ｂ［ｊ］として切り出し、これらのウィンドウＷ _ａ［ｉ］，Ｗ _ｂ［ｊ］に含まれる文字列データについて、これらのウィンドウ内の各位置で前後に所定個ずれた範囲での所定長の文字列同士の類似度を全て算出し、算出した類似度の最大値をその位置の類似度と定義したときの前記ウィンドウ内の各位置に対する類似度の合計であるウィンドウ類似度を算出し、算出した前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度からなるウィンドウ類似度データに基づいて、前記遺伝子ａの制御領域における各位置ｉに対する前記遺伝子群Ｂの類似度の最大値Ｍａｘ _ａＢ［ｉ］と平均値Ａｖｇ _ａＢ［ｉ］とを算出するウィンドウ類似度算出工程と、前記プロセッサが、選択した前記遺伝子群Ｂの各遺伝子ｂのデータと、前記ウィンドウ類似度算出工程にて算出した前記最大値Ｍａｘ _ａＢ［ｉ］からなる最大値データ及び前記平均値Ａｖｇ _ａＢ［ｉ］からなる平均値データとを読み込み、前記遺伝子ａの制御領域のうち、前記最大値Ｍａｘ _ａＢ［ｉ］が前記位置ｉにおける前記ウィンドウ類似度の分布において統計的に特異的に大きい第１の領域を求めるとともに、前記ウィンドウ類似度算出工程にて算出した前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度データに基づいて、当該遺伝子ａの制御領域の位置ｉの変化に対する前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度の最大値Ｍａｘ _ａｂ［ｉ］をみたときに極大値を与える位置をピーク位置としたとき当該ピーク位置を含むような第２の領域を求め、前記第１の領域であり且つ前記第２の領域である領域Ｒを探索する領域探索工程と、前記プロセッサが、前記領域探索工程にて求めた前記領域Ｒからなる領域データに基づいて各遺伝子ａ，ｂに共通の因子が結合する結合サイトを示す情報を出力する出力工程とを備えることを特徴としている。 Furthermore, the binding site estimation method according to the present invention that achieves the above-described object is characterized in that the binding site estimation method estimates a binding site to which a common factor that controls gene a and any other gene b binds. An input process for inputting DNA microarray disruption strain data obtained by performing gene disruption as an experimental operation and DNA base sequence data, which is data representing the base sequence, to a computer, and a processor in the computer is designated by A pair of the data of the gene a and the data of each gene b of the gene group B selected based on the DNA microarray disruption strain data input in the input step was input in the input step Based on the DNA base sequence data, the control region of the gene a is extracted as a character string. The continuous region of a predetermined length l from each position i in the control region when from the head to the location of the i-th character is position i with cut out as the window W _{a [i]} when, each gene of the genes B When the control area of b is extracted as a character string, the position of the jth character from the beginning is taken as position j, and a continuous area of a predetermined length l from each position j in the control area is defined as a window W _b [j]. For the character string data cut out and included in these windows W _a [i], W _b [j] , the similarity between character strings of a predetermined length in a range that is shifted by a predetermined number at each position in these windows the calculated all the maximum value of the calculated similarity to calculate the window similarity is the sum of similarity to each position in the window when defining the degree of similarity of its position, calculates Based on the window similarity data composed of the window similarity of each gene b of the gene group B, the maximum value Max _aB [i] of the gene group B for each position i in the control region of the gene a Window similarity calculation step for calculating the average value Avg _aB [i], the processor calculates the data of each gene b of the selected gene group B, and the maximum value calculated in the window similarity calculation step The maximum value data consisting of Max _aB [i] and the average value data consisting of the average value Avg _aB [i] are read, and the maximum value Max _aB [i] in the control region of the gene a is at the position i A first region that is statistically specifically large in the window similarity distribution is obtained and calculated in the window similarity calculation step. Based on the window similarity data of each gene b of the gene group B, the maximum value Max _ab [ max _{ab of the} window similarity of each gene b of the gene group B with respect to the change in the position i of the control region of the gene a When the position that gives the maximum value when i] is taken as the peak position, a second area including the peak position is obtained, and the area R that is the first area and the second area is searched. And an output step in which the processor outputs information indicating a binding site to which a factor common to the genes a and b binds based on the region data including the region R obtained in the region searching step. It is characterized by comprising.

また、上述した目的を達成する本発明にかかる結合サイト推定プログラムは、遺伝子ａと任意の他の遺伝子ｂとを制御する共通の因子が結合する結合サイトを推定するコンピュータ実行可能な結合サイト推定プログラムにおいて、前記コンピュータを、特定の実験操作として遺伝子破壊を行うことによって得られているＤＮＡマイクロアレイ破壊株データと、塩基配列を表したデータであるＤＮＡ塩基配列データとを入力する入力手段、指定された前記遺伝子ａのデータと、前記入力手段によって入力された前記ＤＮＡマイクロアレイ破壊株データに基づいて選択した遺伝子群Ｂの各遺伝子ｂのデータとのペアについて、前記入力手段によって入力された前記ＤＮＡ塩基配列データに基づいて、前記遺伝子ａの制御領域を文字列として取り出したときにその先頭からｉ番目の文字の場所を位置ｉとしたときに当該制御領域における各位置ｉから所定長ｌの連続領域をウィンドウＷ _ａ［ｉ］として切り出すとともに、前記遺伝子群Ｂの各遺伝子ｂの制御領域を文字列として取り出したときにその先頭からｊ番目の文字の場所を位置ｊとしたときに当該制御領域における各位置ｊから所定長ｌの連続領域をウィンドウＷ _ｂ［ｊ］として切り出し、これらのウィンドウＷ _ａ［ｉ］，Ｗ _ｂ［ｊ］に含まれる文字列データについて、これらのウィンドウ内の各位置で前後に所定個ずれた範囲での所定長の文字列同士の類似度を全て算出し、算出した類似度の最大値をその位置の類似度と定義したときの前記ウィンドウ内の各位置に対する類似度の合計であるウィンドウ類似度を算出し、算出した前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度からなるウィンドウ類似度データに基づいて、前記遺伝子ａの制御領域における各位置ｉに対する前記遺伝子群Ｂの類似度の最大値Ｍａｘ _ａＢ［ｉ］と平均値Ａｖｇ _ａＢ［ｉ］とを算出するウィンドウ類似度算出手段、選択した前記遺伝子群Ｂの各遺伝子ｂのデータと、前記ウィンドウ類似度算出手段によって算出された前記最大値Ｍａｘ _ａＢ［ｉ］からなる最大値データ及び前記平均値Ａｖｇ _ａＢ［ｉ］からなる平均値データとを読み込み、前記遺伝子ａの制御領域のうち、前記最大値Ｍａｘ _ａＢ［ｉ］が前記位置ｉにおける前記ウィンドウ類似度の分布において統計的に特異的に大きい第１の領域を求めるとともに、前記ウィンドウ類似度算出手段によって算出された前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度データに基づいて、当該遺伝子ａの制御領域の位置ｉの変化に対する前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度の最大値Ｍａｘ _ａｂ［ｉ］をみたときに極大値を与える位置をピーク位置としたとき当該ピーク位置を含むような第２の領域を求め、前記第１の領域であり且つ前記第２の領域である領域Ｒを探索する領域探索手段、及び、前記領域探索手段によって求められた前記領域Ｒからなる領域データに基づいて各遺伝子ａ，ｂに共通の因子が結合する結合サイトを示す情報を出力する出力手段として機能させることを特徴としている。 The binding site estimation program according to the present invention that achieves the above-described object is a computer-executable binding site estimation program that estimates a binding site to which a common factor that controls gene a and any other gene b binds. In the above, an input means for inputting DNA microarray disrupted strain data obtained by performing gene disruption as a specific experimental operation and DNA base sequence data that is data representing the base sequence is designated. The DNA base sequence inputted by the input means for a pair of the data of the gene a and the data of each gene b of the gene group B selected based on the DNA microarray disrupted strain data inputted by the input means Based on the data, the control region of gene a is extracted as a character string The continuous region of a predetermined length l from each position i in the control region when from the head to the location of the i-th character is position i with cut out as the window W _{a [i]} when a respective of said gene group B When the control region of gene b is taken out as a character string, the position of the jth character from the beginning is taken as position j, and a continuous region of a predetermined length l from each position j in the control region is represented by window W _b [j] As for character string data included in these windows W _a [i] and W _b [j] , similarities between character strings of a predetermined length within a range shifted by a predetermined number at each position in these windows All the degrees are calculated, and the window similarity that is the sum of the similarities for each position in the window when the calculated maximum similarity is defined as the similarity of the position is calculated. Based on the window similarity data composed of the window similarity of each gene b of the gene group B, the maximum value Max _{aB of the} gene group B for each position i in the control region of the gene a Max _iB [i ] And the average value Avg _aB [i], the data of each gene b in the selected gene group B, and the maximum value Max _aB [i calculated by the window similarity calculation unit ] And the average value data consisting of the average value Avg _aB [i], and among the control region of the gene a, the maximum value Max _aB [i] is the window similarity at the position i. A first region that is statistically specifically large in the distribution of, and is calculated by the window similarity calculation unit Based on the window similarity data of each gene b of the gene group B, the maximum value Max _ab [i of the window similarity of each gene b of the gene group B with respect to the change of the position i of the control region of the gene a. ], The second region that includes the peak position when the position that gives the maximum value is determined as the peak position, and the region R that is the first region and the second region is searched for. Functioning as region search means and output means for outputting information indicating a binding site to which a factor common to each gene a and b binds based on region data consisting of the region R obtained by the region search means. It is characterized by.

さらに、上述した目的を達成する本発明にかかる結合サイト推定装置は、遺伝子ａと任意の他の遺伝子ｂとを制御する共通の因子が結合する結合サイトを推定する結合サイト推定装置において、特定の実験操作として遺伝子破壊を行うことによって得られているＤＮＡマイクロアレイ破壊株データと、塩基配列を表したデータであるＤＮＡ塩基配列データとを入力する入力手段と、指定された前記遺伝子ａのデータと、前記入力手段によって入力された前記ＤＮＡマイクロアレイ破壊株データに基づいて選択した遺伝子群Ｂの各遺伝子ｂのデータとのペアについて、前記入力手段によって入力された前記ＤＮＡ塩基配列データに基づいて、前記遺伝子ａの制御領域を文字列として取り出したときにその先頭からｉ番目の文字の場所を位置ｉとしたときに当該制御領域における各位置ｉから所定長ｌの連続領域をウィンドウＷ _ａ［ｉ］として切り出すとともに、前記遺伝子群Ｂの各遺伝子ｂの制御領域を文字列として取り出したときにその先頭からｊ番目の文字の場所を位置ｊとしたときに当該制御領域における各位置ｊから所定長ｌの連続領域をウィンドウＷ _ｂ［ｊ］として切り出し、これらのウィンドウＷ _ａ［ｉ］，Ｗ _ｂ［ｊ］に含まれる文字列データについて、これらのウィンドウ内の各位置で前後に所定個ずれた範囲での所定長の文字列同士の類似度を全て算出し、算出した類似度の最大値をその位置の類似度と定義したときの前記ウィンドウ内の各位置に対する類似度の合計であるウィンドウ類似度を算出し、算出した前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度からなるウィンドウ類似度データに基づいて、前記遺伝子ａの制御領域における各位置ｉに対する前記遺伝子群Ｂの類似度の最大値Ｍａｘ _ａＢ［ｉ］と平均値Ａｖｇ _ａＢ［ｉ］とを算出するウィンドウ類似度算出手段と、選択した前記遺伝子群Ｂの各遺伝子ｂのデータと、前記ウィンドウ類似度算出手段によって算出された前記最大値Ｍａｘ _ａＢ［ｉ］からなる最大値データ及び前記平均値Ａｖｇ _ａＢ［ｉ］からなる平均値データとを読み込み、前記遺伝子ａの制御領域のうち、前記最大値Ｍａｘ _ａＢ［ｉ］が前記位置ｉにおける前記ウィンドウ類似度の分布において統計的に特異的に大きい第１の領域を求めるとともに、前記ウィンドウ類似度算出手段によって算出された前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度データに基づいて、当該遺伝子ａの制御領域の位置ｉの変化に対する前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度の最大値Ｍａｘ _ａｂ［ｉ］をみたときに極大値を与える位置をピーク位置としたとき当該ピーク位置を含むような第２の領域を求め、前記第１の領域であり且つ前記第２の領域である領域Ｒを探索する領域探索手段と、前記領域探索手段によって求められた前記領域Ｒからなる領域データに基づいて各遺伝子ａ，ｂに共通の因子が結合する結合サイトを示す情報を出力する出力手段とを備えることを特徴としている。 Furthermore, a binding site estimation apparatus according to the present invention that achieves the above-described object is a binding site estimation apparatus that estimates a binding site to which a common factor that controls gene a and any other gene b binds. DNA microarray disruption strain data obtained by performing gene disruption as an experimental operation, input means for inputting DNA base sequence data that is data representing the base sequence, data of the designated gene a, Based on the DNA base sequence data input by the input unit, the gene for the pair of the data of each gene b of the gene group B selected based on the DNA microarray disrupted strain data input by the input unit When the control area of a is extracted as a character string, the position of the i-th character from the beginning is set as the position i. The continuous region of a predetermined length l from each position i in the control region with cut out as the window W _{a [i]} when the, from the beginning when the removed control region of each gene b of the genes B as a string When the position of the j-th character is a position j, a continuous area of a predetermined length l is cut out from each position j in the control area as a window W _b [j], and these windows W _a [i], W _b [j the character string data included in, the position the maximum value of the calculated all the calculated similarity similarity string between the predetermined length in the range of a predetermined number shifted back and forth at each position within the windows Window similarity, which is the sum of the similarities for each position in the window when defined as the similarity of the window, and the windows of each gene b of the gene group B calculated The maximum value Max _aB [i] and the average value Avg _aB [i] of the gene group B with respect to each position i in the control region of the gene a are calculated based on the window similarity data composed of the similarity. Maximum value data consisting of window similarity calculation means, data of each gene b of the selected gene group B, and the maximum value Max _aB [i] calculated by the window similarity calculation means, and the average value Avg _aB Average value data consisting of [i] is read, and the maximum value Max _aB [i] of the control region of the gene a is statistically specifically large in the distribution of the window similarity at the position i. And calculating the window similarity degree of each gene b of the gene group B calculated by the window similarity degree calculating means. Based on the data, the position giving the maximum value when the maximum value Max _ab [i] of the window similarity of each gene b of the gene group B with respect to the change of the position i of the control region of the gene a is peaked A second region that includes the peak position when it is set as a position, and a region search unit that searches for the region R that is the first region and the second region; and the region search unit And output means for outputting information indicating a binding site to which a factor common to the genes a and b binds based on the region data consisting of the region R.

このような本発明にかかる結合サイト推定方法、結合サイト推定プログラム、及び結合サイト推定装置においては、ウィンドウ類似度という概念を取り入れることにより、遺伝子ａ，ｂの制御領域全体ではなく局所的な類似度を発見することが可能となる。そして、本発明にかかる結合サイト推定方法、結合サイト推定プログラム、及び結合サイト推定装置は、前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度からなるウィンドウ類似度データに基づいて、前記遺伝子ａの制御領域における各位置ｉに対する前記遺伝子群Ｂの類似度の最大値ＭａｘIn such a binding site estimation method, binding site estimation program, and binding site estimation apparatus according to the present invention, by introducing the concept of window similarity, local similarity rather than the entire control region of genes a and b is adopted. It becomes possible to discover. Then, the binding site estimation method, the binding site estimation program, and the binding site estimation apparatus according to the present invention are based on window similarity data including the window similarity of each gene b of the gene group B. Maximum value Max of similarity of gene group B for each position i in the control region _ａＢaB ［ｉ］と平均値Ａｖｇ[I] and average value Avg _ａＢaB ［ｉ］とを算出し、前記遺伝子群Ｂの各遺伝子ｂのデータと、前記最大値Ｍａｘ[I] is calculated, the data of each gene b of the gene group B, and the maximum value Max. _ａＢaB ［ｉ］からなる最大値データ及び前記平均値ＡｖｇMaximum value data consisting of [i] and the average value Avg _ａＢaB ［ｉ］からなる平均値データとを読み込み、前記遺伝子ａの制御領域のうち、前記最大値ＭａｘAverage value data consisting of [i] and reading the maximum value Max of the control region of the gene a _ａＢaB ［ｉ］が前記位置ｉにおける前記ウィンドウ類似度の分布において統計的に特異的に大きい第１の領域を求めるとともに、前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度データに基づいて、当該遺伝子ａの制御領域の位置ｉの変化に対する前記遺伝子群Ｂの各遺伝子ｂの前記ウィンドウ類似度の最大値Ｍａｘ[I] obtains a first region that is statistically specifically large in the distribution of window similarity at the position i, and based on the window similarity data of each gene b of the gene group B, Maximum value Max of the window similarity of each gene b of the gene group B with respect to a change in the position i of the control region of a _ａｂab ［ｉ］をみたときに極大値を与える位置をピーク位置としたとき当該ピーク位置を含むような第２の領域を求め、前記第１の領域であり且つ前記第２の領域である領域Ｒを探索する。これにより、本発明にかかる結合サイト推定方法、結合サイト推定プログラム、及び結合サイト推定装置においては、遺伝子ａ，ｂに共通の結合サイトが存在するとした場合には、このピーク位置からのウィンドウ内に存在する可能性が高いことから、かかるピーク位置を含むような領域Ｒを探索することにより、結合サイトの候補を絞り込むことが可能となる。When the position that gives the maximum value when viewing [i] is the peak position, a second region that includes the peak position is obtained, and the region R that is the first region and the second region is determined. Explore. Thereby, in the binding site estimation method, the binding site estimation program, and the binding site estimation apparatus according to the present invention, when there is a common binding site in the genes a and b, the window from this peak position is within the window. Since there is a high possibility of existence, it is possible to narrow down candidates for binding sites by searching for a region R including such a peak position.

本発明によれば、個々の文字列が制御領域にどの程度共通に含まれるのかを調べるのではなく、ある程度の長さの部分領域（ウィンドウ）を文字列の集合体として捉え、ウィンドウが全体として類似しているか否かをも網羅的に計算することにより、文字列の組み合わせによる発現制御にも対応することが可能となる。そして、本発明によれば、ＤＮＡマイクロアレイデータの影響及び結合パターンに含まれるあいまいさの影響を低減することができ、ＤＮＡマイクロアレイを用いながらも、高精度に遺伝子相互作用及び結合サイトの推定を行うことができる。 According to the present invention, instead of examining how common each character string is included in the control area, a partial area (window) of a certain length is regarded as a collection of character strings, and the window as a whole By comprehensively calculating whether or not they are similar, it is possible to cope with expression control by a combination of character strings. According to the present invention, the influence of DNA microarray data and the influence of ambiguity included in the binding pattern can be reduced, and the gene interaction and the binding site are estimated with high accuracy while using the DNA microarray. be able to.

以下、本発明を適用した具体的な実施の形態について図面を参照しながら詳細に説明する。 Hereinafter, specific embodiments to which the present invention is applied will be described in detail with reference to the drawings.

この実施の形態は、遺伝子ａが因子Ｆの制御（正の制御又は負の制御）を受けるとき、因子Ｆの制御を受ける他の遺伝子ｂにいかなる影響を与えるかを推定する遺伝子相互作用推定方法である。特に、この遺伝子相互作用推定方法は、入力されたデータに対して３つの独立したフィルタリング処理を施すことにより、ＤＮＡマイクロアレイを用いながらも、高精度に遺伝子相互作用の推定を行うことができるものである。 In this embodiment, when the gene a is controlled by the factor F (positive control or negative control), it is estimated how the gene interaction is estimated to affect the other gene b controlled by the factor F. It is. In particular, this gene interaction estimation method can estimate the gene interaction with high accuracy while performing DNA microarray by applying three independent filtering processes to the input data. is there.

なお、以下では、被推定生物に実験操作を加えた遺伝子発現データとして、枯菌草バチルス・スブチリス（Bacillus subtillis）の遺伝子発現データを用いた実際のデータを用いた説明を適宜行うものとする。 In the following description, actual data using gene expression data of Bacillus subtillis will be appropriately described as gene expression data obtained by performing an experimental operation on the organism to be estimated.

この遺伝子相互作用推定方法を実行する遺伝子相互作用推定装置は、一般的には、プロセッサによって所定の遺伝子相互作用推定プログラムを実行するコンピュータを用いて実現される。図１に、本発明を適用した遺伝子相互作用推定方法における一連の処理を示す。 A gene interaction estimation apparatus that executes this gene interaction estimation method is generally realized using a computer that executes a predetermined gene interaction estimation program by a processor. FIG. 1 shows a series of processes in a gene interaction estimation method to which the present invention is applied.

まず、この遺伝子相互作用推定方法においては、ＤＮＡマイクロアレイ破壊株データ、ＤＮＡ塩基配列データ、及び相互作用の推定対象となる遺伝子ａのデータが遺伝子相互作用推定装置に入力される。なお、ＤＮＡマイクロアレイデータとは、ＤＮＡマイクロアレイ技術によって得られた遺伝子ａの発現データであり、ＤＮＡマイクロアレイ破壊株データとは、ＤＮＡマイクロアレイデータの１種であり、破壊株と野生株との発現量の比を観測したデータである。ここで、破壊株とは、実験的手法によって標的となる遺伝子ａを破壊した株のことであり、野生株とは、遺伝子ａを破壊していない株のことである。また、株とは、遺伝的形質が同じ生物の別の個体である。さらに、ＤＮＡは、４つの塩基文字（ａ，ｔ，ｇ，ｃ）から構成される１つの塩基配列とその相補配列とが互いに結合し、二重らせん構造をとっている物質であるが、ＤＮＡ塩基配列データとは、かかる配列の片側の塩基配列を表したデータである。 First, in this gene interaction estimation method, DNA microarray disruption strain data, DNA base sequence data, and gene a data to be estimated for interaction are input to a gene interaction estimation device. The DNA microarray data is the expression data of the gene a obtained by the DNA microarray technology. The DNA microarray disrupted strain data is one type of DNA microarray data, and the expression level of the disrupted strain and the wild strain It is the data which observed ratio. Here, the disrupted strain is a strain in which the target gene a is destroyed by an experimental method, and the wild strain is a strain that has not destroyed the gene a. A strain is another individual having the same genetic character. Furthermore, DNA is a substance in which one base sequence composed of four base letters (a, t, g, c) and its complementary sequence are bonded to each other to form a double helix structure. The base sequence data is data representing the base sequence on one side of the sequence.

この遺伝子相互作用推定方法においては、かかる入力に対して、ＤＮＡマイクロアレイによる発現データのバックグラウンドノイズの影響を低減するために、発現データの相関を利用した第１のフィルタリング処理を行う。すなわち、この遺伝子相互作用推定方法においては、ステップＳ１において、遺伝子相互作用推定装置における遺伝子群選択手段により、入力されたＤＮＡマイクロアレイ破壊株データを読み込み、指定された遺伝子ａとＤＮＡマイクロアレイ破壊株データの相関係数の絶対値が大きい上位所定個数の他の遺伝子ｂからなる遺伝子群Ｂを選択する。具体的には、この遺伝子相互作用推定方法においては、遺伝子ａとＤＮＡマイクロアレイ破壊株データの相関係数の絶対値が大きい上位３００の遺伝子群Ｂを選択する。例えば、遺伝子ａが“ａｈｐＣ”である場合には、図２に示すように、ＤＮＡマイクロアレイ破壊株データの相関係数の絶対値が大きい順に、“ａｈｐＣ”、“ａｈｐＦ”、“ｙｕｒＶ”、“ｙｕｒＷ”、“ｙｕｒＵ”、・・・といったように、上位３００の遺伝子が選択され、これらが遺伝子群Ｂとされる。 In this gene interaction estimation method, the first filtering process using the correlation of the expression data is performed on the input in order to reduce the influence of the background noise of the expression data by the DNA microarray. That is, in this gene interaction estimation method, in step S1, the input DNA microarray disruption strain data is read by the gene group selection means in the gene interaction estimation device , and the designated gene a and DNA microarray disruption strain data are read. A gene group B consisting of a predetermined number of other genes b having a large absolute value of the correlation coefficient is selected. Specifically, in this gene interaction estimation method, the top 300 gene groups B having a large absolute value of the correlation coefficient between the gene a and the DNA microarray disrupted strain data are selected. For example, if the gene a is "ahpC", as shown in FIG. 2, in order large absolute value of the correlation coefficient of the DNA microarray disrupted strain data, "ahpC", "ahpF" , "yurV", " The top 300 genes such as “yurW”, “yurU”,... are selected and set as gene group B.

すなわち、この遺伝子相互作用推定方法においては、共通の制御を受ける遺伝子のフィルタリングにのみＤＮＡマイクロアレイ破壊株データの相関係数を用いる。さらに換言すれば、この遺伝子相互作用推定方法においては、共通の制御を受ける遺伝子の候補を絞り込むために、ＤＮＡマイクロアレイ破壊株データの相関係数の値そのものには拘泥せずに、ある値以上の相関係数を有するものは全て共通の制御を受ける遺伝子の候補として残すように処理を行う。 That is, in this gene interaction estimation method, the correlation coefficient of DNA microarray disruption strain data is used only for filtering genes that are subject to common control. In other words, in this gene interaction estimation method, in order to narrow down the candidate genes that are subject to common control, the correlation coefficient value of the DNA microarray disruption strain data itself is not limited to a certain value or more. Processing is performed so that all those having a correlation coefficient are left as candidate genes that are subject to common control.

続いて、この遺伝子相互作用推定方法においては、第１のフィルタリング処理によって抽出された遺伝子群Ｂを対象として第２のフィルタリング処理を施し、制御領域内の類似部分領域の探索による結合サイト（binding site）の推定を行う。 Subsequently, in this gene interaction estimation method, a second filtering process is performed on the gene group B extracted by the first filtering process, and a binding site (binding site) by searching for a similar partial region in the control region. ).

まず、この遺伝子相互作用推定方法においては、第２のフィルタリング処理の一環として、図１中ステップＳ２において、遺伝子相互作用推定装置におけるウィンドウ類似度算出手段により、遺伝子ａのデータと遺伝子群Ｂの各遺伝子ｂのデータとのペアについて、ＤＮＡ塩基配列データに基づいて、制御領域のウィンドウ類似度を算出する。 First, in this gene interaction estimation method, as part of the second filtering process, in step S2 in FIG. 1, each of the data of gene a and each of gene group B is calculated by the window similarity calculation means in the gene interaction estimation device. For the pair with the gene b data , the window similarity of the control region is calculated based on the DNA base sequence data .

ここで、ウィンドウ類似度とは、例えば図３に示すように、遺伝子の制御領域全体ではなく、類似部分の多少のずれを許容しつつ局所的な類似度を発見するために新たに考案した概念である。このとき、この遺伝子相互作用推定方法においては、期待値よりも出現頻度が特異的に高い文字列は結合サイトにはならず、かかる文字列が比較すべき２つの遺伝子の制御領域に共通に含まれていたとしても共通の因子が結合するわけではない、という本願出願人の発見に基づいて、結合サイトになりそうな文字列の類似性のみを考慮し、遺伝子の制御に関係していない文字列が類似していても、それらは無視する。 Here, for example, as shown in FIG. 3, the window similarity is a concept newly devised in order to discover local similarity while allowing a slight shift of similar parts, not the entire control region of a gene. It is. At this time, in this gene interaction estimation method, a character string having an appearance frequency that is specifically higher than the expected value does not become a binding site, and the character string is commonly included in the control region of two genes to be compared. Based on the applicant's discovery that common factors do not bind even if they are included, only the similarity of strings that are likely to become binding sites is considered, and the characters are not related to gene regulation. Ignore if the columns are similar.

具体的には、ウィンドウ類似度は、例えば図４に示すように、比較する２つの遺伝子ａ，ｂの制御領域からそれぞれ長さｌ（例えば３０程度）の連続領域をウィンドウとして切り出し、これらのウィンドウに含まれる近接した開始位置をもつ長さｋ（例えば６程度）の文字列（オリゴヌクレオチド）同士の類似度の最大値をその位置の類似度と定義したときの各位置に対する類似度の合計値として算出される。なお、オリゴヌクレオチド同士の類似度は、一致する文字数に関する単調増加関数として定義する。このとき、前後に数個（例えば３程度）ずれた範囲での類似度を全て算出し、その最大値をその位置の類似度と定義する。これにより、この遺伝子相互作用推定方法は、モチーフアラインメントを近似的に行っていることになり、結合文字列に頻繁にみられる、間に可変長の任意文字列を含むパターンにも対応することが可能となる。また、統計的に特異的に出現頻度が高いオリゴヌクレオチドについては、結合サイトにはならないとの発見に基づいて、それに対する類似度を“０”とし、計算対象から除外する。実際には、例えば図４に示すように、遺伝子ａの制御領域のｉ文字目を開始位置とするウィンドウをＷ_ａ［ｉ］とすると、このウィンドウＷ_ａ［ｉ］に対する遺伝子ｂのウィンドウＷ_ｂ［ｊ］の類似度ｗｓｉｍ_ａ，ｂ（ｉ，ｊ）は、文字列の長さを“６”とすると、
ｗｓｉｍ_ａ，ｂ（ｉ，ｊ）：＝Σ_{ｋ＝０，ｌ−６}ｓｉｍ_ａ，ｂ（ｉ＋ｋ，ｊ＋ｋ）
として算出される。 Specifically, for example, as shown in FIG. 4, the window similarity is obtained by cutting out continuous regions of length l (for example, about 30) from the control regions of two genes a and b to be compared as windows. The maximum value of the similarities between character strings (oligonucleotides) of length k (for example, about 6) having close start positions included in is defined as the similarity of the positions. Is calculated as The similarity between oligonucleotides is defined as a monotonically increasing function related to the number of matching characters. At this time, all the similarities in a range shifted by several (for example, about 3) before and after are calculated, and the maximum value is defined as the similarity of the position. As a result, this gene interaction estimation method approximates the motif alignment, and can correspond to a pattern including an arbitrary character string with a variable length in between, which is frequently seen in a combined character string. It becomes possible. Further, for oligonucleotides that are statistically specific and have a high appearance frequency, based on the discovery that they do not become binding sites, the similarity to them is set to “0” and excluded from the calculation target. Actually, for example, as shown in FIG. 4, if the window starting from the i-th character in the control region of gene a is W _a [i], then the window W _{b of} gene b for this window W _a [i]. The similarity wsim _{a, b} (i, j) of [j] is assumed to be “6” as the length of the character string.
wsim _{a, b} (i, j): = Σ _{k = 0, 1−6} sim _{a, b} (i + k, j + k)
Is calculated as

ここで、１つのウィンドウのペアに対する類似度の計算量は、ウィンドウの長さｌに比例すると考えられ、ウィンドウのペアの総数は、遺伝子ａ，ｂの制御領域全体の長さをｎとすると、高々ｎ^２個である。したがって、全てのウィンドウのペアに対する計算量は、計算量がｌである１つのウィンドウのペアがｎ^２個存在するため、Ｏ（ｎ^２・ｌ）となるはずである。これに対して、この遺伝子相互作用推定方法においては、遺伝子ａ，ｂの制御領域全体の長さをｎとすると、全ての位置ｉ，ｊに対するｓｉｍ_ａ，ｂ（ｉ，ｊ）は、予め計算量Ｏ（ｎ^２）で算出することができ、さらに、全てのウィンドウのペアに対する類似度は、ｗｓｉｍ_ａ，ｂ（ｉ＋１，ｊ＋１）＝ｗｓｉｍ_ａ，ｂ（ｉ，ｊ）−ｓｉｍ_ａ，ｂ（ｉ，ｊ）＋ｓｉｍ_ａ，ｂ（ｉ＋ｌ−５，ｊ＋ｌ−５）の関係を用いると計算量Ｏ（ｎ^２）で算出することができる。したがって、遺伝子相互作用推定方法においては、全てのウィンドウのペアに対する類似度は、ウィンドウの長さｌには依存せずに計算量Ｏ（ｎ^２）で算出することができ、非常に少ない計算量で済ませることができる。 Here, the calculation amount of the similarity for one window pair is considered to be proportional to the window length l, and the total number of window pairs is n, where n is the length of the entire control region of the genes a and b. at most ^{two n.} Therefore, the amount of calculation for all window pairs should be O (n ² · l) because there are n ² pairs of windows with the amount of calculation being l. In contrast, in this gene interaction estimation method, sim _{a, b} (i, j) for all positions i, j is calculated in advance, where n is the length of the entire control region of genes a, b. The amount O (n ² ) can be calculated, and the similarity for all window pairs is wsim _{a, b} (i + 1, j + 1) = wsim _{a, b} (i, j) −sim _{a, b} ( If the relationship of i, j) + sim _{a, b} (i + 1-5, j + 1-5) is used, the calculation amount O (n ² ) can be used. Therefore, in the gene interaction estimation method, the similarity to all window pairs can be calculated with the calculation amount O (n ² ) without depending on the window length l, and the calculation amount is very small. You can do it.

このように、この遺伝子相互作用推定方法においては、ウィンドウ類似度という概念を取り入れることにより、遺伝子ａ，ｂの制御領域全体ではなく局所的な類似度を発見する。また、この遺伝子相互作用推定方法においては、ウィンドウ類似度を算出する際に、遺伝子群Ｂの各遺伝子ｂのウィンドウ類似度からなるウィンドウ類似度データに基づいて、遺伝子ａの制御領域における各位置ｉに対する遺伝子群Ｂの類似度の最大値Ｍａｘ_ａＢ［ｉ］と平均値Ａｖｇ_ａＢ［ｉ］、及び遺伝子群Ｂの各遺伝子ｂの類似度の最大値Ｍａｘ_ａｂ［ｉ］を算出する。なお、位置ｉとは、制御領域を文字列として取り出したとき、その先頭からｉ番目の文字の場所のことを意味する。また、位置ｉからの長さｌのウィンドウとは、ｉ番目からｉ＋ｌ−１番目までの部分文字列のことを意味する。 Thus, in this gene interaction estimation method, the local similarity is found instead of the entire control region of genes a and b by incorporating the concept of window similarity. Further, in this gene interaction estimation method, when calculating the window similarity, each position i in the control region of the gene a is based on the window similarity data including the window similarity of each gene b of the gene group B. The maximum value Max _aB [i] and the average value Avg _aB [i] of the gene group B and the maximum value Max _ab [i] of the similarity of each gene b of the gene group B are calculated. Note that the position i means the position of the i-th character from the beginning when the control area is extracted as a character string. Further, the window of length l from the position i means partial character strings from the i-th to the (i + 1) -1th.

続いて、この遺伝子相互作用推定方法においては、第２のフィルタリング処理の一環として、図１中ステップＳ３において、遺伝子相互作用推定装置における領域探索手段により、遺伝子群Ｂの各遺伝子ｂのデータと、最大値Ｍａｘ _ａＢ［ｉ］からなる最大値データ及び平均値Ａｖｇ _ａＢ［ｉ］からなる平均値データとを読み込み、遺伝子ａの制御領域のうち、最大値Ｍａｘ_ａＢ［ｉ］が位置ｉにおけるウィンドウ類似度の分布において統計的に特異的に大きく、且つ、ピーク位置を含むような領域Ｒを探索する。より具体的には、この遺伝子相互作用推定方法においては、最大値Ｍａｘ _ａＢ［ｉ］と平均値Ａｖｇ _ａＢ［ｉ］との差分Ｍａｘ _ａＢ［ｉ］−Ａｖｇ _ａＢ［ｉ］を算出し、遺伝子ａの制御領域のうち、算出した差分Ｍａｘ_ａＢ［ｉ］−Ａｖｇ_ａＢ［ｉ］が所定値よりも大きく、且つ、ピーク位置を含むような領域Ｒを探索する。なお、ピーク位置とは、例えば図５中矢印で示すように、遺伝子群Ｂの各遺伝子ｂのウィンドウ類似度データに基づいて、遺伝子ａの位置を前後にずらして当該遺伝子ａの制御領域の位置ｉの変化に対する遺伝子群Ｂの各遺伝子ｂのウィンドウ類似度の最大値Ｍａｘ_ａｂ［ｉ］をみたときに極大値を与える位置のことである。このとき、この遺伝子相互作用推定方法においては、所定の閾値以上の類似度を有するピーク位置のみを対象とするために、最大値Ｍａｘ_ａＢ［ｉ］と平均値Ａｖｇ_ａＢ［ｉ］との差分Ｍａｘ_ａＢ［ｉ］−Ａｖｇ_ａＢ［ｉ］が所定値よりも大きい連続領域に含まれるピーク位置を対象とする。そして、この遺伝子相互作用推定方法においては、遺伝子ａ，ｂに共通の結合サイトが存在するとした場合には、このピーク位置からのウィンドウ内に存在する可能性が高いことから、かかるピーク位置を含むような領域Ｒを探索する。 Subsequently, in this gene interaction estimation method, as part of the second filtering process, in step S3 in FIG. 1, the region search means in the gene interaction estimation device uses the data of each gene b of the gene group B, The maximum value data consisting of the maximum value Max _aB [i] and the average value data consisting of the average value Avg _aB [i] are read, and the maximum value Max _aB [i] in the control region of the gene a is similar to the window at the position i. A region R that is statistically specifically large in the degree distribution and includes the peak position is searched. More specifically, in this gene interaction estimation method, the difference Max _aB [i] −Avg _aB [i] between the maximum value Max _aB [i] and the average value Avg _aB [i] is calculated, and the gene a Among the control regions, a region R in which the calculated difference Max _aB [i] −Avg _aB [i] is larger than a predetermined value and includes the peak position is searched. Note that the peak position refers to the position of the control region of the gene a by shifting the position of the gene a back and forth based on the window similarity data of each gene b of the gene group B , for example, as indicated by an arrow in FIG. This is the position that gives the maximum value when _{looking at} the maximum value Max _ab [i] of the window similarity of each gene b of the gene group B with respect to the change of i. At this time, in this gene interaction estimation method, only the peak position having a similarity equal to or higher than a predetermined threshold is targeted, and therefore the difference Max between the maximum value Max _aB [i] and the average value Avg _aB [i]. _{_{aB [i] -Avg aB [i}} ] is directed to a peak position included in the larger continuous area than a predetermined value. And in this gene interaction estimation method, when there is a common binding site for genes a and b, it is highly possible that they exist within the window from this peak position, so this peak position is included. Such a region R is searched.

具体的には、この遺伝子相互作用推定方法においては、例えば図６に示すように、遺伝子ａの制御領域の位置ｉの変化に対する類似度の最大値Ｍａｘ_ａＢ［ｉ］と平均値Ａｖｇ_ａＢ［ｉ］との差分をプロットし、同図中塗りつぶしプロットで示されるピーク位置を求める。そして、この遺伝子相互作用推定方法においては、類似度が１８０００以上のピーク位置を対象とし、最大値Ｍａｘ_ａＢ［ｉ］と平均値Ａｖｇ_ａＢ［ｉ］との差分Ｍａｘ_ａＢ［ｉ］−Ａｖｇ_ａＢ［ｉ］の値が１５０００以上の値を連続的に有してピーク位置を含むような領域を探索するものとすると、例えば図７中斜線部に示すように、位置ｉ＝４１１８０５６〜４１１８０９６を領域Ｒとして求める。 Specifically, in this gene interaction estimation method, for example, as shown in FIG. 6, for example, the maximum value Max _aB [i] and the average value Avg _aB [i] with respect to a change in the position i of the control region of the gene a ] And a peak position indicated by a solid plot in the same figure is obtained. In this gene interaction estimation method, the peak position having a similarity of 18000 or more is targeted, and the difference Max _aB [i] −Avg _aB [i] between the maximum value Max _aB [i] and the average value Avg _aB [i]. Assuming that an area where the value of i] continuously has a value of 15000 or more and includes a peak position is searched, for example, as shown by the hatched portion in FIG. Asking.

なお、領域Ｒを探索するために、最大値Ｍａｘ_ａＢ［ｉ］ではなく差分Ｍａｘ_ａＢ［ｉ］−Ａｖｇ_ａＢ［ｉ］を用いるのは、以下の理由による。上述したように、統計的に特異的に出現頻度が高いオリゴヌクレオチドについては、それに対する類似度を“０”とし、計算対象から除外しているため、各ウィンドウ毎に類似度の算出に用いられる文字列の総数は異なることになる。したがって、高い類似度の文字列がウィンドウ内に少数しか存在しない場合と、低い類似度の文字列がウィンドウ内に多数存在する場合とにおいて、同程度のウィンドウ類似度が算出される可能性がある。遺伝子相互作用推定方法においては、高い類似度を有するものを選ぶのが目的であるため、前者のみを選別する必要がある。ここで、前者と後者とでは、後者の方が平均値が高くなるため、最大値Ｍａｘ_ａＢ［ｉ］と平均値Ａｖｇ_ａＢ［ｉ］との差分をみることにより、これら前者と後者とを簡便に区別することができる。領域Ｒを探索するために、最大値Ｍａｘ_ａＢ［ｉ］ではなく差分Ｍａｘ_ａＢ［ｉ］−Ａｖｇ_ａＢ［ｉ］を用いるのは、このような理由によるものである。ただし、遺伝子相互作用推定方法においては、最大値Ｍａｘ_ａＢ［ｉ］が位置ｉにおけるウィンドウ類似度の分布において統計的に特異的に大きいと判断したものを選ぶことができるのであれば、差分Ｍａｘ_ａＢ［ｉ］−Ａｖｇ_ａＢ［ｉ］を用いる方法以外であっても適用することができる。 In order to search the region R, the difference Max _aB [i] −Avg _aB [i] is used instead of the maximum value Max _aB [i] for the following reason. As described above, since an oligonucleotide with a statistically specific high appearance frequency is set to “0” and excluded from the calculation target, it is used for calculating the similarity for each window. The total number of strings will be different. Therefore, there is a possibility that the same degree of window similarity is calculated when there are only a few character strings with high similarity in the window and when there are many character strings with low similarity in the window. . In the gene interaction estimation method, it is necessary to select only the former because the purpose is to select one having a high degree of similarity. Here, in the former and the latter, since the latter has a higher average value, the former and the latter can be simplified by looking at the difference between the maximum value Max _aB [i] and the average value Avg _aB [i]. Can be distinguished. This is the reason why the difference Max _aB [i] −Avg _aB [i] is used instead of the maximum value Max _aB [i] to search the region R. However, in the gene interaction estimation method, if the maximum value Max _aB [i] is determined to be statistically specifically large in the distribution of window similarity at the position i, the difference Max _aB [I] -Avg _{aB Even a} method other than the method using [i] can be applied.

遺伝子相互作用推定方法においては、このようにして領域Ｒを求めると、第２のフィルタリング処理の一環として、図１中ステップＳ４において、遺伝子相互作用推定装置における遺伝子群抽出手段により、遺伝子群Ｂの各遺伝子ｂのデータと、遺伝子群Ｂの各遺伝子ｂのウィンドウ類似度データと、求めた領域Ｒからなる領域データとを読み込み、遺伝子群Ｂの各遺伝子ｂのうち、各領域Ｒについて、当該領域Ｒにピーク位置が存在し、且つ、類似度が所定値よりも高い遺伝子ｂからなる遺伝子群Ｂ_Ｒを求める。すなわち、この遺伝子相互作用推定方法においては、このステップＳ４を行うことにより、第１のフィルタリング処理によって抽出された遺伝子群Ｂのうち、制御領域内に所定値以上の類似度を有する遺伝子群Ｂ_Ｒを抽出することができる。 In gene interaction estimation method, this way obtaining the region R, as a part of the second filtering process, in step S4 in FIG. 1, the gene group extracting means in gene interaction estimation apparatus, the genes B The data of each gene b, the window similarity data of each gene b of gene group B, and the area data consisting of the obtained area R are read, and for each area R of each gene b of gene group B, the corresponding area there is a peak located in R, and the similarity to seek genes B _R consisting of high gene b than a predetermined value. That is, in this gene interaction estimation method, by performing this step S4, among the genes B extracted by the first filtering process, genes B _R having a predetermined value or more similarity to the control region Can be extracted.

このように、遺伝子相互作用推定方法においては、第２のフィルタリング処理として、特定のパターンを仮定せずに、指定した遺伝子ａと類似した制御領域を有する遺伝子を網羅的に探索する。このとき、この遺伝子相互作用推定方法においては、類似度を算出する際に、部分文字列の出現頻度に関する統計的特異性に基づいて、結合サイトに含まれる可能性が低い部分を除外する。このようにして求められた類似度が高い領域は、共通の因子が結合するサイトであるものと推定することができる。 Thus, in the gene interaction estimation method, as the second filtering process, genes having a control region similar to the designated gene a are exhaustively searched without assuming a specific pattern. At this time, in this gene interaction estimation method, when calculating the similarity, a portion that is unlikely to be included in the binding site is excluded based on the statistical specificity regarding the appearance frequency of the partial character string. Thus, it can be estimated that the area | region where the similarity degree calculated | required is a site which a common factor couple | bonds.

また、遺伝子相互作用推定方法においては、図１中ステップＳ４において、遺伝子群Ｂ_Ｒを求めると、被推定生物に実験操作を加えた遺伝子発現データを利用して転写因子を推定する処理を行う。すなわち、この遺伝子相互作用推定方法においては、実験操作を加えない個体と実験操作を加えた個体との遺伝子発現量の変化を比較することにより、実験操作の直接的な影響を知ることができる。具体的には、遺伝子相互作用推定方法においては、遺伝子相互作用推定装置における第２の遺伝子群抽出手段により、第３のフィルタリング処理として、各領域Ｒについて、第２のフィルタリング処理によって抽出された遺伝子群Ｂ_Ｒのうち、実験操作において大きく発現強度が変化しているもののみを選択する。ここでは、実験操作として遺伝子破壊を行うことによって得られているＤＮＡマイクロアレイ破壊株データにおいて遺伝子ａに強く影響を与えた因子群を求める。そして、この遺伝子相互作用推定方法においては、求めた各因子Ｆについて、遺伝子群Ｂ_Ｒのうち、因子Ｆの影響を強く受けた遺伝子群を抽出し、それをＢ_Ｒ ^Ｆとして求める。なお、遺伝子群Ｂ_Ｒ ^Ｆは、因子Ｆによる野生株の発現強度と比較した破壊株の発現強度との変化が大きく、且つ、その順位も上位であるものである。なお、発現強度の変化比は、ｌｏｇ（（破壊株の発現強度）／（野生株の発現強度））の絶対値で表される。 In the gene interaction estimation method, performed in step S4 in FIG. 1, when obtaining the genes B _R, the process of estimating the transcription factor by using the gene expression data obtained by adding experimental procedure to be estimated biological. That is, in this gene interaction estimation method, the direct influence of the experimental operation can be known by comparing the change in gene expression level between the individual to which the experimental operation is not applied and the individual to which the experimental operation is added. Specifically, in the gene interaction estimation method, the gene extracted by the second filtering process for each region R as the third filtering process by the second gene group extracting means in the gene interaction estimating apparatus. From the group B _R, it selects only those large expression intensities in the experimental operation is changing. Here, the factor group which strongly influenced the gene a in the DNA microarray disruption strain data obtained by performing gene disruption as an experimental operation is obtained. Then, in this gene interaction estimation method, for each factor F obtained, among the genes B _R, extracts strongly received genes influence factors F, obtains it as B _R ^F. Incidentally, genes B _R ^F has a large change in the expression level of disrupted strain compared with the expression intensities of the wild strain by factor F, and, those that order is also higher. The change ratio of the expression intensity is represented by the absolute value of log ((expression intensity of the disrupted strain) / (expression intensity of the wild strain)).

例えば、図８に示す遺伝子群Ｂ_Ｒが抽出されているものとする。この場合、遺伝子相互作用推定方法においては、図９に示すように、遺伝子ａとしての“ａｈｐＣ”の発現強度が変化した破壊株データに対する破壊遺伝子が作り出す因子を、当該遺伝子ａに強く影響を与えた因子群として求める。そして、この遺伝子相互作用推定方法においては、例えば図１０に示すように、求めた因子Ｆとしての“ＰｅｒＲ”を作り出す遺伝子“ｐｅｒＲ”の破壊株における発現強度の変化比を求める。なお、同図中括弧内の数字は、遺伝子内での発現強度の変化の大きさの順位を示している。これにより、この遺伝子相互作用推定方法においては、遺伝子群Ｂ_Ｒのうち、遺伝子ａとしての“ａｈｐＣ”に強く影響を与えた因子Ｆの１つである“ＰｅｒＲ”の影響を強く受けた遺伝子群Ｂ_Ｒ ^Ｆとして、図８及び図１０中太字で示すように、“ｋａｔＡ”、“ｙｆｍＪ”、“ｍｒｇＡ”、“ｈｅｍＡ”、“ｐｏｎＡ”、“ｙｋｖＷ”、及び“ｙｄｊＬ”を抽出することができる。 For example, it is assumed that the genes B _R as shown in FIG. 8 are extracted. In this case, in the gene interaction estimation method, as shown in FIG. 9, the factor generated by the disrupted gene with respect to the disrupted strain data in which the expression intensity of “ahpC” as gene a has changed has a strong influence on the gene a. As a factor group. In this gene interaction estimation method, for example, as shown in FIG. 10, the change ratio of the expression intensity in the disrupted strain of the gene “perR” that produces “PerR” as the obtained factor F is obtained. In the figure, the numbers in parentheses indicate the rank order of the change in the expression intensity within the gene. Thus, in this gene interaction estimation method, among the genes B _R, influence strongly received genes of "ahpC" one of the strongly influential factor F is in "PerR" as a gene a Extracting “katA”, “yfmJ”, “mrgA”, “hemA”, “ponA”, “ykvW”, and “ydjL” as B _R ^F as shown in bold in FIGS. it can.

このように、遺伝子相互作用推定方法においては、第３のフィルタリング処理として、第２のフィルタリング処理によって抽出された遺伝子群Ｂ_Ｒのうち、特定の実験操作において大きく発現強度が変化しているもののみを選択することにより、実験操作によって生成される因子又はそれに強く影響を受ける未知の因子を転写因子として推定することができる。 Thus, in the gene interaction estimation method, as a third filtering process, among the genes B _R extracted by the second filtering process, only those that expression intensity greatly changes in certain experimental manipulation By selecting, a factor generated by an experimental operation or an unknown factor strongly influenced by the factor can be estimated as a transcription factor.

そして、この遺伝子相互作用推定方法においては、図１中ステップＳ５において、遺伝子相互作用推定装置におけるアラインメント手段により、各領域Ｒ及び各因子Ｆについての遺伝子群Ｂ _Ｒ ^Ｆの各遺伝子のデータと、入力されたＤＮＡ塩基配列データとに基づいて、遺伝子ａ及び遺伝子群Ｂ_Ｒ ^Ｆの各遺伝子について、マッチした位置（各領域Ｒ内のピーク位置）からのウィンドウに含まれる文字列データを取り出し、例えば図１１に示すように、当該ピーク位置に対するマルチプルアラインメント（multiple alignment）を行う。そして、この遺伝子相互作用推定方法においては、共通パターンの存在が確認できた場合には、推定が成功したものと判断する。これにより、この遺伝子相互作用推定方法においては、例えば図１２に示すように、遺伝子ａを制御する因子Ｆ、制御方向（正又は負）、因子Ｆによって制御される遺伝子ａ以外の遺伝子群Ｂ_Ｒ ^Ｆ、及び遺伝子ａ及び遺伝子群Ｂ_Ｒ ^Ｆの各遺伝子の結合サイトを示す情報を出力し、一連の処理を終了する。 In this gene interaction estimation method, in step S5 in FIG. 1, the data of each gene of the gene group B _R ^F for each region R and each factor F is input and input by the alignment means in the gene interaction estimation device. Based on the obtained DNA base sequence data, for each gene of gene a and gene group B _R ^F , character string data included in the window from the matched position (peak position in each region R) is extracted, for example, FIG. As shown in FIG. 11, multiple alignment is performed with respect to the peak position . In this gene interaction estimation method, if the existence of the common pattern can be confirmed, it is determined that the estimation is successful. Thus, in this gene interaction estimation method, for example, as shown in FIG. 12, genetic elements to control a F, control direction (positive or negative), genes other than the gene a controlled by factors F B _R Information indicating the binding site of each gene of ^F , gene a, and gene group B _R ^F is output, and the series of processing ends.

このように、遺伝子相互作用推定方法においては、３つのフィルタリング処理を行うことによって得られる複数の結果を複合的に利用し、遺伝子ａが因子Ｆの制御（正の制御又は負の制御）を受けるとき、因子Ｆの制御を受ける他の遺伝子ｂにいかなる影響を与えるかを推定する。このとき、この遺伝子相互作用推定方法においては、ＤＮＡマイクロアレイ破壊株データの相関係数等の信頼性が低いデータについては大まかな選択を行うためにのみ用い、実験操作データ等の信頼性が高いデータについては値そのものを選択基準として用いることにより、結果の信頼性を高めることができる。 Thus, in the gene interaction estimation method, a plurality of results obtained by performing three filtering processes are used in combination, and gene a receives control of factor F (positive control or negative control). At this time, it is estimated what kind of influence the other gene b under the control of factor F has. At this time, in this gene interaction estimation method, data with low reliability such as correlation coefficient of DNA microarray disruption strain data is used only for rough selection, and data with high reliability such as experimental operation data is used. By using the value itself as a selection criterion, the reliability of the result can be improved.

なお、上述した説明においては、図２及び図７乃至図１２等を用いて、遺伝子ａとして“ａｈｐＣ”が因子Ｆとしての“ＰｅｒＲ”の制御を受ける場合について具体的な結果を示しているが、以下、このような遺伝子相互作用推定方法を具体的に実行した他の実験結果についても参考までに示す。 In the above description, specific results are shown for the case where “ahpC” as gene a is controlled by “PerR” as factor F, using FIG. 2 and FIGS. Hereinafter, other experimental results obtained by concretely executing such a method for estimating gene interaction are also shown for reference.

まず、遺伝子ａとして、“ｐｕｒＡ”を用いた場合の具体的な実験結果について説明する。 First, specific experimental results when “purA” is used as gene a will be described.

まず、実験においては、図１中ステップＳ１における処理と同様に、遺伝子ａとＤＮＡマイクロアレイ破壊株データの相関係数の絶対値が大きい上位３００の遺伝子群Ｂを選択した。これにより、例えば図１３に示すように、ＤＮＡマイクロアレイ破壊株データの相関係数の絶対値が大きい順に、遺伝子“ｐｕｒＡ”、“ｙｕｍＤ”、“ｇｌｙＡ”、“ｙｋｂＡ”、“ｐｕｒＱ”、・・・が遺伝子群Ｂとして選択された。 First, in the experiment, similar to the processing in step S1 in FIG. 1, the top 300 gene group B having the large absolute value of the correlation coefficient between the gene a and the DNA microarray disrupted strain data was selected. Thus, for example, as shown in FIG. 13, the genes “purA”, “yumD”, “glyA”, “ykbA”, “purQ”,... In descending order of the absolute value of the correlation coefficient of the DNA microarray disrupted strain data. -Was selected as gene group B.

続いて、実験においては、図１中ステップＳ２における処理と同様に、遺伝子ａのデータと遺伝子群Ｂの各遺伝子ｂのデータとのペアについて、制御領域のウィンドウ類似度を算出し、ステップＳ３における処理と同様に、遺伝子ａの制御領域のうち、最大値Ｍａｘ_ａＢ［ｉ］と平均値Ａｖｇ_ａＢ［ｉ］との差分Ｍａｘ_ａＢ［ｉ］−Ａｖｇ_ａＢ［ｉ］が所定値よりも大きく、且つ、ピーク位置を含むような領域Ｒを探索した。ここでは、類似度が１８０００以上のピーク位置を対象とし、最大値Ｍａｘ_ａＢ［ｉ］と平均値Ａｖｇ_ａＢ［ｉ］との差分Ｍａｘ_ａＢ［ｉ］−Ａｖｇ_ａＢ［ｉ］の値が１５０００以上の値を連続的に有してピーク位置を含むような領域を探索するように条件を設定したため、図１４中斜線部に示すように、位置ｉ＝４１５６０１９〜４１５６０２９が領域Ｒとして求められた。 Subsequently, in the experiment, similarly to the processing in step S2 in FIG. 1, the window similarity of the control region is calculated for the pair of the data of gene a and the data of each gene b of gene group B, and in step S3 Similar to the processing, in the control region of gene a, the difference Max _aB [i] −Avg _aB [i] between the maximum value Max _aB [i] and the average value Avg _aB [i] is larger than a predetermined value, and The region R including the peak position was searched. Here, the peak position with a similarity of 18000 or more is targeted, and the difference Max _aB [i] −Avg _aB [i] between the maximum value Max _aB [i] and the average value Avg _aB [i] is 15000 or more. Since the condition was set so as to search for a region having values continuously and including the peak position, positions i = 4156019 to 4156029 were obtained as the region R as indicated by the hatched portion in FIG.

続いて、実験においては、図１中ステップＳ４における処理と同様に、求めた領域Ｒにピーク位置が存在し、且つ、類似度が所定値よりも高い遺伝子ｂからなる遺伝子群Ｂ_Ｒを抽出した。これにより、図１５に示す遺伝子群Ｂ_Ｒが抽出された。そして、実験においては、ＤＮＡマイクロアレイ破壊株データにおいて遺伝子ａに強く影響を与えた因子群を求め、求めた各因子Ｆについて、遺伝子群Ｂ_Ｒの各遺伝子のうち、因子Ｆの影響を強く受けた遺伝子からなる遺伝子群を抽出し、それをＢ_Ｒ ^Ｆとして求めた。これにより、図１６に示すように、遺伝子ａとしての“ｐｕｒＡ”の発現強度が変化した破壊株が、当該遺伝子“ｐｕｒＡ”に強く影響を与えた因子群として求められ、遺伝子群Ｂ_Ｒの各遺伝子のうち、当該遺伝子ｐｕｒＡ”に強く影響を与えた因子Ｆの１つである“ＰｕｒＲ”の影響を強く受けた遺伝子群Ｂ_Ｒ ^Ｆとして、図１５及び図１７中太字で示すように、“ｙｄｅＱ”及び“ｇｌｙＡ”が抽出された。 Subsequently, in the experiment, as in the processing in the step S4 1, there is a peak located in the determined area R, and the degree of similarity is extracted genes B _R consisting of high gene b than the predetermined value . Thus, genes B _R as shown in FIG. 15 is extracted. Then, in the experiment, determine the factor group gave strongly influence gene a in DNA microarray disrupted strain data, for each factor F obtained, among the genes of the genes B _R, was strongly influenced by factors F extract the genes consisting of the genes was determined it as B _R ^F. Thus, as shown in FIG. 16, broken lines expressing the strength of "purA" as genes a has changed, obtained as factor group of strongly influenced in the gene "purA", each of a group of genes B _R Among the genes , as a gene group B _R ^F strongly influenced by “PurR”, which is one of the factors F that strongly influenced the gene purA ”, as shown in bold letters in FIG. 15 and FIG. ydeQ "and" glyA "were extracted.

そして、実験においては、図１中ステップＳ５における処理と同様に、遺伝子ａ及び遺伝子群Ｂ_Ｒ ^Ｆの各遺伝子について、マッチした位置からのウィンドウに含まれる文字列データを取り出し、マルチプルアラインメントを行い、図１８に示すような結果が得られた。 Then, in the experiment, as in the processing in the step S5 1, for each gene in the gene a and gene group B _R ^F, takes out character string data included in the window from the matched positions, performs multiple alignment, Results as shown in FIG. 18 were obtained.

これにより、遺伝子ａを制御する因子Ｆ、制御方向（正又は負）、因子Ｆによって制御される遺伝子ａ以外の遺伝子群Ｂ_Ｒ ^Ｆ、及び遺伝子ａ及び遺伝子群Ｂ_Ｒ ^Ｆの各遺伝子の結合サイトを示す情報として、図１９に示すような結果が出力として求められた。 Thereby, the factor F controlling the gene a, the control direction (positive or negative), the gene group B _R ^F other than the gene a controlled by the factor ^F , and the binding site of each gene of the gene a and the gene group B _R ^F As information indicating the above, a result as shown in FIG. 19 was obtained as an output.

つぎに、遺伝子ａとして、“ｐｈｏＤ”を用いた場合の具体的な実験結果について説明する。 Next, specific experimental results when “phoD” is used as gene a will be described.

まず、実験においては、図１中ステップＳ１における処理と同様に、遺伝子ａとＤＮＡマイクロアレイ破壊株データの相関係数の絶対値が大きい上位３００の遺伝子群Ｂを選択した。これにより、例えば図２０に示すように、ＤＮＡマイクロアレイ破壊株データの相関係数の絶対値が大きい順に、遺伝子“ｐｈｏＤ”、“ｙｑｇＧ”、“ｔｕａＤ”、“ｔｕａＢ”、“ｙｑｇＫ”、・・・が遺伝子群Ｂとして選択された。 First, in the experiment, similar to the processing in step S1 in FIG. 1, the top 300 gene group B having the large absolute value of the correlation coefficient between the gene a and the DNA microarray disrupted strain data was selected. Accordingly, for example, as shown in FIG. 20, the genes “phoD”, “yqgG”, “tuaD”, “tuaB”, “yqgK”,... In descending order of the absolute value of the correlation coefficient of the DNA microarray disrupted strain data. -Was selected as gene group B.

続いて、実験においては、図１中ステップＳ２における処理と同様に、遺伝子ａのデータと遺伝子群Ｂの各遺伝子ｂのデータとのペアについて、制御領域のウィンドウ類似度を算出し、ステップＳ３における処理と同様に、遺伝子ａの制御領域のうち、最大値Ｍａｘ_ａＢ［ｉ］と平均値Ａｖｇ_ａＢ［ｉ］との差分Ｍａｘ_ａＢ［ｉ］−Ａｖｇ_ａＢ［ｉ］が所定値よりも大きく、且つ、ピーク位置を含むような領域Ｒを探索した。ここでは、類似度が１８０００以上のピーク位置を対象とし、最大値Ｍａｘ_ａＢ［ｉ］と平均値Ａｖｇ_ａＢ［ｉ］との差分Ｍａｘ_ａＢ［ｉ］−Ａｖｇ_ａＢ［ｉ］の値が１５０００以上の値を連続的に有してピーク位置を含むような領域を探索するように条件を設定したため、図２１中斜線部に示すように、位置ｉ＝２８３４９３〜２８３５０２が領域Ｒとして求められた。 Subsequently, in the experiment, similarly to the processing in step S2 in FIG. 1, the window similarity of the control region is calculated for the pair of the data of gene a and the data of each gene b of gene group B, and in step S3 Similar to the processing, in the control region of gene a, the difference Max _aB [i] −Avg _aB [i] between the maximum value Max _aB [i] and the average value Avg _aB [i] is larger than a predetermined value, and The region R including the peak position was searched. Here, the peak position with a similarity of 18000 or more is targeted, and the difference Max _aB [i] −Avg _aB [i] between the maximum value Max _aB [i] and the average value Avg _aB [i] is 15000 or more. Since the condition was set so as to search for an area having values continuously and including the peak position, positions i = 283493 to 283502 were obtained as the area R as indicated by the hatched portion in FIG.

続いて、実験においては、図１中ステップＳ４における処理と同様に、求めた領域Ｒにピーク位置が存在し、且つ、類似度が所定値よりも高い遺伝子ｂからなる遺伝子群Ｂ_Ｒを抽出した。これにより、図２２に示す遺伝子群Ｂ_Ｒが抽出された。そして、実験においては、ＤＮＡマイクロアレイ破壊株データにおいて遺伝子ａに強く影響を与えた因子群を求め、求めた各因子Ｆについて、遺伝子群Ｂ_Ｒの各遺伝子のうち、因子Ｆの影響を強く受けた遺伝子からなる遺伝子群を抽出し、それをＢ_Ｒ ^Ｆとして求めた。これにより、図２３に示すように、遺伝子ａとしての“ｐｈｏＤ”の発現強度が変化した破壊株が、当該遺伝子“ｐｈｏＤ”に強く影響を与えた因子群として求められ、遺伝子群Ｂ_Ｒの各遺伝子のうち、当該遺伝子“ｐｈｏＤ”に強く影響を与えた因子Ｆの１つである“ＰｈｏＰ”の影響を強く受けた遺伝子群Ｂ_Ｒ ^Ｆとして、図２２及び図２４中太字で示すように、“ｐｈｏＡ”及び“ｐｈｏＢ”が抽出された。 Subsequently, in the experiment, as in the processing in the step S4 1, there is a peak located in the determined area R, and the degree of similarity is extracted genes B _R consisting of high gene b than the predetermined value . Thus, genes B _R as shown in FIG. 22 is extracted. Then, in the experiment, determine the factor group gave strongly influence gene a in DNA microarray disrupted strain data, for each factor F obtained, among the genes of the genes B _R, was strongly influenced by factors F extract the genes consisting of the genes was determined it as B _R ^F. Thus, as shown in FIG. 23, broken lines expressing the strength of "phoD" as genes a has changed, obtained as factor group of strongly influenced in the gene "phoD", each of a group of genes B _R of the genes, as strongly received genes _B ^{R F} the effect of the gene "phoD" one of the strongly influential factor F is in "PhoP", as shown in bold in FIGS. 22 and 24, “PhoA” and “phoB” were extracted.

そして、実験においては、図１中ステップＳ５における処理と同様に、遺伝子ａ及び遺伝子群Ｂ_Ｒ ^Ｆの各遺伝子について、マッチした位置からのウィンドウに含まれる文字列データを取り出し、マルチプルアラインメントを行い、図２５に示すような結果が得られた。 Then, in the experiment, as in the processing in the step S5 1, for each gene in the gene a and gene group B _R ^F, takes out character string data included in the window from the matched positions, performs multiple alignment, Results as shown in FIG. 25 were obtained.

これにより、遺伝子ａを制御する因子Ｆ、制御方向（正又は負）、因子Ｆによって制御される遺伝子ａ以外の遺伝子群Ｂ_Ｒ ^Ｆ、及び遺伝子ａ及び遺伝子群Ｂ_Ｒ ^Ｆの各遺伝子の結合サイトを示す情報として、図２６に示すような結果が出力として求められた。 Thereby, the factor F controlling the gene a, the control direction (positive or negative), the gene group B _R ^F other than the gene a controlled by the factor ^F , and the binding site of each gene of the gene a and the gene group B _R ^F As information indicating the above, a result as shown in FIG. 26 was obtained as an output.

以上説明したように、本発明の実施の形態として示した遺伝子相互作用推定方法においては、発現データの相関を利用した第１のフィルタリング処理、結合サイトを推定する全く新規の手法である第２のフィルタリング処理、及び実験操作を加えた遺伝子発現データを利用した第３のフィルタリング処理による３つの独立した推定方法を融合させることにより、ＤＮＡマイクロアレイデータの影響及び結合パターンに含まれるあいまいさの影響を低減することができ、ＤＮＡマイクロアレイを用いながらも、高精度に遺伝子相互作用の推定を行うことができる。 As described above, in the gene interaction estimation method shown as the embodiment of the present invention, the first filtering process using the correlation of the expression data, the second method that is a completely new technique for estimating the binding site. The influence of DNA microarray data and the ambiguity included in the binding pattern are reduced by fusing three independent estimation methods based on the filtering process and the third filtering process using gene expression data with added experimental operations. It is possible to estimate a gene interaction with high accuracy while using a DNA microarray.

特に、この遺伝子相互作用推定方法においては、計算コストが高い制御領域の解析処理、すなわち、第２のフィルタリング処理に先だって、第１のフィルタリング処理として発現データの相関を利用した処理を行うことにより、計算コストが高い第２のフィルタリング処理の計算時間を大幅に削減することが可能となる。なお、本願出願人は、この遺伝子相互作用推定方法を実現するプログラムを、インテル社製ＣＰＵ“Ｐｅｎｔｉｕｍ（登録商標）４、クロック３ＧＨｚ”の５倍程度の速度を有するＡＭＤ社製ＣＰＵ“Ｏｐｔｅｒｏｎ（登録商標）２．４ＧＨｚ”を４並列構成としたコンピュータを用いて実行した場合には、４０００種類の遺伝子全ての計算に約１０日程度要することを確認している。これに対して、本願出願人は、第１のフィルタリング処理を行わない場合には、この日数の１０倍以上の時間が必要となることも確認している。 In particular, in this gene interaction estimation method, by performing processing that uses the correlation of expression data as the first filtering process prior to the analysis process of the control region with high calculation cost, that is, the second filtering process, It is possible to greatly reduce the calculation time of the second filtering process with high calculation cost. The applicant of the present application uses an AMD CPU “Opteron” (registered trademark) having a speed about five times faster than Intel CPU “Pentium (registered trademark) 4, clock 3 GHz”. (Trademark) When 2.4 GHz "is executed using a computer having a 4-parallel configuration, it is confirmed that it takes about 10 days to calculate all 4000 genes. On the other hand, the applicant of this application has also confirmed that the time of 10 times or more of this number of days is required when the first filtering process is not performed.

また、この遺伝子相互作用推定方法においては、第２のフィルタリング処理として、プロモーター領域での発見で頻繁に行われるように、個々のオリゴヌクレオチドが制御領域にどの程度共通に含まれるのかを調べるのではなく、ある程度の長さの部分領域（ウィンドウ）をオリゴヌクレオチドの集合体として捉え、ウィンドウが全体として類似しているか否かをも網羅的に計算する。これにより、この遺伝子相互作用推定方法においては、オリゴヌクレオチドの組み合わせによる発現制御にも対応することが可能となる。さらに、本願出願人は、制御領域におけるオリゴヌクレオチドの出現頻度の解析を行い、統計的に出現頻度が特異的に高いオリゴヌクレオチドは結合文字列になる可能性が低いことを新たに発見したが、この遺伝子相互作用推定方法においては、この事実を領域の類似度算出に反映させることにより、結合サイト以外で類似領域が発見される可能性を低減することができる。さらにまた、この遺伝子相互作用推定方法においては、結合サイトはピーク位置に存在すると予想できることから、結合サイトの候補を絞り込むことも可能となる。 In addition, in this gene interaction estimation method, as the second filtering process, as is frequently performed in the discovery in the promoter region, it is not possible to examine how common each oligonucleotide is included in the control region. Rather, a partial region (window) of a certain length is regarded as an aggregate of oligonucleotides, and it is comprehensively calculated whether or not the windows are similar as a whole. Thereby, in this gene interaction estimation method, it is possible to cope with expression control by a combination of oligonucleotides. Furthermore, the applicant of the present application has analyzed the appearance frequency of oligonucleotides in the control region, and newly discovered that an oligonucleotide having a statistically high frequency of occurrence is less likely to be a binding character string. In this gene interaction estimation method, by reflecting this fact on the calculation of the similarity between regions, the possibility that a similar region is found outside the binding site can be reduced. Furthermore, in this gene interaction estimation method, since the binding site can be expected to exist at the peak position, it is also possible to narrow down the binding site candidates.

また、この遺伝子相互作用推定方法においては、第３のフィルタリング処理については、単一のＤＮＡマイクロアレイデータのみに依存し、ノイズの影響を受けやすいことから、最終段階に適用するが、かかる第３のフィルタリング処理を行うことにより、ＤＮＡマイクロアレイによる発現データのバックグラウンドノイズに対して十分に大きな発現強度の変化を与えることができ、また、各遺伝子に対する因子の直接的影響を導き出すことができる。さらに、この遺伝子相互作用推定方法においては、第３のフィルタリング処理を行うにあたって、値のみならず順位も考慮することにより、相対評価を行っていることと等価な状況を作り出すことができる。 In this gene interaction estimation method, the third filtering process depends on only single DNA microarray data and is susceptible to noise. Therefore, the third filtering process is applied to the final stage. By performing the filtering process, it is possible to give a sufficiently large change in the expression intensity with respect to the background noise of the expression data by the DNA microarray, and it is possible to derive the direct influence of the factor on each gene. Furthermore, in this gene interaction estimation method, when performing the third filtering process, it is possible to create a situation equivalent to performing a relative evaluation by considering not only the value but also the rank.

なお、本発明は、上述した実施の形態に限定されるものではなく、その趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 Note that the present invention is not limited to the above-described embodiment, and it is needless to say that modifications can be made as appropriate without departing from the spirit of the present invention.

本発明の実施の形態として示す遺伝子相互作用推定方法における一連の処理をフローチャートである。It is a flowchart of a series of processes in the gene interaction estimation method shown as an embodiment of the present invention. 遺伝子ａとＤＮＡマイクロアレイ破壊株データの相関係数の絶対値が大きい上位所定個数の遺伝子群Ｂの具体例を説明する図である。It is a figure explaining the specific example of the gene group B of the upper predetermined number with a large absolute value of the correlation coefficient of the gene a and DNA microarray destruction strain data . ウィンドウ類似度について説明するための図である。It is a figure for demonstrating window similarity. ウィンドウ類似度の具体的な算出方法について説明するための図である。It is a figure for demonstrating the specific calculation method of a window similarity. ピーク位置について説明するための図である。It is a figure for demonstrating a peak position. ピーク位置を具体的に求める様子について説明するための図である。It is a figure for demonstrating a mode that a peak position is calculated | required concretely. 領域Ｒを具体的に求める様子について説明するための図である。It is a figure for demonstrating a mode that the area | region R is calculated | required concretely. 抽出された遺伝子群Ｂ_Ｒの具体例を説明する図である。Specific examples of the extracted genes B _R is a diagram illustrating a. 遺伝子ａとしての“ａｈｐＣ”の発現強度が変化した破壊株の具体例を説明する図である。It is a figure explaining the specific example of the destruction strain from which the expression intensity of "ahpC" as gene a changed. 因子Ｆとしての遺伝子“ｐｅｒＲ”の破壊株における発現強度の変化比の具体例を説明する図である。FIG. 4 is a diagram for explaining a specific example of a change ratio of expression intensity in a disrupted strain of gene “perR” as factor F. マルチプルアラインメントの具体的な方法について説明するための図である。It is a figure for demonstrating the specific method of multiple alignment. 本発明の実施の形態として示す遺伝子相互作用推定方法における一連の処理を経て出力される情報の具体例を説明する図である。It is a figure explaining the specific example of the information output through a series of processes in the gene interaction estimation method shown as embodiment of this invention. 遺伝子ａとしての“ｐｕｒＡ”とＤＮＡマイクロアレイ破壊株データの相関係数の絶対値が大きい上位所定個数の遺伝子群Ｂの具体例を説明する図である。It is a figure explaining the specific example of the gene group B of the upper predetermined number with a large absolute value of the correlation coefficient of "purA" as the gene a and DNA microarray destruction strain data . 遺伝子ａとしての“ｐｕｒＡ”と図１３に示す遺伝子群Ｂとに基づいて求められた領域Ｒについて説明するための図である。It is a figure for demonstrating the area | region R calculated | required based on "purA" as the gene a and the gene group B shown in FIG. 図１４に示す領域Ｒから抽出された遺伝子群Ｂ_Ｒの具体例を説明する図である。Specific examples of genes B _R extracted from the region R shown in FIG. 14 is a diagram for explaining the. 遺伝子ａとしての“ｐｕｒＡ”の発現強度が変化した破壊株の具体例を説明する図である。It is a figure explaining the specific example of the destruction strain in which the expression intensity of "purA" as gene a changed. 因子Ｆとしての遺伝子“ｐｕｒＲ”の破壊株における発現強度の変化比の具体例を説明する図である。FIG. 6 is a diagram for explaining a specific example of a change ratio of expression intensity in a disrupted strain of gene “purR” as factor F. 遺伝子ａとしての“ｐｕｒＡ”及び遺伝子群Ｂ_Ｒ ^Ｆの各遺伝子について、マッチした位置からのウィンドウに含まれる文字列データを取り出し、マルチプルアラインメントを行った結果について説明するための図である。For each gene of "purA" and genes B _R ^F as a gene a, takes out character string data included in the window from the matched positions is a diagram for explaining results of multiple alignment. 図１３乃至図１８のデータに基づいて出力された実験結果としての情報の具体例を説明する図である。It is a figure explaining the specific example of the information as an experimental result output based on the data of FIG. 13 thru | or FIG. 遺伝子ａとしての“ｐｈｏＤ”とＤＮＡマイクロアレイ破壊株データの相関係数の絶対値が大きい上位所定個数の遺伝子群Ｂの具体例を説明する図である。It is a figure explaining the specific example of the gene group B of the upper predetermined number with a large absolute value of the correlation coefficient of "phoD" as the gene a and DNA microarray destruction strain data . 遺伝子ａとしての“ｐｈｏＤ”と図２０に示す遺伝子群Ｂとに基づいて求められた領域Ｒについて説明するための図である。It is a figure for demonstrating the area | region R calculated | required based on "phoD" as the gene a and the gene group B shown in FIG. 図２１に示す領域Ｒから抽出された遺伝子群Ｂ_Ｒの具体例を説明する図である。Specific examples of genes B _R extracted from the region R shown in FIG. 21 is a diagram illustrating a. 遺伝子ａとしての“ｐｈｏＤ”の発現強度が変化した破壊株の具体例を説明する図である。It is a figure explaining the specific example of the destruction strain from which the expression intensity of "phoD" as gene a changed. 因子Ｆとしての遺伝子“ｐｈｏＰ”の破壊株における発現強度の変化比の具体例を説明する図である。FIG. 6 is a diagram for explaining a specific example of a change ratio of expression intensity in a disrupted strain of gene “phoP” as factor F. 遺伝子ａとしての“ｐｈｏＤ”及び遺伝子群Ｂ_Ｒ ^Ｆの各遺伝子について、マッチした位置からのウィンドウに含まれる文字列データを取り出し、マルチプルアラインメントを行った結果について説明するための図である。For each gene of "phoD" and genes B _R ^F as a gene a, takes out character string data included in the window from the matched positions is a diagram for explaining results of multiple alignment. 図２０乃至図２５のデータに基づいて出力された実験結果としての情報の具体例を説明する図である。It is a figure explaining the specific example of the information as an experimental result output based on the data of FIG. 20 thru | or FIG. 遺伝子の構造について説明する図である。It is a figure explaining the structure of a gene. ブーリアンネットワークの具体的なモデルについて説明する図である。It is a figure explaining the specific model of a Boolean network. ベイジアンネットワークの具体的なモデルについて説明する図である。It is a figure explaining the specific model of a Bayesian network.

Explanation of symbols

ａ，ｂ遺伝子
Ａｖｇ_ａＢ［ｉ］類似度の平均値
Ｂ，Ｂ_Ｒ，Ｂ_Ｒ ^Ｆ遺伝子群
Ｆ因子
Ｍａｘ_ａＢ［ｉ］類似度の最大値
Ｒ領域
Ｗ_ａ［ｉ］，Ｗ_ｂ［ｊ］ウィンドウ
ｗｓｉｍ_ａ，ｂ（ｉ，ｊ）ウィンドウ類似度 a, b gene Avg _aB [i] similarity average value _B, B _{R, B} ^{R F} genes F factor Max _aB [i] similarity maximum R region _{_{W a [i], W b}} [j] Window wsim _{a, b} (i, j) Window similarity

Claims

When the gene a is controlled by the factor F, in the gene interaction estimation method for estimating the influence on the other gene b controlled by the factor F,
An input step for inputting DNA microarray disruption strain data obtained by performing gene disruption as a specific experimental operation and DNA base sequence data, which is data representing the base sequence, to a computer;
The processor in the computer reads the DNA microarray disrupted strain data input in the input step, and the upper predetermined number of the correlation coefficients between the designated gene a and the DNA microarray disrupted strain data is large. A gene group selection step of selecting a gene group B consisting of another gene b;
Wherein the processor, and data of said genes a, for a pair of each gene b data of said gene group B selected in the genes selection step, based on the DNA base sequence data input in said input step Then, when the control area of the gene a is taken out as a character string, a continuous area of a predetermined length l from each position i in the control area is defined as a window W _a when the position of the i-th character from the beginning is set as the position i [I], and when the control region of each gene b of the gene group B is extracted as a character string, the position of the j-th character from the beginning is defined as the position j, from each position j in the control region. cut out the continuous region of a predetermined length l as a window W _b [j], these windows W _a _[i], the character string data contained in W b [j] Calculating all the similarity of the string between the predetermined length in the range of a predetermined number shifted back and forth at each position within these windows, the when the maximum value of the calculated degree of similarity is defined as the similarity of its position A window similarity that is the sum of the similarities for each position in the window is calculated, and the control region of the gene a is calculated based on the window similarity data that includes the window similarity of each gene b of the gene group B A window similarity calculating step of calculating a maximum value Max _aB [i] and an average value Avg _aB [i] of the similarity of the gene group B for each position i in FIG .
Maximum value data consisting of the data of each gene b of the gene group B selected in the gene group selection step and the maximum value Max _aB [i] calculated in the window similarity calculation step, and Average value data Avg _aB [i] is read, and among the control region of the gene a, the maximum value Max _aB [i] is statistically specific in the window similarity distribution at the position i. To the change of the position i of the control region of the gene a based on the window similarity data of each gene b of the gene group B calculated in the window similarity calculation step peak position a position which gives the maximum value when viewed the maximum value Max ab window similarity _[i] of each gene b of the genes B An area searching step the determined second area to include the peak position, to explore the region R is the a first region and said second region when a,
Data of each gene b of the gene group B selected in the gene group selection step and the window similarity data of each gene b of the gene group B calculated in the window similarity calculation step The region data consisting of the region R obtained in the region search step is read, and for each region R among the genes b of the gene group B, a peak position exists in the region R and the degree of similarity There the genes extraction step of extracting seeking genes B _R consisting of high gene b than the predetermined value,
The processor uses the change ratio of the expression intensity between the wild strain not subjected to the experimental operation based on the DNA microarray disrupted strain data input in the input step and the disrupted strain subjected to the experimental operation, to each region. for R, among the genes of the genes B _R extracted by said genes extraction step, a second gene to extract only those expression intensity greatly changes in the experimental procedure as genes B _R ^F A method for estimating a gene interaction, comprising: a group extraction step .

In the area search step, the processor calculates a difference _{_{Max aB [i] -Avg aB [}} i] and the maximum value _Max aB [i] and the average value _Avg aB [i], the calculated difference _{Max aB} If _{[i] -Avg} aB _[i] is greater than a predetermined value, the said maximum value _Max aB [i] is determined to statistically specifically greater in the distribution of the window similarity at the position i gene interaction estimation method of claim 1, wherein Rukoto determined as the first area.

In the second group of genes extraction step, the processor determines a strong influencing factor group gave the gene a and have groups Dzu the DNA microarray fracture strain data input in said input step, each determined factor for F, the factors on the basis of the change ratio of the expression intensity in the disrupted strain F, among the genes of the genes B _R, said factor F influence strongly received a gene group the genes consisting of gene B _R of gene interaction estimation method according to claim 1 to wherein Rukoto extracted as ^F.

Data of each gene of the gene group B _R ^F for each region R and each factor F extracted in the second gene group extraction step by the processor, and the DNA base sequence input in the input step Alignment that extracts character string data included in a window from the peak position in each region R and performs multiple alignment with respect to the peak position for each gene of the gene a and the gene group B _R ^F based on the data The method according to claim 3, further comprising a step.

When a gene a is controlled by factor F, a computer-executable gene interaction estimation program for estimating an effect on another gene b controlled by the factor F,
The computer,
Input means for inputting DNA microarray disruption strain data obtained by performing gene disruption as a specific experimental operation, and DNA base sequence data which is data representing the base sequence;
The DNA microarray-disrupted strain data input by the input means is read, and the gene consisting of the predetermined number of the other genes b having a large absolute value of the correlation coefficient between the designated gene a and the DNA microarray-disrupted strain data Gene group selection means for selecting group B,
And data of said genes a, for a pair of each gene b data of said gene group B selected by said genes selecting means, based on the DNA base sequence data inputted by said input means, said gene a When the position of the i-th character from the beginning is taken as position i when the control area is extracted as a character string, a continuous area of a predetermined length l is cut out as window W _a [i] from each position i in the control area. In addition, when the control region of each gene b of the gene group B is taken out as a character string, the position of the jth character from the beginning is defined as a position j, and a predetermined length l from each position j in the control region. cut out region as the window W _b [j], these windows W _a _[i], the character string data contained in W b [j], these Said window when calculating all the similarity of the string between the predetermined length at a predetermined number shift range back and forth at each position in the window, the maximum value of the calculated degree of similarity is defined as the similarity of its position The window similarity that is the sum of the similarities for each position of the gene group B is calculated, and each window in the control region of the gene a is calculated based on the window similarity data including the window similarities of the genes b of the gene group B. Window similarity calculation means for calculating a maximum value Max _aB [i] and an average value Avg _aB [i] of the similarity of the gene group B to the position i ;
Maximum value data consisting of the data of each gene b of the gene group B selected by the gene group selection means, the maximum value Max _aB [i] calculated by the window similarity calculation means, and the average value Avg _aB Average value data consisting of [i] is read, and the maximum value Max _aB [i] of the control region of the gene a is statistically specifically large in the distribution of the window similarity at the position i. , And based on the window similarity data of each gene b of the gene group B calculated by the window similarity calculation means, the gene group B with respect to a change in the position i of the control region of the gene a peak position a position which gives the maximum value when the viewed the window maximum similarity value Max ab _[i] of each gene b of And to determine the second region to include the peak position when the said first region and is and the second area search means for searching for a region R is a region,
Data of each gene b of the gene group B selected by the gene group selection means, the window similarity data of each gene b of the gene group B calculated by the window similarity calculation means, and the region search Region data consisting of the region R obtained by the means, and for each region R of the genes b of the gene group B, there is a peak position in the region R, and the similarity is greater than a predetermined value. genes extracting means for extracting even seeking genes B _R consisting of high gene b and,
For each region R, using the ratio of expression intensity change between the wild strain not subjected to the experimental operation based on the DNA microarray disrupted strain data input by the input means and the disrupted strain subjected to the experimental operation, the gene among the genes of the genes B _R extracted by the group extraction unit, functions only what expression intensity greatly changes in the experimental procedure as a second gene group extracting means for extracting a group of genes B _R ^F A gene interaction estimation program characterized by

When a gene a is controlled by factor F, in a gene interaction estimation apparatus for estimating an effect on another gene b controlled by the factor F,
An input means for inputting DNA microarray disruption strain data obtained by performing gene disruption as a specific experimental operation, and DNA base sequence data which is data representing the base sequence;
The DNA microarray-disrupted strain data input by the input means is read, and the gene consisting of the predetermined number of the other genes b having a large absolute value of the correlation coefficient between the designated gene a and the DNA microarray-disrupted strain data Gene group selection means for selecting group B;
And data of said genes a, for a pair of each gene b data of said gene group B selected by said genes selecting means, based on the DNA base sequence data inputted by said input means, said gene a When the position of the i-th character from the beginning is taken as position i when the control area is extracted as a character string, a continuous area of a predetermined length l is cut out as window W _a [i] from each position i in the control area. In addition, when the control region of each gene b of the gene group B is taken out as a character string, the position of the jth character from the beginning is defined as a position j, and a predetermined length l from each position j in the control region. cut out region as the window W _b [j], these windows W _a _[i], the character string data contained in W b [j], these Said window when calculating all the similarity of the string between the predetermined length at a predetermined number shift range back and forth at each position in the window, the maximum value of the calculated degree of similarity is defined as the similarity of its position The window similarity that is the sum of the similarities for each position of the gene group B is calculated, and each window in the control region of the gene a is calculated based on the window similarity data including the window similarities of the genes b of the gene group B. Window similarity calculation means for calculating the maximum value Max _aB [i] and the average value Avg _aB [i] of the similarity of the gene group B to the position i ;
Maximum value data consisting of the data of each gene b of the gene group B selected by the gene group selection means, the maximum value Max _aB [i] calculated by the window similarity calculation means, and the average value Avg _aB Average value data consisting of [i] is read, and the maximum value Max _aB [i] of the control region of the gene a is statistically specifically large in the distribution of the window similarity at the position i. , And based on the window similarity data of each gene b of the gene group B calculated by the window similarity calculation means, the gene group B with respect to a change in the position i of the control region of the gene a peak position a position which gives the maximum value when the viewed the window maximum similarity value Max ab _[i] of each gene b of An area searching means for searching a region R that determine the second region is the first region is a and the second region include the peak position when the,
Data of each gene b of the gene group B selected by the gene group selection means, the window similarity data of each gene b of the gene group B calculated by the window similarity calculation means, and the region search Region data consisting of the region R obtained by the means, and for each region R of the genes b of the gene group B, there is a peak position in the region R, and the similarity is greater than a predetermined value. and genes extracting means for extracting seeking genes B _R consisting of high gene b also,
For each region R, using the ratio of expression intensity change between the wild strain not subjected to the experimental operation based on the DNA microarray disrupted strain data input by the input means and the disrupted strain subjected to the experimental operation, the gene among the genes of the genes B _R extracted by the group extracting means, and a second gene group extracting means for extracting only those that expression intensity greatly changes in the experimental procedure as genes B _R ^F A gene interaction estimation device characterized by comprising:

In a binding site estimation method for estimating a binding site to which a common factor that controls gene a and any other gene b binds,
An input step for inputting DNA microarray disruption strain data obtained by performing gene disruption as a specific experimental operation and DNA base sequence data, which is data representing the base sequence, to a computer;
For a pair of the data of the gene a selected by the processor in the computer and the data of each gene b of the gene group B selected based on the DNA microarray disrupted strain data input in the input step, Based on the DNA base sequence data input in the input step, when the control region of the gene a is taken out as a character string, the location of the i-th character from the head is set as the position i. A continuous region of a predetermined length l is cut out from each position i as _a window W _a [i], and when the control region of each gene b of the gene group B is extracted as a character string, the location of the jth character from the beginning is determined. When a position j is set, a continuous area of a predetermined length l is cut out from each position j in the control area as a window W _b [j], For character string data included in these windows W _a [i], W _b [j] , all similarities between character strings of a predetermined length in a range shifted by a predetermined number at each position in these windows are obtained. Calculating a window similarity that is a sum of similarities for each position in the window when the calculated maximum similarity is defined as the similarity of the position, and calculating each gene of the gene group B The maximum value Max _aB [i] and the average value Avg _aB [i] of the similarity of the gene group B for each position i in the control region of the gene a based on the window similarity data consisting of the window similarity of b A window similarity calculation step for calculating
The processor _{sets the} maximum value data including the data of each gene b of the selected gene group B and the maximum value Max _aB [i] calculated in the window similarity calculation step, and the average value Avg _aB [i]. And a first region where the maximum value Max _aB [i] is statistically specifically large in the distribution of the window similarity at the position i among the control regions of the gene a. And obtaining each gene of the gene group B with respect to a change in the position i of the control region of the gene a based on the window similarity data of the gene b of the gene group B calculated in the window similarity calculation step the peak position when the position giving the maximum value and the peak position when viewed the maximum value Max ab window similarity _[i] and b An area searching step of searching the region R such seek second region is the first region is a and the second region to include,
The processor includes an output step of outputting information indicating a binding site to which a factor common to the genes a and b binds based on the region data including the region R obtained in the region searching step.
A method for estimating a binding site, characterized by:

In a computer-executable binding site estimation program for estimating a binding site to which a common factor that controls gene a and any other gene b binds,
The computer,
Input means for inputting DNA microarray disruption strain data obtained by performing gene disruption as a specific experimental operation, and DNA base sequence data which is data representing the base sequence;
For the pair of the specified gene a data and the data of each gene b of the gene group B selected based on the DNA microarray disruption strain data input by the input means, the data input by the input means Based on the DNA base sequence data, when the control region of the gene a is extracted as a character string, the position of the i-th character from the beginning is defined as the position i, and a predetermined length l from each position i in the control region A continuous region is cut out as _a window W _a [i], and when the control region of each gene b of the gene group B is extracted as a character string, the control is performed when the position of the j-th character from the beginning is set as the position j cut out the continuous region of a predetermined length l as a window _W b [j] from each position j in the area, these windows W _a [i] The character string data contained in W b _[j], the maximum of all the similarity string between the predetermined length calculated, the calculated similarity to the extent that a predetermined number shifted back and forth at each position within the windows A window comprising the window similarity of each gene b in the gene group B, which is calculated as a window similarity that is the sum of the similarities for each position in the window when the value is defined as the similarity of the position. Window similarity calculating means for calculating the maximum value Max _aB [i] and the average value Avg _aB [i] of the gene group B for each position i in the control region of the gene a based on the similarity data ,
The data of each gene b of the selected gene group B, the maximum value data consisting of the maximum value Max _aB [i] calculated by the window similarity calculation means, and the average value consisting of the average value Avg _aB [i] And a first region where the maximum value Max _aB [i] is statistically specifically large in the distribution of the window similarity at the position i among the control regions of the gene a , and Based on the window similarity data of each gene b of the gene group B calculated by the window similarity calculation means, the window of each gene b of the gene group B with respect to a change in the position i of the control region of the gene a including the peak position when the position giving the maximum value and the peak position when viewed maximum value Max ab similarity _[i] Obtains a second region as the first region a and and said second area search means for searching for a region R is a region and,
Functioning as output means for outputting information indicating a binding site to which a common factor binds to each of the genes a and b based on the region data consisting of the region R obtained by the region searching means.
A binding site estimation program characterized by

In a binding site estimation apparatus for estimating a binding site to which a common factor that controls gene a and any other gene b binds,
An input means for inputting DNA microarray disruption strain data obtained by performing gene disruption as a specific experimental operation, and DNA base sequence data which is data representing the base sequence;
For the pair of the specified gene a data and the data of each gene b of the gene group B selected based on the DNA microarray disruption strain data input by the input means, the data input by the input means Based on the DNA base sequence data, when the control region of the gene a is extracted as a character string, the position of the i-th character from the beginning is defined as the position i, and a predetermined length l from each position i in the control region A continuous region is cut out as _a window W _a [i], and when the control region of each gene b of the gene group B is extracted as a character string, the control is performed when the position of the j-th character from the beginning is set as the position j cut out the continuous region of a predetermined length l as a window _W b [j] from each position j in the area, these windows W _a [i] The character string data contained in W b _[j], the maximum of all the similarity string between the predetermined length calculated, the calculated similarity to the extent that a predetermined number shifted back and forth at each position within the windows A window comprising the window similarity of each gene b in the gene group B, which is calculated as a window similarity that is the sum of the similarities for each position in the window when the value is defined as the similarity of the position. Window similarity calculating means for calculating the maximum value Max _aB [i] and the average value Avg _aB [i] of the gene group B for each position i in the control region of the gene a based on the similarity data When,
The data of each gene b of the selected gene group B, the maximum value data consisting of the maximum value Max _aB [i] calculated by the window similarity calculation means, and the average value consisting of the average value Avg _aB [i] And a first region where the maximum value Max _aB [i] is statistically specifically large in the distribution of the window similarity at the position i among the control regions of the gene a , and Based on the window similarity data of each gene b of the gene group B calculated by the window similarity calculation means, the window of each gene b of the gene group B with respect to a change in the position i of the control region of the gene a including the peak position when the position giving the maximum value and the peak position when viewed maximum value Max ab similarity _[i] An area searching means for searching a region R that determine the second region is the first region is a and the second region as,
Output means for outputting information indicating a binding site to which a common factor is bound to each of the genes a and b based on the area data composed of the area R obtained by the area searching means.
A combined site estimation device characterized by the above.