JP4805586B2

JP4805586B2 - Gene structure prediction method and gene structure prediction program

Info

Publication number: JP4805586B2
Application number: JP2005046149A
Authority: JP
Inventors: 哲郎豊田
Original assignee: RIKEN Institute of Physical and Chemical Research
Current assignee: RIKEN Institute of Physical and Chemical Research
Priority date: 2005-02-22
Filing date: 2005-02-22
Publication date: 2011-11-02
Anticipated expiration: 2025-02-22
Also published as: WO2006090868A1; JP2006235750A

Description

本発明は、遺伝子構造予測方法および遺伝子構造予測プログラムに関し、特に、ゲノム塩基配列から規則的に抜き出された部分塩基配列をプローブとして配置したタイリングアレイを用いて、当該タイリングアレイで測定された各プローブの発現強度に関するタイリングアレイデータに基づいて当該ゲノム塩基配列から未知の遺伝子の領域や構造を予測する遺伝子構造予測方法および遺伝子構造予測プログラムに関するものである。 The present invention relates to a gene structure prediction method and a gene structure prediction program, and in particular, is measured with the tiling array using a tiling array in which partial base sequences regularly extracted from genomic base sequences are arranged as probes. Further, the present invention relates to a gene structure prediction method and a gene structure prediction program for predicting an unknown gene region and structure from the genome base sequence based on tiling array data relating to the expression intensity of each probe.

これまで、遺伝子の構造を予測する技術として、例えば非特許文献１〜３が公開されている。非特許文献１〜３に記載の技術によれば、既知の遺伝子に対して、塩基配列データのみで当該遺伝子の構造を予測することができた。 So far, for example, Non-Patent Documents 1 to 3 have been disclosed as techniques for predicting the structure of a gene. According to the techniques described in Non-Patent Documents 1 to 3, it was possible to predict the structure of a known gene from only the base sequence data.

ここで、ＤＮＡマイクロアレイは、多くの遺伝子の発現量を蛍光強度（発現強度）として同時に測定する技術として、非常に有用である。特に、タイリングアレイは、ゲノム塩基配列の全領域の発現量を蛍光強度（発現強度）として同時に測定することができるので、非常に有用である（例えば非特許文献４〜８参照）。タイリングアレイには、複数のプローブが、遺伝子領域であるか否かに関わらず、ゲノムの塩基配列に沿って、非常に密に且つ規則的に配置されている。これにより、タイリングアレイで測定された各プローブの発現強度に関するタイリングアレイデータのみで未知の遺伝子の存在を発見することができた。なお、タイリングアレイにおいて、平均的な長さの遺伝子領域には、数十個から数百個のプローブが対応している。 Here, the DNA microarray is very useful as a technique for simultaneously measuring the expression levels of many genes as fluorescence intensity (expression intensity). In particular, tiling arrays are very useful because they can simultaneously measure the expression level of the entire region of the genomic base sequence as fluorescence intensity (expression intensity) (see, for example, Non-Patent Documents 4 to 8). In the tiling array, a plurality of probes are arranged very densely and regularly along the base sequence of the genome regardless of whether or not they are gene regions. As a result, it was possible to discover the presence of an unknown gene using only tiling array data relating to the expression intensity of each probe measured with a tiling array. In the tiling array, a gene region having an average length corresponds to several tens to several hundreds of probes.

Ｌｕｋａｓｈｉｎ，ＡＶ．，Ｂｏｒｏｄｏｖｓｋｙ，Ｍ．， “ ＧｅｎｅＭａｒｋ．ｈｍｍ：ｎｅｗｓｏｌｕｔｉｏｎｓｆｏｒｇｅｎｅｆｉｎｄｉｎｇ．”，ＮｕｃｌｅｉｃＡｃｉｄｓＲｅｓ．，２６，ｐ．１１０７−１１１５，１９９８Lukashin, AV. Borodovsky, M .; “GeneMark. Hmm: new solutions for gene finding.”, Nucleic Acids Res. , 26, p. 1107-1115, 1998 Ｕｂｅｒｂａｃｈｅｒ，ＥＣ．，Ｘｕ，Ｙ．，Ｍｕｒａｌ，ＲＪ．， “ ＤｉｓｃｏｖｅｒｉｎｇａｎｄｕｎｄｅｒｓｔａｎｄｉｎｇｇｅｎｅｓｉｎｈｕｍａｎＤＮＡｓｅｑｕｅｎｃｅｕｓｉｎｇＧＲＡＩＬ”，ＭｅｔｈｏｄｓＥｎｚｙｍｏｌ．，２６６，ｐ．２５９−２８１，１９９６Uberbacher, EC. , Xu, Y. , Mural, RJ. , “Discovering and understanding genes in human DNA sequence using GRIL”, Methods Enzymol. , 266, p. 259-281, 1996 Ｓｎｙｄｅｒ，ＥＥ．，Ｓｔｏｒｍｏ，ＧＤ．， “ＩｄｅｎｔｉｆｉｃａｔｉｏｎｏｆｃｏｄｉｎｇｒｅｇｉｏｎｓｉｎｇｅｎｏｍｉｃＤＮＡｓｅｑｕｅｎｃｅｓ：ａｎａｐｐｌｉｃａｔｉｏｎｏｆｄｙｎａｍｉｃｐｒｏｇｒａｍｍｉｎｇａｎｄｎｅｕｒａｌｎｅｔｗｏｒｋｓ．”，ＮｕｃｌｅｉｃＡｃｉｄｓＲｅｓ．，２１，ｐ．６０７−６１３，１９９３Snyder, EE. , Stormo, GD. , “Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks.”, Nucleic Acids Res. , 21, p. 607-613, 1993 ＳｈｏｅｍａｋｅｒＤ．Ｄ．ｅｔａｌ．， “Ｅｘｐｅｒｉｍｅｎｔａｌａｎｎｏｔａｔｉｏｎｏｆｔｈｅｈｕｍａｎｇｅｎｏｍｅｕｓｉｎｇｍｉｃｒｏａｒｒａｙｔｅｃｈｎｏｌｏｇｙ．”，Ｎａｔｕｒｅ，４０９，ｐ．９２２，２００１Shomaker D. D. et al. "Experimental annotation of the human genome using microarray technology.", Nature, 409, p. 922, 2001 ＫａｐｒａｎｏｖＰ．ｅｔａｌ．， “Ｌａｒｇｅ−ｓｃａｌｅｔｒａｎｓｃｒｉｐｔｉｏｎａｌａｃｔｉｖｉｔｙｉｎｃｈｒｏｍｏｓｏｍｅｓ２１ａｎｄ２２．”，Ｓｃｉｅｎｃｅ，２９６，ｐ．９１６−９１９，２００２Kaplanov P.M. et al. , “Large-scale transcriptional activity in chromosomes 21 and 22.”, Science, 296, p. 916-919, 2002 Ｒｉｎｎ，Ｊ．Ｌ．ｅｔａｌ．， “Ｔｈｅｔｒａｎｓｃｒｉｐｔｉｏｎａｌａｃｔｉｖｉｔｙｏｆｈｕｍａｎｃｈｒｏｍｏｓｏｍｅ２２．”，ＧｅｎｅｓＤｅｖ．，１７，ｐ．５２９−５４０，２００３Rinn, J.M. L. et al. "The transcribable activity of human chromosome 22.", Genes Dev. , 17, p. 529-540, 2003 ＹａｍａｄａＫ．ｅｔａｌ．， “ ＥｍｐｉｒｉｃａｌＡｎａｌｙｓｉｓｏｆＴｒａｎｓｃｒｉｐｔｉｏｎａｌＡｃｔｉｖｉｔｙｉｎｔｈｅＡｒａｂｉｄｏｐｓｉｓＧｅｎｏｍｅ．”，Ｓｃｉｅｎｃｅ，３０２，ｐ．８４２−８４５，２００３Yamada K.K. et al. , “Empirical Analysis of Transcriptional Activity in the Arabidopsis Genome.”, Science, 302, p. 842-845, 2003 ＳｔｏｌｃＶ．ｅｔａｌ．， “ ＡｇｅｎｅｅｘｐｒｅｓｓｉｏｎｍａｐｆｏｒｔｈｅｅｕｃｈｒｏｍａｔｉｃｇｅｎｏｍｅｏｆＤｒｏｓｏｐｈｉｌａｍｅｌａｎｏｇａｓｔｅｒ．”，Ｓｃｉｅｎｃｅ，３０６，ｐ．６５５−６６０，２００４Stolc V.E. et al. "A gene expression map for the euchromatic genome of Drosophila melanogaster.", Science, 306, p. 655-660, 2004

しかしながら、タイリングアレイデータには一般にノイズが多く含まれているので、タイリングアレイデータのみでゲノム塩基配列に存在するエクソンとイントロンを明確に区別することはできない。具体的には、タイリングアレイで測定された発現強度が閾値を超えるか否かでエクソンであるか否かを判定する方法の場合、閾値の調整を行っても、エクソン領域内で閾値以下の発現強度が存在したり、エクソン領域外で閾値を超える発現強度が存在したりする。そのため、タイリングアレイデータに基づいて未知の遺伝子の存在を予測することはできても、当該遺伝子の領域や構造を予測することはできなかった、という問題点があった。 However, since tiling array data generally contains a lot of noise, it is not possible to clearly distinguish exons and introns present in the genome base sequence only from tiling array data. Specifically, in the case of a method for determining whether or not an exon is based on whether or not the expression intensity measured by the tiling array exceeds the threshold, even if the threshold is adjusted, it is below the threshold within the exon region. There is an expression intensity, or there is an expression intensity exceeding a threshold outside the exon region. Therefore, there was a problem that even though the presence of an unknown gene could be predicted based on tiling array data, the region and structure of the gene could not be predicted.

また、遺伝子の構造を予測する従来の技術では、予測の過程において、統計的に有意と判定された遺伝子の発現領域であっても遺伝子領域でないと判定されてしまう場合があり、遺伝子の構造を必ずしも十分な精度で予測することができなかった、という問題点があった。 In addition, in the conventional technique for predicting the structure of a gene, even in the prediction process, even a gene expression region that is statistically significant may be determined not to be a gene region. There was a problem that it was not always possible to predict with sufficient accuracy.

本発明は上記問題点に鑑みてなされたもので、タイリングアレイデータに基づいてゲノム塩基配列から未知の遺伝子の領域や構造を精度よく予測することができる遺伝子構造予測方法および遺伝子構造予測プログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and provides a gene structure prediction method and a gene structure prediction program capable of accurately predicting an unknown gene region and structure from a genomic base sequence based on tiling array data. The purpose is to provide.

従来の遺伝子構造予測方法ではマルコフモデルを用いるものがよく使われていた（例えば、「ＡｓａｉＫ．，Ｈａｙａｍｉｚｕ，Ｓ．ａｎｄＨａｎｄａ，Ｋ．， “ＰｒｅｄｉｃｔｉｏｎｏｆｐｒｏｔｅｉｎｓｅｃｏｎｄａｒｙｓｔｒｕｃｔｕｒｅｂｙｔｈｅｈｉｄｄｅｎＭａｒｋｏｖｍｏｄｅｌ．”，ＣｏｍｐｕｔｅｒＡｐｐｌｉｃａｔｉｏｎｓｆｏｒＢｉｏｓｃｉｅｎｃｅｓ，９，ｐ．１４１−１４６，１９９３」や「Ｋｒｏｇｈ，Ａ．，Ｂｒｏｗｎ，Ｍ．，Ｍｉａｎ，ＩＳ．Ｓｊｏｌａｎｄｅｒ，Ｋ．，Ｈａｕｓｓｌｅｒ，Ｄ．， “ＨｉｄｄｅｎＭａｒｋｏｖｍｏｄｅｌｓｉｎｃｏｍｐｕｔａｔｉｏｎａｌｂｉｏｌｏｇｙ．Ａｐｐｌｉｃａｔｉｏｎｓｔｏｐｒｏｔｅｉｎｍｏｄｅｌｉｎｇ．”，Ｊ．Ｍｏｌ．Ｂｉｏｌ．，２３５，ｐ．１５０１−１５３１，１９９４」、「Ｔａｎａｋａ，Ｈ．，Ｉｓｈｉｋａｗａ，Ｍ．，Ａｓａｉ，Ｋ．，Ｋｏｎａｇａｙａ，Ａ．， “ＨｉｄｄｅｎＭａｒｋｏｖｍｏｄｅｌｓａｎｄｉｎｔｅｒａｃｔｉｖｅａｌｉｇｎｅｒｓ：Ｓｔｕｄｙｏｆｔｈｅｉｒｅｑｕｉｖａｌｅｎｃｅａｎｄｐｏｓｓｉｂｉｌｉｔｉｅｓ．”，ＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＩｎｔｅｌｌｉｇｅｎｔＳｙｓｔｅｍｓｆｏｒＭｏｌｅｃｕｌａｒＢｉｏｌｏｇｙ，１，ｐ．３９５−４０１，１９９３」など参照）。ところが、従来の遺伝子構造予測方法では、例えばタイリングアレイデータから計算された尤度を考慮してマルコフモデルを適用した例はこれまでみられなかった。そもそも、タイリングアレイデータを考慮して遺伝子の構造を予測する試みは成されていなかった。 Conventional gene structure prediction methods that use a Markov model are often used (for example, “Asai K., Hayamizu, S. and Handa, K.,“ Prediction of protein secondary structure by the label ”. “Computer Applications for Biosciences, 9, p. 141-146, 1993” and “Krogh, A., Brown, M., Mian, IS. Sjorander, K., Haussler, D.,“ Hidden Markovel. to protein model ng. ”, J. Mol. Biol., 235, p. 1501-1531, 1994”, “Tanaka, H., Ishikawa, M., Asai, K., Konagaya, A.,“ Hidden Markov model and : Study of the equivalence and possibilities. ", International Conference on Intelligent Systems for Molecular Biology, 1, p. 395-401, 1993"). However, in the conventional gene structure prediction method, for example, no Markov model has been applied in consideration of the likelihood calculated from tiling array data. In the first place, no attempt has been made to predict the structure of a gene in consideration of tiling array data.

そこで、上記目的を達成するために、本発明にかかる遺伝子構造予測方法は、塩基配列に基づいて遺伝子の構造を予測する遺伝子構造予測方法において、ゲノム塩基配列から規則的に抜き出された部分塩基配列をプローブとして配置したタイリングアレイを用いて、当該タイリングアレイで測定された各プローブの発現強度に関するタイリングアレイデータに基づいて当該ゲノム塩基配列から遺伝子の構造を予測すること、を特徴とする。 Therefore, in order to achieve the above object , the gene structure prediction method according to the present invention includes a partial base regularly extracted from a genome base sequence in the gene structure prediction method for predicting the structure of a gene based on the base sequence. Using a tiling array in which the sequences are arranged as probes, and predicting the structure of the gene from the genomic base sequence based on tiling array data on the expression intensity of each probe measured by the tiling array. To do.

また、本発明にかかる遺伝子構造予測方法は、ゲノム塩基配列から規則的に抜き出された部分塩基配列をプローブとして配置したタイリングアレイを用いて、当該タイリングアレイで測定された各プローブの発現強度に関するタイリングアレイデータおよびゲノム塩基配列に関するゲノム塩基配列データを取得するデータ取得ステップと、前記データ取得ステップで取得したタイリングアレイデータに基づいて、ゲノム塩基配列の中から遺伝子の発現領域を推定する発現領域推定ステップと、前記発現領域推定ステップで推定した発現領域に対応する塩基を起点として、当該起点の両隣に連なる塩基ごとに、前記タイリングアレイデータおよび前記ゲノム塩基配列データに基づいて遺伝子の構造に関する属性を決定することで、遺伝子の構造を予測する遺伝子構造予測ステップと、を含むことを特徴とする。 The gene structure prediction method according to the present invention uses a tiling array in which partial base sequences regularly extracted from genomic base sequences are arranged as probes, and the expression of each probe measured with the tiling array. Data acquisition step for obtaining tiling array data related to strength and genomic base sequence data related to genomic base sequence, and estimation of gene expression region from genomic base sequence based on tiling array data acquired in the data acquisition step An expression region estimation step and a base corresponding to the expression region estimated in the expression region estimation step as a starting point, for each base that is adjacent to the starting point, a gene based on the tiling array data and the genomic base sequence data By determining attributes related to the structure of the gene, the structure of the gene Characterized in that it comprises a gene structure prediction step of predicting, a.

また、本発明にかかる遺伝子構造予測方法は、前記の遺伝子構造予測方法において、前記発現領域推定ステップは、前記タイリングアレイデータに基づいて遺伝子の発現領域の候補である発現領域候補を決定する発現領域候補決定ステップと、前記発現領域候補決定ステップで決定した発現領域候補が統計的に有意であるか否かを判定する発現領域候補判定ステップと、前記発現領域候補判定ステップで有意であると判定した場合、発現領域候補を前記発現領域として決定する発現領域決定ステップと、をさらに含むことを特徴とする。 Moreover, gene structure prediction method according to the present invention, in the gene structure prediction method, the expression region estimation step, the expression determining the expression region candidate is a candidate for the expression region of the gene based on the tiling array data A region candidate determination step, an expression region candidate determination step for determining whether the expression region candidate determined in the expression region candidate determination step is statistically significant, and a determination that the expression region candidate determination step is significant In that case, the method further includes an expression region determination step of determining an expression region candidate as the expression region.

また、本発明にかかる遺伝子構造予測方法は、前記の遺伝子構造予測方法において、前記発現領域候補決定ステップは、ゲノム塩基配列における所定の長さの領域を対象として、当該領域に含まれるプローブの発現強度の中央値を算出する中央値算出ステップと、前記中央値算出ステップで中央値の算出対象とした領域をゲノム塩基配列に沿って移動する領域移動ステップと、前記中央値算出ステップおよび前記領域移動ステップを繰り返し実行することで蓄積した複数の中央値の中から最大のものを選択し、選択した最大の中央値の算出対象であった領域を前記発現領域候補として選出する発現領域候補選出ステップと、をさらに含むことを特徴とする。 Moreover, gene structure prediction method according to the present invention, in the gene structure prediction method, the expression region candidate determining step, a target region of a predetermined length in the genomic nucleotide sequence, the expression of probes contained in the area A median calculation step for calculating the median value of the intensity; a region moving step for moving the region that is the median calculation target in the median value calculating step along the genome base sequence; the median value calculating step and the region shifting An expression region candidate selection step of selecting a maximum one of a plurality of median values accumulated by repeatedly executing the step, and selecting a region that was a target of calculation of the selected maximum median value as the expression region candidate; , Further included.

また、本発明にかかる遺伝子構造予測方法は、前記の遺伝子構造予測方法において、前記遺伝子構造予測ステップは、前記発現領域推定ステップで推定した発現領域に対応する塩基である発現領域塩基を対象として、当該発現領域塩基の属性に関する確率をマルコフモデルを用いて決定する発現領域塩基確率決定ステップと、前記発現領域確率決定ステップで対象とした発現領域塩基に隣接する隣接塩基を対象として、当該隣接塩基の属性に関する状態遷移確率をマルコフモデルを用いて決定する隣接塩基確率決定ステップと、前記隣接塩基確率決定ステップをゲノム塩基配列の末端の塩基まで繰り返して実行した後、各塩基の属性に関する尤度を最尤法を用いて決定する尤度決定ステップと、前記確率、前記状態遷移確率および前記尤度に基づいて各塩基の属性の組み合わせを確定する属性確定ステップと、をさらに含むことを特徴とする。 Moreover, gene structure prediction method according to the present invention, in the gene structure prediction method, the gene structure prediction step as a target base in which expression region base corresponding to the expression region estimated by the expression region estimating step, An expression region base probability determination step for determining a probability related to the attribute of the expression region base using a Markov model, and an adjacent base adjacent to the expression region base targeted in the expression region probability determination step An adjacent base probability determination step for determining a state transition probability for an attribute using a Markov model, and the adjacent base probability determination step are repeatedly performed up to the end base of the genome base sequence, and the likelihood for the attribute of each base is maximized. A likelihood determining step that uses a likelihood method, and the probability, the state transition probability, and the likelihood An attribute determination step of determining a combination of the attributes of each base in Zui, characterized in that it further comprises a.

また、本発明にかかる遺伝子構造予測方法は、前記の遺伝子構造予測方法において、前記遺伝子構造予測ステップで予測した遺伝子の構造に含まれるエクソン領域に対応するプローブの発現強度に基づいて、当該遺伝子の発現値を算出する発現値算出ステップ、をさらに含むことを特徴とする。 Moreover, gene structure prediction method according to the present invention, in the gene structure prediction method, based on the expression intensity of the probe corresponding to exon region included in the structure of the gene predicted by the gene structure prediction step, of the gene An expression value calculating step of calculating an expression value;

また、本発明にかかる遺伝子構造予測方法は、前記の遺伝子構造予測方法において、前記遺伝子構造予測ステップで構造が予測された遺伝子の領域を除いたゲノム塩基配列である未予測領域を予測対象領域として決定する予測対象領域決定ステップ、をさらに含み、前記予測対象領域決定ステップで決定した予測対象領域に対して、前記発現領域推定ステップおよび前記遺伝子構造予測ステップを再び実行すること、を特徴とする。 Moreover, the gene structure prediction method according to the present invention is the above-described gene structure prediction method, wherein an unpredicted region that is a genomic base sequence excluding the region of the gene whose structure is predicted in the gene structure prediction step is used as a prediction target region. A prediction target region determination step for determining, wherein the expression region estimation step and the gene structure prediction step are performed again on the prediction target region determined in the prediction target region determination step.

また、本発明にかかる遺伝子構造予測方法は、前記の遺伝子構造予測方法において、前記予測対象領域決定ステップは、前記未予測領域をゲノム塩基配列から抽出する未予測領域抽出ステップと、前記未予測領域抽出ステップで抽出した未予測領域の塩基長を算出し、算出した塩基長が所定の塩基長以下であるか否かを判定する塩基長判定ステップと、前記塩基長判定ステップで所定の塩基長以下でないと判定した場合、前記未予測領域抽出ステップで抽出した未予測領域を前記予測対象領域として確定する予測対象領域確定ステップと、をさらに含むことを特徴とする。 Moreover, gene structure prediction method according to the present invention, in the gene structure prediction method, the prediction target region determining step, and the non-prediction region extraction step of extracting the non-prediction region from the genomic nucleotide sequence, wherein the non-prediction region The base length of the unpredicted region extracted in the extraction step is calculated, the base length determination step for determining whether or not the calculated base length is equal to or less than a predetermined base length, and the predetermined base length or less in the base length determination step A non-predicted region extracted in the unpredicted region extraction step, and a prediction target region determining step for determining the unpredicted region as the prediction target region.

また、本発明は遺伝子構造予測プログラムに関するものであり、本発明にかかる遺伝子構造予測方法をコンピュータに実行させる遺伝子構造予測プログラムは、ゲノム塩基配列から規則的に抜き出された部分塩基配列をプローブとして配置したタイリングアレイを用いて、当該タイリングアレイで測定された各プローブの発現強度に関するタイリングアレイデータおよびゲノム塩基配列に関するゲノム塩基配列データを取得するデータ取得ステップと、前記データ取得ステップで取得したタイリングアレイデータに基づいて、ゲノム塩基配列の中から遺伝子の発現領域を推定する発現領域推定ステップと、前記発現領域推定ステップで推定した発現領域に対応する塩基を起点として、当該起点の両隣に連なる塩基ごとに、前記タイリングアレイデータおよび前記ゲノム塩基配列データに基づいて遺伝子の構造に関する属性を決定することで、遺伝子の構造を予測する遺伝子構造予測ステップと、を含むことを特徴とする。 The present invention also relates to a gene structure prediction program, and a gene structure prediction program that causes a computer to execute the gene structure prediction method according to the present invention uses a partial base sequence regularly extracted from a genomic base sequence as a probe. Using the arranged tiling array, a data acquisition step for acquiring tiling array data related to the expression intensity of each probe measured with the tiling array and a genomic base sequence data regarding the genomic base sequence, and acquiring in the data acquisition step Based on the tiling array data, the expression region estimation step for estimating the expression region of the gene from the genome base sequence, and the base corresponding to the expression region estimated in the expression region estimation step as a starting point, both adjacent to the starting point For each base connected to By determining the attribute about the structure of the gene on the basis of data and the genome sequence data, characterized in that it contains a gene structure prediction step of predicting a structure of the gene, a.

また、本発明にかかる遺伝子構造予測プログラムは、前記の遺伝子構造予測プログラムにおいて、前記発現領域推定ステップは、前記タイリングアレイデータに基づいて遺伝子の発現領域の候補である発現領域候補を決定する発現領域候補決定ステップと、前記発現領域候補決定ステップで決定した発現領域候補が統計的に有意であるか否かを判定する発現領域候補判定ステップと、前記発現領域候補判定ステップで有意であると判定した場合、発現領域候補を前記発現領域として決定する発現領域決定ステップと、をさらに含むことを特徴とする。 Moreover, gene structure prediction program according to the present invention, in the gene structure prediction programs, the expression region estimation step, the expression determining the expression region candidate is a candidate for the expression region of the gene based on the tiling array data A region candidate determination step, an expression region candidate determination step for determining whether the expression region candidate determined in the expression region candidate determination step is statistically significant, and a determination that the expression region candidate determination step is significant In that case, the method further includes an expression region determination step of determining an expression region candidate as the expression region.

また、本発明にかかる遺伝子構造予測プログラムは、前記の遺伝子構造予測プログラムにおいて、前記発現領域候補決定ステップは、ゲノム塩基配列における所定の長さの領域を対象として、当該領域に含まれるプローブの発現強度の中央値を算出する中央値算出ステップと、前記中央値算出ステップで中央値の算出対象とした領域をゲノム塩基配列に沿って移動する領域移動ステップと、前記中央値算出ステップおよび前記領域移動ステップを繰り返し実行することで蓄積した複数の中央値の中から最大のものを選択し、選択した最大の中央値の算出対象であった領域を前記発現領域候補として選出する発現領域候補選出ステップと、をさらに含むことを特徴とする。 Moreover, gene structure prediction program according to the present invention, in the gene structure prediction programs, the expression region candidate determining step, a target region of a predetermined length in the genomic nucleotide sequence, the expression of probes contained in the area A median calculation step for calculating the median value of the intensity; a region moving step for moving the region that is the median calculation target in the median value calculating step along the genome base sequence; the median value calculating step and the region shifting An expression region candidate selection step of selecting a maximum one of a plurality of median values accumulated by repeatedly executing the step, and selecting a region that was a target of calculation of the selected maximum median value as the expression region candidate; , Further included.

また、本発明にかかる遺伝子構造予測プログラムは、前記の遺伝子構造予測プログラムにおいて、前記遺伝子構造予測ステップは、前記発現領域推定ステップで推定した発現領域に対応する塩基である発現領域塩基を対象として、当該発現領域塩基の属性に関する確率をマルコフモデルを用いて決定する発現領域塩基確率決定ステップと、前記発現領域確率決定ステップで対象とした発現領域塩基に隣接する隣接塩基を対象として、当該隣接塩基の属性に関する状態遷移確率をマルコフモデルを用いて決定する隣接塩基確率決定ステップと、前記隣接塩基確率決定ステップをゲノム塩基配列の末端の塩基まで繰り返して実行した後、各塩基の属性に関する尤度を最尤法を用いて決定する尤度決定ステップと、前記確率、前記状態遷移確率および前記尤度に基づいて各塩基の属性の組み合わせを確定する属性確定ステップと、をさらに含むことを特徴とする。 Moreover, gene structure prediction program according to the present invention, in the gene structure prediction programs, the gene structure prediction step, as the target base at which expression region base corresponding to the expression region estimated by the expression region estimating step, An expression region base probability determination step for determining a probability related to the attribute of the expression region base using a Markov model, and an adjacent base adjacent to the expression region base targeted in the expression region probability determination step An adjacent base probability determination step for determining a state transition probability for an attribute using a Markov model, and the adjacent base probability determination step are repeatedly performed up to the end base of the genome base sequence, and the likelihood for the attribute of each base is maximized. A likelihood determining step determined using a likelihood method, the probability, the state transition probability and the And further comprising a, an attribute determination step of determining a combination of the attributes of each base on the basis of the likelihood.

また、本発明にかかる遺伝子構造予測プログラムは、前記の遺伝子構造予測プログラムにおいて、前記遺伝子構造予測ステップで予測した遺伝子の構造に含まれるエクソン領域に対応するプローブの発現強度に基づいて、当該遺伝子の発現値を算出する発現値算出ステップ、をさらに含むことを特徴とする。 Moreover, gene structure prediction program according to the present invention, in the gene structure prediction program, based on the expression intensity of the probe corresponding to exon region included in the structure of the gene predicted by the gene structure prediction step, of the gene An expression value calculating step of calculating an expression value;

また、本発明にかかる遺伝子構造予測プログラムは、前記の遺伝子構造予測プログラムにおいて、前記遺伝子構造予測ステップで構造が予測された遺伝子の領域を除いたゲノム塩基配列である未予測領域を予測対象領域として決定する予測対象領域決定ステップ、をさらに含み、前記予測対象領域決定ステップで決定した予測対象領域に対して、前記発現領域推定ステップおよび前記遺伝子構造予測ステップを再び実行すること、を特徴とする。 Moreover, gene structure prediction program according to the present invention, in the gene structure prediction programs, the non-predicted area is the genomic nucleotide sequence structure excluding the region of the gene that is predicted by the gene structure prediction step as a prediction target region A prediction target region determination step for determining, wherein the expression region estimation step and the gene structure prediction step are performed again on the prediction target region determined in the prediction target region determination step.

また、本発明にかかる遺伝子構造予測プログラムは、前記の遺伝子構造予測プログラムにおいて、前記予測対象領域決定ステップは、前記未予測領域をゲノム塩基配列から抽出する未予測領域抽出ステップと、前記未予測領域抽出ステップで抽出した未予測領域の塩基長を算出し、算出した塩基長が所定の塩基長以下であるか否かを判定する塩基長判定ステップと、前記塩基長判定ステップで所定の塩基長以下でないと判定した場合、前記未予測領域抽出ステップで抽出した未予測領域を前記予測対象領域として確定する予測対象領域確定ステップと、をさらに含むことを特徴とする。 Moreover, gene structure prediction program according to the present invention, in the gene structure prediction program, the predicted target region determining step includes a non-predictive region extraction step of extracting the non-prediction region from the genomic nucleotide sequence, wherein the non-prediction region The base length of the unpredicted region extracted in the extraction step is calculated, the base length determination step for determining whether or not the calculated base length is equal to or less than a predetermined base length, and the predetermined base length or less in the base length determination step A non-predicted region extracted in the unpredicted region extraction step, and a prediction target region determining step for determining the unpredicted region as the prediction target region.

本発明にかかる遺伝子構造予測方法によれば、ゲノム塩基配列から規則的に抜き出された部分塩基配列をプローブとして配置したタイリングアレイを用いて、当該タイリングアレイで測定された各プローブの発現強度に関するタイリングアレイデータに基づいて当該ゲノム塩基配列から遺伝子の構造を予測するので、タイリングアレイデータに基づいてゲノム塩基配列から未知の遺伝子の領域や構造を精度よく予測することができるという効果を奏する。 According to the gene structure prediction method of the present invention, using a tiling array in which partial base sequences regularly extracted from genomic base sequences are arranged as probes, the expression of each probe measured by the tiling array is used. Since the gene structure is predicted from the genome base sequence based on the tiling array data on the strength, the effect that the region and structure of an unknown gene can be accurately predicted from the genome base sequence based on the tiling array data Play.

また、本発明にかかる遺伝子構造予測方法および遺伝子構造予測プログラムによれば、ゲノム塩基配列から規則的に抜き出された部分塩基配列をプローブとして配置したタイリングアレイを用いて、当該タイリングアレイで測定された各プローブの発現強度に関するタイリングアレイデータおよびゲノム塩基配列に関するゲノム塩基配列データを取得し、取得したタイリングアレイデータに基づいて、ゲノム塩基配列の中から遺伝子の発現領域を推定し、推定した発現領域に対応する塩基を起点として、当該起点の両隣に連なる塩基ごとに、タイリングアレイデータおよびゲノム塩基配列データに基づいて遺伝子の構造に関する属性を決定することで、遺伝子の構造を予測するので、タイリングアレイデータに基づいてゲノム塩基配列から未知の遺伝子の領域や構造を精度よく予測することができるという効果を奏する。 Further, according to the gene structure prediction method and the gene structure prediction program of the present invention, a tiling array using a partial base sequence regularly extracted from a genomic base sequence as a probe is used. Obtain the tiling array data related to the measured expression intensity of each probe and genomic base sequence data related to the genomic base sequence, and based on the acquired tiling array data, estimate the gene expression region from the genomic base sequence, Predict the structure of a gene by determining the attributes related to the structure of the gene based on the tiling array data and genomic base sequence data for each base that is adjacent to the base corresponding to the estimated expression region. Therefore, based on the tiling array data, An effect that the region and the structure of the gene can be predicted accurately.

また、本発明にかかる遺伝子構造予測方法および遺伝子構造予測プログラムによれば、タイリングアレイデータに基づいて遺伝子の発現領域の候補である発現領域候補を決定し、決定した発現領域候補が統計的に有意であるか否かを判定し、有意であると判定した場合、発現領域候補を発現領域として決定するので、既存の統計手法を用いて発現領域を容易且つ正確に推定することができるという効果を奏する。 In addition, according to the gene structure prediction method and the gene structure prediction program of the present invention, expression region candidates that are gene expression region candidates are determined based on tiling array data, and the determined expression region candidates are statistically determined. If it is determined whether it is significant, and if it is determined to be significant, the expression region candidate is determined as the expression region, so that the expression region can be estimated easily and accurately using existing statistical methods Play.

また、本発明にかかる遺伝子構造予測方法および遺伝子構造予測プログラムによれば、ゲノム塩基配列における所定の長さの領域を対象として、当該領域に含まれるプローブの発現強度の中央値を算出し、中央値の算出対象とした領域をゲノム塩基配列に沿って移動し、これらの処理を繰り返し実行することで蓄積した複数の中央値の中から最大のものを選択し、選択した最大の中央値の算出対象であった領域を発現領域候補として選出するので、タイリングアレイで測定された各プローブの発現強度のばらつきを考慮して適切な発現領域候補を選出することができるという効果を奏する。 Further, according to the gene structure prediction method and the gene structure prediction program according to the present invention, for a region of a predetermined length in the genomic base sequence, the median of the expression intensity of the probes contained in the region is calculated, By moving the region for which the value is to be calculated along the genome base sequence and repeatedly executing these processes, the largest one of the accumulated medians is selected and the selected median is calculated. Since the target region is selected as an expression region candidate, there is an effect that an appropriate expression region candidate can be selected in consideration of variations in expression intensity of each probe measured by the tiling array.

また、本発明にかかる遺伝子構造予測方法および遺伝子構造予測プログラムによれば、推定した発現領域に対応する塩基である発現領域塩基を対象として、当該発現領域塩基の属性に関する確率をマルコフモデルを用いて決定し、対象とした発現領域塩基に隣接する隣接塩基を対象として、当該隣接塩基の属性に関する状態遷移確率をマルコフモデルを用いて決定し、状態遷移確率を決定する処理をゲノム塩基配列の末端の塩基まで繰り返して実行した後、各塩基の属性に関する尤度を最尤法を用いて決定し、決定した確率、状態遷移確率および尤度に基づいて各塩基の属性の組み合わせを確定するので、遺伝子の領域や構造をさらに精度よく予測することができるという効果を奏する。 Further, according to the gene structure prediction method and the gene structure prediction program according to the present invention, for the expression region base that is the base corresponding to the estimated expression region, the probability relating to the attribute of the expression region base is determined using a Markov model. Determine the state transition probability related to the attribute of the adjacent base using the Markov model for the adjacent base adjacent to the target expression region base, and perform the process of determining the state transition probability at the end of the genome base sequence. After repeatedly executing up to the base, the likelihood regarding the attribute of each base is determined using the maximum likelihood method, and the combination of the attributes of each base is determined based on the determined probability, state transition probability, and likelihood. There is an effect that it is possible to predict the region and structure of the above with higher accuracy.

また、本発明にかかる遺伝子構造予測方法および遺伝子構造予測プログラムによれば、予測した遺伝子の構造に含まれるエクソン領域に対応するプローブの発現強度に基づいて、当該遺伝子の発現値を算出するので、未知の遺伝子の発現量を得ることができるという効果を奏する。 Further, according to the gene structure prediction method and the gene structure prediction program according to the present invention, the expression value of the gene is calculated based on the expression intensity of the probe corresponding to the exon region included in the predicted gene structure. The effect is that an expression level of an unknown gene can be obtained.

また、本発明にかかる遺伝子構造予測方法および遺伝子構造予測プログラムによれば、構造が予測された遺伝子の領域を除いたゲノム塩基配列である未予測領域を予測対象領域として決定し、決定した予測対象領域に対して、発現領域を推定する処理および遺伝子の構造を予測する処理を再び実行するので、ゲノム塩基配列から複数の遺伝子の領域や構造を効率よく予測することができるという効果を奏する。 According to the gene structure prediction method and the gene structure prediction program of the present invention, an unpredicted region that is a genomic base sequence excluding a region of a gene whose structure is predicted is determined as a prediction target region, and the determined prediction target Since the process of estimating the expression area and the process of predicting the gene structure are performed again on the region, there is an effect that the regions and structures of a plurality of genes can be efficiently predicted from the genome base sequence.

また、本発明にかかる遺伝子構造予測方法および遺伝子構造予測プログラムによれば、未予測領域をゲノム塩基配列から抽出し、抽出した未予測領域の塩基長を算出し、算出した塩基長が所定の塩基長以下であるか否かを判定し、所定の塩基長（例えば１０００塩基長）以下でないと判定した場合、抽出した未予測領域を予測対象領域として確定するので、遺伝子の構造が入り得ない短い領域を除くことができ、その結果、構造予測に相応しい予測対象領域を確定することができるという効果を奏する。 According to the gene structure prediction method and gene structure prediction program of the present invention, the unpredicted region is extracted from the genome base sequence, the base length of the extracted unpredicted region is calculated, and the calculated base length is a predetermined base. If it is determined whether or not the length is equal to or shorter than the predetermined length, and it is determined that the length is not shorter than a predetermined base length (for example, 1000 base length), the extracted unpredicted region is determined as the prediction target region. As a result, it is possible to eliminate the region, and as a result, it is possible to determine the prediction target region suitable for the structure prediction.

以下に、本発明にかかる遺伝子構造予測方法および遺伝子構造予測プログラムの実施の形態を図面に基づいて詳細に説明する。なお、この実施の形態によりこの発明が限定されるものではない。 Embodiments of a gene structure prediction method and a gene structure prediction program according to the present invention will be described below in detail with reference to the drawings. Note that the present invention is not limited to the embodiments.

まず、本明細書で用いる用語を定義する。
「遺伝子」とは、成熟したｍＲＮＡ（メッセンジャーＲＮＡ）または当該ｍＲＮＡに対応するゲノム塩基配列の部分領域（エクソン）の集合である。
「遺伝子領域」とは、ゲノム塩基配列上で、１つの遺伝子に属するエクソンの集合を全て含む最短の部分領域である。
「遺伝子構造」とは、１つの遺伝子領域内において、エクソンとイントロンの領域を示す情報である。
「タイリングアレイ」とは、ゲノム塩基配列から一定長（２０〜１００塩基長）の塩基配列を一定間隔（１〜２００塩基）でずらしつつ抜き出し、抜き出した各塩基配列に対応するオリゴヌクレオチドをプローブとするＤＮＡマイクロアレイである。なお、タイリングアレイには、抜き出した塩基配列と完全に一致するオリゴヌクレオチド（パーフェクトマッチするオリゴヌクレオチド）をプローブとして加えるだけでなく、抜き出した塩基配列と完全に一致しないオリゴヌクレオチド（ミスマッチするオリゴヌクレオチド）を比較用のプローブとして加えてもよい。
「発現強度」とは、生物組織から得られたサンプル中に含まれるｍＲＮＡの量に比例する数値データであり、対応するプローブへサンプルがハイブリダイゼイションすることで測定されるものである。
「タイリングアレイデータ」とは、タイリングアレイの全プローブの発現強度に関するデータである。
「有意な発現領域」とは、「ゲノム塩基配列中の一定塩基長（例えば７５０塩基長）の領域に対応するプローブの発現強度が、ノイズに因り偶然に測定されたものである。」という帰無仮説が統計的検定に基づいて有意なレベルで棄却された場合における当該領域である。なお、有意な発現領域は遺伝子領域の一部であるとする。 First, terms used in this specification are defined.
A “gene” is a set of mature mRNA (messenger RNA) or a partial region (exon) of a genomic base sequence corresponding to the mRNA.
The “gene region” is the shortest partial region including all of the set of exons belonging to one gene on the genome base sequence.
“Gene structure” is information indicating exon and intron regions in one gene region.
“Tiling array” refers to a base sequence of a certain length (20 to 100 bases) extracted from a genome base sequence while shifting it at regular intervals (1 to 200 bases), and probes oligonucleotides corresponding to the extracted base sequences. A DNA microarray. The tiling array not only adds an oligonucleotide that perfectly matches the extracted base sequence (perfectly matched oligonucleotide) as a probe, but also an oligonucleotide that does not completely match the extracted base sequence (mismatched oligonucleotide). ) May be added as a comparative probe.
“Expression intensity” is numerical data proportional to the amount of mRNA contained in a sample obtained from a biological tissue, and is measured by the sample hybridizing to a corresponding probe.
“Tiling array data” is data relating to the expression intensity of all probes in the tiling array.
The “significant expression region” means that “the expression intensity of a probe corresponding to a region of a certain base length (for example, 750 base length) in the genome base sequence was measured by chance due to noise”. This is the region where no hypothesis is rejected at a significant level based on statistical tests. The significant expression region is assumed to be a part of the gene region.

以下で、本発明を実現するための装置である遺伝子構造予測装置１００の構成について、図１などを参照して説明する。図１は、遺伝子構造予測装置１００の構成を示すブロック図であり、該構成のうち本発明に関係する部分のみを概念的に示している。 Below, the structure of the gene structure prediction apparatus 100 which is an apparatus for implement | achieving this invention is demonstrated with reference to FIG. FIG. 1 is a block diagram showing a configuration of the gene structure prediction apparatus 100, and conceptually shows only a portion related to the present invention in the configuration.

図１に示すように、遺伝子構造予測装置１００は、遺伝子構造予測装置１００を統括的に制御するＣＰＵ等の制御部１０２と、ルータ等の通信装置および専用線等の有線または無線の通信回線を介して遺伝子構造予測装置１００をネットワーク３００に通信可能に接続する通信インターフェース部１０４と、各種のデータベースやテーブルやファイルなどを格納する記憶部１０６と、入力装置１１２や出力装置１１４に接続する入出力インターフェース部１０８と、で構成されており、これら各部は任意の通信路を介して通信可能に接続されている。また、ネットワーク３００は、遺伝子構造予測装置１００と外部システム２００とを相互に通信可能に接続する機能を有し、例えばインターネットやＬＡＮ等である。また、外部システム２００は、ネットワーク３００を介して遺伝子構造予測装置１００と相互に通信可能に接続され、ゲノム塩基配列データやタイリングアレイデータや各種パラメータなどに関する外部データベースや、外部プログラム等を提供する機能など、を有する。外部システム２００の各機能は外部システム２００のハードウェア構成中のＣＰＵやディスク装置やメモリ装置や入力装置や出力装置や通信制御装置等およびそれらを制御するプログラム等で実現される。なお、外部システム２００はＷＥＢサーバやＡＳＰサーバ等として構成してもよく、そのハードウェアは一般に市販されるワークステーションやパーソナルコンピュータ等の情報処理装置およびその付属装置で構成してもよい。 As shown in FIG. 1, the gene structure prediction apparatus 100 includes a control unit 102 such as a CPU that comprehensively controls the gene structure prediction apparatus 100, a communication device such as a router, and a wired or wireless communication line such as a dedicated line. A communication interface unit 104 that connects the gene structure prediction apparatus 100 to the network 300 via a network, a storage unit 106 that stores various databases, tables, files, and the like, and an input / output unit that connects to the input device 112 and the output device 114 The interface unit 108 is configured to be communicable via an arbitrary communication path. The network 300 has a function of connecting the gene structure prediction apparatus 100 and the external system 200 so that they can communicate with each other, and is, for example, the Internet or a LAN. Further, the external system 200 is connected to the gene structure prediction apparatus 100 via the network 300 so as to be able to communicate with each other, and provides an external database, genome programs, and the like regarding genome base sequence data, tiling array data, various parameters, and the like. Functions, etc. Each function of the external system 200 is realized by a CPU, a disk device, a memory device, an input device, an output device, a communication control device, and the like in the hardware configuration of the external system 200 and a program for controlling them. The external system 200 may be configured as a WEB server, an ASP server, or the like, and the hardware may be configured by an information processing apparatus such as a commercially available workstation or personal computer, and an accessory device thereof.

記憶部１０６は、ストレージ手段であり、具体的には、ＲＡＭやＲＯＭ等のメモリ装置、ハードディスクのような固定ディスク装置、フレキシブルディスク、光ディスク等を用いることができる。記憶部１０６は、図示の如く、ゲノム塩基配列データファイル１０６ａとタイリングアレイデータファイル１０６ｂと発現領域データファイル１０６ｃと遺伝子構造データファイル１０６ｄと発現値データファイル１０６ｅを格納する。ゲノム塩基配列データファイル１０６ａはゲノム塩基配列データを格納する。具体的には、図２に示すように、ゲノム塩基配列における各塩基の位置を一意に識別するための塩基番号と、塩基番号に対応する塩基を表す塩基記号（Ａ、Ｔ、Ｇ、Ｃ）と、を相互に関連付けて格納する。再び図１に戻り、タイリングアレイデータファイル１０６ｂはタイリングアレイデータを格納する。具体的には、図３に示すように、タイリングアレイに配置したプローブを一意に識別するためのプローブ番号と、プローブ番号に対応するプローブの発現強度と、プローブ番号に対応するプローブの範囲（ゲノム塩基配列における開始塩基番号と終了塩基番号との組）と、を相互に関連付けて格納する。再び図１に戻り、発現領域データファイル１０６ｃは、後述する発現領域推定部１０２ｂで推定した発現領域に関するデータを格納する。遺伝子構造ファイル１０６ｄは、後述する遺伝子構造予測部１０２ｃで予測した遺伝子の領域や構造に関するデータを格納する。発現値データファイル１０６ｅは、後述する発現値算出部１０２ｄで算出した発現値に関するデータを格納する。 The storage unit 106 is a storage unit. Specifically, a memory device such as a RAM or a ROM, a fixed disk device such as a hard disk, a flexible disk, an optical disk, or the like can be used. The storage unit 106 stores a genomic base sequence data file 106a, a tiling array data file 106b, an expression region data file 106c, a gene structure data file 106d, and an expression value data file 106e as shown in the figure. The genomic base sequence data file 106a stores genomic base sequence data. Specifically, as shown in FIG. 2, a base number for uniquely identifying the position of each base in the genome base sequence and a base symbol (A, T, G, C) representing the base corresponding to the base number Are stored in association with each other. Returning again to FIG. 1, the tiling array data file 106b stores tiling array data. Specifically, as shown in FIG. 3, the probe number for uniquely identifying the probe arranged in the tiling array, the expression intensity of the probe corresponding to the probe number, and the probe range corresponding to the probe number ( A set of a start base number and an end base number in the genome base sequence) are stored in association with each other. Returning to FIG. 1 again, the expression region data file 106c stores data related to the expression region estimated by the expression region estimation unit 102b described later. The gene structure file 106d stores data related to the region and structure of a gene predicted by a gene structure prediction unit 102c described later. The expression value data file 106e stores data relating to expression values calculated by an expression value calculation unit 102d described later.

通信インターフェース部１０４は遺伝子構造予測装置１００とネットワーク３００（またはルータ等の通信装置）との間における通信を媒介する。すなわち、通信インターフェース部１０４は他の端末と通信回線を介してデータを通信する機能を有する。 The communication interface unit 104 mediates communication between the gene structure prediction device 100 and the network 300 (or a communication device such as a router). That is, the communication interface unit 104 has a function of communicating data with other terminals via a communication line.

入出力インターフェース部１０８は入力装置１１２や出力装置１１４に接続する。ここで、出力装置１１４には、モニタ（家庭用テレビを含む）の他、スピーカやプリンタを用いることができる（なお、以下で、出力装置１１４をモニタとして記載する場合がある。）。また、入力装置１１２には、キーボードやマウスやマイクの他、マウスと協働してポインティングデバイス機能を実現するモニタを用いることができる。 The input / output interface unit 108 is connected to the input device 112 and the output device 114. Here, in addition to a monitor (including a home television), a speaker or a printer can be used as the output device 114 (the output device 114 may be described as a monitor below). In addition to the keyboard, mouse, and microphone, the input device 112 can be a monitor that realizes a pointing device function in cooperation with the mouse.

制御部１０２は、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）等の制御プログラムや各種の処理手順等を規定したプログラムや所要データを格納するための内部メモリを有し、これらのプログラムに基づいて種々の処理を実行する。そして、制御部１０２は、大別して、データ取得部１０２ａと発現領域推定部１０２ｂと遺伝子構造予測部１０２ｃと発現値算出部１０２ｄと予測対象領域決定部１０２ｅとを備えている。 The control unit 102 has an internal memory for storing a control program such as an OS (Operating System), a program that defines various processing procedures, and necessary data, and executes various processes based on these programs. . The control unit 102 roughly includes a data acquisition unit 102a, an expression region estimation unit 102b, a gene structure prediction unit 102c, an expression value calculation unit 102d, and a prediction target region determination unit 102e.

データ取得部１０２ａは、ゲノム塩基配列から規則的に抜き出された部分塩基配列をプローブとして配置したタイリングアレイを用いて、当該タイリングアレイで測定された各プローブの発現強度に関するタイリングアレイデータおよびゲノム塩基配列に関するゲノム塩基配列データを取得する。 The data acquisition unit 102a uses a tiling array in which partial base sequences regularly extracted from the genomic base sequence are arranged as probes, and tiling array data relating to the expression intensity of each probe measured by the tiling array. And genome base sequence data related to the genome base sequence is acquired.

発現領域推定部１０２ｂは、データ取得部１０２ａで取得したタイリングアレイデータに基づいて、ゲノム塩基配列の中から遺伝子の発現領域を推定する。具体的には、図４に示すように、発現領域推定部１０２ｂは、発現領域候補決定部１０２ｂ１と発現領域候補判定部１０２ｂ２と発現領域決定部１０２ｂ３とをさらに備えている。発現領域候補決定部１０２ｂ１は、タイリングアレイデータに基づいて遺伝子の発現領域の候補である発現領域候補を決定する。具体的には、図５に示すように、発現領域候補決定部１０２ｂ１は、中央値算出部１０２ｂ１１と領域移動部１０２ｂ１２と発現領域候補選出部１０２ｂ１３とをさらに備えている。中央値算出部１０２ｂ１１は、ゲノム塩基配列における所定の長さの領域を対象として、当該領域に含まれるプローブの発現強度の中央値を算出する。領域移動部１０２ｂ１２は、中央値算出部１０２ｂ１１で中央値の算出対象とした領域をゲノム塩基配列に沿って移動する。発現領域候補選出部１０２ｂ１３は、中央値算出部１０２ｂ１１および領域移動部１０２ｂ１２で行われる処理を繰り返し実行することで蓄積した複数の中央値の中から最大のものを選択し、選択した最大の中央値の算出対象であった領域を発現領域候補として選出する。再び図４に戻り、発現領域候補判定部１０２ｂ２は、発現領域候補決定部１０２ｂ１で決定した発現領域候補が統計的に有意であるか否かを判定する。発現領域決定部１０２ｂ３は、発現領域候補判定部１０２ｂ２で有意であると判定した場合、発現領域候補を発現領域として決定する。 The expression region estimation unit 102b estimates the gene expression region from the genome base sequence based on the tiling array data acquired by the data acquisition unit 102a. Specifically, as shown in FIG. 4, the expression region estimation unit 102b further includes an expression region candidate determination unit 102b1, an expression region candidate determination unit 102b2, and an expression region determination unit 102b3. The expression region candidate determination unit 102b1 determines expression region candidates that are gene expression region candidates based on the tiling array data. Specifically, as shown in FIG. 5, the expression region candidate determination unit 102b1 further includes a median value calculation unit 102b11, a region movement unit 102b12, and an expression region candidate selection unit 102b13. The median value calculation unit 102b11 calculates a median value of the expression intensities of probes included in a region having a predetermined length in the genome base sequence. The region moving unit 102b12 moves the region for which the median value is calculated by the median value calculating unit 102b11 along the genome base sequence. The expression region candidate selection unit 102b13 selects the maximum one from the plurality of median values accumulated by repeatedly executing the processing performed by the median value calculation unit 102b11 and the region movement unit 102b12, and selects the selected maximum median value. The region that was the calculation target is selected as an expression region candidate. Returning to FIG. 4 again, the expression region candidate determination unit 102b2 determines whether or not the expression region candidate determined by the expression region candidate determination unit 102b1 is statistically significant. When the expression region candidate determination unit 102b2 determines that the expression region candidate determination unit 102b3 is significant, the expression region determination unit 102b3 determines the expression region candidate as the expression region.

再び図１に戻り、遺伝子構造予測部１０２ｃは、発現領域推定ステップで推定した発現領域に対応する塩基を起点として、当該起点の両隣に連なる塩基ごとに、タイリングアレイデータおよびゲノム塩基配列データに基づいて遺伝子の構造に関する属性（エクソン・イントロン・その他）を決定することで、遺伝子の構造を予測する。具体的には、図６に示すように、遺伝子構造予測部１０２ｃは、発現領域塩基確率決定部１０２ｃ１と隣接塩基確率決定部１０２ｃ２と尤度決定部１０２ｃ３と属性確定部１０２ｃ４とをさらに備えている。発現領域塩基確率決定部１０２ｃ１は、発現領域推定部１０２ｂで推定した発現領域に対応する塩基である発現領域塩基を対象として、当該発現領域塩基の属性に関する確率をマルコフモデルを用いて決定する。隣接塩基確率決定部１０２ｃ２は、発現領域確率決定部１０２ｃ１で対象とした発現領域塩基に隣接する隣接塩基を対象として、当該隣接塩基の属性に関する状態遷移確率をマルコフモデルを用いて決定する。尤度決定部１０２ｃ３は、隣接塩基確率決定部１０２ｃ２をゲノム塩基配列の末端の塩基まで繰り返して実行した後、各塩基の属性に関する尤度を最尤法を用いて決定する。属性確定部１０２ｃ４は、発現領域塩基確率決定部１０２ｃ１で決定した確率、隣接塩基確率決定部１０２ｃ２で決定した状態遷移確率および尤度決定部１０２ｃ３で決定した尤度に基づいて各塩基の属性の組み合わせを確定する。 Returning to FIG. 1 again, the gene structure prediction unit 102c uses the base corresponding to the expression region estimated in the expression region estimation step as a starting point, and adds tiling array data and genomic base sequence data for each base adjacent to the starting point. The gene structure is predicted by determining attributes (exons, introns, etc.) related to the structure of the gene. Specifically, as shown in FIG. 6, the gene structure prediction unit 102c further includes an expression region base probability determination unit 102c1, an adjacent base probability determination unit 102c2, a likelihood determination unit 102c3, and an attribute determination unit 102c4. . The expression region base probability determining unit 102c1 determines, using a Markov model, the probability related to the attribute of the expression region base for the expression region base that is the base corresponding to the expression region estimated by the expression region estimation unit 102b. The adjacent base probability determining unit 102c2 determines the state transition probability related to the attribute of the adjacent base using the Markov model for the adjacent base adjacent to the expression region base targeted by the expression region probability determining unit 102c1. The likelihood determining unit 102c3 repeatedly executes the adjacent base probability determining unit 102c2 up to the terminal base of the genome base sequence, and then determines the likelihood related to the attribute of each base using the maximum likelihood method. The attribute determination unit 102c4 combines the attribute of each base based on the probability determined by the expression region base probability determination unit 102c1, the state transition probability determined by the adjacent base probability determination unit 102c2, and the likelihood determined by the likelihood determination unit 102c3. Confirm.

再び図１に戻り、発現値算出部１０２ｄは、遺伝子構造予測部１０２ｃで予測した遺伝子の構造に含まれるエクソン領域に対応するプローブの発現強度に基づいて、当該遺伝子の発現値を算出する。 Returning to FIG. 1 again, the expression value calculation unit 102d calculates the expression value of the gene based on the expression intensity of the probe corresponding to the exon region included in the gene structure predicted by the gene structure prediction unit 102c.

予測対象領域決定部１０２ｅは、遺伝子構造予測ステップで構造が予測された遺伝子の領域を除いたゲノム塩基配列である未予測領域を予測対象領域として決定する。具体的には、図７に示すように、予測対象領域決定部１０２ｅは、未予測領域抽出部１０２ｅ１と塩基長判定部１０２ｅ２と予測対象領域確定部１０２ｅ３とをさらに備えている。ここで、未予測領域抽出部１０２ｅ１は、未予測領域をゲノム塩基配列から抽出する。塩基長判定部１０２ｅ２は、未予測領域抽出部１０２ｅ１で抽出した未予測領域の塩基長を算出し、算出した塩基長が所定の塩基長以下であるか否かを判定する。予測対象領域確定部１０２ｅ３は、塩基長判定部１０２ｅ２で所定の塩基長以下でないと判定した場合、未予測領域抽出部１０２ｅ１で抽出した未予測領域を予測対象領域として確定する。 The prediction target region determination unit 102e determines, as a prediction target region, an unpredicted region that is a genomic base sequence excluding the region of the gene whose structure is predicted in the gene structure prediction step. Specifically, as illustrated in FIG. 7, the prediction target region determination unit 102e further includes an unpredicted region extraction unit 102e1, a base length determination unit 102e2, and a prediction target region determination unit 102e3. Here, the unpredicted region extraction unit 102e1 extracts the unpredicted region from the genome base sequence. The base length determination unit 102e2 calculates the base length of the unpredicted region extracted by the unpredicted region extraction unit 102e1, and determines whether the calculated base length is equal to or less than a predetermined base length. When the base length determination unit 102e2 determines that the prediction target region determination unit 102e3 is not less than or equal to the predetermined base length, the prediction target region determination unit 102e3 determines the unpredicted region extracted by the unpredicted region extraction unit 102e1 as the prediction target region.

以上の構成において、遺伝子構造予測装置１００の制御部１０２で行われるメイン処理を図８から図１１などを参照して説明する。図８はメイン処理の一例を示すフローチャートである。 In the above configuration, main processing performed by the control unit 102 of the gene structure prediction apparatus 100 will be described with reference to FIGS. FIG. 8 is a flowchart showing an example of the main process.

まず、データ取得部１０２ａで、タイリングアレイデータおよびゲノム塩基配列データを取得し、取得したゲノム塩基配列データを塩基配列データファイル１０６ａに格納し、取得したタイリングアレイデータをタイリングアレイデータファイル１０６ｂに格納する（ステップＳＡ−１）。 First, the data acquisition unit 102a acquires tiling array data and genome base sequence data, stores the acquired genome base sequence data in the base sequence data file 106a, and stores the acquired tiling array data in the tiling array data file 106b. (Step SA-1).

つぎに、発現領域推定部１０２ｂで、ステップＳＡ−１で取得したタイリングアレイデータに基づいて、ゲノム塩基配列の中から遺伝子の発現領域を推定する（ステップＳＡ−２：発現領域推定処理）。具体的には、発現領域推定部１０２ｂで、ステップＳＡ−１で取得したタイリングアレイデータに基づいて遺伝子の発現領域の候補である発現領域候補を決定し、決定した発現領域候補が統計的に有意であるか否かを判定し、有意であると判定した場合、発現領域候補を発現領域として決定する。 Next, the expression region estimation unit 102b estimates the gene expression region from the genome base sequence based on the tiling array data acquired in step SA-1 (step SA-2: expression region estimation process). Specifically, the expression region estimation unit 102b determines expression region candidates that are gene expression region candidates based on the tiling array data acquired in step SA-1, and the determined expression region candidates are statistically determined. It is determined whether or not it is significant, and when it is determined that it is significant, an expression region candidate is determined as an expression region.

ここで、発現領域推定部１０２ｂで行われる発現領域推定処理を図９を参照して説明する。図９は発現領域推定処理の一例を示すフローチャートである。
まず、発現領域候補決定部１０２ｂ１で、タイリングアレイデータに基づいて遺伝子の発現領域の候補である発現領域候補を決定する（ステップＳＢ−１）。具体的には、以下の手順（１）〜（３）に従って発現領域候補を決定する。
（１）中央値算出部１０２ｂ１１で、ゲノム塩基配列における所定の長さの領域（ウインドウ：例えば７５０塩基長）を対象として、当該領域に含まれるプローブの発現強度の中央値を算出する。
（２）領域移動部１０２ｂ１２で、中央値算出部１０２ｂ１１で中央値の算出対象とした領域をゲノム塩基配列に沿って移動する。
（３）発現領域候補選出部１０２ｂ１３で、中央値算出部１０２ｂ１１および領域移動部１０２ｂ１２を繰り返し実行することで蓄積した複数の中央値の中から最大のものを選択し、選択した最大の中央値の算出対象であった領域を発現領域候補として選出する。
つまり、以上の手順により、数式１に示すプライマリーポイントｋを求める。

数式１において、ｖ_i（上付き線あり）は塩基番号ｉの位置において設定されたウインドウに含まれるプローブの発現強度ｖ_iの中央値である。また、ｋは、プライマリーポイント（ｐｒｉｍａｒｙｐｏｉｎｔ）と呼ばれるものであり、１以上Ｎ（ゲノム塩基配列の長さ）以下の整数値である。なお、ｍａｘｖ_i（上付き線あり）は推定の発現値（ｐｕｔａｔｉｖｅｅｘｐｒｅｓｓｉｏｎｖａｌｕｅ）と呼ばれるものである。 Here, the expression region estimation process performed by the expression region estimation unit 102b will be described with reference to FIG. FIG. 9 is a flowchart showing an example of the expression region estimation process.
First, the expression region candidate determination unit 102b1 determines expression region candidates that are gene expression region candidates based on the tiling array data (step SB-1). Specifically, expression region candidates are determined according to the following procedures (1) to (3).
(1) The median value calculation unit 102b11 calculates a median value of expression intensities of probes included in a region having a predetermined length (window: for example, 750 base length) in the genome base sequence.
(2) The region moving unit 102b12 moves the region for which the median value is calculated by the median value calculating unit 102b11 along the genome base sequence.
(3) In the expression region candidate selection unit 102b13, the median calculation unit 102b11 and the region movement unit 102b12 are repeatedly executed to select the maximum one from the accumulated median values, and the maximum median value selected The region that was the calculation target is selected as an expression region candidate.
That is, the primary point k shown in Formula 1 is obtained by the above procedure.

In Equation 1, v _i (with superscript line) is the median value of the expression intensity v _i of the probe included in the window set at the position of the base number i. Further, k is called a primary point and is an integer value of 1 or more and N (length of genome base sequence). Note that maxv _i (with a superscript line) is called an estimated expression value (putty expression value).

ついで、発現領域候補判定部１０２ｂ２で、発現領域候補決定部１０２ｂ１で決定した発現領域候補が統計的に有意であるか否かを判定する（ステップＳＢ−２）。具体的には、以下の手順（１）〜（３）に従って判定する。
（１）ステップＳＢ−１で決定したプライマリーポイントｋに対応するウインドウに含まれるプローブの発現強度が偶然に観測されたものであるという帰無仮説を設定する。
（２）当該ウインドウに含まれるプローブの総数Ｍ_kおよび発現強度が予め設定した閾値θより大きいプローブの数ｍ_kに関する確率を算出する。ここで、当該確率は下記の数式２に示す２項分布に従う。

数式２において、Ｍ_kはプライマリーポイントｋに対応するウインドウに含まれるプローブの総数（例えば３０）である。また、ｍ_kは、プライマリーポイントｋに対応するウインドウに含まれるプローブのうち、発現強度が予め設定した閾値θより大きいプローブの数である。また、πθは、プローブの発現強度が閾値θより大きい確率であり、下記の数式３で定義される。

数式３において、δ_iは下記の数式４で定義される値である。また、λ_iは、１塩基あたりのプローブ数であり、下記の数式５で定義される。また、Ｉｆ［ｃｏｎｄｉｔｉｏｎ］は、括弧内が真であれば１を、括弧内が偽であれば０を返す関数である。また、Ｎはゲノム塩基配列の長さである。

数式４において、ｖ_iは塩基番号ｉの塩基に対応するプローブの発現強度である。θは予め定めた閾値である。なお、θの初期値には、全プローブの発現強度の中央値を設定してもよい。

数式５において、ｎ_iは塩基番号ｉの塩基に対応するプローブの数である。また、ｌはプローブの長さ（塩基数）である。
（３）数式２で算出した確率が期待値（Ｍ_k×πθ）より有意に大きい場合には、下記の数式６に示す有意水準で帰無仮説を棄却する。つまり、ステップＳＢ−１で決定した発現領域候補が統計的に有意であると判定する。これにより、プライマリーポイントｋに対応するウインドウに含まれるプローブの発現強度が偶然に観測されたものでないことが、統計的検定により示された。

数式６において、Ｍ_iは塩基番号ｉに対応するウインドウに含まれるプローブの総数である。また、ｍ_iは、塩基番号ｉに対応するウインドウに含まれるプローブのうち、発現強度が予め設定した閾値θより大きいプローブの数である。また、πθは、プローブの発現強度が閾値θより大きい確率であり、数式３で定義される。 Next, the expression region candidate determination unit 102b2 determines whether or not the expression region candidate determined by the expression region candidate determination unit 102b1 is statistically significant (step SB-2). Specifically, the determination is made according to the following procedures (1) to (3).
(1) A null hypothesis is set that the expression intensity of the probe included in the window corresponding to the primary point k determined in step SB-1 is observed by chance.
(2) The probability relating to the total number M _{k of} probes included in the window and the number m _k of probes whose expression intensity is greater than a preset threshold θ is calculated. Here, the probability follows a binomial distribution shown in Equation 2 below.

In Equation 2, M _k is the total number (for example, 30) of probes included in the window corresponding to the primary point k. M _k is the number of probes whose expression intensity is greater than a preset threshold θ among probes included in the window corresponding to the primary point k. Further, πθ is a probability that the expression intensity of the probe is larger than the threshold value θ, and is defined by Equation 3 below.

In Equation 3, δ _i is a value defined by Equation 4 below. Also, λ _i is the number of probes per base and is defined by Equation 5 below. If [condition] is a function that returns 1 if the parentheses are true and returns 0 if the parentheses are false. N is the length of the genome base sequence.

In Equation 4, v _i is the expression intensity of the probe corresponding to the base with the base number i. θ is a predetermined threshold value. The median value of the expression intensity of all probes may be set as the initial value of θ.

In Equation 5, n _i is the number of probes corresponding to the base with the base number i. L is the length (number of bases) of the probe.
(3) If the probability calculated by Equation 2 is significantly greater than the expected value (M _k × πθ), the null hypothesis is rejected at the significance level shown in Equation 6 below. That is, it is determined that the expression region candidate determined in step SB-1 is statistically significant. Thereby, it was shown by the statistical test that the expression intensity of the probe included in the window corresponding to the primary point k was not accidentally observed.

In Equation 6, M _i is the total number of probes included in the window corresponding to the base number i. M _i is the number of probes whose expression intensity is greater than a preset threshold θ among probes included in the window corresponding to the base number i. Further, πθ is a probability that the expression intensity of the probe is larger than the threshold θ, and is defined by Equation 3.

ついで、発現領域決定部１０２ｂ３で、ステップＳＢ−２で有意であると判定した場合、ステップＳＢ−１で決定した発現領域候補を発現領域として決定する（ステップＳＢ−３）。 Next, when the expression region determination unit 102b3 determines that the expression region is significant in Step SB-2, the expression region candidate determined in Step SB-1 is determined as an expression region (Step SB-3).

なお、遺伝子構造予測部１０２ｃの処理では、プライマリーポイントｋに対応するエクソン領域が実際に発現するという仮定のもとに、プライマリーポイントｋを起点として、遺伝子の構造（具体的にはエクソン領域およびイントロン領域）を予測する。
以上、発現領域推定処理の説明を終了する。 In the process of the gene structure prediction unit 102c, the gene structure (specifically, the exon region and the intron) is determined based on the assumption that the exon region corresponding to the primary point k is actually expressed. Region).
This is the end of the description of the expression region estimation process.

再び図８に戻り、遺伝子構造予測部１０２ｃで、ステップＳＡ−２で推定した発現領域に対応する塩基を起点として、当該起点の両隣に連なる塩基ごとに、タイリングアレイデータおよびゲノム塩基配列データに基づいて遺伝子の構造に関する属性（エクソン・イントロン・その他）を決定することで、遺伝子の構造を予測する（ステップＳＡ−３：遺伝子構造予測処理）。具体的には、遺伝子構造予測部１０２ｃで、（１）ステップＳＡ−２で推定した発現領域に対応する塩基である発現領域塩基を対象として、当該発現領域塩基の属性に関する確率をマルコフモデルを用いて決定し、（２）当該発現領域塩基に隣接する隣接塩基を対象として、当該隣接塩基の属性に関する状態遷移確率をマルコフモデルを用いて決定し、（３）（２）の処理をゲノム塩基配列の末端の塩基まで繰り返して実行した後、各塩基の属性に関する尤度を最尤法を用いて決定し、（４）決定した確率、状態遷移確率および尤度に基づいて各塩基の属性の組み合わせを確定する。より具体的には、遺伝子構造予測部１０２ｃで、下記の数式７に示すスコア関数をダイナミックプログラミング（ニュートン法やパウエル法などでもよい）で計算することにより、スコア関数の値を最大にする属性ｓと閾値θとの組み合わせを決定することで、遺伝子の構造を予測する。ここで、属性ｓは、図１３における「ｓｔａｔｅｎｕｍｂｅｒ」に対応するものである。

数式７において、Ｚ（ｓ，θ）は、下記の数式８で定義されるオッズである。また、Ｚ_T（ｓ）は、エクソン領域を尤もらしい長さに保つための式であり、下記の数式１５で定義される。また、Ｚ_U（ｓ）は、イントロン領域を尤もらしい長さに保つための式であり、下記の数式１８で定義される。

数式８において、Ｐ（ｘ，ｓ，δ｜θ）は、遺伝子である確率であり、下記の数式９で定義される、なお、Ｐ（ｘ，ｓ，δ｜θ）は、式の変形を行うことで最終的に下記の数式１０で表すことができる。また、Ｐ（ｘ，ｓ（上付き線あり），δ｜θ）は、遺伝子でない確率であり、式の変形を同様に行うことで最終的に下記の数式１１で表すことができる。

Returning to FIG. 8 again, the base structure corresponding to the expression region estimated in step SA-2 is used as a starting point in the gene structure prediction unit 102c, and the tiling array data and the genomic base sequence data are obtained for each base that is adjacent to the starting point. Based on the attribute (exon, intron, etc.) related to the structure of the gene, the structure of the gene is predicted (step SA-3: gene structure prediction process). Specifically, in the gene structure prediction unit 102c, (1) using the Markov model for the expression region base that is the base corresponding to the expression region estimated in step SA-2, the probability relating to the attribute of the expression region base is used. (2) For the adjacent bases adjacent to the expression region base, the state transition probability relating to the attribute of the adjacent base is determined using a Markov model, and the processing of (3) (2) is performed as a genomic base sequence After repeatedly executing up to the base of the base, the likelihood regarding the attribute of each base is determined using the maximum likelihood method, and (4) the combination of the attributes of each base based on the determined probability, state transition probability and likelihood Confirm. More specifically, the gene structure prediction unit 102c calculates the score function shown in Equation 7 below by dynamic programming (or may be Newton's method or Powell's method), thereby maximizing the value of the score function. And the threshold θ are determined to predict the gene structure. Here, the attribute s corresponds to “state number” in FIG.

In Equation 7, Z (s, θ) is an odds defined by Equation 8 below. Z _T (s) is an expression for keeping the exon region at a reasonable length, and is defined by Expression 15 below. Z _U (s) is an expression for maintaining the intron region to have a reasonable length, and is defined by Expression 18 below.

In Equation 8, P (x, s, δ | θ) is a probability of being a gene, and is defined by Equation 9 below, where P (x, s, δ | θ) By doing so, it can be finally expressed by the following formula 10. Further, P (x, s (with superscript line), δ | θ) is a probability that it is not a gene, and can be finally expressed by the following formula 11 by similarly modifying the formula.

ここで、遺伝子構造予測部１０２ｃで行われる遺伝子構造予測処理を図１０を参照して説明する。図１０は遺伝子構造予測処理の一例を示すフローチャートである。
まず、発現領域塩基確率決定部１０２ｃ１で、ステップＳＡ−２で推定した発現領域に対応する塩基である発現領域塩基を対象として、当該発現領域塩基の属性に関する確率をマルコフモデルを用いて決定する（ステップＳＣ−１）。具体的には、プライマリーポイントｋを対象として、数式１０および数式１１に含まれる確率Ｐ（ｘ_k，ｓ_k）およびＰ（ｘ_k，ｓ_k（上付き線あり））について、属性ｓ_kを変えながら、「Ｐ（ｘ_k，ｓ_k）÷Ｐ（ｘ_k，ｓ_k（上付き線あり））」の最大値を決定する。 Here, the gene structure prediction process performed by the gene structure prediction unit 102c will be described with reference to FIG. FIG. 10 is a flowchart showing an example of gene structure prediction processing.
First, the expression region base probability determination unit 102c1 determines, using a Markov model, the probability related to the attribute of the expression region base for the expression region base that is the base corresponding to the expression region estimated in step SA-2. Step SC-1). Specifically, for the primary point k, the attribute s _k is set for the probabilities P (x _k , s _k ) and P (x _k , s _k (with superscript lines)) included in the equations 10 and 11. While changing, the maximum value of “P (x _k , s _k ) ÷ P (x _k , s _k (with superscript line))” is determined.

ついで、隣接塩基確率決定部１０２ｃ２で、ステップＳＣ−１で対象とした発現領域塩基に隣接する隣接塩基を対象として、当該隣接塩基の属性に関する状態遷移確率をマルコフモデルを用いて決定する（ステップＳＣ−２：図１２参照）。具体的には、数式１０および数式１１に含まれる状態遷移確率Ｐ（ｘ_k+1，ｓ_k+1｜ｘ_k，ｓ_k）およびＰ（ｘ_k+1，ｓ_k+1（上付き線あり）｜ｘ_k，ｓ_k（上付き線あり））について、属性ｓ_k+1を変えながら、「Ｐ（ｘ_k+1，ｓ_k+1｜ｘ_k，ｓ_k）÷Ｐ（ｘ_k+1，ｓ_k+1（上付き線あり）｜ｘ_k，ｓ_k（上付き線あり））」の最大値を決定する。また、同様に、数式１０および数式１１に含まれる状態遷移確率Ｐ（ｘ_k-1，ｓ_k-1｜ｘ_k，ｓ_k）およびＰ（ｘ_k-1，ｓ_k-1（上付き線あり）｜ｘ_k，ｓ_k（上付き線あり））について、属性ｓ_k-1を変えながら、「Ｐ（ｘ_k-1，ｓ_k-1｜ｘ_k，ｓ_k）÷Ｐ（ｘ_k-1，ｓ_k-1（上付き線あり）｜ｘ_k，ｓ_k（上付き線あり））」の最大値を決定する。ここで、図１２において、３'ｅｎｄと５'ｅｎｄが状態遷移の終点である。 Next, the adjacent base probability determination unit 102c2 determines the state transition probability related to the attribute of the adjacent base using the Markov model for the adjacent base adjacent to the expression region base targeted in step SC-1 (step SC). -2: See FIG. Specifically, the state transition probabilities P (x _{k + 1} , s _{k + 1} | x _k , s _k ) and P (x _{k + 1} , s _{k + 1)} (superscript lines included in Equations 10 and 11) Yes) | x _k , s _k (with superscript line)), while changing the attribute s _{k + 1} , “P (x _{k + 1} , s _{k + 1} | x _k , s _k ) ÷ P (x _{k +1} , s _{k + 1} (with superscript line) | x _k , s _k (with superscript line)) ”is determined. Similarly, the state transition probabilities P (x _k−1 , s _k−1 | x _k , s _k ) and P (x _k−1 , s _k−1 (superscript lines) included in the equations 10 and 11 are used. Yes) | x _k , s _k (with superscript line)), while changing the attribute s _k-1 , “P (x _k−1 , s _k-1 | x _k , s _k ) ÷ P (x _{k −1} , s _k-1 (with superscript line) | x _k , s _k (with superscript line)) ”is determined. Here, in FIG. 12, 3 ′ end and 5 ′ end are the end points of the state transition.

そして、ステップＳＣ−２の処理をゲノム塩基配列の末端の塩基まで繰り返して実行する。これにより、数式１０および数式１１に含まれる状態遷移確率Ｐ（ｘ_i，ｓ_i｜ｘ_i+1，ｓ_i+1）の積および状態遷移確率Ｐ（ｘ_i，ｓ_i（上付き線あり）｜ｘ_i+1，ｓ_i+1（上付き線あり））の積について、「Ｐ（ｘ_i，ｓ_i｜ｘ_i+1，ｓ_i+1）の積÷Ｐ（ｘ_i，ｓ_i（上付き線あり）｜ｘ_i+1，ｓ_i+1（上付き線あり））の積」の最大値を決定する。また、同様に、数式１０および数式１１に含まれる状態遷移確率Ｐ（ｘ_i，ｓ_i｜ｘ_i-1，ｓ_i-1）の積および状態遷移確率Ｐ（ｘ_i，ｓ_i（上付き線あり）｜ｘ_i-1，ｓ_i-1（上付き線あり））の積について、「Ｐ（ｘ_i，ｓ_i｜ｘ_i-1，ｓ_i-1）の積÷Ｐ（ｘ_i，ｓ_i（上付き線あり）｜ｘ_i-1，ｓ_i-1（上付き線あり））の積」の最大値を決定する。なお、図１２において、起点から終点への領域の伸長は、例えば、終点の状態（塩基番号０および塩基番号２５の状態）が長さＬ（例えば８００塩基長）の連続した塩基配列にわたって各位置で最大のスコアを保つ場合に停止する。 Then, the process of step SC-2 is repeated until the terminal base of the genome base sequence. As a result, the product of the state transition probabilities P (x _i , s _i | x _{i + 1} , s _{i + 1} ) included in the equations 10 and 11 and the state transition probabilities P (x _i , s _i (with superscript lines) ) | X _{i + 1} , s _{i + 1} (with superscript line)), the product of “P (x _i , s _i | x _{i + 1} , s _{i + 1} ) ÷ P (x _i , s _{i ”} (with superscript line) | x _{i + 1} , s _{i + 1} (with superscript line))” is determined. Similarly, the state transition probability _{_{P (x i, s i |}} x i-1, s i-1) included in Equation 10 and Equation 11 the product and the state transition probability P (x _{_i,} s _i (superscript With the line) | x _i-1 , s _i-1 (with superscript line)), the product of “P (x _i , s _i | x _i-1 , s _i-1 ) ÷ P (x _i , S _i (with superscript line) | x _i−1 , s _i-1 (with superscript line) ”is determined. In FIG. 12, the extension of the region from the start point to the end point is, for example, at each position over a continuous base sequence in which the end point state (the state of base number 0 and base number 25) is length L (for example, 800 base length). Stop if you keep the maximum score at.

ついで、尤度決定部１０２ｃ３で、各塩基の属性に関する尤度を最尤法を用いて決定する（ステップＳＣ−３）。具体的には、数式１０および数式１１に含まれる尤度Ｐ（δ_i｜ｓ_i，θ）の積および尤度Ｐ（δ_i｜ｓ_i（上付き線あり），θ）の積について、「Ｐ（δ_i｜ｓ_i，θ）の積÷Ｐ（δ_i｜ｓ_i（上付き線あり），θ）の積」の値を決定する。例えば、図１２において、塩基番号０から塩基番号２５までの領域を対象として、Ｐ（δ_i｜ｓ_i，θ）の値を更新する。ここで、Ｐ（δ_i｜ｓ_i，θ）およびＰ（δ_i｜ｓ_i（上付き線あり），θ）は、それぞれ下記の数式１２および数式１４で定義される。

ここで、各遺伝子は、異なる発現レベルをもつので、閾値θおよびＰ（δ_i｜ｓ_i，θ）は、各転写単位（ＴＵ：ＴｒａｎｓｌａｔｉｏｎａｌＵｎｉｔ）内で局所的に決めるべきである。従って、数式１２において、ａ、ｂ、ｃ、およびｄは、それぞれ下記の数式１３で定義される。

数式１３において、ａ、ｂ、ｃ、およびｄは、５'末端および３'末端間の領域内において決められる。

Next, the likelihood determining unit 102c3 determines the likelihood related to the attribute of each base using the maximum likelihood method (step SC-3). Specifically, regarding the product of likelihood P (δ _i | s _i , θ) and the product of likelihood P (δ _i | s _i (with superscript), θ) included in Equation 10 and Equation 11, The value of “product of P (δ _i | s _i , θ) ÷ product of P (δ _i | s _i (with superscript), θ)” is determined. For example, in FIG. 12, the value of P (δ _i | s _i , θ) is updated for the region from base number 0 to base number 25. Here, P (δ _i | s _i , θ) and P (δ _i | s _i (with superscript line), θ) are defined by the following

equations

12 and 14, respectively.

Here, since each gene has a different expression level, the threshold values θ and P (δ _i | s _i , θ) should be determined locally within each transcription unit (TU). Therefore, in Expression 12, a, b, c, and d are respectively defined by Expression 13 below.

In Equation 13, a, b, c, and d are determined in the region between the 5 ′ end and the 3 ′ end.

以上の処理により、予め設定した閾値θにおいて、数式７に含まれるＺ（ｓ，θ）の最大値を決定した。 With the above processing, the maximum value of Z (s, θ) included in Equation 7 is determined at the preset threshold θ.

そして、下記の数式１５に示すエクソン領域を尤もらしい長さに保つための式Ｚ_T（ｓ）の値および下記の数式１８に示すイントロン領域を尤もらしい長さに保つための式Ｚ_U（ｓ）の値を決定することで、最終的に、予め設定した閾値θにおける数式７のスコア関数の値（最大値）を決定し、決定したスコア関数の最大値、当該閾値θおよび当該最大値が得られたときの属性ｓ（ｓ₁，・・・ｓ_k，・・・ｓ_N）を相互に関連付けて、遺伝子構造データファイル１０６ｄの所定の領域に格納する。

数式１５において、Ｎ_Tはエクソン数である。また、ｌ_T,jはエクソン長である。また、Ｐ_T（ｌ）はエクソン長ｌに関する確率分布であり、下記の数式１６で定義される。また、Ｅ_TはＰ_T（ｌ）の期待値であり、下記の数式１７で定義される。

数式１６において、μ_Tはエクソン長の平均値である。σ_Tはエクソン長の標準偏差である。なお、μ_Tおよびσ_Tはトレーニングセットで観測された実測値である。

数式１８において、Ｎ_Uはイントロン数である。また、ｌ_U,jはイントロン長である。また、Ｐ_U（ｌ）はイントロン長ｌに関する確率分布であり、下記の数式１９で定義される。また、Ｅ_UはＰ_U（ｌ）の期待値であり、下記の数式２０で定義される。

数式１９において、μ_Uはイントロン長の平均値である。σ_Uはイントロン長の標準偏差である。なお、μ_Uおよびσ_Uはトレーニングセットで観測された実測値である。

Then, the value of the expression Z _T (s) for keeping the exon region shown in Equation 15 below to have a reasonable length and the equation Z _U (s) for keeping the intron region shown in Equation 18 below in a reasonable length. ) Is finally determined, the value (maximum value) of the score function of Equation 7 at the preset threshold θ is determined, and the maximum value of the determined score function, the threshold θ and the maximum value are determined. The obtained attributes s (s ₁ ,... S _k ,... S _N ) are associated with each other and stored in a predetermined area of the gene structure data file 106d.

In Equation 15, N _T is the number of exons. L _{T, j} is the length of the exon. P _T (l) is a probability distribution with respect to the exon length l and is defined by Equation 16 below. E _T is an expected value of P _T (l) and is defined by the following Equation 17.

In Equation 16, μ _T is an average value of exon lengths. σ _T is the standard deviation of the exon length. Μ _T and σ _T are actually measured values observed in the training set.

In Equation 18, N _U is the number of introns. L _{U, j} is the intron length. P _U (l) is a probability distribution with respect to the intron length l and is defined by Equation 19 below. E _U is an expected value of P _U (l), and is defined by the following Equation 20.

In Equation 19, μ _U is the average value of the intron length. σ _U is the standard deviation of the intron length. Μ _U and σ _U are actually measured values observed in the training set.

ついで、閾値θを他の値に設定し直して、以上の処理を同様に行う。 Next, the threshold θ is reset to another value, and the above processing is performed in the same manner.

ついで、属性確定部１０２ｃ１４で、ステップＳＣ−３で決定した確率、状態遷移確率および尤度に基づいて各塩基の属性の組み合わせを確定する（ステップＳＣ−４）。具体的には、遺伝子構造データファイル１０６ｄに蓄積された各閾値θにおけるスコア関数の最大値の中から最大のものを決定し、決定した最大値に対応する属性ｓを最適なものとして確定する。これにより、スコア関数の値を最大にするような閾値θと属性ｓとの組み合わせが確定され、遺伝子の構造が予測された。
以上、遺伝子構造予測処理の説明を終了する。 Next, the attribute determination unit 102c14 determines the attribute combination of each base based on the probability, state transition probability, and likelihood determined in step SC-3 (step SC-4). Specifically, the maximum value of the score function at each threshold value θ accumulated in the gene structure data file 106d is determined, and the attribute s corresponding to the determined maximum value is determined as the optimum value. As a result, the combination of the threshold value θ and the attribute s that maximizes the score function value was determined, and the gene structure was predicted.
This is the end of the description of the gene structure prediction process.

再び図８に戻り、予測対象領域決定部１０２ｅで、ステップＳＡ−３で構造が予測された遺伝子の領域を除いたゲノム塩基配列である未予測領域を予測対象領域として決定する（ステップＳＡ−４：予測対象領域決定処理）。具体的には、予測対象領域決定部１０２ｅで、未予測領域をゲノム塩基配列から抽出し、抽出した未予測領域の塩基長を算出し、算出した塩基長が所定の塩基長以下であるか否かを判定し、所定の塩基長以下でないと判定した場合、抽出した未予測領域を予測対象領域として確定する。 Returning to FIG. 8 again, the prediction target region determination unit 102e determines an unpredicted region, which is a genomic base sequence excluding the region of the gene whose structure is predicted in step SA-3, as a prediction target region (step SA-4). : Prediction target area determination processing). Specifically, the prediction target region determination unit 102e extracts an unpredicted region from the genome base sequence, calculates the base length of the extracted unpredicted region, and whether the calculated base length is equal to or less than a predetermined base length. If it is determined that the length is not less than the predetermined base length, the extracted unpredicted region is determined as the prediction target region.

ここで、予測対象領域決定部１０２ｅで行われる予測対象領域決定処理を、図１１を参照して説明する。図１１は、予測対象領域決定処理の一例を示すフローチャートである。
まず、未予測領域抽出部１０２ｅ１で、未予測領域をゲノム塩基配列から抽出する（ステップＳＤ−１）。
ついで、塩基長判定部１０２ｅ２で、ステップＳＤ−１で抽出した未予測領域の塩基長を算出し、算出した塩基長が所定の塩基長（例えば１０００塩基長）以下であるか否かを判定する（ステップＳＤ−２）。
ついで、予測対象領域確定部１０２ｅ３で、ステップＳＤ−２で所定の塩基長以下でないと判定した場合（ステップＳＤ−３：Ｙｅｓ）、ステップＳＤ−１で抽出した未予測領域を予測対象領域として確定する（ステップＳＤ−４）。なお、ステップＳＤ−２で所定の塩基長以下であると判定した場合（ステップＳＤ−３：Ｎｏ）には、所定の塩基長の領域内に入る遺伝子構造はあり得ないと判断して、抽出した未予測領域を構造が予測された領域として扱い、再度ステップＳＤ−１に戻る。
以上、予測対象領域決定処理の説明を終了する。 Here, the prediction target region determination process performed by the prediction target region determination unit 102e will be described with reference to FIG. FIG. 11 is a flowchart illustrating an example of the prediction target area determination process.
First, the unpredicted region extraction unit 102e1 extracts the unpredicted region from the genome base sequence (step SD-1).
Next, the base length determination unit 102e2 calculates the base length of the unpredicted region extracted in step SD-1, and determines whether the calculated base length is a predetermined base length (for example, 1000 base lengths) or less. (Step SD-2).
Next, when the prediction target region determination unit 102e3 determines that the length is not less than the predetermined base length in Step SD-2 (Step SD-3: Yes), the unpredicted region extracted in Step SD-1 is determined as the prediction target region. (Step SD-4). When it is determined in step SD-2 that the length is equal to or shorter than the predetermined base length (step SD-3: No), it is determined that there cannot be a gene structure that falls within the region of the predetermined base length, and extracted. The unpredicted area is treated as an area whose structure is predicted, and the process returns to step SD-1.
This is the end of the description of the prediction target region determination process.

再び図８に戻り、ステップＳＡ−４で予測対象領域が決定された場合（ステップＳＡ−５：Ｎｏ）には、再びステップＳＡ−２に戻り、ステップＳＡ−４で予測対象領域が決定されなかった場合（ステップＳＡ−５：Ｙｅｓ）には、メイン処理を終了する。 Returning to FIG. 8 again, when the prediction target area is determined in step SA-4 (step SA-5: No), the process returns to step SA-2 again, and the prediction target area is not determined in step SA-4. If this is the case (step SA-5: Yes), the main process is terminated.

ここで、発現値算出部１０２ｄで、ステップＳＡ−３で予測した遺伝子の構造に含まれるエクソン領域に対応するプローブの発現強度に基づいて、当該遺伝子の発現値を算出し、算出した発現値に関するデータを発現値データファイル１０６ｅの所定の記憶領域に格納してもよい。 Here, the expression value calculation unit 102d calculates the expression value of the gene based on the expression intensity of the probe corresponding to the exon region included in the structure of the gene predicted in step SA-3, and relates to the calculated expression value. The data may be stored in a predetermined storage area of the expression value data file 106e.

これにて、メイン処理の説明を終了する。 This completes the description of the main process.

以上説明したように、遺伝子構造予測装置１００は、タイリングアレイデータに基づいてゲノム塩基配列から遺伝子の構造を予測する。具体的には、遺伝子構造予測装置１００は、タイリングアレイデータおよびゲノム塩基配列データを取得し、取得したタイリングアレイデータに基づいて、ゲノム塩基配列の中から遺伝子の発現領域を推定し、推定した発現領域に対応する塩基を起点として、当該起点の両隣に連なる塩基ごとに、タイリングアレイデータおよびゲノム塩基配列データに基づいて遺伝子の構造に関する属性を決定することで、遺伝子の構造を予測するので、タイリングアレイデータに基づいてゲノム塩基配列から未知の遺伝子の領域や構造を精度よく予測することができる。また、ノンコーディングＲＮＡに対しても本発明を適用することができる。ここで、図１４に示すように、予測された遺伝子構造におけるエクソンとイントロンは、ｃＤＮＡ情報に基づく真のエクソン・イントロンと相関が高かった。そして、相関係数は、閾値を超えるか否かでエクソンかイントロンかを判定することにより遺伝子構造を予測する従来の方法における相関係数よりも高かった。さらに、図１５に示すように、予測された遺伝子構造（図１５におけるＭＡ−３−２に表示した遺伝子構造（オレンジ色はエクソン領域を表し、薄いオレンジ色はイントロン領域を表す。））は、真の遺伝子構造（図１５におけるＭＡ−３−１に表示した遺伝子構造（青色はエクソン領域を表し、薄い青色はイントロン領域を表す。））と概ね対応していた。 As described above, the gene structure prediction apparatus 100 predicts a gene structure from a genome base sequence based on tiling array data. Specifically, the gene structure prediction apparatus 100 acquires tiling array data and genomic base sequence data, estimates a gene expression region from the genomic base sequence based on the acquired tiling array data, and estimates The structure of the gene is predicted by determining the attributes related to the structure of the gene based on the tiling array data and the genomic base sequence data for each base that is adjacent to the base corresponding to the expressed region. Therefore, the region and structure of an unknown gene can be accurately predicted from the genome base sequence based on the tiling array data. The present invention can also be applied to non-coding RNA. Here, as shown in FIG. 14, the exons and introns in the predicted gene structure were highly correlated with the true exons and introns based on the cDNA information. And the correlation coefficient was higher than the correlation coefficient in the conventional method which predicts a gene structure by determining whether it is an exon or an intron depending on whether a threshold value is exceeded. Furthermore, as shown in FIG. 15, the predicted gene structure (the gene structure indicated by MA-3-2 in FIG. 15 (orange represents an exon region and light orange represents an intron region)). It substantially corresponded to the true gene structure (the gene structure shown in MA-3-1 in FIG. 15 (blue represents an exon region and light blue represents an intron region)).

ここで、予測した遺伝子構造と真の遺伝子構造とを比較可能に表示した表示画面である図１５について簡単に説明する。図１５において、ＭＡ−１は、対象としたゲノムのイラストや、予測した遺伝子領域（ＭＡ−１における８８４８８７６ｂｐ〜８８６５４５３ｂｐ：１６５５７ｂｐ）などを表示する表示領域である。ＭＡ−２は、後述する表示領域ＭＡ−３に表示される遺伝子構造を拡大・縮小するための操作ボタンである。ＭＡ−３は、遺伝子構造をゲノム塩基配列に沿って表示するための表示領域である。特に、ＭＡ−３−１は、比較対象とする真の遺伝子構造を表示する領域であり、ＭＡ−３−２は、予測した遺伝子構造を塩基配列に沿って表示する領域である。なお、ＭＡ−３−１において、真の遺伝子構造に含まれるエクソン領域は青色で表し、イントロン領域は薄い青色で表している。また、ＭＡ−３−２において、表示した遺伝子構造に含まれるエクソン領域はオレンジ色で表し、イントロン領域は薄いオレンジ色で表している。また、ＭＡ−３−２において、塩基配列に沿って、各塩基に対応するプローブの発現強度が、中央値以下の場合には緑色で、中央値を超える場合には赤色で表示されている。ＭＡ−３−３は、表示するデータを切り替えるために使用する操作ボタンである。 Here, FIG. 15, which is a display screen displaying the predicted gene structure and the true gene structure so as to be comparable, will be briefly described. In FIG. 15, MA-1 is a display area that displays an illustration of the target genome, a predicted gene region (8884876 bp to 8865453 bp: 16557 bp in MA-1), and the like. MA-2 is an operation button for enlarging / reducing a gene structure displayed in a display area MA-3 described later. MA-3 is a display area for displaying the gene structure along the genome base sequence. In particular, MA-3-1 is a region that displays the true gene structure to be compared, and MA-3-2 is a region that displays the predicted gene structure along the base sequence. In MA-3-1, the exon region included in the true gene structure is represented in blue, and the intron region is represented in light blue. In MA-3-2, exon regions included in the displayed gene structure are shown in orange, and intron regions are shown in light orange. Further, in MA-3-2, along the base sequence, the expression intensity of the probe corresponding to each base is displayed in green when it is equal to or lower than the median, and is displayed in red when it exceeds the median. MA-3-3 is an operation button used for switching data to be displayed.

また、遺伝子構造予測装置１００において、発現領域推定部１０２ｂは、タイリングアレイデータに基づいて遺伝子の発現領域の候補である発現領域候補を決定し、決定した発現領域候補が統計的に有意であるか否かを判定し、有意であると判定した場合、発現領域候補を発現領域として決定するので、既存の統計手法を用いて発現領域を容易且つ正確に推定することができる。 In the gene structure prediction apparatus 100, the expression region estimation unit 102b determines expression region candidates that are gene expression region candidates based on the tiling array data, and the determined expression region candidates are statistically significant. If the expression region candidate is determined to be significant, the expression region candidate is determined as the expression region. Therefore, the expression region can be estimated easily and accurately using an existing statistical method.

また、遺伝子構造予測装置１００において、発現領域候補決定部１０２ｂ１は、ゲノム塩基配列における所定の長さの領域を対象として、当該領域に含まれるプローブの発現強度の中央値を算出し、中央値の算出対象とした領域をゲノム塩基配列に沿って移動し、これらの処理を繰り返し実行することで蓄積した複数の中央値の中から最大のものを選択し、選択した最大の中央値の算出対象であった領域を発現領域候補として選出するので、タイリングアレイで測定された各プローブの発現強度のばらつきを考慮して適切な発現領域候補を選出することができる。 Further, in the gene structure prediction apparatus 100, the expression region candidate determination unit 102b1 calculates the median of the expression intensity of the probes included in the region for a region of a predetermined length in the genome base sequence, By moving the region to be calculated along the genome base sequence and repeatedly executing these processes, the largest one is selected from the accumulated medians, and the selected median of the maximum median is selected. Since the selected region is selected as an expression region candidate, an appropriate expression region candidate can be selected in consideration of variations in expression intensity of each probe measured by the tiling array.

また、遺伝子構造予測装置１００において、遺伝子構造予測部１０２ｃは、推定した発現領域に対応する塩基である発現領域塩基を対象として、当該発現領域塩基の属性に関する確率をマルコフモデルを用いて決定し、対象とした発現領域塩基に隣接する隣接塩基を対象として、当該隣接塩基の属性に関する状態遷移確率をマルコフモデルを用いて決定し、状態遷移確率の決定をゲノム塩基配列の末端の塩基まで繰り返して実行した後、各塩基の属性に関する尤度を最尤法を用いて決定し、決定した確率、状態遷移確率および尤度に基づいて各塩基の属性の組み合わせを確定するので、遺伝子の領域や構造をさらに精度よく予測することができる。 Further, in the gene structure prediction apparatus 100, the gene structure prediction unit 102c determines, using a Markov model, a probability relating to an attribute of the expression region base for an expression region base that is a base corresponding to the estimated expression region, For neighboring bases adjacent to the target expression region base, the state transition probability related to the attribute of the neighboring base is determined using a Markov model, and the state transition probability is repeatedly determined up to the end base of the genome base sequence. After that, the likelihood regarding the attribute of each base is determined using the maximum likelihood method, and the combination of the attributes of each base is determined based on the determined probability, state transition probability, and likelihood. Further, it can be predicted with high accuracy.

また、遺伝子構造予測装置１００は、予測した遺伝子の構造に含まれるエクソン領域に対応するプローブの発現強度に基づいて、当該遺伝子の発現値を算出するので、未知の遺伝子の発現量を得ることができる。 Moreover, since the gene structure prediction apparatus 100 calculates the expression value of the gene based on the expression intensity of the probe corresponding to the exon region included in the predicted gene structure, the expression level of the unknown gene can be obtained. it can.

また、遺伝子構造予測装置１００は、構造が予測された遺伝子の領域を除いたゲノム塩基配列である未予測領域を予測対象領域として決定し、決定した予測対象領域に対して、上述した発現領域推定処理および遺伝子構造予測処理を再び実行するので、ゲノム塩基配列から複数の遺伝子の領域や構造を効率よく予測することができる。 Further, the gene structure prediction apparatus 100 determines an unpredicted region, which is a genomic base sequence excluding a region of a gene whose structure is predicted, as a prediction target region, and the above-described expression region estimation for the determined prediction target region Since the process and the gene structure prediction process are executed again, the regions and structures of a plurality of genes can be efficiently predicted from the genome base sequence.

また、遺伝子構造予測装置１００において、予測対象領域決定部１０２ｅは、未予測領域をゲノム塩基配列から抽出し、抽出した未予測領域の塩基長を算出し、算出した塩基長が所定の塩基長以下であるか否かを判定し、所定の塩基長以下でないと判定した場合、抽出した未予測領域を予測対象領域として確定するので、遺伝子の構造が入り得ない短い領域を除くことができ、その結果、構造予測に相応しい予測対象領域を確定することができる。 In the gene structure prediction apparatus 100, the prediction target region determination unit 102e extracts an unpredicted region from the genome base sequence, calculates the base length of the extracted unpredicted region, and the calculated base length is equal to or less than a predetermined base length. When it is determined whether or not it is not less than a predetermined base length, the extracted unpredicted region is determined as a prediction target region, so that a short region where the gene structure cannot enter can be excluded, As a result, a prediction target area suitable for the structure prediction can be determined.

最後に、上記の数式９を数式１０に変形する手順について説明する。なお、数式９において、Ｐ_kは下記の数式２１で定義され、Ｐ_k→_Nは下記の数式２２で定義され、Ｐ₁→_kは下記の数式２３で定義される。

Finally, a procedure for transforming the above-described Expression 9 into Expression 10 will be described. In Equation 9, P _k is defined by Equation 21 below, P _k → _N is defined by Equation 22 below, and P ₁ → _k is defined by Equation 23 below.

まず、数式２１、数式２２および数式２３は、事後確率の法則により、それぞれ下記の数式２４、数式２５および数式２６に変形することができる。

First, Equation 21, Equation 22, and Equation 23 can be transformed into Equation 24, Equation 25, and Equation 26, respectively, according to the law of posterior probability.

ついで、δ_iはｓ_iおよびθに依存しｘ_iの影響を受けないと仮定するのが自然であるので、数式２４、数式２５および数式２６において、Ｐ（δ_k｜ｘ_k，ｓ_k，θ）は下記の数式２７に変形することができる。

Next, since it is natural to assume that δ _i depends on s _i and θ and is not affected by x _i , in

equations

24, 25 and 26, P (δ _k | x _k , s _k , θ) can be transformed into Equation 27 below.

また、数式２４のｘ_iおよびｓ_iの確率は、隣接するｘ_i-1およびｓ_i-1に依存するというマルコフモデルの基本仮説を用いて、数式２４のＰ（ｘ_k，ｓ_k｜θ）は下記の数式２８に変形することができる。

Further, using the Markov model basic hypothesis that the probabilities of x _i and s _i in Equation 24 depend on adjacent x _i-1 and s _i-1 , P (x _k , s _k | θ in Equation 24 ) Can be transformed into Equation 28 below.

同様に、数式２５および数式２６において、Ｐ（ｘ_i，ｓ_i｜ｘ_i-1，ｓ_i-1，δ_i-1，θ）およびＰ（ｘ_i，ｓ_i｜ｘ_i+1，ｓ_i+1，δ_i+1，θ）は、それぞれ下記の数式２９および数式３０に変形することができる。

Similarly, in Equations 25 and 26, P (x _i , s _i | x _i−1 , s _i−1 , δ _i−1 , θ) and P (x _i , s _i | x _{i + 1} , s _{i + 1} , δ _{i + 1} , θ) can be transformed into the following equations 29 and 30, respectively.

そして、数式２１で定義したＰ_kは最終的に下記の数式３１に変形することができる。

Then, P _k defined by Equation 21 can be finally transformed into Equation 31 below.

同様に、数式２２で定義したＰ_k→_Nは最終的に下記の数式３２に変形することができ、数式２３で定義したＰ₁→_kは最終的に下記の数式３３変形することができる。

Similarly, P _k → _N defined by Expression 22 can be finally transformed into the following Expression 32, and P ₁ → _k defined by Expression 23 can be finally transformed into the following Expression 33.

以上の手順で数式９を変形することにより数式１０を導くことができる。 Equation 10 can be derived by modifying Equation 9 in the above procedure.

以上のように、本発明にかかる遺伝子構造予測方法および遺伝子構造予測プログラムは、タイリングアレイデータに基づいてゲノム塩基配列から未知の遺伝子の領域や構造を精度よく予測することができ、医療や創薬などの分野において極めて有用である。 As described above, the gene structure prediction method and gene structure prediction program according to the present invention can accurately predict the region and structure of an unknown gene from a genomic base sequence based on tiling array data. It is extremely useful in fields such as medicine.

遺伝子構造予測装置１００の構成を示すブロック図である。1 is a block diagram showing a configuration of a gene structure prediction apparatus 100. FIG. 塩基配列データファイル１０６ａに格納される情報の一例を示す図である。It is a figure which shows an example of the information stored in the base sequence data file 106a. タイリングアレイデータファイル１０６ｂに格納される情報の一例を示す図である。It is a figure which shows an example of the information stored in the tiling array data file 106b. 発現領域推定部１０２ｂの構成を示すブロック図である。It is a block diagram which shows the structure of the expression area estimation part 102b. 発現領域候補決定部１０２ｂ１の構成を示すブロック図である。It is a block diagram which shows the structure of the expression area candidate determination part 102b1. 遺伝子構造予測部１０２ｃの構成を示すブロック図である。It is a block diagram which shows the structure of the gene structure estimation part 102c. 予測対象領域決定部１０２ｅの構成を示すブロック図である。It is a block diagram which shows the structure of the prediction object area | region determination part 102e. メイン処理の一例を示すフローチャートである。It is a flowchart which shows an example of a main process. 発現領域推定処理の一例を示すフローチャートである。It is a flowchart which shows an example of an expression area estimation process. 遺伝子構造予測処理の一例を示すフローチャートである。It is a flowchart which shows an example of a gene structure prediction process. 予測対象領域決定処理の一例を示すフローチャートである。It is a flowchart which shows an example of a prediction object area | region determination process. 配列データに関するマルコフモデルの一例を示す図である。It is a figure which shows an example of the Markov model regarding arrangement | sequence data. 属性ｓの定義を示す図である。It is a figure which shows the definition of the attribute s. 予測精度の評価結果の一例を示す図である。It is a figure which shows an example of the evaluation result of prediction accuracy. 予測した遺伝子構造の一例を示す図である。It is a figure which shows an example of the estimated gene structure.

Explanation of symbols

１００遺伝子構造予測装置
１０２制御部
１０２ａデータ取得部
１０２ｂ発現領域推定部
１０２ｂ１発現領域候補決定部
１０２ｂ１１中央値算出部
１０２ｂ１２領域移動部
１０２ｂ１３発現領域候補選出部
１０２ｂ２発現領域候補判定部
１０２ｂ３発現領域決定部
１０２ｃ遺伝子構造予測部
１０２ｃ１発現領域塩基確率決定部
１０２ｃ２隣接塩基確率決定部
１０２ｃ３尤度決定部
１０２ｃ４属性確定部
１０２ｄ発現値算出部
１０２ｅ予測対象領域決定部
１０２ｅ１未予測領域抽出部
１０２ｅ２塩基長判定部
１０２ｅ３予測対象領域確定部
１０４通信インターフェース部
１０６記憶部
１０６ａゲノム塩基配列データファイル
１０６ｂタイリングアレイデータファイル
１０６ｃ発現領域データファイル
１０６ｄ遺伝子構造データファイル
１０６ｅ発現値データファイル
１０８入出力インターフェース部
１１２入力装置
１１４出力装置
２００外部システム
３００ネットワーク 100 Gene structure prediction device
102 Control unit
102a Data acquisition unit
102b Expression region estimation unit
102b1 expression region candidate determination unit
102b11 median value calculator
102b12 area moving part
102b13 Expression region candidate selection section
102b2 Expression region candidate determination unit
102b3 expression region determining unit
102c Gene structure prediction unit
102c1 expression region base probability determination unit
102c2 Adjacent base probability determination unit
102c3 likelihood determination unit
102c4 attribute determination part
102d Expression value calculation unit
102e Prediction area determination unit
102e1 unpredicted region extraction unit
102e2 base length determination unit
102e3 prediction target region determination unit
104 Communication interface
106 Storage unit
106a Genome sequence data file
106b Tiling array data file
106c Expression region data file
106d gene structure data file
106e Expression value data file
108 Input / output interface
112 Input device
114 Output device 200 External system 300 Network

Claims

Using a tiling array in which partial base sequences regularly extracted from genomic base sequences are arranged as probes, tiling array data relating to the expression intensity of each probe measured with the tiling array and genome relating to the genomic base sequence A data acquisition step for acquiring base sequence data;
Based on the tiling array data acquired in the data acquisition step, an expression region estimation step for estimating a gene expression region from the genomic base sequence;
Starting from the base corresponding to the expression region estimated in the expression region estimation step, the attribute relating to the structure of the gene is determined for each base that is adjacent to the origin on the basis of the tiling array data and the genomic base sequence data. A gene structure prediction step for predicting the structure of the gene,
Only including,
The expression region estimation step includes:
An expression region candidate determination step for determining an expression region candidate which is a candidate for a gene expression region based on the tiling array data;
Expression region candidate determination step for determining whether the expression region candidate determined in the expression region candidate determination step is statistically significant,
When it is determined to be significant in the expression region candidate determination step, an expression region determination step for determining an expression region candidate as the expression region;
Further including
The expression region candidate determination step includes
For a region of a predetermined length in the genomic base sequence, a median calculation step for calculating the median of the expression intensity of the probes contained in the region;
A region moving step of moving the region that is the median calculation target in the median calculation step along the genome base sequence;
The median calculation step and the region movement step are repeatedly executed to select the largest one among the plurality of median values accumulated, and the region that was the target of calculation of the selected maximum median value is selected as the expression region candidate An expression region candidate selection step selected as:
Gene structure prediction method comprising the further contains Mukoto.

The gene structure prediction step includes:
For the expression region base that is the base corresponding to the expression region estimated in the expression region estimation step, the expression region base probability determination step for determining the probability regarding the attribute of the expression region base using a Markov model,
For adjacent bases adjacent to the expression region base targeted in the expression region probability determination step, adjacent base probability determination step for determining a state transition probability related to the attribute of the adjacent base using a Markov model,
A likelihood determination step of determining the likelihood related to the attribute of each base using a maximum likelihood method after repeatedly executing the adjacent base probability determination step up to the terminal base of the genomic base sequence,
An attribute determination step for determining a combination of attributes of each base based on the probability, the state transition probability and the likelihood;
The gene structure prediction method according to claim 1 , further comprising:

Based on the expression intensity of the probe corresponding to the exon region included in the gene structure predicted in the gene structure prediction step, an expression value calculation step for calculating the expression value of the gene,
Gene structure prediction method according to claim 1 or 2, further comprising a.

A prediction target region determination step for determining an unpredicted region that is a genomic base sequence excluding a region of the gene whose structure is predicted in the gene structure prediction step, as a prediction target region;
Further including
Re-execution of the expression region estimation step and the gene structure prediction step on the prediction target region determined in the prediction target region determination step;
The gene structure prediction method according to any one of claims 1 to 3 , wherein:

The prediction target region determination step includes
An unpredicted region extraction step of extracting the unpredicted region from the genome base sequence;
Calculating the base length of the unpredicted region extracted in the unpredicted region extraction step, and determining whether the calculated base length is equal to or less than a predetermined base length; and
When it is determined that the base length determination step is not less than or equal to a predetermined base length, a prediction target region determination step for determining the unpredicted region extracted in the unpredicted region extraction step as the prediction target region;
The gene structure prediction method according to claim 4 , further comprising:

Using a tiling array in which partial base sequences regularly extracted from genomic base sequences are arranged as probes, tiling array data relating to the expression intensity of each probe measured with the tiling array and genome relating to the genomic base sequence A data acquisition step for acquiring base sequence data;
Based on the tiling array data acquired in the data acquisition step, an expression region estimation step for estimating a gene expression region from the genomic base sequence;
Starting from the base corresponding to the expression region estimated in the expression region estimation step, the attribute related to the structure of the gene is determined for each base that is adjacent to the origin based on the tiling array data and the genome base sequence data. A gene structure prediction step for predicting the structure of the gene,
Only including,
The expression region estimation step includes:
An expression region candidate determination step for determining an expression region candidate which is a candidate for a gene expression region based on the tiling array data;
Expression region candidate determination step for determining whether the expression region candidate determined in the expression region candidate determination step is statistically significant,
When it is determined to be significant in the expression region candidate determination step, an expression region determination step for determining an expression region candidate as the expression region;
Further including
The expression region candidate determination step includes
For a region of a predetermined length in the genomic base sequence, a median calculation step for calculating the median of the expression intensity of the probes contained in the region;
A region moving step of moving the region that is the median calculation target in the median calculation step along the genome base sequence;
The median calculation step and the region movement step are repeatedly executed to select the largest one among the plurality of median values accumulated, and the region that was the target of calculation of the selected maximum median value is selected as the expression region candidate An expression region candidate selection step selected as:
Further gene structure prediction program characterized by executing the including gene structure prediction methodologies to computers.

The gene structure prediction step includes:
For the expression region base that is the base corresponding to the expression region estimated in the expression region estimation step, the expression region base probability determination step for determining the probability regarding the attribute of the expression region base using a Markov model,
For adjacent bases adjacent to the expression region base targeted in the expression region probability determination step, adjacent base probability determination step for determining a state transition probability related to the attribute of the adjacent base using a Markov model,
A likelihood determination step of determining the likelihood related to the attribute of each base using a maximum likelihood method after repeatedly executing the adjacent base probability determination step up to the terminal base of the genomic base sequence,
An attribute determination step for determining a combination of attributes of each base based on the probability, the state transition probability and the likelihood;
The gene structure prediction program according to claim 6 , further comprising:

Based on the expression intensity of the probe corresponding to the exon region included in the gene structure predicted in the gene structure prediction step, an expression value calculation step for calculating the expression value of the gene,
The gene structure prediction program according to claim 6 or 7 , further comprising:

A prediction target region determination step for determining an unpredicted region that is a genomic base sequence excluding a region of the gene whose structure is predicted in the gene structure prediction step, as a prediction target region;
Further including
Re-execution of the expression region estimation step and the gene structure prediction step on the prediction target region determined in the prediction target region determination step;
The gene structure prediction program according to any one of claims 6 to 8 , wherein:

The prediction target region determination step includes
An unpredicted region extraction step of extracting the unpredicted region from the genome base sequence;
Calculating the base length of the unpredicted region extracted in the unpredicted region extraction step, and determining whether the calculated base length is equal to or less than a predetermined base length; and
When it is determined that the base length determination step is not less than or equal to a predetermined base length, a prediction target region determination step for determining the unpredicted region extracted in the unpredicted region extraction step as the prediction target region;
The gene structure prediction program according to claim 9 , further comprising: