JP4975334B2

JP4975334B2 - Storage area detection system considering evolutionary process

Info

Publication number: JP4975334B2
Application number: JP2006035486A
Authority: JP
Inventors: 康行野崎
Original assignee: Hitachi Solutions Ltd
Current assignee: Hitachi Solutions Ltd
Priority date: 2006-02-13
Filing date: 2006-02-13
Publication date: 2012-07-11
Anticipated expiration: 2026-02-13
Also published as: JP2007209305A

Description

本発明は、複数のＤＮＡ（またはアミノ酸）配列から、ゲノム配列を比較してゲノム配列中における意味を調べるゲノム解析に関し、特に、進化の過程で保存されている保存領域を見つけ、表示する進化過程を考慮した保存領域検出システムに関する。 The present invention relates to genome analysis that examines the meaning of genomic sequences by comparing genomic sequences from a plurality of DNA (or amino acid) sequences, and in particular, an evolution process for finding and displaying a conserved region conserved in the process of evolution. The present invention relates to a storage area detection system that considers

従来技術においては、世界中の配列解析プロジェクトによって、ヒトや動植物のゲノム解析が進み、それらの情報は公共データベース等を通して容易に入手できるようになっている。さまざまな生物における種間や種内のＤＮＡ配列のうちゲノム解析の対象となるゲノム配列同士を比較することによって、ゲノム配列中で各々の種の特異的な部分や、または全ての種に共通な部分を明らかにすることができ、このような特異的な部分、共通な部分に関する情報を用いることで進化のプロセスや、生物学的意味の解釈に役立てることが可能である。例えば、哺乳類のMHC（the Major
Histocompatibility Complex）領域について調べた文献（Hughes
AL, Yeager M. Natural selection at major histocompatibility complex loci of
vertebrates.Annu Rev Genet.
1998;32:415-35やMcConnell TJ,
Talbot WS, McIndoe RA, Wakeland EK. The origin of MHC class II gene
polymorphism within the genus Mus.Nature.
1988 14;332(6165):651-4.やLawlor DA, Ward
FE, Ennis PD, Jackson AP, Parham P. HLA-A and B polymorphisms predate the
divergence of humans and chimpanzees.Nature.
1988 15;335(6187):268-71.など）では、数百年にわたってＭＨＣがゲノム配列中に存続し機能していることが、各ゲノムを比較することによって明らかとなっている。 In the prior art, genome analysis of humans and animals and plants has progressed by sequence analysis projects all over the world, and such information can be easily obtained through public databases. By comparing the genome sequences that are the targets of genome analysis among the DNA sequences in and between species in various organisms, a specific portion of each species in the genome sequence or common to all species It is possible to clarify the part, and it is possible to use the information about the specific part and the common part to help the process of evolution and the interpretation of biological meaning. For example, mammalian MHC (the Major
References on the Histocompatibility Complex area (Hughes
AL, Yeager M. Natural selection at major histocompatibility complex loci of
vertebrates.Annu Rev Genet.
1998; 32: 415-35 and McConnell TJ,
Talbot WS, McIndoe RA, Wakeland EK.The origin of MHC class II gene
polymorphism within the genus Mus. Nature.
1988 14; 332 (6165): 651-4. Or Lawlor DA, Ward
FE, Ennis PD, Jackson AP, Parham P. HLA-A and B polymorphisms predate the
divergence of humans and chimpanzees.Nature.
1988 15; 335 (6187): 268-71. Etc.), it has become clear by comparing the genomes that MHC has persisted and functioned in the genome sequence for several hundred years.

さて、ゲノム配列を比較するとは、進化の過程においてゲノム配列中に起こる変化（変異）を捉える、即ち、把握することであるが、進化におけるゲノム配列の変化（変異）とは、具体的にはsubstitution・deletion・insertion・inversionが挙げられる。図１乃至４は、これらの変化の様子を示した説明図である。図１は、substitutionの変化の様子を示しており、ゲノム配列中の塩基ＡがＣに置換されている。図２は、insertionの変化の様子を示しており、ゲノム配列中の塩基ＡとＴとの間に、新たに塩基Ｃ追加され挿入されている。図３は、deletionの変化の様子を示しており、塩基Ｔが削除されている。図４は、inversionの変化の様子を示しており、ゲノム配列中のＡＴＴがＴＴＡとなっており順序が逆となるように並び替えられている。 Now, comparing genome sequences means capturing (ie, grasping) changes (mutations) that occur in the genome sequence during the evolution process. Specifically, changes in genome sequences (mutation) during evolution are specifically: Substitution / deletion / insertion / inversion. 1 to 4 are explanatory diagrams showing the state of these changes. FIG. 1 shows the change of the substitution, in which the base A in the genome sequence is replaced with C. FIG. 2 shows how the insertion changes. A base C is newly added and inserted between bases A and T in the genome sequence. FIG. 3 shows how the deletion changes, and the base T is deleted. FIG. 4 shows a state of change of inversion, in which ATT in the genome sequence is TTA and rearranged so that the order is reversed.

これらの４種類の変化のうち、substitutionは一塩基単位の変化で起こるが、deletion・insertion・inversionは数百から数万塩基を含むブロック単位の全体で一度に起こることがあり、生物の種の進化の過程でこれらの変化が起こるとゲノム配列中で蓄積してゲノム配列全体に変化を与え、結果的にそれぞれ異なる生物の種を生まれさせていくこととなっている。 Of these four types of changes, substitution occurs with a change of one base unit, but deletion, insertion, and inversion may occur at once in a whole block unit containing hundreds to tens of thousands of bases. When these changes occur during the evolution process, they accumulate in the genome sequence and change the entire genome sequence, resulting in the creation of different species of organisms.

ゲノム配列を比較してゲノム解析が行われたことにより明らかとなっている重要な事実のひとつは、生物にとって重要なゲノム領域（遺伝子など）では、ゲノム配列の変化を受けていないことが多いことである。これは、そのような部分で変化を受けると、ほとんどの場合にはそのような部分の変化を受けた生物は絶滅することが多いため、結果的に、そのような重要なゲノム領域で変化を受けなかった生物が絶滅せずに現在まで残っており、重要なゲノム領域で変化を受けないことが現在まで存在するために必要であると考えられているからである。異なる種類の生物の種のＤＮＡ配列を比較すると、生物の種類によっては変化を受けて互いに異なるＤＮＡ配列を有していることもあるがこれらをアミノ酸毎のレベルで調べてみると変化していないことが多い。 One of the important facts revealed by genome analysis by comparing genome sequences is that genome regions (genes, etc.) that are important for organisms are often not affected by changes in the genome sequence. It is. This can result in changes in such important genomic regions, as changes in such parts are often extinct in most cases. This is because it is believed that it is necessary for the living organisms that have not been received to exist to the present day without being extinct, and to remain untouched in important genomic regions. Comparing the DNA sequences of different species of organisms, depending on the species of organisms, they may have different DNA sequences from each other, but these have not changed when examined at the amino acid level There are many cases.

このような重要なゲノム領域等のゲノム配列の変化を受けていない領域は、保存領域と呼ばれている。研究者は、この保存領域においてゲノム配列の変化を受けていない事実を利用しており、異なる種類の生物の種のゲノム配列を相互に比較して保存領域を見つけこの保存領域に基づいて生物学的な意味を推測するための手がかりとしている。 Such a region that has not undergone a change in the genome sequence, such as an important genomic region, is called a conserved region. Researchers take advantage of the fact that this conserved region has not undergone changes in the genome sequence, and compare the genome sequences of different species of organisms with each other to find the conserved region. It is a clue to guess the meaning.

また、別の重要な事実として、ゲノム配列は進化の歴史をとどめており示していることが挙げられる。一般的に近縁の種（ヒトとチンパンジー等）は、遠縁の種（ヒトと酵母等）よりも、ゲノム配列として類似している部分が多い。これは種が分化してから、近縁の種同士ではそれほど時間が経過しておらず、遠縁の種同士では、長時間経過しているからである。また、特定のＤＮＡ配列を含む遺伝子の状態の推移を追跡することで進化の歴史を把握することも可能である。 Another important fact is that the genome sequence remains and shows the history of evolution. In general, closely related species (such as humans and chimpanzees) have more similar portions as genomic sequences than distantly related species (such as humans and yeast). This is because not so much time has passed between closely related species and long time has passed between distantly related species since the species have differentiated. It is also possible to grasp the evolutionary history by tracking the transition of the state of a gene containing a specific DNA sequence.

図５はこのような遺伝子の状態の推移による進化の歴史を示した説明図である。図５では種１と種２の祖先においてＤＮＡ配列を含む或る遺伝子ａが重複されて一列に配置されたタンデム（縦列）遺伝子ａ１とａ２を生じ、その後、別々の種類の生物の種に種文化した様子を示している。種１の遺伝子ａ１と種２の遺伝子ａ１（または種１の遺伝子ａ２と種２の遺伝子ａ２）は共通の祖先のタンデム遺伝子ａ１（またはタンデム遺伝子ａ２）を共有することとなっておりこれはオーソログと呼ばれている。一方、種１または種２における遺伝子ａ１とａ２とは遺伝子ａの重複によって発生したものであり、これはパラログと呼ばれている。 FIG. 5 is an explanatory diagram showing the history of evolution due to the transition of the state of such a gene. In FIG. 5, a certain gene a including a DNA sequence is duplicated in the ancestors of species 1 and 2 to generate tandem genes a1 and a2 arranged in a line, and then the species is separated into different species of organisms. It shows a culture. Species 1 gene a1 and Species 2 gene a1 (or Species 1 gene a2 and Species 2 gene a2) share a common ancestor tandem gene a1 (or tandem gene a2). is called. On the other hand, the genes a1 and a2 in the species 1 or 2 are generated by duplication of the gene a, and this is called a paralog.

また、図５に示す以外にも遺伝子の状態の推移による進化としてゼノログ（外来）と呼ばれるタイプのものもある。このゼノログでは或る遺伝子がその他のどの遺伝子とも進化的起源を共有していな状態で進化していく場合であり、共生やウィルスによって、類縁のない生物種からもたらされたもの、すなわち水平伝播によって引き起こされたものといわれている。 In addition to those shown in FIG. 5, there is a type called xenolog (foreign) as an evolution due to the transition of the gene state. In this xenolog, a gene evolves without sharing its evolutionary origin with any other gene, which is derived from an unrelated species by symbiosis or viruses, ie horizontal propagation. It is said that it was caused by.

更に、遺伝子以外にも、遺伝子に含まれる配列においてSINEs（short
interspersed repetitive elements）やLINEs（long
interspersed repetitive elements）と呼ばれる特殊な配列がある。これらはゲノム配列中で自分自身を複製し、他の位置にこの複製した配列を挿入する性質を持っており、更に、一旦挿入されると欠失しない性質があるため、これらの特殊な配列も進化の歴史を把握するための手がかりとして利用されている。過去の報告（Verneau O, Catzeflis F, Furano AV. Determination of the evolutionary
relationships in Rattus sensu lato (Rodentia : Muridae) using L1 (LINE-1)
amplification events. J Mol Evol.
1997 45(4):424-36. や Furano AV,
Hayward BE, Chevret P, Catzeflis F, Usdin K. Amplification of the ancient
murine Lx family of long interspersed repeated DNA occurred during the murine
radiation. J Mol Evol. 1994
38(1):18-27.やMurata S, Takasaki N,
Saitoh M, Okada N. Determination of the phylogenetic relationships among
Pacific salmonids by using short interspersed elements (SINEs) as temporal
landmarks of evolution. Proc Natl Acad
Sci U S A. 1993 1;90(15):6995-9.など）によれば、種の文化が起こった後５０００万年以内なら、これらの配列は進化を調べるためのマーカーとして使うことができ、研究者はこれらの特殊な配列を用いたマーカーとしての情報を手がかりとして進化の歴史上での出来事を推測することを行っている。 Furthermore, in addition to genes, SINEs (short
interspersed repetitive elements) and LINEs (long
There is a special arrangement called interspersed repetitive elements. These have the property of replicating themselves in the genome sequence and inserting this duplicated sequence at other positions, and since they have the property of not being deleted once inserted, these special sequences are also It is used as a clue to grasp the history of evolution. Past reports (Verneau O, Catzeflis F, Furano AV. Determination of the evolutionary
relationships in Rattus sensu lato (Rodentia: Muridae) using L1 (LINE-1)
amplification events. J Mol Evol.
1997 45 (4): 424-36. And Furano AV,
Hayward BE, Chevret P, Catzeflis F, Usdin K. Amplification of the ancient
murine Lx family of long interspersed repeated DNA occurred during the murine
radiation. J Mol Evol. 1994
38 (1): 18-27. And Murata S, Takasaki N,
Saitoh M, Okada N. Determination of the phylogenetic relationships among
Pacific salmonids by using short interspersed elements (SINEs) as temporal
landmarks of evolution.Proc Natl Acad
According to Sci US A. 1993 1; 90 (15): 6995-9, etc., these sequences can be used as markers for examining evolution within 50 million years after the occurrence of the species culture. Researchers have been inferring events in the history of evolution using information as markers using these special sequences.

実際にゲノム配列内の保存領域や進化の歴史を調べるための手法としては、主に３種類の方法が使用されている。１つ目の手法はドットマトリックス解析と呼ばれる手法で、二つのゲノム配列の間で変化を受けずに共通して存在している保存領域を見つけるために行われる。図６は、ドットマトリックス解析によりＡＴＧＧＣＡの配列１とＣＡＴＴＧＧＣＴの配列２に存在する保存領域を解析した様子を示す説明図である。このドットマトリックス解析では、二つの配列の長さに対応した縦６個×横８個のマトリックスを作成しこのマトリックスの縦軸と横軸のそれぞれに沿って配列１と配列２を並べる。そして、縦軸の配列の各要素と横軸の配列の各要素を比較し、縦軸および横軸に同じ要素である塩基（または残基）がある場合にはその同じ塩基の縦軸および横軸の座標に該当するドットに印を付していく。 Actually, three methods are mainly used as a method for examining the conserved region in the genome sequence and the history of evolution. The first method is called dot matrix analysis, and is performed to find a conserved region that exists in common without being changed between two genome sequences. FIG. 6 is an explanatory diagram showing a state in which the storage areas existing in the ATGGCA array 1 and the CATTGGCT array 2 are analyzed by dot matrix analysis. In this dot matrix analysis, a matrix of 6 vertical elements × 8 horizontal elements corresponding to the lengths of the two arrays is created, and the arrays 1 and 2 are arranged along the vertical and horizontal axes of the matrix. Then, each element of the vertical axis and each element of the horizontal axis are compared, and if there are bases (or residues) that are the same elements on the vertical and horizontal axes, the vertical and horizontal axes of the same base Mark the dots corresponding to the coordinates of the axis.

図６においては該当するドットを印として濃色（強調）表示する。そして、配列１および配列２の間で保存領域が存在する場合には印を付したドットが対角線方向に並んで構成されこれを視覚的に確認することで保存領域を把握することができるようになっている。図６の点線で囲まれた部分で示すように、濃色（強調）表示したドットが対角線方向に並んでおり配列１のＡＴＧＧＣと配列２のＡＴＴＧＧＣが類似しており保存領域となっていることが明らかとなっている。なお、配列１の塩基を相補鎖の塩基に変換する、すなわち、ＡをＴ、ＴをＡ、ＧをＣ、ＣをＧに変換することによりｒｅｖｅｒｓｅｃｏｍｐｌｅｍｅｎｔ配列と配列２との間の保存領域を明らかとすることも可能である。この場合には、印を付したドットが、配列１の場合とは逆の対角線方向に並んで構成されこれを視覚的に確認することで保存領域を把握することができるようになっている。 In FIG. 6, the corresponding dot is displayed as a dark color (emphasized) as a mark. When there is a storage area between the arrays 1 and 2, the marked dots are arranged in a diagonal direction so that the storage area can be grasped by visually confirming this. It has become. As shown by the portion surrounded by the dotted line in FIG. 6, the dark (highlighted) display dots are arranged in a diagonal direction, and the ATGGC of the array 1 and the ATTGGC of the array 2 are similar to form a storage area. Is clear. In addition, the conserved region between the reverse complement sequence and the sequence 2 is converted by converting the base of the sequence 1 into the base of the complementary strand, that is, converting A to T, T to A, G to C, and C to G. It is also possible to clarify. In this case, the marked dots are arranged in a diagonal direction opposite to that in the case of the array 1, and the storage area can be grasped by visually confirming this.

２つ目の手法は、マルチプルアライメントと呼ばれる手法で、複数の配列を並べたとき同じ要素が１つの列にできるだけ多く集まるような最適な並べ替えを行う手法である。図７は、マルチプルアライメントによる手法を示した説明図である。図７では１５個のゲノム配列としてのアミノ酸配列に対してマルチプルアライメントを実行した結果で、各列に同じアミノ酸（類似したアミノ酸）が並ぶように、ギャップ文字（−）を配列中に挿入している。マルチプルアライメントは、そこに含まれるゲノム配列の間の進化的な歴史を表現したものとみることができる。 The second method is a method called multiple alignment, and is a method for performing optimal rearrangement so that as many elements as possible are gathered in one column when a plurality of arrays are arranged. FIG. 7 is an explanatory diagram showing a technique based on multiple alignment. In FIG. 7, as a result of performing multiple alignment on the amino acid sequence as 15 genome sequences, a gap character (-) is inserted in the sequence so that the same amino acid (similar amino acid) is arranged in each column. Yes. Multiple alignment can be seen as a representation of the evolutionary history between the genomic sequences contained therein.

もしミスマッチとして相互に異なるアミノ酸の個数が少なく、非常に良いマルチプルアライメントが得られるならば、それらのアミノ酸配列は共通の祖先から、比較的最近分かれてきたものと推測される。反対に、ミスマッチの個数が多く良いアライメントが得られないグループの間には、より複雑で遠い進化上の関係が存在する。あるゲノム配列はミスマッチの個数が少なく良く似ており、あるゲノム配列はミスマッチの個数が多く似ていない一群のゲノム配列のマルチプルアライメントを求められれば、それらのゲノム配列間の進化的関係を見出すことが可能である。 If the number of amino acids differing from each other as a mismatch is small and a very good multiple alignment can be obtained, it is assumed that their amino acid sequences have been separated relatively recently from a common ancestor. On the other hand, there are more complex and distant evolutionary relationships between groups that have a large number of mismatches and cannot achieve good alignment. Some genome sequences are similar, with few mismatches, and if a genome sequence requires multiple alignments of a group of genome sequences that do not have many mismatches, find evolutionary relationships between those genome sequences. Is possible.

最後に、３つ目の手法は、系統樹解析と呼ばれる手法である。これは互いに類縁のある塩基配列（あるいはアミノ酸配列）が含まれているファミリーの系統を解析し、進化過程でそのファミリーが派生してきた道筋を決定することである。図８は８つの種から得られたファミリーに含まれるゲノム配列に対して系統樹解析を行った様子を示す説明図である。ゲノム配列間の関係を、各ゲノム配列を枝先に配置したツリー構造の木として表し、木の内部における分岐関係を、異なるゲノム配列がどの程度の類縁関係にあるかを反映させて表示している。枝の長さは近縁／遠縁の度合いに対応しており、枝の長さが短いほど近縁の関係であることを示している。 Finally, the third method is a method called phylogenetic tree analysis. This is to analyze the family strains containing nucleotide sequences (or amino acid sequences) that are closely related to each other, and to determine the path from which the family was derived during the evolution process. FIG. 8 is an explanatory diagram showing a state in which a phylogenetic tree analysis is performed on genome sequences included in families obtained from eight species. The relationship between genome sequences is represented as a tree with a tree structure with each genome sequence placed at the end of the branch, and the branching relationship inside the tree is displayed reflecting the degree of affinity of different genome sequences. Yes. The length of the branch corresponds to the degree of the close / distant edge, and the shorter the length of the branch, the closer the relation is.

この系統樹解析では、類縁関係や近縁／遠縁の度合いを見ることにより個々の生物の種の進化において生じてきた変化の解析のみならず、ゲノム配列のファミリーの進化についても調べることが出来る。それにより、系統樹上の隣り合った枝を占めるゲノム配列が、最も近縁なゲノム配列だと決定できる。ある生物あるいは生物群においてゲノム配列としての遺伝子ファミリーが見出される場合、その遺伝子間の系統関係を調べれば、同じ機能をもつ遺伝子がどれかを予測するのに役立つ。これらの機能予測が得られれば、遺伝学的実験によってその機能を確認できる。系統樹解析は、例えばウィルスのような、急速に変化している生物種のおいて生じている変化を追うためにも使われる。ある集団内での変化の型の系統樹解析は、例えば、ある特定の遺伝子が自然選択を受けているかどうかといった、疫学などの応用にあたって大切な情報を明らかにする。また、従来のバイオチップにおいては、系統樹のノード等に対応して、複数の異なるターゲットの塩基配列に共通して存在する部分配列と特異的にはハイブリダイズするプローブを設計し、スポットしたものが提案されている（例えば、特許文献１参照。）。 In this phylogenetic tree analysis, it is possible not only to analyze the changes that have occurred in the evolution of individual organism species, but also to investigate the evolution of the family of genomic sequences by looking at the degree of affinity and relatedness / distantness. As a result, the genome sequence occupying adjacent branches on the phylogenetic tree can be determined to be the closest genome sequence. If a gene family is found as a genomic sequence in an organism or group of organisms, examining the phylogenetic relationship between the genes will help predict which genes have the same function. If these function predictions are obtained, the function can be confirmed by genetic experiments. Phylogenetic analysis is also used to track changes that occur in rapidly changing species such as viruses. Phylogenetic analysis of the type of change within a population reveals important information for applications such as epidemiology, such as whether a particular gene has undergone natural selection. Also, in conventional biochips, probes that specifically hybridize with partial sequences that exist in common in the base sequences of different targets corresponding to the nodes of the phylogenetic tree are spotted. Has been proposed (see, for example, Patent Document 1).

特開２００２−３３０７６８号公報JP 2002-330768 A

従来の技術においては、ゲノム配列同士を比較して、そこから生物学的な意味を読み取るためには、上に示したような複数のゲノム配列間で保存されている保存領域を見つけ、そして、それがどのような種で共有されているか、つまりどのような進化を歩んできたかを調べることが必要である。 In the prior art, in order to compare genome sequences and read the biological meaning from them, find conserved regions conserved among multiple genome sequences as shown above, and It is necessary to investigate what species it is shared with, that is, what evolution it has made.

しかしながら、上述のような従来の技術を用いても上に示した三つの方法を駆使しても、保存領域とその進化的な関係について、両者を包括的に理解するのは困難・あるいは非常に煩雑な手間を伴う。ドットマトリックス解析では二つのゲノム配列間の保存領域は分かるが、それはどの進化の段階から保存されているのかわからない。マルチプルアライメント解析では、inversionになって保存されていてもそれを検出することができない。また系統樹解析では、進化の過程は分かるが、具体的にどのようなゲノム配列が類縁関係で保存されているのか、そしてどの進化のレベルでinversionやinsertion、deletionが起こったのかどうかわからない。 However, it is difficult or very difficult to comprehensively understand the storage area and its evolutionary relationship, even if the above-mentioned conventional techniques or the above three methods are used. This is complicated. Dot matrix analysis reveals a conserved region between two genome sequences, but does not know from which evolutionary stage it is conserved. In multiple alignment analysis, even if it is stored in inversion, it cannot be detected. In addition, phylogenetic tree analysis shows the evolutionary process, but it is not clear what kind of genomic sequence is conserved in relation to each other, and at what level of evolution, inversion, insertion, or deletion occurred.

例えば、比較対象のゲノム配列のうち、進化的に近い生物種のファミリーで、共通した保存領域にはどのようなものがあるかを従来の方法で調べる場合を考える。研究者は、まず系統樹解析を行い、進化的に近いファミリーをみつける。そしてマルチプルアライメントを実行するか、あるいはドットマトリックス解析を行う。しかし、マルチプルアライメントを実行する場合、長い配列（数千塩基以上）を比較するには、実際的な問題として多大な時間を要する。また、マルチプルアライメントは、ある程度類似したゲノム配列を入力とすることが想定されているので、イントロン配列を多く含む場合や、入力配列が遺伝子以外の領域である場合、アライメントはうまくいかない。更にこの解析は、上にも述べたように、ゲノム配列中にinversionが起こっていてもそれを検出することができない。したがって、比較対象となるゲノム配列は非常に限られたものとなってくるという問題があった。 For example, consider a case where a conventional method is used to determine what kind of conserved regions are common in a family of biological species that are evolutionarily close among the genome sequences to be compared. Researchers first perform a phylogenetic tree analysis to find families that are evolutionarily close. Then, multiple alignment is performed or dot matrix analysis is performed. However, when performing multiple alignment, it takes a lot of time as a practical problem to compare long sequences (several thousand bases or more). In addition, since multiple alignments are assumed to input genome sequences that are somewhat similar to each other, alignment is not successful when many intron sequences are included or when the input sequences are regions other than genes. Furthermore, as described above, this analysis cannot detect even if inversion occurs in the genome sequence. Therefore, there is a problem that the genome sequences to be compared are very limited.

またゲノム配列が、マルチプルアライメントに適した配列であったとしても、ファミリーの配列に共通して存在し、ファミリーでない配列には存在しない保存領域を目視で確認する必要がある。一方、ドットマトリックス解析では、この解析手法の性質上、一度に二つのゲノム配列しか比較できない。したがって、ファミリーの生物種で共通した保存領域を見つける場合、ファミリーの配列同士で繰り返しドットマトリックス解析を行って保存領域をみつけ、更にファミリーに属していない配列でその領域が保存されていないことを確かめなければならない。これはファミリーの数や全体の比較するゲノム配列数が大きくなると、ドットマトリックス解析で比較を行う作業量が膨大なものとなり、手に負えないものとなってしまうという問題があった。 Moreover, even if the genome sequence is a sequence suitable for multiple alignment, it is necessary to visually confirm a conserved region that exists in common with the family sequence and does not exist in the non-family sequence. On the other hand, in dot matrix analysis, only two genome sequences can be compared at a time due to the nature of this analysis method. Therefore, when finding a conserved region common to the species of a family, repeat the dot matrix analysis between the sequences of the family to find the conserved region, and confirm that the region is not conserved with sequences that do not belong to the family. There must be. As the number of families and the total number of genome sequences to be compared increase, the amount of work to be compared by dot matrix analysis becomes enormous, and there is a problem that it becomes unmanageable.

それ故本発明では、従来技術の問題点を鑑みてなされたものであり、その目的とするところは、ゲノム解析の対象となる種のゲノム配列から保存領域を検出し、各々の種間の関係や各保存領域の関係を明確に表示することが可能な進化過程を考慮した保存領域検出システムを提供することを目的とする。 Therefore, the present invention has been made in view of the problems of the prior art, and the object of the present invention is to detect a conserved region from the genome sequence of a species to be subjected to genome analysis, and the relationship between each species. Another object of the present invention is to provide a storage area detection system that takes into account the evolution process that can clearly display the relationship between storage areas.

上記課題を解決するために、本発明は、複数のＤＮＡ配列のうちゲノム解析の対象となるゲノム配列の中に配列の変化を受けておらず進化的に保存されている保存領域を見つける進化過程を考慮した保存領域検出システムにおいて、
ゲノム配列に基づいて得られる系統樹を参照して、この系統樹を構成している中間ノードに属するゲノム配列を認識する配列認識手段と、
中間ノードに属するゲノム配列において存在している同一の文字列の位置から開始してゲノム配列内の保存領域を検出していく保存検出手段とを備えたことを特徴とする。 In order to solve the above-mentioned problems, the present invention is an evolution process for finding a conserved region that is evolutionarily conserved in a genomic sequence that is subject to genomic analysis among a plurality of DNA sequences. In the storage area detection system considering the
Sequence recognition means for referring to a phylogenetic tree obtained on the basis of a genomic sequence and recognizing a genomic sequence belonging to an intermediate node constituting the phylogenetic tree;
And a storage detection means for detecting a storage region in the genome sequence starting from the position of the same character string existing in the genome sequence belonging to the intermediate node.

このような発明においては、配列認識手段が系統樹を構成している中間ノードに属するゲノム配列を認識し、保存検出手段が中間ノードに属するゲノム配列において存在している同一の文字列の位置から開始してゲノム配列内の保存領域を検出していくので、正確にゲノム解析の対象となる種のゲノム配列から配列の変化を受けていない保存領域を検出することができる。 In such an invention, the sequence recognition means recognizes the genome sequence belonging to the intermediate node constituting the phylogenetic tree, and the storage detection means starts from the position of the same character string existing in the genome sequence belonging to the intermediate node. Since the conserved region in the genome sequence is detected by starting, it is possible to accurately detect the conserved region that has not undergone the sequence change from the genome sequence of the species to be analyzed.

また、上述の進化過程を考慮した保存領域検出システムにおいて、
前記保存検出手段は、
中間ノードに属する２つのゲノム配列において存在している同一の文字列の位置から開始してゲノム配列内の保存領域を検出していき、ミスマッチの文字の個数が所定の個数に達するまでの領域を保存領域として検出することとしても良い。 In the storage area detection system considering the evolution process described above,
The storage detection means includes
Starting from the position of the same character string existing in the two genome sequences belonging to the intermediate node, the storage region in the genome sequence is detected, and the region until the number of mismatched characters reaches the predetermined number It may be detected as a storage area.

このような発明においては、保存検出手段は、中間ノードに属する２つのゲノム配列において存在している同一の文字列の位置から開始してゲノム配列内の保存領域を検出していき、ミスマッチの文字の個数が所定の個数に達するまでの領域を保存領域として検出するので、全体的に略同一で配列の変化を受けておらず保存領域とみなすことができる領域を適切に保存領域として検出することができる。 In such an invention, the storage detection means detects the storage region in the genome sequence starting from the position of the same character string existing in the two genome sequences belonging to the intermediate node, and the mismatched character Since the area until the number reaches the predetermined number is detected as a storage area, an area that is substantially the same as a whole and has not undergone a sequence change and can be regarded as a storage area is appropriately detected as a storage area Can do.

上述の進化過程を考慮した保存領域検出システムにおいて、
前記保存検出手段は、
中間ノードに属する複数のゲノム配列において存在している同一の文字列の位置から開始してゲノム配列内で検出した保存領域に基づいて、中間ノードを変えながら繰り返しこの検出した同一の保存領域を検出していき、全ての中間ノードに属するゲノム配列内の保存領域を検出しても良い。 In the storage area detection system considering the above evolution process,
The storage detection means includes
Starting from the position of the same character string existing in multiple genome sequences belonging to the intermediate node, the same conserved region that was detected is detected repeatedly while changing the intermediate node based on the conserved region detected in the genome sequence. Thus, a conserved region in the genome sequence belonging to all intermediate nodes may be detected.

このような発明においては、保存検出手段は、中間ノードを変えながら繰り返しこの検出した同一の保存領域を検出していき、全ての中間ノードに属するゲノム配列内の保存領域を検出するので、系統樹を構成する全ての中間ノードに属するゲノム配列内の保存領域を検出できる。 In such an invention, the storage detection means repeatedly detects the same storage region detected while changing the intermediate node, and detects the storage regions in the genome sequence belonging to all the intermediate nodes. It is possible to detect a conserved region in the genome sequence belonging to all the intermediate nodes constituting.

上述の進化過程を考慮した保存領域検出システムにおいて、
前記保存検出手段が検出したゲノム配列内の各保存領域をそれぞれ毎に異なる形態の線により構成し、前記系統樹上の中間ノードを形成する枝を、中間ノードに属するゲノム配列内の各保存領域に対応させた形態の線により構成し、前記各保存領域と系統樹を同時に表示する解析結果表示手段を備えたこととしても良い。 In the storage area detection system considering the above evolution process,
Each storage region in the genome sequence detected by the storage detection means is configured by a line having a different form for each, and branches forming the intermediate node on the phylogenetic tree are each stored region in the genome sequence belonging to the intermediate node. It is good also as providing the analysis result display means comprised by the line | wire of the form made to respond | correspond to, and displaying each said storage area | region and phylogenetic tree simultaneously.

このような発明においては、解析結果表示手段が各保存領域をそれぞれ毎に異なる形態の線により構成し、系統樹上の中間ノードを形成する枝を、中間ノードに属するゲノム配列内の各保存領域に対応させた形態の線により構成し、各保存領域と系統樹を同時に表示するので、研究者は各保存領域を明確に区別して参照することができ、また、各保存領域と系統樹上の中間ノードとの対応関係を参照して進化的に保存された保存領域を確認し、進化過程を推測することが可能である。 In such an invention, the analysis result display means configures each storage region with a line having a different form for each, and the branches forming the intermediate nodes on the phylogenetic tree are represented by the storage regions in the genome sequence belonging to the intermediate node. Since each storage area and phylogenetic tree are displayed at the same time, researchers can clearly distinguish and refer to each storage area, and each storage area and phylogenetic tree By referring to the correspondence with the intermediate node, it is possible to confirm the evolutionary preservation area and to guess the evolution process.

上述の進化過程を考慮した保存領域検出システムにおいて、
前記解析結果表示手段は、
前記各保存領域を、既知のゲノム配列に関する情報と組み合わせて同時に表示することとしてもよい。 In the storage area detection system considering the above evolution process,
The analysis result display means includes
Each of the storage regions may be displayed simultaneously in combination with information on a known genome sequence.

このような発明においては、各保存領域に組み合わされた既知のゲノム配列に関する情報を参照して進化過程を推測することが可能である。 In such an invention, it is possible to estimate the evolution process with reference to information on known genomic sequences combined with each conserved region.

上述の進化過程を考慮した保存領域検出システムにおいて、
前記解析結果表示手段は、
前記各保存領域を、各保存領域が含まれるゲノム配列と組み合わせ、各ゲノム配列間に含まれる同一の保存領域を関連付けて表示しても良い。 In the storage area detection system considering the above evolution process,
The analysis result display means includes
Each of the storage regions may be combined with a genome sequence including each storage region, and the same storage region included between the genome sequences may be displayed in association with each other.

このような発明においては、関連付けて表示された同一の保存領域の状況を参照して進化的に保存された保存領域を確認し、進化過程を推測することが可能である。 In such an invention, it is possible to check the evolutionary storage area by referring to the situation of the same storage area displayed in association with each other, and estimate the evolution process.

上述の進化過程を考慮した保存領域検出システムにおいて、
任意の配列に基づいて、前記系統樹を構成している中間ノードに属するゲノム配列を検索する配列検索手段と、
前記配列検索手段が検索した結果得られたゲノム配列の情報を参照して、前記系統樹を構成している中間ノードに属するゲノム配列に関する情報を特定の表示方法で表示する特定表示手段を備えても良い。 In the storage area detection system considering the above evolution process,
Sequence search means for searching for a genome sequence belonging to an intermediate node constituting the phylogenetic tree based on an arbitrary sequence;
Specific display means for displaying information on genome sequences belonging to the intermediate nodes constituting the phylogenetic tree with a specific display method with reference to information on the genome sequence obtained as a result of the search by the sequence search means Also good.

このような発明においては、特定の表示方法で表示されたゲノム配列に関する情報を参照して任意の配列が進化的に保存されている様子を確認し、進化過程を推測することが可能である。 In such an invention, it is possible to check the evolutionary process by referring to the information on the genome sequence displayed by a specific display method, and confirming the evolutionary preservation of an arbitrary sequence.

上述の進化過程を考慮した保存領域検出システムにおいて、
特定表示手段は、
前記配列検索手段が検索した結果得られたゲノム配列を、当該任意の配列部分を関連付けて表示することとしても良い。 In the storage area detection system considering the above evolution process,
The specific display means is
The genome sequence obtained as a result of the search by the sequence search means may be displayed in association with the arbitrary sequence portion.

このような発明においては、関連付けて表示された当該任意の配列部分を参照して保存されている状況を確認し、進化過程を推測することが可能である。 In such an invention, it is possible to check the state of being stored by referring to the arbitrary sequence portion displayed in association with each other, and to estimate the evolution process.

以上説明したように、本発明によれば、ゲノム解析の対象となる種のゲノム配列から保存領域を検出し、各々の種間の関係や各保存領域の関係を明確に表示することが可能である。 As described above, according to the present invention, it is possible to detect a conserved region from the genome sequence of a species subject to genome analysis and clearly display the relationship between each species and the relationship between each conserved region. is there.

以下、図面を参照して本発明の実施の形態を説明する。 Embodiments of the present invention will be described below with reference to the drawings.

図１４は、本発明にかかる進化過程を考慮した保存領域検出システムの全体構成を示す説明図である。この保存領域検出システム１００は、ゲノム解析において、比較される対象となるゲノム配列のデータである対象配列１４０１と、それらの各対象配列１４０１を用いて系統樹を構成するための情報である構成情報１４１３と、ゲノム解析の解析結果を画像等により表示するための表示装置１４０２と、保存領域検出システム１００での数値や文書情報等の情報の入力や選択の操作を行うための入力手段であるキーボード１４０３やマウス１４０４と、ゲノム解析の解析結果のデータに参考情報として注釈付けする既知のゲノム配列及びこの既知のゲノム配列に付属する情報が格納されている配列ＤＢ１４０５と、後述するプログラムメモリ１４０７や図示しない記憶装置に格納されたプログラムを実行することにより保存領域の検出や系統樹のデータの構築や解析結果の表示等の各処理を行う中央処理装置１４０６（以下、ＣＰＵ１４０６という。）と、中央処理装置１４０６が行う各処理に必要なプログラムを格納するプログラムメモリ１４０７と、中央処理装置１４０６での処理の際に必要な演算結果等のデータを一時的に格納するデータメモリ１４１１とを備えて構成されている。 FIG. 14 is an explanatory diagram showing the overall configuration of the storage area detection system in consideration of the evolution process according to the present invention. The storage region detection system 100 includes a target sequence 1401 that is data of genome sequences to be compared in genome analysis, and configuration information that is information for configuring a phylogenetic tree using each of the target sequences 1401. 1413, a display device 1402 for displaying an analysis result of genome analysis as an image, etc., and a keyboard which is an input means for inputting information such as numerical values and document information in the storage area detection system 100 and for selecting operation 1403 and mouse 1404, a known genome sequence to be annotated as reference information to analysis result data of genome analysis, and a sequence DB 1405 storing information attached to the known genome sequence, a program memory 1407 to be described later, and an illustration Detect storage area and system by executing program stored in storage device A central processing unit 1406 (hereinafter referred to as CPU 1406) that performs various processes such as data construction and analysis result display, a program memory 1407 that stores a program required for each process performed by the central processing unit 1406, and a central processing A data memory 1411 that temporarily stores data such as calculation results necessary for processing in the device 1406 is provided.

プログラムメモリ１４０７は、図１４に示すように、入力された各対象配列１４０１からそれらの対象配列１４０１間で保存されている保存領域を検出する処理を行うための保存領域計算処理部１４０８と、各対象配列１４０１を用いて系統樹を構築する処理を行う系統樹計算処理部１４０９と、これら解析・計算結果を表示する処理を行うための分析結果表示処理プログラム１４１０とを備えている。これらのプログラムは、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、ＭＯ、フロッピー（登録商標）ディスク等の記録媒体に格納し、ＣＰＵ１４０６がこれらの記録媒体から読み出すことにより提供することもできるし、インターネット等の公衆網のネットワークを介してサーバからダウンロードして提供することもできる。 As shown in FIG. 14, the program memory 1407 includes a storage area calculation processing unit 1408 for performing processing for detecting storage areas stored between the target arrays 1401 from the input target arrays 1401, A phylogenetic tree calculation processing unit 1409 that performs processing for constructing a phylogenetic tree using the target array 1401 and an analysis result display processing program 1410 for performing processing for displaying these analysis / calculation results are provided. These programs can be stored in a recording medium such as a CD-ROM, DVD-ROM, MO, and floppy (registered trademark) disk, and provided by being read from the recording medium by the CPU 1406, or public such as the Internet. It can also be provided by being downloaded from a server via a network.

配列ＤＢ１４０５は、ＣＰＵ１４０６に接続された記憶装置に格納されていてもよいし、遠隔地に設置されたサーバコンピュータが管理する構成とし、そのサーバコンピュータ内のデータベースからインターネット等の公衆網のネットワーク等を介して配列ＤＢ１４０５に含まれている遺伝子データを取得するようにしてもよい。また、データメモリ１４１１は、プログラムの実行において入力データとして用いられる入力データ１４１２を含んでいる。 The array DB 1405 may be stored in a storage device connected to the CPU 1406, or may be managed by a server computer installed at a remote location. From the database in the server computer, a public network such as the Internet may be managed. Alternatively, gene data included in the sequence DB 1405 may be acquired. The data memory 1411 includes input data 1412 that is used as input data in executing the program.

図１５は、対象配列１４０１の一例を示す説明図である。ここでは対象配列１４０１に該当する各ゲノム配列をＦＡＳＴＡ形式により表示しており、ゲノム配列を識別するための名称等を「＞」の後に表示し、その次の行から、ゲノム配列そのものを表示している。この他にも、ゲノム配列を表す形式として、ＧｅｎＢａｎｋ形式やＥＭＢＬ形式で表示することとしてもよい。 FIG. 15 is an explanatory diagram showing an example of the target array 1401. Here, each genome sequence corresponding to the target sequence 1401 is displayed in the FASTA format, the name for identifying the genome sequence is displayed after “>”, and the genome sequence itself is displayed from the next line. ing. In addition, as a format representing the genome sequence, it may be displayed in the GenBank format or the EMBL format.

図１６は、各対象配列１４０１を用いて系統樹を構成するための構成情報１４１３の一例を示す説明図である。この構成情報１４１３では系統樹のリーフと枝の長さを対象配列１４０１の各ゲノム配列の名称に対応付けており一組の括弧及び数値により一つの中間ノードに関する情報を形成している（数値はその中間ノードの上位の位置の中間ノードまでの枝の長さを示している）。そしてその中間ノードが自己の位置よりも下位側に（系統樹上でリーフに近い）更に中間ノードを有しているときは、入れ子構造で表現する形式をとっている。すなわち、ＢＮＦ記法で表示すると次のようになる。 FIG. 16 is an explanatory diagram illustrating an example of configuration information 1413 for configuring a phylogenetic tree using each target array 1401. In this configuration information 1413, leaf and branch lengths of the phylogenetic tree are associated with the names of the genome sequences of the target sequence 1401, and information on one intermediate node is formed by a set of parentheses and numerical values (the numerical values are It shows the length of the branch to the intermediate node at a position higher than that intermediate node). When the intermediate node further has an intermediate node lower than its own position (closer to the leaf in the tree), it is expressed in a nested structure. In other words, when displayed in BNF notation, it is as follows.

ノード::=(ノード,ノード):この中間ノードからその上位中間ノードまでの枝長|配列名:この葉から上位中間ノードまでの枝長 Node :: = (node, node): branch length from this intermediate node to its upper intermediate node | array name: branch length from this leaf to upper intermediate node

そして、この構成情報１４１３では、一組の括弧に囲まれた２つの名称または中間ノードによりゲノム配列の近縁関係を示しており、この系統樹のルートに対応する中間ノード間の相対関係に関する情報が構成されている。例えば「（種１：１５，種２：１０）：２０」は後述する図９に表示された系統樹の９０２の部分の中間ノードを示しており、種１（リーフ）から種１と種２の分岐点までの枝長が１５、種２から種１と種２の分岐点までの枝長が１０、そして、この分岐点とその上の中間ノード（９０１に対応するノード）までの枝長が２０であることを示している。この他にも、系統樹間の関係を表す形式として、Ｐｈｙｌｉｐ形式・ＣＬＵＳＴＡＬ形式・ＤｉｓｔａｎｃｅＭａｔｒｉｘ形式により表示することとしても良い。 In this configuration information 1413, two names or intermediate nodes surrounded by a pair of parentheses indicate the close relationship of genome sequences, and information on the relative relationship between intermediate nodes corresponding to the root of this phylogenetic tree Is configured. For example, “(Species 1:15, Species 2:10): 20” indicates an intermediate node of the portion 902 of the phylogenetic tree displayed in FIG. 9 to be described later. Species 1 (Leaf) to Species 1 and 2 The branch length to the branch point of 15 is 15, the branch length from seed 2 to the branch point of seed 1 and seed 2 is 10, and the branch length to this branch point and the intermediate node (node corresponding to 901) is 20 It shows that there is. In addition to this, as a format representing the relationship between the phylogenetic trees, it may be displayed in the Physlip format, CLUSTAL format, and Distance Matrix format.

図１７は、対象配列１４０１に該当する全てのＤＮＡ（またはアミノ酸）配列に関する索引情報を作成するためのデータ構造を示す構成図である。この索引情報に含まれる配列KtupleArrayDはp^k個の要素からなる配列で、ｐは配列を構成する要素の種類数を示しており、すなわちＤＮＡ配列の場合は４、アミノ酸配列の場合は２０となる。ｋはｔｕｐｌｅ（文字列）の長さを示している。配列KtupleArrayDの配列の各要素には各tupleが割り当てられる。例えば対象配列１４０１がＤＮＡ配列で、ｋが２のとき、配列KtupleArrayDは１６個の要素からなり、それぞれの要素には、AA・AT・AG・AC・TA・TT・TG・TC・GA・GT・GG・GC・CA・CT・CG・CCの１６種類のtupleが割り当てられる。 FIG. 17 is a configuration diagram showing a data structure for creating index information related to all DNA (or amino acid) sequences corresponding to the target sequence 1401. The array KtupleArrayD included in the index information is an array of p ^k elements, and p indicates the number of types of elements constituting the array, that is, 4 for a DNA sequence and 20 for an amino acid sequence. . k indicates the length of the tuple (character string). Each tuple is assigned to each element of the array KtupleArrayD. For example, when the target sequence 1401 is a DNA sequence and k is 2, the array KtupleArrayD is composed of 16 elements, and each element includes AA, AT, AG, AC, TA, TT, TG, TC, GA, and GT.・ 16 tuples of GG, GC, CA, CT, CG, and CC are assigned.

また、この配列KtupleArrayDの各要素には、その要素に割り当てられたtupleが対象配列１４０１中に最も後側に現れたtupleの位置を表す。その要素に割り当てられたtupleが配列中にない場合は、０で表す。 Each element of the array KtupleArrayD represents the position of the tuple where the tuple assigned to the element appears most rearward in the target array 1401. If the tuple assigned to the element is not in the array, it is represented by 0.

配列IdxArrayDは対象配列１４０１と等しい長さの配列であり対象配列１４０１の各要素に割り当てられた要素からなる配列である。配列IdxArrayDの各要素は対象配列１４０１上の各位置に割り当てられており、それら各要素に割り当てられた文字から始まるtupleと同じtupleがその要素より前側の配列中に現れた場合には、その現れたもののうちそれら各要素の最も直前に現れた要素の位置を表す。また、もしそれら各要素と同じtupleが前に現れない場合には０で表す。 The array IdxArrayD is an array having a length equal to that of the target array 1401 and is an array including elements assigned to each element of the target array 1401. Each element of the array IdxArrayD is assigned to each position on the target array 1401. If the same tuple as the tuple starting from the character assigned to each element appears in the array before that element, that element appears. Represents the position of the element that appears immediately before each of the elements. If the same tuple as each element does not appear before, it is represented by 0.

図２８は、対象配列１４０１としての配列GTCTCACGACACTCに対して作成された配列KtupleArrayDとIdxArrayDを表示した説明図である。この配列ではtuple TCは配列中の２番目、４番目、１３番目に現れており、配列KtupleArrayDのTCに対応する要素（KtupleArrayD[８]）に、tuple TCが対象配列１４０１中に最後に現れた位置１３が表示されている。またIdxArrayD[１３]には位置１３に現れたtuple TCと同一のTCがその直前に現れた位置である「４」、IdxArrayD[４]には位置４に現れたtuple TCと同一のTCがその直前に現れた位置である「２」が表示されている。したがって、ここで示したように、特定のtupleが配列中のどこにあるかを、二つの配列KtupleArrayDとIdxArrayDを用いることで、高速に検索することが出来る。配列KtupleArrayRおよびIdxArrayRは、配列KtupleArrayDとIdxArrayDと同様に対象配列１４０１としてのDNA配列（またはアミノ酸配列）のreverse complement配列に対して、作成される。 FIG. 28 is an explanatory diagram showing the arrays KtupleArrayD and IdxArrayD created for the array GTCTCACGACACTC as the target array 1401. In this array, the tuple TC appears in the second, fourth, and thirteenth positions in the array, and the tuple TC appears last in the target array 1401 in the element (KtupleArrayD [8]) corresponding to the TC of the array KtupleArrayD. Position 13 is displayed. Also, in IdxArrayD [13], the same TC as the tuple TC appearing at position 13 is “4”, and the same TC as the tuple TC appearing at position 4 is present in IdxArrayD [4]. “2”, which is the position that appeared immediately before, is displayed. Therefore, as shown here, the location of a specific tuple can be searched at high speed by using the two arrays KtupleArrayD and IdxArrayD. The sequences KtupleArrayR and IdxArrayR are created for the reverse complement sequence of the DNA sequence (or amino acid sequence) as the target sequence 1401 in the same manner as the sequences KtupleArrayD and IdxArrayD.

図１８は、対象配列１４０１の保存領域を記録するためのデータ構造を示す説明図である。このデータ構造で示す構造体配列ConservedRegは、各対象配列１４０１毎に、保存領域が存在していれば作成されるものであり、保存領域を示す位置１８００、保存領域の長さ１８０１、保存領域の向き（順方向か逆方向か）１８０２の各データから構成されている。 FIG. 18 is an explanatory diagram showing a data structure for recording the storage area of the target array 1401. The structure array ConservedReg indicated by this data structure is created for each target array 1401 if a storage area exists. The storage area has a position 1800 indicating a storage area, a storage area length 1801, and a storage area. It consists of each data of direction (whether forward direction or reverse direction) 1802.

図１９は、各対象配列１４０１間の保存領域同士の関係を記録するためのデータ構造体ListOfConservedRegを示す説明図である。このデータ構造体は、構成情報１４１３上の各中間ノードで、保存領域毎に作られるものである。対象配列１４０１を識別するための配列名１９００、この対象配列１４０１において作成されているいずれのConservedRegが対応しているかを示すために、配列名１９００毎に作成された各構造体配列ConservedRegを識別するためのindex１９０１の各データから作成されている。 FIG. 19 is an explanatory diagram showing a data structure ListOfConservedReg for recording the relationship between the storage areas between the target arrays 1401. This data structure is created for each storage area at each intermediate node on the configuration information 1413. In order to indicate which array name 1900 for identifying the target array 1401 and which ConservedReg created in the target array 1401 corresponds, each structure array ConservedReg created for each array name 1900 is identified. Therefore, it is created from each data of index 1901.

図２０は、図１９で述べた関連する保存領域の集合を表す配列ListOfConservedRegを集めた配列AllOfConservedRegのデータ構造を示す説明図である。この配列AllOfConservedRegの各要素はListOfConservedRegへリンクしたポインタが表示されており、構成情報１４１３上の各中間ノード毎に、この配列AllOfConservedRegが１つ作成される。この配列AllOfConservedRegに該当する中間ノードに属するゲノム配列で保存されている異なる種類の保存領域のそれぞれを、各要素とListOfConservedRegとでリンクさせてこの配列によって表示している。なお、本実施の形態における進化過程を考慮した保存領域検出システムは、一般に使用されている各種の情報処理を行うための情報処理装置であるパーソナルコンピュータを用いて実現することも可能である。 FIG. 20 is an explanatory diagram showing the data structure of an array AllOfConservedReg in which the array ListOfConservedReg representing the set of related storage areas described in FIG. 19 is collected. Each element of this array AllOfConservedReg displays a pointer linked to ListOfConservedReg, and one array AllOfConservedReg is created for each intermediate node on the configuration information 1413. Each of the different types of storage regions stored in the genome sequence belonging to the intermediate node corresponding to this array AllOfConservedReg is displayed by this array linked with each element and ListOfConservedReg. Note that the storage area detection system in consideration of the evolution process in this embodiment can also be realized by using a personal computer that is an information processing apparatus for performing various types of information processing that is generally used.

続いて、上述のような構成を有する本実施形態の進化過程を考慮した保存領域検出システムの動作について図２１乃至図２７に示すフローチャートを用いて詳細に説明する。図２１乃至図２７に示すフローチャートでは、ＣＰＵ１４０６が処理を行うことにより、図９、図１０、図１１、図１２、図１３に示す画像のデータを表示させるために必要となる系統樹１４１３の各リーフに対するConservedRegと、保存領域の関連を保持する配列ListOfConservedRegと、配列AllOfConservedRegを求めるようになっている。以下に説明する動作では、対象配列１４０１および系統樹１４１３から、系統樹１４１３の各中間ノードに対して、これらの三つの配列を得るためのアルゴリズムを説明する。 Next, the operation of the storage area detection system considering the evolution process of the present embodiment having the above-described configuration will be described in detail with reference to the flowcharts shown in FIGS. In the flowcharts shown in FIG. 21 to FIG. 27, each of the phylogenetic trees 1413 necessary for displaying the image data shown in FIG. 9, FIG. 10, FIG. 11, FIG. A ConservedReg for the leaf, an array ListOfConservedReg that holds the association of the storage area, and an array AllOfConservedReg are obtained. In the operation described below, an algorithm for obtaining these three arrays for each intermediate node of the system tree 1413 from the target array 1401 and the system tree 1413 will be described.

本実施の形態における進化過程を考慮した保存領域検出システム１００の概略的な処理の流れを図２１に示すフローチャートを用いて説明する。まず、進化過程を考慮した保存領域検出システム１００のＣＰＵ１４０６は、研究者がフロッピー（登録商標）ディスクやＣＤ−ＲＯＭ等の外部記録媒体を用いて入力した対象配列１４０１と系統樹の構成情報１４１３のデータを読み込み、データメモリ１４１１内に格納して入力データ１４１２として保持する（ステップ２１００）。このとき、構成情報１４１３については系統樹自体の情報を読み込まなくても、系統樹を構成するために必要となるパラメータ情報のみを入力しこのパラメータ情報に基づいてＣＰＵ１４０６が構成情報１４１３のデータを作成することとしても良い。 A schematic processing flow of the storage area detection system 100 in consideration of the evolution process in the present embodiment will be described with reference to a flowchart shown in FIG. First, the CPU 1406 of the storage area detection system 100 in consideration of the evolution process stores the target array 1401 and the phylogenetic tree configuration information 1413 input by the researcher using an external recording medium such as a floppy (registered trademark) disk or CD-ROM. Data is read, stored in the data memory 1411 and held as input data 1412 (step 2100). At this time, as for the configuration information 1413, even if the information of the phylogenetic tree itself is not read, only the parameter information necessary for configuring the phylogenetic tree is input, and the CPU 1406 creates the data of the configuration information 1413 based on the parameter information. It is also good to do.

次に、ＣＰＵ１４０６は、保存領域を検出するためのパラメータk, w, mを読み込み、データメモリ１４１１の入力データ１４１２として保持する処理を行う（ステップ２１０１）。ここで、kはtupleの文字列の長さ、wはウィンドウの長さ、mはウィンドウ内で許容されるミスマッチの数、即ち保存領域を検出する２つのゲノム配列で相互に異なる文字の最大限度数を示している。保存領域を検出する際に保存領域として検出を行っている領域中の連続するｗ個の文字（塩基または残基）に対し、２つの配列で相互に異なる文字であるミスマッチの個数が最大m個までを許容するようになっている。例えば、図２９では、ウィンドウサイズｗを５、許容されるミスマッチｍの数を１としたときの配列Aと配列Bの２つのゲノム配列間の保存領域の場所を示している。この場合、連続する５個の文字のペア毎に同一の文字の配列を検出して保存領域を検出していった際にミスマッチの個数が全ての５個の文字のペア毎で１個以内となっており領域２６０１は二つの配列間の保存領域になっている。 Next, the CPU 1406 performs processing for reading parameters k, w, and m for detecting a storage area and holding them as input data 1412 of the data memory 1411 (step 2101). Where k is the length of the tuple character string, w is the length of the window, m is the number of mismatches allowed in the window, that is, the maximum number of characters that are different from each other in the two genome sequences that detect the conserved region Shows the number. When detecting conserved areas, the number of mismatches that are different from each other in the two sequences is maximum m for consecutive w characters (bases or residues) in the area detected as the conserved area. Until now. For example, FIG. 29 shows the location of the conserved region between two genome sequences, sequence A and sequence B, where the window size w is 5 and the number of allowed mismatches m is 1. In this case, when the same character sequence is detected for every five consecutive character pairs and the storage area is detected, the number of mismatches is less than one for every five character pairs. The area 2601 is a storage area between the two sequences.

次に、ＣＰＵ１４０６は、対象配列１４０１がＤＮＡ配列であり、それをアミノ酸配列として比較したい場合、全てのＤＮＡ配列をアミノ酸配列に変換する処理を行う（ステップ２１０２）。 Next, when the target sequence 1401 is a DNA sequence and it is desired to compare it as an amino acid sequence, the CPU 1406 performs processing for converting all the DNA sequences into amino acid sequences (step 2102).

次に、ＣＰＵ１４０６は、各対象配列１４０１に対して上述のように図１７乃至図２０に示す索引情報、すなわちKtupleArrayD[ ], IdxArrayD[ ], KtupleArrayR[ ], IdxArrayR[ ] を作成する処理を行う（ステップ２１０３）。この処理の詳細については後述する。 Next, the CPU 1406 performs processing for creating the index information shown in FIGS. 17 to 20, that is, KtupleArrayD [], IdxArrayD [], KtupleArrayR [], IdxArrayR [] for each target array 1401 as described above ( Step 2103). Details of this processing will be described later.

次に、ＣＰＵ１４０６は、構成情報１４１３により構成される系統樹の全ての中間ノードに対して、その中間ノードに属するゲノム配列から全ての保存領域を検出したか否かを判定する処理を行う（ステップ２１０４）。 Next, the CPU 1406 performs a process of determining whether or not all storage regions have been detected from the genome sequence belonging to the intermediate node for all intermediate nodes of the phylogenetic tree configured by the configuration information 1413 (step S1). 2104).

次に、ＣＰＵ１４０６は、全ての保存領域を検出していない場合には（ステップ２１０４のＮＯ）、構成情報１４１３により構成される系統樹の全ての中間ノードに対して、その中間ノードに属するゲノム配列のうち保存領域を未だ検出していないものを選択する処理を行う（ステップ２１０５）。 Next, when all the storage areas are not detected (NO in step 2104), the CPU 1406, for all intermediate nodes of the phylogenetic tree configured by the configuration information 1413, genome sequences belonging to the intermediate node A process for selecting a storage area that has not yet been detected is performed (step 2105).

次に、ＣＰＵ１４０６は、この選択した各ゲノム配列の間で保存されている保存領域を検出する処理を行う（ステップ２１０６）。この保存領域を検出する処理は後で詳しく述べる。この保存領域を検出する処理が終了すると次にステップ２１０４の処理を実行する。 Next, the CPU 1406 performs processing for detecting a storage area stored between the selected genome sequences (step 2106). The process for detecting this storage area will be described in detail later. When the process for detecting the storage area is completed, the process of step 2104 is executed.

次に、ＣＰＵ１４０６は、全ての保存領域を検出した場合には（ステップ２１０４のＹＥＳ）、保存領域に基づいて配列ＤＢ１４０５内を検索し、同一の保存領域を有する種のゲノム配列の情報等の保存領域に関連する情報等があればこの関連情報を以上のゲノム解析の結果に付与する。そして、ゲノム解析の結果を表示装置１４０２に表示する処理を行い（ステップ２１０７）、全体の処理を終了する。 Next, when the CPU 1406 detects all the storage regions (YES in step 2104), the CPU 1406 searches the sequence DB 1405 based on the storage region, and stores information such as information on the genome sequences of species having the same storage region. If there is information related to the region, this related information is added to the results of the genome analysis described above. Then, a process for displaying the result of the genome analysis on the display device 1402 is performed (step 2107), and the entire process is terminated.

続いて、上述のステップ２１０３における、各対象配列１４０１に対して索引情報、すなわちKtupleArrayD[ ], IdxArrayD[ ], KtupleArrayR[ ], IdxArrayR[ ] を作成する処理について図２２に示すフローチャートを用いて詳細に説明する。まず、ＣＰＵ１４０６は、個々の対象配列１４０１に対応するKtupleArrayD[ ], IdxArrayD[ ], KtupleArrayR[ ], IdxArrayR[ ]の全要素を0で初期化する処理を行う（ステップ２２００）。 Subsequently, the process of creating index information, that is, KtupleArrayD [], IdxArrayD [], KtupleArrayR [], IdxArrayR [] for each target array 1401 in step 2103 described above will be described in detail with reference to the flowchart shown in FIG. explain. First, the CPU 1406 performs a process of initializing all elements of KtupleArrayD [], IdxArrayD [], KtupleArrayR [], IdxArrayR [] corresponding to each target array 1401 with 0 (step 2200).

次に、ＣＰＵ１４０６は、変数ｊ=1を設定する処理を行う（ステップ２２０１）。 Next, the CPU 1406 performs processing for setting the variable j = 1 (step 2201).

次に、ＣＰＵ１４０６は、変数ｊが対象配列１４０１の配列終端側で最も後側のtupleに該当する要素の位置を示す数値、すなわちj=配列長−kとなっているか否かを判定する（ステップ２２０２）。j=配列長−kとなっている場合には（ステップ２２０２のＹＥＳ）、ステップ２２０７の処理を実行する。 Next, the CPU 1406 determines whether or not the variable j is a numerical value indicating the position of the element corresponding to the rearmost tuple on the array end side of the target array 1401, that is, j = array length−k (step). 2202). If j = array length−k (YES in step 2202), the process of step 2207 is executed.

次に、ＣＰＵ１４０６は、j=配列長−kとなっていない場合には（ステップ２２０２のＮＯ）、対象配列１４０１のj番目からはじまるk個の文字列からなるtupleをKと設定し、このKに割り当てられている配列KtupleArrayD[ ]の要素インデックス（即ち配列KtupleArrayD[ ]内の要素の位置を示す要素番号）を iと設定し、配列KtupleArrayD[ ]の配列中の各要素を構成していく処理を行う（ステップ２２０３）。例えば図２８に示す配列KtupleArrayD[
]では、tuple KがTCの場合には、i は８番目の要素インデックスを示す「８」となっている。 Next, when j = array length−k is not satisfied (NO in step 2202), the CPU 1406 sets a tuple composed of k character strings starting from the j-th in the target array 1401 as K, and this K Processing to configure each element in the array KtupleArrayD [] by setting the element index of the array KtupleArrayD [] assigned to to (that is, the element number indicating the position of the element in the array KtupleArrayD []) to i (Step 2203). For example, the array KtupleArrayD [
], When tuple K is TC, i is “8” indicating the eighth element index.

次に、ＣＰＵ１４０６は、IndexArrayD[j]にKtupleArrayD[i]の数値を代入し、また、KtupleArrayD[i] に j を入力する（ステップ２２０４、２２０５）。KtupleArrayD[i]には、常に対象配列１４０１の配列中の最も後側に現れたtupleの位置を表示するため、この二つのステップ（２２０４，２２０５）は、変数ｊの数値をインクリメントしていき（ステップ２２０６）、配列中に新しくＫに該当するtupleが後側に現れるたびに、KtupleArrayD[i]の数値を更新し、IdxArrayD[j]にその更新前のKtupleArrayD[i]の数値を表示していき、更新していく処理となっている。 Next, the CPU 1406 substitutes the value of KtupleArrayD [i] for IndexArrayD [j], and inputs j to KtupleArrayD [i] (steps 2204 and 2205). Since KtupleArrayD [i] always displays the position of the tuple that appears at the rearmost side in the array of the target array 1401, these two steps (2204, 2205) increment the numerical value of the variable j ( Step 2206) Each time a new tuple corresponding to K appears in the array, the value of KtupleArrayD [i] is updated, and the value of KtupleArrayD [i] before the update is displayed in IdxArrayD [j]. It is a process of updating.

次に、以上のステップ２２０２〜２２０６の処理を全ての変数jに対して実行することにより、図２８に示すような、配列IndexArrayD[j]とKtupleArrayD[i]の索引情報を作成する。 Next, the above steps 2202 to 2206 are executed for all the variables j, thereby creating the index information of the arrays IndexArrayD [j] and KtupleArrayD [i] as shown in FIG.

次に、ＣＰＵ１４０６は、以上のステップ２２０２〜２２０６の処理を全ての変数jに対して実行し、j=配列長−kとなっている場合には（ステップ２２０２のＹＥＳ）、対象配列１４０１のreverse complement配列を改めて対象配列に設定する処理を行う（ステップ２２０７）。 Next, the CPU 1406 executes the processing of the above steps 2202 to 2206 for all the variables j. If j = array length−k (YES in step 2202), the reverse of the target array 1401 is performed. Processing for setting the complement sequence as the target sequence is performed again (step 2207).

次に、ＣＰＵ１４０６は、このreverse
complement配列に対する索引情報の配列IndexArrayR[j]とKtupleArrayR[i]を作成する処理を行う（ステップ２２０８〜２２１３）。ＣＰＵ１４０６は、上述の対象配列１４０１に対して以上のステップ２２０２〜２２０６の処理と同様の処理を実行していき配列IndexArrayR[j]とKtupleArrayR[i]を作成する。以上のステップ２２０８〜２２１３の処理を全ての変数jに対して実行し、j=配列長−kとなっている場合には（ステップ２２０９のＹＥＳ）、処理を終了する。 Next, the CPU 1406
Processing for creating index information arrays IndexArrayR [j] and KtupleArrayR [i] for the complement array is performed (steps 2208 to 2213). The CPU 1406 executes the same processing as the processing in steps 2202 to 2206 described above on the target array 1401 to create arrays IndexArrayR [j] and KtupleArrayR [i]. The processes in steps 2208 to 2213 described above are executed for all variables j. If j = array length−k (YES in step 2209), the process ends.

続いて、上述のステップ２１０６における、選択した各ゲノム配列の間で保存されている保存領域を検出する処理について図２３〜図２６で一体的に構成されたフローチャートを用いて詳細に説明する。まず、ＣＰＵ１４０６は、上述のステップ２１０５で選択した保存領域の検出対象となる各対象配列１４０１のそれぞれに対して、対象配列１４０１を識別する情報seq1, seq2, …, seqMを設定する処理を行う（ステップ２３００）。 Next, processing for detecting a storage region stored between each selected genome sequence in the above-described step 2106 will be described in detail with reference to flowcharts configured integrally in FIGS. First, the CPU 1406 performs processing for setting information seq1, seq2,..., SeqM for identifying the target sequence 1401 for each target sequence 1401 to be detected in the storage area selected in the above step 2105 ( Step 2300).

次に、ＣＰＵ１４０６は、変数ｉ=1を設定する処理を行う（ステップ２３０１）。 Next, the CPU 1406 performs processing for setting the variable i = 1 (step 2301).

次に、ＣＰＵ１４０６は、変数ｉが配列KtupleArrayDの終端位置を示す数値、すなわちｉ＞ p^kとなっているか否かを判定する（ステップ２３０２）。ｉ＞ p^kとなっている場合には（ステップ２３０２のＹＥＳ）、処理を終了する。 Then, CPU1406 determines number, i.e. whether a i> p ^k the variable i indicating the end position of the sequence KtupleArrayD (step 2302). If that is the i> p ^k (YES in Step 2302), the process ends.

次に、ＣＰＵ１４０６は、定数c1にseq1の配列の配列KtupleArrayD[i]の数値を代入する処理を行う（ステップ２３０３）。上述の２１０３においてseq1の配列に対して作成したKtupleArrayD[i]の数値をc1に代入する。 Next, the CPU 1406 performs processing for assigning the numerical value of the array KtupleArrayD [i] of the array of seq1 to the constant c1 (step 2303). The value of KtupleArrayD [i] created for the sequence of seq1 in 2103 is substituted for c1.

次に、ＣＰＵ１４０６は、c1が０であるか否かを判定する処理を行う（ステップ２３０４）。c1が０である場合には（ステップ２３０４のＹＥＳ）、ステップ２３２８の処理を実行する。 Next, the CPU 1406 performs a process of determining whether or not c1 is 0 (step 2304). If c1 is 0 (YES in step 2304), the process of step 2328 is executed.

次に、ＣＰＵ１４０６は、c1が０でない場合には（ステップ２３０４のＮＯ）、定数c2にseq2の配列の配列KtupleArrayD[i]の数値を代入する処理を行う（ステップ２３０５）。上述の２１０３においてseq2の配列に対して作成したKtupleArrayD[i]の数値をc2に代入する。 Next, when c1 is not 0 (NO in step 2304), the CPU 1406 performs a process of substituting the numerical value of the array KtupleArrayD [i] of the array of seq2 into the constant c2 (step 2305). The value of KtupleArrayD [i] created for the sequence of seq2 in 2103 is substituted for c2.

次に、ＣＰＵ１４０６は、c2が０であるか否かを判定する処理を行う（ステップ２３０６）。c2が０である場合には（ステップ２３０６のＹＥＳ）、ステップ２３１７の処理を実行する。 Next, the CPU 1406 performs a process of determining whether c2 is 0 (step 2306). If c2 is 0 (YES in step 2306), the process of step 2317 is executed.

次に、ＣＰＵ１４０６は、c2が０でない場合には（ステップ２３０６のＮＯ）、配列seq1のKtupleArrayD[i]に割り当てられたtupleがseq1およびseq2の配列中に存在することとなっており、これら２つのゲノム配列seq1のc1番目とseq2のc2番目の位置に存在するこの同一のtupleから開始して保存領域を検出していく処理を行う（ステップ２３０７）。これら２つのゲノム配列seq1、seq2内の検出していく処理の開始位置からアライメントの連続するw個の文字列（塩基または残基）毎に保存領域を検出していき、seq1のc1番目とseq2のc2番目からアライメントを伸張させ保存領域として一致している文字列の個数を伸張させていく。そして、ゲノム配列seq1およびseq2間で相互に異なる文字であるミスマッチの個数がm個以下となっている文字列となっている領域の範囲内で保存領域を拡大させ、ミスマッチの個数がm個より多い文字列が存在した時点でその位置を保存領域の境界位置とする。このようにして保存領域を検出していき保存領域が存在する場合にはこれをCと設定する処理を行う。 Next, when c2 is not 0 (NO in step 2306), the CPU 1406 indicates that the tuple assigned to KtupleArrayD [i] of the sequence seq1 exists in the sequences of seq1 and seq2, and these 2 A process of detecting a conserved region is performed starting from the same tuple existing at the c1 position of the two genome sequences seq1 and the c2 position of seq2 (step 2307). From these two genomic sequences seq1 and seq2, the conserved region is detected for every w consecutive strings (bases or residues) in the alignment from the start position of the process to be detected, and c1 and seq2 of seq1 The alignment is expanded from the c2 of the number of characters that match as the storage area. And the storage area is expanded within the range of the character string in which the number of mismatches that are different characters between genome sequences seq1 and seq2 is less than m, and the number of mismatches is more than m When there are many character strings, the position is set as the boundary position of the storage area. In this way, the storage area is detected, and if there is a storage area, a process of setting it as C is performed.

次に、ＣＰＵ１４０６は、ステップ２３０７において保存領域が存在したか否かを判定する処理を行う（ステップ２３０８）。保存領域が存在しなかった場合には（ステップ２３０８のＮＯ）、ステップ２３１６の処理を実行し、ＣＰＵ１４０６は、c2にseq2のKtupleArrayD[c2]を代入する処理を行い、ステップ２３０６以降の処理を実行する（ステップ２３１６）。 Next, the CPU 1406 performs processing for determining whether or not a storage area exists in step 2307 (step 2308). If the storage area does not exist (NO in step 2308), the process of step 2316 is executed, and the CPU 1406 performs the process of substituting KtupleArrayD [c2] of seq2 into c2, and executes the processes after step 2306 (Step 2316).

次に、ＣＰＵ１４０６は、保存領域が存在した場合には（ステップ２３０８のＹＥＳ）、残りの検出対象となる対象配列１４０１、即ちseq3,…,seqMのゲノム配列中で同一の保存領域Cを検出する処理を行う（ステップ２３０９）。変数jに対してｊ＝３と設定する処理を行う。 Next, when a storage region exists (YES in step 2308), the CPU 1406 detects the same storage region C in the remaining target sequences 1401, ie, seq3,..., SeqM genome sequences. Processing is performed (step 2309). A process of setting j = 3 for the variable j is performed.

次に、ＣＰＵ１４０６は、変数ｊが対象配列１４０１の最後のゲノム配列を示す数値、すなわちｊ＞Ｍとなっているか否かを判定する（ステップ２３１０）。 Next, the CPU 1406 determines whether or not the variable j is a numerical value indicating the last genome sequence of the target sequence 1401, that is, j> M (step 2310).

次に、ＣＰＵ１４０６は、ｊ＞Ｍとなっていない場合には（ステップ２３１０のＮＯ）、ゲノム配列seq jの配列中に存在する保存領域Cを検出する処理を行う（ステップ２３１１）。このseq jの配列中で保存領域Cを検出する処理については、後で詳しく説明する。 Next, when j> M is not satisfied (NO in step 2310), the CPU 1406 performs processing for detecting the storage region C existing in the genome sequence seq j (step 2311). Processing for detecting the conserved region C in the sequence of seq j will be described in detail later.

次に、ＣＰＵ１４０６は、ステップ２３１１において保存領域Ｃが存在したか否かを判定する処理を行う（ステップ２３１２）。保存領域Ｃが存在しなかった場合には（ステップ２３１２のＮＯ）、ステップ２３１６の処理を実行し、ＣＰＵ１４０６は、c2にseq2のKtupleArrayD[c2]を代入する処理を行い、ステップ２３０６以降の処理を実行する（ステップ２３１６）。 Next, the CPU 1406 performs processing for determining whether or not the storage area C exists in step 2311 (step 2312). When the storage area C does not exist (NO in step 2312), the process of step 2316 is executed, and the CPU 1406 performs a process of substituting KtupleArrayD [c2] of seq2 into c2, and performs the processes after step 2306. Execute (step 2316).

次に、ＣＰＵ１４０６は、保存領域Ｃが存在した場合には（ステップ２３１２のＹＥＳ）、変数jをひとつインクリメントし、ステップ２３１０以降の処理を再度実行する（ステップ２３１３）。そして、ステップ２３１０において、ｊ＞Ｍとなっている場合には（ステップ２３１０のＹＥＳ）、ステップ２３１４の処理を実行し、ＣＰＵ１４０６は、以上の処理で検出した各対象配列１４０１における保存領域ＣおよびＣが配列seq1,…,seqMで現れた位置等の情報をConservedReg[
], ListOfConservedReg[ ], AllOfConservedReg[ ]に登録し、ステップ２３１６の処理を実行する。 Next, when the storage area C exists (YES in Step 2312), the CPU 1406 increments the variable j by one, and executes the processing after Step 2310 again (Step 2313). If j> M in step 2310 (YES in step 2310), the processing of step 2314 is executed, and the CPU 1406 stores the storage areas C and C in each target array 1401 detected by the above processing. The information such as the position that appears in the sequence seq1, ..., seqM is stored in ConservedReg [
], ListOfConservedReg [], AllOfConservedReg [], and the processing of step 2316 is executed.

次に、ＣＰＵ１４０６は、ステップ２３０６においてc2が０である場合には（ステップ２３０６のＹＥＳ）、c2にseq2のKtupleArrayＲ[ｉ]を代入する処理を行う（ステップ２３１７）。上述の２１０３においてseq2の配列に対して作成したKtupleArrayＲ[i]の数値をc2に代入する。 Next, when c2 is 0 in step 2306 (YES in step 2306), the CPU 1406 performs a process of substituting KtupleArrayR [i] of seq2 into c2 (step 2317). The value of KtupleArrayR [i] created for the array of seq2 in 2103 is substituted for c2.

次に、ＣＰＵ１４０６は、c2が０であるか否かを判定する処理を行う（ステップ２３１８）。c2が０である場合には（ステップ２３１８のＹＥＳ）、ステップ２３１５の処理を実行し、ＣＰＵ１４０６は、c1にseq1のIdxArrayD[c1]を代入する処理を行い、ステップ２３０４以降の処理を実行する（ステップ２３１５）。 Next, the CPU 1406 performs processing to determine whether c2 is 0 (step 2318). If c2 is 0 (YES in step 2318), the process of step 2315 is executed, and the CPU 1406 performs a process of substituting IdxArrayD [c1] of seq1 into c1, and executes the processes after step 2304 ( Step 2315).

次に、ＣＰＵ１４０６は、c2が０でない場合には（ステップ２３１８のＮＯ）、配列seq1のKtupleArrayD[i]に割り当てられたtupleがseq1の配列およびseq2のreverse complement配列中に存在することとなっており、これら２つのゲノム配列seq1のc1番目とseq2のreverse complement配列中のc2番目の位置に存在するこの同一のtupleから開始して保存領域を検出していく処理を行う（ステップ２３１９）。これは上述のステップ２３０７の処理と同様であり説明を省略する。 Next, when c2 is not 0 (NO in step 2318), the CPU 1406 indicates that the tuple assigned to KtupleArrayD [i] of the sequence seq1 exists in the sequence of seq1 and the reverse complement sequence of seq2. Thus, a process of detecting a conserved region is performed starting from the same tuple present at the c1 position in the c1 of the two genome sequences seq1 and the reverse complement sequence of seq2 (step 2319). This is the same as the processing in step 2307 described above, and a description thereof will be omitted.

次に、ＣＰＵ１４０６は、ステップ２３１９において保存領域が存在したか否かを判定する処理を行う（ステップ２３２０）。保存領域が存在しなかった場合には（ステップ２３２０のＮＯ）、ステップ２３２７の処理を実行し、ＣＰＵ１４０６は、c2にseq2のIdxArrayR[c2]を代入する処理を行い、ステップ２３１８以降の処理を実行する（ステップ２３２７）。 Next, the CPU 1406 performs processing to determine whether or not a storage area exists in step 2319 (step 2320). If the storage area does not exist (NO in step 2320), the process of step 2327 is executed, and the CPU 1406 performs the process of substituting IdxArrayR [c2] of seq2 into c2, and executes the processes after step 2318. (Step 2327).

次に、ＣＰＵ１４０６は、保存領域が存在した場合には（ステップ２３２０のＹＥＳ）、残りの検出対象となる対象配列１４０１、即ちseq3,…,seqMのゲノム配列中で同一の保存領域Cを検出する処理を行う（ステップ２３２１）。変数jに対してｊ＝３と設定する処理を行う。 Next, when there is a storage region (YES in step 2320), the CPU 1406 detects the same storage region C in the remaining target sequences 1401, that is, the genome sequences of seq3,. Processing is performed (step 2321). A process of setting j = 3 for the variable j is performed.

次に、ＣＰＵ１４０６は、変数ｊが対象配列１４０１の最後のゲノム配列を示す数値、すなわちｊ＞Ｍとなっているか否かを判定する（ステップ２３２２）。 Next, the CPU 1406 determines whether or not the variable j is a numerical value indicating the last genome sequence of the target sequence 1401, that is, j> M (step 2322).

次に、ＣＰＵ１４０６は、ｊ＞Ｍとなっていない場合には（ステップ２３２２のＮＯ）、ゲノム配列seq jの配列中に存在する保存領域Cを検出する処理を行う（ステップ２３２３）。このseq jの配列中で保存領域Cを検出する処理については、後で詳しく説明する。 Next, when j> M is not satisfied (NO in step 2322), the CPU 1406 performs processing for detecting a storage region C existing in the sequence of the genome sequence seq j (step 2323). Processing for detecting the conserved region C in the sequence of seq j will be described in detail later.

次に、ＣＰＵ１４０６は、ステップ２３２３において保存領域Ｃが存在したか否かを判定する処理を行う（ステップ２３２４）。保存領域Ｃが存在しなかった場合には（ステップ２３２４のＮＯ）、ステップ２３２７の処理を実行し、ＣＰＵ１４０６は、c2にseq2のIdxArrayR[c2]を代入する処理を行い、ステップ２３１８以降の処理を実行する（ステップ２３２７）。 Next, the CPU 1406 performs processing to determine whether or not the storage area C exists in step 2323 (step 2324). If the storage area C does not exist (NO in step 2324), the process of step 2327 is executed, and the CPU 1406 performs a process of substituting IdxArrayR [c2] of seq2 into c2, and performs the processes after step 2318. Execute (step 2327).

次に、ＣＰＵ１４０６は、保存領域Ｃが存在した場合には（ステップ２３２４のＹＥＳ）、変数jをひとつインクリメントし、ステップ２３２２以降の処理を再度実行する（ステップ２３２５）。そして、ステップ２３２２において、ｊ＞Ｍとなっている場合には（ステップ２３２２のＹＥＳ）、ステップ２３２６の処理を実行し、ＣＰＵ１４０６は、以上の処理で検出した各対象配列１４０１における保存領域ＣおよびＣが配列seq1,…,seqMで現れた位置等の情報をConservedReg[
], ListOfConservedReg[ ], AllOfConservedReg[ ]に登録し、ステップ２３２７の処理を実行する。 Next, when the storage area C exists (YES in Step 2324), the CPU 1406 increments the variable j by one and executes the processing after Step 2322 again (Step 2325). If j> M in step 2322 (YES in step 2322), the processing of step 2326 is executed, and the CPU 1406 stores the storage areas C and C in each target array 1401 detected by the above processing. The information such as the position that appears in the sequence seq1, ..., seqM is stored in ConservedReg [
], ListOfConservedReg [], AllOfConservedReg [] and execute the processing of step 2327.

次に、ＣＰＵ１４０６は、ステップ２３０４においてc1が０である場合には（ステップ２３０４のＹＥＳ）、c1にseq1のKtupleArrayＲ[i]を代入する処理を行う（ステップ２３２８）。上述の２１０３においてseq1の配列に対して作成したKtupleArrayＲ[i]の数値をc1に代入する。 Next, when c1 is 0 in step 2304 (YES in step 2304), the CPU 1406 performs a process of substituting KtupleArrayR [i] of seq1 into c1 (step 2328). The value of KtupleArrayR [i] created for the sequence of seq1 in 2103 is substituted for c1.

次に、ＣＰＵ１４０６は、c1が０であるか否かを判定する処理を行う（ステップ２３２９）。c1が０である場合には（ステップ２３２９のＹＥＳ）、ステップ２３４２の処理を実行し、ＣＰＵ１４０６は、変数iをひとつインクリメントし、ステップ２３０２以降の処理を実行する（ステップ２３４２）。 Next, the CPU 1406 performs processing for determining whether c1 is 0 (step 2329). If c1 is 0 (YES in step 2329), the process of step 2342 is executed, and the CPU 1406 increments the variable i by one and executes the processes after step 2302 (step 2342).

次に、ＣＰＵ１４０６は、c2にseq2のKtupleArrayD[i]を代入する処理を行う（ステップ２３３０）。上述の２１０３においてseq2の配列に対して作成したKtupleArrayD[i]の数値をc2に代入する。 Next, the CPU 1406 performs processing for substituting KtupleArrayD [i] of seq2 into c2 (step 2330). The value of KtupleArrayD [i] created for the sequence of seq2 in 2103 is substituted for c2.

次に、ＣＰＵ１４０６は、c2が０であるか否かを判定する処理を行う（ステップ２３３１）。C2が０である場合には（ステップ２３３１のＹＥＳ）、ステップ２３４３以降の処理を実行する。 Next, the CPU 1406 performs processing to determine whether c2 is 0 (step 2331). If C2 is 0 (YES in step 2331), the processing after step 2343 is executed.

次に、ＣＰＵ１４０６は、c2が０でない場合には（ステップ２３３１のＮＯ）、seq1のreverse complement配列のKtupleArrayR[i]に割り当てられたtupleがseq1のreverse
complement配列中およびseq2の配列に存在することとなっており、これら２つのゲノム配列seq1のreverse complement配列中のc1番目とseq2のc2番目の位置に存在するこの同一のtupleから開始して保存領域を検出していく処理を行う（ステップ２３３２）。これは上述のステップ２３０７の処理と同様であり説明を省略する。 Next, when c2 is not 0 (NO in Step 2331), the CPU 1406 determines that the tuple assigned to KtupleArrayR [i] of the reverse complement sequence of seq1 is the reverse of seq1.
A conserved region starting from this same tuple that exists in the c1 and c2 positions in the reverse complement sequence of these two genomic sequences seq1. The process of detecting is performed (step 2332). This is the same as the processing in step 2307 described above, and a description thereof will be omitted.

次に、ＣＰＵ１４０６は、ステップ２３３２において保存領域が存在したか否かを判定する処理を行う（ステップ２３３３）。保存領域が存在しなかった場合には（ステップ２３３３のＮＯ）、ステップ２３４１の処理を実行し、ＣＰＵ１４０６は、c2にseq2のIdxArrayD[c2]を代入する処理を行い、ステップ２３３１以降の処理を実行する（ステップ２３４１）。 Next, the CPU 1406 performs processing to determine whether or not a storage area exists in step 2332 (step 2333). If the storage area does not exist (NO in step 2333), the process of step 2341 is executed, and the CPU 1406 performs the process of substituting IdxArrayD [c2] of seq2 into c2, and executes the processes after step 2331. (Step 2341).

次に、ＣＰＵ１４０６は、保存領域が存在した場合には（ステップ２３３３のＹＥＳ）、残りの検出対象となる対象配列１４０１、即ちseq3,…,seqMのゲノム配列中で同一の保存領域Cを検出する処理を行う（ステップ２３３３）。変数jに対してｊ＝３と設定する処理を行う（ステップ２３３４）。 Next, when a storage region exists (YES in step 2333), the CPU 1406 detects the same storage region C in the remaining target sequences 1401, ie, seq3,. Processing is performed (step 2333). A process of setting j = 3 to the variable j is performed (step 2334).

次に、ＣＰＵ１４０６は、変数ｊが対象配列１４０１の最後のゲノム配列を示す数値、すなわちｊ＞Ｍとなっているか否かを判定する（ステップ２３３５）。 Next, the CPU 1406 determines whether or not the variable j is a numerical value indicating the last genome sequence of the target sequence 1401, that is, j> M (step 2335).

次に、ＣＰＵ１４０６は、ｊ＞Ｍとなっていない場合には（ステップ２３３５のＮＯ）、ゲノム配列seq jの配列中に存在する保存領域Cを検出する処理を行う（ステップ２３３６）。このseq jの配列中で保存領域Cを検出する処理については、後で詳しく説明する。 Next, when j> M is not satisfied (NO in step 2335), the CPU 1406 performs a process of detecting the storage region C existing in the genome sequence seq j (step 2336). Processing for detecting the conserved region C in the sequence of seq j will be described in detail later.

次に、ＣＰＵ１４０６は、ステップ２３３６において保存領域Ｃが存在したか否かを判定する処理を行う（ステップ２３３７）。保存領域Ｃが存在しなかった場合には（ステップ２３３７のＮＯ）、ステップ２３４１以降の処理を実行する。 Next, the CPU 1406 performs processing for determining whether or not the storage area C exists in step 2336 (step 2337). If the storage area C does not exist (NO in step 2337), the processing after step 2341 is executed.

次に、ＣＰＵ１４０６は、保存領域Ｃが存在した場合には（ステップ２３３７のＹＥＳ）、変数jをひとつインクリメントし、ステップ２３３５以降の処理を再度実行する（ステップ２３３８）。そして、ステップ２３３５において、ｊ＞Ｍとなっている場合には（ステップ２３３５のＹＥＳ）、ステップ２３３９の処理を実行し、ＣＰＵ１４０６は、以上の処理で検出した各対象配列１４０１における保存領域ＣおよびＣが配列seq1,…,seqMで現れた位置等の情報をConservedReg[
], ListOfConservedReg[ ], AllOfConservedReg[ ]に登録し、ステップ２３４１の処理を実行する。 Next, when the storage area C exists (YES in step 2337), the CPU 1406 increments the variable j by one and executes the processing after step 2335 again (step 2338). If j> M in step 2335 (YES in step 2335), the processing of step 2339 is executed, and the CPU 1406 stores the storage areas C and C in each target array 1401 detected by the above processing. The information such as the position that appears in the sequence seq1, ..., seqM is stored in ConservedReg [
], ListOfConservedReg [], AllOfConservedReg [], and the processing of step 2341 is executed.

次に、ＣＰＵ１４０６は、ステップ２３３１においてc2が０である場合には（ステップ２３３１のＹＥＳ）、c2にseq2のKtupleArrayＲ[i]を代入する処理を行う（ステップ２３４３）。上述の２１０３においてseq2の配列に対して作成したKtupleArrayＲ[i]の数値をc2に代入する。 Next, when c2 is 0 in Step 2331 (YES in Step 2331), the CPU 1406 performs a process of substituting KtupleArrayR [i] of seq2 into c2 (Step 2343). The value of KtupleArrayR [i] created for the array of seq2 in 2103 is substituted for c2.

次に、ＣＰＵ１４０６は、c2が０であるか否かを判定する処理を行う（ステップ２３４４）。c2が０である場合には（ステップ２３４４のＹＥＳ）、ステップ２３４０の処理を実行し、ＣＰＵ１４０６は、c1にseq1のIdxArrayR[c1]を代入する処理を行い、ステップ２３２９以降の処理を実行する（ステップ２３４０）。 Next, the CPU 1406 performs processing to determine whether c2 is 0 (step 2344). If c2 is 0 (YES in step 2344), the process of step 2340 is executed, and the CPU 1406 performs a process of substituting IdxArrayR [c1] of seq1 into c1, and executes the processes after step 2329 ( Step 2340).

次に、ＣＰＵ１４０６は、c2が０でない場合には（ステップ２３４４のＮＯ）、seq1のreverse complement配列のKtupleArrayR[i]に割り当てられたtupleがseq1のreverse
complement配列中およびseq2のreverse complement配列中に存在することとなっており、これら２つのゲノム配列seq1のreverse complement配列中のc1番目とseq2のreverse
complement配列中のc2番目の位置に存在するこの同一のtupleから開始して保存領域を検出していく処理を行う（ステップ２３４５）。これは上述のステップ２３０７の処理と同様であり説明を省略する。 Next, when c2 is not 0 (NO in step 2344), the CPU 1406 determines that the tuple assigned to KtupleArrayR [i] of the reverse complement sequence of seq1 is the reverse of seq1.
It is supposed to be present in the complement sequence and in the reverse complement sequence of seq2, and the reverse of the c1 and seq2 in the reverse complement sequence of these two genomic sequences seq1
A process of detecting a conserved region is performed starting from the same tuple present at the c2 position in the complement sequence (step 2345). This is the same as the processing in step 2307 described above, and a description thereof will be omitted.

次に、ＣＰＵ１４０６は、ステップ２３４５において保存領域が存在したか否かを判定する処理を行う（ステップ２３４６）。保存領域が存在しなかった場合には（ステップ２３４６のＮＯ）、ステップ２３５３の処理を実行し、ＣＰＵ１４０６は、c2にseq2のIdxArrayR[c2]を代入する処理を行い、ステップ２３４４以降の処理を実行する（ステップ２３５３）。 Next, the CPU 1406 performs processing for determining whether or not a storage area exists in step 2345 (step 2346). If the storage area does not exist (NO in step 2346), the process of step 2353 is executed, and the CPU 1406 performs the process of substituting IdxArrayR [c2] of seq2 into c2, and executes the processes after step 2344. (Step 2353).

次に、ＣＰＵ１４０６は、保存領域が存在した場合には（ステップ２３４６のＹＥＳ）、残りの検出対象となる対象配列１４０１、即ちseq3,…,seqMのゲノム配列中で同一の保存領域Cを検出する処理を行う（ステップ２３４７）。変数jに対してｊ＝３と設定する処理を行う。 Next, when the storage region exists (YES in step 2346), the CPU 1406 detects the same storage region C in the remaining target sequences 1401 to be detected, that is, the genome sequences of seq3,. Processing is performed (step 2347). A process of setting j = 3 for the variable j is performed.

次に、ＣＰＵ１４０６は、変数ｊが対象配列１４０１の最後のゲノム配列を示す数値、すなわちｊ＞Ｍとなっているか否かを判定する（ステップ２３４８）。 Next, the CPU 1406 determines whether or not the variable j is a numerical value indicating the last genome sequence of the target sequence 1401, that is, j> M (step 2348).

次に、ＣＰＵ１４０６は、ｊ＞Ｍとなっていない場合には（ステップ２３４８のＮＯ）、ゲノム配列seq jの配列中に存在する保存領域Cを検出する処理を行う（ステップ２３４９）。このseq jの配列中で保存領域Cを検出する処理については、後で詳しく説明する。 Next, when j> M is not satisfied (NO in step 2348), the CPU 1406 performs processing for detecting a storage region C existing in the sequence of the genome sequence seq j (step 2349). Processing for detecting the conserved region C in the sequence of seq j will be described in detail later.

次に、ＣＰＵ１４０６は、ステップ２３４９において保存領域Ｃが存在したか否かを判定する処理を行う（ステップ２３５０）。保存領域Ｃが存在しなかった場合には（ステップ２３５０のＮＯ）、ステップ２３５３以降の処理を実行する。 Next, the CPU 1406 performs processing for determining whether or not the storage area C exists in step 2349 (step 2350). If the storage area C does not exist (NO in step 2350), the processing after step 2353 is executed.

次に、ＣＰＵ１４０６は、保存領域Ｃが存在した場合には（ステップ２３５０のＹＥＳ）、変数jをひとつインクリメントし、ステップ２３４８以降の処理を再度実行する（ステップ２３５１）。そして、ステップ２３４８において、ｊ＞Ｍとなっている場合には（ステップ２３４８のＹＥＳ）、ステップ２３５２の処理を実行し、ＣＰＵ１４０６は、以上の処理で検出した各対象配列１４０１における保存領域ＣおよびＣが配列seq1,…,seqMで現れた位置等の情報をConservedReg[
], ListOfConservedReg[ ], AllOfConservedReg[ ]に登録し、ステップ２３５３の処理を実行する。以上のようにして、選択した各ゲノム配列の間で保存されている保存領域を検出する処理を行う。 Next, when the storage area C exists (YES in Step 2350), the CPU 1406 increments the variable j by one, and executes the processing after Step 2348 again (Step 2351). If j> M in step 2348 (YES in step 2348), the processing of step 2352 is executed, and the CPU 1406 stores the storage areas C and C in each target array 1401 detected by the above processing. The information such as the position that appears in the sequence seq1, ..., seqM is stored in ConservedReg [
], ListOfConservedReg [], AllOfConservedReg [], and the process of step 2353 is executed. As described above, processing for detecting a storage region stored between each selected genome sequence is performed.

続いて、上述のステップ２３１１、２３２３、２３３６、２３４９における、ゲノム配列seq jの配列中に存在する保存領域Cを検出するについて図２７で示すフローチャートを用いて詳細に説明する。まず、ＣＰＵ１４０６は、保存領域C内の最も前側に位置する先頭tupleに該当するindexに変数iを設定する処理を行う（ステップ２４００）。 Next, detection of the conserved region C present in the sequence of the genome sequence seq j in the above steps 2311, 2323, 2336, and 2349 will be described in detail with reference to the flowchart shown in FIG. First, the CPU 1406 performs a process of setting the variable i to the index corresponding to the first tuple located in the foremost side in the storage area C (step 2400).

次に、ＣＰＵ１４０６は、ゲノム配列seq j のKtupleArrayD[i]の値をc1と設定する処理を行う（ステップ２４０１）。 Next, the CPU 1406 performs a process of setting the value of KtupleArrayD [i] of the genome sequence seq j as c1 (step 2401).

次に、ＣＰＵ１４０６は、c1が０であるか否かを判定する処理を行う（ステップ２４０２）。c1が０である場合には（ステップ２４０２のＹＥＳ）、ステップ２４０６の処理を実行し、ＣＰＵ１４０６は、c1にseqｊのKtupleArrayＲ[i]を代入する処理を行い、ステップ２４０７以降の処理を実行する（ステップ２４０６）。 Next, the CPU 1406 performs a process of determining whether or not c1 is 0 (step 2402). If c1 is 0 (YES in step 2402), the process of step 2406 is executed, and the CPU 1406 performs the process of substituting KtupleArrayR [i] of seqj into c1, and executes the processes after step 2407 ( Step 2406).

次に、ＣＰＵ１４０６は、c1が０でない場合には（ステップ２４０２のＮＯ）、保存領域Cの先頭tupleがseqｊの配列中に存在することとなっており、ゲノム配列seqｊのc1番目から開始して、ステップ２３０７、２３１９、２３３２、２３４５で検出した保存領域Ｃのデータに基づいて保存領域を検出していく処理を行う（ステップ２４０３、２４０４）。これらゲノム配列seqｊ内の検出していく処理の開始位置および保存領域Ｃ内の先頭位置から文字列を比較していき、アライメントの連続するw個の文字列（塩基または残基）毎に保存領域を検出していき、seqｊのc1番目からアライメントを伸張させ保存領域Ｃと一致している文字列の個数を伸張させていく。そして、ゲノム配列seqｊおよび保存領域Ｃ間で相互に異なる文字であるミスマッチの個数がm個以下となっている文字列となっている領域の範囲内で保存領域を拡大させ、ミスマッチの個数がm個より多い文字列が存在した時点でその位置をゲノム配列seqｊ内での保存領域の境界位置とする。このようにして保存領域を検出していき保存領域が存在する場合にはこれを一時的にデータメモリ１４１１に格納する処理を行う。この検出した保存領域が保存領域Ｃよりも短い場合であっても、この検出した保存領域を改めてゲノム配列seqｊにおける保存領域Ｃとして設定しデータメモリ１４１１に格納する処理を行う。 Next, when c1 is not 0 (NO in step 2402), the CPU 1406 indicates that the top tuple of the storage area C exists in the sequence of seqj, and starts from the c1th of the genome sequence seqj. , The storage area is detected based on the data of the storage area C detected in steps 2307, 2319, 2332, and 2345 (steps 2403 and 2404). Character strings are compared from the start position of the processing to be detected in these genome sequences seqj and the start position in the storage region C, and the storage region is stored for every w character strings (bases or residues) that are continuously aligned. , And the number of character strings that match the storage area C is expanded by extending the alignment from c1 of seqj. Then, the storage region is expanded within the range of the character string in which the number of mismatches that are different characters between the genome sequence seqj and the storage region C is m or less, and the number of mismatches is m When there are more character strings, the position is set as the boundary position of the storage region in the genome sequence seqj. In this way, the storage area is detected, and when the storage area exists, processing for temporarily storing it in the data memory 1411 is performed. Even when the detected storage area is shorter than the storage area C, the detected storage area is newly set as the storage area C in the genome sequence seqj and stored in the data memory 1411.

次に、ＣＰＵ１４０６は、c1にseqｊのIdxArrayD[c1]を代入する処理を行い、ステップ２４０２以降の処理を実行する（ステップ２４０５）。 Next, the CPU 1406 performs a process of substituting IdxArrayD [c1] of seqj into c1, and executes the processes after step 2402 (step 2405).

次に、ＣＰＵ１４０６は、ステップ２４０２においてc1が０である場合には（ステップ２４０２のＹＥＳ）、ステップ２４０６の処理を実行し、ＣＰＵ１４０６は、c1にseqｊのKtupleArrayＲ[i]を代入する処理を行う（ステップ２４０６）。 Next, when c1 is 0 in step 2402 (YES in step 2402), the CPU 1406 executes the process of step 2406, and the CPU 1406 performs a process of substituting KtupleArrayR [i] of seqj into c1 ( Step 2406).

次に、ＣＰＵ１４０６は、c1が０であるか否かを判定する処理を行う（ステップ２４０７）。c1が０である場合には（ステップ２４０７のＹＥＳ）、処理を終了する。 Next, the CPU 1406 performs a process of determining whether or not c1 is 0 (step 2407). If c1 is 0 (YES in step 2407), the process ends.

次に、ＣＰＵ１４０６は、c1が０でない場合には（ステップ２４０７のＮＯ）、保存領域Cの先頭tupleがseqｊのreverse
complement配列中に存在することとなっており、ゲノム配列seqｊのreverse complement配列中のc1番目から開始して、ステップ２３０７、２３１９、２３３２、２３４５で検出した保存領域Ｃのデータに基づいて保存領域を検出していく処理を行う（ステップ２４０８、２４０９）。これは上述のステップ２４０３、２４０４の処理と同様であり説明を省略する。 Next, when c1 is not 0 (NO in Step 2407), the CPU 1406 reverses the storage area C with the top tuple being seqj.
The conserved region is determined based on the data of conserved region C detected in steps 2307, 2319, 2332, and 2345, starting from c1 in the reverse complement sequence of genomic sequence seqj. The detection process is performed (steps 2408 and 2409). This is the same as the processing in steps 2403 and 2404 described above, and a description thereof will be omitted.

次に、ＣＰＵ１４０６は、c1にseqｊのIdxArrayＲ[c1]を代入する処理を行い、ステップ２４０７以降の処理を実行する（ステップ２４１０）。 Next, the CPU 1406 performs a process of substituting IdxArrayR [c1] of seqj into c1, and executes the processes after step 2407 (step 2410).

ＣＰＵ１４０６が以上説明した進化過程を考慮した保存領域検出システム１００の動作を行うことにより、構成情報１４１３により構成される系統樹に属する中間ノードに属する各リーフに対する対象配列１４０１内で保存されている保存領域を検出し構造体配列ConservedRegと、保存領域の関連を保持する配列ListOfConservedRegと、配列AllOfConservedRegが求められる。そして、ＣＰＵ１４０６は、以下に説明するような図９、図１０、図１１、図１２、図１３に示す画像のデータを作成し表示装置１４０２に表示させる処理を行う。 When the CPU 1406 performs the operation of the storage area detection system 100 in consideration of the evolution process described above, the storage stored in the target array 1401 for each leaf belonging to the intermediate node belonging to the tree constituted by the configuration information 1413. An area is detected and a structure array ConservedReg, an array ListOfConservedReg holding the association of the storage areas, and an array AllOfConservedReg are obtained. The CPU 1406 performs processing for creating image data shown in FIGS. 9, 10, 11, 12, and 13 and displaying the image data on the display device 1402 as described below.

図９は、構造体配列ConservedRegと、配列ListOfConservedRegと、配列AllOfConservedRegのデータを用いて作成した系統樹の様子を示す説明図である。この系統樹では、表示画面の左半分に対象配列１４０１の名称（例えば種１〜種６）を用いて構成された系統樹、右半分に各対象配列１４０１に対応するゲノム配列上の保存領域が表示されている。系統樹の各枝は、各対象配列１４０１毎に異なる色や実線、点線、一点鎖線等の異なる形態の線で表示されている。これにより配列ファミリーを識別するようになっており、例えば、線９０１は種１、種２、種３、種４のファミリーを表し、線９０２は種１、種２のファミリーを表している。図９では、各枝を識別するために色と線の形態を変えているが、実際にはこれ以外の表現方法でもよく、例えば、線の近くにタグや番号、名称等を表示する実現方法を用いても良い。 FIG. 9 is an explanatory diagram showing a phylogenetic tree created using data of the structure array ConservedReg, the array ListOfConservedReg, and the array AllOfConservedReg. In this phylogenetic tree, the left half of the display screen has a phylogenetic tree configured using the names of the target sequences 1401 (for example, species 1 to 6), and the right half has storage regions on the genome sequence corresponding to each target sequence 1401. It is displayed. Each branch of the phylogenetic tree is displayed by a line having a different form such as a different color, a solid line, a dotted line, or a one-dot chain line for each target array 1401. Thus, the sequence family is identified, for example, the line 901 represents the family of species 1, species 2, species 3, and species 4, and the line 902 represents the family of species 1, species 2. In FIG. 9, the color and line form are changed in order to identify each branch. However, other representation methods may be used in practice, for example, an implementation method for displaying a tag, number, name, etc. near the line. May be used.

また図９の系統樹の右側には、各対象配列１４０１に対して、保存領域の位置と、その保存領域が系統樹で保存されているレベルを模式的に表示している。系統樹のレベルは、左半分の系統樹の枝の色・線の形態に対応しており、例えば種１と種２でのみ保存されている領域は種１と種２が属している中間ノードを形成する線９０２と同一の色および形態の線を用いて示した９０５の部分となっている。同様に、種１、種２、種３、種４でのみ保存されている領域は線９０１と同一の色および形態の線を用いて示した９０３に示された部分で、全ての対象配列１４０１（種１〜種６）で保存されている領域は線９０１の上位側の線と同一の色および形態の線を用いて示した９０４に示された部分である。遠縁の種同士で保存されているものは、近縁の種同士でも保存されているはずであり、図９の表示結果では、系統樹において根に近いところの線で表された保存領域は、葉に近いところの全ての対象配列１４０１内に存在していることが表示されている。 Further, on the right side of the phylogenetic tree in FIG. 9, the position of the storage area and the level at which the storage area is stored in the phylogenetic tree are schematically displayed for each target array 1401. The level of the phylogenetic tree corresponds to the color and line form of the branches of the left half of the phylogenetic tree. For example, an area stored only in the species 1 and 2 is an intermediate node to which the species 1 and 2 belong. This is a portion 905 indicated by using a line of the same color and form as the line 902 forming the line. Similarly, the region conserved only in Species 1, Species 2, Species 3, and Species 4 is the portion indicated by 903 indicated by using the same color and shape line as the line 901, and all the target sequences 1401 The region stored in (Seed 1 to Seed 6) is the portion indicated by 904 using the same color and shape as the upper line of the line 901. What is preserved between distantly related species should be preserved between closely related species, and in the display result of FIG. 9, the conserved area represented by the line near the root in the phylogenetic tree is It is displayed that it exists in all the target arrays 1401 near the leaves.

図１０は、図９で示した対象配列１４０１としての種１の保存領域を実際の塩基配列（またはアミノ酸配列）として表示した説明図である。図１０により研究者は、対象配列１４０１の保存領域のDNA（またはアミノ酸）配列を知ることが可能となっている。図１０の図中で矢印の領域は、種１の配列をインターネット等の公衆網やローカルネットワーク上に設置されたデータベースに対して、このDNA（またはアミノ酸）配列に基づいて検索し、その結果をマッピングすることによって表示したものである（矢印の向きは検索配列の方向）。研究者は、この結果を参照してこの保存領域のDNA（またはアミノ酸）配列と既知の情報の対応関係を知ることもできる。図中では、保存領域で既知の結果が見つかった状況を示しており、研究者はこれにより、保存領域の生物学的意味を知ることができるようになっている。 FIG. 10 is an explanatory diagram showing the conserved region of species 1 as the target sequence 1401 shown in FIG. 9 as an actual base sequence (or amino acid sequence). FIG. 10 allows the researcher to know the DNA (or amino acid) sequence of the conserved region of the target sequence 1401. In the figure of FIG. 10, the area of the arrow searches the database of the seed 1 sequence on a public network such as the Internet or a local network based on this DNA (or amino acid) sequence, and finds the result. It is displayed by mapping (the direction of the arrow is the direction of the search sequence). Researchers can also refer to this result to know the correspondence between the DNA (or amino acid) sequence of this conserved region and known information. In the figure, a situation where a known result is found in the storage area is shown, so that a researcher can know the biological meaning of the storage area.

図１１は、図９で示した各対象配列１４０１内の各保存領域の間の関係を示した説明図である。線の色・形状は図９の右半分において保存領域を表示した線に対応している。表示対象となる対象配列１４０１（図１１の場合、種１・種２・種３・種４）を選ぶためには、例えば図９の表示画面上において９０１等の保存領域が表示されている線をマウス１４０４により選択操作して、このような表示をするかどうかのメニューを出せばよい。あるいは任意の配列集合を、入力メニューから選択操作して表示させることとしても良い。図１１では、画面左寄りのの保存領域１１０１は種１、種２、種３では同じ向きだが、種４では向きが異なっている。これを参照することにより研究者は、進化のある時点で、種４でのみinversionが起こって向きが変わったか、あるいは種１・種２・種３の全てがinversionによって向きが変わった事実等を推測することが可能となり、これから進化の過程を知る手がかりとすることができる。 FIG. 11 is an explanatory diagram showing the relationship between the storage areas in each target array 1401 shown in FIG. The color and shape of the line correspond to the line displaying the storage area in the right half of FIG. In order to select the target array 1401 to be displayed (in the case of FIG. 11, seed 1, seed 2, seed 3, and seed 4), for example, a line on which a storage area such as 901 is displayed on the display screen of FIG. Can be selected with the mouse 1404 to display a menu as to whether or not to display such. Alternatively, an arbitrary array set may be selected and displayed from the input menu. In FIG. 11, the storage area 1101 on the left side of the screen has the same orientation for the seed 1, the seed 2, and the seed 3, but the orientation differs for the seed 4. By referring to this, researchers at some point in the evolution, inversion occurred only in Species 4 or changed direction, or the fact that Species 1, Species 2, and Species 3 all changed direction due to inversion, etc. It becomes possible to guess, and it can be a clue to know the process of evolution from now on.

図１２も、図１１と同様に図９で示した各対象配列１４０１内の各保存領域の間の関係を示した説明図である。図１２では、図９の系統樹内の種５と種６を対象として表示している。この図１２の画面を参照することにより研究者は、領域１２０１が、種５では２つ、種６ではひとつありこのことから種５が過去に領域１２０１を重複した事実等を推測できる。同様に領域１２０２は、過去に種６で同領域が重複した事実等を推測できる。 FIG. 12 is also an explanatory diagram showing the relationship between the storage areas in each target array 1401 shown in FIG. In FIG. 12, species 5 and species 6 in the phylogenetic tree of FIG. 9 are displayed as targets. By referring to the screen of FIG. 12, the researcher can infer the fact that there are two regions 1201 in species 5 and one in species 6 and that seed 5 overlaps region 1201 in the past. Similarly, in the area 1202, it is possible to guess the fact that the same area has overlapped with the seed 6 in the past.

図１３は、図９で示した系統樹内の種に対応する保存領域について他の種にも存在している状況を示した説明図である。図１３では、種５の１３０１領域について、これと同様の配列が他の種にあるかどうか検索し、種３と種６でそれが発見された状況を示している。見つかった配列については、種の名称を強調して明確に表示している。ここで図１３の画面を参照して研究者は、領域１３０１が種３でみつかったことの原因として、ひとつは進化系統樹そのものが間違っていたこと、そしてもうひとつは、SINE配列やLINE配列などレトロトランスポゾン配列が挿入されたことが考えられる等の推測を行うことができる。逆にこのことを用いて、系統樹が正しいかどうかを、見つかった保存領域に対して繰り返し検索を行うことで確認する利用形態も実現できる。 FIG. 13 is an explanatory diagram showing a situation in which the storage area corresponding to the species in the phylogenetic tree shown in FIG. 9 also exists in other species. FIG. 13 shows a situation in which the same sequence as that in the other species is searched for the 1301 region of the species 5 and it is found in the species 3 and 6. The found sequences are clearly displayed with the species name highlighted. Here, referring to the screen of FIG. 13, the researcher found that the evolutionary tree itself was wrong as one of the reasons that the region 1301 was found in Species 3, and the other was the SINE array, LINE array, etc. It can be estimated that a retrotransposon sequence has been inserted. On the contrary, by using this fact, it is possible to realize a usage mode in which whether or not the phylogenetic tree is correct is confirmed by repeatedly searching the found storage area.

以上説明したように、本実施の形態における進化過程を考慮した保存領域検出システム１００では、ＣＰＵ１４０６がゲノム解析の対象となる対象配列１４０１のデータに対してKtupleArrayD[ ], IdxArrayD[ ], KtupleArrayR[ ], IdxArrayR[ ]の索引情報を作成する処理を行い、構成情報１４１３により構成される系統樹に属する中間ノードに属する各リーフに対する対象配列１４０１内で保存されている保存領域を検出し構造体配列ConservedRegと、保存領域の関連を保持する配列ListOfConservedRegと、配列AllOfConservedRegを求める処理を行う。 As described above, in the storage region detection system 100 in consideration of the evolution process in the present embodiment, the CPU 1406 performs KtupleArrayD [], IdxArrayD [], KtupleArrayR [] on the data of the target sequence 1401 to be subjected to genome analysis. , IdxArrayR [] processing for creating index information, and a storage area stored in the target array 1401 for each leaf belonging to the intermediate node belonging to the phylogenetic tree configured by the configuration information 1413 is detected and the structure array ConservedReg And processing for obtaining an array ListOfConservedReg that holds the association of the storage areas and an array AllOfConservedReg.

そして、ＣＰＵ１４０６は、構造体配列ConservedRegと、配列ListOfConservedRegと、配列AllOfConservedRegのデータを用いて系統樹や系統樹を構成する各対象配列１４０１の実際のゲノム配列、各対象配列１４０１内の各保存領域の間の関係を示した表示データを作成し、ゲノム配列同士の保存領域を進化過程と対応付けて系統樹の情報と併せて表示画面１４０２に表示するので、研究者は、これを参照して各対象配列１４０１の種の保存領域の進化の過程を推測し、進化の過程を手がかりとして利用することが可能である。そして、より本質的な生物学の理解を得られることが期待される。 Then, the CPU 1406 uses the data of the structure array ConservedReg, the array ListOfConservedReg, and the array AllOfConservedReg, the actual genome sequence of each target sequence 1401 constituting the phylogenetic tree and the phylogenetic tree, and each storage region in each target sequence 1401 Display data showing the relationship between them, and the storage region between the genome sequences is associated with the evolution process and displayed on the display screen 1402 together with the information of the phylogenetic tree. It is possible to estimate the evolution process of the conserved region of the species of the target sequence 1401 and use the evolution process as a clue. And it is expected to gain a more fundamental understanding of biology.

(他の実施の形態)
図２１および図２２で示すフローチャートを用いて行った処理では、対象配列１４０１や構成情報１４１３のデータをもちいて、保存領域の検出対象となるゲノム配列を選択しているが、これに限られず、対象配列１４０１以外の任意のゲノム配列の集合に対しても、この処理を実行することが可能である。その場合には、図２１のステップ２１０４の処理を実行せずスキップし、ステップ２１０５の処理で保存領域の検出対象となるゲノム配列として、対象配列１４０１以外の「任意の配列の集合」を選ぶようにすれば良い。 (Other embodiments)
In the process performed using the flowcharts shown in FIG. 21 and FIG. 22, the target sequence 1401 and the configuration information 1413 are used to select the genome sequence that is the detection target of the storage region. However, the present invention is not limited to this. This process can also be executed for a set of arbitrary genome sequences other than the target sequence 1401. In that case, the processing of step 2104 in FIG. 21 is skipped without being executed, and an “arbitrary sequence set” other than the target sequence 1401 is selected as the genome sequence to be detected in the storage area in the processing of step 2105. You can do it.

また、図１３のような解析結果を得るには、これは図２４に示すフローチャートを用いた処理で説明した「配列seq j内に存在する保存領域Cを検出する処理」を全ての対象配列１４０１に対して実行すればよい。この処理によって保存領域Cが他の対象配列１４０１としてのゲノム配列でも検出されれば、そのゲノム配列中の保存領域Cの位置を記録することによって、図１３で示す表示結果を得ることができる。 Further, in order to obtain the analysis result as shown in FIG. 13, this is the same as the processing for detecting the storage region C existing in the sequence seq j described in the processing using the flowchart shown in FIG. Can be executed. If the storage region C is also detected in the genome sequence as another target sequence 1401 by this processing, the display result shown in FIG. 13 can be obtained by recording the position of the storage region C in the genome sequence.

複数のＤＮＡ（またはアミノ酸）配列から、ゲノム配列を比較してゲノム配列中における意味を調べるゲノム解析に関し、特に、進化の過程で保存されている保存領域を見つけ、表示する進化過程を考慮した保存領域検出システムにおいて利用することが可能である。 Concerning genome analysis that examines the meaning of genomic sequences by comparing genomic sequences from multiple DNA (or amino acid) sequences, especially conserved in consideration of evolutionary processes to find and display conserved conserved regions during evolutionary processes It can be used in an area detection system.

進化におけるゲノム配列の変化のうちsubstitutionについて説明する説明図である。It is explanatory drawing explaining substitution among the changes of the genome sequence in evolution. 進化におけるゲノム配列の変化のうちinsertionについて説明する説明図である。It is explanatory drawing explaining insertion among the changes of the genome sequence in evolution. 進化におけるゲノム配列の変化のうちdeletionについて説明する説明図である。It is explanatory drawing explaining deletion among the changes of the genome sequence in evolution. 進化におけるゲノム配列の変化のうちinversionについて説明する説明図である。It is explanatory drawing explaining inversion among the changes of the genome sequence in evolution. 祖先配列のタイプであるオーソログとパラログについて説明する説明図である。It is explanatory drawing explaining the ortholog and paralog which are the types of an ancestor arrangement | sequence. ドットマトリックス解析の例を示す説明図である。It is explanatory drawing which shows the example of a dot matrix analysis. マルチプルアライメント解析の例を示す説明図である。It is explanatory drawing which shows the example of a multiple alignment analysis. 系統樹解析の例を示す説明図である。It is explanatory drawing which shows the example of a phylogenetic tree analysis. 系統樹と保存領域の関係を組み合わせて表示した画面を示す説明図である。It is explanatory drawing which shows the screen displayed combining the relationship of a phylogenetic tree and a preservation | save area | region. 保存領域と既知の情報を組み合わせて表示した画面を示す説明図である。It is explanatory drawing which shows the screen displayed combining the preservation | save area | region and known information. 複数の配列で保存領域を対応付けて表示した画面を示す説明図である。It is explanatory drawing which shows the screen which matched and displayed the storage area | region by the some arrangement | sequence. 複数の配列で保存領域を対応付けて表示した画面を示す説明図である。It is explanatory drawing which shows the screen which matched and displayed the storage area | region by the some arrangement | sequence. ある配列で見つかった保存領域を、他の配列に対して検索し、その検索結果を表示する、本発明の表示例のひとつである。This is one of the display examples of the present invention in which a storage area found in a certain sequence is searched for another sequence and the search result is displayed. 本実施の形態における進化過程を考慮した保存領域検出システム１００のシステム構成を概略的に示す機能ブロック図である。It is a functional block diagram which shows roughly the system configuration | structure of the preservation | save area | region detection system 100 in consideration of the evolution process in this Embodiment. 本実施の形態における進化過程を考慮した保存領域検出システム１００の対象配列１４０１のデータ構成を示す説明図である。It is explanatory drawing which shows the data structure of the object arrangement | sequence 1401 of the preservation | save area | region detection system 100 in consideration of the evolution process in this Embodiment. 本実施の形態における進化過程を考慮した保存領域検出システム１００の構成情報１４１３のデータ構成を示す説明図である。It is explanatory drawing which shows the data structure of the structure information 1413 of the preservation | save area | region detection system 100 in consideration of the evolution process in this Embodiment. 本実施の形態における進化過程を考慮した保存領域検出システム１００において、対象配列１４０１の索引情報を表示するためのデータ構造を示す説明図である。It is explanatory drawing which shows the data structure for displaying the index information of the object arrangement | sequence 1401 in the preservation | save area | region detection system 100 in consideration of the evolution process in this Embodiment. 本実施の形態における進化過程を考慮した保存領域検出システム１００において、対象配列１４０１の保存領域を記録するためのデータ構造を示す説明図である。It is explanatory drawing which shows the data structure for recording the preservation | save area | region of the target arrangement | sequence 1401 in the preservation | save area | region detection system 100 in consideration of the evolution process in this Embodiment. 本実施の形態における進化過程を考慮した保存領域検出システム１００において、対象配列１４０１間の保存領域同士の対応関係を記録するためのデータ構造体を示す説明図である。FIG. 11 is an explanatory diagram showing a data structure for recording a correspondence relationship between storage areas between target arrays 1401 in the storage area detection system 100 in consideration of the evolution process in the present embodiment. 本実施の形態における進化過程を考慮した保存領域検出システム１００において、異なる種類の保存領域を記録するためのデータ構造を示す説明図である。It is explanatory drawing which shows the data structure for recording a storage area of a different kind in the storage area detection system 100 in consideration of the evolution process in this Embodiment. 本実施の形態における進化過程を考慮した保存領域検出システム１００において、全体の処理の流れを概略的に示すフローチャートである。In the preservation | save area | region detection system 100 which considered the evolution process in this Embodiment, it is a flowchart which shows the flow of the whole process roughly. 本実施の形態における進化過程を考慮した保存領域検出システム１００において、各対象配列１４０１に対して索引情報を作成する処理を詳細に示すフローチャートである。5 is a flowchart showing in detail processing for creating index information for each target array 1401 in the storage area detection system 100 in consideration of the evolution process in the present embodiment. 本実施の形態における進化過程を考慮した保存領域検出システム１００において、保存領域を検出する処理を詳細に示すフローチャートである。It is a flowchart which shows in detail the process which detects a preservation | save area | region in the preservation | save area | region detection system 100 in consideration of the evolution process in this Embodiment. 本実施の形態における進化過程を考慮した保存領域検出システム１００において、保存領域を検出する処理を詳細に示すフローチャートである。It is a flowchart which shows in detail the process which detects a preservation | save area | region in the preservation | save area | region detection system 100 in consideration of the evolution process in this Embodiment. 本発明の進化過程を考慮した保存領域検出システムにおいて、保存領域を算出する処理を詳細に示すフローチャートである。6 is a flowchart showing in detail a process for calculating a storage area in the storage area detection system considering the evolution process of the present invention. 本実施の形態における進化過程を考慮した保存領域検出システム１００において、保存領域を検出する処理を詳細に示すフローチャートである。It is a flowchart which shows in detail the process which detects a preservation | save area | region in the preservation | save area | region detection system 100 in consideration of the evolution process in this Embodiment. 本実施の形態における進化過程を考慮した保存領域検出システム１００において、ゲノム配列の中の保存領域を検出する処理の流れを詳細に示すフローチャートである。It is a flowchart which shows in detail the flow of the process which detects the preservation | save area | region in a genome sequence in the preservation | save area | region detection system 100 in consideration of the evolution process in this Embodiment. 本実施の形態における進化過程を考慮した保存領域検出システム１００において、配列データKtupleArrayDとIdxArrayDの一例を示した説明図である。It is explanatory drawing which showed an example of arrangement | sequence data KtupleArrayD and IdxArrayD in the preservation | save area | region detection system 100 in consideration of the evolution process in this Embodiment. 本実施の形態における進化過程を考慮した保存領域検出システム１００において、二つの配列間の保存領域を検出した状態を示した説明図である。It is explanatory drawing which showed the state which detected the preservation | save area | region between two arrangement | sequences in the preservation | save area | region detection system 100 which considered the evolution process in this Embodiment.

Explanation of symbols

１００保存領域検出システム
１４０１対象配列
１４０２表示装置
１４０３キーボード
１４０４マウス
１４０５配列DB
１４０６中央処理装置
１４０７プログラムメモリ
１４０８保存領域計算処理部
１４０９系統樹計算処理部
１４１０分析結果表示処理部
１４１１データメモリ
１４１２入力データ
１４１３系統樹
２６０１保存領域

100 storage area detection system 1401 target array 1402 display device 1403 keyboard 1404 mouse 1405 array DB
1406 Central processing unit 1407 Program memory 1408 Storage area calculation processing section 1409 Phylogenetic tree calculation processing section 1410 Analysis result display processing section 1411 Data memory 1412 Input data 1413 Phylogenetic tree 2601 Storage area

Claims

In a conserved region detection system that takes into account the evolution process of finding conserved regions that are evolutionarily conserved without undergoing sequence changes in the genomic sequence to be subjected to genomic analysis among a plurality of DNA sequences,
Sequence recognition means for referring to a phylogenetic tree obtained on the basis of a genomic sequence and recognizing a genomic sequence belonging to an intermediate node constituting the phylogenetic tree;
And a storage detecting means will detect the save area that is present in the genome sequence belonging to the intermediate node,
The storage detection means includes
Starting from the position of the same character string existing in the two genome sequences belonging to the intermediate node, the region within the genome sequence is detected for each fixed character string, and the number of mismatched characters is less than the predetermined number The storage area is detected by expanding the area where the character string exists as a storage area, and setting the position where the number of mismatched characters exceeds a predetermined number as the boundary position of the storage area. A storage area detection system that takes into account the evolutionary process that is characteristic.

In the storage area detection system considering the evolution process according to claim 1 ,
The storage detection means includes
Starting from the position of the same character string existing in multiple genome sequences belonging to the intermediate node, the same conserved region that was detected is detected repeatedly while changing the intermediate node based on the conserved region detected in the genome sequence. A storage region detection system that takes into account the evolutionary process, characterized by detecting storage regions in the genome sequence belonging to all intermediate nodes.

In the storage area detection system considering the evolution process according to claim 2 ,
Each storage region in the genome sequence detected by the storage detection means is configured by a line having a different form for each, and branches forming the intermediate node on the phylogenetic tree are each stored region in the genome sequence belonging to the intermediate node. A storage area detection system taking into account the evolution process, characterized by comprising analysis result display means configured to display each storage area and phylogenetic tree at the same time.

In the storage area detection system considering the evolution process according to claim 3 ,
The analysis result display means includes
A storage region detection system considering an evolution process, wherein each storage region is simultaneously displayed in combination with information on a known genome sequence.

In the preservation area detection system in consideration of the evolution process according to claim 4 ,
The analysis result display means includes
A storage area detection system considering an evolution process, wherein each storage area is combined with a genome sequence including each storage area and the same storage area included between the genome sequences is displayed in association with each other.

In the storage area detection system considering the evolution process according to claim 5 ,
Sequence search means for searching for a genome sequence belonging to an intermediate node constituting the phylogenetic tree based on an arbitrary sequence;
Specific display means for displaying information on genome sequences belonging to the intermediate nodes constituting the phylogenetic tree with a specific display method with reference to information on the genome sequence obtained as a result of the search by the sequence search means A storage area detection system that takes into account the evolutionary process.

In the preservation area detection system considering the evolution process according to claim 6 ,
The specific display means is
A storage region detection system in consideration of an evolution process, characterized in that a genome sequence obtained as a result of searching by the sequence search means is displayed in association with the arbitrary sequence portion.