JP2005316605A

JP2005316605A - Splicing pattern analysis method of biopolymer alignment

Info

Publication number: JP2005316605A
Application number: JP2004131822A
Authority: JP
Inventors: Tomohiro Yasuda; 知弘安田; Koichi Kimura; 宏一木村
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2004-04-27
Filing date: 2004-04-27
Publication date: 2005-11-10

Abstract

<P>PROBLEM TO BE SOLVED: To efficiently extract a partial alignment corresponding to an individual exon from a number of biopolymer alignment data including alignments mutually having a relation of splicing variant while considering a case a plurality of exons of a same alignment appear on a same input alignment. <P>SOLUTION: A model of an exon alignment considering the case the plurality of the exons of the same alignment appear on the same input alignment is constructed by using a condition to be satisfied by the exon alignment. The exon alignment defined by the model is extracted from a biopolymer alignment given as input. According to this method, a suffix tree is constructed based on the given biopolymer alignment, and a processing related to search of depth priority on the suffix tree and a position of each character in the alignment is performed in a frequency independent from the number of characters in the alignment or the number of alignments, whereby extracting processing can be completed in a linear time to a total sum of length of the biopolymer alignment given as the input. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は，複数の文字列，特に生体高分子配列に共通に存在する部分文字列を抽出するための文字列解析方法に関し，生体高分子配列のスプライシングパターン解析方法に適用して好適な方法に関する。 The present invention relates to a character string analysis method for extracting a plurality of character strings, particularly partial character strings that are commonly present in a biopolymer sequence, and relates to a method suitable for application to a splicing pattern analysis method for biopolymer sequences. .

国際共同プロジェクト及び米国ベンチャー企業により，2000年6月にヒトゲノムの文字配列決定の完了が宣言された。ゲノム配列の解析が進む一方で，発現している遺伝子について調べるために，mRNAの解析が行なわれている。mRNAは，遺伝子が発現する際，ゲノムDNAから生成されるRNA分子で，遺伝子の機能発現の過程で不可欠な物質である。mRNA分子は分解しやすいため，逆転写によりmRNAよりも安定な物質であるcDNAに転換し解析されることが多い。cDNAを配列決定して得られるcDNA配列には様々な利用価値があるが，そのひとつが，遺伝子がどのようなエクソンにより構成されているかを明らかにすることである。 The completion of character sequencing of the human genome was declared in June 2000 by an international joint project and a US venture company. While genome sequence analysis is progressing, mRNA is being analyzed to investigate the expressed genes. mRNA is an RNA molecule generated from genomic DNA when a gene is expressed, and is an indispensable substance in the process of gene function expression. Since mRNA molecules are easily degraded, they are often analyzed by reverse transcription into cDNA, which is a more stable substance than mRNA. The cDNA sequence obtained by sequencing cDNA has various utility values, one of which is to clarify what exons the gene is composed of.

同一遺伝子由来のmRNAで，スプライシングを受ける領域が異なるものを，互いにスプライシング・バリアントであると言う。ヒトの場合，生体内のタンパク質は10万種類以上存在するといわれる一方，遺伝子数は3万〜4万のみといわれており，スプライシング・バリアントがこの差を埋めていると考えられている。こうしたスプライシング・バリアントの解析は，遺伝子の機能発現を解析するにあたり不可避であり，生命現象の解明や，ゲノム創薬のために不可欠である。同一遺伝子に由来するあらゆるスプライシング・バリアントの配列を網羅する十分な量のcDNA配列を収集し解析することは，mRNAのエクソン構造を明らかにする有力な手段となる。cDNA配列など転写産物由来の配列のみに基づきスプライシング・バリアントの解析を行うためには，複数の配列から個々のエクソンに相当する部分配列をいかにして抽出するかが課題となる。ところが，与えられた配列のあらゆる部分配列を網羅的に列挙し，エクソン配列か否かを調べるアプローチをとる場合は，各入力配列に対してその配列長の2乗のオーダーの部分配列が存在し，複数の入力配列間でそれらの比較を行うと，少なくとも，与えられた配列の数をさらに乗じたオーダーの計算時間が必要となる。従って，処理すべきデータ量が増加すると，処理に必要な時間が急激に増加し，実用的なシステムを作ることは困難である。その一方で，米国公共機関のデータベースに蓄積されている，cDNAの一部を配列決定して得られたESTと呼ばれる配列の数は，配列決定技術の進歩に伴い急速に増大しつつあり，ヒトだけで450万配列を超えている。こうした膨大な配列データを解析するために，高速な計算方法が必要である。
配列の比較解析を行う従来技術には，以下に述べるものがある。 MRNAs derived from the same gene but having different splicing regions are said to be splicing variants. In the case of humans, it is said that there are more than 100,000 types of proteins in the living body, but the number of genes is said to be only 30,000 to 40,000, and splicing variants are thought to fill this difference. Such splicing variant analysis is inevitable in analyzing the functional expression of genes, and is indispensable for elucidation of biological phenomena and genome drug discovery. Collecting and analyzing sufficient amounts of cDNA sequences to cover all splicing variant sequences from the same gene is a powerful tool to elucidate the exon structure of mRNA. In order to analyze splicing variants based only on sequences derived from transcripts such as cDNA sequences, the issue is how to extract partial sequences corresponding to individual exons from multiple sequences. However, when taking the approach of exhaustively enumerating all the partial sequences of a given sequence and checking whether it is an exon sequence, there is a partial sequence of the order of the square of the sequence length for each input sequence. When the comparison is made between a plurality of input arrays, at least a calculation time of an order obtained by further multiplying the number of given arrays is required. Therefore, as the amount of data to be processed increases, the time required for processing increases rapidly, making it difficult to create a practical system. On the other hand, the number of sequences called ESTs obtained by sequencing a portion of cDNA accumulated in a database of a US public organization is rapidly increasing with the progress of sequencing technology. Just over 4.5 million sequences. In order to analyze such a large amount of sequence data, a high-speed calculation method is required.
Conventional techniques for performing comparative analysis of sequences include the following.

非特許文献１のホモロジー検索の方法はcDNA配列の解析にも使われている。もし，cDNA配列以外にゲノム配列も利用可能ならば，ゲノム配列の部分文字列で， cDNA配列にも存在する部分文字列がエクソンであると，高い精度で予想することができる。ただし，対応するゲノム配列が無いcDNA配列を解析する場合には，転写産物由来配列間の相互比較を行う必要がある。ホモロジー検索の方法では，総当りで配列間の相互比較を行うと最悪の場合配列数の2乗のオーダーの計算時間が必要となり，配列数が増加するとスプライシングパターンの解析を行うのは，計算量的に困難である。 The homology search method of Non-Patent Document 1 is also used for cDNA sequence analysis. If a genomic sequence can be used in addition to the cDNA sequence, it can be predicted with high accuracy that the partial character string of the genomic sequence is an exon. However, when analyzing cDNA sequences that do not have corresponding genomic sequences, it is necessary to perform a mutual comparison between sequences derived from transcripts. In the method of homology search, if cross-comparison between sequences is performed in the brute force, the calculation time in the order of the square of the number of sequences is necessary in the worst case, and the splicing pattern is analyzed when the number of sequences increases. Is difficult.

複数の配列に共通する部分配列の抽出を効率よく行う手法として，非特許文献２のDelcherらの手法が知られている。Delcherらの手法は，主に2つの近縁種のゲノムを比較することを目的としている。この手法では，まず次の(D1)-(D4)の条件を満足するMUM(Maximal Unique Match)と呼ばれる部分配列を探索する。
(D1) MUMは，長さがユーザに指定されたパラメータ以上の配列長をもつ部分配列である。
(D2) MUMは，2つの配列の，共通の部分配列である。
(D3) MUMは，2つのそれぞれの配列内に，１つしか含まれない。
(D4) MUMは，別のMUMの部分配列にならない。すなわち，前記(D1)-(D3)を満足する配列を，可能な限り延長して得られた配列のみがMUMである。 As a method for efficiently extracting a partial sequence common to a plurality of sequences, the method of Delcher et al. In Non-Patent Document 2 is known. Delcher's approach is primarily aimed at comparing the genomes of two closely related species. In this method, first, a partial sequence called MUM (Maximal Unique Match) that satisfies the following conditions (D1)-(D4) is searched.
(D1) MUM is a partial array whose length is greater than or equal to the parameter specified by the user.
(D2) MUM is a common partial array of two arrays.
(D3) Only one MUM is included in each of the two arrays.
(D4) A MUM is not a subarray of another MUM. That is, only the sequence obtained by extending the sequence satisfying (D1)-(D3) as much as possible is MUM.

Delcherらの方法をcDNA配列に適用し，MUMとしてアラインされる部分や大きな挿入や欠失が入る位置を同定すれば，MUMはエクソン配列に対応し，大きな挿入や欠失は，選択的なエクソンに対応すると考えることができ，エクソン構造を同定するためのツールとして使うことができる。しかし，Delcherらの方法は２配列のみを対象としており，配列数が3つ以上になったときには適用することができない。また，同一配列のエクソンが同一配列中に現れないことは，一般には保障されない。Delcherらは，非特許文献３において，suffix tree実装のコンパクト化，片方の配列のみをsuffix treeに格納するだけで済む方法の開発を行っているが，３つ以上の配列の同時比較ができない問題は，未解決である。 If the method of Delcher et al. Is applied to the cDNA sequence and the part to be aligned as MUM and the position where a large insertion or deletion is inserted are identified, MUM corresponds to the exon sequence, and large insertion or deletion is a selective exon. Can be considered as a tool for identifying exon structures. However, the method of Delcher et al. Applies only to two sequences and cannot be applied when the number of sequences becomes 3 or more. Also, it is generally not guaranteed that exons of the same sequence do not appear in the same sequence. Delcher et al., In Non-Patent Document 3, have developed a method to make the suffix tree implementation compact and store only one array in the suffix tree, but cannot simultaneously compare three or more arrays. Is unresolved.

３つ以上の文字列から共通する部分配列を抽出する方法としては，非特許文献４のchapter 9に記載の，longest common substring problemに対する線形時間解法が知られている。この方法を用いると，任意の数の文字列について，指定された数以上の文字列に共通の部分文字列の中で最も長い部分文字列を，文字総数の線形時間で抽出することができる。ただし，抽出できるのが共通の部分文字列の中で最も長いものだけであり，しかも，互いに重なっている文字列を抽出してしまう場合もあり得るため，この方法で抽出した配列がエクソンに相当する配列と考えることはできない。 As a method for extracting a common partial sequence from three or more character strings, a linear time solution to the longest common substring problem described in chapter 9 of Non-Patent Document 4 is known. If this method is used, the longest partial character string among the partial character strings common to the specified number or more of character strings can be extracted with a linear time of the total number of characters. However, since only the longest common substring can be extracted, and strings that overlap each other may be extracted, the sequence extracted by this method corresponds to an exon. You can't think of it as an array.

Hohlらは，主にゲノム配列のアラインメントに使用することを想定し，３つ以上の配列に共通する部分配列を抽出し，抽出された部分配列を整列し，その結果に基づき複数配列のアラインメントを高速に生成する方法を開発した(非特許文献５)。しかし，Hohlらの方法は，全配列に共通の配列のみを対象としており，一部の配列にのみ共通する配列を抽出することができないため，スプライシング・バリアント配列に適用した場合，選択的スプライシングが行なわれるエクソンを抽出できないという問題がある。 Hohl et al. Mainly used for alignment of genome sequences, extracted partial sequences common to three or more sequences, aligned the extracted partial sequences, and aligned multiple sequences based on the results. A method of generating at high speed was developed (Non-Patent Document 5). However, Hohl et al.'S method targets only sequences common to all sequences and cannot extract sequences common to only a part of the sequences. Therefore, when applied to splicing variant sequences, alternative splicing is not possible. There is a problem that exons to be performed cannot be extracted.

Altschul, S.F. et al., Nucleic Acid Research, 25:3389-3402, 1997Altschul, S.F. et al., Nucleic Acid Research, 25: 3389-3402, 1997 Delcher et al., Nucleic Acids Research, 1999, 27(11):2369-2376Delcher et al., Nucleic Acids Research, 1999, 27 (11): 2369-2376 Delcher et al., Nucleic Acids Research, 2002, 30(11):2478-2483Delcher et al., Nucleic Acids Research, 2002, 30 (11): 2478-2483 Gusfield, D., Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, New YorkGusfield, D., Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, New York Hohl, M., Kurtz, S., and Ohlebusch, E., Efficient multiple genome alignment, Bioinformatics 18 Suppl.1, S312-S320, 2002Hohl, M., Kurtz, S., and Ohlebusch, E., Efficient multiple genome alignment, Bioinformatics 18 Suppl.1, S312-S320, 2002

本発明は，転写産物由来の複数の配列に基づき，これらの配列に含まれるエクソン配列を抽出することを目的としている。 An object of the present invention is to extract exon sequences contained in these sequences based on a plurality of sequences derived from transcripts.

転写産物由来の配列だけに基づき，個々のエクソンの配列を完全に同定できれば理想だが，現実にはそれが常に可能なわけではない。ある遺伝子が4つのエクソンA,B,C,Dを持ち，それらのエクソンはこの順序でゲノム上に並んでいるとする。仮に，図14のように，BとCがスプライシングの過程を経て常に同時にmRNAに現れるならば，cDNA配列だけを見ている限り，B,Cという2つの別のエクソンがあることや，cDNA配列上のどこにB,Cの境界があるかを知ることは不可能である。このほか，図15のように，エクソンAが直後にBかCを必ず伴うとし，BとCには必ずAが先行する場合を考える。BとCの先頭の配列が一致するならば，A,B,Cという３つのエクソンの存在を知ることができても，どこがA,B,Cの境界なのかを知ることはできない。そこで本発明では，常に隣接して共存するエクソンを連結した配列の部分配列であって，他のエクソンに由来する塩基を含まないことが確実な最長の配列を同定することを目指す。こうした部分配列を，以下ではExon Group (EG)と呼ぶ。ところで，例えば数文字程度の，長さが極端に短い部分配列は入力配列上のいたるところに現れ，複数の入力配列に共通する部分配列となり得るが，それらはエクソン配列とみなすべきでない。よって，EGはある程度以上長い部分配列であるべきである。 Ideally, it would be ideal to be able to fully identify the sequences of individual exons based solely on transcript-derived sequences, but in reality this is not always possible. A gene has four exons A, B, C, and D, and these exons are arranged on the genome in this order. As shown in Fig. 14, if B and C always appear in mRNA through the process of splicing, as long as only the cDNA sequence is seen, there are two other exons B and C, and the cDNA sequence. It is impossible to know where the boundary of B and C is above. In addition, as shown in FIG. 15, let us consider a case where exon A is always accompanied by B or C immediately after, and B and C are always preceded by A. If the leading sequences of B and C match, you can know the existence of the three exons A, B, and C, but you cannot know where A, B, and C are. Therefore, the present invention aims to identify the longest sequence that is a partial sequence of sequences in which exons that always coexist adjacent to each other are linked and does not contain bases derived from other exons. Such a partial sequence is hereinafter referred to as Exon Group (EG). By the way, a partial sequence whose length is extremely short, such as about several characters, appears everywhere on the input sequence and can be a partial sequence common to a plurality of input sequences, but they should not be regarded as exon sequences. Therefore, EG should be a partial sequence longer than a certain length.

EGを抽出する方法を述べるために，上記で直感的に説明したEGを，厳密に定義する必要がある。その準備のために，３つの概念を定義する。なお，以下の説明は，塩基配列だけでなくアミノ酸配列など他の種類の文字列を扱う場合でもそのまま当てはまる。そこで，以下ではより広い問題も扱えることを明確化するために，配列の代わりに文字列，塩基の代わりに文字という言葉を用いる。 In order to describe the method of extracting EG, it is necessary to strictly define the EG intuitively explained above. To prepare for this, we define three concepts. In addition, the following description is applied as it is even when handling other types of character strings such as amino acid sequences as well as base sequences. Therefore, in the following, in order to clarify that a wider problem can be handled, the word string is used instead of the sequence and the word character is used instead of the base.

一つ目の定義は，「文字列が重なる」ことの，本明細書における厳密な定義である。本明細書では，図13に示したように，次の4つの条件を満足する文字列t,t',t''が存在するときに，かつそのときに限り，「文字列s,s'が重なる」という。
(1) t,t',t''が，いずれも，長さが1以上の文字列である。
(2) s=tt'，すなわち，sはtとt'を連結した文字列である。
(3) s'=t't''，すなわち，s'はt'とt''を連結した文字列である。
(4) ある入力文字列が，tt't''を部分文字列として含む。
なお，(1)-(4)が満足されるとき，「sはs'と右で重なる」，「s'はsと左で重なる」という。 The first definition is a strict definition in this specification that “character strings overlap”. In this specification, as shown in FIG. 13, when and only when there is a character string t, t ′, t ″ that satisfies the following four conditions, “character string s, s ′ Are said to overlap.
(1) t, t ', t''are all character strings with a length of 1 or more.
(2) s = tt ', that is, s is a string that concatenates t and t'.
(3) s' = t't '', that is, s' is a character string that concatenates t 'and t''.
(4) An input character string includes tt't '' as a substring.
When (1)-(4) is satisfied, it is said that "s overlaps s 'on the right" and "s' overlaps s on the left".

二つ目の定義は，文字列sが出現する入力された文字列中での位置の集合Q(s)である。Q(s)は，次の式により定義される。
Q(s)={(i,j)|sは，与えられたi番目の文字列のj番目の塩基から始まる文字列}
また，Q(s)と整数kに対し，以下の演算を定義する。
Q(s)+k={(i,j+k)|(i,j)∈Q(s)}，Q(s)−k=Q(s)+(−k)
なお，Q(s)の要素(i,j)を，本明細書では，sのアピアランスと呼ぶ。 The second definition is a set Q (s) of positions in the input string where the string s appears. Q (s) is defined by the following equation.
Q (s) = {(i, j) | s is a string starting from the jth base of the given ith string}
The following operations are defined for Q (s) and integer k.
Q (s) + k = {(i, j + k) | (i, j) ∈Q (s)}, Q (s) −k = Q (s) + (− k)
Note that the element (i, j) of Q (s) is referred to as the appearance of s in this specification.

三つ目の定義は，複数のエクソン配列の連結になっている可能性のある文字列maximal match (MM)の定義である。 (M1)-(M4)の条件を満たす文字列を，本明細書では，以下ではMMと呼ぶ。
(M1) 1つ以上の文字列の部分文字列である。
(M2) 長さu以上。ただし，uとは本発明の方法で抽出したい最も短いエクソンの長さで，本発明の方法の利用者がパラメータとして与える整数である。
(M3) 入力された文字列中に出現する任意の文字aに対して，Q(m)≠Q(ma)である。ここに，maは文字列mの末尾に文字aを付加した文字列である。
(M4) 入力された文字列中に出現する任意の文字aに対して，Q(m)≠Q(am)+1である。ここに，amは文字列mの先頭に文字aを付加した文字列である。 The third definition is the definition of the string maximum match (MM), which may be a concatenation of multiple exon sequences. In the present specification, a character string that satisfies the conditions (M1)-(M4) is hereinafter referred to as MM.
(M1) A substring of one or more strings.
(M2) Length u or more. However, u is the length of the shortest exon to be extracted by the method of the present invention, and is an integer given as a parameter by the user of the method of the present invention.
(M3) For any character a appearing in the input character string, Q (m) ≠ Q (ma). Here, ma is a character string in which the character a is added to the end of the character string m.
(M4) For any character a appearing in the input character string, Q (m) ≠ Q (am) +1. Here, am is a character string in which the character a is added to the head of the character string m.

図３に，MMの例を示す。
EG 305は，MM 303を部分文字列として完全に含んでいたり，重なっていたりしない文字列であるべきである。そこで，EGを次の条件を満たす文字列と定義する。
(E1) 少なくとも1つの入力文字列の部分文字列。
(E2) 長さu以上。
(E3) MMと重ならない。
(E4) MMを真の部分文字列としない。なお，ある文字列の真の部分文字列とは，その文字列自身以外の部分文字列のことである。
(E5) (E1)-(E4)を満たす文字列の，真の部分文字列にならない。 FIG. 3 shows an example of MM.
EG 305 should be a string that does not completely contain or overlap MM 303 as a substring. Therefore, EG is defined as a character string that satisfies the following conditions.
(E1) A substring of at least one input string.
(E2) Length u or more.
(E3) Does not overlap with MM.
(E4) MM is not a true substring. A true partial character string of a character string is a partial character string other than the character string itself.
(E5) Does not become a true substring of a character string that satisfies (E1)-(E4).

巨大なデータを効率よく処理するためには，処理時間を可能な限り少なく抑えることが望ましい。入力文字列を読むために，文字総数の線形時間が必要だから，処理時間を線形時間に抑えれば，理論上最も効率のよい計算時間オーダーを実現したことになる。以上を踏まえ，本発明の課題は，複数の生体高分子配列に相当する文字列が与えられたときに，(E1)-(E5)を満足する文字列すなわちEGを，与えられた入力文字列の総文字数の線形時間で抽出する方法を提供することである。 In order to efficiently process large amounts of data, it is desirable to minimize the processing time. Since the linear time of the total number of characters is required to read the input character string, if the processing time is limited to the linear time, the theoretically most efficient calculation time order is realized. Based on the above, the problem of the present invention is that when a character string corresponding to a plurality of biopolymer sequences is given, a character string that satisfies (E1)-(E5), that is, EG, is given as a given input character string. It is to provide a method of extracting in the linear time of the total number of characters.

本発明では，入力として複数の文字列301が与えられたとき，後述の右MM 302，後述の左MMまたはMM 303，後述の右EG-holder 304を順次抽出し，それらを利用してEG 305を抽出する。なお，左MMの抽出(ステップS503)と，MMの抽出(ステップS504)は，どちらか片方だけを行えばよい。 In the present invention, when a plurality of character strings 301 are given as input, a right MM 302 (to be described later), a left MM or MM 303 (to be described later), and a right EG-holder 304 (to be described later) are sequentially extracted. To extract. Note that only one of the left MM extraction (step S503) and the MM extraction (step S504) may be performed.

以下，本発明の特徴を述べる。
本発明の方法は，複数の入力文字列が与えられたときに，EGを抽出する方法であって，入力文字列の文字数の総和に対し線形時間で処理を完了できる。 The features of the present invention will be described below.
The method of the present invention is a method of extracting EG when a plurality of input character strings are given, and can complete the processing in a linear time with respect to the total number of characters of the input character string.

本発明の方法は，EGを抽出するために，右MMを抽出する工程と，MMを抽出する工程または前述の左MMを抽出する工程と，右EG-holderを抽出する工程とを有し，それぞれの工程で，入力文字列を格納したsuffix treeを用いることを特徴とする。 The method of the present invention includes a step of extracting right MM, a step of extracting MM or the above-mentioned left MM, and a step of extracting right EG-holder in order to extract EG. In each process, a suffix tree storing an input character string is used.

本発明の方法はまた，MMを抽出する工程において，右MMを抽出しその中からMMの条件を満足する部分文字列を選択し，抽出することを特徴とする。 The method of the present invention is also characterized in that, in the step of extracting MM, the right MM is extracted, and a partial character string satisfying the condition of MM is selected and extracted.

本発明の方法はまた，右EG-holderを抽出する工程を有することを特徴とする。 The method of the present invention is also characterized by having a step of extracting the right EG-holder.

本発明の方法はまた，右EG-holderを抽出する工程において，入力文字列上のMMの位置を解析するために，他の右MMを真のsuffixとしない右MMの位置のみを計算するか，または，他の右MMを真のprefixとしない右MMの位置のみを計算することを特徴とする。 In the method of extracting the right EG-holder, the method of the present invention also calculates only the position of the right MM that does not make the other right MM a true suffix in order to analyze the position of the MM on the input character string. Alternatively, only the position of the right MM that does not make the other right MM a true prefix is calculated.

本発明の方法はまた，他の右MMを真のsuffixとしない右MMを求める際，該右MMに相当するsuffix tree上のノードがリーフノードである場合に，該ノードの親ノードのsuffix linkを使用することを特徴とする。 In the method of the present invention, when a right MM that does not use another right MM as a true suffix is obtained, if the node on the suffix tree corresponding to the right MM is a leaf node, the suffix link of the parent node of the node It is characterized by using.

本発明の方法はまた，右EG-holderのprefixのみを，EGの候補とみなすことを特徴とする。 The method of the present invention is also characterized in that only the prefix of the right EG-holder is regarded as an EG candidate.

本発明によれば，cDNA配列のような生体高分子配列から，EGを，与えられた配列長の総和の線形時間で抽出することができる。 According to the present invention, an EG can be extracted from a biopolymer sequence such as a cDNA sequence in a linear time of the sum of given sequence lengths.

以下，本発明の実施の形態について説明する。最初に，本明細書で使用する記号と用語及び概念を定義する。なお，本発明の方法は，塩基配列だけでなく，アミノ酸配列など他の種類の文字列に対しても適用可能である。そこで，以下では，より一般的な問題を扱うことが可能であることを明確にするために，配列の変わりに文字列，塩基の代わりに文字という言葉を用いる。 Hereinafter, embodiments of the present invention will be described. First, the symbols, terms and concepts used in this specification are defined. Note that the method of the present invention can be applied not only to base sequences but also to other types of character strings such as amino acid sequences. Therefore, in the following, in order to clarify that it is possible to handle a more general problem, the word “character” is used instead of the character string and the base instead of the sequence.

はじめに，すでに公知である概念について，本明細書で用いる記号および厳密な意味を定義する。
●空文字列
長さが0の文字列。本明細書では，以下，空文字列をεと表記する。
●文字列の連結
文字列s,tを連結した文字列をstと表記する。 First, symbols and exact meanings used in this specification are defined for concepts that are already known.
● Null string A zero-length string. In the present specification, hereinafter, an empty character string is expressed as ε.
● Concatenation of character strings A character string obtained by concatenating character strings s and t is expressed as st.

●|s|
文字列sの長さ。
●s[j] (0≦j＜|s|)
文字列sの，j番目の文字。なお，本明細書では，文字列の先頭の文字を０番目と数える。 ● | s |
The length of the string s.
● s [j] (0 ≦ j <| s |)
The jth character in the string s. In the present specification, the first character of the character string is counted as the 0th character.

●s[i..j]
文字列sの，i番目の文字からj番目の文字までの，i番目の文字とj番目の文字をいずれも含む部分文字列。i＞jのときは，s[i..j]=εと定義する。
●Prefix
文字列sについて，s[0..j] (j≦|s|)を，sのprefixという。 ● s [i..j]
A substring of the string s that includes both the i-th character and the j-th character from the i-th character to the j-th character. When i> j, it is defined as s [i..j] = ε.
● Prefix
For the string s, s [0..j] (j ≦ | s |) is called the prefix of s.

●Suffix
文字列sについて，s[j..|s|−1] (j≧0)を，sのsuffixという。
●|A|
集合Aの要素数。 ● Suffix
For a character string s, s [j .. | s | −1] (j ≧ 0) is called suffix of s.
● | A |
Number of elements in set A

●⊆,⊇,⊂,⊃
本明細書では，2つの集合A,Bについて，「A⊆B」とは，Aの全ての要素がBの要素であることを表し，「A⊇B」とはB⊆Aを表す。「A⊂B」とは，A⊆B かつA≠Bのことであり，「A⊃B」とはB⊂A を表す。 ● ⊆, ⊇, ⊂, ⊃
In this specification, for the two sets A and B, “A⊆B” means that all elements of A are elements of B, and “A⊇B” means B⊆A. “A⊂B” means A⊆B and A ≠ B, and “A⊃B” means B⊂A.

●O(f(n))
関数g(n)に対しg(n)=O(f(n))であるとは，ある定数Cが存在して，十分大きなnに対しg(n)≦Cf(n)が成立することである。また，ある量が「O(f(n))である」とは，その量がnの関数g(n)以下であり，g(n)=O(f(n))であることを意味する。 O (f (n))
G (n) = O (f (n)) for function g (n) means that there is a constant C and that g (n) ≦ Cf (n) holds for sufficiently large n It is. Also, a certain quantity is “O (f (n))” means that the quantity is less than or equal to the function g (n) of n and g (n) = O (f (n)). To do.

●Ω(f(n))
g(n)= Ω(f(n)) であるとは，ある定数Cが存在して，十分大きなnに対しg(n)≧Cf(n)が成立することである。また，ある量が「Ω(f(n))である」とは，その量がnの関数g(n)以上であり，g(n)= Ω(f(n))であることを意味する。 Ω (f (n))
The expression g (n) = Ω (f (n)) means that there exists a certain constant C and that g (n) ≧ Cf (n) holds for a sufficiently large n. Also, a certain quantity is “Ω (f (n))” means that the quantity is not less than the function g (n) of n and g (n) = Ω (f (n)). To do.

●Suffix tree
文字列P1,P2,…,Pnのsuffix tree とは，P1,…,Pn に存在しない文字$について，$を各文字列の末尾に付加して得られるP1$,…,Pn$に現れる全てのsuffix を格納したtree状のデータ構造で，以下の性質をもつ。なお，2つの文字列「ATATG」，「TTAGTA」を格納したsuffix tree 401を図4に図示した。
(S1) ルートノード402を持つ有向木である。
(S2) P1$, …, Pn$に存在する互いに異なるsuffixの数に等しい数のリーフを持つ。本発明では，それらのリーフには，文字列番号iと文字列中の位置jのペア(i,j)の集合{(i,j)}が，対応付けられ，その集合が格納されているデータ構造に，そのデータ構造を指すポインタ等により，O(1)でアクセスが可能であることを前提とする。
(S3) それぞれのエッジ404には，P1$,..., Pn$のうち，どれか１つの文字列の部分文字列がラベル405として付加されている。
(S4) 任意のノード407を起点とするエッジに，同じ文字で始まるラベル405を持つエッジのペアは存在しない。
(S5) ルートノード402から文字列番号iと文字列中の位置jが割り当てられたリーフ403へのパス上にあるエッジのラベルを，このパス上で出会う順に連結すると，入力文字列iのj番目の文字で始まるsuffixの末尾に$を付加した文字列となる。
(S6) 任意のノードvを起点として伸びるエッジの数は1でない。 ● Suffix tree
The suffix tree of character strings P1, P2, ..., Pn is all that appear in P1 $, ..., Pn $ obtained by adding $ to the end of each character string for character $ that does not exist in P1, ..., Pn This is a tree-like data structure that stores the suffix of and has the following properties. A suffix tree 401 storing two character strings “ATATG” and “TTAGTA” is shown in FIG.
(S1) A directed tree having a root node 402.
(S2) P1 $, ..., Pn $ have the same number of leaves as the number of different suffixes. In the present invention, a set {(i, j)} of a pair (i, j) of a character string number i and a position j in the character string is associated with and stored in these leaves. It is assumed that the data structure can be accessed with O (1) by a pointer to the data structure.
(S3) A partial character string of any one of P1 $,..., Pn $ is added as a label 405 to each edge 404.
(S4) There is no edge pair having a label 405 starting with the same character at an edge starting from an arbitrary node 407.
(S5) When the labels of the edges on the path from the root node 402 to the leaf 403 to which the character string number i and the position j in the character string are assigned are connected in the order in which they are encountered on this path, j of the input character string i This is a string with a suffix of $ starting with the suffix starting with the second character.
(S6) The number of edges extending from an arbitrary node v is not 1.

Suffix treeに格納されている文字列の文字数の合計をnとするとき，リーフ403の数がn以下なので，リーフ以外のノードの数はn-1個以下である。従って，suffix tree全体のノード数は2n-1個以下で，O(n)である。なお，ルートノードから，ラベルが’$’であるエッジを一回通って到達できるリーフには，空集合φが対応付けられる。 When the total number of characters in the character string stored in the suffix tree is n, the number of leaves 403 is n or less, so the number of nodes other than leaves is n−1 or less. Therefore, the number of nodes in the entire suffix tree is 2n-1 or less, and O (n). Note that an empty set φ is associated with a leaf that can be reached from the root node once through an edge with a label of “$”.

●パスラベル
Suffix tree中のノード407のパスラベルとは，ルートノード402からこのノード407にいたるパス408上のエッジのラベルを，連結して得られる文字列である。
●Suffix link
vを，suffix treeの，リーフでないノードとする。ノードvのパスラベルが，ある文字aとある文字列sによりasと表現できるとき，vからパスラベルをsとするノードへのポインタをsuffix linkと呼ぶ。本明細書では，vのsuffix linkを， suffixlink(v)と表記する。vのパスラベルがasならば，suffixlink(v)が指し示すべきノードv(s)は必ず存在することが知られている(非特許文献４, chapter 6)。図4の409は，suffix linkの例である。なお，誤解を招く恐れのないときは，「suffixlink(v)」という表記を，suffixlink(v)が指すノードと同一視する場合がある。 ● Pass label
The path label of the node 407 in the suffix tree is a character string obtained by concatenating the edge labels on the path 408 from the root node 402 to this node 407.
● Suffix link
Let v be a non-leaf node of the suffix tree. When the path label of a node v can be expressed as as by a certain character a and a character string s, a pointer from v to a node whose path label is s is called a suffix link. In this specification, the suffix link of v is denoted as suffixlink (v). It is known that a node v (s) to be pointed to by suffixlink (v) always exists if the path label of v is as (Non-patent Document 4, chapter 6). 409 in FIG. 4 is an example of a suffix link. When there is no risk of misunderstanding, the notation “suffixlink (v)” may be identified with the node pointed to by suffixlink (v).

次に，本明細書で特に用いる記号・用語および，本明細書固有の概念について，以下で定義する。
●入力文字列（図３の301参照）
本発明の方法に入力として与えられる，塩基配列やアミノ酸配列などの，生体高分子配列。
●N
本発明の方法を適用すべき入力文字列の文字数の総和。
●k
本発明の方法を適用すべき入力文字列の数。
●u
本発明の方法に与えられるパラメータで，エクソン配列と考えられる最も短い文字列の長さを指定する整数である。 Next, symbols and terms used in this specification and concepts specific to this specification are defined below.
● Input character string (see 301 in Fig. 3)
A biopolymer sequence, such as a base sequence or amino acid sequence, provided as input to the method of the present invention.
● N
The total number of characters of the input character string to which the method of the present invention should be applied.
● k
The number of input strings to which the method of the present invention should be applied.
● u
A parameter given to the method of the present invention is an integer that specifies the length of the shortest character string that is considered to be an exon sequence.

●p(v)
ノードvのパスラベル。
●p’(v)
ノードvのパスラベルp(v)の末尾に$があれば，その$を除去して得られる文字列。それ以外の場合は，p(v)と同一の文字列。 P (v)
The path label for node v.
P '(v)
If $ exists at the end of the path label p (v) of node v, the character string obtained by removing the $. Otherwise, the same string as p (v).

●v(s)
パスラベルが文字列sのノード。
●v’(s)
sが出現する任意の入力文字列Si上で，sがSiのsuffixとしてのみ出現する場合は，v(s$)。それ以外の場合，v(s)。 ● v (s)
A node whose path label is the string s.
● v '(s)
v (s $) if s appears only as a suffix of Si on any input string Si where s appears. Otherwise v (s).

●Σ
入力文字列中に存在するすべての文字種の集合。
●Si
i番目の入力文字列。ただし，0≦i＜kである。 ● Σ
A set of all character types present in the input string.
● Si
i-th input string. However, 0 ≦ i <k.

●MM（図３の303参照）
上述の条件(M1)-(M4)を満たす文字列。
●EG（図３の305参照）
上述の条件(E1)-(E5)を満たす文字列。
●右MM（図３の302参照） ● MM (see 303 in Fig. 3)
A character string that satisfies the above conditions (M1)-(M4).
● EG (See 305 in Fig. 3)
A character string that satisfies the above conditions (E1)-(E5).
● Right MM (see 302 in Fig. 3)

次の条件(R1)-(R3)を満たす文字列を，右MMという。
(R1) 1つ以上の入力文字列の部分文字列である。
(R2) 長さはu以上である。
(R3) 右MM rは，任意の文字aに対してQ(r)≠Q(ra)を満たす。すなわち，右MMは，右側へ延長することはできない。 A character string that satisfies the following conditions (R1)-(R3) is called a right MM.
(R1) A substring of one or more input strings.
(R2) The length is not less than u.
(R3) The right MM r satisfies Q (r) ≠ Q (ra) for an arbitrary character a. That is, the right MM cannot be extended to the right.

なお，任意の右MM rに対し，Q(sr)+|s|=Q(r)を満たす空文字列かも知れない最長の文字列sをとることができ，srは(M1)-(M4)を満たすので，MMである。従って，次の性質(R4)が成り立つ。
(R4) 任意の右MM rについて，rをsuffixとするMM mで，Q(m)+|m|-|r|=Q(r)を満たすものが存在する。 For any right MM r, the longest character string s that may be an empty character string satisfying Q (sr) + | s | = Q (r) can be taken, and sr is (M1)-(M4) Since it satisfies, it is MM. Therefore, the following property (R4) holds.
(R4) For any right MM r, there is a MM m where r is a suffix and satisfies Q (m) + | m |-| r | = Q (r).

また，右MM rがあるMM mのprefixであると仮定する。このとき，任意の文字aについてQ(am)+1⊂Q(m)より，mのあるアピアランス(i,j)について，(i,j-1)はamのアピアランスではない。従って，Si[j-1]≠aだから，(i,j-1)はarのアピアランスではない。ところが，(i,j)はmのアピアランスだから，そのprefixであるrのアピアランスでもある。従ってQ(ar)+1≠Q(r)。よって，rは(M4)を満たすので，MMである。つまり，右MMについて次の性質(R5)が成り立つ。
(R5) MMのprefixである右MMは，MMである。 Also assume that the right MM r is the prefix of MM m. At this time, from Q (am) + 1⊂Q (m) for an arbitrary character a, (i, j-1) is not an appearance of am for an appearance (i, j) with m. Therefore, since Si [j-1] ≠ a, (i, j-1) is not an appearance of ar. However, since (i, j) is the appearance of m, it is also the appearance of r, which is its prefix. Therefore, Q (ar) + 1 ≠ Q (r). Therefore, r satisfies (M4) and is therefore MM. In other words, the following property (R5) holds for the right MM.
(R5) The right MM that is the prefix of MM is MM.

●左MM
次の条件(L1)-(L3)を満たす文字列を，左MMという。
(L1) 1つ以上の入力文字列の部分文字列である。
(L2) 長さはu以上である。
(L3) 左MM lは，任意の文字aに対してQ(l)≠Q(al)+1 を満たす。すなわち，左MMは，左側へ延長することはできない。 ● Left MM
A character string that satisfies the following conditions (L1)-(L3) is called a left MM.
(L1) A substring of one or more input strings.
(L2) The length is not less than u.
(L3) The left MM l satisfies Q (l) ≠ Q (al) +1 for an arbitrary character a. That is, the left MM cannot be extended to the left.

●右EG-holder（図３の304参照）
次の条件(H1)-(H3)を満たす文字列を，右EG-holderという。
(H1) 右MMである。
(H2) 次の(H2a)，(H2b)のうち，どちらかの条件を満たす。
(H2a) MMである。
(H2b) あるMM mが存在し，mhが少なくとも１つの入力文字列の部分文字列となるような，文字列hである。
(H3) 右MMを，真のprefixとして持たない。 ● Right EG-holder (see 304 in Fig. 3)
A character string that satisfies the following conditions (H1)-(H3) is called a right EG-holder.
(H1) Right MM.
(H2) Either of the following (H2a) or (H2b) is satisfied.
(H2a) MM.
(H2b) A character string h such that there exists a certain MM m and mh is a partial character string of at least one input character string.
(H3) Does not have right MM as a true prefix.

上述したように，本発明では右MM，左MM又はMM，右EG-holderを順次抽出し，それらの抽出結果を利用してEGを抽出する。図５は，本発明による処理の全体を示すフローチャートである。以下では，図５を参照して，右MM，左MM，MM，右EG-holder，EGを抽出する方法を説明する。 As described above, in the present invention, the right MM, the left MM or MM, and the right EG-holder are sequentially extracted, and the EG is extracted using the extraction results. FIG. 5 is a flowchart showing the entire processing according to the present invention. Hereinafter, a method for extracting the right MM, left MM, MM, right EG-holder, and EG will be described with reference to FIG.

■Suffix tree Tの構築(ステップS501)
本発明の方法では，まず，入力文字列に基づきsuffix tree Tを構築する。Suffix tree T は，Ukkonen のアルゴリズム(非特許文献４, chapter 6)を用いて，O(N)の処理時間で，全suffix linkを設定する工程を含めて構築可能である。本発明の方法は，以降，suffix tree T上の深さ優先探査を繰り返し実行する。 ■ Construction of Suffix tree T (Step S501)
In the method of the present invention, first, a suffix tree T is constructed based on the input character string. Suffix tree T can be constructed using Ukkonen's algorithm (Non-patent Document 4, chapter 6), including the process of setting all suffix links in O (N) processing time. Thereafter, the method of the present invention repeatedly executes depth-first search on the suffix tree T.

■右MMの抽出(ステップS502)
Suffix tree Tを深さ優先探査し，ノードvに出会ったとき，文字列p’(v)の長さがu以上で，かつvの親ノードからvに至るエッジのラベルが「$」でないとき，vに「右MMである」と印をつける。なお，パスラベル長|p(v)|は，深さ優先探査の際にルートノードからvまでのパスにあるエッジ長の和を保持することで，1ノード当たり定数時間で計算でき，最後に通過したエッジのラベルの末尾が$のときは|p’(v)|=|p(v)|−1，それ以外のときは|p’(v)|=|p(v)|である。 Extract right MM (step S502)
When depth-first exploration of Suffix tree T and encountering node v, the length of string p '(v) is greater than or equal to u, and the label of the edge from v's parent node to v is not "$" , V is marked as “right MM”. Note that the path label length | p (v) | can be calculated in constant time per node by holding the sum of the edge lengths in the path from the root node to v during depth-first search. | P '(v) | = | p (v) | −1 when the end of the edge label is $, and | p ′ (v) | = | p (v) |

この処理が完了した時点で，「右MMである」と印がついたノードvについてp’(v)は右MMであり，かつそれらだけが右MMである。その理由を述べる。
(R1)について:
suffix tree T の任意のノードのパスラベルp(v)は，少なくとも１つの入力文字列Siの末尾に$を付加した文字列Si$のsuffixのprefixである。従って，p(v)の末尾に$が存在すればその$を除去した文字列p’(v)は，Siのsuffixのprefixであり，少なくとも１つの文字列の部分文字列である。よって，右MMであると印がついたノードvについて，p’(v)は，(R1)を満たす。
(R2)について:
本発明の方法で「右MMである」と印がつくノードvは，|p’(v)|≧uなるノードだけである。従って，こうしたノードvについて，p’(v)は(R2)を満たす。
(R3)について:
「右MM」であると印がついたノードvについて，どのように文字a∈Σを選んでも，p’(v)のあるアピアランス(i,j)が存在し，(i,j)はp’(v)aのアピアランスではない。なぜなら，そのようなアピアランス(i,j)が存在しないなら，ノードvから出るエッジのラベルの最初の文字は全てaであり，suffix treeの条件(S4)に反するからである。従って，p’(v)は(R3)を満足する。 When this processing is completed, p ′ (v) is the right MM for the node v marked as “right MM”, and only those are the right MM. Give the reason.
About (R1):
The path label p (v) of any node in suffix tree T is the suffix of the suffix of the character string Si $ with $ appended to the end of at least one input character string Si. Therefore, if $ exists at the end of p (v), the character string p ′ (v) from which $ is removed is a prefix of Si suffix and is a partial character string of at least one character string. Therefore, for node v marked as right MM, p ′ (v) satisfies (R1).
About (R2):
The node v marked as “right MM” in the method of the present invention is only the node with | p ′ (v) | ≧ u. Therefore, for such a node v, p ′ (v) satisfies (R2).
About (R3):
For a node v marked as “right MM”, no matter how the character a∈Σ is chosen, there is an appearance (i, j) with p ′ (v), and (i, j) is p It is not the appearance of '(v) a. This is because if there is no such appearance (i, j), the first letters of the edge labels coming from the node v are all a, which violates the suffix tree condition (S4). Therefore, p ′ (v) satisfies (R3).

さらに，本発明の方法で右MMの印がついたノードvのp’(v)でなければ，(R1)-(R3)のどれかの条件を満足しないことを示す。 Furthermore, if p ′ (v) of the node v marked with the right MM is not used in the method of the present invention, it indicates that any of the conditions (R1)-(R3) is not satisfied.

まず，suffix tree Tのノードのパスラベルのprefixとして現れない文字列は，どの入力文字列の部分文字列でもないから，(R1)に反する。 First, a character string that does not appear as a prefix of the path label of a suffix tree T node is not a substring of any input character string, and is contrary to (R1).

次に，ノードのパスラベルであって$を含まないprefixとなっている文字列rについて，ノードv’(r)が存在しないと仮定する。rが入力文字列のsuffixとしてのみ出現するならば，v’(r)=v(r$)が存在し仮定に反するため，rはある入力文字列Siのsuffix以外の位置に出現する。すなわち，あるアピアランス(i,j)∈Q(r)についてj+|r|＜|Si|である。このとき，a=Si[j+|r|]とおく。aはSi中の文字だからa∈Σである。rがある入力文字列Siのsuffixである場合，raをprefixとするパスラベルと，r$をprefixとするパスラベルが両方存在するため，v’(r)=v(r)が存在するから仮定に矛盾する。よってrはどの入力文字列のsuffixでもない。aでない文字bについて，rbを部分文字列とする入力文字列の存在を仮定すると，raをprefixとするパスラベルと，rbをprefixとするパスラベルが両方存在するため，v’(r)=v(r)が存在するから仮定に矛盾し，rbを部分文字列とする入力文字列の存在の存在は否定される。従って，rは入力文字列中で右側に常に文字a∈Σを伴って現れるからQ(r)=Q(ra)である。よって，rは(R3)を満足しない。 Next, it is assumed that a node v ′ (r) does not exist for a character string r that is a prefix of a node and does not include $. If r appears only as a suffix of the input character string, v '(r) = v (r $) exists and violates the assumption, so r appears at a position other than the suffix of a certain input character string Si. That is, j + | r | <| Si | for an appearance (i, j) ∈Q (r). At this time, a = Si [j + | r |] is set. Since a is a character in Si, a∈Σ. Assuming that v '(r) = v (r) exists because there is both a path label with ra as a prefix and a path label with r $ as a prefix if r is a suffix of the input string Si. Contradict. So r is not a suffix of any input string. Assuming that there is an input character string with rb as a substring for a character b that is not a, there are both a path label with ra as a prefix and a path label with rb as a prefix, so v '(r) = v ( The existence of r) contradicts the assumption, and the existence of an input string with rb as a substring is denied. Therefore, since r always appears with the character a∈Σ on the right side in the input string, Q (r) = Q (ra). Therefore, r does not satisfy (R3).

さらに，本発明の方法は，|p’(v)|＜uなるノードvには印をつけないが，このようなノードvについて文字列p’(v)は (R2)を満足しない。 Furthermore, the method of the present invention does not mark the node v where | p ′ (v) | <u, but the character string p ′ (v) does not satisfy (R2) for such a node v.

以上で検討した以外に，「右MMである」と印がついたノードvのp’(v)以外の文字列は存在しない。つまり，本発明の方法で，すべての右MM rに対して，v’(r)に，過不足なく「右MMである」と印をつけることができる。 Other than the above, there is no character string other than p ′ (v) of the node v marked as “right MM”. That is, according to the method of the present invention, v ′ (r) can be marked as “right MM” without excess or deficiency for all right MM r.

■左MMの抽出(ステップS503)
ステップS501でsuffix tree Tを構築し，ステップS502で右MMを抽出したのと同様の方法で，左MMを抽出する。なお，ステップS504によりMMを抽出する場合，左MMを抽出するステップS503は行わなくてもよい。 ■ Left MM extraction (step S503)
A suffix tree T is constructed in step S501, and the left MM is extracted in the same manner as the right MM is extracted in step S502. In addition, when extracting MM by step S504, step S503 which extracts left MM does not need to be performed.

ステップS503では，左MMを抽出するために，すべての入力文字列を逆順に見た文字列について，suffix tree T'を構築する。例えば，入力文字列がATATGとTTAGTAの場合は，これらを逆順にして得られるGTATAとATGATTについて，suffix tree T'を構築する。そして，「suffix tree T」を「suffix tree T'」に，「右MM」を「左MM」に読み替えて，本発明の右MM抽出方法を実行する。この方法で，左MMを過不足無く抽出できるのは，ステップS502で過不足なく右MMが抽出されることから明らかである。 In step S503, in order to extract the left MM, a suffix tree T ′ is constructed for character strings obtained by viewing all input character strings in reverse order. For example, if the input character strings are ATATG and TTAGTA, a suffix tree T ′ is constructed for GTATA and ATGATT obtained by reversing these. Then, “suffix tree T” is replaced with “suffix tree T ′”, “right MM” is replaced with “left MM”, and the right MM extraction method of the present invention is executed. The fact that the left MM can be extracted without excess or deficiency by this method is clear from the fact that the right MM is extracted without excess or deficiency in step S502.

■MMの抽出(ステップS504)
MMを抽出する方法について述べる。なお，ステップS503により左MMの抽出を行った場合，MMを抽出するステップS504は行わなくてもよい。 ■ Extraction of MM (Step S504)
A method for extracting MM is described. If the left MM is extracted in step S503, step S504 for extracting the MM may not be performed.

全てのMMは，(M1)-(M3)より(R1)-(R3)を満たし，右MMである。任意の右MM rについてv’(r)に既に上記の方法で「右MMである」と印がついているので，これらのrが (M4)を満足するときv’(r)に「MMである」と印をつければよい。(M4)を満足する右MMに相当するノードを同定するには，非特許文献４のP144〜P145に記載の“left diverse node”同定方法を，次のように修正して適用すればよい。
(1) 非特許文献４における“left diverse”であるノードvの定義を，「p’(v)が，(M4)を満足するノードv」に変更する。
(2) リーフノードvについては，非特許文献４の方法のように，「リーフは常に”left diverse”でない」とは判断しない。代わりに，ある文字a∈Σが存在し，p’(v)の任意のアピアランス(i,j)についてj≧1かつSi[j-1]=aが成立すれば，vの”left character”をaとし，このようなaが存在しなければ，vは”left diverse”であると判断する。
(3) リーフでないノードvについては，vの子ノードに”left diverse”であるノードがなく，かつ全ての子ノードの”left character”があるa∈Σに一意に決まるならば，v自身も”left diverse”ではなくvの”left character”をaとし，それ以外の場合は，vは”left diverse”であるとする。
(4) “left diverse”であると判断されたノードに「右MMである」と印があれば，「MMである」と印をつける。 All MMs satisfy (R1)-(R3) from (M1)-(M3) and are right MMs. For any right MM r, v '(r) is already marked as "right MM" in the above method, so when these r satisfy (M4), v' (r) Just mark “Yes”. In order to identify a node corresponding to the right MM that satisfies (M4), the “left diverse node” identification method described in P144 to P145 of Non-Patent Document 4 may be applied with the following modifications.
(1) The definition of node “v” that is “left diverse” in Non-Patent Document 4 is changed to “node v where p ′ (v) satisfies (M4)”.
(2) The leaf node v is not judged as “the leaf is not always“ left diverse ”” as in the method of Non-Patent Document 4. Instead, if a character a∈Σ exists, and j ≧ 1 and Si [j-1] = a holds for any appearance (i, j) of p '(v), the “left character” of v If a is not present, v is determined to be “left diverse”.
(3) For a node v that is not a leaf, if there is no “left diverse” node among the child nodes of v, and v∈ itself is uniquely determined by a∈Σ where all the child nodes have a “left character” It is assumed that “left character” of v is not “left diverse” but “a”, and v is “left diverse” otherwise.
(4) If a node determined to be “left diverse” is marked as “right MM”, mark it as “MM”.

この方法の計算時間は，リーフに対応付けられているアピアランスの総数と，suffix tree Tのノード数のオーダーである。前者は，入力文字列長の総和Nに等しく，後者は2N-1以下だから，全体の処理時間はO(N)である。 The computation time for this method is the order of the total number of appearances associated with the leaf and the number of nodes in suffix tree T. The former is equal to the sum N of the input string lengths, and the latter is 2N-1 or less, so the total processing time is O (N).

■MM末端位置の抽出(ステップS505)
ステップS502で抽出した右MMの情報と，ステップS503で抽出した左MMの情報またはステップS504で抽出したMMの情報を用いて，ステップS505で2次元配列Head，Tailを以下のように設定する。なお，Head[i,j], Tail[i,j]は，0≦i＜k，0≦j＜|Si|なる(i,j)に対して値を持つ。
Head[i,j] :
(i,j)をアピアランスとするMMが存在すれば1。それ以外の場合は0。（図７参照）
Tail[i,j] :
(i,j-|m|+1)をアピアランスとするMM mが存在すれば1。それ以外の場合は0。（図７参照） ■ Extraction of MM end position (Step S505)
Using the right MM information extracted in step S502 and the left MM information extracted in step S503 or the MM information extracted in step S504, the two-dimensional arrays Head and Tail are set as follows in step S505. Head [i, j] and Tail [i, j] have values for (i, j) where 0 ≦ i <k and 0 ≦ j <| Si |.
Head [i, j]:
1 if there is an MM with an appearance of (i, j). 0 otherwise. (See Figure 7)
Tail [i, j]:
1 if there is an MM m whose appearance is (i, j- | m | +1). 0 otherwise. (See Figure 7)

まず，Tailの値を設定する方法を２つ述べる。一つ目の方法は，以下の(t1)-(t3)を実行することである。
(t1) 2次元配列Tail 702の全ての要素を0で初期化する。
(t2) リストminimal_right_MMを空リストで初期化する。
(t3) suffix tree Tを深さ優先探査し，出会ったノードvに対し(t3a)-(t3c)の処理を行なう。
(t3a) p’(v)=asなる文字a∈Σと文字列sについて，v’(s)を同定する。vがリーフでない場合，sは$を含まないのでv’(s)=v(s)=suffixlink(v)である。vがリーフの場合は，次のような処理を行う。まず，深さ優先探索時に常に，その時点で処理対象となっているノードの親ノードを記憶しておく。リーフvの親ノードをwとするとき，wがリーフでないことは明らかである。w’=suffixlink(w)とおく。w-v間のエッジのラベルである文字列に沿って，w’からsuffix tree上を移動すると，ノードw’’=v(s)に到達する。w’’に直接至るエッジのラベルが一文字の文字列”$”であればv(s)はw’’の親ノード，それ以外の場合はv’(s)=w’’である。
(t3b) v’(s)に「右MMである」と印がついていない場合，リストminimal_right_MMの末尾に，|p’(v)|を追加する。なお，リストminimal_right_MMに追加した値は，ノードvをルートノードとする部分木の処理が終了した時点で削除する。この削除の操作は，単にminimal_right_MMの末尾の要素を削除するだけで実現できる。
(t3c) vがリーフのとき，vに対応付けられた任意のアピアランス(i,j)およびリストminimal_right_MMの任意の要素kについて，Tail[i,j+k-1] 702を1に設定する。 First, two methods for setting the Tail value are described. The first method is to execute the following (t1)-(t3).
(t1) All elements of the two-dimensional array Tail 702 are initialized with 0.
(t2) The list minimal_right_MM is initialized with an empty list.
(t3) Depth-first exploration of suffix tree T, and process (t3a)-(t3c) for node v encountered.
(t3a) For the character a∈Σ and the character string s such that p ′ (v) = as, v ′ (s) is identified. If v is not a leaf, s does not contain $, so v '(s) = v (s) = suffixlink (v). If v is a leaf, the following processing is performed. First, at the time of depth-first search, the parent node of the node that is the processing target at that time is stored. If w is the parent node of leaf v, it is clear that w is not a leaf. Set w '= suffixlink (w). When moving on the suffix tree from w 'along the character string that is the edge label between wv, the node w''= v (s) is reached. If the edge label directly leading to w ″ is a single character string “$”, v (s) is the parent node of w ″, otherwise v ′ (s) = w ″.
(t3b) If v ′ (s) is not marked as “right MM”, add | p ′ (v) | to the end of the list minimal_right_MM. Note that the value added to the list minimal_right_MM is deleted when the processing of the subtree having the node v as the root node is completed. This deletion operation can be realized simply by deleting the last element of minimal_right_MM.
(t3c) When v is a leaf, Tail [i, j + k−1] 702 is set to 1 for an arbitrary appearance (i, j) associated with v and an arbitrary element k of the list minimal_right_MM.

この方法では，Tailの値を設定するにあたり，MMの位置を直接計算する代わりに，他の右MMをsuffixとして持たない右MM(以下，極小右MMと呼ぶ)である文字列rの終了位置を計算している。すなわち，任意の極小右MM rに対し，rの任意のアピアランス(i,j)について，Tail[i,j+|r|-1] 702に，1を書き込む。全ての極小右MMの終了位置の集合と，全てのMMの終了位置の集合は等しい。その理由は，全ての右MM rには(R4)よりrをsuffix としQ(r)+|r|=Q(m)+|m|であるMM mが存在する一方，全てのMMは極小右MMをsuffixとして持つからである。したがって，上記(t1)-(t3)の方法で，全てのMMの終了位置に対応するTail[i,j] 702を1に設定することができる。 In this method, instead of directly calculating the position of the MM when setting the value of Tail, the end position of the character string r that is a right MM that does not have another right MM as a suffix (hereinafter referred to as a minimal right MM) Is calculated. That is, 1 is written in Tail [i, j + | r | -1] 702 for an arbitrary appearance (i, j) of r for an arbitrary minimal right MM r. The set of end positions of all minimal right MMs and the set of end positions of all MMs are equal. The reason is that all right MM r have MM m with r as suffix from (R4) and Q (r) + | r | = Q (m) + | m | This is because it has the right MM as a suffix. Therefore, Tail [i, j] 702 corresponding to the end positions of all the MMs can be set to 1 by the above method (t1)-(t3).

この方法(t1)-(t3)を処理するために要する計算時間について検討する。まず，(t1)の処理時間は入力文字列長の総和Nに対しO(N)である。(t2)の処理時間はO(1)。(t3)で， (t3a)，(t3c)以外の処理は，Tのノード数がO(N)で各ノードについての処理時間がO(1)だから，全体でO(N)である。一方，同一の終了位置を持つ異なる２つの極小右MMが存在すると仮定すれば，長い方が短い方をsuffixとするため，２つとも極小右MMとの仮定に矛盾。よって，同一の終了位置を持つ極小右MMはひとつであり，(t3c)の処理で，同一の(i,j)に対してTail[i,j]が2度以上更新されることはない。Tailの要素数はNだから，(t3c)の処理時間もO(N)である。 Consider the calculation time required to process this method (t1)-(t3). First, the processing time of (t1) is O (N) with respect to the sum N of input character string lengths. The processing time for (t2) is O (1). In (t3), processes other than (t3a) and (t3c) are O (N) as a whole because the number of nodes in T is O (N) and the processing time for each node is O (1). On the other hand, assuming that there are two different minimal right MMs with the same end position, the longer one is the suffix, so both contradict the assumption of the minimal right MM. Accordingly, there is only one minimal right MM having the same end position, and Tail [i, j] is not updated more than once for the same (i, j) in the process of (t3c). Since the number of elements in Tail is N, the processing time of (t3c) is also O (N).

(t3a)の処理に要する処理時間の総和が，O(N)であることを示す。vがリーフでない場合，v’(s)=suffixlink(v)はO(1)で同定できるから，リーフでないノード全体ではO(N)である。vがリーフの場合，w’’からv’(s)を同定する処理は各リーフについてO(1)で，全体でO(N)である。w’’を同定する処理時間について検討する。リーフvからリーフw’’を同定する処理は，各リーフvについて1度ずつ行われるが，その処理時間の合計は，すべての0≦i＜kについて，文字列Si$の最初のsuffixから|Si|番目のsuffixまで順番にリーフを同定していく時間以下である。そこで，後者の処理時間の上限を考察する。変数xを，suffix tree T上のノードを表す変数とする。そして，Siのj番目のsuffixに$を付与した文字列をパスラベルとするリーフvから，j+1番目のsuffixに$を付与した文字列をパスラベルとするリーフw’’へと辿る過程を考える。xの値は，注目しているノードに順次更新されるとする。なお，w’からw’’までxが変化する過程を，以下ではダウンウォークと呼ぶ。xをvからwに更新するために要する処理時間は，深さ優先探索の過程で常に親ノードを記憶しているからO(1)であり，wからw’への更新もsuffixlinkを用いればO(1)である。w’からw’’へのパスをたどる時間は，ノードを1つ通過するごとにO(1)だから，このパス上のノード数nに対しO(n)である。ダウンウォークに要する処理時間の，文字列Siについて処理全体の和をとると，O(|Si|)であることを述べる。ルートノードからxへ至るパス上にあるノードの数（ルートノードおよびxを含む）を，xのnode depthと定義し，その増減について考察する。xのnode depthは，xがw’からw’’へ移る過程で，ノードをひとつ通過するごとに１増加する。一方，xがリーフvからその親ノードwへ更新されるときには1減少する。さらに，xがw’へ更新されるときに，node depthの減少量の最大値は1である。なぜなら，一般にノードv1,v2がv2=suffixlink(v1)の関係にあるとき|p(v1)|=|p(v2)|+1であり，ルートノードからwへのパス上にパスラベル長が等しいノードは存在しないので，ルートノードからwに至るパス上のルートノード以外の任意のノードは，ルートノードからw’に至るパス上の互いに異なるノードへsuffix linkを持つから，w’のnode depthは(wのnode depth)-1以上となるためである。従って，Siのsuffixに相当するリーフを順に辿る過程で，xのnode depthは最大で2×|Si|減少する。一方，Siの末尾に$を付加した文字列のsuffixの長さの最大値は|Si|+1だから，Si$のsuffixに相当するリーフのnode depthは常に|Si|+2以下である。従って，Siに関する全ダウンウォークにおいてxが更新される回数の和をnとすれば，n−2×|Si|≦|Si|+2よりn≦3|Si|+2=O(|Si|)である。従って，|Si|のsuffixに相当するリーフを辿るために必要な処理時間が，O(|Si|)であることがわかった。このことから，全リーフを辿るために要する処理時間がO(Σ|Si|)=O(N)であることが直ちにわかる。 Indicates that the total processing time required for processing (t3a) is O (N). If v is not a leaf, v ′ (s) = suffixlink (v) can be identified by O (1), and therefore, O (N) for all non-leaf nodes. When v is a leaf, the process for identifying v ′ (s) from w ″ is O (1) for each leaf, and O (N) as a whole. Consider the processing time to identify w ″. The process of identifying leaf w '' from leaf v is performed once for each leaf v, but the total processing time is from the first suffix of the string Si $ for all 0 ≦ i <k | Less than the time to identify the leaf in order until the Si | th suffix. Therefore, the upper limit of the latter processing time is considered. Let x be a variable representing a node on suffix tree T. Then, consider the process of tracing from leaf v with the character string with $ added to the j j suffix to the leaf w '' with the character string with $ added to the j + 1 suffix . Assume that the value of x is sequentially updated to the node of interest. Hereinafter, the process of changing x from w ′ to w ″ is referred to as a downwalk. The processing time required to update x from v to w is O (1) because the parent node is always stored in the depth-first search process, and updating from w to w 'is also possible using suffixlink. O (1). Since the time to follow the path from w ′ to w ″ is O (1) every time one node passes, it is O (n) for the number n of nodes on this path. Describe that the processing time required for the downwalk is O (| Si |) when the sum of the entire processing for the character string Si is taken. The number of nodes (including the root node and x) on the path from the root node to x is defined as the node depth of x, and the increase or decrease is considered. The node depth of x increases by 1 every time one node is passed in the process of shifting from w ′ to w ″. On the other hand, when x is updated from leaf v to its parent node w, it decreases by one. Further, when x is updated to w ′, the maximum value of the decrease in node depth is 1. Because, in general, when nodes v1 and v2 are in the relationship of v2 = suffixlink (v1), it is | p (v1) | = | p (v2) | +1, and the path label length is equal on the path from the root node to w Since there are no nodes, any node other than the root node on the path from the root node to w has a suffix link to different nodes on the path from the root node to w ', so the node depth of w' is This is because (node depth of w) -1 or more. Therefore, in the process of sequentially tracing the leaves corresponding to the Si suffix, the node depth of x decreases by 2 × | Si | at maximum. On the other hand, since the maximum length of the suffix of a character string with $ added to the end of Si is | Si | +1, the node depth of the leaf corresponding to the suffix of Si $ is always less than or equal to | Si | +2. Therefore, if the sum of the number of times x is updated in all downwalks for Si is n, then n−3 × Si | ≦ | Si | +2 and n ≦ 3 | Si | + 2 = O (| Si | ). Therefore, it was found that the processing time required to follow the leaf corresponding to | Si | suffix is O (| Si |). From this, it is immediately understood that the processing time required to trace all the leaves is O (Σ | Si |) = O (N).

二つ目の方法は，Tailの値を設定するにあたり，MMの位置を直接計算する代わりに，他の右MMをprefixとして持たない右MM rの位置を計算し，そのような任意のrについて，(i,j)がrのアピアランスなら，Tail[i,j+|r|-1] 702に，1を書き込むことである。このような，他の右MMをprefixとしない右MM r (以下，prefix極小右MMと呼ぶ)の終端の集合と，全てのMMの終端の集合は等しい。その理由を述べる。まず(R4)より，全ての右MM rには(R5)よりrをsuffix としQ(r)=Q(m)+|m|-|r|であるMM mが存在する。従って，次式(1)が成り立ち，prefix極小右MMの終端全体の集合は，MMの終端の集合に含まれる。 The second method is to calculate the position of right MM r that does not have other right MM as a prefix, instead of directly calculating the position of MM in setting the value of Tail, and for any such r , (I, j) is an appearance of r, 1 is written to Tail [i, j + | r | -1] 702. The set of terminations of the right MM r (hereinafter referred to as “prefix minimum right MM”) that does not use other right MMs as prefixes is equal to the set of terminations of all MMs. Give the reason. First, from (R4), all right MM r have MM m with R as suffix from (R5) and Q (r) = Q (m) + | m |-| r |. Therefore, the following equation (1) holds, and the set of all the ends of the prefix minimum right MM is included in the set of ends of the MM.

ただし，式(1)で，pmrMMとは，「prefix極小右MM」の集合を表す。一方，mの長さu以上の任意のsuffix sは右MMである。なぜなら，sはmの部分文字列だから(R1)を満たし，長さu以上だから(R2)を満たし，次の議論により(R3)も満たす。まず，mがある入力文字列のsuffixであるとき，sもその文字列のsuffixであり，sが(R3)を満たすのは明らか。mがどの文字列のsuffixでもないとき，異なる文字a,bについて，それぞれあるアピアランス(i,j), (i’,j’)が存在し，(i,j)∈Q(m), Si[j+|m|]=a, (i’,j’)∈Q(m), Si’[j’+|m|]=bである。すなわち，mは右側にaを伴って現れる場合と，bを伴って現れる場合がある。sはmの|m|-|s|番目の文字から始まるsuffixだから(i,j+|m|-|s|)∈Q(s), Si[(j+|m|-|s|)+|s]]=Si[j+|m|]=a, (i’,j’+|m|-|s|)∈Q(s), Si’[(j’+|m|-|s|)+|s]]=Si’[j’+|m|]=bが成り立つ。すなわち，sも右側にaを伴って現れる場合と，bを伴って現れる場合があるため，任意の文字a∈Σに対しQ(s)≠Q(sa)だから，sは(R3)を満たす。以上に加え，MM mはそれ自身が長さu以上の右MMであるからmには長さu以上のsuffix sは必ず存在し，そのうち長さがちょうどuのsuffix rは上記の議論から右MMであるが，そのprefixの長さはu-1以下だから，右MMを真のprefixに持たない。すなわち，rはprefix極小右MMであり，すべてのMMはprefix極小MMをsuffixに持つことがわかる。従って，式(2)が成り立つ。すなわち，prefix極小右MMの終端全体の集合は，MMの終端の集合を含む。 In Equation (1), pmrMM represents a set of “prefix minimum right MM”. On the other hand, any suffix s of m length u or more is right MM. Because s is a substring of m, it satisfies (R1), because it is longer than u, it satisfies (R2), and also satisfies (R3) by the following discussion. First, when m is a suffix of an input string, it is clear that s is a suffix of that string, and s satisfies (R3). When m is not a suffix of any string, there exists an appearance (i, j), (i ', j') for different characters a and b, respectively, and (i, j) ∈Q (m), Si [j + | m |] = a, (i ′, j ′) ∈Q (m), Si ′ [j ′ + | m |] = b. That is, m may appear with a on the right or b with b. Since s is a suffix starting with the m | m |-| s | th character of m, (i, j + | m |-| s |) ∈Q (s), Si [(j + | m |-| s |) + | s]] = Si [j + | m |] = a, (i ', j' + | m |-| s |) ∈Q (s), Si '[(j' + | m |-| s |) + | s]] = Si '[j' + | m |] = b. That is, since s can also appear with a on the right side and with b, it can satisfy (R3) because Q (s) ≠ Q (sa) for any character a∈Σ . In addition to the above, since MM m is a right MM of length u or more, m always has a suffix s of length u or more, and suffix r of length u is right from the above discussion. Although it is MM, the length of the prefix is less than u-1, so the right MM is not a true prefix. That is, r is the prefix minimum right MM, and all MMs have the prefix minimum MM in the suffix. Therefore, equation (2) holds. That is, the set of all the ends of the prefix minimum right MM includes the set of ends of the MM.

式(1)，式(2)から式(3)が成立し，prefix極小右MMの終端全体の集合は，MMの終端の集合に等しいことがわかる。 From Equations (1) and (2), Equation (3) holds, and it can be seen that the set of the entire end of the prefix minimum right MM is equal to the set of the end of MM.

以上を踏まえ，次の(t’1)-(t’3)の方法で，全ての右MMの末端に対応するTail[i,j] 702を1に設定することができる。
(t’1) 2次元配列Tail 701の全ての要素を0で初期化。
(t’2) 変数kを-1で初期化する。
(t’3) suffix tree Tを深さ優先探査し，出会ったノードvに対し(t’3a)，(t’3b)の処理を行う。
(t’3a) vに「右MMである」と印がついており，かつk<0の場合，kに|p’(v)|を代入する。kは，vのvを含む部分木の深さ優先探査が終わった時点で，−1に戻す。
(t’3b) vがリーフのとき， k>0ならば，vに対応付けられている任意のアピアランス(i,j)についてTail[i,j+k-1] 701を1に設定する。 Based on the above, Tail [i, j] 702 corresponding to the ends of all right MMs can be set to 1 by the following method (t′1)-(t′3).
(t'1) All elements of 2D array Tail 701 are initialized to 0.
(t'2) The variable k is initialized with -1.
(t'3) Suffix depth tree T is searched for suffix tree T, and processing of (t'3a) and (t'3b) is performed for the node v encountered.
(t'3a) If v is marked as "right MM" and k <0, substitute | p '(v) | for k. k is returned to −1 when the depth-first exploration of the subtree containing v of v is over.
(t'3b) When v is a leaf and k> 0, Tail [i, j + k−1] 701 is set to 1 for an arbitrary appearance (i, j) associated with v.

この方法により，入力文字列iのj番目の文字で終了するprefix極小右MM rが存在するときに，かつそのときに限り，Tail[i,j+|r|-1]に1を書き込む。逆に，このようなi,j以外では，Tail[i,j]は1にならない。深さ優先探査で，同一のリーフを2回以上処理することはないから，この方法で，同一の(i,j)に対し2回以上Tail[i,j]が更新されることはない。ゆえに，(t’1)-(t’3)はO(N)で完了できる。 By this method, 1 is written to Tail [i, j + | r | -1] when and only when there is a prefix minimum right MM r ending with the j-th character of the input character string i. Conversely, Tail [i, j] is not 1 except for i, j. In depth-first exploration, the same leaf is not processed more than once, so this method does not update Tail [i, j] more than once for the same (i, j). Therefore, (t′1)-(t′3) can be completed with O (N).

なお，極小右MMやprefix極小右MMの位置のみを計算するのではなく，全てのMM mについて，それらが出現するすべてアピアランス(i,j)についてTail[i,j+|m|-1]=1とする方法では，O(N)の処理時間では完了できない場合がある。Nが文字列数kで割り切れ，かつN≧3k(u+k)と仮定し，文字列集合{Si=T^(u+i)AT^(N/k-(u+i+1))} (0≦i＜k)について考察する（図６）。ただし，「a^n」は文字aのn回の繰り返しを表すものとする。この文字列集合において，計k-1個の，Tがu〜u+k-2個連続する文字列T^u, ..., T^(u+k-2)のそれぞれが，Si[0..u+i-1]=T^(u+i), Si[u+i]=A, S(i+1)[0..u+i-1]=T^(u+i), S(i+1)[u+i]=Tより，(R1)-(R3)を満たすので，これらは右MMである。ただし，S(i+1)はi+1番目の入力文字列を表す。ここで，Sj(0≦j＜k)ではSj[u+j]=Aの後にN/k-(u+j+1)個のTが並ぶことから，T^(u+i)はSjにN/k-(u+j+1)-(u+i)+1=N/k-2u-i-j 回以上出現する(N≧3k(u+k),i≦k,j≦kより，N/k-2u-i-j≧k(u+k)≧0)。従って，考察している文字列集合全体でのMMのアピアランスの数は，式(4)より(k-1)N/3以上だから，Ω(Nk)である。以上により，すべてのMMのアピアランスを列挙して終端を探索する方法では，O(N)の処理時間でTailの値を設定できるとは限らないことがわかる。 Note that instead of calculating only the position of the minimum right MM and prefix minimum right MM, Tail [i, j + | m | -1] = for all appearances (i, j) for all MM m In the method of 1, it may not be completed in the processing time of O (N). Assuming N is divisible by the number of strings k and N ≧ 3k (u + k), the string set {Si = T ^ (u + i) AT ^ (N / k- (u + i + 1)) } Consider (0 ≦ i <k) (FIG. 6). However, "a ^ n" represents n repetitions of the letter a. In this string set, a total of k-1 strings T ^ u, ..., T ^ (u + k-2) with T continuous between u and u + k-2 are represented by Si [ 0..u + i-1] = T ^ (u + i), Si [u + i] = A, S (i + 1) [0..u + i-1] = T ^ (u + i ), S (i + 1) [u + i] = T, which satisfies (R1)-(R3), so these are right MM. However, S (i + 1) represents the i + 1th input character string. Here, in Sj (0≤j <k), N ^ k- (u + j + 1) Ts are arranged after Sj [u + j] = A, so T ^ (u + i) is Sj N / k- (u + j + 1)-(u + i) + 1 = N / k-2u-ij times or more (N ≧ 3k (u + k), i ≦ k, j ≦ k , N / k-2u-ij ≧ k (u + k) ≧ 0). Therefore, the number of appearances of the MM in the entire string set under consideration is Ω (Nk) because (k-1) N / 3 or more from Equation (4). From the above, it can be seen that the method of enumerating the appearances of all MMs and searching for the end does not necessarily set the value of Tail in the processing time of O (N).

次に，Head[i,j] 701を設定する方法を示す。まず，ステップS503で，左MMの抽出が完了している場合に， Head[i,j]を設定するためには，suffix tree Tをsuffix tree T'，Tail[i,j]をHead[i,|Si|-1-j]と読み替えて，Tailを設定する上記の方法を適用すればよい。この方法で2次元配列Headを正しく設定できることは，上記の方法がTailを正しく設定することから自明であり，O(N)の時間で処理を完了できる。 Next, a method for setting Head [i, j] 701 will be described. First, in step S503, when extraction of the left MM is completed, in order to set Head [i, j], suffix tree T is suffix tree T 'and Tail [i, j] is Head [i, , | Si | -1-j] and apply the above method for setting tails. The fact that the two-dimensional array Head can be set correctly by this method is self-evident because the above method sets Tail correctly, and the processing can be completed in O (N) time.

一方，ステップS503ではなくステップS504を実行した場合には，suffix tree Tを深さ優先探査し，「MMである」と印がついたノードvで，ルートノードからvに至るパス上に別の「MMである」と印のついたノードが存在しないノードvを発見した場合，vの部分木内では，出会った任意のリーフに対応付けられている任意のアピアランス(i,j)について，Head[i,j] 701を1に設定すればよい。この方法で，同一のアピアランス(i,j)に対し，2回以上Head[i,j] 701が更新されることはないから，O(N)の処理時間で完了できる。ところで，この方法により他のMM mを真のprefixとして持つMM m'の位置は計算されないが，mはm'のprefixでありQ(m)⊇Q(m')だから，m'が入力文字列iのj番目の文字で始まる部分文字列ならば，mもそうであり，Head[i,j]は正しく1に設定される。この方法で，入力文字列iのj番目から始まる任意の文字列がMMでない場合に，Head[i,j]が1にならないことは明らかである。 On the other hand, when step S504 is executed instead of step S503, a depth-first exploration of the suffix tree T is performed, and another node v is marked on the path from the root node to v with the node v marked “MM”. If a node v is found that does not have a node marked as “MM”, within the subtree of v, for any appearance (i, j) associated with any leaf encountered, Head [ i, j] 701 may be set to 1. With this method, since the Head [i, j] 701 is not updated more than once for the same appearance (i, j), it can be completed in the processing time of O (N). By the way, the position of MM m 'that has other MM m as a true prefix is not calculated by this method, but since m is a prefix of m' and Q (m) ⊇Q (m '), m' is the input character. If it is a substring starting with the jth character in sequence i, so is m, and Head [i, j] is correctly set to 1. In this way, it is clear that Head [i, j] does not become 1 when an arbitrary character string starting from the j-th character in the input character string i is not MM.

■右EG-holderの抽出(ステップS506)
以下に述べる方法(h1)-(h3)によって，p’(v)が右EG-holderであるノードvに「右EG-holderである」と印をつけることができる。 (h1)-(h2)で2次元配列Tvの値を設定し，(h3)で「右EG-holderである」と印をつける。
(h1) 2次元配列Tvを用意し，Tvの全要素をヌルポインタで初期化する。なお，TvもHead，Tailと同様に，入力文字列iと該文字列の長さよりも小さな非負整数jに対して値が定義される。さらに，変数shortest_right_MMを用意し，ヌルポインタを代入する。
(h2) suffix tree Tを深さ優先探査し，出会ったノードvについて，(h2a)と(h2b)の処理を行う。
(h2a) ノードvに「右MMである」と印があり，shortest_right_MMがヌルポインタであれば，shortest_right_MMにvへのポインタを代入する。なお，vをルートノードとする部分木の処理が終了した段階で，shortest_right_MMの値はヌルポインタに戻す。
(h2b) ノードvがリーフであった場合には，このリーフのパスラベルに対応付けられている任意のアピアランス(i,j)について，変数shortest_right_MMの値をTv[i,j]に代入する。
(h3) (i,j) (0≦i＜k，0≦j＜|Si|)が，次の(h3a)，(h3b)の条件がともに満たすとき，Tv[i,j]が指し示すノードに「右EG-holder」と印を付ける。
(h3a) Head[i,j]が1であるか，j≧1かつTail[i,j-1]が1。
(h3b) Tv[i,j]がヌルポインタでない。 ■ Extraction of right EG-holder (Step S506)
The node v whose p ′ (v) is the right EG-holder can be marked as “right EG-holder” by the method (h1)-(h3) described below. Set the value of the two-dimensional array Tv with (h1)-(h2), and mark “I am right EG-holder” with (h3).
(h1) Prepare a two-dimensional array Tv and initialize all elements of Tv with a null pointer. Note that, similarly to Head and Tail, Tv is also defined for the input character string i and a non-negative integer j that is smaller than the length of the character string. Furthermore, the variable shortest_right_MM is prepared and a null pointer is assigned.
(h2) Depth-first exploration of suffix tree T, and process (h2a) and (h2b) is performed for node v encountered.
(h2a) If node v is marked as “right MM” and shortest_right_MM is a null pointer, a pointer to v is assigned to shortest_right_MM. Note that the value of shortest_right_MM is returned to the null pointer at the stage when the processing of the subtree having v as the root node is completed.
(h2b) If the node v is a leaf, the value of the variable shortest_right_MM is assigned to Tv [i, j] for any appearance (i, j) associated with the leaf path label.
(h3) (i, j) (0 ≦ i <k, 0 ≦ j <| Si |) is the node indicated by Tv [i, j] when both of the following conditions (h3a) and (h3b) are satisfied Mark with “Right EG-holder”.
(h3a) Head [i, j] is 1 or j ≧ 1 and Tail [i, j-1] is 1.
(h3b) Tv [i, j] is not a null pointer.

この方法(h1)-(h3)で，p’(v)が右EG-holderであるノードvに，かつそれらのノードvに限り，「右EG-holderである」と印がつくことを示す。まず，(h2)によって，Tv[i,j]の値は，以下の値に設定される。
(1) (i,j)をアピアランスとする右MMが１つ以上存在するなら，それらの中でprefixに他の右MMをもたない右MMをrとするとき，v’(r)のポインタ。
(2) (i,j)をアピアランスとする右MMが存在しないなら，ヌルポインタ。 This method (h1)-(h3) indicates that p '(v) is marked as "right EG-holder" only on those nodes v whose right EG-holder is p' (v) . First, by (h2), the value of Tv [i, j] is set to the following value.
(1) If there is one or more right MMs with appearances of (i, j), then let r be a right MM that has no other right MM in the prefix. Pointer.
(2) A null pointer if there is no right MM with appearance (i, j).

これを踏まえ，(h1)-(h3)の処理により，「右EG-holderである」と印がついたノードvについて，p’(v)は右EG-holderであることを示す。
(H1)について:
「右MMである」と印があるノードにしか「右EG-holderである」との印をつけないので，p’(v)は(H1)を満たす。
(H3)について:
ノードvに「右EG-holderである」と印がつくならば，p’(v)のあるアピアランス(i,j)について，Tv[i,j]=v’(p(v))。ここで，p’(v)の真のprefixで，右MMである文字列rの存在を仮定すると，Tv[i,j]の定義に矛盾。よって，p’(v)が(H3)を満足することがわかる。
(H2)について:
本発明の方法は，Head[i,j]が1であるか，j≧1かつTail[i,j-1]が1であるようなTv[i,j]の指すノードにしか「右EG-holderである」との印をつけない。従って，「右EG-holderである」との印がついたノードvに対して，p’(v)は(i,j)をアピアランスとするMM mのprefixである右MMか，入力文字列SiにおいてあるMMの次の文字から始まる右MMである。前者の場合，rは(R5)よりMMである。従って，(H2)も満たされる。 Based on this, it is shown that p ′ (v) is the right EG-holder for the node v marked as “right EG-holder” by the processing of (h1)-(h3).
About (H1):
Since only the node marked as “right MM” is marked as “right EG-holder”, p ′ (v) satisfies (H1).
About (H3):
If node v is marked as “right EG-holder” then Tv [i, j] = v ′ (p (v)) for appearance (i, j) with p ′ (v). Here, it is inconsistent with the definition of Tv [i, j], assuming that the true prefix of p '(v) and the existence of the character string r that is the right MM. Therefore, it can be seen that p ′ (v) satisfies (H3).
About (H2):
The method of the present invention can be applied only to a node indicated by Tv [i, j] where Head [i, j] is 1 or j ≧ 1 and Tail [i, j-1] is 1. Do not mark it as “-holder”. Therefore, for node v marked as "right EG-holder", p '(v) is the right MM that is the prefix of MM m with appearance (i, j) or the input string It is a right MM that starts with the next letter after a certain MM in Si. In the former case, r is MM from (R5). Therefore, (H2) is also satisfied.

次に，ある右EG-holder hについて，v’(h)に右EG-holderの印がつかないと仮定する。まず，Tv[i,j]にv’(h)へのポインタが入らないと仮定する。このとき，hは右MMでないか，hの真のprefixである右MMが存在する。それぞれ，(H1)，(H3)よりhが右EG-holderであることに矛盾しTv[i,j]にはv’(h)へのポインタが入らなければならないことがわかる。それでもhに「右EG-holderである」との印がつかないということは，hの任意のアピアランス(i,j)について，Head[i,j]=0 かつTail[i,j-1]=0(j≧1のとき)ということである。このとき，任意のMM mについて，(i,j)と(i,j-|m|)はいずれもmのアピアランスではないが，そのようなhは(H2a)，(H2b)をいずれも満たさないため，やはりhが右EG-holderであることに矛盾する。従って，v’(h)に「右EG-holderである」と印がつかないと仮定したことが誤りであり，本発明の方法は右EG-holderに相当するノードに過不足なく「右EG-holderである」と印をつけられることがわかる。 Next, assume that the right EG-holder is not marked on v ′ (h) for a certain right EG-holder h. First, it is assumed that a pointer to v ′ (h) does not enter Tv [i, j]. At this time, h is not a right MM, or there is a right MM that is a true prefix of h. From (H1) and (H3), it can be seen that h is the right EG-holder, contradicting that Tv [i, j] must contain a pointer to v '(h). The fact that h is still not marked as “right EG-holder” means that for any appearance (i, j) of h, Head [i, j] = 0 and Tail [i, j-1] = 0 (when j ≧ 1). At this time, for any MM m, (i, j) and (i, j- | m |) are not appearances of m, but such h satisfies (H2a) and (H2b). Again, it contradicts that h is the right EG-holder. Therefore, it is an error to assume that v ′ (h) is not marked as “right EG-holder”, and the method of the present invention does not oversuffice the node corresponding to the right EG-holder. You can see that it is marked "-holder".

■EGの抽出(ステップS507)
EGは，次の(e1)-(e3)の方法によって，過不足なく抽出することができる。その理由は後述する。本発明の方法は，EGを抽出するために，これまでのステップでで計算したHead, Tail, Tvの値を利用する。
(e1) suffix tree T内の全てのノードvについて，変数EGlength(v)を設ける。そして，深さ優先探査を行ない，EGlength(v)=|p’(v)|と初期化する。
(e2) 全ての入力文字列Siについて，次の(e2a)-(e2d)までの処理を行う。
(e2a) 変数c,jをc=1,j=|Si|-1で初期化する。
(e2b) Tv[i,j]がヌルポインタでなければ，Tv[i,j]が指し示すノードをvについて変数EGlength(v)の値を，cの値703とそれまでのEGlength(v)の値のうち，小さい方の値で置き換える。
(e2c) Head[i,j]が1であるか，j≧1かつTail[i,j-1]が1であれば，cに1を代入し，それ以外のときはcに1を加える。
(e2d) jの値から1を減じる。減じた後でjが0以上ならば，(e2b)-(e2d)までの処理を再び行う。jが0より小さければ，入力文字列Siに関する処理は終了である。
(e3) Suffix tree T上を深さ優先探査し，ノードvが，次の条件(e3a)，(e3b)をともに満たすとき，vに「EGである」と印をつける。
(e3a) vに「右EG-holderである」と印がついている。
(e3b) EGlength(v)≧u ■ EG extraction (step S507)
EG can be extracted without excess or deficiency by the following methods (e1)-(e3). The reason will be described later. The method of the present invention uses the values of Head, Tail, and Tv calculated in the previous steps in order to extract EG.
(e1) A variable EGlength (v) is provided for all nodes v in suffix tree T. Then, depth-first search is performed, and EGlength (v) = | p '(v) | is initialized.
(e2) The following processing (e2a)-(e2d) is performed for all input character strings Si.
(e2a) Variables c and j are initialized with c = 1, j = | Si | -1.
(e2b) If Tv [i, j] is not a null pointer, the value of the variable EGlength (v) for the node pointed to by Tv [i, j] is changed to the value 703 of c and EGlength (v) up to that point Replace with the smaller of the values.
(e2c) If Head [i, j] is 1, or if j ≧ 1 and Tail [i, j-1] is 1, substitute 1 for c, otherwise add 1 to c .
(e2d) Subtract 1 from the value of j. If j is greater than or equal to 0 after subtraction, the processing from (e2b) to (e2d) is performed again. If j is smaller than 0, the process for the input character string Si is completed.
(e3) A depth-first search is performed on Suffix tree T, and when node v satisfies both of the following conditions (e3a) and (e3b), v is marked as “EG”.
(e3a) v is marked as “right EG-holder”.
(e3b) EGlength (v) ≧ u

以上の方法で，「EGである」と印がついたノードvの長さEGlength(v)のprefixはEGであり，かつそれらだけがEGであることを説明する。その説明に先立ち，(E1)-(E5)の条件が次の(E’1)-(E’5)の条件と等価であることを示す。
(E’1) ある右EG-holderのprefixである。
(E’2) 長さがu以上である。
(E’3) 右でも左でも，MMと重ならない。
(E’4) MMを真の部分文字列としない。
(E’5) (E’1)-(E’4)を満足する文字列の真のprefixにならない。 With the above method, it is explained that the prefix of the length EGlength (v) of the node v marked as “EG” is EG, and only those are EG. Prior to the explanation, it is shown that the conditions (E1)-(E5) are equivalent to the following conditions (E'1)-(E'5).
(E'1) Prefix of a right EG-holder.
(E'2) The length is not less than u.
(E'3) Does not overlap with MM, either right or left.
(E'4) MM is not a true substring.
(E'5) Not a true prefix of a string that satisfies (E'1)-(E'4).

(E1)-(E5)の条件が(E’1)-(E’5)と等価であることを示す準備として，(E’1)と(E’2)を満足する文字列eが，次の(E’6)を満足することを示す。
(E’6) eをprefixとし，Q(e)=Q(h)なるEG-holder hが，ただひとつ存在する。 In preparation for showing that the condition of (E1)-(E5) is equivalent to (E'1)-(E'5), the string e that satisfies (E'1) and (E'2) Indicates that the following (E'6) is satisfied.
(E'6) There is only one EG-holder h where e is a prefix and Q (e) = Q (h).

(E’6)が成り立つことを示すために，互いに異なる右EG-holder h,h’が存在すると仮定する。esがh,h’の共通のprefixである最長の文字列となる文字列sについて検討する。hとh’は互いに異なるから，少なくとも一方は|es|より長い。一般性を失わず，|h|＞|es|と仮定できる。h中で，prefix esの次の文字をaとする。esがhとh’に共通する最長のprefixだから，esaはh’のprefixでない。よってQ(esa)∩Q(h’)=φだが，Q(es)⊇Q(h’)≠φよりQ(esa)≠Q(es)。一方，a以外の任意の文字bについて，Q(esb)∩Q(h)＝φだが，Q(es)⊇Q(h)≠φよりQ(esb)≠Q(es)。従って，esは(R3)を満たす。Q(es)⊇Q(h)≠φよりesは(R1)を満たし，|es|≧|e|≧uより，(R2)も満たす。従って，esは右MMである。|h|＞|es|より|es|は|h|の真のprefixだが，これはhが右EG-holderであり(H3)を満たすことに矛盾する。ゆえに，互いに異なる右EG-holder h,h’の存在を仮定したのが誤りで，eをprefixとする右EG-holderは一意である。 In order to show that (E′6) holds, it is assumed that there are different right EG-holder h, h ′. Consider a character string s that is the longest character string in which es is a common prefix of h and h '. Since h and h 'are different from each other, at least one is longer than | es |. Without loss of generality, it can be assumed that | h |> | es |. Let h be the character after prefix es in h. Since es is the longest prefix common to h and h ', esa is not a prefix of h'. Therefore, Q (esa) ∩Q (h ′) = φ, but Q (esa) ≠ Q (es) because Q (es) ⊇Q (h ′) ≠ φ. On the other hand, for any character b other than a, Q (esb) ∩Q (h) = φ, but Q (esb) ≠ Q (es) because Q (es) ⊇Q (h) ≠ φ. Therefore, es satisfies (R3). From Q (es) ⊇Q (h) ≠ φ, es satisfies (R1), and | es | ≧ | e | ≧ u, so (R2) is satisfied. Therefore, es is the right MM. From | h |> | es |, | es | is a true prefix of | h |, which contradicts that h is a right EG-holder and satisfies (H3). Therefore, it is an error to assume the presence of different right EG-holders h, h ′, and the right EG-holder with e as a prefix is unique.

一方，eは(E’1)を満たすからある右EG-holder hのprefixであり，Q(e)⊇Q(h)である。ここで，Q(e)⊃Q(h)を仮定し，ある文字列sをQ(e)=Q(es)⊃Q(h)なる最長の文字列とする。このとき，esはQ(es)=Q(e)≠φより(R1)を，|es|≧|e|≧uより(R2)を満たし，sの定義より(R3)も満たすから右MMである。Q(es)⊃Q(h)より|es|＞|h|だからesはhの真のprefixだが，hが右EG-holderであり(H3)を満足することに矛盾する。従って，Q(e)⊃Q(h)の仮定が誤りでQ(e)⊆Q(h)。eはhのprefixだからQ(e)⊇Q(h)より，Q(e)=Q(h)。以上で，(E’6)が正しいことがわかる。 On the other hand, e is a prefix of the right EG-holder h because it satisfies (E′1), and Q (e) ⊇Q (h). Here, assuming Q (e) ⊃Q (h), let a character string s be the longest character string Q (e) = Q (es) ⊃Q (h). At this time, es satisfies (R1) from Q (es) = Q (e) ≠ φ, satisfies (R2) from | es | ≧ | e | ≧ u, and satisfies (R3) from the definition of s. It is. From Q (es) ⊃Q (h), | es |> | h |, so es is a true prefix of h, but contradicts that h is a right EG-holder and satisfies (H3). Therefore, the assumption of Q (e) ⊃Q (h) is incorrect and Q (e) ⊆Q (h). Since e is a prefix of h, Q (e) = Q (h) from Q (e) ⊇ Q (h). This shows that (E’6) is correct.

まず，(E’1)-(E’5)を満たす文字列eが，(E1)-(E5)を満たすことを示す。
文字列eは(E’1)-(E’4)を満たすので， (E1)-(E4)も満たすことは明らか。 First, a character string e that satisfies (E'1)-(E'5) satisfies (E1)-(E5).
Since string e satisfies (E'1)-(E'4), it is clear that (E1)-(E4) is also satisfied.

(E5)も満たすことを示す。e’=ses’≠eが，(E1)-(E4)を満たすと仮定し，矛盾を導く。eには(E’6)より，eをprefixとする右EG-holder hが一意に存在し，hは(H1)より右MMだから，(R4)より，Q(h)+|h|=Q(m)+|m|なるMM mでhをsuffixとするものが存在。 (E5) is also satisfied. Assuming that e ′ = ses ′ ≠ e satisfies (E1)-(E4), a contradiction is introduced. From (E'6), e has a right EG-holder h with e as a prefix, and h is a right MM from (H1), so from (R4), Q (h) + | h | = There is a MM m with Q (m) + | m |, where h is a suffix.

ここで，|es’|＞|h|を仮定すると，Q(es’)⊆Q(e)=Q(h)=Q(m)+|m|-|h|より，ses’が存在する入力文字列上では，図９に示すようにes’のprefixとしてhが存在し，そのhをsuffixとするmが存在する。|es’|＞|h|だから，e’=ses’はmと左で重なり(E3)に反するか，mを真の部分文字列として含み(E4)に反し，矛盾が生じる。従って，|es’|≦|h|であり，Q(es’)⊆Q(e)=Q(h)よりes’はhのprefixである。さらに，Q(e)⊇Q(es’)⊇Q(h)=Q(e)よりQ(es’)=Q(h)である。 Here, assuming | es' |> | h |, there is ses' from Q (es') ⊆Q (e) = Q (h) = Q (m) + | m |-| h | On the input character string, as shown in FIG. 9, h exists as a prefix of es ′, and m has h as a suffix. Since | es ’|> | h |, e ′ = ses ′ overlaps with m on the left and contradicts (E3), or includes m as a true substring and contradicts (E4), resulting in a contradiction. Therefore, | es ′ | ≦ | h |, and Q (es ′) ⊆Q (e) = Q (h), es ′ is a prefix of h. Further, Q (es ′) = Q (h) from Q (e)) Q (es ′) ⊇Q (h) = Q (e).

次に，|s|≧1を仮定する。このとき，hがMMなら，Q(e’)+|s|⊆Q(es’)=Q(h)よりある入力文字列Si上に部分文字列shが存在するから，e’=ses’は図10に示すように|es’|＜|h|ならばMM hと右で重なり，|es’|=|h|ならばhを真の部分文字列として持つため，e’が(E3)と(E4)を満たすことに矛盾するから，hはMMでない。しかし，hは (H2)を満たすから，(H2b)を満たす。従って，あるMM m’について，ある入力文字列上に文字列m’hが存在する。ここで，tをQ(tes’)+|t|=Q(es’)=Q(h)を満足する最長の文字列とする(図11)。|t|≧|s|を仮定すると，Q(tes’)+|t|⊆Q(ses’)+|s|⊆Q(es’)=Q(h)=Q(tes’)+|t|であり，図１１に示すように，ある (i,j)∈Q(ses’)+|s|について(i,j)∈Q(m’h)+|m’|が成立する。このとき，e’=ses’中のes’がm’h中のhのprefixだから，e’はm’と左で重なり(E3)に矛盾するか，m’を真の部分文字列として含み(E4)に矛盾する。よって|t|＜|s|。ところが，tはQ(tes’)+|t|=Q(es’)=Q(h)を満たす最長の文字列なので，任意の(i,j)∈Q(tes’)+|t|について(i,j-|t|)∈Q(t)および(i,j)∈Q(es’)=Q(h)でありSi[j-|t| .. j+|h|-1]=thより(i,j)∈Q(th)+|t|だからQ(tes’)+|t|⊆Q(th)+|t|である一方，任意のa∈ΣについてQ(ates’)+|t|+1⊂Q(tes’)+|t|⊆Q(th) +|t|だから，tはQ(th)+|t|=Q(h)を満たす最長の文字列でもある。従って，thは(M4)を満たしMMであるが，|t|＜|s|より，e’はthと右で重なり(E3)に矛盾するか，thを真の部分文字列として含み(E4)に矛盾する(|es’|=|h|のとき)。結局，|s|≧1の場合は矛盾が生じるので，|s|=0であり，e’=es’。|es’|＞|h|およびQ(es’)=Q(h)より，es’は右EG-holder hのprefixであり，(E’1)を満足する。さらに，e’=es’は，(E1)-(E4)を満たすと仮定したから(E’2)-(E’4)も満たす。ところが，e≠e’よりs’≠εであり，eはe’の真のprefixである。これは，eが(E’5)を満たすことに矛盾。結局，(E1)-(E4)を満足するe’=ses’≠eの存在を仮定したのが誤りであり，eが(E5)を満たすことがわかる。 Next, assume that | s | ≧ 1. At this time, if h is MM, there is a substring sh on the input string Si from Q (e ') + | s | ⊆Q (es') = Q (h), so e '= ses' As shown in Fig. 10, if | es' | <| h |, it overlaps with MM h on the right, and if | es' | = | h | has h as a true substring, e 'is (E3 ) And (E4) are inconsistent, so h is not MM. However, since h satisfies (H2), it satisfies (H2b). Therefore, for a certain MM m ′, there is a character string m′h on a certain input character string. Here, let t be the longest character string satisfying Q (tes ′) + | t | = Q (es ′) = Q (h) (FIG. 11). Assuming | t | ≧ | s |, Q (tes ') + | t | ⊆Q (ses') + | s | ⊆Q (es ') = Q (h) = Q (tes') + | t As shown in FIG. 11, (i, j) εQ (m′h) + | m ′ | holds true for a certain (i, j) εQ (ses ′) + | s |. At this time, since es 'in e' = ses 'is a prefix of h in m'h, e' overlaps with m 'on the left and contradicts (E3) or includes m' as a true substring. Contradict with (E4). Therefore | t | <| s |. However, since t is the longest string satisfying Q (tes') + | t | = Q (es') = Q (h), for any (i, j) ∈Q (tes') + | t | (i, j- | t |) ∈Q (t) and (i, j) ∈Q (es') = Q (h) and Si [j- | t | .. j + | h | -1] = From (i, j) ∈Q (th) + | t | from Q, so Q (tes ') + | t | ⊆Q (th) + | t |, while Q (ates') for any a∈Σ + | t | + 1⊂Q (tes') + | t | ⊆Q (th) + | t | So t is also the longest string satisfying Q (th) + | t | = Q (h) . Therefore, th satisfies (M4) and is MM, but from | t | <| s |, e 'overlaps with th on the right and contradicts (E3) or contains th as a true substring (E4 ) (When | es' | = | h |). Eventually, if | s | ≧ 1, a contradiction occurs, so | s | = 0 and e ′ = es ′. From | es ′ |> | h | and Q (es ′) = Q (h), es ′ is a prefix of the right EG-holder h and satisfies (E′1). Furthermore, since it is assumed that e '= es' satisfies (E1)-(E4), (E'2)-(E'4) is also satisfied. However, since e ≠ e ′, s ′ ≠ ε, and e is a true prefix of e ′. This contradicts that e satisfies (E’5). In the end, it is an error to assume the existence of e ′ = ses ′ ≠ e that satisfies (E1)-(E4), and it can be seen that e satisfies (E5).

次に，文字列eが(E1)-(E5)を満足すれば，(E’1)-(E’5)を満足することを示す。まず，(E’1)について検討する。sをQ(es)=Q(e)なる最長の文字列とする。このesがEG-holderであることを示す。eは(E1)を満足するからQ(es)=Q(e)≠φ。よってesは(R1)を満たす。|es|≧|e|≧uより，esは(R2)を満たす。定義よりsはQ(e)=Q(es)なる最長の文字列だから，(es)は(R3)を満たす。よって，esは右MMであり，(H1)を満たす。 Next, if the character string e satisfies (E1)-(E5), it indicates that (E′1)-(E′5) is satisfied. First, consider (E’1). Let s be the longest character string Q (es) = Q (e). Indicates that this es is an EG-holder. Since e satisfies (E1), Q (es) = Q (e) ≠ φ. Therefore, es satisfies (R1). From | es | ≧ | e | ≧ u, es satisfies (R2). By definition, s is the longest character string Q (e) = Q (es), so (es) satisfies (R3). Therefore, es is the right MM and satisfies (H1).

esの真のprefixである右MM rの存在を仮定する。rは(R3)を満たすから，esのprefixであるrに続く文字をaとすれば，Q(r)⊃Q(ra)⊇Q(es)=Q(e)より，|r|＜|e|。(R4)より，rをsuffixとするあるMM mが存在しQ(m)+|m|=Q(r)+|r|。ゆえに，eはmと左で重なり(E3)に矛盾するか，mを真の部分文字列として含み(E4)に矛盾する(図12)。よってこうしたrの存在を仮定したのが誤りで，esは右MMを真のprefixとして持たないことがわかる。すなわち，esは(H3)を満たす。 Assume the existence of right MM r, the true prefix of es. Since r satisfies (R3), if the character following r, which is the prefix of es, is a, then Q (r) ⊃Q (ra) ⊇Q (es) = Q (e) and | r | <| e |. From (R4), there exists a certain MM m with r as suffix, and Q (m) + | m | = Q (r) + | r |. Therefore, e overlaps with m on the left and contradicts (E3) or includes m as a true substring and contradicts (E4) (Fig. 12). Thus, it is an error to assume that r exists, and es does not have the right MM as a true prefix. That is, es satisfies (H3).

さらに，esが(H2)を満たすことを示す。esが(H2b)を満足する場合は明らかに(H2)は満足される。esが(H2b)を満足しない場合を考察する。esがMMでないと仮定する。esをprefixとする入力文字列があればesはMMであり仮定に反するから，esはどの入力文字列のprefixでもない。ところで，esは右MMであったから，esがMMでないということは(M4)が満たされないことを意味し，ある文字aについてQ(aes)+1=Q(es)≠φが成り立つ。このとき，aeは，以下に示すように，(E1)-(E4)を満足する。
(E1)について:
eは(E’1)を満たすからQ(aes)+1=Q(es)=Q(e)≠φ。ゆえに，Q(ae)⊇Q(aes)≠φ。
(E2)について:
|ae|=|e|+1＞u。
(E3)について:
aeが，あるMMと重なると仮定する。eは(E3)を満たすからMMと重ならないので，aeはaをsuffixとするMMと左で重なるか，eをprefixとするMMと右で重なるかのいずれかである。esは(H2b)を満足しないので，任意のMM mに対し，mesを部分文字列とする入力文字列は存在しない。Q(es)=Q(e)より，meを部分文字列とする入力文字列も存在しない。ゆえに，aeは左ではMMと重ならない。一方，eをprefixとするMM mの存在を仮定すると，Q(es)=Q(e)より，|m|＜|es|ならばesが(H3)を満たすことに矛盾し，|m|≧|es|ならば(R5)よりesはMMだが，esがMMでないことに矛盾する。よって，いずれの場合も矛盾が生じるので，aeはMMと重ならないことがわかる。
(E4)について:
aeの真の部分文字列でMM mの存在を仮定する。eが(E4)を満たすから，eの真の部分文字列にMMはない。さらに，eがMMとすれば，s=εならばesがMMでないことに矛盾し，s≠εならばesの真のprefixに右MMがないことに矛盾。よって，mはaeの真のprefixである。m=aと仮定するとQ(aes)+1=Q(es)より，esは常に左側にMM aを伴って入力文字列中に出現するから，(H2b)に矛盾。従って，m≠a。ところが，aeの真のprefixであるmが2文字以上の文字列ならば，eはmと左で重なり，eが(E3)を満たすことに矛盾。以上より，aeの真の部分文字列であるmの存在を仮定すれば必ず矛盾が生じるので，aeが(E4)を満たすことがわかる。 In addition, we show that es satisfies (H2). Obviously (H2) is satisfied if es satisfies (H2b). Consider the case where es does not satisfy (H2b). Suppose es is not MM. If there is an input string with es as a prefix, es is MM, which is contrary to the assumption, so es is not a prefix of any input string. By the way, since es is the right MM, that es is not MM means that (M4) is not satisfied, and Q (aes) + 1 = Q (es) ≠ φ holds for a certain character a. At this time, ae satisfies (E1)-(E4) as shown below.
About (E1):
Since e satisfies (E′1), Q (aes) + 1 = Q (es) = Q (e) ≠ φ. Therefore, Q (ae) ⊇Q (aes) ≠ φ.
About (E2):
| ae | = | e | +1> u.
About (E3):
Suppose ae overlaps a certain MM. Since e satisfies (E3) and does not overlap with MM, ae either overlaps with MM with a as suffix on the left, or overlaps with MM with e as prefix on the right. Since es does not satisfy (H2b), there is no input string with mes as a substring for any MM m. From Q (es) = Q (e), there is no input string with me as a substring. Therefore, ae does not overlap with MM on the left. On the other hand, assuming the existence of MM m with e as a prefix, Q (es) = Q (e) contradicts that es satisfies (H3) if | m | <| es | If ≧ | es |, es is MM from (R5), but it contradicts that es is not MM. Therefore, in both cases, a contradiction occurs, so it can be seen that ae does not overlap with MM.
About (E4):
Assume the presence of MM m in the true substring of ae. Since e satisfies (E4), there is no MM in the true substring of e. Furthermore, if e is MM, if s = ε, it is inconsistent that es is not MM, and if s ≠ ε, it is inconsistent that there is no right MM in the true prefix of es. Therefore, m is the true prefix of ae. Assuming m = a, from Q (aes) + 1 = Q (es), es always appears in the input string with MM a on the left, which contradicts (H2b). Therefore, m ≠ a. However, if m, which is the true prefix of ae, is a string of two or more characters, e overlaps with m on the left and contradicts that e satisfies (E3). From the above, it can be seen that ae satisfies (E4) because a contradiction always occurs if the existence of m, which is the true substring of ae, is assumed.

以上より，aeは(E1)-(E4)を満たすが，aeはeを真の部分文字列として含むので，eが(E5)を満たすことに矛盾。従って，esがMMでないとした仮定が誤りである。つまり，esは(H2b)または(H2a)を満たすから(H2)を満たし，右EG-holderである。よって，eは(E’1)を満たす。 From the above, ae satisfies (E1)-(E4), but since ae contains e as a true substring, it contradicts that e satisfies (E5). Therefore, the assumption that es is not MM is incorrect. That is, since es satisfies (H2b) or (H2a), it satisfies (H2) and is the right EG-holder. Therefore, e satisfies (E′1).

さらに，esは|es|≧|e|≧uより(E’2)を満たし，(E3)と(E4)を満足するから(E’3)と(E’4)を満たす。eを真のprefixとし，(E’1)-(E’4)を満たす文字列e’の存在を仮定すると，e’は(E’1)-(E’4)より(E1)-(E4)をそれぞれ満たすが，これはeが(E5)を満たすことに矛盾。よってこのようなe’は存在せず，eが(E’5)を満たすことがわかる。 Furthermore, since es satisfies (E′2) and satisfies (E3) and (E4) from | es | ≧ | e | ≧ u, it satisfies (E′3) and (E′4). Assuming that e is a true prefix and there is a string e 'that satisfies (E'1)-(E'4), e' is (E1)-(E'1)-(E'4) E4) is satisfied, but this contradicts that e satisfies (E5). Therefore, it can be seen that such e 'does not exist and e satisfies (E'5).

以上で，条件(E1)-(E5)が条件(E’1)-(E’5)と等価であることを示した。従って，本発明の方法で，条件(E1)-(E5)を満たす文字列すなわちEGを過不足無く抽出できることを示すには，(E’1)-(E’5)の条件を満たす文字列の集合と，「EGである」と印がつくノードvのパスラベルp(v)の長さEGlength(v)のprefix eの集合が，一致することを示せばよい。 Thus, it has been shown that the conditions (E1)-(E5) are equivalent to the conditions (E′1)-(E′5). Therefore, in order to indicate that the character string satisfying the conditions (E1)-(E5), that is, the EG can be extracted without excess or deficiency, the character string satisfying the condition (E'1)-(E'5) And the set of prefix e with the length EGlength (v) of the path label p (v) of the node v marked as “EG” may be shown.

まず，eが(E’1)-(E’5)を満たすことを示す。
(E’1)について:
EGlength(v)はp(v)から$を除いた文字列の長さに初期化されて以降，値が増加することはないため，p(v)の長さEGlength(v)のprefixは$を含まない。よって，eは(E’1)を満たす。
(E’2)について:
条件(e3b)より，「EGである」と印がついたノードvについて，EGlength(v)≧uだから，eは(E’2)を満たす。
(E’3)，(E’4)について:
ステップ(e2b)において，変数cの値703（図７参照）は，(i,j)をアピアランスとする部分文字列で，入力文字列iにおいてj番目の文字以外には，任意のMMの先頭の文字と終端の次の文字を含まない，最も長い文字列の長さとなっている。一方，任意の右EG-holder hの任意のアピアランス(i,j)について，v=Tv[i,j]とすると，hが右EG-holderだから，v=v’(h)であり，さらに，ステップ(e2b)でEGlength(v)はその時点のEGlength(v)と変数cの値703のうち，小さいほうの値に更新される。また，EGlength(v)の初期値は，|p’(v)|=|h|である。従って，(e2)が完了した時点で，EGlength(v)は，次の条件(e4a)および(e4b)を満足する最大の整数nとなる。
(e4a) n≦|h|
(e4b) hの，長さnのprefixで，MMと重なる文字列や，MMを部分文字列とする文字列は存在しない。
ゆえに，文字列eは(E’3)と(E’4)を満たす。
(E’5)について:
EGlength(v)は，eをprefixとする右EG-holderのprefixで，(E’3)と(E’4)を満たす最長のものの長さである。eの長さはEGlength(v)だから，(E’5)も満たされる。 First, we show that e satisfies (E'1)-(E'5).
About (E'1):
Since EGlength (v) is initialized to the length of the character string excluding $ from p (v), the value does not increase, so the prefix of the length EGlength (v) of p (v) is $ Not included. Therefore, e satisfies (E'1).
About (E'2):
From the condition (e3b), for the node v marked as “EG”, since EGlength (v) ≧ u, e satisfies (E′2).
About (E'3), (E'4):
In step (e2b), the value 703 of the variable c (see FIG. 7) is a partial character string with appearance (i, j), and the beginning of any MM other than the jth character in the input character string i The length of the longest string that does not include the next character and the next character. On the other hand, for any appearance (i, j) of any right EG-holder h, if v = Tv [i, j], then v = v '(h) because h is the right EG-holder, and In step (e2b), EGlength (v) is updated to the smaller one of EGlength (v) and variable c value 703 at that time. The initial value of EGlength (v) is | p '(v) | = | h |. Therefore, when (e2) is completed, EGlength (v) becomes the maximum integer n that satisfies the following conditions (e4a) and (e4b).
(e4a) n ≦ | h |
(e4b) There is no character string that overlaps MM or has MM as a partial character string with a prefix of length n and h.
Therefore, the character string e satisfies (E'3) and (E'4).
About (E'5):
EGlength (v) is the length of the longest EG-holder prefix satisfying (E'3) and (E'4) with e as a prefix. Since the length of e is EGlength (v), (E'5) is also satisfied.

逆に，本発明の方法で「EGである」と印がついたノードvのEGlength(v)のprefix以外の文字列は，EGでないことを示す。 Conversely, a character string other than the prefix of EGlength (v) of node v marked as “EG” by the method of the present invention indicates that it is not an EG.

まず，右EG-holderのprefix以外の文字列は排除されるが，そうした文字列は(E’1)を満たさない。条件(e3b)により，長さがu未満の文字列は排除されるが，これらは(E’2)を満たさない。さらに，EGlength(v)は右EG-holderであるp’(v)のprefixで(E’3)と(E’4)を満たす最も長い文字列の長さだから，vのパスラベルのprefixでEGlength(v)より長い文字列は(E’3)または(E’4)に矛盾する。さらに，EGlength(v)より短い文字列は(E’5)に矛盾する。 First, character strings other than the prefix of the right EG-holder are excluded, but such character strings do not satisfy (E′1). The condition (e3b) excludes character strings with a length less than u, but these do not satisfy (E′2). Furthermore, EGlength (v) is the length of the longest string that satisfies (E'3) and (E'4) with the prefix of p '(v) that is the right EG-holder, so EGlength with the prefix of the path label of v Character strings longer than (v) contradict (E'3) or (E'4). Furthermore, a character string shorter than EGlength (v) contradicts (E′5).

以上により，本発明の方法でEGであると判断されない文字列は，EGではないことがわかった。 From the above, it was found that the character string that is not determined to be EG by the method of the present invention is not EG.

■本発明の方法の処理時間
本発明の方法の処理時間の，入力文字列の文字の総数Nに対する振る舞いについて検討する。本発明の処理は，入力文字列数や入力文字列の文字数に依存しない回数のsuffix tree Tに対する深さ優先探査と，それぞれの深さ優先探査における各ノードに対する定数時間の処理と，全要素数がNである3つの2次元配列Head, Tail, Tvの各要素に対する入力文字列数や入力文字列の文字数に依存しない回数の定数時間の処理と，(t3)において各リーフを探査するO(N)時間で完了できる処理の，固定回数の組み合わせである。従って，本発明の方法全体の計算時間は，O(N)である。 (3) Processing time of the method of the present invention The behavior of the processing time of the method of the present invention with respect to the total number N of characters in the input character string will be examined. The processing of the present invention includes depth-first search for the suffix tree T, the number of times that does not depend on the number of input strings and the number of characters in the input string, constant time processing for each node in each depth-first search, and the total number of elements. The number of input strings for each element of the three two-dimensional arrays Head, Tail, Tv where N is N, and the number of constant times that do not depend on the number of characters in the input string, and the search for each leaf in (t3) O ( N) A combination of a fixed number of processes that can be completed in time. Therefore, the calculation time of the entire method of the present invention is O (N).

■本発明の方法を実現する装置
本発明は，前記方法を実行するための装置も提供する。図８に，装置の構成の一例を示す。該装置は，主記憶806に上記方法を実行するプログラム805を格納し，さらにsuffix tree Tや入力文字列を格納する。プログラム805は，中央演算装置801により実行される。計算結果は，ディスプレイ802を通じて表示されるか，補助記憶装置807に格納されるか，またはそれら両方の処理が行なわれる。ユーザからの入力はキーボード803およびポインティングデバイス804を用いて行なわれる。本発明の装置は，インターネットやイントラネット等のネットワークによって他の装置と通信可能に接続されていてもよい。入力文字列は，例えばファイルの形で与えられ，CD-R等の記録媒体に記録されたファイル，あるいはネットワークを介して受信したファイルを読み込むことによって主記憶806に取り込まれる。もちろん，キーボードから入力文字列を入力することも可能である。本明細書では，入力文字列を装置の主記憶806に取り込むための手段を総称して文字列入力手段という。 (1) Apparatus for realizing the method of the present invention The present invention also provides an apparatus for executing the method. FIG. 8 shows an example of the configuration of the apparatus. The apparatus stores a program 805 for executing the above method in the main memory 806, and further stores a suffix tree T and an input character string. The program 805 is executed by the central processing unit 801. The calculation result is displayed on the display 802, stored in the auxiliary storage device 807, or both of them are processed. Input from the user is performed using a keyboard 803 and a pointing device 804. The apparatus of the present invention may be communicably connected to another apparatus via a network such as the Internet or an intranet. The input character string is given in the form of a file, for example, and is taken into the main memory 806 by reading a file recorded on a recording medium such as a CD-R or a file received via a network. Of course, it is also possible to input an input character string from the keyboard. In this specification, means for taking an input character string into the main memory 806 of the apparatus is generically referred to as a character string input means.

本発明の方法で得られたEGをディスプレイ802に表示する場合，見易さや解析の容易さの観点から，図1の例のように，EGを入力文字列そのものまたは該入力文字列を象徴的に示す線分や矩形101上の，EGである部分文字列に相当する位置104に，色を変えたり文字や記号で示すなど視覚的に見やすい方法で表示することが好ましい。また，入力文字列をそのものまたは該入力文字列を象徴的に示す線分や矩形101とともに表示するか否かに関わらず，EGの入力文字列上の位置102を表示してもよい。また，同じ入力文字列に由来するEGを表示する際には，各EGの開始位置の昇順とすることが好ましい。昇順の表示は，各入力文字列iについて変数jを，0から，入力文字列iの長さから1を減じた値まで増加させ，Tv[i,j]が「EGである」と印がついたノードvへのポインタであったとき，p(v)のprefixで長さがEGlength(v)であるEGを表示することで，実現できる。EGを補助記憶装置807に格納する際にも，同じ入力文字列から得られたEGをともに格納する場合には，上記の表示方法で開始位置の昇順の表示を得たのと同様の方法で，EGの開始位置の昇順を得てその順に格納することが好ましい。 When displaying the EG obtained by the method of the present invention on the display 802, from the viewpoint of ease of viewing and analysis, the EG is represented as the input character string itself or the input character string as shown in the example of FIG. It is preferable to display the line segment or the rectangle 101 on the position 104 corresponding to the partial character string that is EG by a method that is easy to see visually, such as changing the color or displaying it with characters or symbols. Further, the position 102 on the input character string of the EG may be displayed regardless of whether or not the input character string is displayed with the line segment or the rectangle 101 symbolically indicating the input character string. Further, when displaying EGs derived from the same input character string, it is preferable that the start positions of the EGs are in ascending order. In ascending order, for each input string i, the variable j is increased from 0 to a value obtained by subtracting 1 from the length of the input string i, and Tv [i, j] is marked as "EG" This can be realized by displaying the EG whose length is EGlength (v) with the prefix of p (v) when it is a pointer to the connected node v. When storing the EG in the auxiliary storage device 807 together with the EG obtained from the same input character string, the same method as that used to obtain the display in ascending order of the start position by the above display method is used. It is preferable to obtain the ascending order of the start positions of EG and store them in that order.

また， EGを表示する場合には，色や記号，文字列103などを用いて複数の入力文字列に存在する同一のEGを識別できるように表示することが好ましい。また，補助記憶装置807に記録する際も，複数のEGが同一であると識別可能なように番号あるいはEGごとに一意である数値または文字列を同時に記録することが好ましい。 Further, when displaying the EG, it is preferable to display the EG so that the same EG existing in a plurality of input character strings can be identified using colors, symbols, character strings 103, and the like. Also, when recording in the auxiliary storage device 807, it is preferable to simultaneously record a number or a unique numerical value or character string for each EG so that a plurality of EGs can be identified as being the same.

■本発明の方法の利用者による，エクソン境界の指定
本発明の方法では，cDNA配列の相互比較によりエクソンの境界を探索しているが，この方法で発見できないエクソン境界が，何らかの手段により知られている場合があり得る。そこで，本発明の方法の利用者が既知のエクソン境界の位置を入力することにより，その境界でEGを分割できることが望ましい。そのためには，「既知のエクソン境界が，入力文字列Siのj番目の文字とj-1番目の文字の間にある」と入力された場合に，ステップS505においてHeadとTailを設定した直後に，以下の操作を行えばよい。
(b1) Head[i,j]を１に設定する。
(b2) j≧1ならば，Tail[i,j-1]を1に設定する。
このようにしてからステップS506以降を実行すると，入力文字列Siのj番目の文字とj-1番目の文字の間にMMの境界があるときと同様に処理が行われ，EGが分割される。なお，既知のエクソン境界が複数箇所存在する場合には，それぞれの境界に対し(b1)-(b2)の処理を繰り返し実行すればよい。 ■ Specification of exon boundaries by the user of the method of the present invention In the method of the present invention, exon boundaries are searched by mutual comparison of cDNA sequences. Exon boundaries that cannot be found by this method are known by some means. It may be. Therefore, it is desirable that the user of the method of the present invention can input the position of a known exon boundary so that the EG can be divided at the boundary. To do so, immediately after setting “Head and Tail” in step S505 when “a known exon boundary is between the j-th character and the j−1-th character of the input character string Si” is input. The following operations may be performed.
(b1) Set Head [i, j] to 1.
(b2) If j ≧ 1, set Tail [i, j-1] to 1.
When step S506 and subsequent steps are executed in this way, processing is performed in the same way as when there is an MM boundary between the jth character and the j-1th character of the input character string Si, and the EG is divided. . When there are a plurality of known exon boundaries, the processes (b1) to (b2) may be repeatedly executed for each boundary.

■本発明の方法の，cDNA配列以外への適用
本発明は，mRNA配列のスプライシングパターン解析を目的としているが，比較ゲノム解析にも応用が可能である。入力文字列をmRNA配列の代わりにゲノム配列とし，得られたEGがゲノム内の保存領域と解釈すれば，本発明の方法により，ゲノム配列の保存領域の位置と，複数のゲノム配列中の保存領域の対応関係が明らかにでき，比較ゲノム解析の有力な手段となる。 (1) Application of the method of the present invention to those other than cDNA sequences The present invention is intended for splicing pattern analysis of mRNA sequences, but can also be applied to comparative genome analysis. If the input character string is a genomic sequence instead of an mRNA sequence, and the obtained EG is interpreted as a conserved region in the genome, the position of the conserved region of the genomic sequence and conserved in multiple genomic sequences can be obtained by the method of the present invention. The correspondence of regions can be clarified, and it becomes an effective means of comparative genome analysis.

この他，文字列に使用される文字の種類が有限であれば，本発明の方法は塩基配列に依存せず一般の文字列を処理することができるため，アミノ酸配列のような塩基配列以外の配列や文字列に対しても，そのまま適用可能である。 In addition, if the types of characters used in the character string are limited, the method of the present invention can process a general character string without depending on the base sequence. The same can be applied to arrays and character strings.

本発明の方法を実装したシステムを作成し，本発明の方法で塩基配列のスプライシングパターンを解析できることを実証した。以下では，uの値を15とした場合の結果について述べる。 A system that implements the method of the present invention was created, and it was demonstrated that the splicing pattern of the base sequence can be analyzed by the method of the present invention. In the following, we describe the results when u is 15.

はじめに，乱数を用いて，長さが30文字の4つの塩基配列s1,s2,s3,s4を用意した。そして，これらを組み合わせ，長さが90の4つの入力文字列s1s2s3，s1s2s4，s1s3s1, s2s3s4を作成し，本発明の方法を実装したソフトウェアを用いて，これらがs1,s2,s3,s4の組み合わせであることを認識できるか試みた。3つ目の文字列に，s1が２度現れることに注目されたい。パラメータuの値は，15とした。なお，ここで用いたs1,s2,s3,s4は以下の配列である。
s1= TTCAACAAAGACGGAAGTGTCCTAAATAGG
s2= GTGCTGACAGTGCTGTTAGAACTACAGGCT
s3= GAAGAAAGGTAACGCATATAGTGCGACGAA
s4= GGTGGCATGCCATGGACGCATACTCCGTAA First, four base sequences s1, s2, s3, and s4 with a length of 30 characters were prepared using random numbers. These are combined to create four input strings s1s2s3, s1s2s4, s1s3s1, s2s3s4 of length 90, and these are combinations of s1, s2, s3, and s4 using software that implements the method of the present invention. I tried to recognize that. Note that s1 appears twice in the third string. The value of parameter u is 15. Note that s1, s2, s3, and s4 used here are the following arrays.
s1 = TTCAACAAAGACGGAAGTGTCCTAAATAGG
s2 = GTGCTGACAGTGCTGTTAGAACTACAGGCT
s3 = GAAGAAAGGTAACGCATATAGTGCGACGAA
s4 = GGTGGCATGCCATGGACGCATACTCCGTAA

処理の結果，図１に示す4つのEGが得られた。この結果において，s1とs4は，EG1, EG4として抽出されたが，EG2，EG3は，それぞれ先頭の1塩基が欠けた配列として抽出された。この原因は，s2,s3がいずれもGで始まるため，s1およびs2の末尾にGを加えた配列がMMとなり，s2,s3の先頭が1塩基削られた配列が右EG-holderとなったためである。このように，末尾での配列の偶然の一致により，EGが実際のエクソンより短くなる場合がありうる。しかし，完全な一致でなかったとはいえ，EG2，EG3はそれぞれs2,s3にほぼ一致した。つまり，本発明の方法により，入力文字列が長さ約30塩基の4配列の組み合わせであると認識できたといえる。さらに，本発明の方法では，３つ目の配列に，s1がEG1として２箇所現れていることも認識することができた。 As a result of the treatment, four EGs shown in FIG. 1 were obtained. In this result, s1 and s4 were extracted as EG1 and EG4, but EG2 and EG3 were extracted as sequences lacking the first base each. This is because both s2 and s3 start with G, so the sequence with G added to the end of s1 and s2 becomes MM, and the sequence with one base deleted from s2, s3 becomes the right EG-holder. It is. In this way, EG may be shorter than the actual exon due to an accidental sequence match at the end. However, EG2 and EG3 almost coincided with s2 and s3, respectively, although they were not perfect matches. That is, by the method of the present invention, it can be said that the input character string can be recognized as a combination of four sequences having a length of about 30 bases. Furthermore, in the method of the present invention, it was also possible to recognize that s1 appears as two EG1s in the third array.

本発明の方法を実装したシステム用いて，米国公共機関のデータベースRefSeqに登録されている，実際の遺伝子のスプライシング・バリアント配列のうち，ヒトのウィルムス腫瘍遺伝子WT1の配列を取得して解析し，得られたスプライシングパターンをデータベースの記述と比較した。使用した配列は，アクセッション番号がNM_004906, NM_152857, NM_152858の3つの配列である。データベースには，NM_152857の576番塩基以降とNM_152858の675番塩基以降の配列が選択的3'端配列であることと，NM_152857に選択的5'端配列が存在することが記載されている。この3配列に対し，本発明の方法を実装したシステムを適用した結果は，図２の通りである。データベースに記載されていたNM_152857, NM_152858の選択的3'配列は，開始位置を含めて正確にEG5として認識された。さらに，NM_152857の選択的5'端配列に関しては，データベースに位置の記述はなかったが，先頭から115番目の塩基までの配列が選択的5'端配列であり，NM_004906, NM_152858の先頭から213番目の塩基までの5'端配列に置き換わっていることがわかった。しかし，NM_004906の選択的3’配列が，1566塩基目までと1592塩基以降の２つに分断されEG4とEG6となり，それらに挟まれた25塩基が空白区間となった。この空白区間の25塩基は全てTであった。(M1)-(M4)より，長さ15塩基以上のTの連続は全てMMとなり，EGがこれらのMMを真の部分文字列としないように選ばれたのが，選択的3’配列がEG4とEG6に分断された原因であった。このことから，本発明の方法を適用する場合には，リピート配列に関する注意が必要だとわかる。 Using the system in which the method of the present invention is implemented, the sequence of the human Wilms tumor gene WT1 among the splicing variant sequences of actual genes registered in the database RefSeq of the US public institution is obtained, analyzed, and obtained. The splicing patterns obtained were compared with the database description. The sequences used were three sequences with accession numbers NM_004906, NM_152857, and NM_152858. The database describes that the sequence from 576th base of NM_152857 and the 675th base of NM_152858 is a selective 3 ′ end sequence and that a selective 5 ′ end sequence exists in NM_152857. The result of applying the system in which the method of the present invention is implemented to these three arrays is as shown in FIG. The selective 3 ′ sequences of NM_152857 and NM_152858 described in the database were correctly recognized as EG5 including the start position. Furthermore, regarding the selective 5 ′ end sequence of NM_152857, there was no description of the position in the database, but the sequence from the top to the 115th base is the selective 5 ′ end sequence, and the 213th from the top of NM_004906 and NM_152858. It was found that the 5 'end sequence up to the base of was replaced. However, the selective 3 'sequence of NM_004906 was divided into 2 parts up to the 1566th base and 1592 base and later to become EG4 and EG6, and 25 bases between them became a blank section. All 25 bases in this blank section were T. From (M1)-(M4), all T sequences with a length of 15 bases or more are all MM, and EG was chosen so that these MMs are not true substrings. The cause was divided into EG4 and EG6. From this, it can be understood that when the method of the present invention is applied, it is necessary to pay attention to the repeat arrangement.

なお，3配列ともに末尾にEGに含まれない空白区間が見られたが，これらはpolyA配列であり，エクソン配列ではない。この区間では17塩基以上Aが連続しているが，15個以上のAの連続はMMとなるためpoly-A配列の箇所にはMMの末端が密集し，EGは存在しなかった。 In addition, although there was a blank section not included in EG at the end of all three sequences, these are polyA sequences and not exon sequences. In this section, A and 17 A or more continued, but 15 or more A continuation became MM, so the end of MM was dense at the poly-A sequence, and EG was not present.

なお，この実験で用いた３つの入力配列の総塩基数は5,529bpで，計算時間はCPUのクロック周波数が1.7GHzであるパーソナルコンピュータを用いて0.1秒程度であった。 The total number of bases of the three input sequences used in this experiment was 5,529 bp, and the calculation time was about 0.1 seconds using a personal computer with a CPU clock frequency of 1.7 GHz.

４つのエクソンの組み合わせになっている４つの入力配列を，本発明の方法により解析した結果を表示した例を示す図。The figure which shows the example which displayed the result analyzed by the method of this invention about four input arrangement | sequences which are the combination of four exons. 本発明の方法により，ウィルムス腫瘍遺伝子のスプライシング・バリアントからエクソン配列を抽出した結果を表示した例を示す図。The figure which shows the example which displayed the result of having extracted the exon sequence from the splicing variant of Wilms oncogene by the method of this invention. 入力文字列，右MM，MM，右EG-holder，EGの説明図。Illustration of input character string, right MM, MM, right EG-holder, EG. Suffix treeの説明図。Illustration of Suffix tree. 本発明の方法における，処理全体のフローチャート。The flowchart of the whole process in the method of this invention. MMのアピアランスの数の総和がΩ(Nk)となる例を示す図。The figure which shows the example from which the sum total of the number of appearance of MM becomes (omega) (Nk). 図5中，ステップS505の右EG-holder抽出，および，ステップS506のEG抽出方法の説明図。FIG. 5 is an explanatory diagram of the right EG-holder extraction in step S505 and the EG extraction method in step S506. 本発明の方法を実現する装置の一例の説明図。Explanatory drawing of an example of the apparatus which implement | achieves the method of this invention. 条件(E’1)-(E’5)より条件(E5)が導かれることの説明において，|es’|≦|h|を導くために検討した文字列の関係を説明する図。The figure explaining the relationship of the character string examined in order to guide | | es' | <| h | in description of the condition (E5) being derived | led-out from conditions (E'1)-(E'5). 条件(E’1)-(E’5)より条件(E5)が導かれることの説明において，あるMM m’について，ある入力文字列上に文字列m’hが存在することを導くために検討した文字列の関係を説明する図。To explain that condition (E5) is derived from conditions (E'1)-(E'5), for a certain MM m ', to derive that a string m'h exists on a certain input string The figure explaining the relationship of the examined character string. 条件(E’1)-(E’5)より条件(E5)が導かれることの説明において，あるMM m’について，|t|＜|s|を導くために検討した文字列の関係を説明する図。In the explanation of the condition (E5) derived from the conditions (E'1)-(E'5), the relationship between the character strings examined to derive | t | <| s | To do. 条件(E1)-(E5)より条件(E’1)を導くために検討した文字列の関係を説明する図。The figure explaining the relationship of the character string examined in order to derive | lead-out condition (E'1) from conditions (E1)-(E5). 文字列s,s'が重なることの定義を説明する図。The figure explaining the definition that character strings s and s' overlap. 複数のエクソンが存在することを，転写産物由来の配列のみに基づき認識するのが困難な例を説明する図。The figure explaining the example where it is difficult to recognize that there are multiple exons based only on the sequence derived from the transcript. 複数のエクソンの境界を，転写産物由来の配列のみに基づき同定するのが困難な例を説明する図。The figure explaining the example where it is difficult to identify the boundary of several exons based only on the sequence derived from a transcription product.

Explanation of symbols

101: 入力文字列を象徴的に示す矩形
102: EGの入力文字列上の位置を表す数字
103: 複数の入力文字列上にあるEGを識別するための文字列の例
104: 101上で，EGである部分文字列に相当する位置
301: 入力文字列の例
302: 301の入力文字列集合に対する右MMを列挙したもの
303: 301の入力文字列集合に対するMMを列挙したもの
304: 301の入力文字列集合に対する右EG-holderを列挙したもの
305: 301の入力文字列集合に対するEGを列挙したもの
401: 文字列ATATGとTTAGTAから構築されたsuffix tree
402: Suffix tree 401のルートノード
403: Suffix tree 401のリーフの１つ（「i, j」と書かれているリーフは，文字
列iのj番目から始まるsuffixに対応）
404: Suffix tree 401の，エッジの１つ
405: Suffix tree 401の，エッジのラベルの１つ
407: Suffix tree 401の，ノードの１つ
408: Suffix tree 401の，ルートノード402からあるノードにいたるパス
409: Suffix tree 401のsuffix linkの１つ
701: 2次元配列Headの，入力文字列2についての要素を取り出したもの
702: 2次元配列Tailの，入力文字列2についての要素を取り出したもの
703: EG抽出方法において用いられる変数cのステップ(e2b)での値
801: 本発明の装置の中央演算装置
802: 本発明の装置のディスプレイ
803: 本発明の装置のキーボード
804: 本発明の装置のポインティングデバイス
805: 本発明の方法を実行するためのプログラム
806: 本発明の装置の主記憶装置
807: 本発明の装置の補助記憶装置 101: Rectangle indicating the input string symbolically
102: A number representing the position of the EG on the input string
103: Example of character string to identify EG on multiple input character strings
104: Position on 101 corresponding to the substring that is EG
301: Input string example
302: List of right MM for 301 input string set
303: List of MM for 301 input string set
304: List of right EG-holder for 301 input string set
305: List of EGs for 301 input string sets
401: suffix tree constructed from string ATATG and TTAGTA
402: Root node of Suffix tree 401
403: One of the leaves of Suffix tree 401 (the leaf written as “i, j” corresponds to the suffix starting from j of string i)
404: One of the edges of Suffix tree 401
405: One of the labels on the edge of Suffix tree 401
407: One of the nodes in Suffix tree 401
408: Path from the root node 402 to a node in the Suffix tree 401
409: One of the suffix links in Suffix tree 401
701: Two-dimensional array Head extracted from input string 2
702: Extracted element of input string 2 from 2D array Tail
703: Value of variable c used in EG extraction method at step (e2b)
801: Central processing unit of the device of the present invention
802: Display of the device of the present invention
803: Keyboard of the device of the present invention
804: pointing device of the apparatus of the present invention
805: A program for executing the method of the present invention
806: Main memory of the device of the present invention
807: Auxiliary storage device of the device of the present invention

Claims

When multiple character strings are input from the character string input means,
A partial character string that appears in one or more places in at least one of the input character strings, and whose length is greater than or equal to the given integer u, to the right or left of the substring A partial character string in which the set of positions in the input character string where a new character string obtained by adding one character appears is different from the set of positions where the partial character string appears before adding a character is When we call it MM
(1) A partial character string that appears in one or more places in at least one of the entered character strings,
(2) The length is not less than the integer u,
(3) The partial character string does not share a character on any input character string with any MM other than the MM that completely includes the partial character string,
A character string analysis method characterized by executing a process of extracting a character string called EG that is not part of another character string that satisfies only the above three conditions (1) to (3).

2. The character string analysis method according to claim 1, wherein the calculation unit includes:
A partial character string that appears in one or more places in at least one character string of the input character string, has a length greater than the given integer u, and adds one character to the right of the partial character string The set of positions in the input string where the new character string obtained in this way appears is a substring that is a different set from the set of positions where the substring appears before adding characters. A process of extracting a substring called MM,
A partial character string that appears in one or more places in at least one character string of the input character string, has a length that is greater than or equal to the integer u, and adds one character to the left of the partial character string Called left MM, where the set of positions in the input string where the resulting new string appears is a string that is different from the set of positions where the substring appears before the character is added A step of extracting a character string or a step of extracting a partial character string called MM,
The right EG-holder, which is a right MM starting from the next character of MM on the input character string and does not include the character string that is the right MM as a true prefix. The process of extracting the called string,
The character string analysis method characterized by performing.

When multiple character strings are input from the character string input means,
A partial character string that appears in one or more places in at least one of the input character strings, and whose length is greater than or equal to the given integer u, to the right or left of the substring A partial character called MM in which the set of positions in the input character string where a new character string obtained by adding one character appears is different from the set of positions where the substring appears before the character is added A method of extracting a column,
The arithmetic unit is
A partial character string that appears in one or more places in at least one of the input character strings, has a length of the integer u or more, and adds one character to the right of the partial character string. A substring called the right MM, where the set of positions in the input string where the resulting new string appears is different from the set of positions where the substring appears before the character is added. A character string analysis method characterized by executing the extraction process.

When multiple character strings are input from the character string input means,
A partial character string that appears in one or more places in at least one character string of the input character string, has a length greater than the given integer u, and adds one character to the right of the partial character string The set of positions in the input string where the new character string obtained in this way appears is a substring that is a different set from the set of positions where the substring appears before adding characters. A process of extracting a substring called MM,
A partial character string that appears in one or more places in at least one character string of the input character string, has a length that is greater than or equal to the integer u, and adds one character to the left of the partial character string A step of extracting a character string called left MM, which is a partial character string in which the set of positions where the obtained new character string appears is different from the set of positions where the partial character string appears before the character is added. Run,
A partial character string that appears in one or more places in at least one of the input character strings, has a length that is greater than or equal to the integer u, and adds one character to the right or left of the partial character string When a partial character string in which the set of positions in the input character string where the new character string obtained in this way appears differs from the set of positions where the partial character string appears before adding a character is called MM ,
The right EG-holder, which is a right MM starting from the next character of MM on the input character string and does not include the character string that is the right MM as a true prefix. A character string analysis method characterized by executing a process of extracting a called partial character string.

When multiple character strings are input from the character string input means,
A partial character string that appears in one or more places in at least one character string of the input character string, has a length greater than the given integer u, and adds one character to the right of the partial character string When the set of positions in the input character string where the new character string obtained in this way is different from the set of positions where the partial character string appears before the character is added, the partial character string is Called MM,
A partial character string that appears in one or more places in at least one of the input character strings, has a length that is greater than or equal to the integer u, and adds one character to the right or left of the partial character string When the set of positions in the input character string where the new character string obtained as described above appears is different from the set of positions where the partial character string appears before the character is added, the partial character string is When calling
In the arithmetic unit,
The right EG-holder, which is a right MM starting from the next character of MM on the input character string and does not include the character string that is the right MM as a true prefix. A character string analysis method for executing a process of extracting a called partial character string,
In order to calculate the position of the MM in the input character string, it is a right MM that does not make the other right MM a true suffix. String analysis method to be performed.

When multiple character strings are input from the character string input means,
A partial character string that appears in one or more places in at least one character string of the input character string, has a length greater than the given integer u, and adds one character to the right of the partial character string When the set of positions in the input character string where the new character string obtained in this way is different from the set of positions where the partial character string appears before the character is added, the partial character string is Called MM,
A partial character string that appears in one or more places in at least one of the input character strings, has a length that is greater than or equal to the integer u, and adds one character to the right or left of the partial character string When the set of positions in the input character string where the new character string obtained as described above appears is different from the set of positions where the partial character string appears before the character is added, the partial character string is When calling
In the arithmetic unit,
The right EG-holder, which is a right MM starting from the next character of MM on the input character string and does not include the character string that is the right MM as a true prefix. A character string analysis method for executing a process of extracting a called partial character string,
In order to calculate the position of the MM in the input character string, the processing is performed to calculate the position of the character string called the prefix minimum right MM, which is a right MM that does not make the other right MM a true prefix. String analysis method.

The character string analysis method according to claim 1,
A partial character string that appears in one or more places in at least one character string of the input character string, has a length greater than the given integer u, and adds one character to the right of the partial character string If the set of positions in the input character string where the new character string obtained in this way appears is different from the set of positions where the substring appears before the character is added, the substring is the right MM. When calling
The computing unit is
The right EG-holder that is the right MM starting from the next character of the MM on the input character string and does not include the character string that is the right MM as a true prefix. Character string analysis method characterized in that only the prefix of is used as a candidate for the substring EG to be extracted.

6. The character string analysis method according to claim 5, wherein when checking whether a right MM has another right MM that is a true suffix of the right MM, the right MM is displayed as a path label on the suffix tree. A character string analysis method that refers to the suffix link of the parent node of the node to be prefixed.

2. The character string analyzing method according to claim 1, wherein when a position to be a boundary of the EG is given on the input character string, the EG is divided so that the EG does not include the position. Column analysis method.

The program for making a computer perform the character string analysis method of any one of Claims 1-9.