JP2000194713A

JP2000194713A - Method and device for retrieving character string, and storage medium stored with character string retrieval program

Info

Publication number: JP2000194713A
Application number: JP10370933A
Authority: JP
Inventors: Seiichi Konya; 精一紺谷; Masashi Yamamuro; 雅司山室
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1998-12-25
Filing date: 1998-12-25
Publication date: 2000-07-14

Abstract

PROBLEM TO BE SOLVED: To provide a method and device for retrieving character string and a storage medium in which character string retrieval program is started by which a lot of texts can approximately be retried at high speed. SOLUTION: A character string is segmented from the given text (S1), the segmented character string is stored as a tree structure (S2), the difference between a character string inputted by a user and the stored character string is predicted (S3), and the character string stored as the tree structure is retrieved based on the predicted difference (S4). The character string and pointer of the given text are stored, and at the time of search, by obtaining the difference (predictive value) from a selected retrieve key, which position of which text the character string small in difference appears. By excluding the partial tree large in predictive value from the range of search or by preferentially searching the partial tree small in predictive value, for example, the character string small in difference can efficiently be searched and a lot of texts can approximately be retrieved at high speed.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文字列検索方法及
び装置及び文字列検索プログラムを格納した記憶媒体に
係り、特に、記号処理及びパターンマッチングを行う際
の、テキスト検索、音符情報による音楽検索、ＤＮＡの
塩基配列の照合方法を行うための文字列検索方法及び装
置及び文字列検索プログラムを格納した記憶媒体に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character string search method and apparatus, and a storage medium storing a character string search program. The present invention relates to a character string search method and apparatus for performing a DNA base sequence collation method and a storage medium storing a character string search program.

【０００２】[0002]

【従来の技術】従来の文字列検索の方法としては、図１
３に示す「パトリシアツリー」がある。同図に示すパト
リシアツリーは、テキストのすべての文字を始点とする
文字列に対して、インデックスを張る。ここで、パトリ
シアツリーの作成を図１４を用いて説明する。テキスト
“ａｂａｂｃ”から切り出された文字列“ａｂａｂ
ｃ”、“ｂａｂｃ”、“ａｂｃ”、“ｂｃ”、“ｃ”を
順次格納する。文字列“ａｂａｂｃ”は接頭語が一致す
るノードやリーフがないので、リーフを作成し、ルーフ
へのリンクに“ａｂａｂｃ”とラベルを付ける。文字列
“ｂａｂｃ”も同様に、文字列“ａｂｃ”は、格納済の
“ａｂａｂｃ”と接頭語が一致するので、リーフ“ａｂ
ａｂｃ”を削除し、新たに接頭語ａｂのノードを作成す
る。“ａｂａｂｃ”は、“ａｂ＋ａｂｃ”、“ａｂｃ”
は、“ａｂ＋ｃ”と分割し、共有の接頭語ａｂの下にリ
ーフを作成し、ルーフへのリンクに各々“ａｂｃ”、
“ｃ”とラベルをふる。以下、文字列“ｂｃ”、“ｃ”
について同様に処理を行う。2. Description of the Related Art As a conventional character string search method, FIG.
There is a “Patricia tree” shown in FIG. The Patricia tree shown in FIG. 3 indexes a character string starting from all characters of the text. Here, the creation of the Patricia tree will be described with reference to FIG. Character string “abab” cut out from text “ababc”
"c", "babc", "abc", "bc", and "c" are stored in order.The character string "ababc" has no node or leaf with the same prefix, so a leaf is created and a link to the roof is created. Is labeled as “ababc.” Similarly, for the character string “babc”, the prefix of the character string “abc” matches the stored “ ab abc”, so that the leaf “ababc”
“abbc” is deleted and a new node with the prefix “ab” is created, “ababc” is “ab + abc”, “abc”
Splits as "ab + c", creates leaves under the shared prefix ab, and links "abc",
Label "c". Hereinafter, the character strings “bc”, “c”
Are similarly processed.

【０００３】当該パトリシアツリーを用いた文字列検索
を図１５を用いて説明する。キーとして与えられた文字
列“ａｂｃ”とツリーのラベルを比較し、ラベル“ａ
ｂ”がキー“ａｂｃ”の接頭語に一致するので、ラベル
“ａｂ”のリンクを辿る。次に、キーの文字列の残った
部分“ａｂｃ”とノードのラベルを比較し、一致するラ
ベル“ｃ”のリンクを辿り、リーフに辿り着く。リーフ
の持つポインタ（文字列の出現位置）からキー文字列
“ａｂｃ”がテキスト“ａｂａｂｃ”の３文字目に出現
することが分かる。A character string search using the Patricia tree will be described with reference to FIG. The character string "abc" given as the key is compared with the tree label, and the label "a
Since “b” matches the prefix of the key “ ab c”, follow the link of the label “ab.” Next, the remaining part “ab c ” of the character string of the key is compared with the label of the node to find a match. Following the link of the label “c”, it reaches the leaf, and it can be seen from the pointer (the appearance position of the character string) of the leaf that the key character string “abc” appears in the third character of the text “ababc”.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上記従
来の方法では、キーとして与えた文字列と完全一致する
文字列しか検索できず、綴り間違いや類似した文字列の
検索（以下、近似検索）が行えないという問題がある。
本発明は、上記の点に鑑みなされたもので、大量のテキ
ストに対する近似検索を高速に行うことが可能な文字列
検索方法及び装置及び文字列検索プログラムを格納した
記憶媒体を提供することを目的とする。However, in the above-mentioned conventional method, only a character string that completely matches the character string given as a key can be searched, and a search for a misspelled word or a similar character string (hereinafter, approximate search) is performed. There is a problem that can not be done.
The present invention has been made in view of the above points, and has as its object to provide a character string search method and apparatus capable of performing an approximate search for a large amount of text at high speed, and a storage medium storing a character string search program. And

【０００５】[0005]

【課題を解決するための手段】図１は、本発明の原理を
説明するための図である。本発明（請求項１）は、与え
られたテキスト、音符、ＤＮＡの塩基配列等から文字列
を検索するための文字列検索方法において、与えられた
テキストから文字列を切り出し（ステップ１）、切り出
された文字列を木構造として格納し（ステップ２）、利
用者により入力された文字列と蓄積されている文字列間
の差異を予測し（ステップ３）、格納されている木構造
を予測された差異に基づいて文字列を探索する（ステッ
プ４）。FIG. 1 is a diagram for explaining the principle of the present invention. The present invention (claim 1) provides a character string search method for searching for a character string from a given text, musical note, DNA base sequence, or the like, wherein the character string is cut out from the given text (step 1). The stored character string is stored as a tree structure (step 2), the difference between the character string input by the user and the stored character string is predicted (step 3), and the stored tree structure is predicted. A character string is searched based on the difference (step 4).

【０００６】本発明（請求項２）は、文字列を木構造と
して格納する際に、複数のテキストから生成された文字
列を１つの木に格納し、該木のリーフは該文字列の位置
を示すポインタを有する。本発明（請求項３）は、文字
列を木構造として格納する際に、与えられた前記テキス
トの始点をずらしながら文字列を生成する。According to the present invention (claim 2), when storing a character string as a tree structure, a character string generated from a plurality of texts is stored in one tree, and the leaf of the tree is located at the position of the character string. Is provided. According to the present invention (claim 3), when a character string is stored as a tree structure, the character string is generated while shifting the starting point of the given text.

【０００７】本発明（請求項４）は、文字列を木構造と
して格納する際に、文字列に制約を加えて生成される文
字列の数を制限する。本発明（請求項５）は、文字列を
探索する際に、木のルートノードを展開し、部分木を生
成し、利用者により入力された文字列と部分木との差異
を予測し、該部分木を予測値順に並べ、探索リストを生
成し、探索リストの先頭の部分木のノードを展開し、予
測値に従って該探索リストを更新し、更新された探索リ
ストに基づいて文字列を探索する。According to the present invention (claim 4), when a character string is stored as a tree structure, the number of character strings generated is restricted by restricting the character string. According to the present invention (claim 5), when searching for a character string, a root node of the tree is expanded, a partial tree is generated, and a difference between the character string input by the user and the partial tree is predicted. The subtrees are arranged in order of the predicted value, a search list is generated, a node of the subtree at the head of the search list is expanded, the search list is updated according to the predicted value, and a character string is searched based on the updated search list. .

【０００８】本発明（請求項６）は、与えられたテキス
ト、音符、ＤＮＡの塩基配列等から文字列を検索するた
めの文字列検索方法において、利用者により入力された
文字列と、与えられたテキストから切り出された文字列
を木構造にして蓄積している文字列間の差異を予測す
る。本発明（請求項７）は、与えられたテキスト、音
符、ＤＮＡの塩基配列等から文字列を検索するための文
字列検索方法において、利用者により入力された検索キ
ーに基づいて、与えられたテキストから切り出された文
字列を格納している木構造から予測された差異に従って
探索する。[0008] The present invention (claim 6) provides a character string search method for searching a character string from a given text, musical note, DNA base sequence, or the like. The difference between the stored character strings is predicted by making the character strings cut out from the extracted text into a tree structure. The present invention (claim 7) provides a character string search method for searching for a character string from a given text, musical note, DNA base sequence, or the like, based on a search key input by a user. The search is performed according to the difference predicted from the tree structure storing the character string extracted from the text.

【０００９】図２は、本発明の原理構成図である。本発
明（請求項８）は、与えられたテキスト、音符、ＤＮＡ
の塩基配列等から文字列を検索するための文字列検索装
置であって、与えられたテキストから文字列を切り出す
文字列分割手段１０と、文字列分割手段１０により切り
出された文字列を木構造にして格納する格納手段２０
と、利用者が入力した文字列と文字列分割手段に格納さ
れている文字列間の差異を予測する予測手段５０と、予
測手段５０により予測された予測値に基づいて格納手段
の木構造を探索する探索手段４０とを有する。FIG. 2 is a diagram showing the principle of the present invention. According to the present invention (claim 8), given text, musical note, DNA
A character string search device for searching for a character string from a base sequence or the like of a character string, and a character string dividing means 10 for extracting a character string from a given text; Storage means 20 for storing
A prediction unit 50 for predicting a difference between a character string input by a user and a character string stored in the character string division unit, and a tree structure of the storage unit based on a predicted value predicted by the prediction unit 50. Search means 40 for searching.

【００１０】本発明（請求項９）は、文字列格納手段１
０において、複数のテキストから生成された文字列を１
つの木構造として格納し、該木構造のリーフは該文字列
の位置を示すポインタを有する。本発明（請求項１０）
は、文字列格納手段１０において、与えられた前記テキ
ストの始点をずらしながら文字列を生成する手段を含
む。According to the present invention (claim 9), a character string storage means 1 is provided.
At 0, a string generated from multiple texts is 1
Stored as one tree structure, and the leaves of the tree structure have pointers indicating the positions of the character strings. The present invention (Claim 10)
Includes means for generating a character string in the character string storage means 10 while shifting the starting point of the given text.

【００１１】本発明（請求項１１）は、文字列格納手段
１０において、文字列に制約を加えて生成される文字列
の数を制限する手段を含む。本発明（請求項１２）は、
予測手段５０において、木のルートノードを展開し、部
分木を生成する部分木生成手段と、利用者により入力さ
れた文字列と部分木との差異を予測し、該部分木を予測
値順に並べ、探索リストを生成する探索リスト生成手段
と、探索リストの先頭の部分木のノードを展開し、予測
値に従って該探索リストを更新する探索リスト更新手段
とを含み、探索手段４０において、探索リスト更新手段
で更新された探索リストに基づいて文字列を探索する手
段を含む。The present invention (claim 11) includes means for limiting the number of character strings generated by restricting the character strings in the character string storage means 10. The present invention (claim 12)
A prediction unit that expands a root node of the tree and generates a partial tree; and predicts a difference between a character string input by a user and the partial tree, and arranges the partial trees in order of predicted values. , A search list generating means for generating a search list, and a search list updating means for expanding a subtree node at the head of the search list and updating the search list in accordance with a predicted value. Means for searching for a character string based on the search list updated by the means.

【００１２】本発明（請求項１３）は、与えられたテキ
スト、音符、ＤＮＡの塩基配列等から文字列を検索する
ための文字列検索装置であって、利用者が入力した文字
列と与えられたテキストから切り出された文字列間の差
異を予測する予測手段を有する。本発明（請求項１４）
は、上記予測手段において、木のルートノードを展開
し、部分木を生成する部分木生成手段と、利用者により
入力された文字列と部分木との差異を予測し、該部分木
を予測値順に並べ、探索リストを生成する探索リスト生
成手段と、探索リストの先頭の部分木のノードを展開
し、予測値に従って該探索リストを更新する探索リスト
更新手段とを含む。The present invention (claim 13) is a character string search device for searching for a character string from a given text, musical note, DNA base sequence, or the like, wherein the character string search device is provided with a character string input by a user. Prediction means for predicting a difference between character strings cut out from the extracted text. The present invention (claim 14)
Predicts a difference between a character string input by a user and a partial tree by expanding a root node of the tree and generating a partial tree in the predicting means; Search list generating means for arranging in order and generating a search list, and search list updating means for expanding a subtree node at the head of the search list and updating the search list according to a predicted value.

【００１３】本発明（請求項１５）は、与えられたテキ
スト、音符、ＤＮＡの塩基配列等から文字列を検索する
ための文字列検索装置であって、予測された、利用者が
入力した文字列と与えられたテキストから切り出された
文字列間の差異に基づいて格納手段の木構造を探索する
探索手段を有する。本発明（請求項１６）は、与えられ
たテキスト、音符、ＤＮＡの塩基配列等から文字列を検
索するための文字列検索プログラムを格納した記憶媒体
であって、与えられたテキストから文字列を切り出す文
字列分割プロセスと、文字列分割プロセスにより切り出
された文字列を木構造にして格納する格納プロセスと、
利用者が入力した文字列と文字列分割プロセスに格納さ
れている文字列間の差異を予測する予測プロセスと、予
測プロセスにより予測された予測値に基づいて格納プロ
セスの木構造を探索する探索プロセスとを有する。[0013] The present invention (claim 15) is a character string search apparatus for searching a character string from a given text, musical note, DNA base sequence, or the like. A search unit that searches a tree structure of the storage unit based on a difference between the string and a character string cut out from the given text; The present invention (claim 16) is a storage medium storing a character string search program for searching for a character string from a given text, musical note, base sequence of DNA, or the like, wherein the storage medium stores a character string from the given text. A character string dividing process to be cut out, a storing process to store the character string cut out by the character string dividing process in a tree structure,
A prediction process for predicting a difference between a character string input by a user and a character string stored in a character string division process, and a search process for searching a tree structure of a storage process based on prediction values predicted by the prediction process. And

【００１４】本発明（請求項１７）は、文字列格納プロ
セスにおいて、複数のテキストから生成された文字列を
１つの木構造として格納し、該木構造のリーフは該文字
列の位置を示すポインタを有する。本発明（請求項１
８）は、予測プロセスにおいて、木のルートノードを展
開し、部分木を生成する部分木生成プロセスと、利用者
により入力された文字列と部分木との差異を予測し、該
部分木を予測値順に並べ、探索リストを生成する探索リ
スト生成プロセスと、探索リストの先頭の部分木のノー
ドを展開し、予測値に従って該探索リストを更新する探
索リスト更新プロセスとを含み、探索プロセスにおい
て、探索リスト更新プロセスで更新された探索リストに
基づいて文字列を探索するプロセスを含む。According to the present invention (claim 17), in a character string storage process, a character string generated from a plurality of texts is stored as one tree structure, and a leaf of the tree structure is a pointer indicating the position of the character string. Having. The present invention (Claim 1
8) In the prediction process, the root node of the tree is expanded to generate a subtree, and the difference between the character string input by the user and the subtree is predicted, and the subtree is predicted. A search list generating process for generating a search list by arranging in order of values, and a search list updating process for expanding a subtree node at the head of the search list and updating the search list in accordance with a predicted value; Including a process of searching for a character string based on the search list updated in the list update process.

【００１５】上記のように、文字列を与えられたテキス
トから切り出す際に、テキストの始点をずらしながら文
字列を生成する。このとき、与えられたテキストの性質
により、文字列に制約を加えて生成される文字列の数を
制限する。これにより、与えられたテキストの任意の位
置にある文字列を探索することが可能となる。このよう
にして生成された文字列と、当該文字列がどのテキスト
のどの位置から得られたを示す情報であるポインタを合
わせて格納する。これにより、探索時に選ばれた検索キ
ーとの差異の小さい文字列がどのテキストのどの位置に
出現したかを知ることができる。As described above, when a character string is cut out from a given text, the character string is generated while shifting the starting point of the text. At this time, the number of generated character strings is limited by restricting the character strings according to the properties of the given text. This makes it possible to search for a character string at an arbitrary position in the given text. The character string generated in this manner is stored together with a pointer which is information indicating the position where the character string is obtained from which text. Thereby, it is possible to know at which position in which text a character string having a small difference from the search key selected at the time of the search has appeared.

【００１６】さらに、格納された文字列と利用者から入
力された文字列に含まれる文字列集合との差異を予測
し、格納されている木構造を巡回して探索する差異に、
予測された値が小さい順に部分木を検索する。これによ
り、予測された値に基づいて、予測値の大きい部分木を
探索の範囲から外したり、予測値の小さい部分木を優先
して探索することにより差異の小さい文字列を効率良く
検索することが可能となる。Further, the difference between the stored character string and the character string set included in the character string input by the user is predicted,
The subtrees are searched in ascending order of predicted values. This makes it possible to efficiently search for a character string having a small difference by excluding a subtree having a large predicted value from a search range based on a predicted value or by prioritizing a subtree having a small predicted value. Becomes possible.

【００１７】[0017]

【発明の実施の形態】図３は、本発明の文字列検索装置
の構成を示す。同図に示す文字列検索装置は、文字列分
割部１０、蓄積部２０、メモリ３０、探索部４０及び予
測部５０から構成される。文字列分割部１０は、テキス
ト入力装置６０に接続され、当該テキスト入力装置６０
から与えられたテキストの始点をずらしながら文字列を
生成する。この時、与えられたテキストの性質により、
文字列に制約を加えて生成される文字列の数を制限す
る。即ち、意味のない文字列、あるいは検索の対象とな
り難い文字列の生成を抑える。例えば、（１）区切り記号（空白、句点、読点）を含む文字列
を生成しない；（２）文字列はある文字数以下；等により生成される文字列の数を制限する。FIG. 3 shows the configuration of a character string search apparatus according to the present invention. The character string search device shown in FIG. 1 includes a character string division unit 10, a storage unit 20, a memory 30, a search unit 40, and a prediction unit 50. The character string dividing unit 10 is connected to the text input device 60,
Generates a character string while shifting the starting point of the text given by. At this time, due to the nature of the given text,
Limit the number of strings generated by constraining the strings. That is, generation of a meaningless character string or a character string that is difficult to be searched is suppressed. For example, (1) a character string including a delimiter (blank, punctuation, or punctuation) is not generated; (2) a character string is equal to or less than a certain number of characters;

【００１８】蓄積部２０は、文字列分割部１０において
生成された文字列、及びポインタ（その文字列がどのテ
キストのどの位置から得られたかを示す情報）を木構造
にして格納する。探索部４０は、蓄積部２０が有する木
構造を巡回する際に、後述する予測部５０によって予測
された差異が小さい順に部分木を探索する。The storage unit 20 stores the character string generated by the character string division unit 10 and a pointer (information indicating which character string is obtained from which position in the text) in a tree structure. When traversing the tree structure of the storage unit 20, the search unit 40 searches for subtrees in the order of smaller differences predicted by the prediction unit 50 described later.

【００１９】予測部５０は、文字列（蓄積部２０が有す
る木構造の）の部分木に含まれる文字列集合との差異を
予測する。以下に、上記の構成による動作を説明する。
最初に、テキストを蓄積するまでの動作を説明する。図
４は、本発明の文字列検索装置の動作（テキスト蓄積フ
ェーズ）を説明するためのフローチャートである。The prediction unit 50 predicts a difference between a character string (having a tree structure included in the storage unit 20) and a character string set included in a subtree. The operation of the above configuration will be described below.
First, the operation up to storing text will be described. FIG. 4 is a flowchart for explaining the operation (text accumulation phase) of the character string search device of the present invention.

【００２０】ステップ１１０）まず、文字列分割部１
０において、テキスト入力装置６０からテキストが入力
される。ステップ１２０）文字列分割部１０は、与えられたテ
キストの始点を１文字づつずらしながら、文字列を生成
する。ステップ１３０）文字列が生成されているかを判定
し、生成されていない場合には当該処理を終了する。Step 110) First, the character string division unit 1
At 0, a text is input from the text input device 60. Step 120) The character string dividing unit 10 generates a character string while shifting the starting point of the given text by one character. Step 130: Determine whether a character string has been generated, and if not, end the process.

【００２１】ステップ１４０）蓄積部２０は、文字列
分割部１０で生成された文字列をパトリシアツリーに格
納する。この時、複数のテキストから生成された文字列
を１つのツリーに格納し、ツリーのリーフは当該文字列
がどのテキストのどの位置にあるかを示す複数のポイン
タを有するものとする。次に、文字列検索の動作につい
て説明する。Step 140) The storage unit 20 stores the character string generated by the character string division unit 10 in the Patricia tree. At this time, a character string generated from a plurality of texts is stored in one tree, and a leaf of the tree has a plurality of pointers indicating which text is at which position in the text. Next, the operation of the character string search will be described.

【００２２】図５は、本発明の文字列検索フェーズのフ
ローチャートである。ステップ２１０）検索キー入力装置８０から探索部４
０に対してキーボード等により文字列を入力する。ステップ２２０）探索部４０は、入力された文字列
（検索キー）に基づいて蓄積部４０を探索し、探索リス
トを作成する。FIG. 5 is a flowchart of the character string search phase of the present invention. Step 210) From the search key input device 80 to the search unit 4
A character string is input to 0 using a keyboard or the like. Step 220) The search unit 40 searches the storage unit 40 based on the input character string (search key) to create a search list.

【００２３】ステップ２３０）次に、ルートノードを
展開し、部分木を生成し、予測部５０で検索キーと部分
木との差異を予測する。検索キーとの差異の予測として
文字列の編集距離を用いる。ステップ２４０）ここで評価値が確定したか否かを判
定し、確定した場合にはステップ２６０に移行し、確定
しない場合にはステップ２５０に移行する。Step 230) Next, the root node is expanded to generate a partial tree, and the prediction unit 50 predicts the difference between the retrieval key and the partial tree. The edit distance of the character string is used to predict the difference from the search key. (Step 240) Here, it is determined whether or not the evaluation value is determined. If the evaluation value is determined, the process proceeds to step 260, and if not, the process proceeds to step 250.

【００２４】ステップ２５０）確定しない場合には、
各部分木を予測値順に並べ、探索リストを更新し、ステ
ップ２７０に移行する。ステップ２６０）評価値が確定した場合には、評価値
順に検索結果リストに載せる。ステップ２７０）探索リストが空になるまで上記の処
理を繰り返す。空になったらステップ２８０に移行す
る。Step 250) If not determined,
The subtrees are arranged in the order of predicted values, the search list is updated, and the process proceeds to step 270. Step 260) When the evaluation values are determined, they are put on the search result list in the order of the evaluation values. Step 270) Repeat the above process until the search list becomes empty. When it becomes empty, the process proceeds to step 280.

【００２５】ステップ２８０）検索結果を出力する。Step 280) Output the search result.

【００２６】[0026]

【実施例】以下、図面と共に説明する。以下の実施例の
説明に先立って使用される用語について説明する。以下
の説明における“文字”、“文字列”、“テキスト”と
は、各々記号（１字）、意味のある記号の列、文字列の
並びである。BRIEF DESCRIPTION OF THE DRAWINGS FIG. Prior to the description of the embodiments, terms used will be described. In the following description, “character”, “character string”, and “text” are a symbol (one character), a sequence of meaningful symbols, and an arrangement of character strings, respectively.

【００２７】例えば、指定したキーワードを含む文書の
検索に適用する場合では、文字はアルファベット、数
字、かな、漢字などとし、文字列は単語や句、テキスト
は文書に相当する。音楽検索では、文字は音符、文字列
はフレーズ、テキストは曲となる。ＤＮＡの塩基配列の
照合方法では、文字は、４つの塩基｛Ｃ（シトシン）、
Ｔ（チミン）、Ａ（アデニン）、Ｇ（グアニン）｝、文
字列は、塩基の列、テキストはＤＮＡ（遺伝子）とな
る。For example, when the present invention is applied to a search for a document including a specified keyword, characters are alphabets, numbers, kana, kanji, etc., character strings are words and phrases, and texts are documents. In music search, characters are musical notes, character strings are phrases, and texts are songs. In the method for checking the base sequence of DNA, a character consists of four bases ｛C (cytosine),
T (thymine), A (adenine), G (guanine)}, the character string is a sequence of bases, and the text is DNA (gene).

【００２８】図３に示す構成において、文字列分割部１
０は、テキスト入力装置６０からキーボード等によりテ
キストが入力されると、当該テキストの始点を１文字づ
つずらしながら、文字列を生成し、複数のテキストから
生成された文字列を１つのツリーとし、蓄積部２０によ
り、メモリ３０のパトリシアツリーに格納する。予測部
５０は、検索キーとの差異の予測として、文字列の編集
距離を用いる。図６は、本発明の一実施例の編集距離を
説明するための図である。同図において、文字列ａ（ａ
₀ａ₁ａ₂…ａ_n）から編集操作（挿入、削除、置換）
を行い、文字列ｂ（ｂ₀ｂ₁ｂ₂…ｂ_m）との距離を求
める。文字列ａを文字列ｂに変換する編集操作の組み合
わせは複数存在する。In the configuration shown in FIG.
0, when a text is input from the text input device 60 using a keyboard or the like, a character string is generated while shifting the starting point of the text one character at a time, and the character string generated from a plurality of texts is defined as one tree; The data is stored in the Patricia tree of the memory 30 by the storage unit 20. The prediction unit 50 uses the edit distance of the character string as prediction of the difference from the search key. FIG. 6 is a diagram for explaining the edit distance according to one embodiment of the present invention. In the figure, a character string a (a
Editing operations (insert, delete, replace) from ₀ a ₁ a ₂ ... a _n )
To determine the distance from the character string b (b ₀ b ₁ b ₂ ... B _m ). There are a plurality of combinations of editing operations for converting a character string a into a character string b.

【００２９】ユニットコスト（編集距離）は、ユニットコスト≡ｍｉｎ｛Ｎ_I＋Ｎ_D＋Ｎ_R｝により求める。但し、Ｎ_Iは編集操作列の挿入の数、Ｎ
_Dは編集操作列の削除の数、Ｎ_Rは編集操作列の置換数
を示す。重み付き編集距離は、重み付き編集距離≡ｍｉｎ｛ｗ_IＮ_I＋ｗ_DＮ_D＋ｗ_R
Ｎ_R｝により求める。但し、ｗ_Iは挿入に対する重み、ｗ_Dは
削除に対する重み、ｗ_Rは置換に対する重みである。The unit cost (editing distance) is obtained from the unit cost {min} N _I + N _D + N _R }. Here, N _I is the number of insertions of the editing operation sequence, N
_D is the number of deletion editing operation sequence, N _R represents the number of substitutions of the editing operation sequence. Weighted edit distance is weighted edit distance _{_{≡min {w I N I + w}} D N D + w R
Determined by N _R ｝. Here, w _I is the weight for insertion, w _D is the weight for deletion, and w _R is the weight for replacement.

【００３０】ここで、編集距離（ユニットコスト）は、
重み付き編集距離の特別な場合（ｗ _I＝ｗ_D＝ｗ_R＝
１）に相当する。さらに予測部５０は、図７に示す方法
により予測値を求める。同図において、文字列ａｂｃｄ
とａｂｘ^*（＊は任意の文字の列）の距離の下限を予測
値とする。ユニットコスト（編集距離の場合）には、先
頭の比較で距離の下限が求められる。例えば、同図の例
ではａｂｃｄの“ａｂｃ”とａｂｘ^*の“ａｂｘ”にお
いて、ｘ＝ｃ→０となり、ｘ≠ｃ→１となる。また、重
み付き編集距離の場合には、編集操作毎に重みが違うた
め、次にくる文字で距離が変わる。例えば、ａｂｃｄの
“ｃｄ”とａｂｘ^*の“^*”において、ｘ＝ｄ→ｗ_D、
^*＝ｃｄ…→ｗ_I、 ^*＝ｄ…→ｗ_Rとなる。このとき、
距離が最小となる場合を想定して予測値とする。Here, the edit distance (unit cost) is
Special case of weighted edit distance (w _I= W_D= W_R=
This corresponds to 1). Further, the prediction unit 50 performs the method shown in FIG.
To obtain a predicted value. In the figure, a character string abcd
And abx^*Predict the lower limit of the distance of (* is an arbitrary character string)
Value. Unit cost (for edit distance)
The lower limit of the distance is obtained by comparing the heads. For example, the example in the figure
Then, "abc" of abcd and abx^*"Abx"
Therefore, x = c → 0, and x ≠ c → 1. Also, heavy
In the case of the edit distance, the weight differs for each edit operation.
The distance changes with the next character. For example, abcd
“Cd” and abx^*of"^*, X = d → w_D,
^*= Cd… → w_I, ^*= D… → w_RBecomes At this time,
The predicted value is set assuming a case where the distance becomes minimum.

【００３１】以下に一連の動作を説明する。図８は、本
発明の一実施例の検索木を示す。メモリ３０において、
図８に示す木構造に従ってテキストが蓄積されているも
のとする。まず、探索部４０は探索リストを作成し、探
索木を図９に示すような探索リストに載せる。なお、当
該探索リストは探索部４０が保持しているものとする。
次に、ルートノードを展開し、部分木ａ，ｂ，ｃを生成
する。予測部５０が前述の方法により、検索キーと部分
木ａ，ｂ，ｃとの差異を予測する（予測値がＰ_b，
Ｐ_a，Ｐ_cであったとする）。各部分木を予測値順に並
べ、探索リストを図１０に示すように当該探索リストを
更新する（図１０（Ａ））。同図の例では、予測値がＰ
_b＜Ｐ_a＜Ｐ_cであるとする。Hereinafter, a series of operations will be described. FIG. 8 shows a search tree according to an embodiment of the present invention. In the memory 30,
It is assumed that texts are stored according to the tree structure shown in FIG. First, the search unit 40 creates a search list, and places the search tree on the search list as shown in FIG. The search list is assumed to be held by the search unit 40.
Next, the root node is expanded to generate subtrees a, b, and c. The prediction unit 50 predicts the difference between the search key and the subtrees a, b, and c by the method described above (the predicted value is P _b ,
P _a, assumed to be P _c). Each subtree is arranged in the order of the predicted value, and the search list is updated as shown in FIG. 10 (FIG. 10A). In the example of FIG.
_b <a a P _a <P _c.

【００３２】次に、探索リストの先頭の部分木のノード
を展開し、予測値に従って探索リストを更新する（図１
０（Ｂ））。同図の例では、予測値がＰ_k＜Ｐ_a＜Ｐ_l
＜Ｐ _cであるとする。ノードの展開に際して、部分木へ
のパス上のラベルの文字数が検索キーの文字数を越えた
場合、あるいは、探索が木のリーフに到達した場合は、
検索キーと格納された文字列との差異（評価値）が確定
する。評価値（Ｃ_E＜Ｃ_F）が確定した文字列は、評価
値順に検索結果リストに載せる（図１０（Ｃ））。同図
の例では、探索部４０において確定した評価値がＣ_E＜
Ｃ_Fであるので、検索結果リストには、「Ｅ」「Ｆ」の
順に並べられる。Next, the node of the subtree at the head of the search list
And the search list is updated according to the predicted value (see FIG. 1).
0 (B)). In the example of FIG._k<P_a<P_l
<P _cAnd When expanding nodes, go to subtree
The number of characters in the label on the path exceeds the number of characters in the search key
Or if the search reaches a tree leaf,
The difference (evaluation value) between the search key and the stored character string is determined
I do. Evaluation value (C_E<C_F), The character string is evaluated
It is placed in the search result list in order of value (FIG. 10C). Same figure
In the example of, the evaluation value determined by the search unit 40 is C_E<
C_FTherefore, in the search result list, "E" and "F"
It is arranged in order.

【００３３】以下、探索リストが空になるか、他の終了
条件が満たされるまで当該処理を繰り返す。検索キー及
び検索キーに対する差異の上限が与えられた場合は、図
１１に示すように、予測値がＰ_c＞εである場合に、差
異の上限（ε）を越えた予測値を持つ部分木ｃは探索す
る必要がないため高速に探索が実行できる。Thereafter, this process is repeated until the search list becomes empty or another termination condition is satisfied. When the search key and the upper limit of the difference with respect to the search key are given, as shown in FIG. 11, when the predicted value is P _c > ε, the subtree having the predicted value exceeding the upper limit of the difference (ε) is obtained. Since c need not be searched, the search can be executed at high speed.

【００３４】また、差異の小さい順に上位Ｎ件を検索す
る場合においても、図１２に示すように、予測値の小さ
い順に探索するため差異の小さい文字列が優先的に検索
されており、かつ、これから検索される文字列の差異の
下限が予測値として求められているため終了条件が明確
で、不要な探索をすることなく上位Ｎ件を求めることが
できる。Also, in the case of searching for the top N items in ascending order of difference, as shown in FIG. 12, character strings having small differences are preferentially searched for searching in order of small predicted values. Since the lower limit of the difference between the character strings to be searched for is determined as the predicted value, the end condition is clear, and the top N items can be obtained without performing unnecessary search.

【００３５】上述のように、本実施例によれば、大量の
テキストに対する近似検索を高速に行うことが可能とな
る。また、上記の実施例では、図３の構成に基づいて説
明しているが、図３に示す文字列検索装置の構成要素を
プログラムとして構築し、文字列検索装置として利用さ
れるコンピュータに接続されるディスク装置や、フロッ
ピーディスクやＣＤ−ＲＯＭ等の可搬記憶媒体に格納し
ておき、本発明を実施する際にインストールすることに
より容易に本発明を実現できる。As described above, according to the present embodiment, it is possible to perform an approximate search for a large amount of text at high speed. Although the above embodiment has been described based on the configuration of FIG. 3, the components of the character string search device shown in FIG. 3 are constructed as programs and connected to a computer used as the character string search device. The present invention can be easily realized by storing the program in a portable disk medium, a portable storage medium such as a floppy disk, a CD-ROM, or the like, and installing the program when implementing the present invention.

【００３６】なお、本発明は上記の実施例に限定される
ことなく、特許請求の範囲内で種々変更・応用が可能で
ある。It should be noted that the present invention is not limited to the above-described embodiment, but can be variously modified and applied within the scope of the claims.

【００３７】[0037]

【発明の効果】上述のように、本発明によれば、与えら
れたテキストの文字列及びポインタを格納しており、探
索時において、選ばれた検索キーとの差異（予測値）を
求めることにより、差異の小さい文字列がどのテキスト
のどの位置に出現したかを知ることが可能であり、例え
ば、予測値の大きい部分木を探索の範囲から外したり、
予測値の小さい部分木を優先して探索することにより、
差異の小さい文字列を効率良く探索することが可能であ
り、大量のテキストに対する近似検索を高速に行うこと
ができる。As described above, according to the present invention, a character string and a pointer of a given text are stored, and a difference (predicted value) from a selected search key is determined at the time of a search. By, it is possible to know at which position in which text a character string with a small difference appeared, for example, removing a subtree with a large predicted value from the search range,
By preferentially searching for a subtree with a small predicted value,
It is possible to efficiently search for a character string having a small difference, and it is possible to quickly perform an approximate search for a large amount of text.

[Brief description of the drawings]

【図１】本発明の原理を説明するための図である。FIG. 1 is a diagram for explaining the principle of the present invention.

【図２】本発明の原理構成図である。FIG. 2 is a principle configuration diagram of the present invention.

【図３】本発明の文字列検索装置の構成図である。FIG. 3 is a configuration diagram of a character string search device of the present invention.

【図４】本発明の文字列検索装置の動作（テキスト蓄積
フェーズ）を説明するためのフローチャートである。FIG. 4 is a flowchart for explaining the operation (text accumulation phase) of the character string search device of the present invention.

【図５】本発明の文字列検索フェーズのフローチャート
である。FIG. 5 is a flowchart of a character string search phase according to the present invention.

【図６】本発明の一実施例の編集距離を説明するための
図である。FIG. 6 is a diagram for explaining an edit distance according to an embodiment of the present invention.

【図７】本発明の一実施例の予測値計算方法を説明する
ための図である。FIG. 7 is a diagram for explaining a prediction value calculation method according to one embodiment of the present invention.

【図８】本発明の一実施例の検索木を示す図である。FIG. 8 is a diagram showing a search tree according to one embodiment of the present invention.

【図９】本発明の一実施例の探索リストの例である。FIG. 9 is an example of a search list according to an embodiment of the present invention.

【図１０】本発明の一実施例の探索リストを更新する例
である。FIG. 10 is an example of updating a search list according to an embodiment of the present invention.

【図１１】本発明の一実施例の差異がε以下の文字列の
検索の例を示す図である。FIG. 11 is a diagram showing an example of a search for a character string having a difference of not more than ε in one embodiment of the present invention.

【図１２】本発明の一実施例の上位Ｎ件の検索の例を示
す図である。FIG. 12 is a diagram illustrating an example of a search for top N items according to an embodiment of the present invention.

【図１３】パトリシアツリーの例である。FIG. 13 is an example of a Patricia tree.

【図１４】パトリシアツリーのの構築の例である。FIG. 14 is an example of construction of a Patricia tree.

【図１５】パトリシアツリーでの文字列検索の例であ
る。FIG. 15 is an example of a character string search in a Patricia tree.

[Explanation of symbols]

１０文字列分手段、文字列分割部２０格納手段、蓄積部３０メモリ４０探索手段、探索部５０予測手段、予測部６０テキスト入力装置７０表示装置８０検索キー入力装置 DESCRIPTION OF SYMBOLS 10 Character string division | segmentation means, character string division | segmentation part 20 storage means, storage part 30 memory 40 search means, search part 50 prediction means, prediction part 60 text input device 70 display device 80 search key input device

Claims

[Claims]

1. A character string search method for searching for a character string from a given text, musical note, DNA base sequence, or the like, wherein the character string is cut out from a given text, and the cut out character string is formed into a tree structure. And predicting a difference between a character string input by a user and the stored character string, and searching a stored tree structure for a character string based on the predicted difference. String search method to be performed.

2. When storing the character string as a tree structure, a character string generated from a plurality of texts is stored in one tree, and a leaf of the tree has a pointer indicating a position of the character string. Item 1. The character string search method according to Item 1.

3. The character string search method according to claim 1, wherein when the character string is stored as a tree structure, the character string is generated while shifting a starting point of the given text.

4. The character string search method according to claim 3, wherein when storing the character string as a tree structure, the number of character strings generated by restricting the character string is limited.

5. When searching for the character string, a root node of the tree is expanded to generate a partial tree, and a difference between the character string input by the user and the partial tree is predicted. The subtrees are arranged in order of predicted value, a search list is generated, a node of a subtree at the head of the search list is expanded, the search list is updated according to the predicted value, and the character is updated based on the updated search list. 2. The character string search method according to claim 1, wherein a string is searched.

6. A character string search method for searching for a character string from a given text, musical note, DNA base sequence, or the like, comprising: a character string input by a user; and a character cut out from the given text. A character string search method characterized by predicting a difference between stored character strings in a tree structure.

7. A character string search method for searching a character string from a given text, musical note, DNA base sequence, or the like, wherein the character string is extracted from the given text based on a search key input by a user. A character string search method characterized in that a search is performed in accordance with a difference predicted from a tree structure storing the extracted character strings.

8. A character string search device for searching for a character string from a given text, a musical note, a base sequence of DNA, or the like, comprising: a character string dividing unit that cuts out a character string from a given text; Storage means for storing the character string extracted by the column dividing means in a tree structure, and prediction means for predicting a difference between a character string input by a user and a character string stored in the character string dividing means. A character string search device comprising: a search unit configured to search a tree structure of the storage unit based on a predicted value predicted by the prediction unit.

9. The character string storage unit stores a character string generated from a plurality of texts as one tree structure, and a leaf of the tree structure has a pointer indicating a position of the character string. Character string search device.

10. The character string search device according to claim 8, wherein said character string storage means includes means for generating a character string while shifting a starting point of the given text.

11. The character string search device according to claim 10, wherein said character string storage means includes means for restricting the number of character strings generated by restricting the character string.

12. The sub-tree predictor expands a root node of the tree to generate a sub-tree, and predicts a difference between a character string input by the user and the sub-tree, Search list generating means for arranging the subtrees in the order of predicted values and generating a search list; and a search list updating means for expanding a node of a subtree at the head of the search list and updating the search list according to the predicted values. The character string search device according to claim 8, wherein the search unit includes a unit that searches for the character string based on the search list updated by the search list update unit.

13. A character string search device for searching a character string from a given text, a musical note, a base sequence of DNA or the like, wherein the character string is a character string inputted by a user and a character cut out from the given text. A character string search device comprising a prediction unit for predicting a difference between columns.

14. The predicting means, comprising: expanding a root node of the tree to generate a partial tree; and predicting a difference between a character string input by the user and the partial tree, Search list generating means for arranging the subtrees in the order of predicted values and generating a search list; and a search list updating means for expanding a node of a subtree at the head of the search list and updating the search list according to the predicted values. 14. The character string search device according to claim 13, comprising:

15. A character string search device for searching a character string from a given text, a musical note, a base sequence of DNA, or the like, comprising a character string predicted by a user and a given text. A character string search device, comprising: a search unit that searches a tree structure of the storage unit based on a difference between the extracted character strings.

16. A storage medium storing a character string search program for searching for a character string from a given text, musical note, base sequence of DNA, etc., wherein the character string is divided from the given text. A process for storing the character string extracted by the character string division process in a tree structure, and a difference between a character string input by a user and a character string stored in the character string division process. A storage medium storing a character string search program, comprising: a prediction process for predicting; and a search process for searching a tree structure of the storage process based on a predicted value predicted by the prediction process.

17. The character string storage process according to claim 16, wherein a character string generated from a plurality of texts is stored as one tree structure, and a leaf of the tree structure has a pointer indicating a position of the character string. Storage medium storing a character string search program.

18. The prediction process, comprising: expanding a root node of the tree to generate a partial tree; and predicting a difference between a character string input by the user and the partial tree, A search list generation process of arranging the subtrees in the order of predicted values and generating a search list; and a search list updating process of expanding a node of a subtree at the head of the search list and updating the search list according to the predicted values. 17. The storage medium according to claim 16, wherein the search process includes a process of searching for the character string based on the search list updated in the search list update process.