JP2002157252A

JP2002157252A - Device and method for retrieving document and computer readable recording medium recording program for computer to execute the same method

Info

Publication number: JP2002157252A
Application number: JP2000351637A
Authority: JP
Inventors: Hideaki Nakayama; 秀明中山
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2000-11-17
Filing date: 2000-11-17
Publication date: 2002-05-31

Abstract

PROBLEM TO BE SOLVED: To improve efficiency in retrieval and to simplify a program by unnecessitating the reference of an original document by performing processing while referring to an index even in the case of executing retrieval based on a regular expression in full sentence retrieval. SOLUTION: In the document retrieving device for performing the full sentence retrieval, this device is provided with a retrieve character string/regular expression input part 101 for inputting a character string/regular expression, a regular expression processing part 102 for performing preparing processing of finite automaton corresponding to the character string of the inputted regular expression, an index data storage part 104 for storing index data and a processing part 103 for transiting the state of the finite automaton while referring to the index data.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、パーソナルコンピ
ュータや電子ファイリングシステムなどに利用され、電
子化された文書データから全文検索の機能を備え、特に
正規表現による検索を効率的に行なう文書検索装置およ
び文書検索方法、並びに文書検索方法をコンピュータに
実行させるプログラムを記録したコンピュータ読み取り
可能な記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention is used in a personal computer or an electronic filing system, and has a function of full-text search from digitized document data. The present invention relates to a document search method and a computer-readable recording medium storing a program for causing a computer to execute the document search method.

【０００２】[0002]

【従来の技術】昨今のパーソナルコンピュータの普及に
伴い、コンピュータ上で作成される文書、あるいはイン
ターネットなどの通信を介して取り扱われる文書の数が
膨大になってきている。このように大量化された文書
は、その後の検索（文書の特定と取り出しなど）が容易
に行なわれるように、文書ごとに付されたタイトルやキ
ーワードなどによってファイリングされており、後にこ
のタイトル／キーワードを手掛かりに検索し読み出しな
どを行なうといった文書検索が一般的に行われている。2. Description of the Related Art With the spread of personal computers in recent years, the number of documents created on a computer or handled through communication such as the Internet has become enormous. The document thus mass-produced is filed with a title or keyword assigned to each document so that subsequent searches (such as identification and retrieval of the document) can be easily performed. Document retrieval, such as performing retrieval and reading with reference to a key, is generally performed.

【０００３】テキストに対する文書検索を高速に行なう
方法の一例として、テキストに現れるすべての文字の出
現位置を記録した検索を用いる方法がある。この文書検
索方法においては、テキストに番号を振り、文字キーと
してテキストの番号とテキストの中に現れるすべての文
字のテキスト中での出現位置を列挙した索引を作成す
る。たとえば、３１４という番号が振られた文書の最初
の部分が「テレテキストが」となっている場合には、図
６に示すように、文字キーごとに対応する出現位置の値
でなる索引が作成される。As an example of a method of performing a high-speed document search for a text, there is a method of using a search in which the appearance positions of all characters appearing in a text are recorded. In this document search method, numbers are assigned to texts, and an index is created which lists the text numbers and the appearance positions in the text of all characters appearing in the text as character keys. For example, if the first part of the document numbered 314 is "Teletext", an index consisting of the value of the appearance position corresponding to each character key is created as shown in FIG. Is done.

【０００４】従来はこのような索引が作成されることに
より、ある特定の文字列がどの文書のどの位置に現れる
かを高速に検索することができる。この従来における検
索手順を図７のフローチャートに示す。まず、与えられ
た文字列の長さをレジスタＬに置き、カウンタＮを１に
セットし（ステップＳ２１）、結果を入れる集合を空に
する。たとえば、「テキスト」が与えられた場合は、カ
ウンタＮを４、Ｎを１にして結果を入れる集合Ｓを空に
する。Conventionally, by creating such an index, it is possible to quickly search for a document in which position a specific character string appears. The conventional search procedure is shown in the flowchart of FIG. First, the length of the given character string is placed in the register L, the counter N is set to 1 (step S21), and the set into which the result is put is emptied. For example, when "text" is given, the counter N is set to 4 and N is set to 1 to empty the set S for storing the result.

【０００５】続いて、与えられた文字列の１番目の文字
について索引を参照し、その文字がどの文書の何番目に
あるかを調べ、集合に加える（ステップＳ２２）。今回
の例では「テ」が最初の文字であるので、（３１４，１）、（３１４，３）が集合の内容となる。[0005] Subsequently, the index is referred to for the first character of the given character string, the document is searched for the number of the character, and the character is added to the set (step S22). In this example, since “te” is the first character, (314, 1) and (314, 3) are the contents of the set.

【０００６】続いて、カウンタＮを１つインクリメント
し（ステップＳ２３）、カウンタＮがレジスタＬより大
きいか否かを判断する（ステップＳ２４）。ここで、カ
ウンタＮがレジスタＬより大きいと判断した場合（判
断、Ｙｅｓ）、そのときの集合Ｓが結果であるので、Ｓ
には文字列が存在する文書と終わりの位置が入っている
とし（ステップＳ２６）、この処理を終了する。Subsequently, the counter N is incremented by one (step S23), and it is determined whether or not the counter N is larger than the register L (step S24). Here, when it is determined that the counter N is larger than the register L (judgment, Yes), the set S at that time is a result.
Contains the document in which the character string exists and the end position (step S26), and terminates this processing.

【０００７】一方、ステップＳ２４においてカウンタＮ
がレジスタＬ以下であると判断した場合（判断、Ｎ
ｏ）、与えられた文字列のＮ番目の文字を取り出し、索
引を参照してそのＮ番面の文字がどの文書の何番目にあ
るかを取得し、それぞれについてＳの要素と比較し、Ｓ
の中に文書の番号が同じで文字の出現位置が１小さい要
素であれば、その要素と置き換えることを行ない（ステ
ップＳ２５）、置き換えられなかったＳの要素を捨てる
ことを繰り返し実行する。On the other hand, in step S24, the counter N
Is less than or equal to the register L (judgment, N
o), fetch the N-th character of the given character string, obtain the number of the N-th character in which document by referring to the index, and compare it with the element of S for each;
If the document number is the same and the appearance position of the character is smaller by 1 in, replacement with that element is performed (step S25), and discarding of the unreplaced S element is repeatedly executed.

【０００８】たとえば、Ｎを２にした場合「キ」が得ら
れ、索引を参照し、（３１４，４）が得られる。これは、（３１４，３）と文書の番号が同
じで、文字の出現位置が１小さいので、（３１４，３）
が（３１４，４）に置き換えられ、置き換えられなかっ
た（３１４，１）が捨てられたからである。以上を繰り
返し実行すると、Ｎが５になったときにＳには（３１
４，６）が残り、「テキスト」が文書３１４の第６文字
目で終わる位置にあることがわかる。For example, when N is set to 2, "K" is obtained, and (314, 4) is obtained by referring to the index. This is because (314, 3) has the same document number and the appearance position of the character is smaller by one, so (314, 3)
Is replaced by (314, 4), and (314, 1) not replaced is discarded. By repeatedly executing the above, when N becomes 5, S becomes (31)
4, 6) remain, and it can be seen that “text” is located at a position ending with the sixth character of the document 314.

【０００９】[0009]

【発明が解決しようとする課題】上記に示されるような
従来の方法では、特定の文字列がどの文書のどの位置に
存在するかを索引のみを参照して高速に知ることができ
る。しかしながら、ある正規表現（ｒｅｇｕｌａｒｅ
ｘｐｒｅｓｓｉｏｎ）にマッチする文字列がどの文書の
どの位置にあるかを知るためには、元のテキストをすべ
てしらみつぶしに調べる方法しか知られておらず、ま
た、検索条件が単純な文字列なのか、それとも正規表現
かでの処理かによって処理方法が異なるため、プログラ
ムが複雑化するという問題点があった。また、正規表現
の場合にはすべての文書をしらみつぶしに調べるので、
単純な文字列と比較した場合、検索時間が長くなるとい
う問題点があった。In the conventional method as described above, it is possible to know at a high speed which document in which document a specific character string exists by referring only to the index. However, some regular expressions (regular e
The only way to know where in a document a character string that matches (xpression) is located is to examine all the original text exquisitely, and whether the search condition is a simple character string. Since the processing method differs depending on whether the processing is performed using a regular expression or a regular expression, there is a problem that a program is complicated. Also, in the case of regular expressions, all documents are exhaustively examined,
When compared with a simple character string, there was a problem that the search time was long.

【００１０】本発明は、上記に鑑みてなされたものであ
って、全文検索において、正規表現による検索を行なう
場合でも索引を参照して処理を行なうことにより、元の
文書の参照を不要にして、検索の効率を向上させると共
に、プログラムの単純化を実現することを目的とする。The present invention has been made in view of the above. In a full-text search, even when a search using a regular expression is performed, processing is performed by referring to an index, thereby making it unnecessary to refer to an original document. It is an object of the present invention to improve the efficiency of the search and to realize the simplification of the program.

【００１１】[0011]

【課題を解決するための手段】上記の目的を達成するた
めに、請求項１にかかる文書検索装置にあっては、全文
検索を行なう文書検索装置において、文字列、正規表現
を入力する検索文字列・正規表現入力手段と、前記検索
文字列・正規表現入力手段から入力された正規表現の文
字列に対応する有限オートマトンを作成処理する正規表
現処理手段と、索引データが格納される索引データ格納
手段と、前記索引データ格納手段の索引データを参照し
ながら前記有限オートマトンの状態推移を行なう処理手
段と、を備えたものである。According to another aspect of the present invention, there is provided a document search apparatus for performing a full-text search, wherein a search character for inputting a character string and a regular expression is provided. Column / regular expression input means, regular expression processing means for creating and processing a finite automaton corresponding to the character string of the regular expression input from the search character string / regular expression input means, and index data storage for storing index data Means, and processing means for performing a state transition of the finite state automaton with reference to the index data of the index data storage means.

【００１２】この発明によれば、検索文字列・正規表現
入力手段に全文検索の対象となる文字列、正規表現が入
力される。この文字列を正規表現の一種とみなされるの
で、文字列、正規表現とも正規表現処理手段に送られ、
そこで有限オートマトンが作成される。処理手段はこの
有限オートマトンにしたがって索引データ格納手段に格
納されている索引データを参照しながら状態推移を行な
って文字列を検索することにより、元の文書を参照する
ことなく検索が可能になる。According to the present invention, a character string and a regular expression to be subjected to full-text search are input to the search character string / regular expression input means. Since this character string is regarded as a kind of regular expression, both the character string and the regular expression are sent to the regular expression processing means,
Then a finite automaton is created. The processing means performs a state transition while referring to the index data stored in the index data storage means in accordance with the finite automaton to search for a character string, thereby enabling a search without referring to the original document.

【００１３】また、請求項２にかかる文書検索装置にあ
っては、前記有限オートマトンは、状態を表すノードに
２つのスロットが設けられ、前記処理手段は、異なるス
ロットで文書の番号および文字の出現位置の組の集合を
移しながら前記有限オートマトンの状態推移を実行する
ものである。Further, in the document search apparatus according to the second aspect, the finite state automaton is provided with two slots in a node representing a state, and the processing means determines whether a document number and a character appear in different slots. The state transition of the finite state automaton is executed while transferring a set of position sets.

【００１４】この発明によれば、有限オートマトンの状
態を表すノードに文書の番号と文字の出現位置との集合
を置くためのスロットを２つ用意し、処理手段が異なる
スロットで文書の番号および文字の出現位置の組の集合
を移して有限オートマトンの状態推移を実行し、どのノ
ードからも推移が不可能になったときに受理状態のスロ
ットにある集合が正規表現にマッチする文字列を含む文
書の番号とその文字列の最後の位置集合とみなし、文書
検索を行なう。According to the present invention, two slots are provided for placing a set of a document number and a character appearance position in a node representing the state of a finite state automaton, and the processing means uses different slots for the document number and the character. Executes a state transition of a finite state automaton by transferring a set of sets of occurrence positions of, and when the transition from any node becomes impossible, the set in the slot in the accepted state contains a character string that matches the regular expression The document is searched by regarding it as the last position set of the number and the character string.

【００１５】また、請求項３にかかる文書検索方法にあ
っては、全文検索を行なう文書検索方法において、入力
された正規表現から有限オートマトンを作成し、文字の
位置を記録した索引を参照し、当該正規表現にマッチす
る文字列を検出するものである。According to a third aspect of the present invention, in the document search method for performing a full-text search, a finite automaton is created from the input regular expression, and an index in which the positions of characters are recorded is referred to. It detects a character string that matches the regular expression.

【００１６】この発明によれば、正規表現の問い合わせ
が与えられた際に、当該正規表現から有限オートマトン
を作成し、有限オートマトンにしたがって文字の位置を
記録した索引を参照して検索することにより、元のテキ
ストを参照せずに、索引データの参照のみで当該正規表
現にマッチする文字列を検出することが可能になる。According to the present invention, when an inquiry about a regular expression is given, a finite automaton is created from the regular expression, and a search is performed by referring to an index in which the positions of characters are recorded in accordance with the finite automaton. It is possible to detect a character string that matches the regular expression only by referring to the index data without referring to the original text.

【００１７】また、請求項４にかかる文書検索方法にあ
っては、文書の番号および文字の出現位置の組の集合を
用いて前記有限オートマトンの状態推移を実行するもの
である。According to a fourth aspect of the present invention, the state transition of the finite state automaton is performed using a set of a set of a document number and a character appearance position.

【００１８】この発明によれば、有限オートマトンの状
態推移を、状態毎の文書の番号および文字の出現位置の
組の集合を用いて行なうことにより、有限オートマトン
を用いた文書検索が実現する。According to the present invention, by performing the state transition of the finite state automaton using a set of a set of a document number and a character appearance position for each state, a document search using the finite state automaton is realized.

【００１９】また、請求項５にかかる文書検索方法にあ
っては、状態を表すノードに２つのスロットを設け、状
態推移を行なう場合に異なるスロットで文書の番号およ
び文字の出現位置の組の集合を移すものである。In the document search method according to the fifth aspect, two slots are provided in a node representing a state, and a set of a set of a document number and a character appearance position in a different slot when a state transition is performed. Is transferred.

【００２０】この発明によれば、異なるスロットで文書
の番号および文字の出現位置の組の集合を移して有限オ
ートマトンの状態推移を実行し、どのノードからも推移
が不可能になったときに受理状態のスロットにある集合
が正規表現にマッチする文字列を含む文書の番号とその
文字列の最後の位置集合とみなすことにより、検索対象
の文書の検索を実行する。According to the present invention, the state transition of the finite state automaton is executed by transferring the set of the document number and the appearance position of the character in a different slot, and the transition is not accepted from any node. The retrieval of the retrieval target document is executed by regarding the set in the slot in the state as the number of the document including the character string matching the regular expression and the last position set of the character string.

【００２１】また、請求項６にかかるコンピュータ読み
取り可能な記録媒体にあっては、前記請求項３〜５の何
れか一つに記載の文書検索方法を、コンピュータに実行
させるプログラムを記録したものである。According to a sixth aspect of the present invention, there is provided a computer-readable recording medium on which a program for causing a computer to execute the document search method according to any one of the third to fifth aspects is recorded. is there.

【００２２】この発明によれば、請求項３〜５の何れか
一つに記載の文書検索方法を、コンピュータに実行させ
るプログラムを記録したことにより、請求項３〜５の何
れか一つに記載の文書検索方法をコンピュータによって
実現することが可能になる。According to the present invention, a program for causing a computer to execute the document search method according to any one of claims 3 to 5 is recorded, whereby the document search method according to any one of claims 3 to 5 is recorded. Can be realized by a computer.

【００２３】[0023]

【発明の実施の形態】以下、本発明にかかる文書検索装
置および文書検索方法、並びに文書検索方法をコンピュ
ータに実行させるプログラムを記録したコンピュータ読
み取り可能な記録媒体の好適な実施の形態について添付
図面を参照し、詳細に説明する。なお、本発明はこの実
施の形態に限定されるものではない。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Preferred embodiments of a document search apparatus and a document search method according to the present invention and a computer-readable recording medium storing a program for causing a computer to execute the document search method will be described with reference to the accompanying drawings. Refer to and describe in detail. The present invention is not limited to this embodiment.

【００２４】本発明は、全文検索システムにおいて、回
帰的に定義される入力系列である正規表現（文字列を表
現するのに特定の文字（メタキャラクタ）を使う方法）
の問い合わせが与えられたときに、元のテキストを参照
することなく、索引データのみを参照するだけでマッチ
する文字列を高速に求めるものである。また、単純な文
字列を正規表現の一種としてみなすことにより、正規表
現にマッチするものを検索する場合と単純な文字列の存
在を調べる操作が統一的に処理できる単純なプログラム
化を実現するものである。以下、その装置構成や方法な
ど具体例を挙げて説明する。According to the present invention, in a full-text search system, a regular expression which is an input sequence defined recursively (a method of using a specific character (metacharacter) to express a character string)
Is given, the matching character string can be obtained at high speed only by referring to the index data without referring to the original text. In addition, by realizing a simple character string as a kind of regular expression, it realizes a simple program that can unify processing of searching for something that matches the regular expression and operation of checking for the existence of a simple character string It is. Hereinafter, a specific example of the device configuration and method will be described.

【００２５】図１は、本発明の実施の形態にかかる全文
検索システムの機能構成を示すブロック図である。図に
おいて、符号１０１は正規表現の文字列を入力する検索
文字列・正規表現入力部、符号１０２は有限オートマト
ンを作成処理する正規表現処理部、符号１０３は索引情
報を参照しながらオートマトンの状態推移を行なう処理
部、符号１０４は索引データが格納される索引データ格
納部である。FIG. 1 is a block diagram showing a functional configuration of a full-text search system according to an embodiment of the present invention. In the figure, reference numeral 101 denotes a search character string / regular expression input unit for inputting a character string of a regular expression, reference numeral 102 denotes a regular expression processing unit for creating and processing a finite automaton, and reference numeral 103 denotes a state transition of the automaton while referring to index information. And 104, an index data storage unit for storing index data.

【００２６】ここで、有限オートマトン（ｆｉｎｉｔｅ
ａｕｔｏｍａｔｏｎ）について言及する。ここでは、
有限オートマトンで、ある与えられた文字列がある言語
の文になっているかを検索したい場合について考える。
その文字列をモデルの初期状態（ｉｎｉｔｉａｌｓｔ
ａｔｅ）に対して与えると、入力されてくる文字列の各
文字と内部状態の組み合わせで、つぎに移るべき状態が
決定され、有限オートマトンの状態推移（ｓｔａｔｅ
ｔｒａｎｓｉｔｉｏｎ）が行われる。文字列を入れ終わ
った時点で、有限オートマトンがあらかじめ定められて
いる最終状態（ｆｉｎａｌｓｔａｔｅ）となっている
場合に、この文字列が受理（ａｃｃｅｐｔ）され、その
文字列がその言語に属するかを判別する。Here, the finite automaton (finite
(automatons). here,
Consider a case in which a finite state automaton is used to search for a given character string in a language.
The character string is stored in the initial state of the model (initial st
ate), the next state to be shifted is determined by the combination of each character of the input character string and the internal state, and the state transition (state) of the finite state automaton is determined.
(transition) is performed. If the finite state automaton is in a predetermined final state at the time when the character string has been inserted, the character string is accepted (accepted), and it is determined whether the character string belongs to the language. Determine.

【００２７】以上のように構成された全文検索システム
において、まず、検索文字列・正規表現入力部１０１は
全文検索の対象となる文字列、正規表現を受け取る。文
字列は正規表現の一種としてみなすことができるので、
文字列、正規表現とも正規表現処理部１０２に送られ
る。正規表現処理部１０２はこの入力された正規表現か
ら有限オートマトンを作成し、処理部１０３に送る。処
理部１０３は正規表現処理部１０２で作られた有限オー
トマトンを受け取り、索引データ格納部１０４に格納さ
れている索引データを参照しながら状態推移を実行す
る。In the full-text search system configured as described above, first, the search character string / regular expression input unit 101 receives a character string and a regular expression to be subjected to full-text search. Strings can be considered as a type of regular expression,
Both the character string and the regular expression are sent to the regular expression processing unit 102. The regular expression processing unit 102 creates a finite automaton from the input regular expression and sends it to the processing unit 103. The processing unit 103 receives the finite state automaton created by the regular expression processing unit 102, and executes a state transition while referring to the index data stored in the index data storage unit 104.

【００２８】図２は、本発明の実施の形態にかかる文書
検索装置の構成を示すブロック図であり、上記全文検索
システムを内包し具現化するものである。図において、
この文書検索装置はシステムバス２０１上に、ＣＰＵ２
０２と、ＲＯＭ２０３と、ＲＡＭ２０４と、キーボード
２０５と、ディスク装置２０６と、ディスプレイ２０７
と、が接続されている。FIG. 2 is a block diagram showing the configuration of the document search apparatus according to the embodiment of the present invention, which embodies and implements the above-described full-text search system. In the figure,
This document search device is provided on a system bus 201 by a CPU 2
02, ROM 203, RAM 204, keyboard 205, disk device 206, display 207
And are connected.

【００２９】キーボード２０５とディスプレイ２０７
は、図１における検索文字列・正規表現入力部１０１に
相当し、検索文字列・正規表現が入力される部分であ
り、また、ディスプレイ２０７には検索結果が表示され
る。ＣＰＵ２０２は本発明の文書検索のプログラムを実
行する部分であり、入力された正規表現から有限オート
マトンを作成すること、索引データを参照しながら有限
オートマトンの状態推移を行なうこと、その結果をディ
スプレイ２０７に出力するといった制御処理を実行す
る。ＲＯＭ２０３にはＣＰＵ２０２が実行するプログラ
ムが格納されており、必要に応じてＲＡＭ２０４にロー
ドされる。なお、場合によってはＲＯＭ２０３にプログ
ラムを格納しておく代わりに、ディスク装置２０６に格
納しておく構成であってもよい。Keyboard 205 and display 207
Corresponds to the search character string / regular expression input unit 101 in FIG. 1 and is a portion where the search character string / regular expression is input. The display 207 displays a search result. The CPU 202 is a part for executing the document search program of the present invention. The CPU 202 creates a finite automaton from the input regular expression, performs a state transition of the finite automaton while referring to the index data, and displays the result on the display 207. Control processing such as outputting is performed. A program to be executed by the CPU 202 is stored in the ROM 203 and loaded into the RAM 204 as necessary. In some cases, the program may be stored in the disk device 206 instead of storing the program in the ROM 203.

【００３０】ＲＡＭ２０４はＣＰＵ２０２がプログラム
実行時にそのプログラムを格納する。また、ＲＡＭ２０
４は正規表現に対応した有限オートマトンを格納し、さ
らに状態推移の途中結果や最終結果を保持する。ディス
ク装置２０６には索引データが格納される。索引データ
の大きさや、検索装置の用途によっては、索引データを
ＲＡＭ２０４やＲＯＭ２０３に格納する構成も考えられ
る。The RAM 204 stores the program when the CPU 202 executes the program. Also, the RAM 20
Numeral 4 stores a finite automaton corresponding to the regular expression, and further holds an intermediate result and a final result of the state transition. The disk device 206 stores index data. Depending on the size of the index data and the application of the search device, a configuration in which the index data is stored in the RAM 204 or the ROM 203 is also conceivable.

【００３１】ところで、正規表現から、その正規表現に
マッチする文字列を検出する有限オートマトンを作る方
法は既に知られている。そこで本発明では、正規表現か
ら有限オートマトンを作る。有限オートマトンは、通
常、状態を表すノード（ｎｏｄｅ）とノード間を結ぶ向
きを持つ辺があり、辺には推移を引き起こす文字の集合
がついている。また、始状態（ｉｎｉｔｉａｌｓｔａ
ｔｅ：初期状態）と受理状態（ａｃｃｅｐｔｓｔａｔ
ｅ）という特別な状態があり、始状態から出発し、ある
状態まで推移してその状態から出発する辺の推移を引き
起こす文字の集合の中に与えられた文字があれば、その
辺を通って到達できる状態に推移を行ない、受理状態に
到達すればマッチする文字列があったことが検出され
る。By the way, a method of creating a finite automaton for detecting a character string matching the regular expression from the regular expression is already known. Therefore, in the present invention, a finite automaton is created from a regular expression. A finite state automaton generally has a node having a direction connecting a node representing a state and a node, and the side has a set of characters causing a transition. Also, the initial state (initial state)
te: initial state and accept state (accept stat)
e) there is a special state, starting from the initial state, transitioning to a certain state, and causing a transition of an edge that departs from that state, if there is a given character in the set of characters, through that edge A transition is made to a state in which the character string can be reached, and if the state reaches the acceptance state, it is detected that there is a matching character string.

【００３２】本発明では、有限オートマトンの状態を表
すノードに文書の番号と文字の出現位置との集合を置く
ためのスロット（ｓｌｏｔ）を２つ用意しておく（以
下、この２つのスロットをスロット１とスロット２とい
う）。図３は、本発明の実施の形態にかかる文書検索方
法の手順を示すフローチャートである。上述したよう
に、有限オートマトンには始状態と受理状態がある。ま
ず、受理状態のノードのスロット１とスロット２に空集
合をセットし、推移先のスロット番号を入れる変数Ｔ
に、初期値として２を代入し、現在有効なスロットの番
号の１をＳに入れる（ステップＳ１１）。そして、受理
状態を除く状態のスロットＴに空集合を置く（ステップ
Ｓ１２）。In the present invention, two slots (slots) for placing a set of a document number and a character appearance position in a node representing the state of a finite state automaton are prepared (hereinafter, these two slots are referred to as slots). 1 and slot 2). FIG. 3 is a flowchart illustrating a procedure of the document search method according to the embodiment of the present invention. As described above, the finite state automaton has a starting state and an accepting state. First, an empty set is set to slot 1 and slot 2 of the node in the accepting state, and a variable T
, 2 is substituted as an initial value, and the number 1 of the currently valid slot is put into S (step S11). Then, an empty set is placed in the slot T in a state other than the reception state (step S12).

【００３３】続いて、始状態から出てくる辺の推移を引
き起こす文字それぞれについて索引を参照し、その文字
の現れる文書の番号と出現位置の集合を辺の行き先のノ
ードのスロットＳの集合に追加する（ステップＳ１
３）。続いて、受理状態以外の状態であってスロットＳ
に空でない集合を持つものがあるか否かを判断する（ス
テップＳ１４）。ここで、スロットＳに空でない集合を
持つものがあると判断した場合、スロットＳに空でない
集合がある受理状態以外の全ての状態について以下の動
作を実行する（ステップＳ１５）。Subsequently, the index of each character which causes the transition of the side appearing from the initial state is referred to, and the set of the document number and the appearance position where the character appears is added to the set of the slot S of the destination node of the side. (Step S1
3). Subsequently, in a state other than the reception state,
It is determined whether there is a non-empty set (step S14). Here, if it is determined that there is a slot S that has a non-empty set, the following operation is performed for all states other than the accepting state in which the slot S has a non-empty set (step S15).

【００３４】ステップＳ１５では、ノードから出ている
辺の推移を引き起こす文字について、索引を引いて文書
番号と出現位置を求め、スロットＳの集合の要素に文書
番号が同じで出現位置が１小さいものがあれば、辺の行
き先のノードのスロットＴの集合に加える。さらに、Ｓ
とＴの値を入れ換え、受理状態を除くすべての状態のス
ロットＴに空集合を置き（ステップＳ１６）、ステップ
Ｓ１４に戻り、以降の動作を繰り返し実行する。一方、
ステップＳ１４において、スロットＳに空でない集合を
持つものがないと判断した場合、受理状態のスロット
１，２の集合がマッチする文字列の終わりを表す（ステ
ップＳ１７）。In step S15, for a character which causes a transition of a side protruding from the node, an index is obtained to obtain a document number and an appearance position. Is added to the set of slots T of the destination node of the edge. Furthermore, S
The values of T and T are exchanged, empty sets are set in the slots T in all the states except the reception state (step S16), the process returns to step S14, and the subsequent operations are repeatedly executed. on the other hand,
If it is determined in step S14 that none of the slots S has a non-empty set, the set of slots 1 and 2 in the receiving state indicates the end of the matched character string (step S17).

【００３５】すなわち、有限オートマトンには始状態と
受理状態とがある。そこでまず、受理状態のノードのス
ロット１とスロット２に空集合をセットし、推移先のス
ロット番号を入れる変数Ｔに初期値として２を代入し、
現在有効なスロットの番号１をＳに入れる。つぎに始状
態から出ている全ての辺の推移を引き起こす文字につい
て索引を引いて現れる文書番号と文字の出現位置の集合
を辺の行き先である状態のスロット１の集合を加える。
そして、以下の動作を繰り返し実行する。That is, the finite state automaton has a starting state and an accepting state. Therefore, first, an empty set is set to slot 1 and slot 2 of the node in the accepting state, and 2 is substituted as an initial value into a variable T for inserting a transition destination slot number.
The number 1 of the currently valid slot is put in S. Next, a set of the document numbers and the appearance positions of the characters appearing by indexing the characters which cause the transition of all the sides coming out of the start state is added to the set of slots 1 in the state where the side is the destination.
Then, the following operation is repeatedly executed.

【００３６】全てのノードについてＴの指すスロットに
空集合を置く（ただし、受理状態のノードのスロットの
集合はそのままにする）。ついで、Ｓの指すスロットに
空でない集合がある全てのノード（受理状態を除く）に
着目し、ノードから出てくる全ての辺について、辺に推
移を引き起こす文字の集合について索引を参照し文字の
現れる文書番号と文字の出現位置を得る。その中からノ
ードのスロットに置かれた集合と文書番号が同じで、文
字の出現位置が１大きいものだけを残し、その辺の行き
先ノードのＴによって指されるスロットの集合に追加す
る。最後にＳとＴの値を入れ換える。以上の動作を繰り
返し実行し、どのノードからも推移が不可能になったと
きに受理状態のスロットにある集合が正規表現にマッチ
する文字列を含む文書の番号とその文字列の最後の位置
集合となる。An empty set is placed in the slot pointed to by T for all nodes (however, the set of slots of the node in the accepted state is left as it is). Next, paying attention to all nodes (except the accepting state) having a non-empty set in the slot pointed to by S, referring to an index for a set of characters causing a transition to the side for all sides coming out of the node, Get the document number and the position of the character. From among them, only those having the same document number as the set placed in the slot of the node and having the character occurrence position larger by 1 are left, and are added to the set of slots indicated by T of the destination node on that side. Finally, the values of S and T are exchanged. The above operation is repeatedly executed, and when the transition becomes impossible from any node, the set in the accepting slot is the number of the document containing the character string that matches the regular expression and the last position set of the character string Becomes

【００３７】つぎに、上述した動作について具体例を挙
げて説明する。たとえば、キーボード２０５から、図４
に示すような正規表現が入力された場合を例にとる。こ
の正規表現の意味は、最初の部分が“日本”であり、そ
の後ろに“。”“、”以外の文字の０個以上の並びがあ
り、その後ろに“協会”がくる文字列を表している。た
とえば、“日本万年青協会”や“日本放送協会”がこの
正規表現にマッチする。一方、“日本の、たくさんの協
会”はこの正規表現にマッチしない。Next, the above-mentioned operation will be described with a specific example. For example, from the keyboard 205, FIG.
Let us take a case where a regular expression as shown in FIG. The meaning of this regular expression is that the first part is "Japan", followed by a sequence of zero or more characters other than ".", "," And followed by "association". ing. For example, "Japan Mannenkai" and "Japan Broadcasting Corporation" match this regular expression. On the other hand, "many associations in Japan" does not match this regular expression.

【００３８】この正規表現に対応して、以下の図５に示
すようなノードと辺からなる有限オートマトンが作成さ
れる。図において、矩形の塊が状態を示すノードＮ1 〜
Ｎ５、矢印が辺３００、矢印の上に書かれたものが推移
を引き起こす文字の集合を表す。一番左の矩形の塊（ノ
ードＮ１）が始状態３０１、一番右の矩形の塊（ノード
Ｎ５）が受理状態３０２を表す。ノードＮ1 〜Ｎ５に下
の２つの矩形が、文書の番号と文字の出現位置の組との
集合を置くスロット３０３である。In correspondence with this regular expression, a finite automaton composed of nodes and edges as shown in FIG. 5 is created. In the figure, rectangular blocks indicate states N1 to N1.
N5, the arrow drawn on the side 300, and the one written on the arrow represent a set of characters causing the transition. The leftmost block (node N1) represents the starting state 301, and the rightmost block (node N5) represents the receiving state 302. The lower two rectangles at nodes N1 to N5 are slots 303 in which a set of a document number and a set of character appearance positions is placed.

【００３９】このような、有限オートマトンが作成され
ると、まず、受理状態３０２のノードＮ５の両スロット
Ｓ１，Ｓ２に空集合が置かれる。上をスロット１、下を
スロット２とすると、つぎに受理状態３０２を除くノー
ドＮ1 〜Ｎ４の下のスロットに空集合が置かれる。そし
て、索引データを参照して“日”が出現する文書の番号
と出現位置の組みが得られ、これが左から２番目のノー
ドＮ２のスロット１に置かれる。When such a finite automaton is created, first, an empty set is placed in both slots S1 and S2 of the node N5 in the accepting state 302. Assuming that slot 1 is the upper slot and slot 2 is the lower slot, an empty set is placed in a slot below the nodes N1 to N4 excluding the receiving state 302. Then, by referring to the index data, a set of a document number and an appearance position where “day” appears is obtained, and this is placed in the slot 1 of the second node N2 from the left.

【００４０】また、現在有効なスロットの番号を表す変
数Ｓに１を入れ、推移先のスロット番号に２を入れる。
以降、繰り返しとして、スロットＳに空ではない集合が
入っている受理状態３０２以外のすべてのノードＮ1 〜
Ｎ４について、当該ノードから出ている辺の推移を引き
起こす文字の集合の文字が出現する文書の番号と出現位
置とを索引データから参照して求め、スロットＳに入っ
ている文字の出現位置のつぎの位置になっているものを
選びそれをスロットＴの集合へ追加し、ＳとＴとの値を
入れ替えることを行ない、受理状態３０２以外のノード
Ｎ1 〜Ｎ４のスロットＴに空集合を置く。このような処
理を行なうことにより、受理状態３０２の両スロットに
図４に示す正規表現にマッチする文字列の終わりの位置
が集められる。Also, 1 is entered into the variable S representing the number of the currently valid slot, and 2 is entered into the transition destination slot number.
Thereafter, as a repetition, all the nodes N1 to N1 except the reception state 302 in which the non-empty set is contained in the slot S are set.
With respect to N4, the number and appearance position of the document in which the character of the character set causing the transition of the side out of the node appears are obtained by referring to the index data, and the next to the appearance position of the character in the slot S is obtained. Is added to the set of slots T, the values of S and T are exchanged, and empty sets are placed in the slots T of the nodes N1 to N4 other than the reception state 302. By performing such processing, the end position of the character string that matches the regular expression shown in FIG.

【００４１】ところで、これまで説明してきた文書検索
方法を、プログラム化し、コンピュータ読み取り可能な
記録媒体に記録し、コンピュータ上で実行することもで
きる。また、文書検索方法の一部をネットワーク上に有
し、通信回線を通して実現することもできる。By the way, the document search method described so far can be programmed, recorded on a computer-readable recording medium, and executed on a computer. Also, a part of the document search method can be implemented on a network and realized through a communication line.

【００４２】すなわち、この実施の形態で説明した文書
検索方法は、あらかじめ用意されたプログラムをパーソ
ナルコンピュータやワークステーションなどのコンピュ
ータ（ＣＰＵ）で実行することにより実現される。すな
わち、このプログラムは、キーボードの操作などによ
り、メモリ、ハードディスク、フロッピー（登録商標）
ディスク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤなどのコンピュ
ータで読み取り可能な記録媒体に記録され、コンピュー
タ（ＣＰＵ）によって記録媒体から読み出されることに
よって実行される。That is, the document search method described in this embodiment is realized by executing a prepared program on a computer (CPU) such as a personal computer or a workstation. In other words, this program is operated by a keyboard, etc., to operate a memory, hard disk, floppy (registered trademark)
The program is recorded on a computer-readable recording medium such as a disc, a CD-ROM, an MO, and a DVD, and is executed by being read from the recording medium by a computer (CPU).

【００４３】また、必要に応じてこの文書検索における
処理データを通信装置から外部装置に送受信することも
可能である。また、このプログラムは、上記記録媒体を
介して、インターネットなどのネットワークによってパ
ーソナルコンピュータなどの装置それぞれに配布するこ
とができる。Further, it is also possible to transmit and receive processing data in the document search from the communication device to an external device as needed. Further, this program can be distributed to each device such as a personal computer via the recording medium via a network such as the Internet.

【００４４】[0044]

【発明の効果】以上説明したように、本発明にかかる文
書検索装置（請求項１）によれば、検索文字列・正規表
現入力手段に全文検索の対象となる文字列、正規表現が
入力される。この文字列を正規表現の一種とみなされる
ので、文字列、正規表現とも正規表現処理手段に送ら
れ、そこで有限オートマトンが作成される。処理手段は
この有限オートマトンにしたがって索引データ格納手段
に格納されている索引データを参照しながら状態推移し
て文字列の検索を行なうため、元のテキストを参照する
ことなく、索引データのみを参照するだけでマッチする
文字列を高速に求めることができ、また、単純な文字列
を正規表現の一種としてみなすことにより、正規表現に
マッチするものを検索する場合と単純な文字列の存在を
調べる操作が統一的に処理できる単純なプログラム化が
実現する。As described above, according to the document search apparatus of the present invention (claim 1), a character string and a regular expression to be subjected to full-text search are input to the search character string / regular expression input means. You. Since this character string is regarded as a kind of regular expression, both the character string and the regular expression are sent to the regular expression processing means, where a finite automaton is created. The processing means changes the state while referring to the index data stored in the index data storage means in accordance with the finite automaton to search for a character string. Therefore, the processing means refers only to the index data without referring to the original text. Can be used to find a matching string at high speed, and consider a simple string as a kind of regular expression to search for something that matches the regular expression and to check for the existence of a simple string A simple programming that can be processed uniformly is realized.

【００４５】また、本発明にかかる文書検索装置（請求
項２）によれば、有限オートマトンの状態を表すノード
に文書の番号と文字の出現位置との集合を置くためのス
ロットを２つ用意し、処理手段が異なるスロットで文書
の番号および文字の出現位置の組の集合を移して有限オ
ートマトンの状態推移を実行し、どのノードからも推移
が不可能になったときに受理状態のスロットにある集合
が正規表現にマッチする文字列を含む文書の番号とその
文字列の最後の位置集合とみなすため、有限オートマト
ンを用いた文書検索が実現する。According to the document search apparatus of the present invention (claim 2), two slots are provided for placing a set of a document number and a character appearance position in a node representing a state of a finite state automaton. The processing means executes the state transition of the finite state automaton by transferring the set of the document number and the appearance position of the character in a different slot, and when the transition becomes impossible from any node, the processing means is in the accepting slot. Since the set is regarded as a document number including a character string that matches the regular expression and the last position set of the character string, a document search using a finite automaton is realized.

【００４６】また、本発明にかかる文書検索方法（請求
項３）によれば、正規表現の問い合わせが与えられた際
に、当該正規表現から有限オートマトンを作成し、有限
オートマトンにしたがって文字の位置を記録した索引を
参照して検索するため、元のテキストを参照することな
く、索引データのみを参照するだけでマッチする文字列
を高速に求めることができ、また、単純な文字列を正規
表現の一種としてみなすため、正規表現にマッチするも
のを検索する場合と単純な文字列の存在を調べる操作が
統一的に処理できる単純なプログラム化が実現する。Further, according to the document search method of the present invention (claim 3), when an inquiry about a regular expression is given, a finite automaton is created from the regular expression, and the position of a character is determined according to the finite automaton. Searching by referring to the recorded index enables searching for matching character strings at high speed just by referring to the index data without referring to the original text. Since it is regarded as a kind, a simple program is realized in which the operation of searching for a match with the regular expression and the operation of checking for the existence of a simple character string can be processed in a unified manner.

【００４７】また、本発明にかかる文書検索方法（請求
項４）によれば、請求項３において、正規表現から有限
オートマトンを作成し、その有限オートマトンの状態推
移を、状態毎の文書の番号および文字の出現位置の組の
集合を用いて行なうことにより、有限オートマトンにお
ける始状態から受理状態までの状態推移による検索方法
が実現する。According to the document search method of the present invention (claim 4), in claim 3, a finite state automaton is created from a regular expression, and the state transition of the finite state automaton is represented by a document number and a state number of each state. By using a set of sets of character appearance positions, a search method based on a state transition from a start state to an accepted state in a finite state automaton is realized.

【００４８】また、本発明にかかる文書検索方法（請求
項５）によれば、請求項４において、異なるスロットで
文書の番号および文字の出現位置の組の集合を移して有
限オートマトンの状態推移を実行し、どのノードからも
推移が不可能になったときに受理状態のスロットにある
集合が正規表現にマッチする文字列を含む文書の番号と
その文字列の最後の位置集合とみなすため、検索対象の
文書の検索において、受理状態（最終状態）に達した場
合に文字列があったことを検出することがきる。According to the document search method of the present invention (claim 5), in claim 4, the state transition of the finite state automaton is changed by transferring a set of a document number and a character appearance position in a different slot. Execute, and when the transition from any node becomes impossible, the set in the accepted slot is regarded as the number of the document containing the character string that matches the regular expression and the last position set of the character string, so the search In the search for the target document, it is possible to detect the presence of a character string when it reaches the acceptance state (final state).

【００４９】また、本発明にかかるコンピュータ読み取
り可能な記録媒体（請求項６）によれば、請求項３〜５
の何れか一つに記載の文書検索方法を、コンピュータに
実行させるプログラムを記録したことにより、請求項３
〜５の何れか一つに記載の文書検索方法をコンピュータ
によって実現することができる。According to the computer-readable recording medium of the present invention (claim 6), claims 3-5
A program for causing a computer to execute the document search method according to any one of claims 1 to 3, wherein
The document search method described in any one of Items 1 to 5 can be realized by a computer.

[Brief description of the drawings]

【図１】本発明の実施の形態にかかる全文検索システム
の機能構成を示すブロック図である。FIG. 1 is a block diagram showing a functional configuration of a full-text search system according to an embodiment of the present invention.

【図２】本発明の実施の形態にかかる文書検索装置の構
成を示すブロック図である。FIG. 2 is a block diagram showing a configuration of a document search device according to the embodiment of the present invention.

【図３】本発明の実施の形態にかかる文書検索方法の手
順を示すフローチャートである。FIG. 3 is a flowchart illustrating a procedure of a document search method according to the embodiment of the present invention.

【図４】本発明の実施の形態にかかる正規表現の文字列
の一例を示す図である。FIG. 4 is a diagram showing an example of a character string of a regular expression according to the embodiment of the present invention.

【図５】本発明の実施の形態にかかる正規表現に対応す
る有限オートマトンの作成例を示す説明図である。FIG. 5 is an explanatory diagram showing an example of creating a finite state automaton corresponding to a regular expression according to the embodiment of the present invention;

【図６】従来におけるテキスト検索例を示す説明図であ
る。FIG. 6 is an explanatory diagram showing an example of a conventional text search.

【図７】従来における文書検索例を示すフローチャート
である。FIG. 7 is a flowchart illustrating an example of a conventional document search.

[Explanation of symbols]

１０１検索文字列・正規表現入力部１０２正規表現処理部１０３処理部１０４索引データ格納部２０２ＣＰＵ２０３ＲＯＭ２０５キーボード２０７ディスプレイ３００辺３０１始状態３０２受理状態３０３スロット 101 search character string / regular expression input unit 102 regular expression processing unit 103 processing unit 104 index data storage unit 202 CPU 203 ROM 205 keyboard 207 display 300 side 301 start state 302 reception state 303 slot

Claims

[Claims]

1. A document search apparatus for performing full-text search, comprising: a search character string / regular expression input unit for inputting a character string and a regular expression; and a regular expression character string input from the search character string / regular expression input unit. A regular expression processing means for creating and processing a finite automaton corresponding to: an index data storage means for storing index data; and a processing means for performing a state transition of the finite automaton while referring to the index data of the index data storage means. A document search device, comprising:

2. The finite state automaton is provided with two slots in a node representing a state, and the processing means transfers the state of the finite state automaton while transferring a set of a document number and a character appearance position set in different slots. 2. The document search device according to claim 1, wherein the document search is executed.

3. A document search method for performing a full-text search, comprising creating a finite automaton from an input regular expression, referring to an index in which character positions are recorded, and detecting a character string matching the regular expression. Characteristic document search method.

4. The document search method according to claim 3, wherein the state transition of the finite state automaton is executed using a set of a set of a document number and a character appearance position.

5. The node according to claim 4, wherein two slots are provided in a node representing a state, and when a state transition is performed, a set of a set of a document number and a character appearance position is shifted in a different slot. Document search method.

6. A computer-readable recording medium on which a program for causing a computer to execute the document search method according to claim 3 is recorded.