JPH10307841A

JPH10307841A - Text retrieving device and its method

Info

Publication number: JPH10307841A
Application number: JP9119870A
Authority: JP
Inventors: Shogo Shibata; 昇吾柴田; Shiro Ito; 史朗伊藤
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1997-05-09
Filing date: 1997-05-09
Publication date: 1998-11-17

Abstract

PROBLEM TO BE SOLVED: To improve the efficiency of retrieving processing by making it possible to reuse an initial operation result at the time of next appearance when an overlapped pattern exists in a retrieving character string. SOLUTION: A two-gram index storing part 101 uses two continuous characters in a retrieved text as a key and stores a two-gram index registering the positions of the characters concerned in a retrieved text. An overlapped pattern detection part 103 extracts an overlapped pattern constituted of characters more than two continuous characters from the retrieving character strings stored in a retrieving word storing part 102. A position retrieving part 104 retrieves the existing position of the overlapped pattern detected by the detection part 103 and stores the retrieved result in a comparing operation result storing part 105. The position retrieving part 104 retrieves the existing positions of characters other than the overlapped pattern and acquires the existing position of the retrieving character string based on the retrieved results and the contents stored in the storing part 105.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、インデックスを用
いてテキストデータから文字列を高速に検索するテキス
ト検索装置及び方法に関するものである。特に、Ｎ−ｇ
ｒａｍインデックス方式を採用したテキスト検索におい
て検索処理の高速化を図るテキスト検索装置及び方法に
関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a text retrieval apparatus and method for retrieving a character string from text data at high speed using an index. In particular, N-g
The present invention relates to a text search apparatus and method for speeding up search processing in a text search employing a ram index method.

【０００２】[0002]

【従来の技術】文書中の全てのテキストを対象として与
えられた検索語を含む文書を検索する全文検索装置など
のテキスト検索装置では、大量のテキストを高速に検索
するために、被検索テキストのインデックスを予め作成
して、インデックスを用いて検索を行なうインデックス
技術が利用されている。2. Description of the Related Art In a text search apparatus such as a full-text search apparatus that searches a document including a given search word for all texts in a document, a large number of texts are searched at a high speed. An index technique of creating an index in advance and performing a search using the index is used.

【０００３】こうしたインデックス方式の一つに、Ｎ−
ｇｒａｍインデックスと呼ばれるインデックス方式があ
る。Ｎ−ｇｒａｍインデックス方式とは、被検索テキス
ト中の連接するＮ文字をキーとして、キーとなる文字列
の被検索テキスト中での存在位置を列挙した位置リスト
を保持するインデックスである。One of such index systems is N-
There is an index method called a gram index. The N-gram index method is an index that holds a position list listing the positions of the character strings serving as keys in the search target text, using consecutive N characters in the search target text as keys.

【０００４】例えば、Ｎを２に固定すると、文字列を２
文字ずつ切ってインデックスを作成することになる。一
例として「ジェントル・エージェント」という文字列に
対しては、図４に示すようになる。こうして得られた各
文字列について、その文書テキスト中における出現位置
を登録することにより２−ｇｒａｍインデックスが作成
される。一文字をインデックスとする方式に比べ、イン
デックスのパターンの数は増えるものの、各インデック
スの出現回数が大幅に減るので、より高速な検索処理が
可能となる。For example, if N is fixed to 2, the character string is 2
The index will be created by cutting each character. As an example, the character string “Gentle Agent” is as shown in FIG. For each character string obtained in this way, a 2-gram index is created by registering the appearance position in the document text. Compared to the method using one character as an index, the number of index patterns is increased, but the number of appearances of each index is significantly reduced, so that a faster search process can be performed.

【０００５】次に、Ｎ−ｇｒａｍインデックスを用いた
検索処理を説明する。ここでも、Ｎを２に固定した場合
で説明するが、Ｎの値が変わっても基本は同じである。Next, a search process using an N-gram index will be described. Here, the case where N is fixed to 2 will be described, but the basics are the same even if the value of N changes.

【０００６】まず、検索文字列を先頭から２文字ずつ切
って、２−ｇｒａｍインデックスに問い合わせ、その出
現位置情報を記憶する。図５に、検索文字列より２文字
ずつ取り出す例を示す。まず、「ジェ」という２文字に
ついて２−ｇｒａｍインデックスを検索し、位置情報を
取出す。次に、次の２文字である、「ント」について、
同様に位置情報を取り出し、前の「ジェ」で得られた文
字位置から２だけ増えているものだけをチェックする。
それ以外は、文字列が連続していないので排除される。First, a search character string is cut off two characters at a time from the beginning, a 2-gram index is inquired, and its appearance position information is stored. FIG. 5 shows an example of extracting two characters from the search character string. First, a 2-gram index is searched for two characters "J" to extract position information. Next, for the next two characters,
Similarly, the position information is extracted, and only the character positions that are increased by 2 from the character position obtained in the previous “J” are checked.
Others are excluded because the character strings are not continuous.

【０００７】以上のようにして、検索文字列から得られ
る２文字ずつの文字列のすべてについてチェックされた
ものが、検索文字列を含む出現位置となる。なお、検索
文字列の長さが奇数である場合は、最後の１文字を単独
で文字位置インデックスから位置情報を取り出し、同様
の処理を行なう。ここで、文字位置インデックスは、被
検索テキスト中の各文字について、当該被検索テキスト
中における文字位置を登録したものである。すなわち、
Ｎ−ｇｒａｍインデックスを使用する場合、文字位置イ
ンデックスも併用されることになる。As described above, all of the character strings of two characters obtained from the search character string that have been checked are the appearance positions including the search character string. If the length of the search character string is odd, position information is extracted from the character position index for the last character alone, and the same processing is performed. Here, the character position index is obtained by registering a character position in the searched text for each character in the searched text. That is,
When the N-gram index is used, the character position index is also used.

【０００８】[0008]

【発明が解決しようとする課題】しかしながら、上記従
来例の装置では、前方の文字列に後方の２文字の位置情
報を組み合わせる演算にコストがかかる。この演算は、
例えば前方の文字列「ジェ」の出現数がＮ個、後方の文
字列「ント」の出現数がＭ個の場合、Ｎ×Ｍの出現位置
の比較演算を行なうため、ＮやＭが大きくなった場合、
比較演算の占める割合が高くなってしまう。However, in the above-mentioned prior art apparatus, it is costly to perform an operation for combining the front character string with the position information of the two following characters. This operation is
For example, if the number of appearances of the front character string “J” is N and the number of appearances of the rear character string “N” is M, the comparison operation of N × M occurrence positions is performed, so that N and M become large. If
The ratio occupied by the comparison operation increases.

【０００９】本発明は、上述した課題に鑑みてなされた
ものであり、検索文字列より同一の部分文字列である重
複パタンが存在するかどうかをチェックし、重複パタン
がある場合には、その検索結果を次の出現時に再利用す
ることを可能とし、検索処理の効率を向上させるテキス
ト検索装置及び方法を提供することを目的とする。The present invention has been made in view of the above-described problem, and checks whether or not there is a duplicate pattern that is the same partial character string from a search character string. It is an object of the present invention to provide a text search apparatus and a text search method that enable a search result to be reused at the next appearance and improve the efficiency of search processing.

【００１０】[0010]

【課題を解決するための手段】上記の目的を達成するた
めの本発明のテキスト検索装置は以下の構成を備える。
すなわち、被検索テキスト中の文字をキーとして、被検
索テキスト中での当該文字の位置を登録した第１インデ
ックスを保持する保持手段と、与えられた検索文字列の
中から連続する２文字以上の文字で構成され、該検索文
字列に重複して存在する重複パタンを抽出する抽出手段
と、前記保持手段に保持された第１インデックスを用い
て、前記抽出手段で抽出された重複パタンの存在位置を
検索する第１検索手段と、前記保持手段に保持された第
１インデックスを用いて、前記検索文字列中の前記重複
パタン以外の文字についてその存在位置を検索する第２
検索手段と、前記第１検索手段と前記第２検索手段で得
られた検索結果に基づいて、前記検索文字列の存在位置
を取得する取得手段とを備える。Means for Solving the Problems A text search apparatus according to the present invention for achieving the above object has the following arrangement.
That is, using a character in the searched text as a key, a holding unit for holding a first index in which the position of the character in the searched text is registered, and two or more consecutive characters from a given search character string Extracting means for extracting a duplicate pattern composed of characters and present in the search character string in an overlapping manner; and using the first index held by the holding means, the position of the duplicate pattern extracted by the extracting means. A first search unit for searching for a character string, and a second search unit for searching for an existing position of a character other than the overlapping pattern in the search character string using the first index held in the holding unit.
A search unit; and an obtaining unit that obtains an existing position of the search character string based on search results obtained by the first search unit and the second search unit.

【００１１】また、上記の目的を達成する本発明のテキ
スト検索方法は、与えられた検索文字列の中から連続す
る２文字以上の文字で構成され、該検索文字列に重複し
て存在する重複パタンを抽出する抽出工程と、被検索テ
キスト中の文字をキーとして、被検索テキスト中での当
該文字の位置を登録した第１インデックスを用いて、前
記抽出工程で抽出された重複パタンの存在位置を検索す
る第１検索工程と、前記第１インデックスを用いて、前
記検索文字列中の前記重複パタン以外の文字についてそ
の存在位置を検索する第２検索工程と、前記第１検索工
程と前記第２検索工程で得られた検索結果に基づいて、
前記検索文字列の存在位置を取得する取得工程とを備え
る。Further, the text search method according to the present invention for achieving the above object comprises a character string consisting of two or more consecutive characters from a given search character string. An extraction step of extracting a pattern, and using the first index in which the position of the character in the searched text is registered using the character in the searched text as a key, the location of the duplicate pattern extracted in the extracting step A second search step of searching for a position of a character other than the overlapping pattern in the search character string using the first index, a first search step of searching for 2 Based on the search results obtained in the search process,
Obtaining an existing position of the search character string.

【００１２】[0012]

【発明の実施の形態】以下、添付の図面を参照して本発
明の好適な実施形態を説明する。なお、以下の実施形態
では、Ｎ−ｇｒａｍのＮを２に固定した場合で説明する
が、Ｎの値が変わっても基本は同じである。Preferred embodiments of the present invention will be described below with reference to the accompanying drawings. In the following embodiment, the case where N of N-gram is fixed to 2 will be described, but the basics are the same even if the value of N changes.

【００１３】図１は、本発明の実施形態に係る文書処理
装置の構成を示すブロック図である。同図において、１
０１は被検索テキスト中の文字および文字列をキーとし
て、被検索テキスト中での当該文字の位置を保持したイ
ンデックスを保持する２−ｇｒａｍインデックス保持部
である。１０２は検索を行なう文字列を保持する検索語
保持部である。１０３は、検索文字列の中から、連続し
た２文字以上の重複パタンを検出する重複パタン検出部
である。１０４は、２−ｇｒａｍインデックス保持部１
０１に保持されている２−ｇｒａｍインデックスを用い
て、検索文字列保持部１０２に保持されている検索文字
列と被検索テキスト中の任意の文字列との一致から検索
文字列を含む位置を検索する位置検索部である。１０５
は、位置検索部１０４での演算時に、重複パタン検出部
１０３で検出された重複パタンについての検索結果を保
持する比較演算結果保持部である。１０６は、位置検索
部１０５により検索された検索結果を保持する検索結果
保持部である。FIG. 1 is a block diagram showing a configuration of a document processing apparatus according to an embodiment of the present invention. In the figure, 1
Reference numeral 01 denotes a 2-gram index holding unit that holds an index holding the position of the character in the searched text, using the character and the character string in the searched text as keys. A search word holding unit 102 holds a character string to be searched. Reference numeral 103 denotes a duplicate pattern detection unit that detects a duplicate pattern of two or more consecutive characters from the search character string. 104 is a 2-gram index holding unit 1
Using the 2-gram index stored in 01, a position including the search character string is searched from a match between the search character string stored in the search character string storage unit 102 and an arbitrary character string in the searched text. This is a position search unit to perform. 105
Reference numeral denotes a comparison calculation result holding unit that holds a search result of the overlap pattern detected by the overlap pattern detection unit 103 at the time of calculation by the position search unit 104. Reference numeral 106 denotes a search result holding unit that holds the search results searched by the position search unit 105.

【００１４】なお、上記構成において、２−ｇｒａｍイ
ンデックスを生成する手法は周知のものを採用できるの
で、インデックスを生成する部分については図示を省略
した。In the above configuration, since a known method for generating a 2-gram index can be employed, a portion for generating the index is not shown.

【００１５】図２は上述の文書処理装置のハードウェア
構成を示す図である。同図において、２０１はＣＰＵで
あり、後述する手順を実現するプログラムに従って動作
する。２０２はＲＡＭであり、上述の検索語保持部１０
２、比較演算結果保持部１０５、検索結果保持部１０６
と上記プログラムの動作に必要な記憶領域とを提供す
る。２０３はＲＯＭであり、後述する手順を実現する制
御プログラムを保持する。２０４はディスク装置であ
り、２−ｇｒａｍインデックス保持部１０１を実現す
る。２０５は入力部であり、キーボードマウス等を備
え、検索文字列の入力や検索開始の指示入力等を行な
う。２０６はディスプレイであり、検索結果の表示等を
行なう。２０７はバスであり、上述の各構成を接続す
る。FIG. 2 is a diagram showing a hardware configuration of the above-described document processing apparatus. In FIG. 1, reference numeral 201 denotes a CPU, which operates according to a program for implementing a procedure described later. Reference numeral 202 denotes a RAM, which is the search word holding unit 10 described above.
2. Comparison operation result holding unit 105, search result holding unit 106
And a storage area necessary for the operation of the program. Reference numeral 203 denotes a ROM, which stores a control program that implements a procedure described below. Reference numeral 204 denotes a disk device that implements the 2-gram index holding unit 101. An input unit 205 includes a keyboard and a mouse, and performs input of a search character string, input of a search start instruction, and the like. Reference numeral 206 denotes a display for displaying search results and the like. A bus 207 connects the above-described components.

【００１６】なお、本実施形態では制御プログラムをＲ
ＯＭに格納しておき、これをＣＰＵ２０１が実行するこ
ととしたが、ディスク装置２０４等の外部記憶装置に制
御プログラムを格納しておき、これをＲＡＭ２０２にロ
ードしてＣＰＵ２０１が実行するように構成しても良い
ことは明らかである。In this embodiment, the control program is R
It is stored in the OM and executed by the CPU 201. However, the control program is stored in an external storage device such as the disk device 204, and the control program is loaded into the RAM 202 and executed by the CPU 201. Obviously, you can.

【００１７】以上のような構成を備えた本実施形態の文
書処理装置におけるテキスト検索処理について説明す
る。A description will be given of a text search process in the document processing apparatus according to the present embodiment having the above configuration.

【００１８】図３は図１に示した装置における動作の処
理手順を示すフローチャートである。また、図５、図６
は、本実施形態の検索処理を説明するための具体例であ
る。以下、図３のフローチャートを参照し、図５、図６
の例を用いて重複パタンを利用して効率化された２−ｇ
ｒａｍインデックスの検索処理手順を示す。FIG. 3 is a flowchart showing a processing procedure of the operation in the apparatus shown in FIG. 5 and 6
Is a specific example for explaining the search processing of the present embodiment. Hereinafter, referring to the flowchart of FIG. 3, FIG.
2-g that was made more efficient using overlapping patterns using the example of
5 shows a ram index search processing procedure.

【００１９】まず、ステップＳ３０１で、検索語の中で
重複しているパタン（重複パタン）を検出する。ここで
は、検索文字列（以下、検索語）の例として「ジェント
ル・エージェント」という語を用いる。この検索語に
は、「ジェント」という文字列が２回重複して用いられ
ている。すなわち、「ジェント」が重複パタンとして検
出される。なお、重複パタンは、一検索語内に限らな
い。例えば、「炭素材料ＡＮＤ機能素材」というような
２つの条件（検索語）のＡＮＤ／ＯＲなどによる結合時
にも重複パタン（この例では「素材」）を検出する。First, in step S301, a duplicate pattern (duplicate pattern) in a search word is detected. Here, the word “gentle agent” is used as an example of a search character string (hereinafter, a search word). The character string “gent” is used twice in this search word. That is, “gent” is detected as an overlapping pattern. The overlapping pattern is not limited to one search word. For example, an overlap pattern ("material" in this example) is detected when two conditions (search terms) such as "carbon material AND functional material" are combined by AND / OR or the like.

【００２０】ステップＳ３０１で重複が検出された場合
には、ステップＳ３０２で、重複パタンに対しての検索
を行なう。この例では、「ジェント」に対して、先頭の
２文字「ジェ」の位置インデックスを読み込み、次に、
「ント」の位置インデックスを読み込む。そして、両者
のインデックスを比較演算し、連続しているもの、即ち
後者の位置が前者より２だけ大きいものだけをバッファ
に書き込む。本実施形態では重複パタンが検索語の先頭
から始まっているが、重複パタンが先頭にある／ないに
関わらず、重複パタンから位置インデックスを作成す
る。If duplicates are detected in step S301, a search for duplicate patterns is performed in step S302. In this example, for "gent", the position index of the first two characters "je" is read,
Read the position index of "nt". Then, both indices are compared, and only the consecutive one, that is, the one whose latter position is larger than the former by 2 is written in the buffer. In this embodiment, the overlapping pattern starts from the beginning of the search word, but a position index is created from the overlapping pattern regardless of whether the overlapping pattern is at the beginning or not.

【００２１】また、文字列がより長い場合には、この結
果に対して次の２文字の位置インデックスを比較し、終
端まで続ける。なお、長さが奇数である場合には、最後
の文字は一文字インデックスから位置情報を得る。ステ
ップＳ３０３で、バッファに書き込まれた位置インデッ
クスを、比較演算結果保持部１０５にセットする。重複
パタンが複数ある場合には、同様野処理をすべての重複
パタンについて実行し、全ての重複パタンの位置インデ
ックスをセットする。If the character string is longer, the result is compared with the position index of the next two characters, and the processing is continued until the end. If the length is odd, position information is obtained from the one-character index for the last character. In step S303, the position index written in the buffer is set in the comparison operation result holding unit 105. When there are a plurality of overlapping patterns, similarly, the field processing is executed for all the overlapping patterns, and the position indexes of all the overlapping patterns are set.

【００２２】ステップＳ３０４では、重複パタン以外の
文字列について、同様の検索処理を行なう。この例で
は、「(ジェント)ル・エー(ジェント)」である。まず
「ル・」の位置インデックスを読み込み、「エー」につ
いても位置インデックスを読み込む。そして、両者のイ
ンデックスを比較演算し、連続しているもの、即ち後者
の位置が前者より２だけ大きいものだけをバッファに書
き込む。In step S304, similar search processing is performed on character strings other than the overlapping pattern. In this example, it is "(gent) rue (gent)". First, the position index of “Le.” Is read, and the position index of “A” is also read. Then, both indices are compared, and only the consecutive one, that is, the one whose latter position is larger than the former by 2 is written in the buffer.

【００２３】ステップＳ３０５では、「ジェント」と
「ル・エー」と重複した「ジェント」について、位置イ
ンデックスを組み合わせる。即ち、「ジェント」の位置
インデックスと「ル・エー」の位置インデックスとを比
較演算して、連続しているもの、本例では後者の位置が
前者より４だけ大きいものだけを残す。そして最後に、
このバッファに対して、再度、「ジェント」の位置イン
デックスを比較し、前者より８だけ大きいものだけを残
せば、検索語の位置だけが残ることになる。In step S305, position indexes are combined for "gent" overlapping "gent" and "rue". In other words, the position index of “gent” and the position index of “rue” are compared and calculated, and only those that are continuous, in this example, those whose position of the latter is larger by 4 than the former, are left. And finally,
If the position index of “gent” is compared again with this buffer, and only the position index larger than the former by 8 is left, only the position of the search word remains.

【００２４】一方、ステップＳ３０１で重複が検出され
なかった場合には、ステップＳ３０６で通常の検索とし
て処理される。On the other hand, if no duplication is detected in step S301, processing is performed as a normal search in step S306.

【００２５】以上のように本実施形態によれば、従来の
手法では図５に示すように、２文字で構成される６個の
部分文字列で５回の比較演算が必要であったものが、図
６に示すように、４個の部分文字列で４回の比較演算を
行なえばよくなる。このように、部分文字列の数、比較
演算の回数が減少するので、検索処理が効率的且つ高速
に行なえるようになる。すなわち、検索語中の重複パタ
ンを利用して、演算コストを軽減することが可能となる
ので、より高速に検索できるようになる。As described above, according to the present embodiment, in the conventional method, as shown in FIG. 5, five comparison operations are required for six partial character strings composed of two characters. As shown in FIG. 6, it is sufficient to perform four comparison operations on four partial character strings. As described above, the number of partial character strings and the number of comparison operations are reduced, so that the search processing can be performed efficiently and at high speed. That is, the calculation cost can be reduced by using the overlapping pattern in the search word, so that the search can be performed at higher speed.

【００２６】＜他の実施形態＞（１）上記実施形態では、一回の検索処理に閉じたもの
となっているが、頻度の高いパタンを統計的に抜き出し
て、それらのパタンについては比較演算結果を保持して
おくように構成しても良い。特定のパタンが高頻度で用
いられる検索においては、大幅に検索コストを削減でき
る。ただし、被検索文書の更新があった場合には、保持
していた結果を破棄する必要がある。<Other Embodiments> (1) In the above embodiment, although the search processing is closed only once, patterns with high frequency are statistically extracted and compared with those patterns. You may comprise so that a result may be hold | maintained. In a search in which a specific pattern is frequently used, the search cost can be significantly reduced. However, when the searched document is updated, the held result needs to be discarded.

【００２７】図７は複数回の検索処理に対応可能な重複
パタン用インデックスを保持する構成を説明するブロッ
ク図である。同図において、図１と同様の構成には同一
の参照番号を付してある。３０１は統計処理部であり、
比較演算結果保持部１０５において保持された重複パタ
ンの出現頻度を得る。３０２は重複パタンインデックス
保持部であり、統計処理部３０１において出現頻度が所
定値を越えた重複パタンについて、被検索テキスト中に
おけるその位置を登録した重複パタンインデックスを保
持する。FIG. 7 is a block diagram for explaining a configuration for holding an index for an overlapping pattern that can support a plurality of search processes. In the figure, the same components as those in FIG. 1 are denoted by the same reference numerals. Reference numeral 301 denotes a statistical processing unit,
The appearance frequency of the overlapping pattern held in the comparison calculation result holding unit 105 is obtained. Reference numeral 302 denotes a duplicate pattern index holding unit which stores a duplicate pattern index in which the position of the duplicate pattern whose appearance frequency exceeds a predetermined value in the searched text is registered in the statistical processing unit 301.

【００２８】位置検索部１０４は、重複パタン検出部１
０３において検出された重複パタンが重複パタンインデ
ックス保持部３０２に登録されているか否かを調べ、登
録されていればその重複パタンインデックスから直接に
位置情報を得る。すなわち、２−ｇｒａｍインデックス
による重複パタンの検索を省略することができる。The position search unit 104 is provided with a duplicate pattern detection unit 1
It is checked whether or not the duplicate pattern detected in 03 is registered in the duplicate pattern index holding unit 302, and if it is registered, position information is directly obtained from the duplicate pattern index. That is, it is possible to omit the search for the duplicate pattern using the 2-gram index.

【００２９】なお、上記説明では重複パタンの文字列全
体をキーとしたインデックスを生成したがこれに限らな
い。例えば、重複パタンの文字列をＮ文字ずつに分割し
て得られる文字列をキーとしたインデックスを生成する
ことも可能である。例えば、「ジェント」という重複パ
ターンを２文字ずつに分割すれば、「ジェ」と「ント」
の２つの文字列が出現することになるので、これらの出
現頻度を統計処理部３０１で獲得する。そして、出現頻
度が所定値を越えた場合には、その文字列をキーとする
位置情報を追加するように構成しても良い。In the above description, an index is generated using the entire character string of the overlapping pattern as a key, but the present invention is not limited to this. For example, it is also possible to generate an index using a character string obtained by dividing a character string of an overlapping pattern into N characters at a time. For example, if the overlapping pattern "gent" is divided into two characters, "je" and "nt"
Since the two character strings appear, the statistical processing unit 301 acquires the appearance frequencies of these two character strings. Then, when the appearance frequency exceeds a predetermined value, position information using the character string as a key may be added.

【００３０】（２）また、上記実施形態では、重複パタ
ンが先頭にあったため位置インデックスを最初に求めて
いたが、計算の順序は、重複パタンが現れた時に行なう
ようにしても良いものとする。もちろん並列に計算して
も良いものとする。(2) In the above-described embodiment, the position index is obtained first because the overlapping pattern is at the head. However, the calculation order may be performed when the overlapping pattern appears. . Of course, calculations may be performed in parallel.

【００３１】（３）また、上記実施形態では、重複パタ
ンの入れ子構造には触れていなかったが、重複パタンが
入れ子構造になっていても上記実施形態で説明した処理
を適用できることは明らかである。例えば、「（天然素
材ＯＲ自然素材）ＡＮＤ炭素材料」という例では、「素
材」は３箇所で重複しており、「然素材」は２箇所で重
複している。このような場合には、まず、「然素材」に
含まれている「素材」について、重複パタンの位置イン
デックスを求め、その結果に一文字の「然」と比較演算
して「然素材」の位置インデックスを求めるようにすれ
ばよい。(3) In the above embodiment, the nested structure of the overlapping patterns is not described, but it is clear that the processing described in the above embodiment can be applied even if the overlapping patterns have the nested structure. . For example, in the example of "(natural material OR natural material) AND carbon material", "material" overlaps at three places, and "Nara material" overlaps at two places. In such a case, first, for the “material” included in the “nature material”, the position index of the overlapping pattern is obtained, and the result is compared with the one character “nature” to calculate the position of the “nature material”. What is necessary is just to ask for an index.

【００３２】（４）また、上記実施形態では、日本語の
文字列を例としてあげたが、日本語以外の言語の文字列
についても同様な処理が可能であり、これらの言語を対
象としても良いことはいうまでもない。(4) In the above embodiment, a Japanese character string is taken as an example. However, a similar process can be performed on a character string in a language other than Japanese. It goes without saying that it is good.

【００３３】（５）なお、本発明は、複数の機器（例え
ばホストコンピュータ，インタフェイス機器，リーダ，
プリンタなど）から構成されるシステムに適用しても、
一つの機器からなる装置（例えば、複写機，ファクシミ
リ装置など）に適用してもよい。(5) It should be noted that the present invention provides a plurality of devices (for example, a host computer, an interface device, a reader,
Printer, etc.)
The present invention may be applied to a device including one device (for example, a copying machine, a facsimile device, etc.).

【００３４】また、本発明の目的は、前述した実施形態
の機能を実現するソフトウェアのプログラムコードを記
録した記憶媒体を、システムあるいは装置に供給し、そ
のシステムあるいは装置のコンピュータ（またはＣＰＵ
やＭＰＵ）が記憶媒体に格納されたプログラムコードを
読出し実行することによっても、達成されることは言う
までもない。An object of the present invention is to provide a storage medium storing a program code of software for realizing the functions of the above-described embodiments to a system or apparatus, and to provide a computer (or CPU) of the system or apparatus.
And MPU) read and execute the program code stored in the storage medium.

【００３５】この場合、記憶媒体から読出されたプログ
ラムコード自体が前述した実施形態の機能を実現するこ
とになり、そのプログラムコードを記憶した記憶媒体は
本発明を構成することになる。In this case, the program code itself read from the storage medium implements the functions of the above-described embodiment, and the storage medium storing the program code constitutes the present invention.

【００３６】プログラムコードを供給するための記憶媒
体としては、例えば、フロッピディスク，ハードディス
ク，光ディスク，光磁気ディスク，ＣＤ−ＲＯＭ，ＣＤ
−Ｒ，磁気テープ，不揮発性のメモリカード，ＲＯＭな
どを用いることができる。As a storage medium for supplying the program code, for example, a floppy disk, hard disk, optical disk, magneto-optical disk, CD-ROM, CD
-R, a magnetic tape, a nonvolatile memory card, a ROM, or the like can be used.

【００３７】また、コンピュータが読出したプログラム
コードを実行することにより、前述した実施形態の機能
が実現されるだけでなく、そのプログラムコードの指示
に基づき、コンピュータ上で稼働しているＯＳ（オペレ
ーティングシステム）などが実際の処理の一部または全
部を行い、その処理によって前述した実施形態の機能が
実現される場合も含まれることは言うまでもない。When the computer executes the readout program code, not only the functions of the above-described embodiment are realized, but also the OS (Operating System) running on the computer based on the instruction of the program code. ) May perform some or all of the actual processing, and the processing may realize the functions of the above-described embodiments.

【００３８】さらに、記憶媒体から読出されたプログラ
ムコードが、コンピュータに挿入された機能拡張ボード
やコンピュータに接続された機能拡張ユニットに備わる
メモリに書込まれた後、そのプログラムコードの指示に
基づき、その機能拡張ボードや機能拡張ユニットに備わ
るＣＰＵなどが実際の処理の一部または全部を行い、そ
の処理によって前述した実施形態の機能が実現される場
合も含まれることは言うまでもない。Further, after the program code read from the storage medium is written into a memory provided on a function expansion board inserted into the computer or a function expansion unit connected to the computer, based on the instructions of the program code, It goes without saying that the CPU included in the function expansion board or the function expansion unit performs part or all of the actual processing, and the processing realizes the functions of the above-described embodiments.

【００３９】[0039]

【発明の効果】以上説明したように、本発明によれば、
検索文字列より同一の部分文字列である重複パタンが存
在するかどうかをチェックし、重複パタンがある場合に
は、その検索結果を次の出現時に再利用することが可能
となるので、検索処理の効率が向上する。As described above, according to the present invention,
Check if there is a duplicate pattern that is the same substring from the search string, and if there is a duplicate pattern, the search result can be reused at the next appearance, so search processing Efficiency is improved.

【００４０】[0040]

[Brief description of the drawings]

【図１】本発明の実施形態に係る文書処理装置の構成を
示すブロック図である。FIG. 1 is a block diagram illustrating a configuration of a document processing apparatus according to an embodiment of the present invention.

【図２】上述の文書処理装置のハードウェア構成を示す
図である。FIG. 2 is a diagram illustrating a hardware configuration of the above-described document processing apparatus.

【図３】図１に示した装置における動作の処理手順を示
すフローチャートである。FIG. 3 is a flowchart showing a processing procedure of an operation in the device shown in FIG. 1;

【図４】２−ｇｒａｍインデックスを生成する際の文字
列の抽出形態を説明する図である。FIG. 4 is a diagram illustrating a character string extraction mode when a 2-gram index is generated.

【図５】検索文字列より２文字ずつを取り出す例を示す
図である。FIG. 5 is a diagram illustrating an example of extracting two characters from a search character string.

【図６】本実施形態の検索処理を説明するための具体例
である。FIG. 6 is a specific example illustrating a search process according to the embodiment.

【図７】本発明の他の実施形態に係る文書処理装置の構
成を示すブロック図である。FIG. 7 is a block diagram illustrating a configuration of a document processing apparatus according to another embodiment of the present invention.

Claims

[Claims]

1. Characters in a search target text are used as keys,
Holding means for holding a first index in which the position of the character in the search target text is registered; and two or more consecutive characters from a given search character string, which overlap with the search character string. Extracting means for extracting an existing overlapping pattern, and a first index held in the holding means,
Using a first search unit that searches for the position of the duplicate pattern extracted by the extraction unit, and a first index stored in the storage unit,
A second search unit that searches for a position of a character other than the overlapping pattern in the search character string; and a search character string based on search results obtained by the first search unit and the second search unit. A text search device comprising: an obtaining unit configured to obtain an existing position of a document.

2. The method according to claim 1, wherein, when N is an integer of 2 or more, a position of each character string in a search target text is registered using a character string composed of N characters as a key. The text search device according to claim 1, wherein:

3. The method according to claim 1, wherein the first search unit obtains a position in the searched text of a character string obtained by dividing the overlapping pattern into N characters by referring to the first index. The text search apparatus according to claim 2, wherein the existence position of the overlapping pattern is obtained based on the position.

4. The apparatus further comprises: a second holding unit that holds a second index for registering an existing position in the text to be searched, using the duplicate pattern obtained by the first search unit as a key, wherein the first search unit includes: 2. The text search apparatus according to claim 1, wherein a search is made for the location of the duplicate pattern extracted by the extraction means in the search target text, based on the first index and the second index.

5. The apparatus according to claim 4, wherein the second holding unit obtains statistics of the appearance of the overlapping pattern, and uses the overlapping pattern whose appearance frequency exceeds a predetermined value as a key of the second index. Text search device.

6. A statistical means for acquiring an appearance frequency of each character string obtained by dividing the overlapping pattern into N characters, and a character string whose appearance frequency exceeds a predetermined value in the statistical means is used as a key. And a second holding unit for holding a second index for registering a position in the searched text, wherein the first searching unit is configured to divide the overlapping pattern into N characters at a time. The text search apparatus according to claim 3, wherein a position in the searched text is obtained based on the first and second indexes.

7. The text search apparatus according to claim 1, further comprising a generation unit configured to generate a first index stored in the storage unit from the search target text.

8. An extracting step of extracting a duplicate pattern that is composed of two or more consecutive characters from a given search character string and that overlaps the search character string; and a character in the search target text. A first search step of searching for an existing position of the duplicate pattern extracted in the extraction step using a first index in which the position of the character in the searched text is registered, using A second search step of searching for an existing position of a character other than the overlapping pattern in the search character string, based on search results obtained in the first search step and the second search step, An acquisition step of acquiring the location of the search character string.

9. The text search method according to claim 8, further comprising a generation step of generating the first index.

10. The method according to claim 1, wherein, when N is an integer of 2 or more, a position of each character string in the search target text is registered using a character string composed of N characters as a key. The text search method according to claim 8, wherein:

11. The first search step obtains a position in the search target text of a character string obtained by dividing the overlapping pattern into N characters by referring to the first index. The text search method according to claim 10, wherein the existence position of the overlapping pattern is obtained based on the position.

12. The method according to claim 12, further comprising a holding step of holding a second index in which the positions of the overlapping patterns obtained in the first searching step are registered by using the overlapping patterns as keys, wherein the first searching step includes: 9. The text search method according to claim 8, wherein a search is made for a location of the duplicate pattern extracted in the extraction step based on the second index.

13. The text search according to claim 12, wherein, in the holding step, statistics of appearing duplicate patterns are obtained, and a duplicate pattern whose appearance frequency exceeds a predetermined value is used as a key of the second index. Method.

14. A statistical step of acquiring an appearance frequency of each character string obtained by dividing the overlapping pattern into N characters, and a character string whose appearance frequency exceeds a predetermined value in the statistical step is used as a key. The search method further includes a holding step of holding a second index for registering a position in the search target text, wherein the first search step is performed by dividing the overlap pattern into N characters at a time. Is obtained by referring to the first and second indexes,
11. The position of the duplicate pattern in the searched text is acquired based on the positions.
Text search method described in.

15. A computer-readable memory storing a control program for searching for a searched text with a given search character string, the control program comprising: two consecutive characters from a given search character string. Register the code of the extraction process, which consists of the above characters and extracts the duplicate pattern that exists in the search character string, and the position of the character in the search target text using the character in the search target text as a key A code of a first search step of searching for the position of the duplicate pattern extracted in the extraction step using the obtained first index; and a code other than the duplicate pattern in the search character string using the first index. The search character based on a code of a second search step for searching for the location of the character, and search results obtained in the first search step and the second search step A computer-readable memory characterized by comprising a code acquisition step of acquiring location of.