JPH11110402A

JPH11110402A - Document retrieving device

Info

Publication number: JPH11110402A
Application number: JP9267547A
Authority: JP
Inventors: Kazushige Asada; 一繁浅田; Hiroshi Takegawa; 弘志竹川; Toshio Ito; 俊男伊藤; Hideaki Nakayama; 秀明中山
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1997-09-30
Filing date: 1997-09-30
Publication date: 1999-04-23

Abstract

PROBLEM TO BE SOLVED: To speed up the retrieval by reducing the generation ratio of false drop at the time of retrieving document data. SOLUTION: At the time of retrieving a character string under a condition that forward coincidence and backward coincidence is obtained by using a processing part 2 of the document retrieving device, ' ' (a character indicating the head of a character string) and '$'(a character indicating the tail of a character string) are added as virtual characters for discriminating the head from the tail to retrieval character strings p1(='high speed') and p2 (='printer'), and characters to be retrieved t1 (='high speed printer'), t2 (='metropolitan speedway'), and t3 (='printer output'). Therefore, those character strings can be converted into p1 (= ' high speed'), p2' (='printer $'), t1' (=' high speed printer $'), t2' (='metropolitan speedway$'), t3' (=' printer output $'). Thus, signature can be extracted, and the generation ratio of false drop can be sharply reduced.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書検索装置に係
り、さらに詳しくは、シグネチャーファイルを利用して
指定された文字列を含む文書を検索する文書検索装置に
関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document search apparatus, and more particularly, to a document search apparatus that searches for a document including a specified character string using a signature file.

【０００２】[0002]

【従来の技術】従来より、英文や日本文などの複数の文
字種を扱って文書の作成・編集などを行うものとしてワ
ードプロセッサや、ワープロソフトを使ったパーソナル
コンピュータなどがあり、これらの装置に入力された文
書データ中の部分文字列を検索するものとして文書検索
装置が用いられている。2. Description of the Related Art Conventionally, word processors and personal computers using word processing software have been used to create and edit documents by using a plurality of character types such as English and Japanese. A document search device is used to search for a partial character string in document data.

【０００３】この種の文書検索装置としては、例えば、
特開平７−２４４６７１号公報に記載された文書検索装
置があり、ある文字列から一定の方法で摘出される２進
数によるビットパターンのシグネチャーを用いることに
よって文書検索が行われていた。このシグネチャーの２
進数のビットパターンにおいて「１」がセットされるビ
ットの位置は、文字列を構成する文字や単語を数値化
し、その値を０からビット位置の最大値までの値にハッ
シングすることにより得ている。例えば、文字列として
「コピー」があったとすると、「コ」、「ピ」、「ー」
を文字コードを用いて「５」、「７」、「１２」と数値
化できたとすると、その数値がビット位置を示し、ビッ
トが「５」と「７」と「１２」番目の位置に「１」が立
つことになり、「００００１０１００００１」のように
なる。この０と１のパターンがビットマップ（ビットパ
ターン）と称され、このビットマップによって構成され
るものがシグネチャーである。As this type of document search device, for example,
There is a document search device described in Japanese Patent Application Laid-Open No. 7-244671, and a document search is performed by using a binary bit pattern signature extracted from a certain character string by a certain method. 2 of this signature
The position of the bit at which "1" is set in the hexadecimal bit pattern is obtained by digitizing the characters and words constituting the character string and hashing the value from 0 to the maximum value of the bit position. . For example, if there is "copy" as a character string, "ko", "pi", "-"
Can be converted into numerical values "5", "7", and "12" using character codes, the numerical values indicate bit positions, and the bits are "5", "7", and "12" at the 12th position. 1 ", and becomes" 00000100001 ". The pattern of 0s and 1s is called a bitmap (bit pattern), and what is constituted by this bitmap is a signature.

【０００４】シグネチャーの摘出方法については、「Ac
cess Method of Text 」(ChristosFaloutsos,Computing
Surveys,Vol.17，No.1,March 1985,pp49〜74) に記載
されている。この文献によれば、文書データを構成する
単語ごとにワードシグネチャーと称されるシグネチャー
を作り、それらをスーパーインポーズしたものを文書デ
ータのシグネチャーとするものである。ここで、スーパ
ーインポーズとは、複数のシグネチャーにおいて同じ位
置のビットの値の論理和をとり、各論理和の値の列を新
たなシグネチャーとして摘出する操作のことである。[0004] The signature extraction method is described in "Ac
cess Method of Text '' (ChristosFaloutsos, Computing
Surveys, Vol. 17, No. 1, March 1985, pp. 49-74). According to this document, a signature called a word signature is created for each word constituting document data, and a superimposed signature is used as the signature of the document data. Here, the superimposition is an operation of calculating a logical sum of bit values at the same position in a plurality of signatures, and extracting a sequence of values of each logical sum as a new signature.

【０００５】また、単語を構成する部分文字列の検索も
できるようにするために、文書データに重複部分を持た
せながら一定の文字数の文字列に分割して、上記したワ
ードシグネチャーと同様に各文字列のシグネチャーをス
ーパーインポーズする方法がある。Further, in order to search for a partial character string constituting a word, the document data is divided into a character string having a fixed number of characters while having an overlapping portion, and each of the document data is divided into character strings in the same manner as the above-mentioned word signature. There is a way to superimpose a string signature.

【０００６】また、より長い文書データを文やパラグラ
フなどの論理的なブロックに分割して、各ブロックから
摘出される複数のシグネチャーを１つの文書に対応させ
る方法もある。ここで、ブロックから摘出されたシグネ
チャーは、ブロックシグネチャーと呼ばれている。There is also a method in which longer document data is divided into logical blocks such as sentences and paragraphs, and a plurality of signatures extracted from each block correspond to one document. Here, the signature extracted from the block is called a block signature.

【０００７】検索文字列を含む文書を検索するためにシ
グネチャーを利用する場合は、異なる文字列から同じビ
ットパターンのシグネチャーが摘出される可能性がある
ので、検索結果として検索文字列を含まない文書を検出
することがある。この文書は、フォルスドロップと呼ば
れる。一方、検索文字列が含まれる文書は、アクチュア
ルドロップと呼ばれる。When a signature is used to search for a document including a search character string, the same bit pattern signature may be extracted from different character strings. May be detected. This document is called False Drop. On the other hand, a document including a search character string is called an actual drop.

【０００８】従来の文書検索装置において、シグネチャ
ーは文書ごとに摘出され、各シグネチャーはシグネチャ
ーファイルと呼ばれるファイルに一括して格納される。
シグネチャーファイルは、シグネチャーの格納方法によ
って２つに大別される。１つは、単にシグネチャーを順
に並べて格納する方法である。この方法によるファイル
構成は、シーケンシャル構成と呼ばれる。もう１つは、
シグネチャーの各ビットをビット位置ごとに別々のビッ
トマップに格納する方法である。この方法によるファイ
ル構成は、ビットスライス構成と呼ばれる。ビットスラ
イス構成によるシグネチャーファイルは、「Partial-Ma
tch Retrieval via Method of Superimposed Codes」
（Charles S. Roberts, Proceedings of the IEEE. Vo
l.67,No.12,December,1979,pp1624〜1979）に記載され
ている。In a conventional document retrieval apparatus, signatures are extracted for each document, and each signature is stored collectively in a file called a signature file.
Signature files are roughly classified into two types according to the signature storage method. One method is to simply store signatures in order. The file configuration by this method is called a sequential configuration. The other is
In this method, each bit of the signature is stored in a separate bitmap for each bit position. The file configuration by this method is called a bit slice configuration. The signature file with the bit slice configuration is “Partial-Ma
tch Retrieval via Method of Superimposed Codes ''
(Charles S. Roberts, Proceedings of the IEEE. Vo
l, 67, No. 12, December, 1979, pp 1624-1979).

【０００９】また、シーケンシャル構成のシグネチャー
のビットマップを圧縮する方法は、「Description and
Performance Analysis of Signature File Methods for
Office Filing」（Christos Faloutsos,ACM Transacti
on Office Information Systems,Vol.5,No.3,July 198
7,pp.237 〜257 ）に記載されているように、ランレン
グスコーティングなどを利用する方法がある。A method for compressing a bitmap of a signature having a sequential structure is described in “Description and
Performance Analysis of Signature File Methods for
Office Filing "(Christos Faloutsos, ACM Transacti
on Office Information Systems, Vol.5, No.3, July 198
7, pp. 237 to 257), there is a method using run-length coating or the like.

【００１０】（１）そこで、上記した従来の文書検索装
置を用いて、前方一致、後方一致を条件として文字列を
検索する場合、前方一致、後方一致を条件とした検索文
字列をｐ１、ｐ２とし、検索対象の文字列をｔ１、ｔ
２、ｔ３とした下記の例題で説明する。ｐ１＝“高速” ｐ２＝“プリンター” ｔ１＝“高速プリンター” ｔ２＝“首都高速道路” ｔ３＝“プリンター出力” 上記の検索文字列ｐ１、ｐ２から摘出されるシグネチャ
ーを用いる場合、生起された文字が文字列の先頭か末尾
かを区別することができないため、検索対象の文字列ｔ
１、ｔ２、ｔ３から摘出されたシグネチャーに対して検
索を行うと、ｐ１の条件ではｔ１、ｔ２が検索結果とな
り、ｐ２の条件ではｔ１、ｔ３が検索結果となってい
た。(1) Therefore, when a character string is searched by using the above-described conventional document search apparatus on the condition of a head match and a tail match, the search character strings with the head match and the tail match are p1 and p2. And the character strings to be searched are t1, t
This will be described in the following example in which 2, t3 is set. p1 = “high-speed” p2 = “printer” t1 = “high-speed printer” t2 = “capital expressway” t3 = “printer output” In the case of using the signature extracted from the above search character strings p1 and p2, generated characters Cannot be distinguished from the beginning or end of the character string.
When a search was performed on the signatures extracted from 1, t2, and t3, t1 and t2 were the search results under the condition of p1, and t1 and t3 were the search results under the condition of p2.

【００１１】（２）また、従来の文書検索装置では、文
書データを一つ以上の部分文字列に分割し、各部分文字
列に応じたシグネチャーを摘出していた。この部分文字
列はブロックと称され、各ブロックの文字数は一定であ
った。このため、文書データごとに末尾のブロックの文
字数は、末尾以外のブロックの文字数以下となってい
た。(2) Further, in the conventional document retrieval apparatus, the document data is divided into one or more partial character strings, and a signature corresponding to each partial character string is extracted. This partial character string was called a block, and the number of characters in each block was constant. Therefore, the number of characters in the last block of each document data is equal to or less than the number of characters in the blocks other than the last block.

【００１２】（３）さらに、従来の文書検索装置では、
シグネチャーを摘出するための方法として、隣接するＮ
文字（Ｎは２以上の整数）の文字列に応じてシグネチャ
ーで値を１とすべきビットの位置を決めるＮ−gram方式
が用いられていた。例えば、３−gram方式では、“リコ
ーの高性能カメラ”という文字列を下記のような部分文
字列に分解して、シグネチャーの摘出を行っていた。リコー／コーの／ーの高／の高性／高性能／性能カ／能
カメ／カメラ(3) Further, in the conventional document search device,
As a way to extract the signature, the adjacent N
An N-gram method has been used in which the position of a bit whose value should be set to 1 by signature is determined according to a character string of a character (N is an integer of 2 or more). For example, in the 3-gram system, a character string "Ricoh's high-performance camera" is decomposed into the following partial character strings to extract a signature. Ricoh / Co / High / High performance / High performance / Performance / Noh camera / Camera

【００１３】（４）また、従来の文書検索装置では、
“高速ｏｒプリンター”のように、論理演算子ＡＮＤや
ＯＲでつながれた複数の検索文字列を含む条件式により
検索する場合は、各検索文字列ごとにシグネチャーファ
イルの走査を行って、それぞれの検索結果の集合演算を
行うことにより、最終的な検索結果を求めていた。(4) In the conventional document search device,
When searching by a conditional expression that includes multiple search strings connected by the logical operators AND and OR, such as “high-speed or printer”, scan the signature file for each search string and search for each. By performing a set operation on the results, a final search result has been obtained.

【００１４】[0014]

【発明が解決しようとする課題】しかしながら、このよ
うな従来の文書検索装置にあっては、（１）上記した検
索文字列をｐ１、ｐ２とし、検索対象の文字列をｔ１、
ｔ２、ｔ３とした場合、文字列の同一性のみしか判別で
きず、前方一致、後方一致を条件としても文字列の先頭
で一致しているか、末尾で一致しているかを区別するこ
とができないことから、検索結果として検索文字列を含
まない文書であるフォルスドロップの発生率が高くな
り、検索に時間がかかるという不都合があった。However, in such a conventional document search apparatus, (1) the search character strings described above are p1 and p2, and the character strings to be searched are t1 and p1.
In the case of t2 and t3, only the identity of the character string can be determined, and it is not possible to distinguish whether the character string matches at the beginning or at the end of the character string even under the condition of head match and tail match. Therefore, the occurrence rate of false drop, which is a document that does not include a search character string as a search result, becomes high, and there is a disadvantage that it takes time to search.

【００１５】また、上記（２）の従来の文書検索装置に
あっては、文書データを一定の文字数で部分文字列に分
割してシグネチャーを摘出すると、文書データの末尾の
ブロックの文字数が不均一になり易いことから、フォル
スドロップが増加し、検索に時間がかかるという不都合
があった。In the conventional document retrieval apparatus of (2), if the document data is divided into partial character strings with a fixed number of characters and the signature is extracted, the number of characters in the block at the end of the document data becomes uneven. However, there is an inconvenience that false drops increase and search takes time.

【００１６】さらに、上記（３）の従来の文書検索装置
にあっては、単純に隣接するＮ文字で部分文字列を切り
出すと、上記例では“ーの高”のような検索文字列に指
定されることの少ない文字列に対してもシグネチャーの
ビットが割り当てられることから、シグネチャーで１で
あるビットが多くなるため、フォルスドロップの発生す
る確率が高くなり、検索にも時間がかかるという不都合
があった。Further, in the conventional document search apparatus of the above (3), if a partial character string is simply cut out from adjacent N characters, it is designated as a search character string such as "-high" in the above example. Since the signature bits are assigned to a character string that is rarely performed, the number of bits that are 1 in the signature increases, so that the probability that a false drop occurs and the search takes time is disadvantageous. there were.

【００１７】また、上記（４）の従来の文書検索装置に
あっては、複数の検索文字列を含む条件式を用いて検索
する場合、条件式に含まれる検索文字列の数が増加する
に伴って走査回数が増えるため、その分検索時間が長く
なり、検索性能が低下するという不都合があった。In the conventional document search apparatus of the above (4), when a search is performed using a conditional expression including a plurality of search character strings, the number of search character strings included in the conditional expression increases. As a result, the number of scans increases, so that the search time becomes longer and the search performance decreases.

【００１８】本発明は、かかる従来技術の有する不都合
に鑑みてなされたもので、請求項１に記載の発明の目的
は、前方一致、後方一致を条件とした文書検索において
フォルスドロップの発生率を低くして、検索の高速化を
図ることができる文書検索装置を提供することにある。SUMMARY OF THE INVENTION The present invention has been made in view of the above-mentioned disadvantages of the related art, and an object of the invention described in claim 1 is to reduce a false drop occurrence rate in a document search on the basis of a head match and a tail match. It is an object of the present invention to provide a document search device that can reduce the speed and speed up the search.

【００１９】請求項２に記載の発明の目的は、文書デー
タからブロックシグネチャーを摘出する際に発生するフ
ォルスドロップの増加を抑制して、検索の高速化を図る
ことができる文書検索装置を提供することにある。An object of the invention described in claim 2 is to provide a document search apparatus capable of suppressing an increase in false drops that occur when extracting a block signature from document data, thereby achieving a high-speed search. It is in.

【００２０】請求項３に記載の発明の目的は、文字種の
多い文書データからＮ−gram方式によりシグネチャーを
摘出する際に、フォルスドロップの増加を抑制して、検
索の高速化を図ることができる文書検索装置を提供する
ことにある。An object of the invention described in claim 3 is to extract false signatures from document data having many types of characters by the N-gram method by suppressing an increase in false drops and speeding up retrieval. A document search device is provided.

【００２１】請求項４に記載の発明の目的は、複数の検
索文字列を含む条件式を用いて文書を検索する際の検索
の高速化が図れる文書検索装置を提供することにある。It is an object of the present invention to provide a document search apparatus capable of speeding up a search when searching for a document using a conditional expression including a plurality of search character strings.

【００２２】[0022]

【課題を解決するための手段】請求項１に記載の発明
は、登録文書の文書データ及び検索文字列を入力する入
力部と、該入力部に入力された文書データ及び検索文字
列を文字コードに変換してシグネチャーを摘出する処理
部と、該処理部により摘出されたシグネチャーを格納す
るシグネチャーファイル及びレコード識別子に対応する
レコードを参照して文書データを求めるレコードファイ
ルとを有するデータ部と、文書データを出力する出力部
とを備えた文書検索装置において、前記処理部は、前記
文書データ及び検索文字列のシグネチャーを摘出する前
に、それぞれの文字列の先頭または末尾を表す仮想的な
文字を付加してシグネチャーを摘出し、同じ部分文字列
が文字列の先頭または末尾に表れた場合か否かをシグネ
チャーで区別することを特徴とする。According to a first aspect of the present invention, there is provided an input unit for inputting document data and a search character string of a registered document, and a character code for inputting the document data and the search character string input to the input unit. A data unit having a processing unit that converts the data into a signature and extracts the signature, a signature file that stores the signature extracted by the processing unit, and a record file that obtains the document data by referring to the record corresponding to the record identifier; and In the document search device including an output unit that outputs data, the processing unit extracts virtual characters representing the beginning or end of each character string before extracting the signature of the document data and the search character string. Extract the signature by adding it, and distinguish whether the same substring appears at the beginning or end of the character string with the signature. The features.

【００２３】これによれば、処理部において文書データ
と検索文字列のシグネチャーを摘出する前に、文書デー
タや検索文字列に対してそれぞれの文字列の先頭または
末尾を表す仮想的な文字を付加して、シグネチャーを摘
出するようにしたため、検索する部分文字列が文字列の
先頭または末尾に表れているか否かをシグネチャーだけ
で判別することが可能となるので、フォルスドロップの
発生率が低くなり、検索性能と検索速度を向上させるこ
とができる。According to this, before the signature of the document data and the search character string is extracted in the processing unit, virtual characters representing the head or end of each character string are added to the document data or the search character string. Then, the signature is extracted, so that it is possible to determine whether or not the substring to be searched appears at the beginning or end of the character string using only the signature. , Search performance and search speed can be improved.

【００２４】請求項２に記載の発明は、登録文書の文書
データ及び検索文字列を入力する入力部と、該入力部に
入力された文書データ及び検索文字列を文字コードに変
換してシグネチャーを摘出する処理部と、該処理部によ
り摘出されたシグネチャーを格納するシグネチャーファ
イル及びレコード識別子に対応するレコードを参照して
文書データを求めるレコードファイルとを有するデータ
部と、文書データを出力する出力部とを備えた文書検索
装置において、前記処理部は、前記文書データを複数の
ブロックに分割して、各ブロックに応じたシグネチャー
を摘出する際に、各ブロックの文字数が均一となるよう
に分割することを特徴とする。According to a second aspect of the present invention, there is provided an input unit for inputting document data and a search character string of a registered document, and converting the document data and the search character string input to the input unit into character codes to generate a signature. A data unit having a processing unit for extracting, a signature file for storing the signature extracted by the processing unit, and a record file for obtaining document data by referring to a record corresponding to the record identifier; and an output unit for outputting document data Wherein the processing unit divides the document data into a plurality of blocks and, when extracting a signature corresponding to each block, divides the document data so that the number of characters in each block is uniform. It is characterized by the following.

【００２５】これによれば、処理部において文書データ
を複数のブロックに分割し、各ブロックに応じたブロッ
クシグネチャーを摘出する際に、各ブロックの文字数が
均一となるように分割したため、ブロックに対して必要
以上のビット数が割り当てられたり、ビット数が不足す
ることが少なくなるので、フォルスドロップの増加が抑
制され、検索の高速化が図れる。According to this, the processing unit divides the document data into a plurality of blocks and, when extracting a block signature corresponding to each block, divides the block so that the number of characters in each block is uniform. As a result, it is less likely that more bits than necessary are allocated or the number of bits is insufficient, so that an increase in false drops is suppressed and search can be speeded up.

【００２６】請求項３に記載の発明は、登録文書の文書
データ及び検索文字列を入力する入力部と、該入力部に
入力された文書データ及び検索文字列を文字コードに変
換してシグネチャーを摘出する処理部と、該処理部によ
り摘出されたシグネチャーを格納するシグネチャーファ
イル及びレコード識別子に対応するレコードを参照して
文書データを求めるレコードファイルとを有するデータ
部と、文書データを出力する出力部とを備えた文書検索
装置において、前記処理部は、前記文書データの隣接す
るＮ文字（Ｎは２以上の整数）の文字列に応じてシグネ
チャーを摘出する際に、前記隣接するＮ文字の各文字列
が同一の文字種としたことを特徴とする。According to a third aspect of the present invention, there is provided an input unit for inputting document data and a search character string of a registered document, and converting the document data and the search character string input to the input unit into a character code to generate a signature. A data unit having a processing unit for extracting, a signature file for storing the signature extracted by the processing unit, and a record file for obtaining document data by referring to a record corresponding to the record identifier; and an output unit for outputting document data The processing unit, when extracting a signature according to a character string of adjacent N characters (N is an integer of 2 or more) of the document data, Character strings are of the same character type.

【００２７】これによれば、処理部において文書データ
の隣接するＮ文字の文字列に応じてシグネチャーを摘出
する際に、隣接するＮ文字の各文字列を同一の文字種と
したため、切り出されたＮ文字の文字列の種類を従来方
式よりも少なくすることができるので、フォルスドロッ
プの増加が抑制され、検索の高速化が図れる。According to this, when the signature is extracted in the processing unit in accordance with the character string of the adjacent N characters of the document data, the character strings of the adjacent N characters are of the same character type. Since the number of types of character strings can be reduced as compared with the conventional method, an increase in false drops is suppressed, and search can be speeded up.

【００２８】請求項４に記載の発明は、登録文書の文書
データ及び検索文字列を入力する入力部と、該入力部に
入力された文書データ及び検索文字列を文字コードに変
換してシグネチャーを摘出する処理部と、該処理部によ
り摘出されたシグネチャーを格納するシグネチャーファ
イル及びレコード識別子に対応するレコードを参照して
文書データを求めるレコードファイルとを有するデータ
部と、文書データを出力する出力部とを備えた文書検索
装置において、前記処理部は、複数の検索文字列が含ま
れる条件式に基づいて文字列を検索する際に、前記条件
式に含まれる全ての検索文字列のシグネチャーを摘出
し、各シグネチャーの値が１であるビットに応じてビッ
トスライスされたビットマップを走査するカーソルを用
意し、前記各カーソルを条件式の内容に従って並行に走
査させながら検索を行うことを特徴とする。According to a fourth aspect of the present invention, there is provided an input unit for inputting document data and a search character string of a registered document, and converting the document data and the search character string input to the input unit into character codes to generate a signature. A data unit having a processing unit for extracting, a signature file for storing the signature extracted by the processing unit, and a record file for obtaining document data by referring to a record corresponding to the record identifier; and an output unit for outputting document data The processing unit, when searching for a character string based on a conditional expression including a plurality of search character strings, extracts the signature of all search character strings included in the conditional expression Then, a cursor for scanning a bit map bit-sliced in accordance with the bit whose signature value is 1 is prepared. And performing a search while scanning in parallel in accordance with the contents of the conditional expression.

【００２９】これによれば、処理部において複数の検索
文字列が含まれる条件式に基づいて文字列を検索する際
に、条件式に含まれる全ての検索文字列のシグネチャー
を摘出して、各シグネチャーの値が１であるビットに応
じてビットスライスされたビットマップを走査するカー
ソルを用意して、そのカーソルを条件式の内容に従って
並行に走査させながら検索するため、条件式に含まれる
検索文字列が増えても、シグネチャーファイルを一回走
査するだけで検索結果を求めることが可能となり、検索
の高速化が図れる。According to this, when the processing unit searches for a character string based on a conditional expression including a plurality of search character strings, signatures of all search character strings included in the conditional expression are extracted and To prepare a cursor that scans a bitmap that is bit-sliced according to the bit whose signature value is 1 and perform a search while scanning the cursor in parallel according to the contents of the conditional expression, search characters included in the conditional expression Even if the number of columns increases, search results can be obtained only by scanning the signature file once, and the search can be speeded up.

【００３０】[0030]

【発明の実施の形態】以下、本発明の実施の形態１〜実
施の形態４を図面に基づいて説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments 1 to 4 of the present invention will be described below with reference to the drawings.

【００３１】まず、各実施の形態の説明に入る前に、各
実施の形態に共通する文書検索装置の概略構成について
説明する。図１は、本実施の形態に係る文書検索装置の
概略構成を示すブロック図である。図１において、文書
検索装置は、入力部１、処理部２、文字列入力処理部
３、レコード識別子計算処理部４、文書検索処理部５、
文書出力処理部６、格納位置計算処理部７、文書登録処
理部８、シグネチャー摘出処理部９、出力部１０、デー
タ部１１、シグネチャーファイル１２、レコードファイ
ル１３などにより構成されている。Before describing each embodiment, a schematic configuration of a document retrieval apparatus common to each embodiment will be described. FIG. 1 is a block diagram showing a schematic configuration of the document search device according to the present embodiment. In FIG. 1, the document search device includes an input unit 1, a processing unit 2, a character string input processing unit 3, a record identifier calculation processing unit 4, a document search processing unit 5,
It comprises a document output processing unit 6, a storage position calculation processing unit 7, a document registration processing unit 8, a signature extraction processing unit 9, an output unit 10, a data unit 11, a signature file 12, a record file 13, and the like.

【００３２】対象とする検索文字列および登録文書の文
書データの文字コードは、ＡＳＣＩＩ（American Stand
ard Code Information Interchange）のように、各文字
のバイト数が一律１バイトの文字コードでも、ＥＵＣ
（Extended Unix Interchange）のように１バイト、２
バイト、３バイトの文字が混在する文字コードでもよ
い。なお、文字列入力処理部で用いられる内部処理用の
文字コードは、変換される文字コード（例えば、ＥＵ
Ｃ）と同じである必要はない。また、本発明において
は、ＥＵＣを用いた場合について説明する。The target search character string and the character code of the document data of the registered document are ASCII (American Standalone).
ard Code Information Interchange), even if the number of bytes of each character is 1 byte, EUC
1 byte as in (Extended Unix Interchange), 2 bytes
It may be a character code in which a byte and a 3-byte character are mixed. The character code for internal processing used in the character string input processing unit is a character code to be converted (for example, EU).
It need not be the same as C). Also, in the present invention, a case where EUC is used will be described.

【００３３】図１に示されるように、入力部１から入力
された検索文字列および登録文書の文書データは、処理
部２の文字列入力処理部３で入力用の文字コードからＥ
ＵＣに変換される。また、検索時にアクチュアルドロッ
プである文書データは、文書出力処理部６でＥＵＣから
出力用の文字コードへ変換される。したがって、検索文
字列および登録文書の文書データは、処理部２内では、
常にＥＵＣの文字列として処理され、文書データは、常
にＥＵＣの文字列としてデータ部１１のレコードファイ
ル１３に格納される。As shown in FIG. 1, the search character string input from the input unit 1 and the document data of the registered document are converted from the character code for input by the character string input processing unit 3 of the processing unit 2 to E.
Converted to UC. In addition, document data that is an actual drop at the time of retrieval is converted from the EUC to a character code for output by the document output processing unit 6. Therefore, the search character string and the document data of the registered document are
It is always processed as an EUC character string, and the document data is always stored in the record file 13 of the data section 11 as an EUC character string.

【００３４】上記したように、本実施の形態の文書検索
装置は、文書を登録および検索する機能を有し、入力部
１と処理部２とデータ部１１と出力部１０の４つの部分
から構成されている。また、前記処理部２は、文字列入
力処理部３とシグネチャー摘出処理部９と文書検索処理
部５とレコード識別子計算処理部４と文書出力処理部６
と文書登録処理部８と格納位置計算処理部７とを備える
とともに、図示を省略したが、後述するカーソル移動処
理部と条件式清算部と条件式見込み計算部とをさらに備
えている。As described above, the document search apparatus according to the present embodiment has a function of registering and searching for a document, and is composed of an input unit 1, a processing unit 2, a data unit 11, and an output unit 10. Have been. The processing unit 2 includes a character string input processing unit 3, a signature extraction processing unit 9, a document search processing unit 5, a record identifier calculation processing unit 4, and a document output processing unit 6.
And a document registration processing unit 8 and a storage position calculation processing unit 7, and further include a cursor movement processing unit, a conditional expression clearing unit, and a conditional expression prospect calculation unit (not shown).

【００３５】また、前記データ部１１は、シグネチャー
ファイル１２とレコードファイル１３とを備えている。
文書の登録に際しては、入力部１で登録する文書データ
の入力を受けつけ、入力された文書データは文字列入力
処理部３で所定の文字コードに変換された後、文書登録
処理部８に渡される。その文書登録処理部８は、第１
に、データ部１１のレコードファイル１３に文書データ
を格納する。第２に、格納位置計算処理部７を利用し
て、文書データを格納したレコードの識別子からシグネ
チャーの格納位置を計算する。第３に、文書データを一
定の文字数のブロックに分割して、シグネチャー摘出処
理部９を利用して各ブロックからブロックシグネチャー
を摘出する。ただし、隣接するブロック同士は、一定の
文字数の重複部分を有する。第４に、ブロックシグネチ
ャーをデータ部１１のシグネチャーファイル１２におけ
る所定の格納位置に格納する。The data section 11 includes a signature file 12 and a record file 13.
When registering a document, the input unit 1 accepts input of document data to be registered. The input document data is converted into a predetermined character code by a character string input processing unit 3 and then transferred to a document registration processing unit 8. . The document registration processing unit 8 stores the first
Then, the document data is stored in the record file 13 of the data section 11. Second, the storage location of the signature is calculated from the identifier of the record storing the document data by using the storage location calculation processing unit 7. Third, the document data is divided into blocks each having a fixed number of characters, and a block signature is extracted from each block using the signature extraction processing unit 9. However, adjacent blocks have overlapping portions with a certain number of characters. Fourth, the block signature is stored at a predetermined storage location in the signature file 12 of the data section 11.

【００３６】また、文書の検索動作に関しては、まず、
入力部１で検索文字列を受け付け、入力された検索文字
列は文字列入力処理部３で所定の文字コードに変換され
た後、文書検索処理部５に渡される。その文書検索処理
部５は、第１に、検索文字列から制限された文字数以内
の部分文字列を抽出し、シグネチャー摘出処理部９を利
用して、部分文字列からシグネチャーを摘出し、そのシ
グネチャーを検索文字列のシグネチャーとする。ここ
で、制限された文字数とは、登録された文書データを構
成するブロック同士の重複部分の文字数のことである。
第２に、検索用のシグネチャーで「１」がセットされて
いるビットを調べ、データ部１１のシグネチャーファイ
ルにおいて、先に調べたビットに対応するビットスライ
スされたビットマップを参照し、検索文字列が含まれる
と判断されるブロックから摘出されたブロックシグネチ
ャーの格納位置を求める。Regarding the document search operation, first,
A search character string is received by the input unit 1, and the input search character string is converted into a predetermined character code by the character string input processing unit 3, and is then passed to the document search processing unit 5. First, the document search processing unit 5 extracts a partial character string within the limited number of characters from the search character string, extracts the signature from the partial character string using the signature extraction processing unit 9, and extracts the signature. Is the signature of the search string. Here, the limited number of characters refers to the number of characters in an overlapping portion between blocks constituting registered document data.
Secondly, the bit for which "1" is set in the search signature is checked, and in the signature file of the data section 11, the bit map corresponding to the previously checked bit is referenced, and the search character string is searched. The storage position of the block signature extracted from the block that is determined to contain the is calculated.

【００３７】第３に、レコード識別子計算処理部４を利
用して、ブロックシグネチャーの格納位置からレコード
識別子の値を求める。第４に、データ部１１のレコード
ファイル１２において、レコード識別子に対応するレコ
ードを参照して文書データを求める。第５に、求めた文
書データを文書出力処理部６に渡す。その文書出力処理
部６は、渡された文書データに本当に検索文字列が含ま
れるかどうかを調べ、フォルスドロップを除去し、アク
チュアルドロップを所定の文字コードに変換したのち、
文書データを出力部１０に渡す。その出力部１０は、渡
された文書データを出力する。Third, the value of the record identifier is obtained from the storage location of the block signature using the record identifier calculation processing unit 4. Fourth, document data is obtained by referring to the record corresponding to the record identifier in the record file 12 of the data section 11. Fifth, the obtained document data is passed to the document output processing unit 6. The document output processing unit 6 checks whether or not the passed document data really includes the search character string, removes the false drop, converts the actual drop into a predetermined character code,
The document data is passed to the output unit 10. The output unit 10 outputs the passed document data.

【００３８】（実施の形態１）本実施の形態１では、文
書検索装置の処理部２を用いて前方一致、後方一致を条
件とした文字列を検索する際に、前方一致、後方一致を
条件とした下記の検索文字列ｐ１、ｐ２と検索対象の文
字列ｔ１、ｔ２、ｔ３としている。ｐ１＝“高速” ｐ２＝“プリンター” ｔ１＝“高速プリンター” ｔ２＝“首都高速道路” ｔ３＝“プリンター出力”(Embodiment 1) In the first embodiment, when a character string is searched by using the processing unit 2 of the document search apparatus on the basis of a head match and a tail match, a head match and a tail match are required. The following search character strings p1 and p2 and character strings t1, t2 and t3 to be searched are set. p1 = “high speed” p2 = “printer” t1 = “high speed printer” t2 = “capital expressway” t3 = “printer output”

【００３９】これらの文字列において、先頭か末尾かを
区別するための仮想的な文字を、ここでは、「＾」（文
字列の先頭を示す文字）と「＄」（文字列の末尾を示す
文字）とし、下記に示すような文字列に変換してからシ
グネチャーを摘出するようにする。ｐ１’＝“＾高速” ｐ２’＝“プリンター＄” ｔ１’＝“＾高速プリンター＄” ｔ２’＝“＾首都高速道路＄” ｔ３’＝“＾プリンター出力＄”In these character strings, virtual characters for discriminating between the beginning and the end are represented by "@" (a character indicating the beginning of the character string) and "@" (a character indicating the end of the character string). Character), and convert it to the following character string before extracting the signature. p1 '= "＾ high speed"p2' = "printer" t1 '= "＾ high speed printer"t2' = "＾ capital expressway" t3 '= "＾ printer output"

【００４０】このように、本実施の形態１では、文字列
における先頭か末尾かを区別するための仮想的な文字を
含めてシグネチャーを摘出するようにしたため、上記の
検索文字列ｐ１’やｐ２’を条件とした検索結果は、何
れの場合もｔ１’の「高速プリンター」のみを検索する
ことができる。このことは、従来例のように、ｐ１の条
件ではｔ１、ｔ２が検索結果となり、ｐ２の条件ではｔ
１、ｔ３が検索結果となる場合と較べると、フォルスド
ロップの発生率を大幅に低減することが可能となり、そ
の結果として検索性能と検索速度を向上することができ
る。As described above, in the first embodiment, the signature is extracted including the virtual character for distinguishing the head or the end of the character string, so that the search character strings p1 ′ and p2 In any case, the search result based on 'can search only for "high-speed printer" at t1'. This means that, as in the conventional example, the search results are t1 and t2 under the condition of p1, and t1 and t2 under the condition of p2.
Compared to the case where 1, t3 is a search result, the occurrence rate of false drops can be greatly reduced, and as a result, search performance and search speed can be improved.

【００４１】（実施の形態２）本実施の形態２では、文
書検索装置の処理部２を用いて文書データを一つ以上の
部分文字列（ブロック）に分割し、各ブロックに応じた
シグネチャーを摘出する際に、各ブロックの文字数が一
定となるように分割するものである。これを図２で説明
すると、図２（ａ）には、文書データを一定の文字数で
分割した比較例が示され、図２（ｂ）には、各ブロック
の文字数が適切な文字数以下であって、かつ、各ブロッ
クの文字数が均一となるように分割した本実施の形態２
の例を示したものである。図２（ａ）のように、分割す
るブロックの文字数を予め一定に定めて分割すると、文
書データごとに末尾のブロック（図中のブロック４）の
文字数が、末尾以外のブロック（図中のブロック１，
２，３）の文字数以下となる。このような状況では、シ
グネチャーのビット数に対して、適切なブロックの文字
数をＬとし、ブロック１，２，３の文字数をＬ１とし、
ブロック４の文字数をＬ２とすると、それらの大小関係
が下記の不等号の関係になる場合がある。Ｌ２≦Ｌ≦Ｌ１(Embodiment 2) In Embodiment 2, the document data is divided into one or more partial character strings (blocks) using the processing unit 2 of the document search apparatus, and a signature corresponding to each block is obtained. When extracting, each block is divided so that the number of characters in each block is constant. Referring to FIG. 2, FIG. 2A shows a comparative example in which document data is divided by a fixed number of characters, and FIG. 2B shows that the number of characters in each block is equal to or less than the appropriate number of characters. Embodiment 2 in which each block is divided such that the number of characters in each block is uniform.
This is an example. As shown in FIG. 2A, when the number of characters of a block to be divided is determined in advance and the number of characters is divided, the number of characters of the last block (block 4 in the figure) for each document data is changed to the other blocks (blocks in the figure) 1,
2,3) or less. In such a situation, with respect to the number of bits of the signature, let L be the number of characters in an appropriate block, L1 be the number of characters in blocks 1, 2, and 3,
Assuming that the number of characters in block 4 is L2, their magnitude relationship may be the following inequality relationship. L2 ≦ L ≦ L1

【００４２】上記不等号の関係にある場合は、ブロック
シグネチャーにおいて、ブロック４にとっては必要数以
上のビットが割り当てられることになる一方で、ブロッ
ク１，２，３にとっては、ビット数が不足することにな
るため、フォルスドロップの増加を招いていた。In the case of the above inequality relationship, in the block signature, more bits than necessary are allocated to the block 4 while the number of bits is insufficient for the blocks 1, 2, and 3. Therefore, the false drop was increased.

【００４３】これに対して、本実施の形態２では、図２
（ｂ）に示されるように、各ブロックの文字数がＬ以下
となるように、各ブロックの文字数を均一化して文書デ
ータを分割し、シグネチャーを摘出するようにしたの
で、上記不都合を解消することが可能となり、フォルス
ドロップを低減することができる。On the other hand, in Embodiment 2, FIG.
As shown in (b), the number of characters in each block is equal to or less than L, the number of characters in each block is equalized, the document data is divided, and the signature is extracted. Is possible, and a false drop can be reduced.

【００４４】（実施の形態３）本実施の形態３では、文
書検索装置の処理部２を用いて、日本語のように漢字・
ひらがな・カタカナのように文字種の多い文書データか
ら上記したＮ−gram方式を用いてシグネチャーを摘出す
る際に、任意に隣接するＮ文字（Ｎは２以上の整数）で
切り出すのではなく、文字種が同一である任意に隣接す
るＮ文字を切り出してシグネチャーを摘出するようにし
たものである。このようにすることで、先に示した例題
の“リコーの高性能カメラ”という文字列は、リコー／高性能／カメラのように切り出すことが可能となり、先の切り出し例で
あるリコー／コーの／ーの高／の高性／高性能／性能カ／能
カメ／カメラと比較すると、（コーの／ーの高／の高性）および（性
能カ／能カメ）のような余分な切り出しに対してシグネ
チャーのビットを割り当てる必要がなくなるため、フォ
ルスドロップの発生する確率を大幅に低減させることが
できる。(Embodiment 3) In this embodiment 3, the processing unit 2 of the document search apparatus is used to output kanji characters such as Japanese characters.
When extracting a signature from document data having many character types such as Hiragana and Katakana using the above-described N-gram method, the character type is not cut out by arbitrarily adjacent N characters (N is an integer of 2 or more). The same arbitrarily adjacent N characters are cut out to extract a signature. By doing so, the character string “Ricoh's high-performance camera” in the example shown above can be cut out as “Ricoh / high-performance / camera”. Compared to / high / high performance / high performance / high power / high power / camera, extra cutting such as (high / high power / high power) and (high power / high power) On the other hand, there is no need to assign a signature bit, so that the probability of a false drop occurring can be greatly reduced.

【００４５】このように、本実施の形態３によれば、Ｎ
−gram方式を用いて文書データを切り出してシグネチャ
ーを摘出する場合に、切り出す文字種を同一のもののみ
としたため、切り出された文字列の種類を従来と比較す
ると大幅に少なくすることができる。このように、文字
列の種類が十分に少なくなれば、シグネチャーの長さ
（ビット数）をその種類に一致させることも可能であ
る。そのようなシグネチャーを作ると検索文字列がＮ文
字の場合、走査すべきビットマップは、一つのビットス
ライスだけでよくなり、しかも、フォルスドロップは全
く発生しない。As described above, according to the third embodiment, N
When extracting signatures by extracting document data using the -gram method, only the same character type is extracted, so that the types of extracted character strings can be significantly reduced as compared with the conventional case. As described above, if the types of character strings are sufficiently reduced, the length of the signature (the number of bits) can be made to match the type. With such a signature, if the search string is N characters, then the bitmap to be scanned only needs to be one bit slice, and no false drop occurs.

【００４６】なお、この実施の形態３に係る文書検索装
置において、検索文字列が“プリンター”の場合に３−
gram方式を用いれば、プリン／リンタ／ンターのようになり、検索文字列とはなりにくい文字列からシ
グネチャーが提出されることになるが、このような場合
は、上記３−gram方式だけを用いるのではなく、４−gr
am方式や５−gram方式といった他の方式と併用すること
により、上記の場合と同様に好適な効果を得ることがで
きる。In the document search apparatus according to the third embodiment, if the search character string is "printer",
If the gram method is used, the signature will be submitted from a character string that is unlikely to be a search character string, as in pudding / linta / inter, but in such a case, only the above-mentioned 3-gram method is used. Instead of 4-gr
When used in combination with another method such as the am method or the 5-gram method, the same advantageous effects as in the above case can be obtained.

【００４７】（実施の形態４）本実施の形態４では、文
書検索装置の処理部２を用いて、“高速ｏｒプリンタ
ー”のように、論理演算子ＡＮＤやＯＲでつながれた複
数の検索文字列を含む条件式を用いて検索する場合であ
っても、検索の高速化が図れるようにしたものである。
すなわち、第１に、条件式に含まれる全ての検索文字列
（パターン）からシグネチャーを摘出し、第２に、摘出
した各シグネチャーの値が１であるビットに応じたビッ
トスライスされたビットマップを走査する複数のカーソ
ルを用意し、第３に、それらのカーソルを条件式の内容
にしたがって並行して動かしながら走査することによ
り、シグネチャーファイル１２を１回走査しただけで、
検索結果を求めることが可能となる。しかも、カーソル
は常に順方向に走査させるので、ＭＨ（Modified Huffm
an）法等により圧縮されたビットマップに対しても対応
できるという利点がある。(Embodiment 4) In Embodiment 4, a plurality of search character strings connected by logical operators AND and OR, such as "high speed or printer", are used by using the processing unit 2 of the document search apparatus. Thus, even when a search is performed using a conditional expression that includes, the search can be speeded up.
That is, first, a signature is extracted from all the search character strings (patterns) included in the conditional expression, and secondly, a bit map obtained by bit-slicing according to the bit whose extracted signature value is 1 is obtained. Third, by preparing a plurality of cursors to be scanned and scanning them while moving them in parallel according to the contents of the conditional expression, the signature file 12 is scanned only once.
Search results can be obtained. Moreover, since the cursor always scans in the forward direction, the MH (Modified Huffm
There is an advantage that bitmaps compressed by the an) method can be handled.

【００４８】図３は、上記した検索文字列のパターンと
カーソルとビットスライスされたビットマップとの関係
を示した図である。ここで、カーソルは、ビットスライ
スされたビットマップに対して１対１に対応している。
つまり、条件式に含まれる検索文字列から摘出されたシ
グネチャーのうち、あるビット位置の値が１であるもの
が複数あってもカーソルは一つしか用意しない。そうす
ることにより、条件式に含まれる検索文字列が増えたと
しても、ビットスライスされたビットマップの走査を１
回で済ませることができる。FIG. 3 is a diagram showing the relationship between the above-described search character string pattern, the cursor, and the bit-sliced bitmap. Here, the cursor has a one-to-one correspondence with the bit-sliced bitmap.
In other words, among the signatures extracted from the search character string included in the conditional expression, only one cursor is prepared even if there is a plurality of signatures having a value of 1 at a certain bit position. By doing so, even if the number of search character strings included in the conditional expression increases, scanning of the bit-sliced bitmap can be performed by one.
It can be done in times.

【００４９】そして、カーソルは、各検索文字列（パタ
ーン）に関連づけられている。また、カーソルは、パタ
ーンごとに検索効率の良い順に並べられる。この検索効
率の良い順とは、ここではカーソルに応じてビットスラ
イスされたビットマップで値が１であるビットの数が少
ない順のことである。これにより、検索効率を向上させ
ることができる。The cursor is associated with each search character string (pattern). In addition, the cursors are arranged in order of search efficiency for each pattern. Here, the order in which the search efficiency is high is an order in which the number of bits having a value of 1 is small in a bitmap bit-sliced according to the cursor. Thereby, search efficiency can be improved.

【００５０】また、各検索文字列において、先頭のカー
ソルがプライマリーカーソルと称され、検索時にはプラ
イマリーカーソルのうち、指示するビットマップの位置
が最も小さいものを選び出し、そのカーソルで値が
「１」であるビットを探してゆく。In each search character string, the first cursor is called the primary cursor. At the time of the search, the primary cursor having the smallest bitmap position is selected from the primary cursors. Search for a bit.

【００５１】そして、プライマリーカーソルの移動後、
各プライマリカーソルが指示している位置でシグネチャ
ーが見つかったと仮定した場合、それらのシグネチャー
の摘出元である文書データが条件式を満たすか否かを判
定する。Then, after the movement of the primary cursor,
If it is assumed that a signature is found at the position indicated by each primary cursor, it is determined whether or not the document data from which those signatures are extracted satisfies the conditional expression.

【００５２】また、条件式は、事前に連言標準形に変換
しておく。連言標準形とは、リテラルが連言（積）の選
言（和）となっている論理式の形式を表す。ここでリテ
ラルは、ひとつの検索文字列に対応づけられる。例え
ば、条件式“（高速 OR 高精細）AND プリンター”は、
“（高速 AND プリンター） OR （高精細 AND プリン
ター）”という連言標準形に変換される。このように、
条件式を連言標準形にしておくことにより、連言のいず
れかの真偽によって、条件式全体の真偽を判定すること
ができる。The conditional expression is converted in advance into a conjunction standard form. The conjunction standard form represents a form of a logical expression in which a literal is a disjunction (sum) of a conjunction (product). Here, a literal is associated with one search character string. For example, the conditional expression ((high-speed OR high-definition) AND printer)
It is converted to the conjunction standard form of “(high-speed AND printer) OR (high-definition AND printer)”. in this way,
By setting the conditional expression to the conjunctive normal form, it is possible to determine whether the entire conditional expression is true or false based on whether any of the conjunctions is true or false.

【００５３】また、ある検索文字列に応じたプライマリ
ーカーソルがビットスライスされたビットマップの末尾
に達したときに、そのプライマリーカーソルに応じた検
索文字列をリテラルとする連言を除去することにより、
条件式を簡略化することができる。つまり、検索文字列
“高性能”のシグネチャーが見つからないことが判明し
たら、それ以降は、条件式を“高速ＡＮＤプリンター”
として検索すればよい。When the primary cursor corresponding to a certain search character string reaches the end of the bit-sliced bitmap, the conjunction that uses the search character string corresponding to the primary cursor as a literal is removed.
Conditional expressions can be simplified. In other words, if it is found that the signature of the search string “high performance” cannot be found, then the conditional expression is changed to “high-speed AND printer”.
What should be searched for.

【００５４】各プライマリーカーソルが指示している位
置で、それぞれに応じたシグネチャーが見つかったと仮
定し、それらのシグネチャーの摘出元である文書データ
が条件式を満たすと判断されるときには、検索文字列ご
とにプライマリーカーソル以降のカーソルについて、ビ
ット位置の値が「１」であるか否かを確認し、条件式を
満足するか否かを判定する。Assuming that signatures corresponding to the respective primary cursors are found at the positions indicated by the primary cursors, and when it is determined that the document data from which those signatures are extracted satisfies the conditional expression, if the signature data is determined to satisfy the conditional expression, It is checked whether the value of the bit position is “1” for the cursors after the primary cursor, and it is determined whether or not the conditional expression is satisfied.

【００５５】以上述べたように、カーソル移動を繰り返
しながら、検索結果を求めてゆく。その詳細について
は、フローチャートの図４〜図７に示したものである。
これらの各フローチャートは、カーソル移動処理部（図
４、図５）、条件式清算部（図６）、条件式見込み計算
部（図７）のフローチャートであり、カーソル移動処理
部から条件式清算部と条件式見込み計算部とを呼び出す
形になっている。そして、それらの各図中では、条件式
を構成する検索文字列をパターンと言い表しており、そ
れらの各パターンは、属性値として、position, checke
d ，match を持っている。position, checked の初期値
は、「０」であり、「０」はビットスライスされたビッ
トマップの仮想的な先頭位置を表している。また、matc
h の初期値は、ＦＡＬＳＥである。As described above, the search result is obtained while repeating the cursor movement. The details are shown in FIGS. 4 to 7 of the flowchart.
These flowcharts are flowcharts of the cursor movement processing unit (FIGS. 4 and 5), the conditional expression settlement unit (FIG. 6), and the conditional expression calculation unit (FIG. 7). And the conditional expression expectation calculation unit are called. In each of these figures, the search character string that constitutes the conditional expression is referred to as a pattern, and each of those patterns has position, checke
d and match. The initial values of position and checked are “0”, and “0” represents a virtual head position of the bit sliced bitmap. Also matc
The initial value of h is FALSE.

【００５６】図４のステップ４５の関数searchについて
は、引数で指定されたカーソルを次の値が「１」である
ビットに進め、そのビット位置を返す。但し、つぎの値
が「１」であるビットが見つからない場合は、負の値を
返す。これにより、ステップ４６においてｈが正か負か
により、ステップ４７かステップ５３に行くかが判断さ
れる。Regarding the function search in step 45 in FIG. 4, the cursor designated by the argument is advanced to the bit whose next value is "1", and the bit position is returned. However, if a bit whose next value is “1” is not found, a negative value is returned. Accordingly, it is determined whether to go to step 47 or step 53 depending on whether h is positive or negative in step 46.

【００５７】そして、これらのフローチャートでは、検
索結果やその候補となる文書データがｔで表わされてい
るが、ｔが不定なときは、その値をnullとする（ステッ
プ４９）。また、t の属性値としては、start と endが
あり、ｔに応じたシグネチャーの格納位置の先頭と末尾
を表している。In these flowcharts, the search result and the document data serving as candidates are represented by t. If t is undefined, its value is set to null (step 49). The attribute values of t include start and end, and represent the start and end of the storage position of the signature according to t.

【００５８】また、図５のステップ６８では、ｔに応じ
たシグネチャーの集合を同族のシグネチャーと表現して
いる。In step 68 of FIG. 5, a set of signatures corresponding to t is expressed as a family signature.

【００５９】以上述べたように、実施の形態４の文書検
索装置の処理部２では、複数の検索文字列が含まれる条
件式に基づいて文字列を検索する際に、条件式に含まれ
る全ての検索文字列のシグネチャーを摘出し、各シグネ
チャーの値が１であるビットに応じてビットスライスさ
れたビットマップを走査するカーソルを用意し、そのカ
ーソルを条件式の内容に従って並行に走査させながら検
索するようにしたため、条件式に含まれる検索文字列が
増えても、シグネチャーファイルを一回走査するだけで
検索結果を求めることが可能となり、検索の高速化を図
ることができる。As described above, when the processing unit 2 of the document search apparatus according to the fourth embodiment searches for a character string based on a conditional expression including a plurality of search character strings, all of the conditions included in the conditional expression Extract the signature of the search character string, prepare a cursor that scans the bitmap that is bit-sliced according to the bit whose signature value is 1, and search while scanning the cursor in parallel according to the contents of the conditional expression Therefore, even if the number of search character strings included in the conditional expression increases, the search result can be obtained only by scanning the signature file once, and the search can be speeded up.

【００６０】なお、本実施の形態に係る文書検索装置
は、文書検索を行う場合について説明したが、これに限
定されるものではなく、より一般的な検索装置に対して
も適用することが可能である。例えば、性別、血液型、
出身県を区別するためのビットをシグネチャーに割り当
てて、“性別＝男 AND 血液型＝AB AND（出身県＝埼玉
OR 出身県＝山梨）”のようにすれば、この条件式の内
容の検索を高速で処理することも可能になる。Although the document search apparatus according to the present embodiment has been described for a case where a document search is performed, the present invention is not limited to this, and can be applied to a more general search apparatus. It is. For example, gender, blood type,
Allocate a bit to distinguish the prefecture of origin from the signature, and then select "sex = male AND blood type = AB AND (origin = Saitama
OR hometown = Yamanashi) ”, it is possible to search for the contents of this conditional expression at high speed.

【００６１】[0061]

【発明の効果】以上説明したように、請求項１に記載の
発明によれば、前方一致、後方一致を条件とした文書検
索においてフォルスドロップの発生率を低くして、検索
の高速化を図ることができる。As described above, according to the first aspect of the present invention, the false drop occurrence rate is reduced in the document search on the condition of the head match and the tail match, thereby speeding up the search. be able to.

【００６２】請求項２に記載の発明によれば、文書デー
タからブロックシグネチャーを摘出する際に発生するフ
ォルスドロップの増加を抑制して、検索の高速化を図る
ことができる。According to the second aspect of the present invention, it is possible to suppress an increase in false drop that occurs when extracting a block signature from document data, and to speed up a search.

【００６３】請求項３に記載の発明によれば、文字種の
多い文書データからＮ−gram方式によりシグネチャーを
摘出する際に、フォルスドロップの増加を抑制して、検
索の高速化を図ることができる。According to the third aspect of the present invention, when extracting signatures from document data having many types of characters by the N-gram method, an increase in false drops can be suppressed, and the search can be speeded up. .

【００６４】請求項４に記載の発明によれば、複数の検
索文字列を含む条件式を用いて文書を検索する際の検索
の高速化を図ることができる。According to the fourth aspect of the present invention, it is possible to speed up a search when searching for a document using a conditional expression including a plurality of search character strings.

[Brief description of the drawings]

【図１】本実施の形態に係る文書検索装置の概略構成を
示すブロック図である。FIG. 1 is a block diagram showing a schematic configuration of a document search device according to the present embodiment.

【図２】実施の形態２において文書データを分割してブ
ロックシグネチャーを摘出する際の文書データの分割状
態を示す図である。FIG. 2 is a diagram illustrating a state of division of document data when extracting block signatures by dividing the document data according to a second embodiment.

【図３】実施の形態４に係る検索文字列のパターンとカ
ーソルとビットスライスされたビットマップとの関係を
示す図である。FIG. 3 is a diagram showing a relationship between a search character string pattern, a cursor, and a bit-sliced bitmap according to a fourth embodiment.

【図４】実施の形態４に係るカーソル移動処理部のフロ
ーチャートである。FIG. 4 is a flowchart of a cursor movement processing unit according to the fourth embodiment.

【図５】実施の形態４に係るカーソル移動処理部のフロ
ーチャートである。FIG. 5 is a flowchart of a cursor movement processing unit according to the fourth embodiment.

【図６】実施の形態４に係る条件式清算部のフローチャ
ートである。FIG. 6 is a flowchart of a conditional formula settlement unit according to a fourth embodiment.

【図７】実施の形態４に係る条件式見込み計算部のフロ
ーチャートである。FIG. 7 is a flowchart of a conditional expression prospect calculation unit according to the fourth embodiment.

[Explanation of symbols]

１入力部２処理部１０出力部１１データ部 DESCRIPTION OF SYMBOLS 1 Input part 2 Processing part 10 Output part 11 Data part

───────────────────────────────────────────────────── フロントページの続き (72)発明者中山秀明東京都大田区中馬込１丁目３番６号株式会社リコー内 ────────────────────────────────────────────────── ─── Continued on the front page (72) Inventor Hideaki Nakayama 1-3-6 Nakamagome, Ota-ku, Tokyo Inside Ricoh Co., Ltd.

Claims

[Claims]

An input unit for inputting document data and a search character string of a registered document; a processing unit for converting the document data and the search character string input to the input unit into a character code to extract a signature; A document search device comprising: a data unit having a signature file storing a signature extracted by a processing unit and a record file for obtaining document data by referring to a record corresponding to a record identifier; and an output unit outputting the document data In the processing unit, before extracting the signature of the document data and the search character string, extract the signature by adding virtual characters representing the beginning or end of each character string, the same partial character string A document search device characterized in that whether or not it appears at the beginning or end of a character string is distinguished by a signature.

2. An input unit for inputting document data and a search character string of a registered document, a processing unit for converting the document data and the search character string input to the input unit into a character code and extracting a signature, A document search device comprising: a data unit having a signature file storing a signature extracted by a processing unit and a record file for obtaining document data by referring to a record corresponding to a record identifier; and an output unit outputting the document data In the document search, the processing unit may divide the document data into a plurality of blocks, and extract a signature corresponding to each block so that the number of characters in each block is uniform. apparatus.

3. An input unit for inputting document data and a search character string of a registered document, a processing unit for converting the document data and the search character string input to the input unit into a character code and extracting a signature, A document search device comprising: a data unit having a signature file storing a signature extracted by a processing unit and a record file for obtaining document data by referring to a record corresponding to a record identifier; and an output unit outputting the document data In the processing unit, when extracting a signature according to a character string of adjacent N characters (N is an integer of 2 or more) of the document data, each character string of the adjacent N characters may have the same character type. A document search device characterized by the following.

An input unit for inputting document data and a search character string of the registered document; a processing unit for converting the document data and the search character string input to the input unit into a character code to extract a signature; A document search device comprising: a data unit having a signature file storing a signature extracted by a processing unit and a record file for obtaining document data by referring to a record corresponding to a record identifier; and an output unit outputting the document data In the processing unit, when searching for a character string based on a conditional expression including a plurality of search character strings, extract the signatures of all search character strings included in the conditional expression, the value of each signature A cursor for scanning a bit map bit-sliced according to the bit being 1 is prepared, and each of the cursors is moved according to the contents of the conditional expression. A document search device for performing a search while scanning in parallel.