JP2006179019A

JP2006179019A - Document retrieval device

Info

Publication number: JP2006179019A
Application number: JP2006007929A
Authority: JP
Inventors: Kazushige Asada; 一繁浅田; Hiroshi Takegawa; 弘志竹川; Toshio Ito; 俊男伊藤; Hideaki Nakayama; 秀明中山
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2006-01-16
Filing date: 2006-01-16
Publication date: 2006-07-06

Abstract

<P>PROBLEM TO BE SOLVED: To quicken retrieval in retrieving a document by using a conditional formula including a plurality of retrieval character strings. <P>SOLUTION: In retrieving a character string based on a conditional formula including a plurality of retrieval character strings, the signatures of all retrieval character strings included in the conditional formula are extracted, and a cursor for scanning a bit map bit-sliced according to bits whose values are 1 of each signature is prepared, and retrieval is executed by making the cursor scan the bit map according to the contents of the conditional formula in parallel. Thus, it is possible to obtain a retrieval result only by scanning a signature file once even when the retrieval character strings included in the conditional formula are increased, and to quicken retrieval. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、文書検索装置に係り、さらに詳しくは、シグネチャーファイルを利用して指定された文字列を含む文書を検索する文書検索装置に関する。 The present invention relates to a document search apparatus, and more particularly, to a document search apparatus that searches for a document including a character string specified using a signature file.

従来より、英文や日本文などの複数の文字種を扱って文書の作成・編集などを行うものとしてワードプロセッサや、ワープロソフトを使ったパーソナルコンピュータなどがあり、これらの装置に入力された文書データ中の部分文字列を検索するものとして文書検索装置が用いられている。 Conventionally, word processors and personal computers using word processing software have been used to create and edit documents by handling multiple character types such as English and Japanese, and document data input to these devices A document search device is used to search for a partial character string.

この種の文書検索装置としては、例えば、特開平７−２４４６７１号公報に記載された文書検索装置があり、ある文字列から一定の方法で摘出される２進数によるビットパターンのシグネチャーを用いることによって文書検索が行われていた。このシグネチャーの２進数のビットパターンにおいて「１」がセットされるビットの位置は、文字列を構成する文字や単語を数値化し、その値を０からビット位置の最大値までの値にハッシングすることにより得ている。例えば、文字列として「コピー」があったとすると、「コ」、「ピ」、「ー」を文字コードを用いて「５」、「７」、「１２」と数値化できたとすると、その数値がビット位置を示し、ビットが「５」と「７」と「１２」番目の位置に「１」が立つことになり、「００００１０１００００１」のようになる。この０と１のパターンがビットマップ（ビットパターン）と称され、このビットマップによって構成されるものがシグネチャーである。 As this type of document search device, for example, there is a document search device described in Japanese Patent Application Laid-Open No. 7-244671. By using a binary bit pattern signature extracted from a character string by a certain method. A document search was being performed. The bit position where “1” is set in the binary bit pattern of this signature is to digitize the characters and words that make up the character string, and to hash the value from 0 to the maximum value of the bit position. Is gained by. For example, if there is “Copy” as a character string, “Co”, “Pi”, “-” can be digitized as “5”, “7”, “12” using character codes. Indicates the bit position, and “1” stands at the “5”, “7”, and “12” -th positions, and becomes “00000100001”. This pattern of 0 and 1 is called a bitmap (bit pattern), and a signature is constituted by this bitmap.

シグネチャーの摘出方法については、非特許文献１に記載されている。この非特許文献１によれば、文書データを構成する単語ごとにワードシグネチャーと称されるシグネチャーを作り、それらをスーパーインポーズしたものを文書データのシグネチャーとするものである。ここで、スーパーインポーズとは、複数のシグネチャーにおいて同じ位置のビットの値の論理和をとり、各論理和の値の列を新たなシグネチャーとして摘出する操作のことである。 The signature extraction method is described in Non-Patent Document 1. According to this non-patent document 1, a signature called a word signature is created for each word constituting document data, and a signature obtained by superimposing the signature is used as the document data signature. Here, superimposing is an operation of taking a logical sum of bit values at the same position in a plurality of signatures and extracting a column of values of each logical sum as a new signature.

また、単語を構成する部分文字列の検索もできるようにするために、文書データに重複部分を持たせながら一定の文字数の文字列に分割して、上記したワードシグネチャーと同様に各文字列のシグネチャーをスーパーインポーズする方法がある。 In addition, in order to be able to search for partial character strings constituting words, the document data is divided into character strings having a certain number of characters while having overlapping portions, and each character string is divided in the same manner as the word signature described above. There is a way to superimpose signatures.

また、より長い文書データを文やパラグラフなどの論理的なブロックに分割して、各ブロックから摘出される複数のシグネチャーを１つの文書に対応させる方法もある。ここで、ブロックから摘出されたシグネチャーは、ブロックシグネチャーと呼ばれている。 There is also a method in which longer document data is divided into logical blocks such as sentences and paragraphs, and a plurality of signatures extracted from each block are associated with one document. Here, the signature extracted from the block is called a block signature.

検索文字列を含む文書を検索するためにシグネチャーを利用する場合は、異なる文字列から同じビットパターンのシグネチャーが摘出される可能性があるので、検索結果として検索文字列を含まない文書を検出することがある。この文書は、フォルスドロップと呼ばれる。一方、検索文字列が含まれる文書は、アクチュアルドロップと呼ばれる。 When signatures are used to search for documents that contain search strings, signatures with the same bit pattern may be extracted from different strings, so documents that do not contain search strings are detected as search results. Sometimes. This document is called false drop. On the other hand, a document including a search character string is called an actual drop.

従来の文書検索装置において、シグネチャーは文書ごとに摘出され、各シグネチャーはシグネチャーファイルと呼ばれるファイルに一括して格納される。シグネチャーファイルは、シグネチャーの格納方法によって２つに大別される。１つは、単にシグネチャーを順に並べて格納する方法である。この方法によるファイル構成は、シーケンシャル構成と呼ばれる。もう１つは、シグネチャーの各ビットをビット位置ごとに別々のビットマップに格納する方法である。この方法によるファイル構成は、ビットスライス構成と呼ばれる。ビットスライス構成によるシグネチャーファイルは、非特許文献２に記載されている。 In a conventional document search apparatus, a signature is extracted for each document, and each signature is collectively stored in a file called a signature file. Signature files are roughly classified into two types according to the signature storage method. One is a method of simply storing signatures in order. A file structure by this method is called a sequential structure. The other is a method in which each bit of the signature is stored in a separate bitmap for each bit position. A file structure by this method is called a bit slice structure. A signature file having a bit slice configuration is described in Non-Patent Document 2.

また、シーケンシャル構成のシグネチャーのビットマップを圧縮する方法は、非特許文献３に記載されているように、ランレングスコーティングなどを利用する方法がある。 In addition, as described in Non-Patent Document 3, there is a method of using run-length coating or the like as a method of compressing a sequential signature bitmap.

従来の文書検索装置では、"高速ｏｒプリンター"のように、論理演算子ＡＮＤやＯＲでつながれた複数の検索文字列を含む条件式により検索する場合は、各検索文字列ごとにシグネチャーファイルの走査を行って、それぞれの検索結果の集合演算を行うことにより、最終的な検索結果を求めていた。 In a conventional document search device, when searching by a conditional expression including a plurality of search character strings connected by logical operators AND and OR, such as "high speed or printer", the signature file is scanned for each search character string. And the final search result is obtained by performing a set operation on each search result.

「Access Method of Text 」（Christos Faloutsos,Computing Surveys,Vol.17，No.1,March 1985,pp49〜74）“Access Method of Text” (Christos Faloutsos, Computing Surveys, Vol. 17, No. 1, March 1985, pp 49-74) 「Partial-Match Retrieval via Method of Superimposed Codes」（Charles S. Roberts, Proceedings of the IEEE. Vol.67,No.12,December,1979,pp1624〜1979）"Partial-Match Retrieval via Method of Superimposed Codes" (Charles S. Roberts, Proceedings of the IEEE. Vol.67, No.12, December, 1979, pp1624-1979) 「Description and Performance Analysis of Signature File Methods for Office Filing」（Christos Faloutsos,ACM Transaction Office Information Systems,Vol.5,No.3,July 1987,pp.237〜257 ）"Description and Performance Analysis of Signature File Methods for Office Filing" (Christos Faloutsos, ACM Transaction Office Information Systems, Vol.5, No.3, July 1987, pp.237-257)

しかしながら、このような従来の文書検索装置にあっては、複数の検索文字列を含む条件式を用いて検索する場合、条件式に含まれる検索文字列の数が増加するに伴って走査回数が増えるため、その分検索時間が長くなり、検索性能が低下するという不都合があった。 However, in such a conventional document search apparatus, when a search is performed using a conditional expression including a plurality of search character strings, the number of scans increases as the number of search character strings included in the conditional expression increases. Therefore, the search time becomes longer and the search performance is lowered.

本発明は、上記に鑑みてなされたものであって、複数の検索文字列を含む条件式を用いて文書を検索する際の検索の高速化が図れる文書検索装置を提供することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to provide a document search apparatus capable of speeding up a search when searching for a document using a conditional expression including a plurality of search character strings. .

上述した課題を解決し、目的を達成するために、請求項１にかかる発明は、２進数によるビットパターンのシグネチャーを利用して、検索文字列を含む文書データを検索する文書検索装置において、シグネチャーを摘出するシグネチャー摘出処理部と、前記文書データを所定の文字数の文字列であるブロックに分割し、前記各ブロックを構成する文字列から前記シグネチャー摘出処理部を利用してシグネチャーを摘出してブロックシグネチャーとしてシグネチャーファイルに格納する文書登録処理部と、前記検索文字列から所定の文字数の部分文字列を抽出し、前記部分文字列から前記シグネチャー摘出処理部を利用して検索文字列のシグネチャーを摘出し、当該検索文字列のシグネチャーに基づいて前記シグネチャーファイルに格納されている前記ブロックシグネチャーを検索する文書検索処理部と、を備え、前記文書検索処理部は、前記検索文字列が論理演算子でつながれた複数の検索文字列を含む条件式である場合、前記条件式に含まれる全ての前記検索文字列からシグネチャーを摘出し、摘出した前記各シグネチャーの値が１であるビットに応じてビットスライスされたビットマップをそれぞれ走査する前記各検索文字列に関連づけられているカーソルを用意し、前記各カーソルを前記条件式の内容に従って並行に走査させながら検索を行う、ことを特徴とする。 In order to solve the above-mentioned problems and achieve the object, the invention according to claim 1 is directed to a document search apparatus for searching document data including a search character string using a binary bit pattern signature. A signature extraction processing unit that extracts a document, and the document data is divided into blocks that are character strings of a predetermined number of characters, and the signature is extracted from the character strings that constitute each block by using the signature extraction processing unit. A document registration processing unit to be stored in a signature file as a signature, a partial character string of a predetermined number of characters is extracted from the search character string, and a signature of the search character string is extracted from the partial character string using the signature extraction processing unit And stored in the signature file based on the signature of the search string. A document search processing unit that searches for the block signature, wherein the document search processing unit is a conditional expression that includes a plurality of search character strings that are connected by logical operators. A signature is extracted from all the search character strings included in the search character string, and is associated with each search character string that scans a bit-sliced bitmap according to a bit whose extracted signature value is 1. A cursor is prepared, and searching is performed while each cursor is scanned in parallel according to the contents of the conditional expression.

また、請求項２にかかる発明は、請求項１記載の文書検索装置において、前記カーソルは、前記検索文字列ごとに当該カーソルに応じてビットスライスされたビットマップで値が１であるビットの数が少ない順に並べられることを特徴とする。 According to a second aspect of the present invention, in the document retrieval apparatus according to the first aspect, the cursor is a number of bits whose value is 1 in a bitmap that is bit-sliced according to the cursor for each retrieval character string. Are arranged in ascending order.

また、請求項３にかかる発明は、請求項１記載の文書検索装置において、前記条件式は、前記検索文字列に対応づけられているリテラルが連言の選言となっている論理式の形式である連言標準形に変換しておくことを特徴とする。 The invention according to claim 3 is the document search device according to claim 1, wherein the conditional expression is a logical expression format in which a literal associated with the search character string is a disjunctive disjunction. It is characterized by being converted to the conjunction standard form.

また、請求項４にかかる発明は、請求項１記載の文書検索装置において、前記検索文字列に応じたプライマリーカーソルがビットスライスされたビットマップの末尾に達したときに、そのプライマリーカーソルに応じた検索文字列をリテラルとする連言を除去することを特徴とする。 According to a fourth aspect of the present invention, in the document retrieval device according to the first aspect, when the primary cursor corresponding to the search character string reaches the end of the bit-sliced bitmap, the primary cursor corresponds to the primary cursor. It is characterized in that a conjunction with a search character string as a literal is removed.

請求項１にかかる発明によれば、複数の検索文字列が含まれる条件式に基づいて文字列を検索する際に、条件式に含まれる全ての検索文字列のシグネチャーを摘出して、各シグネチャーの値が１であるビットに応じてビットスライスされたビットマップを走査するカーソルを用意して、そのカーソルを条件式の内容に従って並行に走査させながら検索するため、条件式に含まれる検索文字列が増えても、シグネチャーファイルを一回走査するだけで検索結果を求めることが可能となり、複数の検索文字列を含む条件式を用いて文書を検索する際の検索の高速化を図ることができるという効果を奏する。 According to the invention of claim 1, when searching for a character string based on a conditional expression including a plurality of search character strings, the signatures of all the search character strings included in the conditional expression are extracted, and each signature is extracted. A search character string included in a conditional expression for preparing a cursor that scans a bit-sliced bitmap according to a bit whose value is 1 and scanning the cursor in parallel according to the contents of the conditional expression Even if the number of files increases, it is possible to obtain a search result by scanning the signature file once, and it is possible to speed up the search when searching a document using a conditional expression including a plurality of search character strings. There is an effect.

また、請求項２にかかる発明によれば、検索効率を向上させることができるという効果を奏する。 Moreover, according to the invention concerning Claim 2, there exists an effect that search efficiency can be improved.

また、請求項３にかかる発明によれば、条件式を連言標準形にしておくことにより、連言のいずれかの真偽によって、条件式全体の真偽を判定することができるという効果を奏する。 Further, according to the invention of claim 3, by setting the conditional expression in the conjunction standard form, it is possible to determine whether the whole conditional expression is true or false by the truth of any of the conjunctions. Play.

また、請求項４にかかる発明によれば、条件式を簡略化することができるという効果を奏する。 Moreover, according to the invention concerning Claim 4, there exists an effect that a conditional expression can be simplified.

以下、本発明の実施の形態１〜実施の形態４を図面に基づいて説明する。 Embodiments 1 to 4 of the present invention will be described below with reference to the drawings.

まず、各実施の形態の説明に入る前に、各実施の形態に共通する文書検索装置の概略構成について説明する。図１は、本実施の形態に係る文書検索装置の概略構成を示すブロック図である。図１において、文書検索装置は、入力部１、処理部２、文字列入力処理部３、レコード識別子計算処理部４、文書検索処理部５、文書出力処理部６、格納位置計算処理部７、文書登録処理部８、シグネチャー摘出処理部９、出力部１０、データ部１１、シグネチャーファイル１２、レコードファイル１３などにより構成されている。 First, before entering the description of each embodiment, a schematic configuration of a document search apparatus common to each embodiment will be described. FIG. 1 is a block diagram showing a schematic configuration of a document search apparatus according to the present embodiment. In FIG. 1, a document search apparatus includes an input unit 1, a processing unit 2, a character string input processing unit 3, a record identifier calculation processing unit 4, a document search processing unit 5, a document output processing unit 6, a storage position calculation processing unit 7, The document registration processing unit 8, signature extraction processing unit 9, output unit 10, data unit 11, signature file 12, record file 13, etc.

対象とする検索文字列および登録文書の文書データの文字コードは、ＡＳＣＩＩ（American Standard Code Information Interchange）のように、各文字のバイト数が一律１バイトの文字コードでも、ＥＵＣ（Extended Unix Interchange ）のように１バイト、２バイト、３バイトの文字が混在する文字コードでもよい。なお、文字列入力処理部で用いられる内部処理用の文字コードは、変換される文字コード（例えば、ＥＵＣ）と同じである必要はない。また、本発明においては、ＥＵＣを用いた場合について説明する。 The character code of the target search character string and the document data of the registered document is EUC (Extended Unix Interchange) even if the number of bytes of each character is uniform, as in ASCII (American Standard Code Information Interchange). Thus, a character code in which 1-byte, 2-byte, and 3-byte characters are mixed may be used. Note that the character code for internal processing used in the character string input processing unit does not have to be the same as the character code to be converted (for example, EUC). In the present invention, a case where EUC is used will be described.

図１に示されるように、入力部１から入力された検索文字列および登録文書の文書データは、処理部２の文字列入力処理部３で入力用の文字コードからＥＵＣに変換される。また、検索時にアクチュアルドロップである文書データは、文書出力処理部６でＥＵＣから出力用の文字コードへ変換される。したがって、検索文字列および登録文書の文書データは、処理部２内では、常にＥＵＣの文字列として処理され、文書データは、常にＥＵＣの文字列としてデータ部１１のレコードファイル１３に格納される。 As shown in FIG. 1, the search character string input from the input unit 1 and the document data of the registered document are converted from the input character code into EUC by the character string input processing unit 3 of the processing unit 2. Further, the document data that is an actual drop at the time of retrieval is converted from EUC to an output character code by the document output processing unit 6. Therefore, the search character string and the document data of the registered document are always processed as an EUC character string in the processing unit 2, and the document data is always stored in the record file 13 of the data unit 11 as an EUC character string.

上記したように、本実施の形態の文書検索装置は、文書を登録および検索する機能を有し、入力部１と処理部２とデータ部１１と出力部１０の４つの部分から構成されている。また、前記処理部２は、文字列入力処理部３とシグネチャー摘出処理部９と文書検索処理部５とレコード識別子計算処理部４と文書出力処理部６と文書登録処理部８と格納位置計算処理部７とを備えるとともに、図示を省略したが、後述するカーソル移動処理部と条件式清算部と条件式見込み計算部とをさらに備えている。 As described above, the document search apparatus according to the present embodiment has a function of registering and searching for a document, and includes four parts: an input unit 1, a processing unit 2, a data unit 11, and an output unit 10. . The processing unit 2 includes a character string input processing unit 3, a signature extraction processing unit 9, a document search processing unit 5, a record identifier calculation processing unit 4, a document output processing unit 6, a document registration processing unit 8, and a storage location calculation process. Although it is provided with a unit 7 and is not illustrated, it further includes a cursor movement processing unit, a conditional expression clearing unit, and a conditional expression expectation calculating unit, which will be described later.

また、前記データ部１１は、シグネチャーファイル１２とレコードファイル１３とを備えている。文書の登録に際しては、入力部１で登録する文書データの入力を受けつけ、入力された文書データは文字列入力処理部３で所定の文字コードに変換された後、文書登録処理部８に渡される。その文書登録処理部８は、第１に、データ部１１のレコードファイル１３に文書データを格納する。第２に、格納位置計算処理部７を利用して、文書データを格納したレコードの識別子からシグネチャーの格納位置を計算する。第３に、文書データを一定の文字数のブロックに分割して、シグネチャー摘出処理部９を利用して各ブロックからブロックシグネチャーを摘出する。ただし、隣接するブロック同士は、一定の文字数の重複部分を有する。第４に、ブロックシグネチャーをデータ部１１のシグネチャーファイル１２における所定の格納位置に格納する。 The data unit 11 includes a signature file 12 and a record file 13. When registering a document, the input unit 1 accepts input of document data to be registered. The input document data is converted into a predetermined character code by the character string input processing unit 3 and then passed to the document registration processing unit 8. . The document registration processing unit 8 first stores the document data in the record file 13 of the data unit 11. Second, the storage position calculation processing unit 7 is used to calculate the signature storage position from the identifier of the record storing the document data. Third, the document data is divided into blocks having a certain number of characters, and a block signature is extracted from each block using the signature extraction processing unit 9. However, adjacent blocks have overlapping portions with a certain number of characters. Fourth, the block signature is stored in a predetermined storage position in the signature file 12 of the data part 11.

また、文書の検索動作に関しては、まず、入力部１で検索文字列を受け付け、入力された検索文字列は文字列入力処理部３で所定の文字コードに変換された後、文書検索処理部５に渡される。その文書検索処理部５は、第１に、検索文字列から制限された文字数以内の部分文字列を抽出し、シグネチャー摘出処理部９を利用して、部分文字列からシグネチャーを摘出し、そのシグネチャーを検索文字列のシグネチャーとする。ここで、制限された文字数とは、登録された文書データを構成するブロック同士の重複部分の文字数のことである。第２に、検索用のシグネチャーで「１」がセットされているビットを調べ、データ部１１のシグネチャーファイルにおいて、先に調べたビットに対応するビットスライスされたビットマップを参照し、検索文字列が含まれると判断されるブロックから摘出されたブロックシグネチャーの格納位置を求める。 As for the document search operation, first, a search character string is received by the input unit 1. The input search character string is converted into a predetermined character code by the character string input processing unit 3, and then the document search processing unit 5. Passed to. First, the document search processing unit 5 extracts a partial character string within a limited number of characters from the search character string, extracts a signature from the partial character string using the signature extraction processing unit 9, and extracts the signature. Is the signature of the search string. Here, the limited number of characters is the number of characters in the overlapping portion between the blocks constituting the registered document data. Secondly, the bit for which “1” is set in the search signature is checked, and in the signature file of the data part 11, the bit-sliced bitmap corresponding to the previously checked bit is referred to, and the search character string The storage location of the block signature extracted from the block that is determined to contain the is obtained.

第３に、レコード識別子計算処理部４を利用して、ブロックシグネチャーの格納位置からレコード識別子の値を求める。第４に、データ部１１のレコードファイル１３において、レコード識別子に対応するレコードを参照して文書データを求める。第５に、求めた文書データを文書出力処理部６に渡す。その文書出力処理部６は、渡された文書データに本当に検索文字列が含まれるかどうかを調べ、フォルスドロップを除去し、アクチュアルドロップを所定の文字コードに変換したのち、文書データを出力部１０に渡す。その出力部１０は、渡された文書データを出力する。 Third, the record identifier calculation processing unit 4 is used to obtain the value of the record identifier from the storage location of the block signature. Fourth, in the record file 13 of the data part 11, the document data is obtained by referring to the record corresponding to the record identifier. Fifth, the obtained document data is transferred to the document output processing unit 6. The document output processing unit 6 checks whether or not the passed document data really includes the search character string, removes the false drop, converts the actual drop into a predetermined character code, and then outputs the document data to the output unit 10. To pass. The output unit 10 outputs the passed document data.

（実施の形態１）
本実施の形態１では、文書検索装置の処理部２を用いて前方一致、後方一致を条件とした文字列を検索する際に、前方一致、後方一致を条件とした下記の検索文字列ｐ１、ｐ２と検索対象の文字列ｔ１、ｔ２、ｔ３としている。
ｐ１＝"高速"
ｐ２＝"プリンター"
ｔ１＝"高速プリンター"
ｔ２＝"首都高速道路"
ｔ３＝"プリンター出力" (Embodiment 1)
In the first embodiment, when searching for a character string with a forward match and a backward match as a condition using the processing unit 2 of the document search apparatus, the following search character string p1 with a forward match and a backward match as a condition: p2 and search target character strings t1, t2, and t3.
p1 = "High speed"
p2 = "Printer"
t1 = "High-speed printer"
t2 = "Metropolitan Expressway"
t3 = "Printer output"

これらの文字列において、先頭か末尾かを区別するための仮想的な文字を、ここでは、「＾」（文字列の先頭を示す文字）と「＄」（文字列の末尾を示す文字）とし、下記に示すような文字列に変換してからシグネチャーを摘出するようにする。
ｐ１'＝"＾高速"
ｐ２'＝"プリンター＄"
ｔ１'＝"＾高速プリンター＄"
ｔ２'＝"＾首都高速道路＄"
ｔ３'＝"＾プリンター出力＄" In these character strings, hypothetical characters for distinguishing between the beginning and the end are “^” (characters indicating the beginning of the character string) and “$” (characters indicating the end of the character string). The signature is extracted after converting to a character string as shown below.
p1 '= "^ High speed"
p2 '= "Printer $"
t1 '= "^ High-speed printer $"
t2 '= "^ Metropolitan Expressway $"
t3 '= "^ Printer output $"

このように、本実施の形態１では、文字列における先頭か末尾かを区別するための仮想的な文字を含めてシグネチャーを摘出するようにしたため、上記の検索文字列ｐ１'やｐ２'を条件とした検索結果は、何れの場合もｔ１'の「高速プリンター」のみを検索することができる。このことは、従来例のように、ｐ１の条件ではｔ１、ｔ２が検索結果となり、ｐ２の条件ではｔ１、ｔ３が検索結果となる場合と較べると、フォルスドロップの発生率を大幅に低減することが可能となり、その結果として検索性能と検索速度を向上することができる。 As described above, in the first embodiment, since the signature is extracted including the virtual character for distinguishing between the beginning and the end of the character string, the search character strings p1 ′ and p2 ′ are used as the condition. In any case, only the “high-speed printer” at t1 ′ can be searched. This means that t1 and t2 are search results under the p1 condition and t1 and t3 are search results under the p2 condition as in the conventional example. As a result, search performance and search speed can be improved.

（実施の形態２）
本実施の形態２では、文書検索装置の処理部２を用いて文書データを一つ以上の部分文字列（ブロック）に分割し、各ブロックに応じたシグネチャーを摘出する際に、各ブロックの文字数が一定となるように分割するものである。これを図２で説明すると、図２（ａ）には、文書データを一定の文字数で分割した比較例が示され、図２（ｂ）には、各ブロックの文字数が適切な文字数以下であって、かつ、各ブロックの文字数が均一となるように分割した本実施の形態２の例を示したものである。図２（ａ）のように、分割するブロックの文字数を予め一定に定めて分割すると、文書データごとに末尾のブロック（図中のブロック４）の文字数が、末尾以外のブロック（図中のブロック１，２，３）の文字数以下となる。このような状況では、シグネチャーのビット数に対して、適切なブロックの文字数をＬとし、ブロック１，２，３の文字数をＬ１とし、ブロック４の文字数をＬ２とすると、それらの大小関係が下記の不等号の関係になる場合がある。
Ｌ２≦Ｌ≦Ｌ１ (Embodiment 2)
In the second embodiment, when the document data is divided into one or more partial character strings (blocks) using the processing unit 2 of the document search device, and the signature corresponding to each block is extracted, the number of characters in each block Is divided so as to be constant. This will be described with reference to FIG. 2. FIG. 2A shows a comparative example in which document data is divided by a certain number of characters, and FIG. 2B shows that the number of characters in each block is less than the appropriate number of characters. In addition, an example of the second embodiment in which the number of characters in each block is divided to be uniform is shown. As shown in FIG. 2A, when the number of characters in a block to be divided is determined to be constant in advance, the number of characters in the last block (block 4 in the figure) is changed to a block other than the last (block in the figure) for each document data. 1, 2, 3) or less. In such a situation, if the number of characters in an appropriate block is L, the number of characters in blocks 1, 2, and 3 is L1, and the number of characters in block 4 is L2, the magnitude relationship between them is as follows. May be a inequality sign.
L2 ≦ L ≦ L1

上記不等号の関係にある場合は、ブロックシグネチャーにおいて、ブロック４にとっては必要数以上のビットが割り当てられることになる一方で、ブロック１，２，３にとっては、ビット数が不足することになるため、フォルスドロップの増加を招いていた。 In the case of the above inequality relationship, in the block signature, more than the necessary number of bits are allocated to the block 4, while the number of bits is insufficient for the blocks 1, 2, and 3. Invited false drops.

これに対して、本実施の形態２では、図２（ｂ）に示されるように、各ブロックの文字数がＬ以下となるように、各ブロックの文字数を均一化して文書データを分割し、シグネチャーを摘出するようにしたので、上記不都合を解消することが可能となり、フォルスドロップを低減することができる。 In contrast, in the second embodiment, as shown in FIG. 2B, the document data is divided by equalizing the number of characters in each block so that the number of characters in each block is L or less, and the signature is Therefore, the above inconvenience can be solved and false drop can be reduced.

（実施の形態３）
本実施の形態３では、文書検索装置の処理部２を用いて、日本語のように漢字・ひらがな・カタカナのように文字種の多い文書データから上記したＮ−gram方式を用いてシグネチャーを摘出する際に、任意に隣接するＮ文字（Ｎは２以上の整数）で切り出すのではなく、文字種が同一である任意に隣接するＮ文字を切り出してシグネチャーを摘出するようにしたものである。このようにすることで、先に示した例題の"リコーの高性能カメラ"という文字列は、
リコー／高性能／カメラ
のように切り出すことが可能となり、先の切り出し例である
リコー／コーの／ーの高／の高性／高性能／性能カ／能カメ／カメラ
と比較すると、（コーの／ーの高／の高性）および（性能カ／能カメ）のような余分な切り出しに対してシグネチャーのビットを割り当てる必要がなくなるため、フォルスドロップの発生する確率を大幅に低減させることができる。 (Embodiment 3)
In the third embodiment, using the processing unit 2 of the document search apparatus, a signature is extracted from document data with many character types such as kanji, hiragana, and katakana like Japanese using the above-described N-gram method. At this time, instead of cutting out arbitrarily adjacent N characters (N is an integer equal to or greater than 2), arbitrarily adjacent N characters having the same character type are cut out to extract a signature. By doing this, the string "Ricoh's high-performance camera" in the previous example is
It is possible to cut out like Ricoh / High Performance / Camera. Compared with Ricoh / Cor // High / High / High Performance / Performance / Camera / Camera, Since there is no need to assign signature bits to extra cutouts such as (High / High) and (Performance / Capability), the probability of false drops can be greatly reduced. it can.

このように、本実施の形態３によれば、Ｎ−gram方式を用いて文書データを切り出してシグネチャーを摘出する場合に、切り出す文字種を同一のもののみとしたため、切り出された文字列の種類を従来と比較すると大幅に少なくすることができる。このように、文字列の種類が十分に少なくなれば、シグネチャーの長さ（ビット数）をその種類に一致させることも可能である。そのようなシグネチャーを作ると検索文字列がＮ文字の場合、走査すべきビットマップは、一つのビットスライスだけでよくなり、しかも、フォルスドロップは全く発生しない。 As described above, according to the third embodiment, when the document data is extracted by using the N-gram method and the signature is extracted, only the same character type is extracted. Compared to the conventional case, it can be greatly reduced. Thus, if the types of character strings are sufficiently reduced, the signature length (number of bits) can be made to match the type. When such a signature is created, when the search character string is N characters, the bitmap to be scanned is only one bit slice, and no false drop occurs.

なお、この実施の形態３に係る文書検索装置において、検索文字列が"プリンター"の場合に３−gram方式を用いれば、
プリン／リンタ／ンター
のようになり、検索文字列とはなりにくい文字列からシグネチャーが提出されることになるが、このような場合は、上記３−gram方式だけを用いるのではなく、４−gram方式や５−gram方式といった他の方式と併用することにより、上記の場合と同様に好適な効果を得ることができる。 In the document search apparatus according to the third embodiment, if the search character string is “printer” and the 3-gram method is used,
The signature is submitted from a character string that is unlikely to be a search character string, such as pudding / printer / intermediate. In such a case, instead of using only the 3-gram method, 4- By using together with other methods such as the gram method and the 5-gram method, a suitable effect can be obtained as in the above case.

（実施の形態４）
本実施の形態４では、文書検索装置の処理部２を用いて、"高速ｏｒプリンター"のように、論理演算子ＡＮＤやＯＲでつながれた複数の検索文字列を含む条件式を用いて検索する場合であっても、検索の高速化が図れるようにしたものである。すなわち、第１に、条件式に含まれる全ての検索文字列（パターン）からシグネチャーを摘出し、第２に、摘出した各シグネチャーの値が１であるビットに応じたビットスライスされたビットマップを走査する複数のカーソルを用意し、第３に、それらのカーソルを条件式の内容にしたがって並行して動かしながら走査することにより、シグネチャーファイル１２を１回走査しただけで、検索結果を求めることが可能となる。しかも、カーソルは常に順方向に走査させるので、ＭＨ（Modified Huffman）法等により圧縮されたビットマップに対しても対応できるという利点がある。 (Embodiment 4)
In the fourth embodiment, the processing unit 2 of the document search apparatus is used to search using a conditional expression including a plurality of search character strings connected by logical operators AND and OR, such as “high speed or printer”. Even in this case, the search speed can be increased. That is, first, a signature is extracted from all search character strings (patterns) included in the conditional expression, and second, a bit-sliced bitmap corresponding to a bit whose extracted signature value is 1 is used. By preparing a plurality of cursors to be scanned and thirdly scanning them while moving them in accordance with the contents of the conditional expression, the search result can be obtained by scanning the signature file 12 only once. It becomes possible. In addition, since the cursor is always scanned in the forward direction, there is an advantage that it can cope with a bitmap compressed by the MH (Modified Huffman) method or the like.

図３は、上記した検索文字列のパターンとカーソルとビットスライスされたビットマップとの関係を示した図である。ここで、カーソルは、ビットスライスされたビットマップに対して１対１に対応している。つまり、条件式に含まれる検索文字列から摘出されたシグネチャーのうち、あるビット位置の値が１であるものが複数あってもカーソルは一つしか用意しない。そうすることにより、条件式に含まれる検索文字列が増えたとしても、ビットスライスされたビットマップの走査を１回で済ませることができる。 FIG. 3 is a diagram showing the relationship between the search character string pattern, the cursor, and the bit-sliced bitmap. Here, the cursor has a one-to-one correspondence with the bit-sliced bitmap. That is, only one cursor is prepared even if there are a plurality of signatures extracted from the search character string included in the conditional expression having a value of 1 at a certain bit position. By doing so, even if the search character string included in the conditional expression increases, the bit-sliced bitmap can be scanned once.

そして、カーソルは、各検索文字列（パターン）に関連づけられている。また、カーソルは、パターンごとに検索効率の良い順に並べられる。この検索効率の良い順とは、ここではカーソルに応じてビットスライスされたビットマップで値が１であるビットの数が少ない順のことである。これにより、検索効率を向上させることができる。 The cursor is associated with each search character string (pattern). In addition, the cursors are arranged in order of good search efficiency for each pattern. The order in which the search efficiency is good is an order in which the number of bits having a value of 1 is small in the bit-sliced bitmap according to the cursor. Thereby, search efficiency can be improved.

また、各検索文字列において、先頭のカーソルがプライマリーカーソルと称され、検索時にはプライマリーカーソルのうち、指示するビットマップの位置が最も小さいものを選び出し、そのカーソルで値が「１」であるビットを探してゆく。 In each search character string, the first cursor is called the primary cursor, and at the time of search, the primary cursor having the smallest bitmap position is selected, and the bit whose value is “1” is selected by the cursor. I will look for it.

そして、プライマリーカーソルの移動後、各プライマリーカーソルが指示している位置でシグネチャーが見つかったと仮定した場合、それらのシグネチャーの摘出元である文書データが条件式を満たすか否かを判定する。 If it is assumed that signatures are found at positions indicated by the primary cursors after the primary cursors are moved, it is determined whether or not the document data from which the signatures are extracted satisfies the conditional expression.

また、条件式は、事前に連言標準形に変換しておく。連言標準形とは、リテラルが連言（積）の選言（和）となっている論理式の形式を表す。ここでリテラルは、ひとつの検索文字列に対応づけられる。例えば、条件式"（高速 OR 高精細）AND プリンター"は、"（高速 AND プリンター） OR （高精細 AND プリンター）"という連言標準形に変換される。このように、条件式を連言標準形にしておくことにより、連言のいずれかの真偽によって、条件式全体の真偽を判定することができる。 The conditional expression is converted into the conjunction standard form in advance. The conjunction standard form represents a form of a logical expression in which a literal is a disjunction (sum) of conjunction (product). Here, the literal is associated with one search character string. For example, the conditional expression “(high-speed OR high-definition) AND printer” is converted into the conjunction standard form “(high-speed AND printer) OR (high-definition AND printer)”. As described above, by setting the conditional expression in the conjunction standard form, it is possible to determine whether the entire conditional expression is true or false based on any truth of the conjunction.

また、ある検索文字列に応じたプライマリーカーソルがビットスライスされたビットマップの末尾に達したときに、そのプライマリーカーソルに応じた検索文字列をリテラルとする連言を除去することにより、条件式を簡略化することができる。つまり、検索文字列"高性能"のシグネチャーが見つからないことが判明したら、それ以降は、条件式を"高速ＡＮＤプリンター"として検索すればよい。 Also, when the primary cursor corresponding to a certain search character string reaches the end of the bit-sliced bitmap, the conditional expression is removed by removing the conjunction that uses the search character string corresponding to the primary cursor as a literal. It can be simplified. In other words, if it is found that the signature of the search character string “high performance” is not found, after that, the conditional expression may be searched as “high-speed AND printer”.

各プライマリーカーソルが指示している位置で、それぞれに応じたシグネチャーが見つかったと仮定し、それらのシグネチャーの摘出元である文書データが条件式を満たすと判断されるときには、検索文字列ごとにプライマリーカーソル以降のカーソルについて、ビット位置の値が「１」であるか否かを確認し、条件式を満足するか否かを判定する。 Assuming that the corresponding signature is found at the position indicated by each primary cursor, and it is determined that the document data from which these signatures are extracted satisfies the conditional expression, the primary cursor for each search string For subsequent cursors, it is checked whether the value of the bit position is “1”, and it is determined whether the conditional expression is satisfied.

以上述べたように、カーソル移動を繰り返しながら、検索結果を求めてゆく。その詳細については、フローチャートの図４〜図７に示したものである。これらの各フローチャートは、カーソル移動処理部（図４、図５）、条件式清算部（図６）、条件式見込み計算部（図７）のフローチャートであり、カーソル移動処理部から条件式清算部と条件式見込み計算部とを呼び出す形になっている。そして、それらの各図中では、条件式を構成する検索文字列をパターンと言い表しており、それらの各パターンは、属性値として、position, checked ，match を持っている。position, checked の初期値は、「０」であり、「０」はビットスライスされたビットマップの仮想的な先頭位置を表している。また、match の初期値は、ＦＡＬＳＥである。 As described above, search results are obtained while repeatedly moving the cursor. The details are shown in FIGS. 4 to 7 in the flowcharts. Each of these flowcharts is a flowchart of the cursor movement processing unit (FIGS. 4 and 5), the conditional expression settlement unit (FIG. 6), and the conditional expression expectation calculation unit (FIG. 7). And the conditional expression expectation calculation part. In each of these drawings, the search character string constituting the conditional expression is referred to as a pattern, and each of these patterns has position, checked, and match as attribute values. The initial value of position and checked is “0”, and “0” represents the virtual head position of the bit-sliced bitmap. The initial value of match is FALSE.

図４のステップ４５の関数searchについては、引数で指定されたカーソルを次の値が「１」であるビットに進め、そのビット位置を返す。但し、つぎの値が「１」であるビットが見つからない場合は、負の値を返す。これにより、ステップ４６においてｈが正か負かにより、ステップ４７かステップ５３に行くかが判断される。 For the function search of step 45 in FIG. 4, the cursor specified by the argument is advanced to the bit whose next value is “1”, and the bit position is returned. However, if a bit whose next value is “1” is not found, a negative value is returned. Thereby, in step 46, whether to go to step 47 or step 53 is determined depending on whether h is positive or negative.

そして、これらのフローチャートでは、検索結果やその候補となる文書データがｔで表わされているが、ｔが不定なときは、その値をnullとする（ステップ４９）。また、ｔの属性値としては、start と endがあり、ｔに応じたシグネチャーの格納位置の先頭と末尾を表している。 In these flowcharts, search results and candidate document data are represented by t. If t is indefinite, the value is null (step 49). The attribute value of t includes start and end, and represents the beginning and end of the signature storage position according to t.

また、図５のステップ６８では、ｔに応じたシグネチャーの集合を同族のシグネチャーと表現している。 In step 68 in FIG. 5, a set of signatures corresponding to t is expressed as a family signature.

以上述べたように、実施の形態４の文書検索装置の処理部２では、複数の検索文字列が含まれる条件式に基づいて文字列を検索する際に、条件式に含まれる全ての検索文字列のシグネチャーを摘出し、各シグネチャーの値が１であるビットに応じてビットスライスされたビットマップを走査するカーソルを用意し、そのカーソルを条件式の内容に従って並行に走査させながら検索するようにしたため、条件式に含まれる検索文字列が増えても、シグネチャーファイルを一回走査するだけで検索結果を求めることが可能となり、検索の高速化を図ることができる。 As described above, in the processing unit 2 of the document search apparatus according to the fourth embodiment, when searching for a character string based on a conditional expression including a plurality of search character strings, all the search characters included in the conditional expression are searched. Prepare a cursor that scans a bit-sliced bitmap according to the bit whose bit is 1 in each signature value, and searches the cursor while scanning in parallel according to the contents of the conditional expression. Therefore, even if the number of search character strings included in the conditional expression increases, it is possible to obtain a search result by scanning the signature file only once, and the search speed can be increased.

なお、本実施の形態に係る文書検索装置は、文書検索を行う場合について説明したが、これに限定されるものではなく、より一般的な検索装置に対しても適用することが可能である。例えば、性別、血液型、出身県を区別するためのビットをシグネチャーに割り当てて、
"性別＝男 AND 血液型＝AB AND（出身県＝埼玉 OR 出身県＝山梨）"
のようにすれば、この条件式の内容の検索を高速で処理することも可能になる。 Although the document search apparatus according to the present embodiment has been described with respect to the case of performing a document search, the present invention is not limited to this and can be applied to a more general search apparatus. For example, you can assign a bit to the signature to distinguish gender, blood type,
"Gender = Male AND Blood Type = AB AND (Home Prefecture = Saitama OR Home Prefecture = Yamanashi)"
By doing so, it becomes possible to process the contents of the conditional expression at high speed.

本実施の形態に係る文書検索装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the document search device concerning this Embodiment. 実施の形態２において文書データを分割してブロックシグネチャーを摘出する際の文書データの分割状態を示す図である。FIG. 10 is a diagram illustrating a document data division state when document data is divided and a block signature is extracted in the second embodiment. 実施の形態４に係る検索文字列のパターンとカーソルとビットスライスされたビットマップとの関係を示す図である。It is a figure which shows the relationship between the pattern of the search character string which concerns on Embodiment 4, a cursor, and the bit-sliced bitmap. 実施の形態４に係るカーソル移動処理部のフローチャートである。14 is a flowchart of a cursor movement processing unit according to the fourth embodiment. 実施の形態４に係るカーソル移動処理部のフローチャートである。14 is a flowchart of a cursor movement processing unit according to the fourth embodiment. 実施の形態４に係る条件式清算部のフローチャートである。It is a flowchart of the conditional expression liquidation part which concerns on Embodiment 4. FIG. 実施の形態４に係る条件式見込み計算部のフローチャートである。10 is a flowchart of a conditional expression expectation calculation unit according to the fourth embodiment.

Explanation of symbols

５文書検索処理部
８文書登録処理部
９シグネチャー摘出処理部
１２シグネチャーファイル 5 Document Search Processing Unit 8 Document Registration Processing Unit 9 Signature Extraction Processing Unit 12 Signature File

Claims

In a document search device for searching document data including a search character string using a binary bit pattern signature,
A signature extraction processing unit for extracting a signature;
Document registration that divides the document data into blocks that are character strings of a predetermined number of characters, extracts a signature from the character strings that constitute each block using the signature extraction processing unit, and stores the signature as a block signature in a signature file A processing unit;
A partial character string having a predetermined number of characters is extracted from the search character string, a signature of the search character string is extracted from the partial character string using the signature extraction processing unit, and the signature based on the signature of the search character string is extracted. A document search processing unit for searching the block signature stored in the file;
With
When the search character string is a conditional expression including a plurality of search character strings connected by a logical operator, the document search processing unit extracts and extracts a signature from all the search character strings included in the conditional expression. Preparing a cursor associated with each search character string for scanning a bit-sliced bitmap according to a bit having a value of 1 for each signature, and setting the cursors in parallel according to the contents of the conditional expression Search while scanning
A document search apparatus characterized by that.

The cursor is arranged in ascending order of the number of bits having a value of 1 in the bitmap sliced according to the cursor for each search character string.
The document search apparatus according to claim 1, wherein:

The conditional expression is converted into a conjunction standard form that is a form of a logical expression in which a literal associated with the search character string is a disjunctive choice,
The document search apparatus according to claim 1, wherein:

When the primary cursor corresponding to the search character string reaches the end of the bit-sliced bitmap, the conjunction that uses the search character string corresponding to the primary cursor as a literal is removed.
The document search apparatus according to claim 1, wherein: