JP3307336B2

JP3307336B2 - Document search method, document search device, and recording medium storing document search program

Info

Publication number: JP3307336B2
Application number: JP24802498A
Authority: JP
Inventors: 秀樹下村
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1998-09-02
Filing date: 1998-09-02
Publication date: 2002-07-24
Anticipated expiration: 2018-09-02
Also published as: JP2000076292A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書検索方法およ
び文書検索装置並びに文書検索プログラムを記録した記
録媒体に関し、特に、文書画像を文字認識した後などの
誤りを含むテキストに有用な文書検索方法および文書検
索装置並びに文書検索プログラムを記録した記録媒体に
関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document search method, a document search apparatus, and a recording medium on which a document search program is recorded, and more particularly, to a document search method useful for a text including an error such as after character recognition of a document image. In addition, the present invention relates to a document search device and a recording medium storing a document search program.

【０００２】[0002]

【従来の技術】紙に印刷された文書を光学的スキャナで
電子化し、さらにそれを文字認識（ＯＣＲ:Optical Cha
racter Reader ）にかけて保存する文書画像ファイリン
グシステムが開発され、すでに商品ともなっている。そ
の非常に大規模なものとして、電子図書館システムなど
も開発が進んでいる。電子化した文書画像を蓄積・検索
するこのようなシステムでは、文字認識後のテキストを
文書画像に対応付けて保存し、検索のインデクスに使
う。ユーザは、蓄積された文書画像から所望のものを得
るため、任意の文字列を論理和、あるいは論理積演算子
で組み合わせた検索式を入力し、検索要求を出す。検索
処理では、文字認識後のテキストを既存の文字列照合技
術に基づいて検索し、結果として検索要求を満たす文字
列を含むと思われる文書画像、あるいはその文書画像を
文字認識した後のテキストを出力する。2. Description of the Related Art A document printed on paper is digitized by an optical scanner, and is further subjected to character recognition (OCR: Optical Chassis).
racter Reader), and a document image filing system that saves the image has been developed and is already a product. As a very large-scale system, the development of electronic library systems and the like is also progressing. In such a system for storing and retrieving digitized document images, the text after character recognition is stored in association with the document image and used as a search index. In order to obtain a desired one from the stored document images, the user inputs a search formula in which an arbitrary character string is combined with a logical sum or a logical product operator, and issues a search request. In the search process, the text after character recognition is searched based on the existing character string collation technology, and as a result, a document image that seems to contain a character string that satisfies the search request, or a text after character recognition of the document image is used. Output.

【０００３】しかし、一般にＯＣＲの結果には誤りが含
まれる。この原因には、認識技術の限界という問題だけ
でなく、画像の解像度やノイズの混入などさまざまな要
因があり、現状の技術では避けることができない。した
がって、文字認識後のテキストを、普通のテキストと同
様に、入力された検索式中の文字列の完全一致を基本と
して全文検索すると、検索もれが発生するという問題が
ある。However, in general, the result of OCR includes an error. The cause is not only the problem of the limit of the recognition technology, but also various factors such as the resolution of the image and the inclusion of noise, and cannot be avoided by the current technology. Therefore, if the full-text search is performed on the text after the character recognition based on the exact match of the character string in the input search expression as in the case of the ordinary text, there is a problem that a search omission occurs.

【０００４】この検索もれの問題を解決し得る従来技術
として、検索式に含まれる単語（以下「検索式構成単
語」）の各文字を、文字認識で混同しやすい文字（以下
「類似文字」）で置き換えた文字列（以下「類似文字
列」）群を作成し、それらを論理和で結合した検索式
（以下「類似文字拡張検索式」）を新たに生成し、元の
検索式に置き換えて検索する防ぐ方法がある。特許出願
例としては、特開平４−１５８４７８号公報の「情報の
検索方法および情報蓄積装置」、特開平６−１９５３８
７号公報の「文書検索装置」、特開平７−１５２７７４
号公報の「文書検索方法および装置」、特開平８−１８
００６４号公報の「文書検索方法及び文書ファイリング
装置」などがある。As a conventional technique that can solve the problem of search omission, as a conventional technique, each character of a word included in a search expression (hereinafter, “search expression constituent word”) is confused by character recognition (hereinafter, “similar character”). ) To create a group of character strings (hereinafter “similar character strings”), combine them with a logical sum to generate a new search expression (hereinafter “similar character expansion search expression”), and replace it with the original search expression There are ways to prevent searching. Japanese Patent Application Laid-Open No. 4-158478 discloses "information search method and information storage device", and Japanese Patent Application Laid-Open No. 6-19538.
No. 7, "Document search device", Japanese Patent Application Laid-Open No. 7-152774.
Publication "Document Searching Method and Apparatus", JP-A-8-18
0064, “Document Search Method and Document Filing Device”.

【０００５】[0005]

【発明が解決しようとする課題】しかし、この類似文字
による検索式構成単語の拡張で作られる類似文字列の個
数は、その元となる単語の長さに関して指数関数的に増
大する。文字数１０の単語について、各文字が平均して
５つの類似文字（自分自身も含む）を持つとすれば、５
の１０乗、すなわち９８６５６２５個の類似文字列が作
られ、さらに類似文字拡張検索式はそれらが論理和で結
合されたもとのなる。検索における処理時間および処理
メモリ量は、一般に、検索式を構成する単語数や文字数
に比例するので、この類似文字拡張検索式を使った検索
処理は、検索時間と処理メモリ量を大幅に浪費すること
になる。However, the number of similar character strings formed by expanding the words constituting the search formula with similar characters increases exponentially with respect to the length of the original word. For a word with 10 characters, if each character has an average of 5 similar characters (including yourself), then 5
, That is, 9865625 similar character strings, and the similar character extended search expression is the original that is connected by OR. Since the processing time and the amount of processing memory in the search are generally proportional to the number of words and characters constituting the search expression, the search processing using the similar character extended search expression greatly wastes the search time and the amount of processing memory. Will be.

【０００６】これに対しては、従来技術の特開平８−１
８００６４号公報の「文書検索方法及び文書ファイリン
グ装置」に記載されている技術は、ある一定個数以上の
類似文字列を生成しないように制御することが述べられ
ている。しかし、いくつかの類似文字列を生成しない方
式では、検索もれが発生する危険を残すことになる。On the other hand, Japanese Patent Application Laid-Open No. Hei 8-1
The technique described in “Document Search Method and Document Filing Apparatus” of Japanese Patent No. 80064 describes that control is performed so as not to generate a certain number or more of similar character strings. However, a method that does not generate some similar character strings leaves a risk of missing search.

【０００７】この様な従来技術の問題に鑑み本発明の目
的は、長い文字列を多くの類似文字で置き換えても、類
似文字列の個数の増加が抑えられ、その結果処理時間と
処理メモリ量の爆発的な増加を抑えることが可能で、か
つ検索精度が大幅には低下しない、利用者にとって快適
な、文書検索方法および文書検索装置並びに文書検索プ
ログラムを記憶した記憶媒体を提供することにある。[0007] In view of such problems of the prior art, an object of the present invention is to suppress the increase in the number of similar character strings even if a long character string is replaced with many similar characters, and as a result, processing time and processing memory capacity are reduced. An object of the present invention is to provide a document search method, a document search device, and a storage medium storing a document search program that are capable of suppressing an explosive increase in the number of documents and that do not significantly reduce search accuracy, and that are comfortable for users. .

【０００８】[0008]

【課題を解決するための手段】本発明においては、上記
目的を達成するために、検索式構成単語から類似文字列
群を作る前に、検索式構成単語を予め定めた規則に基づ
いて分割し、その分割された文字列を論理積で結合して
部分的な検索式を作成し、その作成した部分的な検索式
で元の検索式構成単語を置き換える「文字列ずらし分割
手段」を設けた。According to the present invention, in order to attain the above object, before forming a similar character string group from search formula constituent words, the search formula constituent words are divided based on a predetermined rule. , A partial search expression is created by combining the divided character strings with a logical product, and a “character string shift division unit” that replaces the original search expression constituent words with the created partial search expression is provided. .

【０００９】また、検索の速度と精度のバランス調整を
可能とするため、文字列ずらし分割手段で行う分割処理
のパラメータを決定する、「ずらし分割定数決定手段」
を設けた。ずらし分割定数決定手段は、検索要求入力手
段で入力される検索式、検索処理監視手段によって得ら
れる検索処理の進行状況、あるいは検索データの特性情
報の少なくともどれか１つを参照して実行される。[0009] Further, in order to enable a balance between search speed and accuracy to be adjusted, a parameter for a splitting process performed by the character string shifting and splitting means is determined.
Was provided. The shift division constant determining means is executed with reference to at least one of a search formula input by the search request input means, a progress of the search processing obtained by the search processing monitoring means, or characteristic information of the search data. .

【００１０】[0010]

【発明の実施の形態】次に、本発明の実施の形態を、図
面を参照して説明する。Next, embodiments of the present invention will be described with reference to the drawings.

【００１１】（第１の実施の形態）図１は、本発明の第
１の実施の形態の構成例を示すブロック図である。(First Embodiment) FIG. 1 is a block diagram showing a configuration example of the first embodiment of the present invention.

【００１２】検索対象となる文書画像情報とそれを文字
認識した結果のテキストは、検索データ１ａにすでに格
納されているものとする。認識結果のデータ構造、画像
の格納形式、あるいは認識結果と画像の対応付けなど
は、この分野の従来技術をそのまま用いることができ
る。なお、画像とそれを認識した結果は１対１に対応づ
けられており、それらは一意の情報（例えば識別番号）
で指定することができるとする。また、検索式に対する
検索結果は、その検索式を満たす文字列を含んでいる文
書画像（複数枚の画像から構成されていてもよい）を一
意に指し示す情報（例えば識別番号）を返すものとす
る。It is assumed that the document image information to be searched and the text resulting from character recognition of the document image information have already been stored in the search data 1a. For the data structure of the recognition result, the storage format of the image, or the correspondence between the recognition result and the image, the conventional technology in this field can be used as it is. In addition, the image and the result of recognizing the image are associated with each other on a one-to-one basis, and they are unique information (for example, identification number).
It can be specified by. The search result for the search expression returns information (for example, an identification number) uniquely indicating a document image (which may be composed of a plurality of images) including a character string satisfying the search expression. .

【００１３】ユーザは、検索要求入力手段１ｂにより、
検索要求として、検索式を入力する。図２は、その検索
式の例である。２ａは検索式全体を、２ｂはその検索式
を構成する単語、すなわち検索式構成単語の例である。
この例では、“コンピュータ”、“計算機”、“システ
ム”という３つの単語が、括弧、“＋”、“＊”の記号
で結合されている。“＋”は、検索式において論理和、
すなわち“＋”の両側にある式のどちらかが満たされれ
ば検索成功と見なすこと、“＊”は、検索式において論
理積、すなわち、“＊”の両側にある式の両方が同時に
満たされれば、検査成功と見なすこととする。括弧は、
検索式内の演算の優先順位を示すものとする。この図２
の例では、“コンピュータ”、“計算機”のどちらかが
含まれており、かつ“システム”という単語が含まれて
いる文書画像の検索を要求している。The user operates the search request input means 1b to
Enter a search expression as a search request. FIG. 2 is an example of the search formula. 2a is an entire search formula, and 2b is an example of a word constituting the search formula, that is, an example of a search formula constituent word.
In this example, three words "computer", "computer", and "system" are connected by parentheses, "+", and "*". “+” Indicates a logical sum in a search expression,
That is, if either of the expressions on both sides of “+” is satisfied, the search is regarded as successful. “*” Is a logical product in the search expression, that is, if both of the expressions on both sides of “*” are satisfied at the same time. , The test is considered to be successful. Parentheses are
It indicates the priority of the operation in the search expression. This figure 2
In the example, a request is made to search for a document image containing either “computer” or “computer” and containing the word “system”.

【００１４】文字列ずらし分割手段１ｃでは、まず検索
要求入力手段１ｂで入力された検索式に含まれる各検索
式構成単語を、予め決めた規則に従って分割する。分割
は重複を持たせて行ってもよい。検索精度の面から考え
るとむしろ重複部分のある方が望ましい。図３は、２ａ
に示した検索式の例に含まれる各検索式構成単語を、分
割長３、ずらし幅１の文字列に分割した結果である。こ
こでの分割長とは、分割した結果として作られる文字列
の長さ、ずらし幅とは、直前に分割して切り出した文字
列に対して、次の分割開始位置を何文字分ずらすかを意
味する。この例では、例えば“コンピュータ”に対し
て、“コンピ”、“ンピュ”、“ピュー”、“ュータ”
の４つの文字列が切り出されている。The character string shift division unit 1c first divides each search expression constituent word included in the search expression input by the search request input unit 1b according to a predetermined rule. Division may be performed with overlap. Considering the search accuracy, it is preferable that there is an overlap. FIG.
Is a result obtained by dividing each search formula constituent word included in the example of the search formula shown in FIG. 4 into a character string having a division length of 3 and a shift width of 1. Here, the division length is the length of the character string created as a result of division, and the shift width is the number of characters to shift the next division start position with respect to the character string cut out immediately before being divided. means. In this example, for example, for “computer”, “compilation”, “computer”, “pew”, “computer”
Are cut out.

【００１５】文字列ずらし分割手段１ｃでは次に、各検
索式構成単語ごとにこの分割して作られた文字列を論理
積で結合し、それによって元の検索式構成単語を置き換
える。図４は、図３に示した“コンピュータ”に対する
分割例を論理積で結合した例である。この作成された部
分的な検索式で元の“コンピュータ”という検索式構成
単語を置き換えて、新たな検索式を作成する。その他の
検索式構成単語、“計算機”、“システム”についても
同様に処理を行う。図５は、図３および図４の例から、
文字列ずらし分割手段１ｃによって最終的に作成された
出力の例を示す。Next, the character string shifting / dividing means 1c combines the character strings formed by the division for each of the search expression constituent words by a logical product, thereby replacing the original search expression constituent words. FIG. 4 is an example in which the division examples of the “computer” shown in FIG. 3 are combined by logical product. A new search expression is created by replacing the original search expression constituent word “computer” with the created partial search expression. The same processing is performed for the other search formula constituent words, “computer” and “system”. FIG. 5 is based on the examples of FIG. 3 and FIG.
An example of an output finally created by the character string shifting and dividing unit 1c is shown.

【００１６】類似文字展開手段１ｄでは、文字列ずらし
分割手段１ｃから出力された新たな検索式中の各検索式
構成単語を、ＯＣＲで誤りやすい文字で展開して、類似
文字拡張検索式を作成する。図６は、ＯＣＲで誤りやす
い文字を格納した類似文字テーブルの例である。左側の
文字をＯＣＲで認識しようとした場合、右側の複数の文
字が誤りとしてよく現れることを示している。例えば、
「コ」という文字をＯＣＲは「ユ、ュ、口、ロ」などに
間違いやすいことを示している。このテーブルは、ＯＣ
Ｒの認識実験を通して、作成することができる。例え
ば、文字の誤り確率をテーブルにした混乱行列（confus
ion matrix）の値が、ある一定以上のものをリストアッ
プして登録すればよい。In the similar character expanding means 1d, each search expression constituent word in the new search expression output from the character string shifting and dividing means 1c is expanded by OCR into a character which is likely to be erroneous to create a similar character extended search expression. I do. FIG. 6 is an example of a similar character table that stores characters that are easily erroneous in OCR. This indicates that when the OCR attempts to recognize the character on the left side, a plurality of characters on the right side often appear as errors. For example,
OCR indicates that the character "ko" is apt to be mistaken for "yu, u, mouth, b". This table is OC
It can be created through R recognition experiments. For example, a confusion matrix (confus
What is necessary is just to list and register the values of the ion matrix) that are not less than a certain value.

【００１７】類似文字展開手段１ｄでは、検索式の各検
索式構成単語に対し、まず類似文字テーブルを参照しな
がら、ＯＣＲの誤りによって出現し得る文字列をすべて
生成する。具体的には、検索式構成単語に含まれる各文
字を、類似文字テーブルに登録されている類似文字に置
き換えて、類似文字列を作成する。図７は、“コン
ピ”、“ンピュ”、“ピュー”、“ュータ”のそれぞれ
について、生成される類似文字列の例を示す。これらは
図６の類似文字テーブルを参照して作成した例である。The similar character expanding means 1d generates all character strings that can appear due to an OCR error with reference to the similar character table for each search expression constituent word of the search expression. Specifically, each character included in the search expression constituent word is replaced with a similar character registered in a similar character table to create a similar character string. FIG. 7 shows an example of a similar character string generated for each of "compi", "mpu", "pyu", and "tuta". These are examples created with reference to the similar character table of FIG.

【００１８】１つの検索式構成単語について類似文字で
の置き換えを行って生成される類似文字列の数は、その
検索式構成単語を構成する各文字の類似文字の数（自分
自身、例えば、“あ”の類似文字として“あ”も含めて
数える）の積となる。例えば、図６の類似文字テーブル
では、“コ”に対しては自分自身を含んで５つ、“ン”
に対しても５つ、“ピ”に対して６つの類似文字が定義
されていることになる。したがって、そこから“コン
ピ”に対して生成される類似文字列の個数は、それらの
積で１５０となる。The number of similar character strings generated by replacing one search formula constituent word with similar characters depends on the number of similar characters of each character constituting the search formula constituent word (self, for example, “ "A" is counted as a similar character, including "A"). For example, in the similar character table of FIG. 6, five characters including “self”
, And six similar characters are defined for "pi". Therefore, the number of similar character strings generated for “compilation” therefrom is 150 as their product.

【００１９】類似文字展開手段１ｄでは、類似文字列群
作成の後、検索式構成単語ごとにこの類似文字列群を論
理和で結合し、元の検索式構成単語と置き換える。図８
は、“コンピ”から類似文字展開手段１ｄによって生成
された類似文字列を論理和で結合した部分的な検索式の
例であり、これで“コンピ”という検索式構成単語を置
き換える。最初に入力された検索要求に含まれた“コン
ピュータ”に対する文字列ずらし分割によって生成され
た残りの文字列、“ンピュ”、“ピュー”、“ュータ”
に対しても同じ処理が施される。すなわち、類似文字列
群が作成され、論理和で結合された検索式が作られ、元
の検索式構成単語と置き換える。またさらに、最初に入
力された検索要求に含まれていた“計算機”、“システ
ム”という単語に対し文字列ずらし分割手段から生まれ
た文字列に対しても、同様である。図９は、類似文字展
開手段１ｄによって最終的に生成される検索式の例であ
る。たまたま、“計算機”に関しては文字列ずらし分割
手段１ｃでも分割されず、また類似文字展開手段１ｄで
も類似文字が見つからず、入力された検索式の単語と同
じ文字列となった。In the similar character developing means 1d, after the similar character string group is created, the similar character string groups are combined by logical OR for each search formula constituent word, and are replaced with the original search formula constituent words. FIG.
Is an example of a partial search expression in which similar character strings generated by the similar character expansion means 1d from "compi" are combined by a logical sum, and this replaces the search expression constituent word "compi". Remaining character strings generated by character string shift division for “computer” included in the search request entered first, “mpu”, “pyu”, “uta”
The same processing is performed for That is, a group of similar character strings is created, and a search expression combined by a logical sum is created and replaced with the original search expression constituent word. Further, the same applies to a character string generated from the character string shifting / dividing means for the words "computer" and "system" included in the search request input first. FIG. 9 shows an example of a search formula finally generated by the similar character expanding means 1d. As for "computer", the character string was not divided by the character string shifting / dividing means 1c, nor was any similar character found by the similar character developing means 1d, and the character string was the same as the word of the input search formula.

【００２０】検索手段１ｅでは、類似文字展開手段１ｄ
でまた新たに生成された検索式により、検索データ１ａ
に含まれる認識結果のテキストを検索する。この部分に
は、従来からある全文検索での完全一致検索技術がその
まま使える。検索結果としては、検索された文書画像を
一意に識別する情報が得られる。複数の文書が検索条件
に該当する場合は、すべての文書についての情報を出力
する。In the search means 1e, the similar character developing means 1d
The search data 1a is again generated by the newly generated search formula.
Search for the text of the recognition result included in. For this part, the conventional full-text search technique for full-text search can be used as it is. As the search result, information for uniquely identifying the searched document image is obtained. If a plurality of documents meet the search condition, information about all documents is output.

【００２１】結果処理手段１ｆでは、検索手段１ｅの検
索結果をもとに検索データ１ａから必要な情報を取り出
し、検索結果を出力する。出力形態としては、画像をデ
ィスプレイに出力する、ＯＣＲ後のテキストを出力す
る、あるいはその両方を出力するなど、利用者の希望に
応じて、いくつか考えられる。The result processing means 1f extracts necessary information from the search data 1a based on the search result of the search means 1e, and outputs the search result. There are several possible output forms depending on the user's desire, such as outputting an image to a display, outputting text after OCR, or outputting both.

【００２２】さて、類似文字展開手段１ｄで作られた検
索式を構成する文字列の個数や文字数は、類似文字を使
って展開する元の文字列の長さ、置き換えに使う類似文
字の個数に大きく依存する。特に、文字列の長さに対し
ては、べき乗のオーダーで増加する。一般に、検索式中
の単語数や文字数に比例して処理時間や処理メモリが増
加するので、これは大きな問題となる。The number of character strings and the number of characters constituting the retrieval formula created by the similar character developing means 1d are determined by the length of the original character string developed using similar characters and the number of similar characters used for replacement. Depends heavily. In particular, the length of a character string increases in the order of a power. Generally, the processing time and the processing memory increase in proportion to the number of words and the number of characters in the search formula, and this is a serious problem.

【００２３】本実施の形態では、文字列ずらし分割手段
１ｃにおいて、分割長を３としたため、最大で長さ３の
文字列を類似文字展開手段１ｄでは処理することになっ
た。もちろん、分割長をこの例での３に固定する必要は
なく、それより長い、あるいは短い値に設定することも
可能である。例えば、図１０は分割長を４、ずらし幅を
１とした例である。In the present embodiment, the character string shifting / dividing means 1c sets the division length to 3, so that a character string having a maximum length of 3 is processed by the similar character developing means 1d. Of course, it is not necessary to fix the division length to 3 in this example, and it is also possible to set a longer or shorter value. For example, FIG. 10 shows an example in which the division length is 4 and the shift width is 1.

【００２４】一般に分割長を短くすると、確かに類似文
字を使った展開で作られる文字列の個数は減少し、検索
処理自体は高速になるが、本来は検出すべきでない文書
を検出してしまうケースが生じる。例えば、極端な例
で、長さを１、すらし幅を１としてずらし分割を行った
場合、１つの検索式構成単語に対して作られる類似文字
列の個数は、検索式構成単語の全文字の類似文字数の合
計となる。したがって、類似文字列の数の爆発的増加は
起こらないが、それによって作られる部分的な検索式
は、そのどれかの文字が含まれている文書の検索を意味
するものとなってしまう。つまり、分割をあまり短い単
位で行うと、検索式構成単語に含まれていた文字の位置
関係情報が欠落してしまうため、本来は検索すべきでな
い余分な検索結果が発生することになる。In general, when the division length is shortened, the number of character strings formed by expansion using similar characters certainly decreases, and the retrieval processing itself becomes faster, but a document which should not be originally detected is detected. A case arises. For example, in an extreme case, when the division is performed with the length being 1 and the margin width being 1, the number of similar character strings formed for one search formula constituent word is determined by the number of all characters in the search formula constituent word. Is the total number of similar characters. Therefore, the explosive increase in the number of similar character strings does not occur, but the partial search formula created thereby implies a search for documents containing any of the characters. In other words, if the division is performed in a unit that is too short, the positional relationship information of the characters included in the search expression constituent words is lost, and extra search results that should not be originally searched are generated.

【００２５】逆に、分割長を長く設定すると、過剰な検
索は減少するが類似文字列数の増加による検索時間の増
加、メモリの浪費が起こる。すなわち、分割する長さに
対してこの２つはトレードオフの関係にあり、それを適
切に選択することで、処理量を抑えて高い検索精度を保
てることになる。Conversely, if the division length is set to be long, excessive search is reduced, but search time is increased due to an increase in the number of similar character strings, and memory is wasted. That is, the two are in a trade-off relationship with the length to be divided, and by appropriately selecting the length, the processing amount can be suppressed and high search accuracy can be maintained.

【００２６】上記は分割長に関する議論であったが、ず
らし幅についても同様である。ずらし幅が大きいと、元
の検索式構成単語に含まれていた文字自身、あるいは文
字の位置関係情報が欠落してしまうため、余分な検索結
果が増加する。しかし、最終的に作られる類似文字拡張
検索式に含まれる検索式構成単語の数あるいは文字数が
少なくなり、検索処理が高速になるという利点がある。
ずらし幅を長くした場合はその逆の特性となるので、こ
れもトレードオフの関係である。どのような分割を行う
かは検索の様々な特性を事前にあるいは検索中に考慮
し、設定するのが望ましい。Although the above is a discussion on the division length, the same applies to the shift width. If the shift width is large, the character itself or the positional relationship information of the character included in the original search formula constituent word is lost, so that extra search results increase. However, there is an advantage that the number of words or characters of the search expression included in the similar character extended search expression finally formed is reduced, and the search process is sped up.
If the shift width is increased, the opposite characteristic is obtained, and this is also a trade-off relationship. It is desirable to determine what kind of division is to be performed by considering various characteristics of the search in advance or during the search.

【００２７】（第２の実施の形態）図１１は、本発明の
第２の実施の形態の構成例を示すブロック図である。図
１に示した第１の実施の形態の構成例に対し、ずらし分
割定数決定手段１１ｇが文字列ずらし分割手段１１ｃの
前に、検索処監視手段１１ｈが検索手段に接続して加え
られている。その他の構成要素の動作については、図１
に示した第１の実施の形態と同じであるので説明を省略
する。(Second Embodiment) FIG. 11 is a block diagram showing a configuration example of a second embodiment of the present invention. In the configuration example of the first embodiment shown in FIG. 1, a search division monitoring unit 11h is connected to the search unit before the character string shift division unit 11c. . For the operation of the other components, see FIG.
Since it is the same as the first embodiment shown in FIG.

【００２８】ずらし分割定数決定手段１１ｇは、文字列
ずらし分割手段１１ｃで行われる処理に関連する動作パ
ラメータを、検索要求の内容、検索処理の進行状況、検
索データの特徴に基づいて決定する役割がある。The shift division constant determining means 11g has a role of determining operation parameters related to the processing performed by the character string shift division means 11c based on the contents of the search request, the progress of the search processing, and the characteristics of the search data. is there.

【００２９】図１２は、ずらし分割定数決定手段１１ｇ
の処理フローの例である。ここでは、検索式はｎ個の検
索式構成単語から生成されているとする。そして、各検
索式構成単語に関して、分割長Ｗｋ（１≦ｋ≦ｎ）、ず
らし幅Ｓｋ（１≦ｋ≦ｎ）を決めるものとする。FIG. 12 shows a shift division constant determining means 11g.
It is an example of the processing flow of. Here, it is assumed that the search formula is generated from n search formula constituent words. Then, a division length Wk (1 ≦ k ≦ n) and a shift width Sk (1 ≦ k ≦ n) are determined for each search formula constituent word.

【００３０】まず、ＷｋとＳｋを初期化する。図１３
は、初期化処理１２ｂの流れ図の例である。ｋを１〜ｎ
まで変化させながら、Ｗｋに３、Ｓｋに１をセットして
いる。つまり、どの検索式構成単語についても、分割長
３、ずらし幅１でずらし分割することを、初期状態とし
て設定していることになる。もちろん、別の値をセット
してもかまわない。First, Wk and Sk are initialized. FIG.
Is an example of a flowchart of the initialization processing 12b. k is 1 to n
While changing this, Wk is set to 3 and Sk is set to 1. That is, for any search expression constituent words, the shift division by the division length 3 and the shift width 1 is set as the initial state. Of course, another value may be set.

【００３１】図１２の１２ｃにおいては、検索要求、検
索処理の状況、検索データの特性など、検索にかかる情
報を考慮し、ＷｋとＳｋを必要に応じて部分的に変更す
ることで、状況に応じて適切に、検索速度と検索精度の
調整を行う。In 12c of FIG. 12, Wk and Sk are partially changed as necessary in consideration of information relating to a search, such as a search request, a status of a search process, and characteristics of search data. The search speed and the search accuracy are appropriately adjusted accordingly.

【００３２】図１４は、１２ｃに示したＷｋ、Ｓｋの補
正処理の流れ図の例である。この例では、ユーザの入力
した検索要求を参照して、ずらし分割処理に用いるパラ
メータを変更している。一般に、片仮名は文字が単純な
図形であり、また類似の形状も多いことから文字認識が
難しい。したがって類似文字も多くなりがちで、Ｗｋを
長くすると類似文字列の数が爆発的に増加する危険があ
る。そこで、検索式構成単語が片仮名列の場合、Ｗｋを
１だけ短く設定し、またＳｋを１だけ長く設定すること
で、類似文字列の数を抑えている。一方、漢字列に対し
ては、片仮名と逆の特性を持っているので、Ｗｋを長く
しても、類似文字列の数がそれほど多くはなりにくい。
したがって、Ｗｋを１だけ長くし、またＳｋも１だけ短
くしている。ただし、Ｓｋが０ではずらし分割が成立し
ないので、１以上の値となるようにしている。それ以外
のケースでは、初期値として設定されたＷｋとＳｋをそ
のままとしている。FIG. 14 is an example of a flowchart of the correction processing of Wk and Sk shown in 12c. In this example, the parameters used for the shift division processing are changed with reference to the search request input by the user. In general, katakana characters are simple figures and have many similar shapes, making character recognition difficult. Therefore, the number of similar characters tends to increase, and there is a risk that the number of similar character strings may explosively increase when Wk is increased. Therefore, when the search expression constituent word is a katakana string, the number of similar character strings is suppressed by setting Wk to be short by one and Sk to be long by one. On the other hand, a kanji character string has characteristics opposite to those of katakana, so that even if Wk is increased, the number of similar character strings is unlikely to increase so much.
Therefore, Wk is lengthened by one and Sk is also shortened by one. However, when Sk is 0, the shift division is not established, so that the value is set to 1 or more. In other cases, Wk and Sk set as initial values are left as they are.

【００３３】図１５も、１２ｃに示したＷｋ、Ｓｋの補
正処理の流れ図の例である。これも、検索要求を参照し
た処理である。検索式の単語が論理和で結合されている
場合、どちらかが満たされればよいので、検索結果が多
くなりがちである。そのとき、Ｗｋをあまり短くする
と、その短い類似文字列のどれかが文書中に検出される
可能性も高まるため、不要文書が検出される可能性がさ
らに増える。そこで、論理和で結合された検索要求の場
合、その両端の検索式構成単語の分割に際しＷｋを長く
することで、不要文書の過剰検索を防ぐことができる。
逆に、検索式の単語が論理和で結合されている場合、検
索結果があまり多くならない傾向となるので、Ｗｋを短
くして、展開されるずらし分割類似文字列の個数を減ら
すことで、精度に大きな影響を出さずに処理速度を上げ
ることができる。FIG. 15 is also an example of a flowchart of the correction process of Wk and Sk shown in 12c. This is also a process that refers to the search request. When the words of the search formula are connected by a logical sum, either of them may be satisfied, and the search result tends to increase. At this time, if Wk is too short, the possibility that any of the short similar character strings is detected in the document increases, and the possibility that an unnecessary document is detected further increases. Therefore, in the case of a search request combined by a logical sum, an excessive search for unnecessary documents can be prevented by lengthening Wk when dividing search expression constituent words at both ends thereof.
Conversely, if words in the search formula are combined by logical OR, the search results tend not to increase very much. Therefore, by shortening Wk and reducing the number of shifted divided similar character strings to be expanded, the accuracy is improved. The processing speed can be increased without significantly affecting the processing.

【００３４】また図１６も、１２ｃに示したＷｋ、Ｓｋ
の補正処理の別の例である。これは、検索処理の進行状
況を参照し、分割の方法を調整するものである。この例
は、１つの検索システムを、ネットワーク等を介して複
数のユーザで共用している場合に有効である。処理で
は、図１１の１１ｈの検索処理監視手段から、検索処理
の進捗状況を受け取り、もし現在、検索処理が別のユー
ザの検索処理を実行している、あるいは次の検索処理の
ために待たされている検索要求があるならば、Ｗｋを短
く、Ｓｋを長くすることにより、類似文字列の生成数を
抑制する。これにより、ユーザへの検索結果の返答（タ
ーンアラウンドタイム）を改善することができる。FIG. 16 also shows Wk and Sk shown in FIG.
13 is another example of the correction processing of FIG. This refers to the progress of the search process and adjusts the division method. This example is effective when one search system is shared by a plurality of users via a network or the like. In the process, the progress of the search process is received from the search process monitoring means 11h in FIG. 11, and if the search process is currently executing the search process of another user, or is waited for the next search process. If there is a search request, the number of similar character strings generated is suppressed by shortening Wk and lengthening Sk. As a result, it is possible to improve the response (turnaround time) of the search result to the user.

【００３５】また図１７は、１２ｃのまた別の実施例で
ある。これは検索データの特性を参照することを特徴と
している。まず、検索データの特性の一例として、字種
別の含有率を取得する。そして、検索式構成単語が片仮
名列であり、また検索データがある一定以上の片仮名含
有率であれば、Ｗｋを長く、Ｓｋを短くして、不要文書
の過剰な誤検索を防ぐように、パラメータを変更してい
る。検索対象の文書に片仮名の含有率が多いということ
は、片仮名の文字列がたくさんあるということであり、
Ｗｋが短いと、不要な文書が誤って検索されるケースの
増加する危険が高いので、その問題への対処である。FIG. 17 shows another embodiment 12c. This is characterized by referring to the characteristics of the search data. First, as an example of the characteristics of the search data, the content rate of the character type is acquired. If the search formula constituent word is a katakana string and the search data has a certain katakana content ratio or more, a parameter is set so that Wk is lengthened and Sk is shortened to prevent excessive erroneous search of unnecessary documents. Has changed. A high content rate of katakana in the search target document means that there are many katakana character strings,
If Wk is short, there is a high risk that unnecessary documents will be erroneously searched, and therefore, it is necessary to address the problem.

【００３６】以上、図１２の１２ｃにおけるＷｋとＳｋ
の補正処理の例として、検索要求を参照するもの２例、
検索処理の進行状況を参照するもの１例、検索データの
特性を参照するもの１例をそれぞれ別々に示したが、こ
れらの処理は多段に接続して、同じ検索式の処理に適用
してもよい。As described above, Wk and Sk at 12c in FIG.
As examples of correction processing, two examples referring to a search request,
One example referring to the progress of the search process and one example referring to the characteristics of the search data are shown separately, but these processes can be connected in multiple stages and applied to the process of the same search formula. Good.

【００３７】また、本発明をコンピュータによって実現
するためするため、例えば上記した第１の実施の形態に
おいては、コンピュータの内部に上記した文字列ずらし
分割手段１ｃ、類似文字展開手段１ｄ、検索手段１ｅ等
が持つ機能をコンピュータに実現するコンピュータプロ
グラムを作成し、そのコンピュータプログラムをＣＤ−
ＲＯＭやフロッピーディスクや半導体メモリに代表され
る記録媒体に記録されて提供される形態でも本発明の効
果は失われない。また、第２の実施の形態においても同
様である。In order to realize the present invention by a computer, for example, in the above-described first embodiment, the character string shift division unit 1c, the similar character expansion unit 1d, and the search unit 1e are provided inside the computer. Computer program that implements the functions of a computer, etc., and stores the computer program on a CD-ROM.
The effects of the present invention are not lost even in a form provided by being recorded on a recording medium represented by a ROM, a floppy disk, or a semiconductor memory. The same applies to the second embodiment.

【００３８】[0038]

【発明の効果】検索式中の検索式構成単語を分割した後
に類似文字展開を行うことにより、類似文字展開手段で
作られる類似文字列の数を大幅に減らすことができる。
例えば、長さ１０の文字列で各文字についてそれぞれ５
つの類似文字がある場合、そのまま類似文字展開を行う
と、長さ１０の類似文字列が５の１０乗で、文字数では
９７６５６２５０となってしまう。これは、処理時間と
処理メモリ量を多く必要とするという問題を生む。これ
に対し、分割長３、ずらし幅１で分割した後に類似文字
による展開を行うと、長さ３の類似文字列が８＊５の３
乗＝１０００、文字数では３０００で済む。この結果、
検索時間の短縮、処理に必要なメモリ量の削減が達成さ
れる。According to the present invention, by performing similar character expansion after dividing search expression constituent words in a search expression, the number of similar character strings created by similar character expansion means can be greatly reduced.
For example, for a character string of length 10, 5
If there are two similar characters, if similar character expansion is performed as it is, a similar character string having a length of 10 will be 5 to the 10th power, and the number of characters will be 97656250. This causes a problem that a large processing time and a large amount of processing memory are required. On the other hand, when a similar character string having a length of 3 is divided into 3 * 8 * 5 by performing expansion using similar characters after dividing by a division length of 3 and a shift width of 1.
The power is 1000, and the number of characters is 3000. As a result,
The search time is reduced, and the amount of memory required for processing is reduced.

【００３９】ただし、本発明では分割により文字列間の
位置関係情報が一部失われることから、本来検索すべき
でないテキストを過剰検索するケースが発生する危険が
ある。これに対しては、日本語の特許を対象にした実験
により、分割長３以上ずらし幅１で分割すると、過剰検
索は１％以下に抑えられるなど、ある条件では実用上大
きな問題とならないことを実験で確認した。もちろん、
文書の種類やＯＣＲの性能に依存する部分もあるが、そ
れらの特性を加味して分割の規則や定数を決めることに
より、検索精度にほとんど影響を出さず、高速な検索
を、既存の全文検索エンジンを利用して実現することが
可能となる。However, in the present invention, since the positional relationship information between the character strings is partially lost due to the division, there is a danger that a text that should not be originally searched may be over-searched. On the other hand, in experiments conducted on Japanese patents, it was found that if a division was made with a division length of 3 or more and a shift width of 1, an excessive search would be suppressed to 1% or less, and under certain conditions this would not be a major problem in practice. Confirmed by experiment. of course,
Some parts depend on the type of document and the performance of OCR, but by taking into account those characteristics and determining rules and constants for division, high-speed search has almost no effect on search accuracy and existing full-text search can be performed. This can be realized using an engine.

【００４０】また本発明では、ずらし分割を行う際の分
割長とずらし幅を、検索にかかる情報、例えば検索要
求、検索処理の進行状況、検索対象のデータの特性によ
って検索処理中に適応的に調整する。これにより、処理
時間や処理速度のバランスが自動的に適切に設定され、
利用者に対してストレスのない検索環境を提供すること
ができる。Also, in the present invention, the division length and the shift width at the time of performing the shift division are adaptively determined during the search process according to the information related to the search, for example, the search request, the progress of the search process, and the characteristics of the data to be searched. adjust. As a result, the balance between processing time and processing speed is automatically set appropriately,
It is possible to provide a user with a stress-free search environment.

[Brief description of the drawings]

【図１】本発明の第１の実施の形態における構成の一例
を示すブロック図である。FIG. 1 is a block diagram illustrating an example of a configuration according to a first embodiment of the present invention.

【図２】検索要求入力手段を介して入力される検索式の
例である。FIG. 2 is an example of a search formula input via a search request input unit.

【図３】検索式構成単語を分割長３、ずらし幅１で分割
した例である。FIG. 3 is an example in which a search formula constituent word is divided by a division length of 3 and a shift width of 1;

【図４】分割した文字列から生成される部分的な検索式
の例である。FIG. 4 is an example of a partial search formula generated from a divided character string.

【図５】文字列ずらし分割手段１ｃの出力例である。FIG. 5 is an output example of a character string shifting / dividing unit 1c.

【図６】類似文字テーブルの例である。FIG. 6 is an example of a similar character table.

【図７】類似文字によって作られる類似文字列の例であ
る。FIG. 7 is an example of a similar character string formed by similar characters.

【図８】類似文字列から生成される部分的な検索式の例
である。FIG. 8 is an example of a partial search formula generated from a similar character string.

【図９】類似文字展開手段１ｄの出力例である。FIG. 9 is an output example of the similar character developing means 1d.

【図１０】検索式国勢単語を分割長４、ずらし幅１で分
割した例である。FIG. 10 is an example in which a search-type national census word is divided by a division length of 4 and a shift width of 1;

【図１１】本発明の第２の実施の形態における構成の一
例を示すブロック図である。FIG. 11 is a block diagram illustrating an example of a configuration according to a second embodiment of the present invention.

【図１２】ずらし分割定数決定手段１１ｇの流れ図の例
である。FIG. 12 is an example of a flowchart of a shift division constant determining unit 11g.

【図１３】ずらし分割定数の初期化処理の例である。FIG. 13 is an example of a shift division constant initialization process.

【図１４】ずらし分割定数の補正処理の例である。FIG. 14 is an example of a shift division constant correction process.

【図１５】ずらし分割定数の補正処理の例である。FIG. 15 is an example of a shift division constant correction process.

【図１６】ずらし分割定数の補正処理の例である。FIG. 16 is an example of a shift division constant correction process.

【図１７】ずらし分割定数の補正処理の例である。FIG. 17 is an example of a shift division constant correction process.

[Explanation of symbols]

１ａ、１１ａ検索データ１ｂ、１１ｂ検索要求入力手段１ｃ、１１ｃ文字列ずらし分割手段１ｄ、１１ｄ類似文字展開手段１ｅ、１１ｅ検索手段１ｆ、１１ｆ結果処理手段１１ｇずらし分割定数決定手段１１ｈ検索処理監視手段２ａ検索式２ｂ検索式構成単語 1a, 11a Search data 1b, 11b Search request input means 1c, 11c Character string shift division means 1d, 11d Similar character expansion means 1e, 11e Search means 1f, 11f Result processing means 11g Shift division constant determination means 11h Search processing monitoring means 2a Search formula 2b Search formula constituent words

フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/30 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of the front page (58) Field surveyed (Int.Cl. ⁷ , DB name) G06F 17/30 JICST file (JOIS)

Claims

(57) [Claims]

(1)Search described by logical OR or logical AND of words
A document search method that searches for text using an expression
hand, Each word that constitutes the search expression is referred to by referring to a predetermined rule.
First to divide into overlapping substrings of length 2 or more
Steps and Discuss the character string formed by the first step.
A second step for creating a partial search expression joined by logical operations
And Each constituent word in the search formula created in the second step
For each of the letters that make up that word
Create a group of similar strings replaced with characters, and
A third step for creating a search expression in which the column groups are combined by OR
And Using the search formula created in the third step, the text
A fourth step of searching for a kist, The rule referred to in the first step is set in the search request.
A fifth step of determining based on the content; Including
Characteristic document search method.

(2)Search described by logical OR or logical AND of words
A document search method that searches for text using an expression
hand, Each word that constitutes the search expression is referred to by referring to a predetermined rule.
First to divide into overlapping substrings of length 2 or more
Steps and Discuss the character string formed by the first step.
A second step for creating a partial search expression joined by logical operations
And Each constituent word in the search formula created in the second step
For each of the letters that make up that word
Create a group of similar strings replaced with characters, and
A third step for creating a search expression in which the column groups are combined by OR
And Using the search formula created in the third step, the text
A fourth step of searching for a kist, Already require the rule referenced in the first step
Search process Based on the degree of congestion inside the search process
And the fifth step Characterized by containing
Document search method.

(3)Search described by logical OR or logical AND of words
A document search method that searches for text using an expression
hand, Each word that constitutes the search expression is referred to by referring to a predetermined rule.
First to divide into overlapping substrings of length 2 or more
Steps and Discuss the character string formed by the first step.
A second step for creating a partial search expression joined by logical operations
And Each constituent word in the search formula created in the second step
For each of the letters that make up that word
Create a group of similar strings replaced with characters, and
A third step for creating a search expression in which the column groups are combined by OR
And Using the search formula created in the third step, the text
A fourth step of searching for a kist, The rule referred to in the first step is a search target.
Fifth Decision Based on Characteristic of Text
When A document search method comprising:

4. When a text containing an error mainly after character recognition is searched for using a search expression described by a logical OR or a logical product of words, each text in the search expression is searched for. Create a similar character string group in which each character constituting the word is replaced with a similar character that is apt to be erroneous in the character recognition, and further create a partial search formula in which the similar character strings are combined by logical OR, and In a document search method in which a search is performed after replacing each word in the original search formula using a simple search formula, each word constituting the search formula is referred to by referring to information related to the search before replacement by similar characters Is determined according to the parameters related to the process of dividing the word, and each word constituting the search formula is divided into partial character strings having a length of 2 or more that allow overlapping, and the resulting character strings are logically ANDed. Partial search joined by Create a document search method characterized by previously replacing each word in the original search expression using the partial search expression.

5. The document search method according to claim 4, wherein the information referred to when determining the parameter is a characteristic of a search formula.

6. The characteristic of the search expression is at least one of the character type or length of each word constituting the search expression, or the type of a logical operator constituting the search expression. The document search method according to claim 5 .

7. The document search method according to claim 4, wherein the information referred to when determining the parameter is a congestion degree in a search process of a search process that has already been requested.

8. The document search method according to claim 4, wherein the information referred to when determining the parameter is a characteristic of data to be searched.

9. The document search method according to claim 8, wherein the characteristic of the data to be searched is a content rate by type of a character included in the data to be searched.

10.Search request input means for inputting a search expression;
Refer to the rules that determine each constituent word included in the search expression in advance
And split it into substrings of length 2 or more that allow overlap,
Search for the original by combining the divided substrings with logical product
Create a second search expression that replaces the constituent words included in the expression
Character string shifting and dividing means; For each constituent word in the second search expression,
Similarity where each character is replaced with a similar error-prone character
Create a group of strings and combine the similar strings with a logical OR
Similar character expansion means for creating a third search expression; A search for performing a character string search according to the third search expression
Means, The rule referred to by the character string shifting / dividing means is
Characteristics, already required Search process inside the search process
Congestion degree, the characteristics of the data to be searched
Also determines the shift division constant using any one of them.
With steps and The character string shift division unit determines the shift division constant.
Each component included in the search expression according to the rules determined by the means
A document search device characterized by dividing words.

11.On the computer, Using a search formula described by the logical OR or logical product of words,
A record of a document search program for searching text
In recording media, The search formula is referred to a computer with reference to a predetermined rule.
Partial characters of length 2 or more that allow each word to be composed to overlap
A first step of splitting into columns; Discuss the character string formed by the first step.
A second step for creating a partial search expression joined by logical operations
And Each constituent word in the search formula created in the second step
For each of the letters that make up that word
Create a group of similar strings replaced with characters, and
A third step for creating a search expression in which the column groups are combined by OR
And Using the search formula created in the third step, the text
A fourth step of searching for a kist, The rule referred to in the first step is set in the search request.
A fifth step of determining based on the content; Run
Recording medium on which a document search program for recording is recorded.

12.On the computer, Each component included in the search expression entered via the input means
Words of length 2 or more that allow overlapping by referring to predetermined rules
Is divided into the above substrings, and the divided
Replaces constituent words in the original search expression by combining with products
A character string shift division function for creating a second search formula, For each constituent word in the second search expression,
Similarity where each character is replaced with a similar error-prone character
Create character strings and OR their similar character strings Combine with
A similar character expansion function for creating a third search expression, A search for performing a character string search according to the third search expression
Features and The rule referred to by the character string shift division function is
Characteristics inside the search process of the search process already requested
Congestion degree, the characteristics of the data to be searched
Is a shift division constant determining machine that uses one of them to determine
Noh and realize The character string shift division function determines the shift division constant.
Each configuration included in the search expression according to the rules determined by the function
Split words into substrings of length 2 or more that allow overlapping
Recording a document search program characterized by the fact that
Medium.