JP2001175661A

JP2001175661A - Device and method for full-text retrieval

Info

Publication number: JP2001175661A
Application number: JP35477799A
Authority: JP
Inventors: Taizou Kameshiro; 泰三亀代; Takashi Hirano; 敬平野
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1999-12-14
Filing date: 1999-12-14
Publication date: 2001-06-29
Anticipated expiration: 2019-12-14
Also published as: CN1300026A; JP3803219B2; CN1118034C

Abstract

PROBLEM TO BE SOLVED: To solve the problem that a character recognition result has an error with higher probability and retrieval is not correctly performed/frequently since a key word does not match characters in a text when an index using only a 1st recognition candidate character of a character recognition result is generated from a text generated as a result of character recognition. SOLUTION: While a document of recognition candidate characters matching the key word is retrieved by referring to indexes, shape features of a character image and shape features of the characters constituting the key word are compared with each other to retrieve a document which meets retrieval conditions.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、例えば、文書や
図面に記載された文字画像を識別することにより作成さ
れた文書・図面データから、任意のキーワードを用いて
全文検索する全文検索装置及び全文検索方法に関するも
のである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a full-text search apparatus and a full-text search apparatus for performing a full-text search by using an arbitrary keyword from document / drawing data created by identifying a character image described in a document or drawing. It relates to the search method.

【０００２】[0002]

【従来の技術】コンピュータが読取可能な電子化テキス
トを蓄積し、キーワードを用いて電子化テキストの検索
処理を行う方法には、（１）テキストの内容とキーワー
ドを１文字ずつ直接照合する方法、（２）テキスト内に
出現する文字とその位置情報を予め抽出してインデック
スを作成し、検索時にインデックスを用いてキーワード
とテキスト内の文字の位置関係を検定する方法とがあ
る。2. Description of the Related Art A method of accumulating digitized text which can be read by a computer and performing a retrieval process of the digitized text by using a keyword includes (1) a method of directly collating the text content and the keyword one by one; (2) There is a method in which a character appearing in a text and its position information are extracted in advance to create an index, and a positional relationship between a keyword and a character in the text is tested using the index at the time of search.

【０００３】上記（２）ではインデックスを作成する文
字列の単位から、連続するＮ（Ｎは整数）文字単位での
インデックスと、単語、形態素等の文法的要素を含む単
位によるインデックスに大きく分類できる。更に位置情
報の記述内容からテキスト番号等を記述する方法、テキ
スト番号に加えてテキスト内の文字の出現位置を記述す
る方法がある。[0003] In the above (2), the index can be largely classified from the unit of a character string for creating an index into an index in units of consecutive N (N is an integer) characters and an index in units including grammatical elements such as words and morphemes. . Further, there are a method of describing a text number or the like from the description content of the position information, and a method of describing an appearance position of a character in the text in addition to the text number.

【０００４】上記（１）では、テキストとキーワードの
照合を高速に行うためには、テキストをメモリに展開す
る必要があるが、保存しているテキスト数が多くなると
テキストをメモリに展開する時間が長くなるため、高速
に検索できない問題が発生する。しかし、予めインデッ
クスを作成せずに済む点から、頻繁に登録、削除を行う
場合に都合がよい。上記（２）は、予めインデックスを
作成する必要があるため、上記（１）に比べ登録、削除
に時間を費やすが、一般的に検索における処理時間は、
上記（１）に比べ少ない。このため、登録、削除があま
り頻繁に行われず、大量文書を扱う場合に適している。In the above (1), in order to collate a text with a keyword at high speed, it is necessary to develop the text in a memory. However, when the number of stored texts increases, the time required to develop the text in the memory increases. Because of the length, there is a problem that high-speed search cannot be performed. However, since it is not necessary to create an index in advance, it is convenient when registering and deleting frequently. In the above (2), since it is necessary to create an index in advance, it takes more time to register and delete than in the above (1).
Less than in (1) above. For this reason, registration and deletion are not performed very frequently, and are suitable for handling a large number of documents.

【０００５】図２１は例えば特開平１０−１４９３６７
号公報に示された従来の全文検索装置（以下、従来例１
という）を示す構成図であり、当該従来例１は上記
（２）に関するインデックス作成方法を適用するもので
ある。図において、２０１はテキスト格納手段、２０２
は主インデックス登録手段、２０３は副インデックス登
録手段、２０４は主インデックス格納手段、２０５は副
インデックス格納手段、２０６は副インデックス管理手
段、２０７は主インデックス検索手段、２０８は副イン
デックス検索手段、２０９はキーワード検索制御手段、
２１０はキーワード検索結果格納手段、２１１は検索条
件入力手段、２１２は論理条件解析手段、２１３は検索
結果出力手段である。FIG. 21 shows, for example, JP-A-10-149367.
Of the related art full text search device disclosed in
FIG. 2 is a block diagram showing a conventional example 1 in which the index creation method related to the above (2) is applied. In the figure, 201 is a text storage means, 202
Is a main index registration unit, 203 is a sub index registration unit, 204 is a main index storage unit, 205 is a sub index storage unit, 206 is a sub index management unit, 207 is a main index search unit, 208 is a sub index search unit, and 209 is a sub index search unit. Keyword search control means,
210 is a keyword search result storage unit, 211 is a search condition input unit, 212 is a logical condition analysis unit, and 213 is a search result output unit.

【０００６】次に動作について説明する。テキスト格納
手段２０１によって格納されたテキストは、主インデッ
クス登録手段２０２によって連続するＮ文字のインデッ
クスを登録し、主インデックス格納手段２０４によって
格納される。Next, the operation will be described. The text stored by the text storage unit 201 is registered with an index of consecutive N characters by the main index registration unit 202 and stored by the main index storage unit 204.

【０００７】検索時には、検索条件入力手段２１１から
得た検索条件を用いて、キーワード検索制御手段２０９
が主インデックスと副インデックスを検索することによ
り検索結果を得る。その検索結果からキーワード検索結
果格納手段２１０が検索結果の件数（テキスト識別数）
の多いものや、検索結果のテキスト内文字位置数とテキ
スト識別数の比が大きいものに対し、副インデックス作
成手段２０６を起動し、副インデックスの作成を行う。At the time of a search, the keyword search control means 209 is used by using the search conditions obtained from the search condition input means 211.
Obtains a search result by searching the primary index and the secondary index. From the search result, the keyword search result storage unit 210 stores the number of search results (text identification number).
The sub-index creation unit 206 is activated to create a sub-index for those having a large number of words and those having a large ratio between the number of character positions in the text and the number of text identifications in the search result.

【０００８】従来例１では、Ｎ文字インデックスの主イ
ンデックスに加え、副インデックスを保持し、始めに副
インデックスをアクセスし、キーワードが副インデック
スに存在しない場合、主インデックスをアクセスする。
主インデックスは文書番号と文字位置番号を保持し、副
インデックスは文書番号のみを保持している。このた
め、副インデックスは主インデックスに比べ、サイズが
小さく、インデックスの検定処理も少なく済む。副イン
デックス内にキーワードのＮ文字インデックスがある場
合、主インデックスをアクセスする必要がなく、検索処
理時間が短くなる。また、検索履歴を元に検索される頻
度が小さいインデックスを副インデックスから削除する
ことで、インデックスのサイズを小さくすることができ
る。In the first conventional example, in addition to the main index of the N-character index, a sub-index is held, the sub-index is accessed first, and if a keyword does not exist in the sub-index, the main index is accessed.
The main index holds the document number and the character position number, and the sub index holds only the document number. For this reason, the secondary index is smaller in size than the main index, and requires less index test processing. When there is an N-character index of a keyword in the sub-index, there is no need to access the main index, and the search processing time is shortened. In addition, by deleting from the secondary index an index that is searched less frequently based on the search history, the size of the index can be reduced.

【０００９】次に、文書を文字コード化していない（電
子化テキストを作成していない）文書画像に対して検索
を行うには、文字認識処理を実行して文書画像から文字
部分を抽出することにより、電子化テキストを作成して
保存するようにする。例えば、特開平８−７０３３号公
報では、文字認識の結果として、各文字画像に対する認
識候補文字を複数保持することにより、正解文字が含ま
れる割合を高める技術を開示している。Next, in order to perform a search on a document image in which the document is not character-coded (no digitized text is created), character recognition processing is executed to extract a character portion from the document image. To create and save digitized text. For example, Japanese Patent Application Laid-Open No. 8-7033 discloses a technique of increasing the ratio of correct characters by holding a plurality of recognition candidate characters for each character image as a result of character recognition.

【００１０】図２２は特開平８−７０３３号公報に示さ
れた従来の全文検索装置（以下、従来例２という）を示
す構成図であり、図において、２２１は画像入力手段、
２２２は出力手段、２２３は文字認識手段、２２４は文
書検索手段、２２５はキーワード入力手段、２２６はイ
メージデータ、２２７はテキスト情報、２２８は検索用
ファイルである。FIG. 22 is a block diagram showing a conventional full-text search device disclosed in Japanese Patent Application Laid-Open No. 8-7033 (hereinafter referred to as Conventional Example 2). In FIG.
222 is an output unit, 223 is a character recognition unit, 224 is a document search unit, 225 is a keyword input unit, 226 is image data, 227 is text information, and 228 is a search file.

【００１１】次に動作について説明する。従来例２で
は、文書画像を画像入力手段２２１から入力すると、文
字認識手段２２３を用いて文字認識を実行し、その認識
候補文字を検索用ファイル２２８に格納する。複数の認
識候補文字を格納するために、検索用ファイル２２８の
記述は、認識候補文字数と認識候補文字を用いて、［候
補文字数］［候補文字１］［候補文字２］・・・と記述
する。Next, the operation will be described. In the second conventional example, when a document image is input from the image input unit 221, character recognition is performed using the character recognition unit 223, and the recognition candidate character is stored in the search file 228. In order to store a plurality of recognition candidate characters, the description of the search file 228 is described as [number of candidate characters] [candidate character 1] [candidate character 2] using the number of recognition candidate characters and the recognition candidate characters. .

【００１２】例えば、「新文書ファイリング」という文
字画像に対して、複数の認識候補文字を格納する場合、
［１］新［４］丈文女交［１］書［１］フ［１］ァ
［１］イ［１］リ［１］ン［１］グなどと記述する。検
索時には、文書検索手段２２４が検索用ファイル２２８
内のテキストとキーワードの照合を実行し、認識候補文
字中にキーワードと同一文字が全て含まれている場合
に、照合の成功を認定する。例えば、「新文書ファイリ
ング」のテキストに対してキーワード「文書」で検索す
ると、［４］［丈文女交］［１］［書］の認識候補文字
内に「文」及び「書」が存在するので照合に成功し、検
索結果として出力する。For example, when storing a plurality of recognition candidate characters for a character image "new document filing",
[1] New [4] Jobun Girls [1] Book [1] F [1] A [1] I [1] L [1] [1] At the time of search, the document search means 224 sets the search file 228
The matching between the text and the keyword is executed, and if all of the same characters as the keyword are included in the recognition candidate characters, the matching is determined to be successful. For example, when the keyword “document” is searched for the text “new document filing”, “sentence” and “sho” are present in the recognition candidate characters [4] Therefore, the collation succeeds and the search result is output.

【００１３】なお、従来例１と従来例２を組み合わせる
ことによって、認識候補文字を含めたインデックスを作
成して検索を行うことが可能となる。例えば、Ｎ＝２と
すると、従来例２の「新文書ファイリング」の例では、
「新丈」、「新文」、「新女」、「新交」、「丈書」、
「文書」、「女書」、「交書」のような認識候補文字を
用いたインデックスを作成することで、従来例１に適応
可能となる。By combining Conventional Example 1 and Conventional Example 2, it becomes possible to create an index including recognition candidate characters and perform a search. For example, if N = 2, in the example of “New document filing” in Conventional example 2,
"New length", "New sentence", "New woman", "Shinko", "Jojo",
By creating an index using recognition candidate characters such as "document", "girl book", and "letters", it becomes possible to adapt to the first conventional example.

【００１４】[0014]

【発明が解決しようとする課題】従来の全文検索装置は
以上のように構成されているので、文字認識の結果作成
されたテキストからインデックスを作成する場合におい
て、文字認識結果の第１位認識候補文字のみを用いたイ
ンデックスを作成すると、文字認識結果が誤りを含む確
率が高くなり、キーワードとテキスト内の文字が一致せ
ず、正しく検索されないことが多くなる課題があった。Since the conventional full-text search apparatus is configured as described above, when an index is created from a text created as a result of character recognition, the first recognition candidate of the character recognition result is used. When an index using only characters is created, the probability that the character recognition result contains an error increases, and there is a problem that the keyword and the character in the text do not match, and the search is often not performed correctly.

【００１５】また、従来例２のように認識候補文字を用
いたテキストを実際に照合する検索では、正解文字がテ
キストに含まれる確率が第１位認識候補文字のみを保持
する場合に比べて高くなるが、大量データになる程、テ
キストファイルをメモリにロードするための時間が長く
なるため、検索の高速を図ることができなくなる課題が
あった。In a search for actually collating text using recognition candidate characters as in Conventional Example 2, the probability that the correct character is included in the text is higher than when only the first-ranked recognition candidate character is held. However, the larger the amount of data, the longer the time required to load the text file into the memory, so that there was a problem that the search could not be performed at high speed.

【００１６】また、認識候補文字を用いてインデックス
を作成して検索する場合、正解文字が認識候補文字内に
全て含まれないと、正解文字列のインデックスを正しく
作成することができず、検索時に正しく検索されない課
題があった。例えば、「文字認識」という文字画像の認
識結果が「文宇認識」のように「字」を「宇」に誤って
認識した場合、作成するインデックスは「文宇」、「宇
認」、「認識」となり、本来あるべき「文字」、「字
認」のインデックスが作成できず、その結果「文字認
識」のキーワードで正しく検索されなくなる。Further, when an index is created and searched using recognition candidate characters, if all the correct characters are not included in the recognition candidate characters, the index of the correct character string cannot be created correctly. There was an issue that was not correctly searched. For example, if the recognition result of the character image "character recognition" incorrectly recognizes "characters" as "u", such as "bunu recognition", the indexes to be created are "bunu", "uho", " Recognition ", and the original" character "and" character recognition "indexes cannot be created. As a result, the keyword" character recognition "cannot be correctly searched.

【００１７】さらに、例えば、各文字に対して認識候補
文字を３文字ずつ保持すると、連続する２文字のインデ
ックスを作成する場合の組合せは３×３＝９通りとな
り、認識候補文字を１文字ずつ保持する場合の９倍とな
る。連続する３文字の組合せでは３×３×３＝２７通り
となり、認識候補文字を多く保持するほど、連続するＮ
文字の組合せが多くなり、その結果、インデックスの容
量が非常に大きくなる課題もあった。Further, for example, if three recognition candidate characters are held for each character, the combinations for creating an index of two consecutive characters are 3 × 3 = 9, and the recognition candidate characters are stored one character at a time. Nine times that of holding. In a combination of three consecutive characters, 3 × 3 × 3 = 27 combinations. As the number of recognition candidate characters increases, the number of consecutive N
There is also a problem that the number of character combinations increases, and as a result, the capacity of the index becomes very large.

【００１８】この発明は上記のような課題を解決するた
めになされたもので、高速かつ高精度な全文検索を実施
することができる全文検索装置及び全文検索方法を得る
ことを目的とする。また、この発明は、インデックスの
容量を小さくすることができる全文検索装置を得ること
を目的とする。SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and has as its object to provide a full-text search device and a full-text search method capable of performing high-speed and high-precision full-text search. Another object of the present invention is to provide a full-text search device capable of reducing the capacity of an index.

【００１９】[0019]

【課題を解決するための手段】この発明に係る全文検索
装置は、インデックスを参照して、キーワードと一致す
る認識候補文字の文書を検索する一方、特徴抽出手段に
より抽出された文字画像の形状特徴とキーワードを構成
する文字の形状特徴を照合して、検索条件に合致する文
書を検索する検索手段を設けたものである。A full-text search device according to the present invention refers to an index to search for a document of a recognition candidate character that matches a keyword, and at the same time, shapes features of a character image extracted by a feature extraction unit. And a search unit for searching for a document that matches the search condition by comparing the shape characteristics of the characters constituting the keyword.

【００２０】この発明に係る全文検索装置は、２以上の
認識候補文字が組み合わされた連接文字をインデックス
の作成対象に含めるようにしたものである。In the full-text search device according to the present invention, a concatenated character in which two or more recognition candidate characters are combined is included in an index creation target.

【００２１】この発明に係る全文検索装置は、文字認識
手段が出力する各認識候補文字の中で、基準確度より確
度が低い認識候補文字をインデックスの作成対象から除
外するようにしたものである。[0021] In the full-text search device according to the present invention, among the candidate characters output by the character recognizing means, recognition candidate characters whose accuracy is lower than the reference accuracy are excluded from the index creation target.

【００２２】この発明に係る全文検索装置は、文字認識
手段が出力する認識候補文字の確度が基準確度より低い
場合でも、基準確度を超える確度の認識候補文字を有し
ない文字画像に係る認識候補文字の場合、その認識候補
文字をインデックスの作成対象に含めるとともに、その
認識候補文字に対して他の認識候補文字と区別する識別
記号を付加するようにしたものである。In the full-text search device according to the present invention, even if the accuracy of the recognition candidate character output by the character recognition means is lower than the reference accuracy, the recognition candidate character relating to the character image having no recognition candidate character with an accuracy exceeding the reference accuracy is provided. In the case of, the recognition candidate character is included in an index creation target, and an identification symbol for distinguishing the recognition candidate character from other recognition candidate characters is added.

【００２３】この発明に係る全文検索装置は、文字画像
の形状特徴をデータベースに格納するとともに、その文
字画像に対する各認識候補文字と単語を構成する可能性
のある文字の文字コードをデータベースに格納するよう
にしたものである。The full-text search device according to the present invention stores the shape characteristics of the character image in the database, and stores the recognition candidate characters for the character image and the character codes of the characters that may form words in the database. It is like that.

【００２４】この発明に係る全文検索装置は、言語的情
報又は文字の種類を考慮して、各認識候補文字と単語を
構成する可能性のある文字を判定するようにしたもので
ある。The full-text search device according to the present invention is configured to determine each recognition candidate character and a character that may constitute a word, in consideration of linguistic information or a type of character.

【００２５】この発明に係る全文検索装置は、特徴抽出
手段により抽出された文字画像の形状特徴とキーワード
を構成する文字の形状特徴との距離を計算し、その距離
が所定の基準を満たすとき検索条件の合致を認定するよ
うにしたものである。A full-text search apparatus according to the present invention calculates a distance between a shape feature of a character image extracted by a feature extraction unit and a shape feature of a character constituting a keyword, and searches when the distance satisfies a predetermined criterion. It is designed to recognize that the conditions are met.

【００２６】この発明に係る全文検索装置は、検索手段
による形状特徴照合処理の実行の有無を設定する設定手
段を設けたものである。The full-text search device according to the present invention is provided with setting means for setting whether or not to execute the shape feature collation processing by the search means.

【００２７】この発明に係る全文検索装置は、キーワー
ドと一致する認識候補文字を含む文書を形状特徴の照合
対象から除外するようにしたものである。A full-text search device according to the present invention excludes a document including a recognition candidate character matching a keyword from a target of shape feature matching.

【００２８】この発明に係る全文検索装置は、キーワー
ドと一致する認識候補文字が存在しない場合に限り、特
徴抽出手段により抽出された文字画像の形状特徴とキー
ワードを構成する文字の形状特徴を照合するようにした
ものである。The full-text search device according to the present invention compares the shape features of the character image extracted by the feature extraction unit with the shape features of the characters constituting the keyword only when there is no recognition candidate character matching the keyword. It is like that.

【００２９】この発明に係る全文検索装置は、キーワー
ドに対する形状特徴の照合対象を特定する際、識別符号
が付加された認識候補文字をワイルド・カードとして取
り扱うようにしたものである。In the full-text search device according to the present invention, when specifying a collation target of a shape feature with respect to a keyword, a recognition candidate character to which an identification code is added is treated as a wild card.

【００３０】この発明に係る全文検索装置は、２以上の
認識候補文字が組み合わされた連接文字が文書全体に出
現する出現確率を考慮して、当該連接文字をインデック
スの作成対象に含めるか否かを判定するようにしたもの
である。[0030] The full-text search device according to the present invention determines whether or not to include a concatenated character, which is a combination of two or more candidate recognition characters, in the index creation target, in consideration of the appearance probability of appearance of the concatenated character in the entire document. Is determined.

【００３１】この発明に係る全文検索装置は、連接文字
を構成する各認識候補文字が、当該文字画像に対する唯
一の認識候補文字である場合、その連接文字の出現回数
をカウントアップして出現確率を更新するようにしたも
のである。In the full-text search device according to the present invention, when each of the recognition candidate characters constituting the concatenated character is the only recognition candidate character for the character image, the number of appearances of the concatenated character is counted up to increase the appearance probability. It is intended to be updated.

【００３２】この発明に係る全文検索装置は、キーワー
ドと一致する連接文字の出現回数をカウントアップして
出現確率を更新するようにしたものである。In the full-text search device according to the present invention, the appearance probability is updated by counting up the number of appearances of the connected character that matches the keyword.

【００３３】この発明に係る全文検索装置は、文字認識
手段が出力する認識候補文字が修正された場合、修正後
の認識候補文字を含む連接文字の出現回数をカウントア
ップして出現確率を更新するようにしたものである。[0033] In the full-text search device according to the present invention, when the recognition candidate character output by the character recognition means is corrected, the appearance frequency is updated by counting up the number of occurrences of the concatenated character including the corrected recognition candidate character. It is like that.

【００３４】この発明に係る全文検索方法は、インデッ
クスを参照して、キーワードと一致する認識候補文字の
文書を検索する一方、文字画像の形状特徴とキーワード
を構成する文字の形状特徴を照合して、検索条件に合致
する文書を検索するようにしたものである。In the full-text search method according to the present invention, while referring to an index, a document of a recognition candidate character matching a keyword is searched, and a shape characteristic of a character image is compared with a shape characteristic of a character constituting a keyword. , A document that matches the search condition is searched.

【００３５】[0035]

【発明の実施の形態】以下、この発明の実施の一形態を
説明する。実施の形態１．図１はこの発明の実施の形態１による全
文検索装置を示す構成図であり、図において、１は画像
を入力する画像入力手段、２は入力画像に含まれる各文
字画像を識別して、各文字画像に対する１以上の認識候
補文字を出力するとともに、各認識候補文字の確度（類
似度）を出力する文字認識手段、３は文字認識手段２が
出力する各認識候補文字と文字位置の対応関係を示すイ
ンデックスを作成するインデックス作成手段である。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be described below. Embodiment 1 FIG. FIG. 1 is a block diagram showing a full-text search apparatus according to Embodiment 1 of the present invention. In the figure, reference numeral 1 denotes an image input means for inputting an image, and 2 denotes each character image included in the input image. A character recognition unit that outputs one or more recognition candidate characters for a character image and outputs the accuracy (similarity) of each recognition candidate character. Reference numeral 3 denotes a correspondence relationship between each recognition candidate character output by the character recognition unit 2 and a character position. This is an index creating means for creating an index indicating the following.

【００３６】４は入力画像に含まれる各文字画像の中
で、基準確度を超える確度の認識候補文字を有しない文
字画像が存在する場合、その文字画像の形状特徴を抽出
するとともに、言語的情報又は文字の種類を考慮して、
その文字画像に対する認識候補文字と単語（文字列）を
構成する可能性のある文字を判定し、その文字列を曖昧
テキストとして抽出する曖昧テキスト抽出手段（特徴抽
出手段）、５は文書の検索条件としてキーワードを入力
する検索条件入力手段（入力手段、設定手段）、６はイ
ンデックスを参照して、そのキーワードと一致する認識
候補文字の文書番号を検索する一方、曖昧テキスト抽出
手段４により抽出された文字画像の形状特徴とキーワー
ドを構成する文字の形状特徴を照合して、文書の検索条
件に合致する文書番号を検索する検索手段、７は検索手
段６の検索結果を出力する出力手段である。Reference numeral 4 denotes a case where, among character images included in the input image, there is a character image having no recognition candidate character having a certainty exceeding the reference certainty, a shape characteristic of the character image is extracted and linguistic information is extracted. Or, considering the type of characters,
Ambiguous text extraction means (feature extraction means) for determining a character that may constitute a word (character string) and a recognition candidate character for the character image, and extracting the character string as an ambiguous text; The search condition input means (input means, setting means) 6 for inputting a keyword as a search for the document number of the recognition candidate character matching the keyword with reference to the index, while being extracted by the vague text extraction means 4 Retrieval means for comparing the shape characteristics of the character image with the shape characteristics of the characters constituting the keyword and searching for a document number matching the document search conditions, and an output means 7 for outputting a search result of the search means 6.

【００３７】８は文字認識手段２が文字認識に使用する
文字認識辞書、９は検索手段６がキーワード検索時に使
用する形状特徴辞書、１０は曖昧テキスト抽出手段４に
より抽出された曖昧テキストを格納する曖昧テキストデ
ータベース、１１はインデックス作成手段３により作成
されたインデックスを格納するインデックスデータベー
ス、１２は認識候補文字等を格納する認識文字データベ
ースである。Reference numeral 8 denotes a character recognition dictionary used by the character recognition means 2 for character recognition, 9 denotes a shape feature dictionary used by the search means 6 at the time of keyword search, and 10 denotes an ambiguous text extracted by the ambiguous text extraction means 4. An ambiguous text database, 11 is an index database for storing the index created by the index creation means 3, and 12 is a recognition character database for storing recognition candidate characters and the like.

【００３８】次に動作について説明する。最初に、図２
を参照して文書の登録方法を説明する。まず、ステップ
ＳＴ１００において、画像入力手段１はコンピュータで
処理可能な文書画像を入力する。画像入力手段１の構成
としては、スキャナあるいはディジタルカメラ等を用い
てもよいし、予め作成されたコンピュータ処理可能な画
像をネットワーク経由等で入力してもよい。ここでは、
画像入力手段１から図３の文書イメージを入力するもの
とする。Next, the operation will be described. First, FIG.
The document registration method will be described with reference to FIG. First, in step ST100, the image input unit 1 inputs a document image that can be processed by a computer. As a configuration of the image input unit 1, a scanner, a digital camera, or the like may be used, or a computer-generated image created in advance may be input via a network or the like. here,
It is assumed that the document image shown in FIG.

【００３９】次に、ステップＳＴ１１０において、文字
認識手段２は、画像入力手段１から入力された入力画像
に対し文字認識処理を実行し、文字コードとその確から
しさを示す類似度を出力する。文字認識の方法は、公知
となっている技術を用いることにより可能であるので詳
細は省略する。文字認識手段２は入力画像に含まれる各
文字画像に対し、複数の認識候補文字とそれぞれの類似
度を出力する。Next, in step ST110, the character recognizing means 2 performs a character recognizing process on the input image input from the image input means 1, and outputs a character code and a similarity indicating its certainty. Since the method of character recognition can be performed by using a known technique, details are omitted. The character recognition means 2 outputs a plurality of recognition candidate characters and respective degrees of similarity to each character image included in the input image.

【００４０】図４は文字認識手段２の認識結果の一部で
あり、ここでは、図３の１行目から２行目までの各文字
画像の認識結果について、認識候補第１位から第５位ま
での認識候補文字とその類似度を示している。図４で認
識候補文字中に存在する「◆」は、対応する文字コード
が格納されていないことを意味する。FIG. 4 shows a part of the recognition result of the character recognizing means 2. Here, the recognition results of the respective character images from the first line to the second line in FIG. It shows the recognition candidate characters up to the rank and their similarity. In FIG. 4, “◆” existing in the recognition candidate character means that the corresponding character code is not stored.

【００４１】次に、ステップＳＴ１２０において、イン
デックス作成手段３は、図４に示す認識結果から検索に
用いる認識候補文字の絞込みを実施する。認識候補文字
の絞込みを行う方法としては、例えば、認識候補文字の
類似度の値と当該認識候補文字が、正解である確率を予
め学習データから求めておき、正解である確率が高く、
かつ、十分な絞込みが行える閾値ＴＨ１を設定し、閾値
ＴＨ１以上の類似度の認識候補文字を保持するようにす
る。Next, in step ST120, the index creating means 3 narrows down the recognition candidate characters used for the search from the recognition results shown in FIG. As a method of narrowing down the recognition candidate characters, for example, the similarity value of the recognition candidate character and the recognition candidate character are obtained in advance from the learning data, the probability that the recognition candidate character is a correct answer, the probability of the correct answer is high,
In addition, a threshold value TH1 at which sufficient narrowing can be performed is set, and recognition candidate characters having a similarity equal to or greater than the threshold value TH1 are stored.

【００４２】閾値ＴＨ１以上の類似度の認識候補文字が
存在しない場合は、正解文字が含まれない確率が高いた
め、各認識候補文字に加えて正解文字が含まれない可能
性が高いことを示す「＊」記号を付加する。この例では
「＊」を用いているが、他の文字コードを割り当てても
よいし、文字コード以外の値を割り当てるようにしても
よい。図５は認識候補文字の絞込み結果を示している。
例えば、ＴＨ１＝８０と設定すると、文字位置番号４と
文字位置番号９に対しては、類似度が８０以上の認識候
補文字が存在しないので（図４を参照）、これらに対し
て、「＊」を付加するようにしている（図５の符号２
３，２４を参照）。インデックス作成手段３は図５に示
す絞込み後の認識候補文字を認識文字データベース１２
に保存する。When there is no recognition candidate character having a similarity of the threshold value TH1 or more, there is a high probability that a correct character is not included, indicating that there is a high possibility that a correct character is not included in addition to each recognition candidate character. Add the “*” symbol. Although “*” is used in this example, another character code may be assigned, or a value other than the character code may be assigned. FIG. 5 shows the result of narrowing down the recognition candidate characters.
For example, if TH1 = 80 is set, there is no recognition candidate character having a similarity of 80 or more for character position number 4 and character position number 9 (see FIG. 4). "(Reference numeral 2 in FIG. 5).
3, 24). The index creation means 3 converts the narrowed-down recognition candidate characters shown in FIG.
To save.

【００４３】次に、ステップＳＴ１３０において、イン
デックス作成手段３は、インデックスを作成する。ここ
では、図５に示す認識候補文字から１文字毎のインデッ
クスと、連続する２文字のインデックスを作成する。こ
こで、インデックスの作成方法を具体的に説明する。図
９はインデックス作成手段３が図５に示す認識候補文字
から作成した２文字のインデックスを示している。その
作成方法は、図５の１文字目から順番に隣り合う文字同
士について、隣り合う２文字の前の文字と後の文字の文
字コード、前の文字の出現位置、前の文字の認識候補順
位と後の文字の認識候補順位との積を計算して保存す
る。出現位置は「Ｘ−Ｙ」と記述し、文書番号Ｘの文頭
からＹ文字目を意味する。ここでは、図３の文書イメー
ジの文書番号を“１”としている。Next, in step ST130, the index creating means 3 creates an index. Here, an index for each character and an index for two consecutive characters are created from the recognition candidate characters shown in FIG. Here, a method of creating an index will be specifically described. FIG. 9 shows a two-character index created by the index creating means 3 from the recognition candidate characters shown in FIG. The creation method is as follows. For characters adjacent to each other in order from the first character in FIG. 5, the character code of the character before and after the two adjacent characters, the appearance position of the previous character, the recognition candidate order of the previous character Calculate and store the product of the recognition candidate rank of the subsequent character. The appearance position is described as “XY”, and means the Y-th character from the beginning of the document number X. Here, the document number of the document image in FIG. 3 is “1”.

【００４４】例えば、図５の「文」２１と「書」２２か
ら図９の「文書」２５のインデックスを作成する。この
場合、「文」２１の位置情報が文書１の先頭から１文字
目であるので、文字位置は「１−１」となり、「文」２
１と「書」２２の認識候補順位が共に１位であるので、
認識候補順位は１×１＝１となる。図１０は１文字イン
デックスの位置と認識順位を記憶したテーブルであり、
文字コード、文字出現位置及び認識候補順位を保持す
る。正解文字コードが含まれないと判定した文字に対し
ては、「＊」３１と文字位置３２を保持するようにして
いる。For example, an index of “document” 25 in FIG. 9 is created from “sentence” 21 and “book” 22 in FIG. In this case, since the position information of the "sentence" 21 is the first character from the beginning of the document 1, the character position is "1-1" and the "sentence" 2
Since the recognition candidate ranks of 1 and 22 are both first,
The recognition candidate rank is 1 × 1 = 1. FIG. 10 is a table storing the positions of one-character indexes and the recognition order.
The character code, the character appearance position, and the recognition candidate rank are held. For a character determined not to include the correct character code, “*” 31 and a character position 32 are retained.

【００４５】次に、ステップＳＴ１４０において、曖昧
テキスト抽出手段４は、正解文字コードが含まれない文
字を含む曖昧テキストを抽出する。即ち、曖昧テキスト
抽出手段４は、図５に示す認識候補文字から、「＊」が
付いた文字コードの文字画像から文字の形状特徴を作成
し、その前後の数文字と共に曖昧テキストデータベース
１０内に格納する。前後の文字の判定方法は、例えば、
公知である形態素解析を実行し、「＊」が付いた文字コ
ードの前後から形態素解析に失敗した文字としてもよい
し、「＊」が付いた文字コードと同一カテゴリ（英字、
漢字、数字、ひらがな、かたかなの何れか）で連続する
文字としてもよいし、文字数を固定してもよい。ここで
は、後ろの１文字を保持するようにしている。Next, in step ST140, the ambiguous text extracting means 4 extracts an ambiguous text including a character which does not include the correct character code. That is, the ambiguous text extracting means 4 creates the character shape feature from the character image of the character code with “*” from the recognition candidate characters shown in FIG. 5 and stores it in the ambiguous text database 10 together with several characters before and after the character. Store. The method of determining the characters before and after is, for example,
By performing a known morphological analysis, a character whose morphological analysis has failed from before and after the character code marked with “*” may be used, or may be in the same category (alphabet,
(Kanji, numbers, hiragana, katakana) may be continuous characters, or the number of characters may be fixed. Here, the last character is retained.

【００４６】図８は具体的な形状特徴の作成方法を示
し、図８では文字画像のイメージを８分割して、各領域
の黒画素数を求めるようにしている。例えば、領域４１
に対して黒画素数が１３個（符号４９を参照）、領域４
２に対しては黒画素数が１０個（符号５０を参照）とし
て求まる。こうして作成した形状特徴を認識候補文字と
ともに保存する。図６は４文字目と９文字目の文字画像
から抽出された形状特徴を保持する例を示している。ま
た、曖昧テキスト抽出手段４は、認識文字データベース
１２に形状特徴を作成した文字の位置とその特徴値を格
納する（図５の下部を参照）。FIG. 8 shows a specific method for creating a shape feature. In FIG. 8, the image of a character image is divided into eight parts, and the number of black pixels in each area is obtained. For example, area 41
, The number of black pixels is 13 (see reference numeral 49),
For 2, the number of black pixels is determined as 10 (see reference numeral 50). The shape features thus created are stored together with the recognition candidate characters. FIG. 6 shows an example in which the shape features extracted from the character images of the fourth and ninth characters are stored. Further, the ambiguous text extracting means 4 stores the position of the character for which the shape characteristic has been created and its characteristic value in the recognized character database 12 (see the lower part of FIG. 5).

【００４７】次に、文書の検索方法を説明する。ここで
は、文書登録処理の結果、インデックスデータベース１
１及び曖昧テキストデータベース１０には文書番号１の
文書に関するデータのみが格納されているものとする。
図１１は文書の検索方法を示すフローチャートである。Next, a document search method will be described. Here, as a result of the document registration processing, the index database 1
1 and the ambiguous text database 10 only store data relating to the document of document number 1.
FIG. 11 is a flowchart showing a document search method.

【００４８】まず、ステップＳＴ２００において、ユー
ザは検索条件入力手段５を用いて、キーワードを入力す
る。検索条件入力手段５を構成するには、コンピュータ
のキーボードやマウスで可能であるが、これに限らずマ
イク、電話などを用いた音声入力も可能である。ここで
は「文字」というキーワードを入力するものとする。次
に、ステップＳＴ２１０において、検索手段６は、入力
されたキーワードを分割する。ここでは、１文字および
２文字連接文字列の組に分解する。即ち、「文」、
「字」、「文字」に分割する。First, in step ST 200, the user inputs a keyword using search condition input means 5. The search condition input means 5 can be configured using a computer keyboard or a mouse, but is not limited thereto, and voice input using a microphone, a telephone, or the like is also possible. Here, it is assumed that a keyword “character” is input. Next, in step ST210, the search means 6 divides the input keyword. Here, it is decomposed into a set of one-character and two-character concatenated character strings. That is, "sentence",
Divide into "characters" and "characters".

【００４９】次に、ステップＳＴ２２０において、検索
手段６は、インデックスを用いた文書の検索を実施す
る。図１２はインデックス照合を示すフローチャートで
ある。まず、ステップＳＴ２２１において、検索手段６
は、その分割した「文字」、「文」、「字」の各インデ
ックス（図９の符号２６、図１０の符号２７，２８を参
照）を取り出す処理を実行する。具体的には、図示しな
いメモリ上に各インデックスの内容をロードする。Next, in step ST220, the search means 6 searches for a document using the index. FIG. 12 is a flowchart showing index matching. First, in step ST221, the search unit 6
Performs a process of extracting the respective indexes of the divided “character”, “sentence”, and “character” (see reference numerals 26 and 27 and 28 in FIG. 9 and 27 and 28 in FIG. 10). Specifically, the content of each index is loaded on a memory (not shown).

【００５０】次に、ステップＳＴ２２２において、文字
位置の検証を実施して文書番号を検索する。即ち、
「文」、「字」の文字位置をそれぞれ検証して文書番号
を検索してもよいが、「文字」のインデックス２６を用
いて文書番号を検索するようにしてもよい。ここでは、
「文字」のインデックス２６を用いて検索する。この場
合、「文字」の文字位置が「１−７」であるので、文書
番号１が検索結果となる。最後に、ステップＳＴ２２４
において、検索手段６は、インデックス検索での検索結
果を出力する。Next, in step ST222, the position of the character is verified to search for the document number. That is,
The document number may be searched by verifying the character positions of “sentence” and “character”, but the document number may be searched using the index 26 of “character”. here,
A search is performed using the “character” index 26. In this case, since the character position of "character" is "1-7", document number 1 is the search result. Finally, step ST224
, The search means 6 outputs a search result of the index search.

【００５１】次に、図１１のステップＳＴ２３０におい
て、検索手段６は曖昧テキストを用いた検索を実施す
る。図１３は曖昧テキスト照合を示すフローチャートで
ある。まず、ステップＳＴ２３１において、検索対象文
書の決定を実行する。ここでは、処理の無駄を省くため
にインデックス照合（ステップＳＴ２２０）による検索
の結果、出力候補となった文書番号の文書を検索対象か
ら除外する。Next, in step ST230 of FIG. 11, the search means 6 performs a search using the ambiguous text. FIG. 13 is a flowchart showing ambiguous text matching. First, in step ST231, a search target document is determined. Here, in order to eliminate waste of processing, as a result of the search by the index matching (step ST220), the document of the document number which is the output candidate is excluded from the search target.

【００５２】具体的には、キーワード「文字」の
「文」、「字」何れかの文字を含む文書番号をピックア
ップし、そこからステップＳＴ２２０において出力され
た文書番号の文書を除いたものを検索対象文書とする。
つまり、図１０から「文」のインデックス２７が示す文
書番号と「字」のインデックス２８が示す文書番号との
ＯＲをとり、これからステップＳＴ２２０での検索結果
を除くようにする。この場合、「文」と「字」の文書番
号のＯＲは“１”であり、ステップＳＴ２２０におい
て、文書番号１を出力しているので、文書番号１から文
書番号１を除いて対象文書なしとする。More specifically, a document number including any one of the characters “sentence” or “character” of the keyword “character” is picked up, and a search is made for a document number excluding the document of the document number output in step ST220. Target document.
That is, the document number indicated by the index 27 of “sentence” is ORed with the document number indicated by the index 28 of “character” from FIG. 10, and the search result in step ST220 is excluded from this. In this case, the OR of the document numbers of “sentence” and “character” is “1”, and since document number 1 is output in step ST220, there is no target document except for document number 1 from document number 1. I do.

【００５３】次に、ステップＳＴ２３２において、対象
文書をメモリにロードする。ここでは、対象文書なしな
のでロードしない。続いて、ステップＳＴ２３３におい
て、文字コードレベルでの照合を行うが、対象文書なし
なので照合を行わない。同様に、ステップＳＴ２３４に
おいて、形状特徴の照合を行うが対象文書なしなので照
合を行わない。ステップＳＴ２３５において、Ｙに進
み、ステップＳＴ２３６において、結果なしを出力して
終了する。最後に、図１１のステップＳＴ２４０におい
て、各検索結果（文書番号１）を出力して終了する。Next, in step ST232, the target document is loaded into the memory. Here, since there is no target document, it is not loaded. Subsequently, in step ST233, collation at the character code level is performed, but since there is no target document, no collation is performed. Similarly, in step ST234, collation of shape features is performed, but since there is no target document, collation is not performed. In step ST235, the process proceeds to Y. In step ST236, no result is output, and the process ends. Finally, in step ST240 of FIG. 11, each search result (document number 1) is output, and the processing ends.

【００５４】次に、ユーザがキーワードとして「課題」
を入力した場合の検索について説明する。図１１のステ
ップＳＴ２００において、ユーザは検索条件入力手段５
から「課題」をキーワードとして入力する。ステップＳ
Ｔ２１０において、検索手段６はキーワード分割する。
ここでは、「課」、「題」、「課題」とに分割する。次
に、ステップＳＴ２２０において、検索手段６は、イン
デックス照合による検索を実行する。図１２のステップ
ＳＴ２２１において、各インデックスを取り出すが、こ
こでは、「題」のインデックス３０は存在するが、「課
題」、「課」のインデックスは存在しない。ステップＳ
Ｔ２２２，ステップＳＴ２２４と進み、「課題」のイン
デックスが存在しないので、結果なしで終了する。Next, the user inputs "task" as a keyword.
A search in the case where is input will be described. In step ST200 of FIG.
Input "Assignment" as a keyword. Step S
At T210, the search means 6 divides the keyword.
Here, it is divided into “sections”, “titles”, and “assignments”. Next, in step ST220, the search means 6 executes a search by index matching. In step ST221 of FIG. 12, each index is extracted. Here, the index 30 of "title" exists, but the indexes of "task" and "section" do not exist. Step S
The process proceeds to T222, step ST224, and ends without any result because there is no index of "assignment".

【００５５】次に、図１１のステップＳＴ２３０におい
て、検索手段６は曖昧テキストを検索する。まず、図１
３のステップＳＴ２３１において、検索対象文書の決定
を実行する。「課」のインデックスが示す文書番号と、
「題」のインデックスが示す文書番号とのＯＲをとり、
これからステップＳＴ２２０における検索結果を除く処
理を実行する。Next, in step ST230 of FIG. 11, the search means 6 searches for an ambiguous text. First, FIG.
In step ST231 of step 3, the search target document is determined. The document number indicated by the “section” index,
OR with the document number indicated by the index of "title",
From now on, the processing excluding the search result in step ST220 is executed.

【００５６】「題」のインデックス３０が示す文書番号
が“１”で、ステップＳＴ２２０での検索結果がなしで
あるから対象文書の文書番号は“１”となる。次に、ス
テップＳＴ２３２において、対象文書の曖昧テキストを
メモリにロードする。ここでは、図６に示す文書番号１
のテキスト及び形状特徴をメモリにロードする。Since the document number indicated by the “title” index 30 is “1” and the search result in step ST220 is absent, the document number of the target document is “1”. Next, in step ST232, the ambiguous text of the target document is loaded into the memory. Here, document number 1 shown in FIG.
Is loaded into memory.

【００５７】次に、ステップＳＴ２３３において、検索
手段６は文字コードレベルでの照合を実行する。ここで
は、検索キーワードと１文字でも一致した場合に、一致
した文字位置付近を形状特徴照合範囲として記憶し次に
進む。具体的には、キーワード「課題」の「課」又は
「題」いずれかの文字が存在した部分の付近を形状特徴
照合範囲とする。ここでは、図６で「題」３３が一致す
るので、これを形状特徴照合範囲とする。Next, in step ST233, the search means 6 executes collation at the character code level. Here, if even one character matches the search keyword, the vicinity of the matched character position is stored as the shape feature collation range, and the process proceeds to the next step. Specifically, the vicinity of the part where the character of either “section” or “title” of the keyword “assignment” exists is set as the shape feature matching range. Here, since “theme” 33 matches in FIG. 6, this is set as the shape feature collation range.

【００５８】次に、ステップＳＴ２３４において、検索
手段６は、形状特徴を用いた照合を実行する。ここで
は、図６の形状特徴３４と形状特徴辞書９から「課」の
形状特徴をロードする。図８で、４１〜４８の領域を領
域１〜領域８に割り当てる。形状特徴の計算は、下記に
示すように、各領域毎の特徴の差分を計算する。Next, in step ST234, the search means 6 executes collation using the shape feature. Here, the shape feature of “section” is loaded from the shape feature 34 and the shape feature dictionary 9 in FIG. In FIG. 8, regions 41 to 48 are allocated to regions 1 to 8. In the calculation of the shape feature, a difference between the features of each region is calculated as described below.

【００５９】[0059]

【数１】 (Equation 1)

【００６０】ここで、Ｄは形状特徴間の距離、Ｘ_i は曖
昧テキストデータベース１０内のテキストのｉ番目の形
状特徴であり、Ｙ_i は対応するキーワード文字のｉ番目
の形状特徴（形状特徴辞書９内に格納されている）であ
る。Here, D is the distance between the shape features, X _i is the i-th shape feature of the text in the ambiguous text database 10, and Y _i is the i-th shape feature (shape feature dictionary) of the corresponding keyword character. 9).

【００６１】距離Ｄがある閾値ＴＨＲ以下の場合に形状
特徴の照合に成功したものとし、この文書を検索結果と
して出力する。いま、形状特徴辞書９内の「課」の領域
１〜８までの特徴値をそれぞれ「１０」「７」「１２」
「１２」「１０」「５」「１０」「９」とすると、図６
の形状特徴３４との距離はＤ＝３０となる。従って、Ｔ
ＨＲ≧Ｄが成立するので、この特徴間の照合は成功し、
文書番号１を検索結果として出力する。最後に、ステッ
プＳＴ２４０において、その検索結果である文書番号１
を出力する。When the distance D is equal to or less than a threshold value THR, it is determined that the shape feature has been successfully collated, and this document is output as a search result. Now, the feature values of the “section” areas 1 to 8 in the shape feature dictionary 9 are “10”, “7”, and “12”, respectively.
Assuming that “12”, “10”, “5”, “10” and “9”, FIG.
The distance to the shape feature 34 is D = 30. Therefore, T
Since HR ≧ D holds, the matching between the features succeeds,
Document number 1 is output as a search result. Finally, in step ST240, the search result, document number 1
Is output.

【００６２】この実施の形態１では、インデックスを１
文字と２文字の場合で説明したが、これに限らず、連続
する３文字のインデックスを用いてもよいし、それ以上
でもよい。また、この実施の形態１では、インデックス
と曖昧テキストの両方を用いて検索を行ったが、これに
限らず、図２０に示すように、曖昧テキストの照合を実
施せずに検索結果を出力してもよい。曖昧テキストを用
いないことで、文字認識で失敗した部分の検索を実施す
ることができないが、結果出力の高速化を図ることがで
きる。また、曖昧テキストを用いることによって高精度
検索が可能となるので、検索条件入力手段５に検索条件
を入力する際、曖昧テキストを用いた検索を行うか否か
を指定することで、検索精度の優先又は検索速度の優先
を自由に指定することができる。In the first embodiment, the index is set to 1
Although the description has been made of the case of a character and two characters, the present invention is not limited to this, and an index of three consecutive characters may be used or more. In the first embodiment, the search is performed using both the index and the ambiguous text. However, the present invention is not limited to this, and the search result is output without performing the ambiguous text collation as shown in FIG. You may. By not using the ambiguous text, it is not possible to search for a part that failed in character recognition, but it is possible to speed up the result output. In addition, since a high-precision search can be performed by using an ambiguous text, when a search condition is input to the search condition input unit 5, whether or not to perform a search using an ambiguous text is specified, thereby improving the search accuracy. Priority or priority of search speed can be freely specified.

【００６３】また、曖昧テキストは図６を用いたが、図
７に示すように曖昧テキストのある文書番号の開始位置
と終了位置及び曖昧テキストの文字コードをどの文書に
含むかを示す表を作成してもよい。この場合の動作につ
いて説明する。登録時において、曖昧テキスト抽出手段
４は、上述したように、類似度がＴＨ１以下の文字を含
む前後数文字の文字列を曖昧テキストと決定し、その開
始文字位置と終了文字位置及び文書番号を保持する。い
ま、図５の「＊」２３で説明すると、ここでは、この文
字を含む後１文字を曖昧テキストとする。図７で開始文
字位置４（符号５００を参照）、終了文字位置５（符号
５０１を参照）、文書番号１（符号５０２を参照）を保
持する。Although FIG. 6 is used for the ambiguous text, as shown in FIG. 7, a table is prepared which indicates the start position and the end position of the document number having the ambiguous text and the character code of the ambiguous text included in which document. May be. The operation in this case will be described. At the time of registration, the ambiguous text extracting means 4 determines the character string of several characters before and after including the character whose similarity is equal to or less than TH1 as the ambiguous text, and determines the start character position, the end character position, and the document number as described above. Hold. Now, as described with “*” 23 in FIG. 5, the last character including this character is assumed to be an ambiguous text. In FIG. 7, a start character position 4 (see reference numeral 500), an end character position 5 (see reference numeral 501), and a document number 1 (see reference numeral 502) are held.

【００６４】また、曖昧テキスト抽出手段４は、図７
（Ｂ）に示す曖昧テキストが出現する文字の表を作成す
る。いま、開始文字位置４と終了文字位置５に存在する
認識候補文字の全てに対して文書番号１を保持する。図
５からこの例では、図７（Ｂ）の「諜」５０３，「訓」
５０４，「詰」５０５，「語」５０６，「話」５０７，
「題」５０８に対して文書番号１を保持する。Further, the ambiguous text extracting means 4 is provided in FIG.
A table of characters in which the ambiguous text shown in (B) appears is created. Now, the document number 1 is held for all of the recognition candidate characters existing at the start character position 4 and the end character position 5. From FIG. 5, in this example, “intelligence” 503 and “Kun” in FIG.
504, “fill” 505, “word” 506, “talk” 507,
Document number 1 is held for “title” 508.

【００６５】検索処理は、図１１のステップＳＴ２２０
まで、上記実施の形態１と同一である。ステップＳＴ２
３０において、キーワード「課題」に対しては、検索手
段６は図７（Ｂ）の表から「課」、「題」のインデック
スをロードし、該当文書を決定する。ここでは、「課」
を含む文書が存在せず、「題」を含む文書の文書番号が
“１”であるので、文書番号１に対し、形状特徴を用い
た検索を実行する。図７（Ａ）で文書番号１の４から５
文字目と、９から１０文字目に対し、図５の認識文字デ
ータベース１２から文字と形状特徴をロードして照合を
行う。以下、上記実施の形態１と同一である。これによ
り、認識文字データベース１２と曖昧テキストデータベ
ース１０の２重保持が防止され、大量データになる程、
データ保持のための容量を抑えることが可能となる。The search process is performed in step ST220 of FIG.
Up to this, it is the same as the first embodiment. Step ST2
At 30, for the keyword “assignment”, the retrieval means 6 loads the “section” and “title” indexes from the table of FIG. 7B and determines the corresponding document. Here, "section"
Does not exist, and the document number of the document including “title” is “1”. Therefore, the document number 1 is searched using the shape feature. In FIG. 7A, 4 to 5 of document number 1
For the characters and the ninth to tenth characters, the characters and shape features are loaded from the recognized character database 12 in FIG. The following is the same as the first embodiment. As a result, double storage of the recognition character database 12 and the ambiguous text database 10 is prevented, and
It is possible to reduce the capacity for holding data.

【００６６】以上で明らかなように、この実施の形態１
によれば、インデックスを参照して、キーワードと一致
する認識候補文字の文書番号を検索する一方、文字画像
の形状特徴とキーワードを構成する文字の形状特徴を照
合して、文書の検索条件に合致する文書番号を検索する
ように構成したので、高速かつ高精度な全文検索を実施
することができる効果を奏する。As is clear from the above, the first embodiment
According to the method, the document number of the recognition candidate character that matches the keyword is searched by referring to the index, and the shape feature of the character image is matched with the shape feature of the character constituting the keyword to match the search condition of the document. Since the configuration is such that the document number to be searched is searched, there is an effect that high-speed and high-precision full-text search can be performed.

【００６７】実施の形態２．上記実施の形態１では、文
字コードが全て一致しない場合、形状特徴を用いて文書
番号を検索するものについて示したが、形状特徴を用い
ずにインデックスファイルのみで検索を実施するように
してもよい。文書の登録方法は上記実施の形態１と同様
であるので、文書の検索方法について説明する。Embodiment 2 In the first embodiment, when all the character codes do not match, a case is described in which the document number is searched using the shape feature. However, the search may be performed using only the index file without using the shape feature. . Since the document registration method is the same as that in the first embodiment, a document search method will be described.

【００６８】まず、図１１のステップＳＴ２００におい
て、キーワード「課題」を入力するものとする。次に、
ステップＳＴ２１０において、キーワード分割を実施す
る。ここでは、「課」、「題」、「課題」を作成する。
次に、ステップＳＴ２２０において、インデックス照合
による検索を実施するが、インデックス照合のフローチ
ャートは図１４を用いる。ステップＳＴ２２１におい
て、検索手段６は、各分割キーワード文字列のインデッ
クスを取り出す処理を実行する。「課題」、「課」のイ
ンデックスは存在せず「題」のみのインデックスが存在
するので、図１０から「題」のインデックス３０を取り
出す。First, in step ST200 of FIG. 11, a keyword “task” is input. next,
In step ST210, keyword division is performed. Here, “section”, “title”, and “assignment” are created.
Next, in step ST220, a search by index collation is performed. The flowchart of index collation uses FIG. In step ST221, the search means 6 executes a process of extracting the index of each divided keyword character string. Since there is no index for “task” and “section” but an index for only “title”, the index 30 for “title” is extracted from FIG.

【００６９】次に、ステップＳＴ２２２において、文字
位置の照合を実施する。ここでは、「課題」のインデッ
クスが存在しないので、照合した文書は該当なしとなり
ステップＳＴ２２３に進む。ステップＳＴ２２３では、
一部不一致である文字位置に対して「＊」記号を用いた
照合を実施する。この検索は、「課題」のようにキーワ
ードと完全に一致しなくとも「＊題」、「課＊」の文字
列でも照合を可能とする。処理の手順は、「課」、
「題」のインデックスを用いて、「課」または「題」の
インデックスから文字位置を検出する。「課」に対して
はインデックスが存在しないが、「題」についてはイン
デックス３０が存在する。Next, in step ST222, collation of the character position is performed. Here, since there is no index of “assignment”, the collated document is not applicable, and the process proceeds to step ST223. In step ST223,
A collation using a “*” symbol is performed for character positions that are partially mismatched. In this search, it is possible to collate even the character strings of “* title” and “section *” even if they do not completely match the keyword like “assignment”. The processing procedure is "section",
The character position is detected from the index of “section” or “title” using the index of “title”. An index does not exist for “section”, but an index 30 exists for “title”.

【００７０】次に、「＊」文字のインデックス３１をロ
ードする。「＊」のインデックス３１で、「題」のイン
デックス３０に連接するものが存在するかを検証する。
「＊」の始めの文字位置「１−４」３２は「題」の１−
５の１文字前にあるため条件を満たす。他に、「題」の
文字位置が存在しないので、ステップＳＴ２２４におい
て、検索結果（文書番号１）を出力して終了する。図１
１で、ステップＳＴ２３０の曖昧テキスト照合を実施せ
ず、ステップＳＴ２４０へと進み、その検索結果（文書
番号１）を出力して終了する。Next, the index 31 of the "*" character is loaded. It is verified whether or not an index 31 of “*” is connected to the index 30 of “title”.
The character position "1-4" 32 at the beginning of "*" is 1- of "title".
The condition is satisfied because the character is one character before 5. Since there is no other character position of "title", the search result (document number 1) is output in step ST224, and the process ends. FIG.
In step 1, without performing the ambiguous text collation in step ST230, the process proceeds to step ST240, in which the search result (document number 1) is output, and the process ends.

【００７１】この実施の形態２では、認識候補文字に正
解が存在しないと思われる文字に対し「＊」記号を認識
候補文字に加え、この文字はどの文字とも照合に一致す
るものとして検索を行う。ただし、「＊＊」のように正
解文字が１文字も含まれない場合は成功としない。これ
により、誤認識による検索もれを減少させることができ
る効果を奏する。In the second embodiment, a character "*" is added to a recognition candidate character for a character for which there is no correct answer as a recognition candidate character, and a search is performed assuming that this character matches any character. . However, if no correct character is included such as “**”, the result is not a success. Thereby, there is an effect that search omission due to erroneous recognition can be reduced.

【００７２】実施の形態３．図１５はこの発明の実施の
形態３による全文検索装置を示す構成図であり、図にお
いて、図１と同一符号は同一または相当部分を示すので
説明を省略する。１３は文字認識手段２の認識結果を修
正する認識結果修正手段、１４は文字連鎖出現確率を変
更する文字連鎖出現確率辞書更新手段（出現確率更新手
段）、１５は文字連鎖の出現確率を格納する文字連鎖出
現確率辞書、１６はインデックスを作成する際、文字連
鎖出現確率辞書１５を参照して、２以上の認識候補文字
が組み合わされた連接文字をインデックスの作成対象に
含めるか否かを判定するインデックス作成手段である。Embodiment 3 FIG. 15 is a block diagram showing a full-text search apparatus according to Embodiment 3 of the present invention. In the figure, the same reference numerals as those in FIG. 1 denote the same or corresponding parts, and a description thereof will be omitted. 13 is a recognition result correcting means for correcting the recognition result of the character recognition means 2, 14 is a character chain appearance probability dictionary updating means (appearance probability updating means) for changing the character chain appearance probability, and 15 is a character chain occurrence probability. When creating an index, the character chain appearance probability dictionary 16 refers to the character chain appearance probability dictionary 15 to determine whether or not to include a concatenated character in which two or more recognition candidate characters are combined in the index creation target. It is an index creation means.

【００７３】次に動作について説明する。ここでは、文
字連鎖出現確率辞書１５を用いたインデックスの作成方
法と、文字連鎖出現確率辞書１５の更新方法について説
明する。文書の登録処理では、図２のステップＳＴ１２
０までは上記実施の形態１と同様に処理する。Next, the operation will be described. Here, a method of creating an index using the character chain appearance probability dictionary 15 and a method of updating the character chain appearance probability dictionary 15 will be described. In the document registration process, step ST12 in FIG.
Processing up to 0 is performed in the same manner as in the first embodiment.

【００７４】図２のステップＳＴ１３０において、イン
デックス作成手段１６は、上記実施の形態１と同様に認
識候補文字の絞り込みを実施し、図５に示す認識候補文
字からインデックスを作成する。このとき、文字連鎖出
現確率辞書１５を用いて、認識候補文字の組み合わせに
対し、インデックスを作成するか否かを決定する。図１
６は文字連鎖出現確率辞書１５の一例を示し、図１５の
「確率」には、予め多くの学習文書から文書内に連続す
るＮ文字の組合せの出現数を計算し、文書全体に対して
出現確率を求める。総数は実際に学習文書に出現する組
合せ数である。組合せ文字（連接文字）の始めの文字が
同一であるグループの確率の和は“１”である。例え
ば、「文字」、「文学」、「文章」など「文」から始ま
る組合せの確率の和は“１”となる。In step ST130 of FIG. 2, the index creating means 16 narrows down the recognition candidate characters as in the first embodiment, and creates an index from the recognition candidate characters shown in FIG. At this time, using the character chain appearance probability dictionary 15, it is determined whether to create an index for a combination of recognition candidate characters. FIG.
6 shows an example of the character chain appearance probability dictionary 15. In “probability” in FIG. 15, the number of occurrences of combinations of N consecutive characters in a document is calculated in advance from many learning documents, and the Find the probability. The total number is the number of combinations that actually appear in the learning document. The sum of the probabilities of the groups in which the first characters of the combination characters (joint characters) are the same is “1”. For example, the sum of the probabilities of combinations starting from “sentence” such as “letter”, “literature”, and “sentence” is “1”.

【００７５】以下の式を定義し、図５の認識候補文字の
組合せから、Ｅを計算し、そのＥの値によってインデッ
クスを作成するか否かを決定する。The following equation is defined, E is calculated from the combination of the recognition candidate characters in FIG. 5, and it is determined whether or not to create an index based on the value of E.

【００７６】[0076]

【数２】 (Equation 2)

【００７７】ここで、Ｒは文字認識での類似度を表し、
Ｒ_ijとは、文頭からｉ番目の文字位置における第ｊ位認
識候補文字の類似度を示す。同様に、Ｒ_(i+1)kとは、文
頭から（ｉ＋１）番目の文字位置における第ｋ位認識候
補文字の類似度を示す。Ｐ_ij(i+1)kは、文頭からｉ番目
の文字位置における第ｊ位認識候補文字の次に、文頭か
ら（ｉ＋１）番目の文字位置における第ｋ位認識候補文
字が続いて出現する確率を示す。α，βは定数である。Here, R represents the similarity in character recognition,
R _ij indicates the similarity of the j-th recognition candidate character at the i-th character position from the beginning of the sentence. Similarly, R _{(i + 1) k} indicates the similarity of the k-th recognition candidate character at the (i + 1) -th character position from the beginning of the sentence. P _{ij (i + 1) k} is the probability that the k-th recognition candidate character at the (i + 1) -th character position from the beginning of the sentence follows the j-th recognition candidate character at the i-th character position from the beginning of the sentence. Is shown. α and β are constants.

【００７８】具体的には、図５において、例えば、ｉ＝
７の場合、「文宇」、「文字」、「文学」、「丈宇」、
「丈字」、「丈学」の６通りに対して、Ｅの計算を実施
し、各値がある閾値以上になれば、その組合せをインデ
ックスに作成し、ある閾値以下になれば、インデックス
に残さないようにする。いま、α＝０．５、β＝３００
とすると、Ｅ（文宇）＝０．５×（９０＋８６）＋（１
−０．５）×３００×０．００１＝８８．１５となる。
同様に計算し、Ｅ（文字）＝１０２、Ｅ（文学）＝８
６．５、Ｅ（丈宇）＝７８．１５、Ｅ（丈字）＝７７．
１５、Ｅ（丈学）＝７５．１５となる。したがって、Ｅ
＞８５以上の文字組をインデックスとして保存する場
合、「文字」、「文宇」、「文学」の組み合わせのみを
登録する。このとき、図９の２文字インデックスでは、
Ｅの値が大きい順に割り当てるようにしている。ここで
は、「文字」を１、「文宇」を２、「文学」を３と保持
する。Specifically, in FIG. 5, for example, i =
In the case of 7, "bunyu", "letters", "literature", "jou"
E is calculated for the six types of “Jo” and “Jo”, and if each value exceeds a certain threshold, the combination is created in the index. Do not leave. Now, α = 0.5, β = 300
Then, E (bunyu) = 0.5 × (90 + 86) + (1
−0.5) × 300 × 0.001 = 88.15.
Calculate in the same way, E (character) = 102, E (literature) = 8
6.5, E (length) = 78.15, E (length) = 77.
15, E (length) = 75.15. Therefore, E
When storing a character set of> 85 or more as an index, only a combination of “character”, “bun”, and “literature” is registered. At this time, in the two-character index of FIG.
E is assigned in descending order of the value of E. Here, "character" is held as 1, "bun" is held as 2, and "literature" is held as 3.

【００７９】文書の検索方法は、上記実施の形態１と同
様である。文字認識に用いた類似度と、文書中に文字同
士の組合せが連続して出現する確率を用いて値を算出す
ることで、文字としての正解である可能性が低かった
り、文字列として文書中に存在する確率が低い組合せを
排除することにより、検索のためのインデックスをコン
パクトに、かつ正解文字の誤った削除を少なく作成する
ことが可能となる。The document search method is the same as in the first embodiment. By calculating the value using the similarity used for character recognition and the probability that combinations of characters will appear consecutively in the document, the possibility of a correct answer as a character is low, or as a character string in the document By eliminating combinations having a low probability of existing characters, it is possible to create an index for search compactly and to reduce erroneous deletion of correct characters.

【００８０】実施の形態４．次に、文字連鎖出現確率辞
書１５を変更する方法について説明する。内容、分野が
同一又は類似する文書においては、各文書内に出現する
重要単語が類似しており、比較的多く出現する。そこ
で、出現する文字の組合せを学習し、各分野毎の文書の
文字連鎖出現確率辞書１５を更新していくことで、検索
の精度をそれほど落とさずにインデックスのコンパクト
化が可能となる。この実施の形態４では、文字認識結果
から、正しいと思われる文字の組合せに対して出現数を
カウントし、この値を文字連鎖出現確率辞書１５に反映
させる例について説明する。Embodiment 4 Next, a method of changing the character chain appearance probability dictionary 15 will be described. In documents having the same or similar contents and fields, important words appearing in each document are similar and appear relatively frequently. Therefore, by learning the combination of the characters that appear and updating the character chain appearance probability dictionary 15 of the document for each field, it is possible to make the index compact without reducing the accuracy of the search so much. In the fourth embodiment, an example will be described in which the number of appearances is counted for a combination of characters considered to be correct from the character recognition result, and this value is reflected in the character chain appearance probability dictionary 15.

【００８１】図１７は文書の登録方法を示すフローチャ
ートである。文書登録に用いる文書は、上記実施の形態
１と同一とする。ステップＳＴ１２０までは、上記実施
の形態１と同様に処理する。ステップＳＴ１３５におい
て、上記実施の形態１と同様にインデックスを作成す
る。その後、文字連鎖出現確率辞書更新手段１４は、図
５に示す認識候補文字の中から、候補数が１文字で連続
する文字の組合せの出現数をカウントする。FIG. 17 is a flowchart showing a document registration method. The documents used for document registration are the same as those in the first embodiment. Until step ST120, processing is performed in the same manner as in the first embodiment. In step ST135, an index is created as in the first embodiment. Thereafter, the character chain appearance probability dictionary updating unit 14 counts the number of appearances of a combination of characters in which the number of candidates is one and continues from among the recognition candidate characters shown in FIG.

【００８２】図５では、「文書」、「識性」、「性
能」、「能の」、「の向」、「向上」の組み合わせに対
して出現数をカウントする。文字連鎖出現確率辞書更新
手段１４は、各組合せとその数を図示しないバッファに
保持し、あるタイミング、例えば、数回の文書登録に一
度の割合で図１６の文字連鎖出現確率辞書１５を更新す
る。または、ユーザが更新の命令を行うことによって更
新してもよい。以下、ステップＳＴ１４０では、上記実
施の形態１と同様に曖昧テキストを作成して終了する。In FIG. 5, the number of appearances is counted for a combination of “document”, “intelligence”, “performance”, “noh”, “direction”, and “improvement”. The character chain appearance probability dictionary updating means 14 holds each combination and the number thereof in a buffer (not shown), and updates the character chain appearance probability dictionary 15 of FIG. 16 at a certain timing, for example, once every several document registrations. . Alternatively, the update may be performed by a user issuing an update instruction. Hereinafter, in step ST140, an ambiguous text is created as in the first embodiment, and the process ends.

【００８３】また、認識候補文字に対し、ユーザが認識
結果修正手段１３を用いて、文字認識誤りを修正した場
合に、修正した文字の組合せの数をカウントして文字連
鎖出現確率辞書１５を更新することも可能である。図１
９は文書の登録方法を示すフローチャートである。図１
９でステップＳＴ１２０までは上記実施の形態１と同様
に処理する。When the user corrects a character recognition error using the recognition result correcting means 13 for the recognition candidate character, the number of corrected character combinations is counted and the character chain appearance probability dictionary 15 is updated. It is also possible. FIG.
9 is a flowchart showing a document registration method. FIG.
In step 9, up to step ST120, the same processing as in the first embodiment is performed.

【００８４】ステップＳＴ１２５において、認識結果修
正手段１３を用いて文字の修正を行う。例えば、図５の
文字位置８，９を図１８の６０，６１のようにユーザが
修正する。次に、ステップＳＴ１３３において、インデ
ックス作成手段１６は、図１８に示す認識候補文字から
インデックスを作成する。次に、ステップＳＴ１４３に
おいて、文字連鎖出現頻度をカウントする。文字連鎖出
現確率辞書更新手段１４は、修正した文字の前後も含め
認識候補文字が１文字である組み合わせの数をカウント
する。ここでは、図１８で「字認」、「認識」に対して
組合せ数をカウントする。文字連鎖出現確率辞書１５の
更新は、あるタイミング、例えば、一定数修正した後に
更新する。In step ST125, the character is corrected using the recognition result correcting means 13. For example, the user corrects the character positions 8 and 9 in FIG. 5 as shown by 60 and 61 in FIG. Next, in step ST133, the index creating means 16 creates an index from the recognition candidate characters shown in FIG. Next, in step ST143, the character chain appearance frequency is counted. The character chain appearance probability dictionary updating means 14 counts the number of combinations in which the recognition candidate character is one character including before and after the corrected character. Here, the number of combinations is counted for "character recognition" and "recognition" in FIG. The character chain appearance probability dictionary 15 is updated at a certain timing, for example, after a certain number of corrections.

【００８５】また、誤認識文字の修正に限らず、検索に
用いたキーワードから文字連鎖出現頻度をカウントし、
文字連鎖出現確率辞書１５に反映させることで、登録時
においてキーワードに用いた文字列をより正確に残すこ
とが可能となる。The frequency of occurrence of a character chain is counted from the keyword used in the search, not limited to the correction of the misrecognized character.
By reflecting it in the character chain appearance probability dictionary 15, it is possible to leave the character string used as a keyword at the time of registration more accurately.

【００８６】[0086]

【発明の効果】以上のように、この発明によれば、イン
デックスを参照して、キーワードと一致する認識候補文
字の文書を検索する一方、特徴抽出手段により抽出され
た文字画像の形状特徴とキーワードを構成する文字の形
状特徴を照合して、検索条件に合致する文書を検索する
検索手段を設けるように構成したので、高速かつ高精度
な全文検索を実施することができる効果がある。As described above, according to the present invention, while referring to the index, the document of the recognition candidate character matching the keyword is searched, while the shape feature of the character image extracted by the feature extracting means and the keyword are searched. Is configured to provide a search unit that searches for documents that match the search conditions by comparing the shape characteristics of the characters that constitute the character string. Therefore, there is an effect that high-speed and high-precision full-text search can be performed.

【００８７】この発明によれば、２以上の認識候補文字
が組み合わされた連接文字をインデックスの作成対象に
含めるように構成したので、高速かつ高精度な全文検索
を実施することができる効果がある。According to the present invention, since a concatenated character in which two or more recognition candidate characters are combined is included in the index creation target, there is an effect that high-speed and high-precision full-text search can be performed. .

【００８８】この発明によれば、文字認識手段が出力す
る各認識候補文字の中で、基準確度より確度が低い認識
候補文字をインデックスの作成対象から除外するように
構成したので、検索精度の劣化を招くことなく、インデ
ックスの容量を小さくすることができる効果がある。According to the present invention, among the recognition candidate characters output by the character recognition means, recognition candidate characters having lower accuracy than the reference accuracy are excluded from the index creation target, so that the search accuracy is deteriorated. Therefore, there is an effect that the capacity of the index can be reduced without inducing.

【００８９】この発明によれば、文字認識手段が出力す
る認識候補文字の確度が基準確度より低い場合でも、基
準確度を超える確度の認識候補文字を有しない文字画像
に係る認識候補文字の場合、その認識候補文字をインデ
ックスの作成対象に含めるとともに、その認識候補文字
に対して他の認識候補文字と区別する識別記号を付加す
るように構成したので、キーワードと文字コードが一致
しない検索において、インデックスデータベースのみを
用いた検索が可能になる効果がある。According to the present invention, even when the accuracy of the recognition candidate character output by the character recognition means is lower than the reference accuracy, in the case of a recognition candidate character relating to a character image having no recognition candidate character having an accuracy exceeding the reference accuracy, The recognition candidate character is included in the index creation target, and an identification symbol for distinguishing from the other recognition candidate character is added to the recognition candidate character. There is an effect that a search using only the database becomes possible.

【００９０】この発明によれば、文字画像の形状特徴を
データベースに格納するとともに、その文字画像に対す
る各認識候補文字と単語を構成する可能性のある文字の
文字コードをデータベースに格納するように構成したの
で、検索精度の向上を図ることができる効果がある。According to the present invention, the shape characteristics of the character image are stored in the database, and the recognition candidate characters for the character image and the character codes of the characters that may form words are stored in the database. Therefore, there is an effect that search accuracy can be improved.

【００９１】この発明によれば、言語的情報又は文字の
種類を考慮して、各認識候補文字と単語を構成する可能
性のある文字を判定するように構成したので、検索精度
が向上する効果がある。According to the present invention, each recognition candidate character and a character that may form a word are determined in consideration of linguistic information or a type of character, so that search accuracy is improved. There is.

【００９２】この発明によれば、特徴抽出手段により抽
出された文字画像の形状特徴とキーワードを構成する文
字の形状特徴との距離を計算し、その距離が所定の基準
を満たすとき検索条件の合致を認定するように構成した
ので、形状特徴辞書をカスタマイズすることができる効
果がある。According to the present invention, the distance between the shape feature of the character image extracted by the feature extraction means and the shape feature of the character constituting the keyword is calculated, and when the distance satisfies a predetermined criterion, the search condition is satisfied. , The effect of being able to customize the shape feature dictionary is provided.

【００９３】この発明によれば、検索手段による形状特
徴照合処理の実行の有無を設定する設定手段を設けるよ
うに構成したので、検索速度と検索精度の重要性を考慮
して、検索処理における処理種別の優先度を設定するこ
とができる効果がある。According to the present invention, since the setting means for setting whether or not to execute the shape feature matching processing by the search means is provided, the processing in the search processing is considered in consideration of the importance of the search speed and the search accuracy. There is an effect that the priority of the type can be set.

【００９４】この発明によれば、キーワードと一致する
認識候補文字を含む文書を形状特徴の照合対象から除外
するように構成したので、形状特徴を照合する際の検索
の無駄を削減することができる効果がある。According to the present invention, since the document including the recognition candidate character matching the keyword is excluded from the target of the shape feature comparison, it is possible to reduce the useless search at the time of matching the shape feature. effective.

【００９５】この発明によれば、キーワードと一致する
認識候補文字が存在しない場合に限り、特徴抽出手段に
より抽出された文字画像の形状特徴とキーワードを構成
する文字の形状特徴を照合するように構成したので、検
索速度を高めることができる効果がある。According to the present invention, the configuration is such that the shape feature of the character image extracted by the feature extraction means is compared with the shape feature of the character constituting the keyword only when there is no recognition candidate character matching the keyword. Therefore, there is an effect that the search speed can be increased.

【００９６】この発明によれば、キーワードに対する形
状特徴の照合対象を特定する際、識別符号が付加された
認識候補文字をワイルド・カードとして取り扱うように
構成したので、インデックスデータベースのみを用いた
検索を実施することができる効果がある。According to the present invention, when specifying the collation target of the shape feature with respect to the keyword, the recognition candidate character to which the identification code is added is treated as a wild card, so that the search using only the index database can be performed. There is an effect that can be implemented.

【００９７】この発明によれば、２以上の認識候補文字
が組み合わされた連接文字が文書全体に出現する出現確
率を考慮して、当該連接文字をインデックスの作成対象
に含めるか否かを判定するように構成したので、インデ
ックスの効率的な容量削減を実施することができる効果
がある。According to the present invention, it is determined whether or not the connected character is to be included in the index creation target in consideration of the appearance probability that the connected character in which two or more recognition candidate characters are combined appears in the entire document. With such a configuration, there is an effect that the capacity of the index can be efficiently reduced.

【００９８】この発明によれば、連接文字を構成する各
認識候補文字が、当該文字画像に対する唯一の認識候補
文字である場合、その連接文字の出現回数をカウントア
ップして出現確率を更新するように構成したので、重要
なキーワードが検索されない確率を低減することができ
る効果がある。According to the present invention, when each recognition candidate character constituting a connected character is the only recognition candidate character for the character image, the number of appearances of the connected character is counted up to update the appearance probability. , There is an effect that the probability that an important keyword is not searched can be reduced.

【００９９】この発明によれば、キーワードと一致する
連接文字の出現回数をカウントアップして出現確率を更
新するように構成したので、重要な文字の優先度が高め
られ、重要な文字が検索されない確率を低減することが
できる効果がある。According to the present invention, since the appearance frequency is updated by counting up the number of appearances of the concatenated character matching the keyword, the priority of the important character is increased, and the important character is not searched. There is an effect that the probability can be reduced.

【０１００】この発明によれば、文字認識手段が出力す
る認識候補文字が修正された場合、修正後の認識候補文
字を含む連接文字の出現回数をカウントアップして出現
確率を更新するように構成したので、重要な文字の優先
度が高められ、重要な文字が検索されない確率を低減す
ることができる効果がある。According to the present invention, when the recognition candidate character output by the character recognition means is corrected, the appearance probability is updated by counting up the number of occurrences of the connected character including the corrected recognition candidate character. Therefore, the priority of the important character is raised, and the probability that the important character is not searched can be reduced.

【０１０１】この発明によれば、インデックスを参照し
て、キーワードと一致する認識候補文字の文書を検索す
る一方、文字画像の形状特徴とキーワードを構成する文
字の形状特徴を照合して、検索条件に合致する文書を検
索するように構成したので、高速かつ高精度な全文検索
を実施することができる効果がある。According to the present invention, a document of a recognition candidate character that matches a keyword is searched for by referring to an index, and a shape characteristic of a character image is compared with a shape characteristic of a character constituting a keyword, and a search condition is determined. Is configured to be searched for a document that matches with, so that there is an effect that a high-speed and high-precision full-text search can be performed.

[Brief description of the drawings]

【図１】この発明の実施の形態１による全文検索装置
を示す構成図である。FIG. 1 is a configuration diagram showing a full-text search device according to a first embodiment of the present invention.

【図２】文書の登録方法を示すフローチャートであ
る。FIG. 2 is a flowchart illustrating a document registration method.

【図３】入力画像を示す説明図である。FIG. 3 is an explanatory diagram showing an input image.

【図４】文字認識手段の認識結果を示す説明図であ
る。FIG. 4 is an explanatory diagram showing a recognition result of a character recognition unit.

【図５】認識候補文字の絞込み結果を示す説明図であ
る。FIG. 5 is an explanatory diagram showing a result of narrowing down recognition candidate characters.

【図６】文字画像から抽出された形状特徴を保持する
例を示す説明図である。FIG. 6 is an explanatory diagram showing an example in which shape features extracted from a character image are stored.

【図７】曖昧テキストのある文書番号の開始位置等を
示す説明図である。FIG. 7 is an explanatory diagram showing a start position and the like of a document number having an ambiguous text;

【図８】具体的な形状特徴の作成方法を示す説明図で
ある。FIG. 8 is an explanatory view showing a specific method of creating a shape feature.

【図９】２文字のインデックス例を示す説明図であ
る。FIG. 9 is an explanatory diagram showing an example of a two-character index.

【図１０】１文字インデックスの位置と認識順位を記
憶したテーブルを示す説明図である。FIG. 10 is an explanatory diagram showing a table storing a position of a one-character index and a recognition order.

【図１１】文書の検索方法を示すフローチャートであ
る。FIG. 11 is a flowchart illustrating a document search method.

【図１２】インデックス照合を示すフローチャートで
ある。FIG. 12 is a flowchart illustrating index matching.

【図１３】曖昧テキスト照合を示すフローチャートで
ある。FIG. 13 is a flowchart showing ambiguous text matching.

【図１４】インデックス照合を示すフローチャートで
ある。FIG. 14 is a flowchart illustrating index matching.

【図１５】この発明の実施の形態３による全文検索装
置を示す構成図である。FIG. 15 is a configuration diagram showing a full-text search device according to a third embodiment of the present invention.

【図１６】文字連鎖出現確率辞書を示す説明図であ
る。FIG. 16 is an explanatory diagram showing a character chain appearance probability dictionary.

【図１７】文書の登録方法を示すフローチャートであ
る。FIG. 17 is a flowchart illustrating a document registration method.

【図１８】認識結果の修正内容を示す説明図である。FIG. 18 is an explanatory diagram showing correction contents of a recognition result.

【図１９】文書の登録方法を示すフローチャートであ
る。FIG. 19 is a flowchart illustrating a document registration method.

【図２０】文書の検索方法を示すフローチャートであ
る。FIG. 20 is a flowchart illustrating a document search method.

【図２１】従来の全文検索装置（従来例１）を示す構
成図である。FIG. 21 is a configuration diagram showing a conventional full-text search device (conventional example 1).

【図２２】従来の全文検索装置（従来例２）を示す構
成図である。FIG. 22 is a configuration diagram showing a conventional full-text search device (conventional example 2).

[Explanation of symbols]

１画像入力手段、２文字認識手段、３インデック
ス作成手段、４曖昧テキスト抽出手段（特徴抽出手
段）、５検索条件入力手段（入力手段、設定手段）、
６検索手段、７出力手段、８文字認識辞書、９
形状特徴辞書、１０曖昧テキストデータベース、１１
インデックスデータベース、１２認識文字データベ
ース、１３認識結果修正手段、１４文字連鎖出現確
率辞書更新手段（出現確率更新手段）、１５文字連鎖
出現確率辞書、１６インデックス作成手段。1 image input means, 2 character recognition means, 3 index creation means, 4 ambiguous text extraction means (feature extraction means), 5 search condition input means (input means, setting means),
6 search means, 7 output means, 8 character recognition dictionary, 9
Shape feature dictionary, 10 ambiguous text database, 11
Index database, 12 recognized character database, 13 recognition result correcting means, 14 character chain appearance probability dictionary updating means (appearance probability updating means), 15 character chain appearance probability dictionary, 16 index creation means.

Claims

[Claims]

1. A character recognition means for identifying each character image included in an input image, outputting one or more recognition candidate characters for each character image, and outputting the accuracy of each recognition candidate character; Index creation means for creating an index indicating the correspondence between each recognition candidate character output by the means and the document, and a character image having no recognition candidate character with a certainty exceeding the reference certainty among the character images included in the input image Exists, a feature extracting unit for extracting a shape feature of the character image, an input unit for inputting a keyword as a document search condition, and referring to the index,
While searching for a document of the recognition candidate character that matches the keyword, the shape feature of the character image extracted by the feature extraction unit is compared with the shape feature of the character constituting the keyword, and a document that matches the search condition is searched. A full-text search device including a search unit for searching.

2. The full-text search device according to claim 1, wherein the index creation unit includes a connected character in which two or more recognition candidate characters are combined in an index creation target.

3. The index creation unit according to claim 1, wherein, among the recognition candidate characters output by the character recognition unit, recognition candidate characters having a lower accuracy than the reference accuracy are excluded from an index creation target. Full text search device.

4. The method according to claim 1, wherein the index generating means outputs the recognition candidate character output from the character recognizing means even if the accuracy of the recognition candidate character is lower than the reference accuracy. 4. The full-text search device according to claim 3, wherein the recognition candidate character is included in an index creation target, and an identification symbol for distinguishing the recognition candidate character from other recognition candidate characters is added.

5. The feature extracting means stores shape features of a character image in a database, and stores, in the database, recognition candidate characters for the character image and character codes of characters that may form words. The full-text search device according to any one of claims 1 to 4, characterized in that:

6. The full-text according to claim 5, wherein the feature extracting unit determines each recognition candidate character and a character that may form a word in consideration of linguistic information or a type of character. Search device.

7. A search means calculates a distance between a shape feature of a character image extracted by the feature extraction means and a shape feature of a character constituting a keyword, and when the distance satisfies a predetermined criterion, matches a search condition. 7. The full-text search device according to claim 1, wherein the full text search device is certified.

8. The full-text search apparatus according to claim 1, further comprising setting means for setting whether or not to execute the shape feature matching process by the search means.

9. The full-text according to claim 1, wherein the search unit excludes a document including a recognition candidate character that matches the keyword from a target of shape feature matching. Search device.

10. A search unit checks a shape feature of a character image extracted by a feature extraction unit against a shape feature of a character constituting the keyword only when there is no recognition candidate character matching the keyword. The full-text search device according to any one of claims 1 to 7, characterized in that:

11. The full-text search device according to claim 4, wherein the search means handles a recognition candidate character to which an identification code is added as a wild card when specifying a matching target of the shape feature with respect to the keyword.

12. The index creation means determines whether or not to include the connected character in an index creation target in consideration of an appearance probability that a connected character in which two or more recognition candidate characters are combined appears in the entire document. 3. The full-text search device according to claim 2, wherein:

13. In the case where each of the recognition candidate characters constituting the concatenated character is the only recognition candidate character for the character image, an appearance probability updating means for counting up the number of appearances of the concatenated character and updating the appearance probability is provided. 13. The full-text search device according to claim 12, wherein the device is provided.

14. The full-text search device according to claim 12, further comprising an appearance probability updating unit that counts up the number of appearances of the concatenated character that matches the keyword and updates the appearance probability.

15. Appearance probability updating means for updating the appearance probability by counting up the number of appearances of a connected character including the corrected recognition candidate character when the recognition candidate character output by the character recognition means is corrected. 13. The method according to claim 12, wherein
Description full-text search device.

16. Identifying each character image included in the input image, outputting one or more recognition candidate characters for each character image and the accuracy of each recognition candidate character, and indicating the correspondence between each recognition candidate character and a document. In addition to creating an index, among the character images included in the input image, extracting the shape features of character images that do not have recognition candidate characters with a degree of accuracy exceeding the reference accuracy, and entering a keyword as a document search condition, By referring to the index, a search is made for a document of a candidate character that matches the keyword, and at the same time, the shape characteristics of the character image are compared with the shape characteristics of the characters constituting the keyword to search for a document that matches the search condition. Full-text search method to do.