JP3526821B2

JP3526821B2 - Document search device

Info

Publication number: JP3526821B2
Application number: JP2000254697A
Authority: JP
Inventors: 善彦松川; 太郎今川; 堅司近藤; 強司目片
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 1999-08-25
Filing date: 2000-08-24
Publication date: 2004-05-17
Anticipated expiration: 2020-08-24
Also published as: JP2001134617A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、文書画像に対して
文字認識を行うことによって得られる認識結果からキー
ワードを検索する文書検索装置および記録媒体に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document search device and a recording medium for searching a keyword from a recognition result obtained by performing character recognition on a document image.

【０００２】[0002]

【従来の技術】一般に、紙の形態の文書を文書データベ
ースに蓄積する場合、紙の形態の文書が画像データとし
て読みこまれ、その画像データは文字認識を行うことに
より文字コードの集合（文字認識結果）に変換される。
文書は文字コードの集合として文書データベースに蓄積
される。文書データベースからキーワードを検索する場
合には、キーワードが文字認識結果に含まれるか否かが
判定される。一般に使用されている文字認識では、オリ
ジナルの文書（紙の形態の文書）に書かれた文字が、正
しく文字コードに変換されない場合がある。このよう
に、文字認識において誤りがある場合、文字コードが表
す文字はオリジナルの文書に書かれている文字と異なり
得る。このため、文書データベースに蓄積された文字コ
ードの集合からキーワードを検索する場合に、検索漏れ
が起こる可能性がある。検索漏れとは、オリジナルの文
書にはキーワードと一致する文字列が存在するにもかか
わらず、文書データベースに蓄積された文字認識結果か
らキーワードを検索した場合に、キーワードと一致する
文字列が検出されないことをいう。2. Description of the Related Art Generally, when a paper-form document is stored in a document database, a paper-form document is read as image data, and the image data is subjected to character recognition to collect a set of character codes (character recognition). Result).
The document is stored in the document database as a set of character codes. When searching for a keyword from the document database, it is determined whether the keyword is included in the character recognition result. In commonly used character recognition, characters written in an original document (document in paper form) may not be correctly converted into a character code. Thus, when there is an error in character recognition, the character represented by the character code may be different from the character written in the original document. Therefore, a search omission may occur when searching for a keyword from a set of character codes accumulated in the document database. Missing search means that even if the original document has a character string that matches the keyword, when the keyword is searched from the character recognition results stored in the document database, the character string that matches the keyword is not detected. Say that.

【０００３】このような検索漏れを防ぐ従来技術とし
て、例えば、特開平７−１５２７７４に開示される技術
が知られている。As a conventional technique for preventing such a search omission, for example, the technique disclosed in Japanese Patent Laid-Open No. 7-152774 is known.

【０００４】特開平７−１５２７７４に開示される従来
技術によれば、検索時に、キーワードに含まれる文字の
うち文字認識を誤りやすい文字について予め複数の候補
を挙げた類似文字のリストを用いて、展開文字列が生成
される。文字認識を誤りやすい文字とは、例えば、その
文字と形状が類似した文字が存在する文字である。According to the conventional technique disclosed in JP-A-7-152774, at the time of search, using a list of similar characters in which a plurality of candidates are listed in advance for characters that are likely to cause character recognition errors among the characters included in the keyword, The expansion string is generated. A character whose character recognition is likely to be erroneous is, for example, a character whose shape is similar to that of the character.

【０００５】特開平７−１５２７７４に開示される従来
技術を図２４Ａおよび図２４Ｂを参照して説明する。A conventional technique disclosed in Japanese Patent Application Laid-Open No. 7-152774 will be described with reference to FIGS. 24A and 24B.

【０００６】図２４Ａは、オリジナルの文書中に含まれ
る文字「本」および「口」が、文字認識における誤りに
より、それぞれ形状の類似した「木」および「区」とい
う文字に対応する文字コードに変換されている例を示
す。文字認識結果は文字コードの集合であるが、図２４
Ａでは説明のために、文字コードは、その文字コードに
対応する文字として示されている。オリジナルの文書に
はキーワード「日本」が含まれているにもかかわらず、
キーワード「日本」を文字認識結果から検索すると検索
漏れが起こる。In FIG. 24A, the characters "book" and "mouth" contained in the original document are converted into character codes corresponding to the characters "tree" and "ku" having similar shapes due to an error in character recognition. An example of conversion is shown. The character recognition result is a set of character codes.
For the sake of explanation, in A, the character code is shown as a character corresponding to the character code. Although the original document contains the keyword "Japan",
When the keyword "Japan" is searched from the character recognition result, a search omission occurs.

【０００７】図２４Ｂは、類似文字のリストの例を示
す。行９９−１は、文字「本」は、文字「木」、
「大」、「太」および「才」に誤って認識されやすいこ
とを示す。行９９−２は、文字「口」は、文字「□（記
号の矩形）」、「回」、「円」および「々」に誤って認
識されやすいことを示す。FIG. 24B shows an example of a list of similar characters. In line 99-1, the character "book" is the character "tree",
It shows that it is apt to be mistakenly recognized as "large", "thick" and "aged". The line 99-2 indicates that the character "mouth" is apt to be erroneously recognized as the characters "□ (rectangle of symbol)", "times", "circle" and "every".

【０００８】特開平７−１５２７７４に開示される従来
技術では、キーワード「日本」を検索する場合、図２４
Ｂに示される類似文字のリストを用いて、展開文字列
「日木」、「日大」、「日太」および「日才」が生成さ
れる。文字認識結果からキーワード「日本」を検索する
場合、展開文字列「日木」、「日大」、「日太」および
「日才」のそれぞれも、キーワードとして使用される。
これによって、文字コード中で、「日本」が誤って文字
認識された「日木」の部分が検索され得る。According to the conventional technique disclosed in Japanese Patent Laid-Open No. 7-152774, when the keyword "Japan" is searched, the search shown in FIG.
Using the list of similar characters shown in B, the expanded character strings “Hiki”, “Nichidai”, “Hita”, and “Nichizai” are generated. When searching for the keyword “Japan” from the character recognition result, each of the expanded character strings “Hiki”, “Nichidai”, “Hita”, and “Nichizai” is also used as a keyword.
As a result, in the character code, a portion of "Hiki" in which "Japan" is erroneously recognized as a character can be searched.

【０００９】[0009]

【発明が解決しようとする課題】特開平７−１５２７７
４に開示される従来技術によれば、文書に含まれる文字
が類似文字のリストに含まれない文字として誤って文字
認識された場合には、検索漏れの問題を回避できない。
例えば、図２４Ａに示される文字認識結果からキーワー
ド「人口」を検索することを仮定する。文字「口」が誤
って文字認識された文字「区」は、図２４Ｂの行９９−
２に示される文字「口」についての類似文字のリストに
含まれない。従って、キーワード「人区」を使用した検
索は行われず、検索漏れが発生する。DISCLOSURE OF THE INVENTION Problems to be Solved by the Invention
According to the related art disclosed in No. 4, if a character included in a document is erroneously recognized as a character not included in the list of similar characters, the problem of omission of search cannot be avoided.
For example, assume that the keyword “population” is searched from the character recognition result shown in FIG. 24A. The character "ku" in which the character "mouth" is erroneously recognized as a character is the line 99- in FIG. 24B.
It is not included in the list of similar characters for the character "mouth" shown in 2. Therefore, a search using the keyword "personal ward" is not performed, and a search omission occurs.

【００１０】このような検索漏れの可能性を減らすため
に類似文字のリストに含まれる文字の数を増加させる
と、展開文字列の個数が増え、検索にかかるコスト（時
間および計算量）が増大する。If the number of characters included in the list of similar characters is increased in order to reduce the possibility of omission of the search, the number of expanded character strings increases, and the cost (time and calculation amount) for the search increases. To do.

【００１１】本発明は、このような課題に鑑みてなされ
たものであって、検索にかかるコスト（時間および計算
量）が低くて済み、文字認識の誤りに起因する検索漏れ
を減らすことのできる文書検索装置を提供することを目
的とする。The present invention has been made in view of the above problems, and the cost (time and amount of calculation) required for retrieval is low, and the omission of retrieval due to character recognition error can be reduced. An object is to provide a document search device.

【００１２】[0012]

【課題を解決するための手段】本発明の文書検索装置
は、文書の画像に対して文字認識を行うことによって得
られる認識結果からキーワードを検索する文書検索装置
であって、前記キーワードは、少なくとも１つの第１文
字を含んでおり、前記少なくとも１つの第１文字のそれ
ぞれには文字コード及び文字画像が割り当てられてお
り、前記認識結果は、少なくとも１つの第２文字を含ん
でおり、前記少なくとも１つの第２文字のそれぞれには
文字コードと、前記文書の画像の部分領域とが割り当て
られており、前記文書検索装置は、前記文字コードの比
較に基づいて、前記キーワードに一致する少なくとも１
つの第１一致部分が前記認識結果に存在するか否かを判
定し、存在する場合には、前記少なくとも１つの第１一
致部分を特定する第１一致部分特定手段と、所定の第１
条件を満たす少なくとも１つの第１部分が前記認識結果
から前記特定された少なくとも１つの第１一致部分を除
いた部分に存在するか否かを判定し、存在する場合に
は、前記少なくとも１つの第１部分を特定する第１部分
特定手段と、前記第１部分に含まれる前記第２文字に割
り当てられた前記部分領域の画像の特徴量と、前記キー
ワードに含まれる前記第１文字の前記文字画像の特徴量
との比較に基づいて、前記キーワードに一致する少なく
とも１つの第２一致部分が前記特定された少なくとも１
つの第１部分に存在するか否かを判定し、存在する場合
には、前記少なくとも１つの第２一致部分を特定する第
２一致部分特定手段とを備え、前記所定の第１条件は、
予め定められた幅より小さい幅を有する特定の第２文字
の近傍に前記第１部分があるという条件であり、これに
より上記目的が達成される。本発明の他の文書検索装置
は、文書の画像に対して文字認識を行うことによって得
られる認識結果からキーワードを検索する文書検索装置
であって、前記キーワードは、少なくとも１つの第１文
字を含んでおり、前記少なくとも１つの第１文字のそれ
ぞれには文字コード及び文字画像が割り当てられてお
り、前記認識結果は、少なくとも１つの第２文字を含ん
でおり、前記少なくとも１つの第２文字のそれぞれには
文字コードと、前記文書の画像の部分領域と、前記文字
認識を行った際に得られる文字認識の確からしさを示す
信頼度とが割り当てられており、前記文書検索装置は、
前記文字コードの比較に基づいて、前記キーワードに一
致する少なくとも１つの第１一致部分が前記認識結果に
存在するか否かを判定し、存在する場合には、前記少な
くとも１つの第１一致部分を特定する第１一致部分特定
手段と、所定の第１条件を満たす少なくとも１つの第１
部分が前記認識結果から前記特定された少なくとも１つ
の第１一致部分を除いた部分に存在するか否かを判定
し、存在する場合には、前記少なくとも１つの第１部分
を特定する第１部分特定手段と、前記第１部分に含まれ
る前記第２文字に割り当てられた前記部分領域の画像の
特徴量と、前記キーワードに含まれる前記第１文字の前
記文字画像の特徴量との比較に基づいて、前記キーワー
ドに一致する少なくとも１つの第２一致部分が前記特定
された少なくとも１つの第１部分に存在するか否かを判
定し、存在する場合には、前記少なくとも１つの第２一
致部分を特定する第２一致部分特定手段とを備え、前記
所定の第１条件は、割り当てられた前記信頼度が所定の
閾値よりも小さい特定の第２文字の近傍に前記第１部分
があるという条件であり、これにより上記目的が達成さ
れる。前記文書検索装置は、前記文書の画像の画質を判
定する手段と、前記判定された画像の画質に基づいて前
記所定の閾値を決定する手段とをさらに備えていてもよ
い。前記第２一致部分特定手段は、前記第１部分に含ま
れる前記第２文字の文字コードが、前記キーワードに含
まれる特定の第１文字の文字コードに一致するか否かを
判定する第１判定手段と、前記第１部分に含まれる前記
第２文字の文字コードが、前記キーワードに含まれる特
定の第１文字の文字コードに一致しなかった場合には、
前記第１部分に含まれる前記第２文字を少なくとも含
み、前記特定の第１文字の幅に最も近い幅を有する１ま
たは２以上の連続した第２文字を不一致文字として特定
する不一致文字特定手段と、前記特定の第１文字の画像
の特徴量と、前記不一致文字に含まれる前記１または２
以上の連続した第２文字に割り当てられた１または２以
上の部分領域を含む領域の画像の特徴量との距離が、予
め定められた値よりも小さい場合に、前記特定の第１文
字が前記不一致文字に一致すると判定する第２判定手段
とを備えていてもよい。前記文書検索装置は、前記少な
くとも１つの第１一致部分から所定の判定基準値を算出
する算出手段と、前記判定基準値に基づいて、前記少な
くとも１つの第２一致部分のうちで、所定の第２条件を
満たす第２一致部分を検出する検出手段とをさらに備え
ていてもよい。前記算出手段は、前記少なくとも１つの
第１一致部分に含まれる前記少なくとも１つの第２文字
に割り当てられた少なくとも１つの部分領域の画像の特
徴量に基づいて前記判定基準値を算出し、前記第２条件
は、前記少なくとも１つの第２一致部分に含まれる前記
少なくとも１つの第２文字に割り当てられた少なくとも
１つの部分領域の画像の特徴量と、前記判定基準値との
距離が予め定められた値よりも小さいという条件を含ん
でもよい。 A document search device of the present invention is a document search device for searching a keyword from a recognition result obtained by performing character recognition on an image of a document, wherein the keyword is at least A character code and a character image are assigned to each of the at least one first character, and the recognition result includes at least one second character; A character code and a partial area of the image of the document are assigned to each one of the second characters, and the document search device, based on the comparison of the character codes, determines that at least one of the second characters matches the keyword.
Determining whether or not one first matching portion is present in the recognition result, and if there is, first matching portion identifying means for identifying the at least one first matching portion;
It is determined whether or not at least one first portion satisfying the condition exists in a portion excluding the specified at least one first matching portion from the recognition result, and if there is, the at least one first portion. a first part specifying means for specifying a first portion, wherein the feature amount of the image of the partial area allocated to the second characters in the first portion, the character image of the first character included in the keyword At least one second matching portion that matches the keyword based on the comparison with the feature amount of
And a second matching portion identifying means for identifying the at least one second matching portion when it is present, and the predetermined first condition is:
A condition that there is the first portion in the vicinity of the specific second character having a width less than the predetermined width, thereby the objective described above being achieved. Other document retrieval device of the present invention
Is obtained by performing character recognition on the image of the document.
Document retrieval device for retrieving keywords from recognized recognition results
And the keyword is at least one first sentence
A character and that of said at least one first character
A character code and character image are assigned to each.
And the recognition result includes at least one second character.
And each of the at least one second character
The character code, the partial area of the image of the document, and the character
Indicates the certainty of character recognition obtained when recognition is performed.
The reliability is assigned, and the document retrieval device
Based on the comparison of the character codes,
At least one first matching part that matches the recognition result
It is judged whether or not it exists, and if it exists, the
First matching part identification for identifying at least one first matching part
Means and at least one first condition satisfying a predetermined first condition
At least one part of which is identified from the recognition result
Determine whether it exists in the part excluding the first matching part of
And, if present, the at least one first portion
And a first part specifying means for specifying
Of the image of the partial area assigned to the second character
Before the feature amount and the first character included in the keyword
Based on the comparison with the feature amount of the character image,
The at least one second matching part that matches the
The presence of at least one first part
And if present, the at least one second
A second matching portion identifying means for identifying a matching portion,
The predetermined first condition is that the assigned reliability is predetermined.
The first part in the vicinity of a particular second character smaller than a threshold
The purpose is to achieve the above objective.
Be done. The document retrieval device determines the image quality of the image of the document.
Means for determining the image quality based on the image quality of the determined image.
And a means for determining a predetermined threshold value.
Yes. The second matching portion identifying means is included in the first portion.
The character code of the second character is included in the keyword.
Whether it matches the character code of the specific first character
First determining means for determining, and the first part included in the first part
The character code of the second character is a special character included in the keyword.
If the character code of the fixed first character does not match,
At least the second character included in the first portion is included.
1 having the width closest to the width of the specific first character.
Or 2 or more consecutive second characters are identified as non-matching characters
Non-matching character specifying means and an image of the specified first character
And the 1 or 2 included in the non-matching character
1 or 2 or more assigned to the above consecutive second characters
The distance from the image feature of the area including the upper partial area is
If the value is less than the specified value, the specific first sentence
Second determining means for determining that a character matches the non-matching character
And may be provided. The document search device is
Calculate a predetermined criterion value from at least one first match
Based on the calculation means and the judgment reference value,
Of the at least one second matching part, the predetermined second condition is satisfied.
And a detection means for detecting the second matching portion that satisfies the condition.
May be. The calculating means includes the at least one
The at least one second character included in the first matching portion
Image features of at least one subregion assigned to
The determination reference value is calculated based on
Is included in the at least one second matching portion
At least one assigned to at least one second character
Of the feature amount of the image of one partial region and the determination reference value
Includes the condition that the distance is less than a predetermined value
But it's okay.

【００１３】[0013]

【００１４】[0014]

【００１５】[0015]

【００１６】[0016]

【００１７】[0017]

【００１８】[0018]

【００１９】[0019]

【００２０】[0020]

【００２１】[0021]

【００２２】[0022]

【００２３】[0023]

【発明の実施の形態】はじめに、文書を蓄積し、検索す
る文書ファイリングシステム２１０を説明する。BEST MODE FOR CARRYING OUT THE INVENTION First, a document filing system 210 for storing and retrieving documents will be described.

【００２４】図１は、文書ファイリングシステム２１０
の構成を示す。文書ファイリングシステム２１０は、画
像入力装置２０１と、ＯＣＲ（ｏｐｔｉｃａｌｃｈａ
ｒａｃｔｅｒｒｅａｄｅｒ）装置２０２と、文書デー
タベース２０３と、文書検索装置２０４と、表示装置２
０５とを含む。FIG. 1 illustrates a document filing system 210.
Shows the configuration of. The document filing system 210 includes an image input device 201 and an OCR (optical cha).
(racker reader) device 202, document database 203, document search device 204, and display device 2
Including 05 and.

【００２５】画像入力装置２０１は、オリジナルの文書
（例えば、紙の形態の文書）を文書画像データＤ_iに変
換する。画像入力装置２０１は、例えば、スキャナやデ
ジタルカメラである。The image input device 201 converts an original document (for example, a document in the form of paper) into document image data D _i . The image input device 201 is, for example, a scanner or a digital camera.

【００２６】ＯＣＲ装置２０２は、文書画像データＤ_i
に対して文字認識を実行する。ＯＣＲ装置２０２には、
公知のＯＣＲ技術が用いられ得る。ＯＣＲ装置２０２に
よる文字認識の結果は、文字認識結果Ｄ_cとしてＯＣＲ
装置２０２から出力される。The OCR device 202 uses the document image data D _i
Perform character recognition on. The OCR device 202 includes
Known OCR techniques can be used. The result of character recognition by the OCR device 202 is OCR as the character recognition result D _c.
It is output from the device 202.

【００２７】文書データベース２０３には、文書データ
Ｄ_dが格納されている。文書データＤ_dは、同一の文書に
ついての文字認識結果Ｄ_cと文書画像データＤ_iとを含
む。[0027] in the document database 203, the document data D _d is stored. The document data D _d includes the character recognition result D _c and the document image data D _i for the same document.

【００２８】文書検索装置２０４は、文書データベース
２０３に格納された文書データＤ_dに含まれる文字認識
結果Ｄ_cから、キーワードＫ_wを検索する。本発明の文書
検索装置２０４は、文字認識結果Ｄ_cからキーワードＫ_w
を検索する際に、文書データＤ_dに含まれる文書画像デ
ータＤ_iを利用する。The document retrieval device 204 retrieves the keyword K _w from the character recognition result D _c contained in the document data D _d stored in the document database 203. The document search device 204 of the present invention uses the keyword K _w from the character recognition result D _c.
When searching for, the document image data D _i included in the document data D _d is used.

【００２９】文書検索装置２０４は、文字認識結果Ｄ_c
からキーワードＫ_wが検出された場合には、検索結果Ｒ
Ｄ_tを表示装置２０５に出力する。The document retrieval device 204 receives the character recognition result D _c.
When the keyword K _w is detected from the search result R
D _t is output to the display device 205.

【００３０】表示装置２０５は、検索結果ＲＤ_tに基づ
いて検索結果を表示する。例えば、表示装置２０５は、
文書データベースに格納された文書画像データＤ_iをデ
ィスプレイに表示し、そのディスプレイに表示された文
書画像データＤ_iの領域のうちキーワードＫ_wに対応する
領域を強調表示する。強調表示は、例えば、着色表示や
反転表示であり得る。キーワードＫ_wに対応する領域
は、検索結果ＲＤ_tに基づいて決定される。The display device 205 displays the search result based on the search result RD _t . For example, the display device 205 is
The document image data D _i stored in the document database is displayed on the display, and the area corresponding to the keyword K _w in the area of the document image data D _i displayed on the display is highlighted. The highlighted display can be, for example, a colored display or a reverse display. The area corresponding to the keyword K _w is determined based on the search result RD _t .

【００３１】次に、文書画像データＤ_iおよび文字認識
結果Ｄ_cのデータ構造を説明する。Next, the data structure of the document image data D _i and the character recognition result D _c will be described.

【００３２】図２Ａは、文書画像データＤ_iの例を示
す。文書画像データＤ_iは、例えば、ビットマップ形式
の画像データである。FIG. 2A shows an example of the document image data D _i . The document image data D _i is, for example, bitmap image data.

【００３３】図２Ｂは、図２Ａに示される文書画像デー
タＤ_iに対して文字認識を実行した結果である文字認識
結果Ｄ_cのデータ構造を示す。文字認識結果Ｄ_cは、文字
Ｄ_c［ｊ］（０≦ｊ≦Ｎ_d−１）の集合として得られる。
ここにＮ_dは、文字認識結果Ｄ_cに含まれる文字の数であ
る。本明細書中で、［］内に記される数字は添字（イン
デックス）を示す。文字Ｄ_c［ｊ］には、文字コードＣ_c
［ｊ］、文字座標（ｘ₁［ｊ］，ｙ₁［ｊ］）、（ｘ
₂［ｊ］，ｙ₂［ｊ］）および信頼度Ｃ_r［ｊ］が割り当
てられている。FIG. 2B shows a data structure of a character recognition result D _c which is a result of performing character recognition on the document image data D _i shown in FIG. 2A. The character recognition result D _c is obtained as a set of characters D _c [j] (0 ≦ j ≦ N _d −1).
Here, N _d is the number of characters included in the character recognition result D _c . In the present specification, the numbers in [] indicate subscripts (indexes). The character code C _c is included in the character D _c [j].
[J], character coordinates (x ₁ [j], y ₁ [j]), (x
₂ [j], y ₂ [j]) and the reliability C _r [j] are assigned.

【００３４】文字コードＣ_c［ｊ］は、ＯＣＲ装置２０
２によって決定されたコードである。文字コードＣ
_c［ｊ］は、例えば、２バイトで表されるコードであ
る。ただし、図２Ｂに示される例では、説明のために、
文字コードに代えてその文字コードに対応する文字が示
されている。The character code C _c [j] corresponds to the OCR device 20.
It is the code determined by 2. Letter code C
_c [j] is a code represented by 2 bytes, for example. However, in the example shown in FIG. 2B, for the sake of explanation,
Instead of the character code, the character corresponding to the character code is shown.

【００３５】文字座標は、ＯＣＲ装置２０２によって１
つの文字であると認識された、文書画像データＤ_i内の
部分領域を示す。この部分領域は、例えば、矩形によっ
て表される。文字座標（ｘ₁［ｊ］，ｙ₁［ｊ］）は矩形
の左上の頂点の座標であり、文字座標（ｘ₂［ｊ］，ｙ₂
［ｊ］）は矩形の右下の頂点の座標である。文字座標の
座標系は任意の座標系が使用され得る。Character coordinates are set to 1 by the OCR device 202.
A partial area in the document image data D _i recognized as one character is shown. This partial area is represented by, for example, a rectangle. The character coordinate (x ₁ [j], y ₁ [j]) is the coordinate of the upper left vertex of the rectangle, and the character coordinate (x ₂ [j], y ₂
[J]) is the coordinates of the lower right vertex of the rectangle. Any coordinate system may be used as the coordinate system of the character coordinates.

【００３６】信頼度Ｃ_r［ｊ］は、例えば、ＯＣＲ装置
２０２が文字認識を実行した際の尤度、確度あるいは確
からしさとして定義され得る。The reliability C _r [j] can be defined, for example, as the likelihood, the accuracy, or the certainty when the OCR device 202 executes the character recognition.

【００３７】信頼度Ｃ_r［ｊ］は、文字認識の結果が正
しいものである可能性が高いか低いかを示している。図
２Ｂに示される例では、信頼度は０と１との間の数によ
って表され、信頼度が１に近いほど、文字認識の結果が
正しいものである可能性が高い。The reliability C _r [j] indicates whether the result of character recognition is likely to be correct or low. In the example shown in FIG. 2B, the reliability is represented by a number between 0 and 1, and the closer the reliability is to 1, the more likely the result of character recognition is.

【００３８】なお、ＯＣＲ装置２０２によって文字認識
された文字と、オリジナルの文書に書かれている文字と
は必ずしも一対一に対応しない。ＯＣＲ装置２０２によ
る文字認識の際に切り出し誤りが起こっている可能性が
あるからである。切り出し誤りとは、例えば、オリジナ
ルの文書における１つの文字が、複数の文字として認識
されたり、逆に、オリジナルの文書における複数の文字
が、１つの文字として認識されることである。図２Ａお
よび図２Ｂに示される例では、オリジナルの文書におけ
る文字「湖」が、ＯＣＲ装置２０２による切り出し誤り
のために、「三」、「古」、「月」という３つの文字と
して認識されている。このため、オリジナルの文書にお
ける文字「湖」は、文字認識結果Ｄ_c中の文字Ｄ_c［２］
〜文字Ｄ _c［４］の３つの文字に対応している。このよ
うに、文字認識結果Ｄ_c中の文字は、オリジナルの文書
に書かれている１つの文字と対応する場合のほか、文字
の断片と対応する場合や、複数の文字の断片の組み合わ
せに対応する場合があり得る。Character recognition by the OCR device 202
And the characters written in the original document
Do not necessarily correspond one-to-one. With the OCR device 202
There is a possibility that a cutout error may occur during character recognition.
Because there is. The cut-out error is, for example, original
Recognizes one character in multiple documents as multiple characters
Or vice versa, multiple characters in the original document
Is recognized as one character. Figure 2A
And in the example shown in FIG. 2B, in the original document
The character "Lake" is cut out by the OCR device 202.
For three characters, "three", "old" and "month"
Has been recognized. Therefore, the original document
The letter "Lake" is a character recognition result D_cLetter D in_c[2]
~ Letter D _cIt corresponds to the three characters [4]. This
Character recognition result D_cThe letters inside are the original document
Characters that correspond to one character written in
Corresponding to a fragment of a character or a combination of multiple character fragments
In some cases, it may correspond.

【００３９】文字認識結果Ｄ_c中の文字の配列順序は、
オリジナルの文書での文字の配列順序（例えば、左側に
位置する文字から右側に位置する文字へ向かう順序）と
同一である。The arrangement order of the characters in the character recognition result D _c is
It is the same as the arrangement order of the characters in the original document (for example, the order from the character located on the left side to the character located on the right side).

【００４０】本明細書中で、「文字」とは、漢字やアル
ファベット等の特定の言語の文字だけでなく、数字、記
号など、文字コードが割り当てられたあらゆるシンボル
を含む。In the present specification, "character" includes not only characters of a particular language such as Chinese characters and alphabets but also all symbols to which a character code is assigned such as numbers and symbols.

【００４１】「２以上の連続した文字」とは、文字認識
結果Ｄ_cにおいて添字ｊが連続した２以上の文字Ｄ
_c［ｊ］を意味する。[0041] "2 or more consecutive characters" is 2 or more characters D subscript j are continuous in character recognition result D _c
_c [j] is meant.

【００４２】また、「オリジナルの文書」は、紙の形態
の文書に限定されない。「オリジナルの文書」は、文字
が書かれた任意の対象物であり得る。Further, the "original document" is not limited to a document in paper form. The "original document" can be any written object.

【００４３】本発明の文書検索装置２０４は、図２Ｂに
示されるデータ構造を有する文字認識結果Ｄ_cからキー
ワードＫ_wを検索する。The document retrieval device 204 of the present invention retrieves the keyword K _w from the character recognition result D _c having the data structure shown in FIG. 2B.

【００４４】図２Ｃは、キーワードＫ_wのデータ構造を
示す。図２Ｃに示される例では、キーワードＫ_wは、文
字列「琵琶湖畔」であり、キーワードＫ_wは４文字から
なる。ただし、キーワードＫ_wの文字数は４に限定され
ない。キーワードＫ_wは１以上の任意の数の文字を含み
得る。キーワードＫ_wに含まれる文字のそれぞれは、キ
ーワード文字と呼ばれる。キーワード文字は、添字ｉを
用いてＫ_w［ｉ］と表される。ここで０≦ｉ≦３であ
る。一般に、キーワードＫ_wに含まれる文字の文字数が
Ｎ_kであるとすると、キーワードＫ_wはＫ_w［０］〜Ｋ
_w［Ｎ_k−１］のＮ_k個のキーワード文字からなる文字列
として表される。FIG. 2C shows the data structure of the keyword K _w . In the example shown in FIG. 2C, the keyword K _w is the character string “Lake Biwako”, and the keyword K _w consists of 4 characters. However, the number of characters of the keyword K _w is not limited to 4. The keyword K _w may include any number of letters greater than or equal to one. Each of the characters included in the keyword K _w is called a keyword character. The keyword character is represented as _Kw [i] using the subscript i. Here, 0 ≦ i ≦ 3. Generally, assuming that the number of characters included in the keyword K _w is N _k , the keyword K _w is K _w [0] to K.
_It is represented as a character string consisting of N _k keyword characters of _w [N _k -1].

【００４５】キーワード文字Ｋ_w［０］〜Ｋ_w［Ｎ_k−
１］のそれぞれには、文字コードが割り当てられてい
る。例えば、キーワード文字Ｋ_w［０］（＝「琵」）に
は、文字コード「０ｘ４８７ｃ」（ＪＩＳコード）が割
り当てられている。Keyword characters K _w [0] to K _w [N _k −
A character code is assigned to each of [1]. For example, the character code “0x487c” (JIS code) is assigned to the keyword character K _w [0] (= “biwa”).

【００４６】文書検索装置２０４による検索結果は、検
索結果ＲＤ_tとして表示装置２０５に出力される。The search result by the document search device 204 is output to the display device 205 as the search result RD _t .

【００４７】図２Ｄは、検索結果ＲＤ_tのデータ構造を
示す。図２Ｄに示される検索結果ＲＤ_tは、図２Ｂに示
される文字認識結果Ｄ_cから図２Ｃに示されるキーワー
ドＫ_w（＝「琵琶湖畔」）を検索した結果を示す。検索
結果ＲＤ_tは、Ｎ_r個の検索箇所データＲＤ_t［ｔ］（０
≦ｔ≦Ｎ_r−１）の集合である。検索箇所データＲＤ
_t［ｔ］は、検索対象となった文字認識結果Ｄ_c（図２
Ｂ）のうち、キーワードＫ_wに一致した部分（一致部
分）を示す。Ｎ_rは、一致部分の個数を示す。FIG. 2D shows the data structure of the search result RD _t . The search result RD _t shown in FIG. 2D shows the result of searching the keyword K _w (= “Biwako shore”) shown in FIG. 2C from the character recognition result D _c shown in FIG. 2B. The search result RD _t is N _r pieces of search location data RD _t [t] (0
≦ t ≦ N _r −1). Search location data RD
_t [t] is the character recognition result D _c (Fig.
The part (matching part) of B) that matches the keyword K _w is shown. N _r indicates the number of matching parts.

【００４８】検索箇所データＲＤ_t［０］は、リスト要
素２２４１、リスト要素２２４２、リスト要素２２４３
およびリスト要素２２４４からなるリストである。リス
トの長さはキーワードＫ_wの長さ（キーワードＫ_wに含ま
れる文字（キーワード文字）の数で、この場合、４）と
等しい。リスト要素２２４１〜２２４４のそれぞれは、
キーワードＫ_wに含まれる４個のキーワード文字Ｋ
_w［０］〜Ｋ_w［３］に対応する、文字認識結果Ｄ_c中の
１または２以上の連続した文字を示す。例えば、リスト
要素２２４１は、文字Ｄ_c［０］が、キーワード文字Ｋ_w
［０］（「琵」）に対応することを示す。リスト要素２
２４２は、文字Ｄ_c［１］が、キーワード文字Ｋ_w［１］
（「琶」）に対応することを示す。リスト要素２２４３
は、連続する３個の文字Ｄ_c［２］〜Ｄ_c［４］が、キー
ワード文字Ｋ_w［２］（「湖」）に対応することを示
す。リスト要素２２４４は、文字Ｄ_c［５］が、キーワ
ードＫ_w［３］（「畔」）に対応することを示す。文字
Ｄ_c［２］〜Ｄ_c［４］のそれぞれの文字コードは、キー
ワード文字Ｋ_w［２］（「湖」）の文字コードと一致し
ない。なぜなら文字Ｄ_c［２］〜Ｄ_c［４］はそれぞれ、
ＯＣＲ装置２０２によって、文字「三」「古」「月」と
して認識された文字だからである。図２Ｄに示される例
は、文書検索装置２０４によって、文字「三」「古」
「月」として認識された連続する３個の文字Ｄ_c［２］
〜Ｄ_c［４］が１つのグループに結合され、その１つの
グループが、キーワード文字Ｋ_w［２］（「湖」）に対
応すると判定された例を示す。The search location data RD _t [0] has list elements 2241, list elements 2242, and list elements 2243.
And a list element 2244. The length of the list is equal to the length of the keyword K _w (the number of characters (keyword characters) included in the keyword K _w , in this case, 4). Each of the list elements 2241 to 2244 is
Four keyword characters K included in the keyword K _w
corresponding to _{_{w [0] ~K w [3}} ], 1 or 2 or more consecutive characters in the character recognition result D _c. For example, in the list element 2241, the character D _c [0] is the keyword character K _w.
Indicates that it corresponds to [0] (“Biwa”). List element 2
In 242, the character D _c [1] is the keyword character K _w [1].
Indicates that it corresponds to ("Biwa"). List element 2243
Indicates that three consecutive characters D _c [2] to D _c [4] correspond to the keyword character K _w [2] (“lake”). List element 2244 indicates that the letter D _c [5] corresponds to the keyword K _w [3] (“bank”). The character code of each of the characters D _c [2] to D _c [4] does not match the character code of the keyword character K _w [2] (“lake”). Because the characters D _c [2] to D _c [4] are
This is because the OCR device 202 recognizes the characters “three”, “old”, and “month”. In the example shown in FIG. 2D, the characters “three” and “old” are used by the document search device 204.
Three consecutive characters D _c [2] recognized as "month"
.. D _c [4] are combined into one group, and the one group is determined to correspond to the keyword character K _w [2] (“lake”).

【００４９】検索箇所データＲＤ_t［ｔ］（１≦ｔ≦Ｎ_r
−１）も、上述したＲＤ_t［０］と同様に、Ｎ_kの長さを
有するリストである。ここでＮ_kはキーワードＫ_wの長さ
である。Search location data RD _t [t] (1≤t≤N _r
-1) is also a list having a length of N _k , similar to RD _t [0] described above. Here, N _k is the length of the keyword K _w .

【００５０】ＯＣＲ装置２０２は、ある処理単位ごとに
文字認識を実行する。処理単位とは、例えば、文書の１
ページであってもよいし、１つの段落であってもよい。
このような処理単位は、「文字ブロック」と呼ばれる。The OCR device 202 executes character recognition for each processing unit. The processing unit is, for example, 1 of a document.
It may be a page or a paragraph.
Such a processing unit is called a "character block".

【００５１】通常、文字ブロック内では文字フォントや
文字の大きさが一定している場合が多い。従って、文字
ブロックを１つの単位として検索を実行することは、検
索精度の向上の点から好ましい。Usually, the character font and the character size are often constant in the character block. Therefore, it is preferable to execute the search with the character block as one unit from the viewpoint of improving the search accuracy.

【００５２】文字認識結果Ｄ_cに含まれる文字Ｄ_c［ｊ］
は、文字ブロックごとにグループ化されていてもよい。
このようなグループは文字ブロックデータＤ_tと呼ば
れ、１つの文字ブロック内について文字認識を実行した
結果を表す。Character D _c [j] included in character recognition result D _c
May be grouped by character block.
Such a group is referred to as character block data D _t and represents the result of performing character recognition within one character block.

【００５３】図３は、文字ブロックデータＤ_tの構造を
示す。FIG. 3 shows the structure of the character block data D _t .

【００５４】文字ブロックデータＤ_tは、文字ブロック
座標２２０１と、文字数２２０２と、方向情報２２０３
と、文字Ｄ_c［ｊ］の集合（図３に示されている例で
は、Ｄ_c［０］〜Ｄ_c［８］）とを含む。The character block data D _t includes character block coordinates 2201, the number of characters 2202, and direction information 2203.
And a set of characters D _c [j] (D _c [0] to D _c [8] in the example shown in FIG. 3).

【００５５】文字ブロック座標２２０１は、文字ブロッ
クに外接する矩形の文書画像データＤ_iにおける座標値
を示す。The character block coordinates 2201 indicate coordinate values in the rectangular document image data D _i circumscribing the character block.

【００５６】文字数２２０２は、文字ブロックデータＤ
_tに含まれる文字の数を示す。The number of characters 2202 is the character block data D
Indicates the number of characters contained in _t .

【００５７】方向情報２２０３は、文字ブロック内に文
字がどの向きに書かれているか（縦書きであるか横書き
であるか）を示す。例えば、方向情報２２０３の値が１
であることは縦書きを示し、方向情報２２０３の値が０
であることは横書きを示す。The direction information 2203 indicates in which direction the characters are written in the character block (whether vertical writing or horizontal writing). For example, the value of the direction information 2203 is 1
Indicates vertical writing, and the value of the direction information 2203 is 0.
Indicates horizontal writing.

【００５８】文字ブロックデータＤ_tは、その文字ブロ
ックにおいて使用されているフォントの情報をさらに有
してもよい。The character block data D _t may further include information on the font used in the character block.

【００５９】以下、図面を参照しながら本発明の実施の
形態を説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００６０】（実施の形態１）図４は、本発明の実施の
形態１の文書検索装置１の構成を示す。文書検索装置１
は、テキスト検索手段１０１と、文字特定手段１０２
と、文字形状検索手段１０３と、文字幅推定手段１０４
とを含む。(Embodiment 1) FIG. 4 shows the configuration of a document retrieval apparatus 1 according to Embodiment 1 of the present invention. Document retrieval device 1
Is a text search means 101 and a character identification means 102.
Character shape search means 103 and character width estimation means 104
Including and

【００６１】文書検索装置１は、図１に示される文書検
索装置２０４として使用され得る。この場合、文書検索
装置１は、文書データベース２０３に蓄積された文書デ
ータＤ_dからキーワードＫ_wを検索する。文書データＤ_d
は、文書画像データＤ_iと文字認識結果Ｄ_cとを含む。The document search device 1 can be used as the document search device 204 shown in FIG. In this case, the document search device 1 searches the document data D _d stored in the document database 203 for the keyword K _w . Document data D _d
Includes the document image data D _i and the character recognition result D _c .

【００６２】なお、以下の説明では、文書は横書きであ
る（すなわち、文書中の文字は左から右に配列されてい
る）ものと仮定する。以下で説明される文書検索処理の
手順は、文字の幅および高さをそれぞれ文字の高さおよ
び幅に置き換えることにより、縦書きの文書についても
適用することができる。文書が縦書きであるか横書きで
あるかは、例えば、文字ブロックデータＤ_tに含まれる
方向情報２２０３（図３）を参照することにより判定さ
れ得る。In the following description, it is assumed that the document is written horizontally (that is, the characters in the document are arranged from left to right). The procedure of the document search process described below can be applied to a vertically written document by replacing the character width and height with the character height and width, respectively. Whether the document is written vertically or horizontally can be determined, for example, by referring to the direction information 2203 (FIG. 3) included in the character block data D _t .

【００６３】キーワードＫ_wが文書検索装置１の外部か
ら文書検索装置１に入力され、特定の文書データＤ_dか
らキーワードＫ_wを検索する旨の指示が与えられる。キ
ーワードＫ_wの入力や検索の指示は、例えば、キーボー
ドなどの入力手段（図示されない）により行われる。The keyword K _w is input from the outside of the document search apparatus 1 to the document search apparatus 1, and an instruction to search the specific document data D _d for the keyword K _w is given. The input of the keyword K _w and the instruction of the search are performed by an input means (not shown) such as a keyboard.

【００６４】テキスト検索手段１０１は、キーワードＫ
_wに含まれるキーワード文字（第１文字）に割り当てら
れた文字コードと文字認識結果Ｄ_cに含まれる文字（第
２文字）に割り当てられた文字コードとを比較し、これ
らの文字コードが一致すれば、第２文字が第１文字に一
致すると判定する。キーワードＫ_wの長さをＮ_kとし、文
字認識結果Ｄ_c中のＮ_k個の連続する文字のそれぞれが、
キーワードＫ_w中のＮ_k個のキーワード文字のうちの対応
する１つに一致する場合、テキスト検索手段１０１は、
文字認識結果Ｄ_c中のＮ_k個の連続する文字をキーワード
Ｋ_wに一致する一致部分（第１一致部分）として特定す
る。このような一致部分（第１一致部分）は、文字認識
結果Ｄ_cにおいて１以上の任意の数だけ存在し得る。テ
キスト検索手段１０１は１以上の一致部分（第１一致部
分）を第１の検索結果ＲＤ_t1として出力する。第１の検
索結果ＲＤ_t1に含まれる一致部分（第１一致部分）の数
をＮ_r1とする。第１の検索結果ＲＤ_t1は、検出箇所デー
タＲＤ_t1［ｔ］を含む。ここで０≦ｔ≦Ｎ_r1−１であ
る。検出箇所データＲＤ_t1［ｔ］は、テキスト検索手段
１０１によって特定された１以上の第１一致部分のうち
の１つを表している。検出箇所データＲＤ_t1［ｔ］は、
キーワードＫ_wに含まれるキーワード文字の数と同一の
数のリスト要素を含むリストである。各リスト要素は、
対応するキーワード文字の文字コードに一致する文字コ
ードを有する文字を表す。The text search means 101 uses the keyword K.
_The character code assigned to the keyword character (first character) included in _w is compared with the character code assigned to the character (second character) included in the character recognition result D _c , and if these character codes match. For example, it is determined that the second character matches the first character. Let N _{k be} the length of the keyword K _w , and let each of the N _k consecutive characters in the character recognition result D _c be
If it matches a corresponding one of the N _k keyword characters in the keyword K _w , the text search means 101
N _k consecutive characters in the character recognition result D _c are specified as a matching part (first matching part) that matches the keyword K _w . Such a matching portion (first matching portion) may exist in any number of 1 or more in the character recognition result D _c . The text search means 101 outputs one or more matching parts (first matching part) as the first search result RD _t1 . The number of matching parts (first matching part) included in the first search result RD _t1 is N _r1 . The first search result RD _t1 includes detection point data RD _t1 [t]. Here, 0 ≦ t ≦ N _r1 −1. The detection location data RD _t1 [t] represents one of the one or more first matching parts identified by the text search means 101. The detection point data RD _t1 [t] is
It is a list including the same number of list elements as the number of keyword characters included in the keyword K _w . Each list element is
Represents a character having a character code that matches the character code of the corresponding keyword character.

【００６５】このように、テキスト検索手段１０１は、
文字コードの比較に基づいて、キーワードに一致する少
なくとも１つの一致部分（第１一致部分）が文字認識結
果Ｄ _cに存在するか否かを判定し、存在する場合には、
その少なくとも１つの第１一致部分を特定する第１一致
部分特定手段として機能する。As described above, the text search means 101
Based on the character code comparison, the number of matching keywords
At least one matching part (first matching part) is the character recognition result.
Fruit D _cIs present in the
A first match identifying the at least one first match
It functions as a part specifying means.

【００６６】もし、ＯＣＲ装置２０２（図１）による文
字認識に誤りがなければ、認識結果からのキーワードの
検索は、テキスト検索手段１０１による検索だけで十分
である。しかし、すでに述べたように、ＯＣＲ装置２０
２（図１）における文字認識に誤りがある場合、テキス
ト検索手段１０１による、文字コードの比較に基づく検
索だけでは、検索漏れが生ずる可能性がある。すなわ
ち、文字認識結果Ｄｃから、テキスト検索手段１０１に
よって特定された一致部分（第１一致部分）を除いた部
分にも、キーワードＫ_wと一致する部分がある可能性が
ある。If there is no error in the character recognition by the OCR device 202 (FIG. 1), the text search means 101 is sufficient for searching the keyword from the recognition result. However, as already mentioned, the OCR device 20
If there is an error in the character recognition in 2 (FIG. 1), there is a possibility that a search omission may occur only by the text search means 101 based on the comparison of the character codes. That is, there is a possibility that there is a portion that matches the keyword K _{w in} the portion obtained by removing the matching portion (first matching portion) specified by the text search means 101 from the character recognition result Dc.

【００６７】文字特定手段１０２は、文字認識結果Ｄ_c
からテキスト検索手段１０１によって特定された１以上
の一致部分（第１一致部分）を除いた部分に、所定の条
件を満たす１以上の文字が存在するか否かを判定する。
存在する場合には、文字特定手段１０２はその所定の条
件を満たす１以上の文字を候補部分ＳＤ_c（第１部分）
として特定する。ここで「所定の条件」は、例えば、以
下の（１）に示される条件を含む。The character specifying means 102 determines the character recognition result D _c.
It is determined whether or not there are one or more characters satisfying a predetermined condition in the portion excluding the one or more matching portions (first matching portion) specified by the text search means 101 from.
If there is, the character identifying means 102 selects one or more characters satisfying the predetermined condition as the candidate portion SD _c (first portion).
Specify as. Here, the “predetermined condition” includes, for example, the condition (1) below.

【００６８】（１）「その文字が、文字認識結果Ｄ_cに
おいて、予め定められた幅より小さい幅を有する特定の
文字の近傍に存在する」あるいは、「所定の条件」は、
（１）の条件に代えて、または、（１）の条件に加え
て、（２）に示される条件を含んでいてもよい。[0068] (1) "is the character in the character recognition result D _c, it is present in the vicinity of the particular character which has a width less than the predetermined width" or "predetermined conditions"
Instead of the condition (1), or in addition to the condition (1), the condition (2) may be included.

【００６９】（２）「その文字が、文字認識結果Ｄ_cに
おいて、予め定められた信頼度よりも小さい信頼度を有
する特定の文字の近傍に存在する」一般に、文字認識結
果Ｄ_cに含まれる文字の幅が小さいほど、文字認識にお
いて文字の切り出し誤りが発生している可能性が高い。
また、文字認識結果Ｄ_cに含まれる文字の信頼度が低い
ほど、文字認識において文字の認識誤りが生じている可
能性が高い。従って、これらの特定の文字（すなわち、
小さい幅を有する文字または低い信頼性を有する文字）
の近傍では文字認識誤りが発生している可能性が高い。
なお、「近傍」の具体的な意味については、図６を参照
して後述される。[0069] (2) included in "is the character in the character recognition result D _c, present in the vicinity of the particular character with less confidence than a predetermined confidence" Generally, in the character recognition result D _c The smaller the width of the character, the higher the possibility that a character cutting error will occur in character recognition.
Further, the lower the reliability of the character included in the character recognition result D _c , the higher the possibility that a character recognition error will occur in character recognition. Therefore, these particular characters (ie
Characters with small width or characters with low reliability)
There is a high possibility that a character recognition error has occurred in the vicinity of.
The specific meaning of "nearby" will be described later with reference to FIG.

【００７０】なお、第１一致部分が文字認識結果Ｄ_c中
に存在しない場合には、文字特定手段１０２は、文字認
識結果Ｄ_c中に所定の条件を満たす１以上の文字が存在
するか否かを判定する。文字認識結果Ｄ_c中に所定の条
件を満たす１以上の文字が存在する場合には、文字特定
手段１０２は、その所定の条件を満たす１以上の文字を
候補部分ＳＤ_c（第１部分）として特定する。When the first matching portion does not exist in the character recognition result D _c , the character identifying means 102 determines whether or not one or more characters satisfying a predetermined condition exist in the character recognition result D _c. To determine. When one or more characters satisfying the predetermined condition are present in the character recognition result D _c , the character identifying means 102 sets the one or more characters satisfying the predetermined condition as the candidate portion SD _c (first portion). Identify.

【００７１】このように、文字特定手段１０２は、所定
の条件を満たす少なくとも１つの候補部分ＳＤ_c（第１
部分）が、文字認識結果Ｄ_cからテキスト検索手段１０
１によって特定された少なくとも１つの第１一致部分を
除いた部分に存在するか否かを判定し、存在する場合に
は、その少なくとも１つの第１部分を特定する第１部分
特定手段として機能する。As described above, the character specifying means 102 determines that at least one candidate portion SD _c (first
Part) is the text search means 10 from the character recognition result D _c.
It determines whether or not it exists in a portion other than the at least one first matching portion specified by 1, and if it exists, it functions as a first portion specifying means for specifying the at least one first portion. .

【００７２】文字形状検索手段１０３は、少なくとも１
つの候補部分ＳＤ_c（第１部分）が特定された場合に
は、候補部分ＳＤ_cに含まれる文字（第２文字）に割り
当てられた部分領域の画像の特徴量と、キーワードＫ_w
に含まれるキーワード文字（第１文字）の画像の特徴量
との比較に基づいて、候補部分ＳＤ_cにキーワードＫ_wに
一致する一致部分（第２一致部分）が存在するか否かを
判定する。文字形状検索手段１０３の詳細な構成および
動作については、図７を参照して後述される。The character shape search means 103 has at least 1
When one candidate portion SD _c (first portion) is specified, the feature amount of the image of the partial area assigned to the character (second character) included in the candidate portion SD _c and the keyword K _w.
It is determined whether or not there is a matching portion (second matching portion) that matches the keyword K _w in the candidate portion SD _c , based on a comparison between the keyword character (first character) included in the image feature amount. . The detailed configuration and operation of the character shape search means 103 will be described later with reference to FIG. 7.

【００７３】このような一致部分（第２一致部分）は、
候補部分ＳＤ_cにおいて１以上の任意の数だけ存在し得
る。文字形状検索手段１０３は、１以上の一致部分（第
２一致部分）を第２の検索結果ＲＤ_t2として出力する。
第２の検索結果ＲＤ_t2に含まれる一致部分（第２一致部
分）の数をＮ_r2とする。第２の検索結果ＲＤ_t2は、検出
箇所データＲＤ_t2［ｔ］を含む。ここで０≦ｔ≦Ｎ_r2−
１である。検出箇所データＲＤ_t2［ｔ］は、文字形状検
索手段１０３によって特定された１以上の第２一致部分
のうちの１つを表している。検出箇所データＲＤ
_t2［ｔ］は、キーワードＫ_wに含まれるキーワード文字
の数と同数のリスト要素を含むリストである。各リスト
要素は、対応するキーワード文字の画像の特徴量との距
離が所定の閾値Ｔｈｄ₁以下である画像の特徴量を有す
る１または２以上の連続した文字である。Such a matching portion (second matching portion) is
There can be any number greater than or equal to 1 in the candidate portion SD _c . The character shape search unit 103 outputs one or more matching parts (second matching part) as the second search result RD _t2 .
The number of matching parts (second matching part) included in the second search result RD _t2 is N _r2 . The second search result RD _t2 includes detection point data RD _t2 [t]. Where 0 ≦ t ≦ N _r2 −
It is 1. The detection location data RD _t2 [t] represents one of the one or more second matching parts specified by the character shape searching unit 103. Detection point data RD
_t2 [t] is a list including the same number of list elements as the number of keyword characters included in the keyword K _w . Each list element is one or more consecutive characters having an image feature amount whose distance from the image feature amount of the corresponding keyword character is a predetermined threshold Thd ₁ or less.

【００７４】文字幅推定手段１０４は、キーワードＫ_w
に含まれるそれぞれのキーワード文字の幅Ｋ_ww［ｉ］
（０≦ｉ≦Ｎ_k−１）を推定する。キーワード文字の幅
Ｋ_ww［ｉ］は、文字形状検索手段１０３によって用いら
れる。文字幅推定手段１０４はまた、キーワードＫ_wの
幅Ｋ_wwを推定する。キーワードＫ_wの幅Ｋ_wwは文字特定
手段１０２によって用いられる。The character width estimation means 104 uses the keyword K _w.
Width of each keyword character included in K _ww [i]
Estimate (0 ≦ i ≦ N _k −1). The width K _ww [i] of the keyword character is used by the character shape searching unit 103. The character width estimation means 104 also estimates the width K _ww of the keyword K _w . The width K _ww of the keyword K _w is used by the character specifying unit 102.

【００７５】テキスト検索手段１０１による検索結果Ｒ
Ｄ_t1と、文字形状検索手段１０３による検索結果ＲＤ_t2
とは、文字認識結果Ｄ_cからキーワードＫ_wを検索した結
果として、文書検索装置１から出力される。Search result R by the text search means 101
D _t1 and the search result RD _t2 by the character shape search means 103.
Is output from the document search device 1 as a result of searching the keyword K _w from the character recognition result D _c .

【００７６】図５は、文字幅推定手段１０４が、キーワ
ードＫ_wに含まれる各文字Ｋ_w［ｉ］の文字幅Ｋ_ww［ｉ］
を推定する例を示す。In FIG. 5, the character width estimating means 104 causes the character width K _ww [i] of each character K _w [i] included in the keyword K _w .
An example of estimating

【００７７】図５に示される例では、キーワードＫ_wは
「少子化」であると仮定する。この場合において、キー
ワードＫ_wに含まれる文字「化」（＝Ｋ_w［２］）の文字
幅を推定する例を説明する。In the example shown in FIG. 5, it is assumed that the keyword K _w is "declining birthrate". In this case, an example will be described in which the character width of the character “” (= K _w [2]) included in the keyword K _w is estimated.

【００７８】文字幅推定手段１０４は、文字認識結果Ｄ
_cに含まれる文字のうち、最大の高さを有する文字を求
め、この文字の高さを文字高さの推定値ａとする。文字
認識結果Ｄ_cに含まれる文字の高さは、文字に割り当て
られた文字座標から算出され得る。The character width estimating means 104 determines the character recognition result D.
Among the characters included in _c , the character having the maximum height is obtained, and the height of this character is set as the estimated value a of the character height. The height of the character included in the character recognition result D _c can be calculated from the character coordinates assigned to the character.

【００７９】あるいは、文字高さの推定値ａとして、文
字ブロックデータＤ_t内の文字のうち、最大の高さを有
する文字の高さを用いてもよい。Alternatively, the height of the character having the maximum height among the characters in the character block data D _t may be used as the estimated value a of the character height.

【００８０】あるいは、文字高さの推定値ａは、文字の
高さの最大値である代わりに、最頻値、平均値またはメ
ディアン値であってもよい。文字幅および文字の高さの
単位は、例えば、ピクセルである。Alternatively, the character height estimation value a may be a mode value, an average value, or a median value instead of the maximum character height value. The unit of the character width and the character height is, for example, pixel.

【００８１】文字幅推定手段１０４は、文字「化」の標
準の高さｂと、標準の幅ｃとを用いて、文字「化」（＝
Ｋ_w［２］）の幅Ｋ_ww［２］を（数１）により算出す
る。The character width estimating means 104 uses the standard height b and the standard width c of the character "ka" to convert the character "ka" (=
The width K _ww [2] of K _w [2]) is calculated by ( _Equation 1).

【００８２】[0082]

【数１】Ｋ_ww［２］＝ａ・ｃ／ｂ文字「化」の標準の高
さｂと標準の幅ｃとは、文書検索装置１が有しているフ
ォントを用いて文字「化」の画像（文字画像ＫＣ
_i［２］）を生成し、文字画像ＫＣ_i［２］の高さと幅と
を得ることにより求められる。あるいは、文字の標準の
高さｂと、標準の幅ｃとの比ｂ／ｃが、予め全ての文字
について定められていてもよい。文字ブロックデータＤ
_tが、文字ブロックにおいて使用されているフォントの
情報を有する場合、文字「化」の画像を生成する際にそ
のフォントと同一のフォントが使用され得る。[ _Equation 1] K _ww [2] = a · c / b The standard height b and standard width c of the character “ka” are the characters “ka” using the font of the document retrieval apparatus 1. Image (character image KC
_i [2]) and obtain the height and width of the character image KC _i [2]. Alternatively, the ratio b / c of the standard height b of the character and the standard width c may be determined in advance for all the characters. Character block data D
_{If t} has information of the font used in the character block, the same font as that font may be used when generating the image of the character “”.

【００８３】図６は、所定の条件を満たす候補部分ＳＤ
_cを特定する処理の例を示す。この処理は、文字特定手
段１０２によって実行される。FIG. 6 shows a candidate portion SD satisfying a predetermined condition.
An example of a process for identifying _c will be shown. This processing is executed by the character identifying means 102.

【００８４】キーワードＫ_wが「少子化」であり、文字
認識結果Ｄ_c中に文字列「・・・少子イヒの問題・・
・」があるとする。図６では、説明のために、文字認識
結果Ｄ_c中の文字は割り当てられた文字コードに対応す
る文字として表されている。また、ここでは「所定の条
件」はすでに述べた（２）の条件であるものとする。[0084] keyword K _w is "low birth rate", a character string in the character recognition result D _c "... declining birthrate torquecontrol of the problem ...
・ " In FIG. 6, for the sake of explanation, the characters in the character recognition result D _c are represented as the characters corresponding to the assigned character code. Further, here, the “predetermined condition” is assumed to be the condition (2) already described.

【００８５】また、文字認識結果Ｄ_c中の各文字に割り
当てられた信頼度のうち、文字「イ」および「ヒ」の信
頼度が、予め定められた閾値Ｔｈｒよりも低く、それ以
外は閾値Ｔｈｒよりも高いとする。Further, of the reliability assigned to each character in the character recognition result D _c , the reliability of the characters “a” and “hi” is lower than a predetermined threshold value Thr, and the other values are threshold values. It is assumed to be higher than Thr.

【００８６】文字幅推定手段１０４により、キーワード
Ｋ_wの幅Ｋ_wwが求められる。キーワードＫ_wの幅Ｋ_wwは、
キーワードＫ_wに含まれるそれぞれの文字の幅の合計と
して求められる。The character width estimation means 104 determines the width K _ww of the keyword K _w . The width K _ww of the keyword K _w is
It is calculated as the sum of the widths of the respective characters included in the keyword K _w .

【００８７】文字特定手段１０２が候補部分ＳＤ_cを特
定する場合、まず、信頼度が閾値よりも低い文字「イ」
を終点（右端）とし、幅がＫ_wwと等しい範囲Ａ（「少子
イ」の部分）と、「イ」を始点（左端）とし、幅がＫ_ww
と等しい範囲Ｂ（「イヒの問」の部分）とを求める。特
定の文字の近傍とは、このように、特定の文字（この場
合文字「イ」）を中心とし、その左と右とに設定された
幅がＫ_wwと等しい範囲内を意味する。When the character specifying means 102 specifies the candidate portion SD _c , first, the character "a" whose reliability is lower than the threshold value is used.
Was an end point (right end), and a width of K _ww equal range A (the part of "low birth rate I"), the "I" and the starting point (the left end), a width of K _ww
And a range B (part of "question of Ihi") equal to. As described above, the vicinity of a specific character means a range in which the width set to the left and right of the specific character (in this case, the character “a”) is equal to K _ww .

【００８８】範囲Ｂの中にさらに信頼度が閾値Ｔｈｒよ
りも低い文字（「ヒ」）があれば、その文字「ヒ」を始
点とし、幅がＫ_wwと等しい範囲Ｃを求める。範囲Ａ、Ｂ
およびＣのいずれかに含まれる文字が、候補部分ＳＤ_c
として特定される。[0088] If there is a range more reliability is lower than the threshold Thr characters in B ( "human"), the start point of the character "human", width seek range C equal to K _ww. Range A, B
The character included in any one of C and C is a candidate portion SD _c
Specified as.

【００８９】図７は、文字形状検索手段１０３の詳細な
構成を示す。FIG. 7 shows a detailed structure of the character shape search means 103.

【００９０】文字形状検索手段１０３は、文字画像抽出
手段３０１と、文字画像テーブル３０２と、形状照合手
段３０３と、照合制御手段３０４とを含む。The character shape search means 103 includes a character image extraction means 301, a character image table 302, a shape matching means 303, and a matching control means 304.

【００９１】文字画像抽出手段３０１は、キーワードＫ
_w中の特定の文字の幅に最も近い幅を有する１または２
以上の連続した文字を不一致文字として特定し、その不
一致文字と対応する文書画像の領域（すなわち、文字画
像）を抽出する。The character image extracting means 301 uses the keyword K.
1 or 2 with a width that is closest to the width of a particular character in _w
The above consecutive characters are specified as non-matching characters, and the area of the document image (that is, the character image) corresponding to the non-matching characters is extracted.

【００９２】文字画像テーブル３０２には、キーワード
Ｋ_wに含まれる文字の画像が予め格納されている。この
文字の画像は、例えば、ビットマップ形式のフォントで
ある。The character image table 302 stores in advance images of characters included in the keyword K _w . The image of this character is, for example, a bitmap font.

【００９３】形状照合手段３０３は、文字画像抽出手段
３０１により抽出された領域の画像の特徴量と、文字画
像テーブル３０２に格納されている、キーワードＫ_wに
含まれる文字の画像の特徴量とを比較し、この２つの画
像が類似しているか否かを判定する。The shape matching means 303 compares the feature amount of the image of the area extracted by the character image extracting means 301 and the feature amount of the image of the character included in the keyword K _w stored in the character image table 302. The two images are compared to determine whether or not the two images are similar.

【００９４】照合制御手段３０４は、文字形状検索手段
１０３の動作を制御する。The collation control means 304 controls the operation of the character shape search means 103.

【００９５】文字形状検索手段１０３の動作を以下のス
テップＳ１０１〜Ｓ１０３に示す。この処理は、照合制
御手段３０４によって実行される。ここで、候補部分Ｓ
Ｄ_cは、文字ＳＤ_c［ｊ］の集合であるとする。それぞれ
の文字ＳＤ_c［ｊ］は、図２Ｂを参照して説明した文字
と同様のデータ構造を有する。なお、ステップＳ１０１
〜Ｓ１０３において、変数ｊは文字ＳＤ_c［ｊ］の添字
を表し、変数ｉは、キーワード文字Ｋ_w［ｉ］の添字を
表す。The operation of the character shape retrieval means 103 is shown in steps S101 to S103 below. This processing is executed by the collation control means 304. Here, the candidate portion S
Let D _{c be} the set of characters SD _c [j]. Each character SD _c [j] has the same data structure as the character described with reference to FIG. 2B. Note that step S101
In S103, the variable j represents the subscript of the character SD _c [j], and the variable i represents the subscript of the keyword character K _w [i].

【００９６】文字認識結果Ｄ_c中には、複数の候補部分
ＳＤ_cが存在し得る。文字認識結果Ｄ _c中に、複数の候補
部分ＳＤ_cが存在する場合には、それぞれの候補部分Ｓ
Ｄ_cについて以下のステップＳ１０１〜Ｓ１０３の処理
が行われる。Character recognition result D_cSome candidate parts
SD_cCan exist. Character recognition result D _cMultiple candidates in
Partial SD_cIs present, each candidate part S
D_cProcessing of steps S101 to S103 below
Is done.

【００９７】ステップＳ１０１：変数ｊに変数ｓｔａｒ
ｔ＿ｊが代入され、変数ｉに値「０」が代入される。変
数ｓｔａｒｔ＿ｊは、候補部分ＳＤ_cの始端（左端）に
位置する文字の添字である。また、ステップＳ１０１に
おいてキーワードＫ_wの長さＮ_kと等しい長さを有するリ
ストが検出箇所データとして用意される。Step S101: Variable j is variable star
t_j is substituted, and the value “0” is substituted for the variable i. The variable start_j is a subscript of a character located at the start end (left end) of the candidate portion SD _c . Further, in step S101, a list having a length equal to the length N _k of the keyword K _w is prepared as detection point data.

【００９８】ステップＳ１０２：候補部分ＳＤ_cのうち
から、キーワード文字Ｋ_w［ｉ］と対応させるべき１ま
たは２以上の連続する文字が「不一致文字」として特定
され、画像Ｃ_iが抽出される。この処理は文字画像抽出
手段３０１によって実行される。この処理の詳細は、図
８を参照して後述される。画像Ｃ_iは、不一致文字とし
て特定された１または２以上の連続する文字に割り当て
れた部分領域の画像である。また、キーワード文字Ｋ_w
［ｉ］の画像ＫＣ_iが文字画像テーブル３０２から得ら
れる。Step S102: One or more consecutive characters that should be associated with the keyword character K _w [i] are identified as “mismatched characters” from the candidate portion SD _c , and the image C _i is extracted. This processing is executed by the character image extracting means 301. Details of this processing will be described later with reference to FIG. The image C _i is an image of a partial area assigned to one or more consecutive characters specified as a non-matching character. Also, the keyword character K _w
The image KC _i of [i] is obtained from the character image table 302.

【００９９】ステップＳ１０３：キーワード文字Ｋ
_w［ｉ］と、ステップＳ１０２で特定された不一致文字
とが一致するか否かが判定される。この判定を行うため
に、まず、形状照合手段３０３を用いて、画像Ｃ_iと画
像ＫＣ_iとが照合される。画像の照合は、それぞれの画
像の特徴量を比較することによって行われる。画像Ｃ_i
の特徴量と画像ＫＣ_iの特徴量とのユークリッド距離
が、所定の閾値Ｔｈｄ₁よりも小さいことは、画像Ｃ_iと
画像ＫＣ_iとが類似していることを示す。画像Ｃ_iと画像
ＫＣ_iとが類似していれば、ステップＳ１０２で特定さ
れた不一致文字がキーワード文字Ｋ_w［ｉ］に一致する
と判定される。Step S103: Keyword letter K
_It is determined whether _w [i] and the non-matching character specified in step S102 match. In order to make this determination, first, the shape matching means 303 is used to match the image C _i with the image KC _i . The image matching is performed by comparing the feature amounts of the images. Image C _i
The Euclidean distance between the feature amount of the image and the feature amount of the image KC _i is smaller than the predetermined threshold Thd ₁ , which indicates that the image C _i and the image KC _i are similar to each other. If the image C _i and the image KC _i are similar, it is determined that the unmatched character specified in step S102 matches the keyword character K _w [i].

【０１００】キーワード文字Ｋ_w［ｉ］と、ステップＳ
１０２で特定された不一致文字とが一致すると判定され
た場合には、検出箇所データのｉ番目の位置に、文字Ｓ
Ｄ_c［ｊ］が登録され、変数ｉが１だけ増加され、変数
ｊに変数ｎｅｘｔ＿ｊが代入され、ステップ１０２へ戻
る。これは、キーワード中の次の文字を、候補部分ＳＤ
_c中の照合が終わった文字の右に隣接する部分から探す
ことを意味する。変数ｎｅｘｔ＿ｊは、候補部分ＳＤ_c
中の照合が終わった文字の右に隣接する文字の添字を示
す。変数ｎｅｘｔ＿ｊの値は、ステップＳ１０２におい
て文字画像抽出手段３０１によって決定される。The keyword character K _w [i] and step S
When it is determined that the non-matching character specified in 102 matches, the character S is added to the i-th position of the detection location data.
D _c [j] is registered, the variable i is incremented by 1, the variable next_j is substituted for the variable j, and the process returns to step 102. This is the next character in the keyword, the candidate part SD
Means to search from the part to the right of the matched character in _c . The variable next_j is a candidate part SD _c.
Indicates the subscript of the character adjacent to the right of the character in which the matching is completed. The value of the variable next_j is determined by the character image extracting means 301 in step S102.

【０１０１】キーワード文字Ｋ_w［ｉ］と、ステップＳ
１０２で特定された不一致文字とが一致しないと判定さ
れた場合には、変数ｉに値「０」が代入され、変数ｊに
変数ｓｔａｒｔ＿ｊ＋１が代入され、変数ｓｔａｒｔ＿
ｊに変数ｊが代入され、ステップ１０２に戻る。これ
は、候補部分ＳＤ_c中の着目している文字を１文字分右
にシフトして、再び最初のキーワード文字Ｋ_w［０］を
探すことを意味する。The keyword character K _w [i] and step S
When it is determined that the non-matching character specified in 102 does not match, the value “0” is assigned to the variable i, the variable start_j + 1 is assigned to the variable j, and the variable start_
The variable j is substituted for j, and the process returns to step 102. This means that the character of interest in the candidate portion SD _c is shifted to the right by one character and the first keyword character K _w [0] is searched again.

【０１０２】以上のステップＳ１０１〜Ｓ１０３に示す
処理によりキーワードＫ_wに含まれる全てのキーワード
文字の画像ＫＣ_iが対応する画像Ｃ_iに類似すれば、キー
ワードＫ_w中に含まれるそれぞれのキーワード文字に対
応する候補部分ＳＤｃ中の文字が、一致部分（第２一致
部分）として特定される。If the image KC _i of all the keyword characters included in the keyword K _w is similar to the corresponding image C _i by the processing shown in steps S101 to S103 above, the keyword characters included in the keyword K _w become The character in the corresponding candidate portion SDc is specified as the matching portion (second matching portion).

【０１０３】このように、文字形状検索手段１０３は、
候補部分ＳＤ_c（第１部分）に含まれる文字（第２文
字）に割り当てられた部分領域の画像の特徴量と、キー
ワードに含まれるキーワード文字（第１文字）の画像の
特徴量との比較に基づいて、キーワードＫ_wに一致する
少なくとも１つの第２一致部分が、文字特定手段１０２
によって特定された少なくとも１つの候補部分ＳＤ
_c（第１部分）に存在するか否かを判定し、存在する場
合には、前記少なくとも１つの第２一致部分を特定する
第２一致部分特定手段として機能する。As described above, the character shape search means 103
Comparison of the feature amount of the image of the partial area assigned to the character (second character) included in the candidate portion SD _c (first portion) and the image feature amount of the keyword character (first character) included in the keyword Based on the above, at least one second matching part that matches the keyword K _w is the character identifying means 102.
At least one candidate part SD identified by
_It determines whether or not it exists in _c (first part), and if it exists, it functions as a second matching part specifying means for specifying the at least one second matching part.

【０１０４】図８は、文字画像抽出手段３０１において
キーワードＫ_w中の特定の文字と対応する候補部分ＳＤ_c
中の文字が特定される例を示す。FIG. 8 shows a candidate portion SD _c corresponding to a specific character in the keyword K _w in the character image extracting means 301.
An example in which the characters inside are specified is shown.

【０１０５】キーワードＫ_wを「少子化」とし、文書画
像データＤ_iには、「・・・少子化の問題・・・」と書
かれている部分があるとする。いま、「少子」までは照
合が終わっているとすると、次にキーワードＫ_w中の文
字「化」（＝Ｋ_w［２］）の照合が行われる。文書画像
中の「化」は、ＯＣＲ装置２０２における文字認識の切
り出し誤りのために、文字の断片「イ」と、文字の断片
「ヒ」とに分割されている。それぞれの文字の断片に
は、候補部分ＳＤ_c中の文字ＳＤ_c［ｊ］と、文字ＳＤ _c
［ｊ＋１］とが対応している。矩形１３１０、１３１１
および１３１２はそれぞれ、文字ＳＤ_c［ｊ］、ＳＤ
_c［ｊ＋１］およびＳＤ_c［ｊ＋２］に割り当てられた文
書画像データＤ_i中の部分領域である。Keyword K_wIs "declining birthrate"
Image data D_iWrites, "... the problem of declining birthrate ..."
Suppose there is a part. Now, let's shine until "the little child"
If the match is over, then the keyword K_wSentence in
Character "ka" (= K_w[2]) is collated. Document image
The “” in the text indicates that the character recognition in the OCR device 202 is off.
Due to a mistaken protrusion, the character fragment "a" and the character fragment
It is divided into "hi". Into each character fragment
Is the candidate part SD_cCharacter SD in_c[J] and the character SD _c
It corresponds to [j + 1]. Rectangle 1310, 1311
And 1312 are the characters SD_c[J], SD
_c[J + 1] and SD_cSentence assigned to [j + 2]
Calligraphy image data D_iIt is a partial area inside.

【０１０６】矩形１３１０の幅をｗ₁とし、矩形１３１
０と矩形１３１１とを包含する領域（矩形１３１３）の
幅をｗ₂とし、矩形１３１０〜矩形１３１２を包含する
領域の幅をｗ₃とする。The width of the rectangle 1310 is w _1, and the width of the rectangle 131
The width of a region (rectangle 1313) including 0 and a rectangle 1311 is w _2, and the width of a region including rectangles 1310 and 1312 is w ₃ .

【０１０７】キーワードＫ_w中の文字「化」の照合を行
う場合、文字「化」（＝Ｋ_w［２］）についての文字幅
推定値Ｋ_ww［２］と、幅ｗ₁〜幅ｗ₃のそれぞれとが比較
される。幅ｗ₁〜幅ｗ₃のうち、文字幅推定値Ｋ_ww［２］
と最も近い値を有するものが幅ｗ₂であったとすると、
文字ＳＤ_c［ｊ］と文字ＳＤ_c［ｊ＋１］とが不一致文字
として特定される。[0107] keyword K When performing the collation of the character "of" in the _w, character "of" (= K _w [2]) and the character width estimated value K _ww [2] for, width w ₁ ~ width w ₃ Are compared with each. Character width estimated value K _ww [2] of width w ₁ to width w ₃
If the one having the value closest to is the width w ₂ ,
The character SD _c [j] and the character SD _c [j + 1] are identified as non-matching characters.

【０１０８】また、文書画像データＤ_iから、矩形１３
１３内の画像が、文字画像Ｃ_iとして抽出される。さら
に、不一致文字として特定された２つの文字（文字ＳＤ
_c［ｊ］と文字ＳＤ_c［ｊ＋１］）の次の文字（右隣の文
字）の番号「ｊ＋２」が、変数ｎｅｘｔ＿ｊの値として
決定される。Further, from the document image data D _i , the rectangle 13
The image in 13 is extracted as the character image C _i . In addition, the two characters (character SD
The number “j + 2” of the character (the character on the right) next to the character _c [j] and the character SD _c [j + 1]) is determined as the value of the variable next_j.

【０１０９】図９は、文字画像Ｃ_iから特徴量（ベクト
ル量）を求める方法の例を示す。FIG. 9 shows an example of a method for obtaining the characteristic amount (vector amount) from the character image C _i .

【０１１０】文字「あ」の文字画像Ｃ_iが１６個のブロ
ックＢ［ｉ］（０≦ｉ≦１５）に分割される。各ブロッ
クは、左上から右下へ順に番号ｉが付与されている。ｉ
＝０〜１５について、ブロックＢ［ｉ］の黒画素密度
（ブロック内の黒画素の数をブロックの面積で割った
値）が計算され、１６個の値が得られる。この１６個の
値を成分とする１６次元のベクトル量を特徴量とする。
このようにして求められた特徴量は、形状の特徴を示
す。The character image C _i of the character "A" is divided into 16 blocks B [i] (0≤i≤15). Each block is numbered from upper left to lower right. i
= 0 to 15, the black pixel density of the block B [i] (the number of black pixels in the block divided by the area of the block) is calculated, and 16 values are obtained. A 16-dimensional vector amount having these 16 values as components is set as a feature amount.
The feature amount thus obtained indicates the feature of the shape.

【０１１１】以上のように、本発明の文書検索装置１に
よれば、テキスト検索手段１０１による文字コードに基
づく検索の後に、テキスト検索手段１０１による検索に
おいて検索漏れが生ずる可能性が高い部分から、画像の
特徴量に基づく検索がさらに行われる。これにより検索
漏れを減らすことができる。As described above, according to the document search apparatus 1 of the present invention, after the search based on the character code by the text search means 101, there is a high possibility that a search omission occurs in the search by the text search means 101. The search is further performed based on the image feature amount. This can reduce the omission of search.

【０１１２】また、画像の特徴量の比較に基づく検索の
対象は、検索漏れが生ずる可能性が高い部分に限定され
るので、検索にかかるコスト（時間および計算量）は低
くて済む。Further, since the target of the search based on the comparison of the image feature amounts is limited to the part where the search omission is likely to occur, the cost (time and calculation amount) required for the search can be low.

【０１１３】なお、以上の説明では、テキスト検索手段
１０１における文字コードに基づく検索では、文字認識
結果Ｄ_c中の文字の文字コードとキーワードＫ_w中のキー
ワード文字の文字コードとを単純に比較していた。しか
し、文字認識結果Ｄ_c中の文字のそれぞれが、ＯＣＲ装
置２０２による文字認識処理における候補文字を複数保
持しておき、この複数の候補文字のいずれかの文字コー
ドとキーワードＫｗ中の特定のキーワード文字の文字コ
ードとが一致すれば、文字認識結果Ｄｃ中の文字とキー
ワードＫ_w中の特定のキーワード文字とが一致すると判
断してもよい。あるいは、キーワードＫ_w中の特定の文
字を類似した複数の文字に展開し、この複数の類似文字
のいずれかの文字コードと文字認識結果Ｄ_c中の文字の
文字コードとが一致した場合に、文字認識結果Ｄ_c中の
文字とキーワードＫ_w中の特定のキーワード文字とが一
致すると判断してもよい。In the above description, in the search based on the character code in the text search means 101, the character code of the character in the character recognition result D _c and the character code of the keyword character in the keyword K _w are simply compared. Was there. However, each of the characters in the character recognition result D _c holds a plurality of candidate characters in the character recognition processing by the OCR device 202, and any one of the character codes of the plurality of candidate characters and the specific keyword in the keyword Kw are stored. if there is a match and the character code may be determined that a specific keyword character and in keyword K _w in the character recognition result Dc match. Alternatively, when a specific character in the keyword K _w is expanded into a plurality of similar characters and one of the character codes of the plurality of similar characters matches the character code of the character in the character recognition result D _c , It may be determined that the character in the character recognition result D _c matches the specific keyword character in the keyword K _w .

【０１１４】テキスト検索手段１０１による検索として
は、シソーラス等の類義語辞書を用いた曖昧検索を用い
てもよく、他の公知の検索方法を用いてもよい。As the search by the text search means 101, an ambiguous search using a synonym dictionary such as a thesaurus may be used, or another known search method may be used.

【０１１５】文書データＤ_dは、インデックス形式であ
ってもよい。The document data D _d may be in the index format.

【０１１６】文字画像テーブル３０２は、フォントデー
タのように文字コードと文字画像が対になっているデー
タを用いることができるとしたが、実際の文書画像デー
タから文字コードごとに文字画像を収集し、文字コード
ごとに特徴量の平均値とったものを用いてもよい。さら
に、同じキーワードによる確信度の高い検索結果（例え
ばテキスト検索手段１０１の検索結果ＲＤ_t1）が存在す
る場合には、その検索結果から文字コードと文字画像の
対を生成して文字画像テーブル３０２として用い、その
ような検索結果が存在しない場合には、フォントデータ
を用いるようにしてもよい。The character image table 302 can use data in which character codes and character images are paired like font data. However, character images are collected for each character code from actual document image data. Alternatively, the average value of the feature amount for each character code may be used. Further, when there is a search result with a high degree of certainty with the same keyword (for example, the search result RD _t1 of the text search means 101), a pair of a character code and a character image is generated from the search result, and a character image table 302 is created. If the search result does not exist, the font data may be used.

【０１１７】文字幅推定手段１０４は、省略することも
可能である。文字幅推定手段１０４を省略した場合に
は、文字幅の推定値として文字にかかわらず同じ値（例
えば、文字高さの推定値ａ）を用いてもよい。The character width estimating means 104 can be omitted. When the character width estimation means 104 is omitted, the same value (for example, the character height estimation value a) may be used as the character width estimation value regardless of the character.

【０１１８】文字特定手段１０２による処理の後に、候
補部分ＳＤ_cに含まれる文字を統合・再分割する処理が
行われてもよい。この処理を実現するために、文字再分
割手段が文書検索装置１に追加され得る。文字再分割手
段は、文字特定手段１０２で特定された候補部分ＳＤ_c
中の文字を統合し、統合された文字に対応する画像を文
書データＤ_dから抽出する。抽出された画像は、可能な
限り細かく分割される。文書が横書きである場合、この
分割は例えば、垂直方向に黒画素の射影ヒストグラムを
求め、射影ヒストグラムが予め定めた閾値よりも小さい
部分で行われる。このようにして可能な限り細かく分割
された部分は、細分化要素と呼ばれる。細分化要素は、
一般に、候補部分ＳＤ_c中の文字と比較してサイズが等
しいか小さい要素である。文字画像抽出手段３０１にお
いて、不一致文字の文字幅を決める際に、候補部分ＳＤ
_c中の文字単位に文字幅が可変にされたが、細分化要素
単位に文字幅が可変にされてもよい。このように細分化
要素を用いることにより、ＯＣＲ装置２０２における文
字認識の際に文字の切り出し位置を誤った場合にも、キ
ーワード文字の文字幅を基準に適切な切り出し位置を特
定することが可能になる。After the processing by the character specifying means 102, the processing of integrating / re-dividing the characters included in the candidate portion SD _c may be performed. In order to realize this process, a character subdivision unit may be added to the document search device 1. The character subdivision means is a candidate part SD _c identified by the character identification means 102.
The characters inside are integrated, and the image corresponding to the integrated character is extracted from the document data D _d . The extracted image is divided into as fine pieces as possible. When the document is written horizontally, for example, this division is performed in a portion where the projection histogram of black pixels in the vertical direction is obtained and the projection histogram is smaller than a predetermined threshold value. The portion thus divided into as much as possible is called a subdivision element. The subdivision element is
Generally, it is an element whose size is equal to or smaller than the character in the candidate portion SD _c . When determining the character width of the non-matching character in the character image extraction means 301, the candidate portion SD
_{Although the} character width is variable for each character in _c , the character width may be variable for each subdivision element. By using the subdivision element in this way, even when the character cutout position is incorrect at the time of character recognition in the OCR device 202, it is possible to specify an appropriate cutout position based on the character width of the keyword character. Become.

【０１１９】以下、本発明の実施の形態１の文書検索装
置１のいくつかのバリエーションを図面を参照しながら
説明する。Hereinafter, some variations of the document search device 1 according to the first embodiment of the present invention will be described with reference to the drawings.

【０１２０】図１０は、文字形状検索手段１０３のバリ
エーションとしての文字形状検索手段１０３ａの構成を
示す。FIG. 10 shows the configuration of a character shape search means 103a as a variation of the character shape search means 103.

【０１２１】文字形状検索手段１０３ａは、図７に示さ
れる文字形状検索手段１０３の代わりに用いられ得る。
図１０において、図７に示される構成要素と同一の構成
要素には同一の参照番号を付し、その説明を省略する。The character shape searching means 103a can be used in place of the character shape searching means 103 shown in FIG.
10, the same components as those shown in FIG. 7 are designated by the same reference numerals, and the description thereof will be omitted.

【０１２２】文字形状検索手段１０３ａは、照合制御手
段６０６と、類似文字照合手段６０２とを含む。The character shape searching means 103a includes a matching control means 606 and a similar character matching means 602.

【０１２３】照合制御手段６０６は、文字形状検索手段
１０３ａ全体の動作を制御する。The collation control means 606 controls the operation of the entire character shape search means 103a.

【０１２４】類似文字照合手段６０２は、候補部分ＳＤ
_cに含まれる文字の文字コードＣ_c［ｊ］が、キーワード
文字Ｋ_w［ｉ］の文字コードに一致しているか否かを判
定する。あるいは、文字コードＣ_c［ｊ］が、キーワー
ド文字Ｋ_w［ｉ］についての類似文字リスト（図２４
Ｂ）に含まれるいずれかの文字の文字コードと一致して
いるか否かを判定してもよい。このような類似文字リス
トは、予め全ての文字に対して用意されており、誤って
認識される傾向の強い文字のリストである。The similar character collating means 602 determines the candidate portion SD.
It is determined whether the character code C _c [j] of the character included in _c matches the character code of the keyword character K _w [i]. Alternatively, the character code C _c [j] has a similar character list (see FIG. 24) for the keyword character K _w [i].
It may be determined whether or not it matches the character code of any of the characters included in B). Such a similar character list is prepared in advance for all characters and is a list of characters that are likely to be erroneously recognized.

【０１２５】文字形状検索手段１０３ａの動作を以下の
ステップＳ３０１〜Ｓ３０３に示す。この処理は、照合
制御手段６０６によって実行される。ここで、候補部分
ＳＤ _cは、文字ＳＤ_c［ｊ］の集合であるとする。それぞ
れの文字ＳＤ_c［ｊ］は、図２Ｂを参照して説明した文
字と同様のデータ構造を有する。なお、Ｓ３０１〜Ｓ３
０３において、変数ｊは文字ＳＤ_c［ｊ］の添字を表
し、変数ｉは、キーワード文字Ｋ_w［ｉ］の添字を表
す。The operation of the character shape retrieval means 103a will be described below.
This is shown in steps S301 to S303. This process is a collation
It is executed by the control means 606. Where the candidate part
SD _cIs the character SD_cLet it be a set of [j]. That's it
This character SD_c[J] is the sentence described with reference to FIG. 2B
It has the same data structure as a character. Note that S301 to S3
In 03, the variable j is the character SD_cShow the subscript of [j]
The variable i is the keyword letter K_wShow subscript of [i]
You

【０１２６】文字認識結果Ｄ_c中には、複数の候補部分
ＳＤ_cが存在し得る。文字認識結果Ｄ _c中に、複数の候補
部分ＳＤ_cが存在する場合には、それぞれの候補部分Ｓ
Ｄ_cについて以下のステップＳ３０１〜Ｓ３０３の処理
が行われる。Character recognition result D_cSome candidate parts
SD_cCan exist. Character recognition result D _cMultiple candidates in
Partial SD_cIs present, each candidate part S
D_cThe following steps S301 to S303
Is done.

【０１２７】ステップＳ３０１：変数ｊに変数ｓｔａｒ
ｔ＿ｊが代入され、変数ｉに値「０」が代入される。変
数ｓｔａｒｔ＿ｊは、候補部分ＳＤ_cの始端（左端）に
位置する文字の添字である。また、ステップＳ３０１に
おいてキーワードＫ_wの長さＮ_kと等しい長さを有するリ
ストが検出箇所データとして用意される。Step S301: Variable j is variable star
t_j is substituted, and the value “0” is substituted for the variable i. The variable start_j is a subscript of a character located at the start end (left end) of the candidate portion SD _c . Further, in step S301, a list having a length equal to the length N _k of the keyword K _w is prepared as detection point data.

【０１２８】ステップＳ３０２：文字ＳＤ_c［ｊ］が、
キーワード文字Ｋ_w［ｉ］に一致するか否かが、文字コ
ードの比較に基づいて類似文字照合手段６０２によって
判定される。Step S302: The character SD _c [j] is
Whether or not it matches the keyword character K _w [i] is determined by the similar character collating means 602 based on the comparison of the character codes.

【０１２９】もし一致すれば、検出箇所データのｉ番目
の位置に、文字ＳＤ_c［ｊ］が登録され、変数ｉおよび
変数ｊがともに１だけ増加され、ステップＳ３０２を繰
り返す。これは、候補部分ＳＤ_c中の次の文字と、次の
キーワード文字との照合がなされることを意味する。If they match, the character SD _c [j] is registered at the i-th position of the detected position data, both the variable i and the variable j are incremented by 1, and step S302 is repeated. This means that the next character in the candidate portion SD _c will be matched with the next keyword character.

【０１３０】もし一致しなければ、ステップＳ３０３に
進む。If they do not match, the process proceeds to step S303.

【０１３１】このように、類似文字照合手段６０２は候
補部分ＳＤ_c（第１部分）に含まれる文字（第２文字）
の文字コードが、キーワードに含まれる特定のキーワー
ド文字Ｋ_w［ｉ］（第１文字）の文字コードに一致する
か否かを判定する第１判定手段として機能する。As described above, the similar character collating means 602 causes the character (second character) included in the candidate portion SD _c (first portion).
Function as a first determination unit that determines whether or not the character code of No. 1 matches the character code of the specific keyword character K _w [i] (first character) included in the keyword.

【０１３２】ステップＳ３０３：候補部分ＳＤ_cのうち
から、キーワード文字Ｋ_w［ｉ］と対応させるべき１ま
たは２以上の連続する文字が「不一致文字」として特定
され、その画像Ｃ_iが抽出される。この処理は文字画像
抽出手段３０１によって行われ、図８を用いてすでに述
べた。また、キーワード文字Ｋ_w［ｉ］の画像ＫＣ_iが文
字画像テーブル３０２から得られる。Step S303: One or more consecutive characters that should be associated with the keyword character K _w [i] are identified as “mismatched characters” from the candidate portion SD _c , and the image C _i thereof is extracted. . This processing is performed by the character image extracting means 301 and has already been described with reference to FIG. Further, the image KC _{i of} the keyword character K _w [i] is obtained from the character image table 302.

【０１３３】次にキーワード文字Ｋ_w［ｉ］が、不一致
文字に一致するか否かが形状照合手段３０３により判定
される。この判定を行うために、まず、形状照合手段３
０３を用いて、画像Ｃ_iと画像ＫＣ_iとが照合される。画
像の照合は、それぞれの画像の特徴量を比較することに
よって行われる。画像Ｃ_iの特徴量と画像ＫＣ_iの特徴量
とのユークリッド距離が、所定の閾値Ｔｈｄ₁よりも小
さいことは、画像Ｃ_iと画像ＫＣ_iとが類似していること
を示す。画像Ｃ_iと画像ＫＣ_iとが類似していれば、ステ
ップＳ３０３で特定された不一致文字がキーワード文字
Ｋ_w［ｉ］に一致すると判定される。Next, the shape matching means 303 determines whether or not the keyword character K _w [i] matches the non-matching character. In order to make this determination, first, the shape matching means 3
03 is used to collate the image C _i with the image KC _i . The image matching is performed by comparing the feature amounts of the images. The Euclidean distance between the feature amount of the image C _{i and} the feature amount of the image KC _i being smaller than the predetermined threshold Thd ₁ indicates that the image C _i and the image KC _i are similar to each other. If the image C _i and the image KC _i are similar, it is determined that the non-matching character specified in step S303 matches the keyword character K _w [i].

【０１３４】キーワード文字Ｋ_w［ｉ］が、不一致文字
に一致すると判定された場合には、検出箇所データのｉ
番目の位置に、不一致文字として特定された１または２
以上の連続する文字が登録され、変数ｉが１だけ増加さ
れ、変数ｊに変数ｎｅｘｔ＿ｊが代入され、処理はステ
ップＳ３０２に戻る。これは、キーワード中の次の文字
を、候補部分ＳＤ_c中の照合が終わった文字の右に隣接
する部分から探すことを意味する。変数ｎｅｘｔ＿ｊ
は、候補部分ＳＤｃ中の照合が終わった文字の右に隣接
する文字の添字を示す。変数ｎｅｘｔ＿ｊの値は、ステ
ップＳ３０３において文字画像抽出手段３０１によって
決定される。When it is determined that the keyword character K _w [i] matches the non-matching character, i of the detected position data is detected.
1 or 2 identified as the non-matching character in the th position
The above consecutive characters are registered, the variable i is incremented by 1, the variable next_j is substituted for the variable j, and the process returns to step S302. This means that the next character in the keyword is searched for in the part adjacent to the right of the character in the candidate part SD _c that has been matched. Variable next_j
Indicates the subscript of the character adjacent to the right of the character in the candidate portion SDc that has been collated. The value of the variable next_j is determined by the character image extracting means 301 in step S303.

【０１３５】キーワード文字Ｋ_w［ｉ］が、不一致文字
に一致しないと判定された場合には、変数ｉに値「０」
が代入され、変数ｊに変数ｓｔａｒｔ＿ｊ＋１が代入さ
れ、変数ｓｔａｒｔ＿ｊに変数ｊが代入され、処理はス
テップ３０２へ戻る。これは、候補部分ＳＤ_c中の着目
している文字を１文字分右にシフトして、再び最初のキ
ーワード文字Ｋ_w［０］を探すことを意味する。When it is determined that the keyword character K _w [i] does not match the unmatched character, the variable i has the value “0”.
Is assigned, the variable start_j + 1 is assigned to the variable j, the variable j is assigned to the variable start_j, and the process returns to step 302. This means that the character of interest in the candidate portion SD _c is shifted to the right by one character and the first keyword character K _w [0] is searched again.

【０１３６】このように、文字画像抽出手段３０１は、
候補部分ＳＤ_c（第１部分）に含まれる文字ＳＤ_c［ｊ］
（第２文字）の文字コードが、キーワードＫ_wに含まれ
る特定のキーワード文字Ｋ_w［ｉ］の文字コードに一致
しなかった場合には、候補部分ＳＤｃ（第１部分）に含
まれる文字ＳＤ_c［ｊ］（第２文字）を少なくとも含
み、キーワード文字Ｋ_w［ｉ］の幅に最も近い幅を有す
る１または２以上の連続した文字を不一致文字として特
定する不一致文字特定手段として機能する。As described above, the character image extracting means 301
Characters SD _c [j] included in the candidate part SD _c (first part)
If the character code of (second character) does not match the character code of the specific keyword character K _w [i] included in the keyword K _w , the character SD included in the candidate portion SDc (first portion) _It functions as a non-matching character specifying means that specifies at least one continuous character having at least the width of the keyword character K _w [i] and including _c [j] (second character) as a non-matching character.

【０１３７】また、形状照合手段３０３は、特定のキー
ワード文字Ｋ_w［ｉ］（第１文字）の画像の特徴量と、
不一致文字に含まれる１または２以上の連続した文字
（第２文字）に割り当てられた１または２以上の部分領
域を含む領域の画像の特徴量との距離が、所定の閾値Ｔ
ｈｄ₁よりも小さい場合に、特定のキーワード文字Ｋ
_w［ｉ］が前記不一致文字に一致すると判定する第２判
定手段として機能する。Further, the shape matching means 303 detects the feature quantity of the image of the specific keyword character K _w [i] (first character),
The distance from the feature amount of the image of the area including one or more partial areas assigned to one or more consecutive characters (second character) included in the non-matching character is a predetermined threshold T.
a particular keyword character K if less than hd ₁
_w [i] functions as a second determination unit that determines that the mismatched character matches.

【０１３８】以上に述べたステップＳ３０１〜Ｓ３０３
に示される処理手順は、ステップＳ１０１〜Ｓ１０３に
示される処理手順と比較して、画像の特徴量の比較に基
づく照合処理（ステップＳ３０３）の前に、文字コード
の比較に基づく照合処理（ステップＳ３０２）が追加さ
れている。文字コードの比較に基づく照合処理によって
候補部分ＳＤ_c中の文字ＳＤ_c［ｊ］がキーワード文字Ｋ
_w［ｉ］に一致すると判定されれば、候補部分ＳＤ_c中の
文字ＳＤ_c［ｊ］と、キーワード文字Ｋ_w［ｉ］との画像
の特徴量の比較に基づく照合処理は行われない。一般
に、文字コードの比較に基づく照合処理は、画像の特徴
量の比較に基づく照合処理よりも高速に行い得るので、
ステップＳ３０１〜Ｓ３０３に示される処理手順を行う
ことにより文書検索装置１における検索の処理速度を向
上することが可能になる。Steps S301 to S303 described above
Compared with the processing procedure shown in steps S101 to S103, the processing procedure shown in (1) is performed before the matching processing based on the comparison of the image feature amounts (step S303), and based on the comparison of the character code (step S302). ) Has been added. By the matching process based on the comparison of the character codes, the character SD _c [j] in the candidate portion SD _c becomes the keyword character K.
If it is determined that the matching _w [i], the character SD _c in the candidate portion SD _c [j], the matching processing based on the comparison of the feature amount of the image with the keyword character K _w [i] is not performed. In general, the matching process based on the comparison of character codes can be performed faster than the matching process based on the comparison of image feature amounts.
By performing the processing procedure shown in steps S301 to S303, it becomes possible to improve the processing speed of the search in the document search device 1.

【０１３９】また、ステップＳ３０３において、候補部
分ＳＤ_c中の文字ＳＤ_c［ｊ］の信頼度Ｃ_r［ｊ］が所定
の閾値よりも高い場合には、形状照合手段３０３による
照合を省略してもよい。なぜなら、文字ＳＤ_c［ｊ］の
信頼度Ｃ_r［ｊ］が所定の閾値よりも高いことはＯＣＲ
装置２０２における文字認識が正しく行われた可能性が
高く、画像の特徴量の比較を行う必要性が低いからであ
る。If the reliability C _r [j] of the character SD _c [j] in the candidate portion SD _c is higher than a predetermined threshold value in step S303, the shape matching means 303 omits the matching. Good. Because the reliability C _r [j] of the character SD _c [j] is higher than the predetermined threshold, it is OCR.
This is because there is a high possibility that the character recognition in the device 202 has been performed correctly, and there is less need to compare the feature amounts of the images.

【０１４０】図１１は、文書検索装置１のバリエーショ
ンとしての文書検索装置７０１の構成を示す。文書検索
装置７０１は、例えば、図４に示される文書検索装置１
の代わりに用いられ得る。FIG. 11 shows the configuration of a document search device 701 as a variation of the document search device 1. The document search device 701 is, for example, the document search device 1 shown in FIG.
Can be used instead of.

【０１４１】図１１において、図４に示される構成要素
と同一の構成要素には同一の参照番号を付し、その説明
を省略する。In FIG. 11, the same components as those shown in FIG. 4 are designated by the same reference numerals, and the description thereof will be omitted.

【０１４２】文書検索装置７０１は、文字特定手段１０
２において使用される閾値Ｔｈｒを生成する検索精度制
御手段７０５を含む。The document search device 701 is provided with the character specifying means 10
The search precision control means 705 which produces | generates the threshold value Thr used in 2 is included.

【０１４３】文書検索装置７０１に入力される文書デー
タＤ_dは、品質情報を有するものとする。品質情報は、
文書画像データＤ_iの画質に関連する値であり、例え
ば、文書画像データＤ_iの解像度、文字のかすれの度合
いおよび文字のつぶれの度合いを表す。品質情報は、例
えば、０〜１の間の値で表され、値が大きいほど文書画
像データＤ_iの画質が良いものとする。The document data D _d input to the document search device 701 has quality information. Quality information is
It is a value related to the image quality of the document image data D _i , and represents, for example, the resolution of the document image data D _i , the degree of character blurring, and the degree of character collapse. The quality information is represented by, for example, a value between 0 and 1, and the larger the value, the better the image quality of the document image data D _i .

【０１４４】検索精度制御手段７０５は、予め品質情報
と閾値Ｔｈｒとの関係を定めたテーブルに基づいて、文
書データＤ_dの品質情報に応じた閾値Ｔｈｒを出力す
る。予め定められた品質情報と閾値Ｔｈｒとの関係とし
ては、例えば、品質情報と閾値Ｔｈｒとが等しいという
関係が用いられ得る。The search accuracy control means 705 outputs the threshold value Thr according to the quality information of the document data D _d based on a table in which the relationship between the quality information and the threshold value Thr is determined in advance. As the relationship between the predetermined quality information and the threshold Thr, for example, the relationship that the quality information is equal to the threshold Thr can be used.

【０１４５】一般に、文書画像データＤ_iの画質が良い
場合には、ＯＣＲ装置２０２における文字認識は正しく
行われることが期待される。従って、文書画像データＤ
_iの画質に応じて閾値Ｔｈｒを調節することで、文字特
定手段１０２で特定される候補部分ＳＤ_cの数を調節す
ることができる。これにより、品質の良い文書データＤ
_dについては必要以上に文字形状検索手段１０３の処理
対象（候補部分ＳＤ_cの数）が増えることを抑制でき、
処理時間の短縮や過剰検出の抑制が可能になる。また、
品質の悪い文書データＤ_dについては、文字形状検索手
段１０３の処理対象を増やすことにより、ＯＣＲ装置２
０２における文字認識の際の切り出し誤りや認識誤りに
起因する検索漏れを減らすことが可能になる。Generally, when the image quality of the document image data D _i is good, it is expected that the character recognition in the OCR device 202 will be performed correctly. Therefore, the document image data D
By adjusting the threshold Thr according to the image quality of _i , it is possible to adjust the number of candidate portions SD _c specified by the character specifying means 102. As a result, good quality document data D
_With respect to _d, it is possible to prevent the number of processing targets (the number of candidate portions SD _c ) of the character shape search unit 103 from increasing more than necessary,
The processing time can be shortened and excessive detection can be suppressed. Also,
For document data D _d of poor quality, the OCR device 2 is increased by increasing the processing targets of the character shape search means 103.
It is possible to reduce the omission of search due to a clipping error or a recognition error in character recognition in 02.

【０１４６】図１２は、文書検索装置７０１のバリエー
ションとしての文書検索装置８０１の構成を示す。文書
検索装置８０１は、例えば、図１１に示される文書検索
装置７０１の代わりに用いられ得る。FIG. 12 shows the configuration of a document search device 801 as a variation of the document search device 701. The document search device 801 can be used, for example, instead of the document search device 701 shown in FIG.

【０１４７】図１２において、図１１に示される構成要
素と同一の構成要素には同一の参照番号を付し、その説
明を省略する。In FIG. 12, the same components as those shown in FIG. 11 are designated by the same reference numerals, and the description thereof will be omitted.

【０１４８】文書検索装置８０１は、文字認識結果Ｄ_c
中の文字に割り当てられた信頼度から、文書データＤ_d
についての品質情報を求める品質情報抽出手段８０５を
含む。The document search device 801 receives the character recognition result D _c.
Document data D _d from the reliability assigned to the middle character
Quality information extraction means 805 for obtaining quality information about

【０１４９】信頼度Ｃ_r［ｊ］は、文字認識が正解する
確率を反映したものであり、品質情報の高い文書は文字
認識が正解する確率が高いと考えられる。従って、信頼
度Ｃ _r［ｊ］から品質情報を求めることができる。Reliability C_rCharacter recognition is correct for [j]
It reflects the probability, and documents with high quality information are written
It is considered that the probability of correct recognition is high. Therefore trust
Degree C _rThe quality information can be obtained from [j].

【０１５０】品質情報抽出手段８０５は、例えば、文書
データＤ_dの文字認識結果Ｄ_cに含まれる全ての文字の信
頼度Ｃ_r［ｊ］の平均値として、品質情報を求め得る。The quality information extraction means 805 can obtain the quality information as an average value of the reliability C _r [j] of all the characters included in the character recognition result D _c of the document data D _d .

【０１５１】図１２に示される構成によれば、品質情報
を文書データＤ_dに含まれる文字の信頼度から客観的に
求めることができる。According to the configuration shown in FIG. 12, the quality information can be objectively obtained from the reliability of the characters included in the document data D _d .

【０１５２】図１３は、文書検索装置７０１のバリエー
ションとしての文書検索装置９０１の構成を示す。文書
検索装置９０１は、例えば、図１１に示される文書検索
装置７０１の代わりに用いられ得る。FIG. 13 shows the configuration of a document search device 901 as a variation of the document search device 701. The document search device 901 can be used instead of the document search device 701 shown in FIG. 11, for example.

【０１５３】図１３において、図１１に示される構成要
素と同一の構成要素には同一の参照番号を付し、その説
明を省略する。In FIG. 13, the same components as those shown in FIG. 11 are designated by the same reference numerals, and the description thereof will be omitted.

【０１５４】文書検索装置９０１は、文字特定手段１０
２において使用される閾値Ｔｈｒのユーザによる指定を
可能にする検索精度指定手段９０５を含む。The document retrieving apparatus 901 includes the character specifying means 10
2 includes a search precision designating unit 905 that allows the user to designate the threshold Thr used in 2.

【０１５５】検索精度指定手段９０５によれば、ユーザ
が目的に応じて閾値Ｔｈｒを指定することができる。ユ
ーザが、過剰な検出箇所の数を増やすことをいとわずに
正しい検出箇所をできるだけ多く知りたい場合には、閾
値Ｔｈｒを大きくすればよい。ユーザが、正しい検出箇
所を１つ知れば十分であると考え、過剰な検出箇所の数
を増やしたくない場合には閾値Ｔｈｒを小さくすればよ
い。過剰な検出箇所とは、オリジナルの文書中でキーワ
ードと一致しないが、文書検索装置により一致部分して
検出される部分を指す。The search precision designating means 905 allows the user to designate the threshold Thr according to the purpose. When the user wants to know as many correct detection points as possible without being willing to increase the number of excessive detection points, the threshold Thr may be increased. If the user thinks that it is sufficient to know one correct detection point and does not want to increase the number of excessive detection points, the threshold Thr may be reduced. Excessive detection points are the points that do not match the keywords in the original document, but are detected as matching parts by the document search device.

【０１５６】このように、文書検索装置９０１によれ
ば、ユーザの意図に応じた検索が可能になる。As described above, according to the document search device 901, it is possible to search according to the intention of the user.

【０１５７】図１４は、文書検索装置１のバリエーショ
ンとしての文書検索装置１１５１の構成を示す。文書検
索装置１１５１は、図４に示される文書検索装置１の代
わりに用いられ得る。FIG. 14 shows a configuration of a document search device 1151 as a variation of the document search device 1. The document search device 1151 can be used instead of the document search device 1 shown in FIG.

【０１５８】図１４において、図４に示される構成要素
と同一の構成要素には同一の参照番号を付し、その説明
を省略する。In FIG. 14, the same components as those shown in FIG. 4 are designated by the same reference numerals, and the description thereof will be omitted.

【０１５９】文書検索装置１１５１は、第１一致部分の
集合ＲＤ_t1に含まれる検出箇所データから画像の特徴量
の平均値を求める類似文字列平均化手段１１０２と、類
似文字列平均化手段１１０２で求められた画像の特徴量
の平均値を用いて第２一致部分の集合ＲＤ_t2に含まれる
検出箇所データを精製する文字列再検出手段１１０３と
を含む。The document retrieval device 1151 includes a similar character string averaging means 1102 for obtaining an average value of image feature amounts from the detection point data included in the first matching portion set RD _t1 and a similar character string averaging means 1102. And a character string re-detection means 1103 for refining the detection point data included in the set RD _t2 of the second matching parts by using the obtained average value of the feature amount of the image.

【０１６０】テキスト検索手段１０１における検索結果
であるＲＤ_t1に含まれる過剰な検出箇所の数は、文字形
状検索手段１０３における検索結果ＲＤ_t2に含まれる過
剰な検出箇所の数よりも少ないと考えられる。従って、
検索結果ＲＤ_t1に含まれる検出箇所データは、キーワー
ドと真に一致している可能性が高い。It is considered that the number of excessive detection points included in the search result RD _t1 in the text search means 101 is smaller than the number of excessive detection points included in the search result RD _t2 in the character shape search means 103. . Therefore,
It is highly possible that the detection location data included in the search result RD _t1 truly matches the keyword.

【０１６１】類似文字列平均化手段１１０２は、検索結
果ＲＤ_t1に含まれる検出箇所データを参照し、それぞれ
のキーワード文字と対応する文書画像データＤ_iの部分
領域の文字画像Ｃ_iを抽出する。この文字画像Ｃ_iは、キ
ーワード文字がオリジナルの文書中でどのような形状で
書かれているかを示す。類似文字列平均化手段１１０２
はさらに、抽出された文字画像Ｃ_iの特徴量を算出し、
その特徴量を同一のキーワード文字について平均化す
る。この平均化された特徴量は、文字列再検出手段１１
０３において判断基準値として使用される。The similar character string averaging means 1102 refers to the detected position data included in the search result RD _t1 and extracts the character image C _i of the partial area of the document image data D _i corresponding to each keyword character. The character image C _i shows how the keyword character is written in the original document. Similar character string averaging means 1102
Further calculates the feature amount of the extracted character image C _i ,
The feature amounts are averaged for the same keyword character. This averaged feature amount is used as the character string re-detection means 11
In 03, it is used as a criterion value.

【０１６２】このように、類似文字列平均化手段１１０
２は、検索結果ＲＤ_t1（少なくとも１つの第１一致部
分）から所定の判定基準値を算出する算出手段として機
能する。As described above, the similar character string averaging means 110
2 functions as a calculation unit that calculates a predetermined determination reference value from the search result RD _t1 (at least one first matching portion).

【０１６３】文字形状検索手段１０３における検索結果
であるＲＤ_t2に含まれる検出箇所データには、過剰な検
出箇所が含まれている可能性がある。文字形状検索手段
１０３では、オリジナルの文書で用いられているフォン
トとは異なる文字画像を使用して検索が行われている可
能性があるからである。There is a possibility that the detection position data included in RD _t2 , which is the search result by the character shape searching means 103, includes an excessive detection position. This is because there is a possibility that the character shape searching means 103 may be searching using a character image different from the font used in the original document.

【０１６４】文字列再検出手段１１０３は、類似文字列
平均化手段１１０２で算出された平均化された特徴量
（判定基準値）を用いて、このような検出箇所をふるい
にかける処理を行う。The character string re-detection means 1103 carries out a process of sieving such detection points by using the averaged feature amount (judgment reference value) calculated by the similar character string averaging means 1102.

【０１６５】文字列再検出手段１１０３は、検索結果Ｒ
Ｄ_t2に含まれる検出箇所データを参照し、それぞれのキ
ーワード文字と対応する文書画像データＤ_iの部分領域
の文字画像Ｃ_iを抽出する。次に、抽出された文字画像
Ｃ_iの特徴量と判定基準値との距離が所定の閾値Ｔｈｄ₂
よりも小さくなるような検出箇所データを検出する。The character string rediscovery means 1103 retrieves the search result R.
The character image C _i of the partial area of the document image data D _i corresponding to each keyword character is extracted by referring to the detection point data included in D _t2 . Next, the distance between the feature amount of the extracted character image C _i and the determination reference value is a predetermined threshold Thd ₂
The detection point data that is smaller than the above is detected.

【０１６６】このように、文字列再検出手段１１０３
は、判定基準値に基づいて、検索結果ＲＤ_t2（少なくと
も１つの第２一致部分）のうちで所定の第２条件を満た
す第２一致部分を検出する検出手段として機能する。In this way, the character string re-detection means 1103
Functions as a detection unit that detects a second matching portion satisfying a second predetermined condition in the search result RD _t2 (at least one second matching portion) based on the determination reference value.

【０１６７】以上のように文字列再検出手段１１０３に
より検出された検出箇所データは、検出箇所データの新
たな集合として出力される。The detection location data detected by the character string re-detection means 1103 as described above is output as a new set of detection location data.

【０１６８】なお、類似文字列平均化手段１１０２で算
出された平均化された特徴量（判定基準値）を用いて、
再度文書データＤ_dを検索すれば、より検索の精度を上
げることが可能になる。It should be noted that, using the averaged feature amount (judgment reference value) calculated by the similar character string averaging means 1102,
If the document data D _d is searched again, the accuracy of the search can be improved.

【０１６９】このように、文書検索装置１１５１によれ
ば、比較的信頼性の低い検索結果ＲＤｔ₂に含まれる検
出箇所データのうちから、ある判断基準値に基づいて所
定の条件を満たす検出箇所データを検出することによっ
て、過剰な検出箇所の数を抑制することができる。As described above, according to the document retrieval apparatus 1151, among the detected location data included in the relatively unreliable retrieval result RDt ₂ , the detected location data satisfying a predetermined condition based on a certain criterion value. By detecting, it is possible to suppress the number of excessive detection points.

【０１７０】また、図１４に参照番号１１０１で示され
る多段階検索手段は、上述した例では図４に示される文
書検索装置１であるとした。しかし、多段階検索手段１
１０１として採用され得る文書検索装置の構成は、これ
に限定されない。多段階検索手段１１０１としては、検
索結果に含まれる過剰な検出箇所の数を段階的に制御す
ることが可能な任意の文書検索装置が採用され得る。Further, the multi-stage search means indicated by reference numeral 1101 in FIG. 14 is assumed to be the document search apparatus 1 shown in FIG. 4 in the above-mentioned example. However, the multi-stage search means 1
The configuration of the document search device that can be adopted as 101 is not limited to this. As the multi-stage search means 1101, an arbitrary document search device capable of controlling the number of excessive detection points included in the search result stepwise can be adopted.

【０１７１】多段階検索手段１１０１によって得られた
検索結果は、検索結果ＲＤ_t1，検索結果ＲＤ_t2，・・
・，検索結果ＲＤ_tnからなり、これらのうち、含まれる
過剰な検出箇所の数が最も少ない検索結果が検索結果Ｒ
Ｄ_t1であるとすると、類似文字列平均化手段１１０２は
検索結果ＲＤ_t1に基づいて判断基準値を算出する。The search results obtained by the multi-stage search means 1101 are search results RD _t1 , search results RD _t2 , ...
, Search result RD _tn , and of these, the search result with the smallest number of excessive detection points included is the search result R
If it is D _t1 , the similar character string averaging means 1102 calculates a judgment reference value based on the search result RD _t1 .

【０１７２】あるいは、多段階検索手段１１０１に代え
て、検索結果に含まれる過剰な検出箇所の数を段階的に
制御しない任意の文書検索装置を採用してもよい。その
ような例を図１５に示す。Alternatively, instead of the multi-step search means 1101, any document search device that does not control the number of excessive detection points included in the search results stepwise may be adopted. Such an example is shown in FIG.

【０１７３】図１５は、文書検索装置１１５１のバリエ
ーションとしての文書検索装置１０５１の構成を示す。
文書検索装置１０５１は、図１４に示される文書検索装
置１１５１の代わりに用いられ得る。FIG. 15 shows the structure of a document search device 1051 as a variation of the document search device 1151.
The document search device 1051 can be used instead of the document search device 1151 shown in FIG.

【０１７４】図１５において、図１４に示される構成要
素と同一の構成要素には同一の参照番号を付し、その説
明を省略する。In FIG. 15, the same components as those shown in FIG. 14 are designated by the same reference numerals, and the description thereof will be omitted.

【０１７５】文書検索装置１０５１は、検索手段１００
１と、類似文字列平均化手段１００２とを含む。The document retrieving apparatus 1051 is comprised of retrieving means 100.
1 and a similar character string averaging means 1002.

【０１７６】検索手段１００１としては、任意の文書検
索装置が採用され得る。検索手段１００１における検索
結果は、ＲＤ_tとして出力される。As the search means 1001, an arbitrary document search device can be adopted. The search result in the search means 1001 is output as RD _t .

【０１７７】類似文字列平均化手段１００２は、検索結
果ＲＤ_tのうち、キーワードと類似している少なくとも
１つの検出箇所データを参照し、それぞれのキーワード
文字と対応する文書画像データの部分領域の文字画像Ｃ
_iを抽出する。この文字画像Ｃ_iは、キーワード文字がオ
リジナルの文書中でどのような形状で書かれているかを
示す。類似文字列平均化手段１００２はさらに、抽出さ
れた文字画像Ｃ_iの特徴量を算出し、その特徴量を同一
のキーワード文字について平均化する。この平均化され
た特徴量は、文字列再検出手段１１０３において判断基
準値として使用される。The similar character string averaging means 1002 refers to at least one detection point data similar to the keyword in the search result RD _t , and refers to the character of the partial area of the document image data corresponding to each keyword character. Image C
Extract _i . The character image C _i shows how the keyword character is written in the original document. The similar character string averaging means 1002 further calculates the feature amount of the extracted character image C _i and averages the feature amount for the same keyword character. The averaged feature amount is used as a determination reference value in the character string redetection unit 1103.

【０１７８】類似文字列平均化手段１００２が、検索結
果ＲＤ_tのうち、キーワードと類似している少なくとも
１つの検出箇所データを求める処理は以下のように行わ
れる。The processing for the similar character string averaging means 1002 to obtain at least one detection point data similar to the keyword in the search result RD _t is performed as follows.

【０１７９】まず、キーワードＫ_wに含まれるそれぞれ
のキーワード文字が画像化され、文字画像ＫＣ_iが得ら
れる。文字ブロックデータＤ_tが、文字ブロックにおい
て使用されているフォントの情報を有する場合、キーワ
ード文字の画像を生成する際にそのフォントと同一のフ
ォントが使用され得る。First, each keyword character contained in the keyword K _w is imaged to obtain a character image KC _i . When the character block data D _t has information on the font used in the character block, the same font as that font can be used when generating the image of the keyword character.

【０１８０】文字画像ＫＣ_iから、特徴量が求められ
る。ここで、特徴量としては、例えば、文字認識におい
て使用される特徴量や、図９を参照して説明した特徴量
が採用され得る。次に、検出箇所データ中の、キーワー
ドＫ_wの各文字に対応する文字認識結果Ｄ_c中の文字の位
置情報を参照して、文書画像データＤ_iから文字画像Ｃ _i
が抽出される。その文字画像Ｃ_iから、特徴量が求めら
れる。文字画像ＫＣ_iから求められた特徴量と、文字画
像Ｃ_iから求められた特徴量との間のユークリッド距離
が算出される。このユークリッド距離をキーワードＫ_w
に含まれる全ての文字について加算し、その和をキーワ
ードＫ_wに含まれる文字数で割った値が、検出箇所デー
タとキーワードＫ_wとの距離と定義される。検出箇所デ
ータがキーワードＫ_wと類似しているとは、検出箇所デ
ータとキーワードＫ_wとの距離が小さいことを意味す
る。Character image KC_iThe feature quantity is calculated from
It Here, as the feature amount, for example, in character recognition,
Features that are used by the user and the features that were described with reference to FIG.
Can be adopted. Next, the key word in the detected location data
De K_wCharacter recognition result D corresponding to each character of_cCharacter position in
Document image data D by referring to the placement information_iFrom character image C _i
Is extracted. The character image C_iFrom the feature amount,
Be done. Character image KC_iCharacteristic amount obtained from
Image C_iEuclidean distance from the feature value obtained from
Is calculated. This Euclidean distance is the keyword K_w
Add all the characters included in and add the sum to the key
K_wThe value divided by the number of characters included in
Keyword K_wIs defined as the distance from. Detection point data
Keyword is K_wIs similar to
Data and keyword K_wMeans a small distance from
It

【０１８１】以上のように、検索結果ＲＤ_tの各検出箇
所データについてキーワードＫ_wとの距離を求め、その
距離が小さいものから順に予め定められた数の検出箇所
データを選択する。このようにして、キーワードと類似
している少なくとも１つの検出箇所が求められる。As described above, the distance from the keyword K _w for each detection point data of the search result RD _t is obtained, and a predetermined number of detection point data are selected in order from the smallest distance. In this way, at least one detection position similar to the keyword is obtained.

【０１８２】（実施の形態２）図１６は、本発明の実施
の形態２の文書検索装置４５１の構成を示す。(Second Embodiment) FIG. 16 shows the configuration of a document search device 451 according to the second embodiment of the present invention.

【０１８３】文書検索装置４５１は、ワイルドカード検
索手段４０１と、文字形状検索手段４０２と、文字幅推
定手段４０３と、文字画像テーブル４０４とを含む。The document search device 451 includes a wild card search means 401, a character shape search means 402, a character width estimation means 403, and a character image table 404.

【０１８４】文書検索装置４５１は、図１に示される文
書検索装置２０４として使用され得る。この場合、文書
検索装置４５１は、文書データベース２０３に蓄積され
た文書データＤ_dからキーワードＫ_wを検索する。文書デ
ータＤ_dは、文書画像データＤ_iと文字認識結果Ｄ_cとを
含む。The document search device 451 can be used as the document search device 204 shown in FIG. In this case, the document search device 451 searches for the keyword K _w from the document data D _d stored in the document database 203. The document data D _d includes the document image data D _i and the character recognition result D _c .

【０１８５】ワイルドカード検索手段４０１は、キーワ
ードＫ_wに含まれる文字（第１文字）に割り当てられた
文字コードと文字認識結果Ｄ_cに含まれる文字（第２文
字）に割り当てられた文字コードとを比較し、少なくと
も１文字が一致する部分を特定し、これを検索結果ＲＤ
_t1として出力する。検索結果ＲＤ_t1に含まれる一致部分
の数をＮ_r1とする。検索結果ＲＤ_t1は、検出箇所データ
ＲＤ_t［ｔ］を含む。ここで０≦ｔ≦Ｎ_r1−１である。
検出箇所データＲＤ_t［ｔ］は、キーワード文字の少な
くとも１つが文字コードの比較において文字認識結果Ｄ
_cに含まれる文字に一致する一致部分を示す。The wildcard search means 401 determines the character code assigned to the character (first character) included in the keyword K _w and the character code assigned to the character (second character) included in the character recognition result D _c. Are compared, the part where at least one character matches is specified, and this is searched result RD
Output as _t1 . The number of matching parts included in the search result RD _t1 is N _r1 . The search result RD _t1 includes detection location data RD _t [t]. Here, 0 ≦ t ≦ N _r1 −1.
In the detection point data RD _t [t], at least one of the keyword characters is the character recognition result D in the comparison of the character codes.
Indicates a match that matches the characters contained in _c .

【０１８６】文字形状検索手段４０２は、文字コードの
比較において文字認識結果Ｄ_cに含まれる文字に一致し
ないキーワード文字（ワイルドカード文字）を特定し、
画像の特徴量の比較に基づいて、そのワイルドカード文
字が文字認識結果Ｄ_cに含まれる文字に一致するか否か
を判定する。The character shape searching means 402 specifies a keyword character (wildcard character) which does not match the character included in the character recognition result D _c in the comparison of the character codes,
Based on the comparison of the image feature amounts, it is determined whether the wildcard character matches the character included in the character recognition result D _c .

【０１８７】文字幅推定手段４０３は、キーワード文字
の幅の推定値を算出する。この推定値は、ワイルドカー
ド検索手段４０１および文字形状検索手段４０２におい
て用いられる。The character width estimating means 403 calculates the estimated value of the width of the keyword character. This estimated value is used in the wildcard search means 401 and the character shape search means 402.

【０１８８】文字画像テーブル４０４は、図７に示され
る文字画像テーブル３０２と同様であり、ここではその
説明を省略する。The character image table 404 is similar to the character image table 302 shown in FIG. 7, and the description thereof is omitted here.

【０１８９】図１７は、ワイルドカード検索手段４０１
による処理手順を示す。FIG. 17 shows the wildcard search means 401.
The processing procedure by is shown.

【０１９０】以下、図１７に示される処理手順を各ステ
ップごとに説明する。The processing procedure shown in FIG. 17 will be described below step by step.

【０１９１】ステップＳ１７０１：変数ｉに値「０」が
代入される。変数ｉは、キーワード文字のキーワード中
での位置を表す添字である。変数ｉに値「０」を代入す
ることは、キーワードの最初の文字Ｋ_w［０］から順に
処理が行われることを意味する。Step S1701: The value "0" is assigned to the variable i. The variable i is a subscript indicating the position of the keyword character in the keyword. Substituting the value “0” for the variable i means that the processing is performed in order from the first character K _w [0] of the keyword.

【０１９２】ステップＳ１７０２：変数ｉと変数Ｎ_kと
が等しいか否かが判定される。変数Ｎ_kは、キーワード
Ｋ_wの文字数を表す。ステップＳ１７０２における判定
が「Ｙｅｓ」であれば、処理は終了する。変数ｉと変数
Ｎ_kとが等しい場合には、キーワードＫ_wの最後の文字ま
で処理が終わったことを意味するからである。ステップ
Ｓ１７０２における判定が「Ｎｏ」であれば、処理はス
テップＳ１７０３に進む。Step S1702: It is determined whether the variable i is equal to the variable N _k . The variable N _k represents the number of characters of the keyword K _w . If the determination in step S1702 is "Yes", the process ends. This is because when the variable i is equal to the variable N _k , it means that the processing has been completed up to the last character of the keyword K _w . If the determination in step S1702 is “No”, the process proceeds to step S1703.

【０１９３】ステップＳ１７０３：変数ｊに、値「０」
が代入される。変数ｊは、文字認識結果Ｄ_c中の文字の
位置を表す添字である。変数ｊに値「０」を代入するこ
とは、文字認識結果Ｄ_cの最初から順に処理が行われる
ことを意味する。Step S1703: The value "0" is set in the variable j.
Is substituted. The variable j is a subscript indicating the position of the character in the character recognition result D _c . Substituting the value “0” into the variable j means that the processing is sequentially performed from the beginning of the character recognition result D _c .

【０１９４】ステップＳ１７０４：変数ｊと変数Ｎ_dと
が等しいか否かが判定される。変数Ｎ_dは、文字認識結
果Ｄ_c中の文字の数を表す。ステップＳ１７０４におけ
る判定が「Ｙｅｓ」であれば、処理はステップＳ１７０
８に進む。変数ｊと変数Ｎ_dとが等しい場合には、文字
認識結果Ｄ_c中の最後の文字まで処理が終わったことを
意味するからである。ステップＳ１７０４における判定
が「Ｎｏ」であれば、処理はステップＳ１７０５に進
む。Step S1704: It is determined whether the variable j is equal to the variable N _d . The variable N _d represents the number of characters in the character recognition result D _c . If the determination in step S1704 is “Yes”, the process is step S170.
Go to 8. This is because if the variable j and the variable N _d are equal, it means that the processing has been completed up to the last character in the character recognition result D _c . If the determination in step S1704 is “No”, the process proceeds to step S1705.

【０１９５】ステップＳ１７０５：Ｋ_w［ｉ］とＣ
_c［ｊ］とが等しいか否かが判定される。Ｋ_w［ｉ］は、
キーワードのｉ番目の文字の文字コードを表し、Ｃ
_c［ｊ］は、文字認識結果Ｄ_c中のｊ番目の文字の文字コ
ードを表す。ステップＳ１７０５における判定が「Ｙｅ
ｓ」であれば、処理はステップＳ１７０６に進む。ステ
ップＳ１７０５における判定が「Ｎｏ」であれば、処理
はステップＳ１７０７に進む。Step S1705: K _w [i] and C
_It is determined whether or not _c [j] is equal. K _w [i] is
Represents the character code of the i-th character of the keyword, C
_c [j] represents the character code of the jth character in the character recognition result D _c . The determination in step S1705 is “Yes
If “s”, the process proceeds to step S1706. If the determination in step S1705 is “No”, the process proceeds to step S1707.

【０１９６】このように、ワイルドカード検索手段４０
１は、ステップＳ１７０５において、文字コードの比較
により、キーワードＫ_wの少なくとも一部が認識結果Ｄ_c
の少なくとも一部に一致するか否かを判定する第１判定
手段として機能する。As described above, the wildcard search means 40
In step S1705, at least a part of the keyword K _w is the recognition result D _c by comparing the character codes in step S1705.
Functioning as a first determining unit that determines whether or not at least part of

【０１９７】ステップＳ１７０６：検出箇所データのｉ
番目の位置に、文字Ｄ_c［ｊ］が登録される。検出箇所
データは、図２Ｄを参照して既述したように、長さがＮ
_kのリストである。このリストのｉ番目のリスト要素と
して、文字Ｄ_c［ｊ］が登録される。ステップＳ１７０
６の詳細な処理手順は、図１９を参照して後述される。Step S1706: i of detected location data
The character D _c [j] is registered at the th position. The detection point data has a length of N as described above with reference to FIG. 2D.
Here is a list of _k . The character D _c [j] is registered as the i-th list element of this list. Step S170
The detailed processing procedure of 6 will be described later with reference to FIG.

【０１９８】ステップＳ１７０７：変数ｊが１だけ増加
させられる。これは、認識結果Ｄ_c中の次の文字につい
て以降の処理が行われることを意味する。Step S1707: The variable j is incremented by 1. This means that the subsequent processing is performed for the next character in the recognition result D _c .

【０１９９】ステップＳ１７０８：変数ｉが１だけ増加
させられる。これは、キーワードＫ _w中の次のキーワー
ド文字について以降の処理が行われることを意味する。Step S1708: The variable i is incremented by 1.
To be made. This is the keyword K _wNext Kiwa in
It means that the following processing is performed for the character.

【０２００】図１８は、図１７に示される処理手順によ
って検索された検索結果ＲＤ_t1を示す。検出箇所データ
ＲＤ_t［１］には、キーワードＫｗ（＝「琵琶湖」）に
含まれる全てのキーワード文字に対して、文字コードが
一致する認識結果Ｄ_c中の文字が存在している。検出箇
所データＲＤ_t［０］には、キーワードＫ_wの一部の文字
に対して、文字コードが一致する認識結果Ｄ_c中の文字
が存在している。検出箇所データＲＤ_t［０］中に
「＊」で示されるリスト要素１８６１は、キーワード文
字Ｋ_w［２］（＝「湖」）に文字コードが一致する認識
結果Ｄ_c中の文字が存在せず、キーワード文字Ｋ_w［２］
に対応する認識結果Ｄ_c中の文字が未定であることを意
味する。このようなリスト要素を「ワイルドカード」と
呼ぶ。FIG. 18 shows the search result RD _t1 searched by the processing procedure shown in FIG. In the detection location data RD _t [1], there is a character in the recognition result D _c whose character code matches all the keyword characters included in the keyword Kw (= “Lake Biwa”). In the detection location data RD _t [0], there is a character in the recognition result D _c in which the character codes of some characters of the keyword K _w match. The list element 1861 indicated by “*” in the detection point data RD _t [0] indicates that a character in the recognition result D _c whose character code matches the keyword character K _w [2] (= “lake”) exists. No, keyword character K _w [2]
It means that the character in the recognition result D _c corresponding to is undecided. Such list elements are called "wildcards".

【０２０１】図１９は、図１７に示されるステップＳ１
７０６の詳細な処理手順を示す。FIG. 19 shows step S1 shown in FIG.
The detailed processing procedure of 706 is shown.

【０２０２】以下、図１９に示される処理手順を各ステ
ップごとに説明する。The processing procedure shown in FIG. 19 will be described below step by step.

【０２０３】ステップＳ１９０１：変数ｔに値「０」が
代入される。変数ｔは、検出箇所データの添字である。Step S1901: The value "0" is assigned to the variable t. The variable t is a subscript of the detection point data.

【０２０４】ステップＳ１９０２：変数ｔと変数Ｎ_rと
が等しいか否かが判定される。変数Ｎ_rは、現時点まで
に検出された検出箇所データの数を示す。この判定が
「Ｙｅｓ」であれば処理はステップＳ１９０７に進む。
この判定が「Ｙｅｓ」であることは、Ｄ_c［ｊ］を登録
可能な検出箇所データが、現時点までに検出された検出
箇所データの中に存在しないことを意味する。ステップ
Ｓ１９０２における判定が「Ｎｏ」であれば処理はステ
ップＳ１９０３に進む。Step S1902: It is judged whether or not the variable t and the variable N _r are equal. The variable N _r indicates the number of detection point data detected so far. If this determination is “Yes”, the process proceeds to step S1907.
If this determination is “Yes”, it means that the detection point data capable of registering D _c [j] does not exist in the detection point data detected up to the present time. If the determination in step S1902 is “No”, the process proceeds to step S1903.

【０２０５】ステップＳ１９０３：検出箇所データＲＤ
_t［ｔ］に最後に登録された文字Ｄ_c［ｋ］が取得され
る。Step S1903: Detected location data RD
_The last registered character D _c [k] at _t [t] is obtained.

【０２０６】ステップＳ１９０４：文字Ｄ_c［ｋ］に対
応するキーワード文字Ｋ_w［ｍ］の添字ｍが取得され
る。キーワード文字Ｋ_w［ｍ］の添字ｍは、検出箇所デ
ータＲＤ_t［ｔ］のリスト中のどの位置に文字Ｄ_c［ｋ］
が登録されているかを調べることにより取得される。Step S1904: The subscript m of the keyword character K _w [m] corresponding to the character D _c [k] is acquired. The subscript m of the keyword character K _w [m] is the character D _c [k] at which position in the list of the detection location data RD _t [t].
It is obtained by checking if is registered.

【０２０７】ステップＳ１９０５：文字Ｄ_c［ｋ］と、
文字Ｄ_c［ｊ］との間に、キーワード文字Ｋ_w［ｍ＋１］
〜Ｋ_w［ｉ−１］が過不足なく収まるか否かが判定され
る。この判定は、例えば、文字Ｄ_c［ｋ］と、文字Ｄ
_c［ｊ］との間のスペースが、キーワード文字Ｋ_w［ｍ＋
１］〜Ｋ_w［ｉ−１］の幅の１倍〜１．２倍の範囲内に
あるか否かを判定することにより行われる。この範囲
は、文書の文字と文字との間のスペースの幅に応じて可
変であってもよい。Step S1905: Character D _c [k],
Between keyword D _c [j] and keyword character K _w [m + 1]
~K _w [i-1] is whether fits just enough is determined. This determination is made by, for example, character D _c [k] and character D _c
The space between _c [j] and the keyword character K _w [m +
Is performed by determining whether or not within the range of 1 × to 1.2 times the width of _{1] ~K w [i-1} ]. This range may be variable depending on the width of the space between characters in the document.

【０２０８】ステップＳ１９０５における判定が「Ｙｅ
ｓ」であれば、処理はステップＳ１９０９に進む。判定
が「Ｙｅｓ」であることは、文字Ｄ_c［ｊ］は検出箇所
データＲＤ_t［ｔ］に登録され得ることを意味する。The determination in step S1905 is “Yes
If “s”, the process proceeds to step S1909. The determination being “Yes” means that the character D _c [j] can be registered in the detection location data RD _t [t].

【０２０９】ステップＳ１９０５における判定が「Ｎ
ｏ」であれば処理はステップＳ１９０６に進む。[0209] The judgment in step S1905 is "N".
If “o”, the process proceeds to step S1906.

【０２１０】ステップＳ１９０６：変数ｔの値が１だけ
増加される。すなわち、検出箇所データＲＤ_t［ｔ］に
は文字Ｄ_c［ｊ］を登録せずに、次の検出箇所データに
ついての処理に移る。Step S1906: The value of the variable t is incremented by 1. That is, the character D _c [j] is not registered in the detection location data RD _t [t], and the process proceeds to the next detection location data.

【０２１１】ステップＳ１９０７：新たな検出箇所デー
タのためのリストＲＤ_t［Ｎ_r］が確保され、そのｉ番目
の位置に文字Ｄ_c［ｊ］を登録する。確保されたリスト
ＲＤ_t［Ｎ_r］のｉ番目の位置以外のリスト要素はワイル
ドカードとなっている。Step S1907: The list RD _t [N _r ] for the new detection point data is secured, and the character D _c [j] is registered at the i-th position. List elements other than the i-th position of the secured list RD _t [N _r ] are wild cards.

【０２１２】ステップＳ１９０８：変数Ｎ_rの値が１だ
け増加させられる。Step S1908: The value of the variable N _r is incremented by 1.

【０２１３】ステップＳ１９０９：検出箇所データＲＤ
_t［ｔ］のｉ番目の位置に文字Ｄ_c［ｊ］が登録される。Step S1909: Detection location data RD
_The character D _c [j] is registered at the i-th position of _t [t].

【０２１４】以下、図２０Ａおよび図２０Ｂを参照し
て、図１９に示される処理手順に従って文字認識結果Ｄ
_cからキーワードＫ_wを検索した例を説明する。ここで、
文字認識結果Ｄ_cは図２Ｂに示されるデータ構造を有す
るものとし、キーワードＫ_wは「琵琶湖畔」であると仮
定する。Hereinafter, referring to FIGS. 20A and 20B, the character recognition result D according to the processing procedure shown in FIG.
An example of searching the keyword K _w from _c will be described. here,
It is assumed that the character recognition result D _c has the data structure shown in FIG. 2B, and the keyword K _w is “shore of Lake Biwa”.

【０２１５】いま、キーワードＫ_w「琵琶湖畔」のう
ち、「琵琶湖」については処理が完了しており、次にキ
ーワード文字「畔」（Ｋ_w［３］、ｉ＝３）についての
処理が行われる。Now, of the keywords K _w “Biwako shore”, the processing for “Biwako” has been completed, and the processing for the keyword character “shore” (K _w [3], i = 3) is performed next. Be seen.

【０２１６】図２０Ａは、キーワードＫ_w「琵琶湖畔」
のうち、「琵琶湖」についての照合処理が完了した時点
の検出箇所データＲＤ_t［０］の状態を示す。すでに処
理が完了した「琵琶湖」のうち、「琵」と「琶」につい
ては文字コードが一致する認識結果Ｄ_c中の文字（それ
ぞれＤ_c［０］およびＤ_c［１］）が存在している。キー
ワード文字「湖」については、文字コードが一致する認
識結果Ｄ_c中の文字が存在しないために対応関係が未定
である。キーワード文字「畔」については、文字コード
が一致する認識結果Ｄ_c中の文字があるかどうかの判定
がまだ行われていないために、対応関係が未定である。
この判定は、ステップＳ１７０５（図１７）において行
われる。FIG. 20A shows the keyword K _w "shore of Lake Biwa".
Of these, the state of the detection point data RD _t [0] at the time when the matching process for “Lake Biwa” is completed is shown. Among the “Biwako” that have already been processed, there are characters (D _c [0] and D _c [1]) in the recognition result D _c with the same character code for “Biwa” and “Biwa”, respectively. There is. For the keyword character “lake”, there is no character in the recognition result D _c having a matching character code, so the correspondence relationship is undetermined. For the keyword character “Kan”, the correspondence relationship is undecided because it has not been determined whether there is a character in the recognition result D _c having a matching character code.
This determination is made in step S1705 (FIG. 17).

【０２１７】いま、変数ｊ＝５を仮定する。ステップＳ
１７０５において文字Ｄ_c［５］の文字コードＣ_c［５］
と、キーワード文字「畔」（Ｋ_w［３］）の文字コード
とが等しいと判定される。従って処理はステップＳ１７
０６に進む。ステップＳ１７０６の詳細な処理手順は図
１９に示される。Now, assume that the variable j = 5. Step S
Character code C _c of the character D _c [5] In 1705 [5]
And the character code of the keyword character "Kan" ( _Kw [3]) are determined to be equal. Therefore, the process is step S17.
Proceed to 06. The detailed processing procedure of step S1706 is shown in FIG.

【０２１８】いま、変数ｔ＝０を仮定する。ステップＳ
１９０３において、検出箇所データＲＤ_t［０］に最後
に登録された文字Ｄ_c［ｋ］は、図２０Ａに参照符号２
０６２で示される文字Ｄ_c［１］である。従って、ｋ＝
１となる。Now, assume that the variable t = 0. Step S
In 1903, the character D _c [k] last registered in the detection location data RD _t [0] is indicated by reference numeral 2 in FIG. 20A.
It is the character D _c [1] indicated by 062. Therefore, k =
It becomes 1.

【０２１９】ステップＳ１９０４において、検出箇所デ
ータＲＤ_t［０］の文字Ｄ_c［１］に対応するキーワード
文字はキーワード文字Ｋ_w［１］である。従って、ｍ＝
１となる。In step S1904, the keyword character corresponding to the character D _c [1] of the detection location data RD _t [0] is the keyword character K _w [1]. Therefore, m =
It becomes 1.

【０２２０】ステップＳ１９０５において、文字Ｄ
_c［１（＝ｋ）］と文字Ｄ_c［５（＝ｊ）］との間のスペ
ースに、キーワード文字Ｋ_w［ｍ＋１］〜Ｋ_w［ｉ−１］
（この場合、Ｋ_w［２］）が過不足なく収まるか否かが
判定される。文字Ｄ_c［１（＝ｋ）］と文字Ｄ_c［５（＝
ｊ）］との間のスペースは、１３１である（図２Ｂに示
されるＤ_c［１］の右下角のｘ座標３１８とＤ_c［５］の
左上角のｘ座標４４９との差として求められる）。キー
ワード文字Ｋ_w［２］の文字幅は、文字幅推定手段４０
３により、図５を参照して説明した手順と同様の手順に
よって求められる。キーワード文字Ｋ_w［２］の文字幅
が１２５であるとする。１２５＜１３１＜１２５×１．
２（＝１５０）が成立するので、文字Ｄ_c［１］と文字
Ｄ_c［５］との間のスペースに、キーワード文字Ｋ
_w［２］が過不足なく収まると判定される。従って処理
はステップＳ１９０９に進む。In step S1905, the character D
_c [1 (= k)] and in the space between the characters _{D c [5 (= j)} ], keyword character _{K w [m + 1] ~K} w [i-1]
(In this case, K _w [2]) is determined whether it fits in exactly. Character D _c [1 (= k)] and character D _c [5 (=
j)] is 131 (determined as the difference between the x coordinate 318 of the lower right corner of D _c [1] and the x coordinate 449 of the upper left corner of D _c [5] shown in FIG. 2B. ). The character width of the keyword character K _w [2] is the character width estimation means 40.
3 is obtained by a procedure similar to that described with reference to FIG. It is assumed that the character width of the keyword character K _w [2] is 125. 125 <131 <125 × 1.
Since 2 (= 150) is established, the keyword character K is placed in the space between the character D _c [1] and the character D _c [5].
_w It is determined that [2] fits in exactly. Therefore, the process proceeds to step S1909.

【０２２１】ステップＳ１９０９において、ＲＤｔ
［０］の３（＝ｉ）番目の位置に文字Ｄ _c［５（＝
ｊ）］が登録される。At step S1909, RDt
The character D at the 3 (= i) th position of [0] _c[5 (=
j)] is registered.

【０２２２】図２０Ｂは、文字Ｄ_c［５］を登録した時
点における検出箇所データＲＤ_t［０］の状態を示す。
図２０Ｂに示される検出箇所データＲＤ_t［０］には、
ワイルドカードが含まれている。ワイルドカード検索手
段４０１は、このようにワイルドカードを有し得る検出
箇所データＲＤ_t［ｔ］を含む検索結果ＲＤ_t1を出力す
る。FIG. 20B shows the state of the detection point data RD _t [0] at the time when the character D _c [5] is registered.
The detection location data RD _t [0] shown in FIG.
Contains wildcards. The wild card search means 401 outputs the search result RD _t1 including the detection location data RD _t [t] that may have a wild card in this way.

【０２２３】検出箇所データにワイルドカードが含まれ
る場合、文字形状検索手段４０２は、そのワイルドカー
ドの照合を行う。When the detected location data includes a wild card, the character shape search means 402 collates the wild card.

【０２２４】図２１は、文字形状検索手段４０２によっ
て行われるワイルドカードの照合の処理手順を示す。FIG. 21 shows the processing procedure of wildcard matching performed by the character shape searching means 402.

【０２２５】以下、図２１に示される処理手順を各ステ
ップごとに説明する。The processing procedure shown in FIG. 21 will be described below step by step.

【０２２６】ステップＳ２１０１：変数ｔに値「０」が
代入される。変数ｔは、検出箇所データの添字である。Step S2101: The value "0" is assigned to the variable t. The variable t is a subscript of the detection point data.

【０２２７】ステップＳ２１０２：変数ｔと変数Ｎ_rと
が等しいか否かが判定される。変数Ｎ_rは、ワイルドカ
ード検索手段４０１における検索結果ＲＤｔ₁に含まれ
る検出箇所データの個数を示す。ステップＳ２１０２に
おける判定が「Ｙｅｓ」であれば処理はステップＳ２１
０７に進む。この判定が「Ｙｅｓ」であることは、検索
結果ＲＤ_t1に含まれる検出箇所データについて処理が完
了したことを意味する。Step S2102: It is determined whether or not the variable t is equal to the variable N _r . The variable N _r indicates the number of detection point data included in the search result RDt ₁ in the wildcard search means 401. If the determination in step S2102 is "Yes", the process is step S21.
Proceed to 07. If this determination is “Yes”, it means that the processing has been completed for the detection point data included in the search result RD _t1 .

【０２２８】ステップＳ２１０２における判定が「Ｎ
ｏ」であれば処理はステップＳ２１０３に進む。The judgment in step S2102 is "N".
If “o”, the process proceeds to step S2103.

【０２２９】ステップＳ２１０３：検出箇所データＲＤ
_t［ｔ］に、ワイルドカードが存在するか否かが判定さ
れる。ステップＳ２１０３における判定が「Ｙｅｓ」で
あれば処理はステップＳ２１０４に進む。Step S2103: Detection location data RD
_{At t} [t], it is determined whether or not a wild card exists. If the determination in step S2103 is "Yes", the process proceeds to step S2104.

【０２３０】ステップＳ２１０３における判定が「Ｎ
ｏ」であれば処理はステップＳ２１０６に進む。The judgment in step S2103 is "N".
If “o”, the process proceeds to step S2106.

【０２３１】ステップＳ２１０４：ワイルドカードに対
応するキーワード文字および認識結果Ｄ_c中の文字が特
定される。Step S2104: The keyword character corresponding to the wild card and the character in the recognition result D _c are specified.

【０２３２】ステップＳ２１０５：ステップＳ２１０４
で特定されたキーワード文字と認識結果Ｄ_c中の文字と
が形状照合される。この形状照合でキーワード文字と認
識結果Ｄ_c中の文字とが一致した場合には、検出箇所デ
ータＲＤ_t［ｔ］のうちのワイルドカードであるリスト
要素に、認識結果Ｄ_c中の文字を登録する。一致しなか
った場合には、検出箇所データＲＤ_t［ｔ］のうちのワ
イルドカードであるリスト要素はワイルドカードのまま
にしておく。Step S2105: Step S2104
The shape of the keyword character specified in step 3 is collated with the character in the recognition result D _c . If this shape matching the keyword character and the character recognition result in the D _c are matched, the list element is a wildcard of the detection point data RD _t [t], registers the character recognition result in the D _c To do. If they do not match, the list element, which is a wild card, of the detection location data RD _t [t] is left as a wild card.

【０２３３】ステップＳ２１０６：変数ｔを１だけ増加
させる。Step S2106: The variable t is incremented by 1.

【０２３４】ステップＳ２１０７：ワイルドカードを含
む検出箇所データが削除される。検索結果ＲＤ_t1に含ま
れる検出箇所データのうち、ステップＳ２１０７におい
てワイルドカードを含む検出箇所データを削除した残り
の検出箇所データが、検索結果ＲＤ_tとして出力され
る。Step S2107: The detection point data including the wild card is deleted. Of the detection point data included in the search result RD _t1 , the remaining detection point data obtained by deleting the detection point data including the wild card in step S2107 is output as the search result RD _t .

【０２３５】いま、文字形状検索手段４０２に入力され
る検索結果ＲＤｔ₁に含まれる１つの検出箇所データＲ
Ｄ_t［０］が、図２０Ｂに示される状態であると仮定す
る。この検出箇所データＲＤ_t［０（＝ｔ）］につい
て、図２１に示される処理手順を具体的に適用する例を
以下に説明する。Now, one detection point data R included in the search result RDt ₁ input to the character shape search means 402.
Suppose D _t [0] is in the state shown in FIG. 20B. An example in which the processing procedure shown in FIG. 21 is specifically applied to the detection location data RD _t [0 (= t)] will be described below.

【０２３６】検出箇所データＲＤ_t［０］の中にはワイ
ルドカードが含まれるので、ステップＳ２１０３におけ
る判定は、「Ｙｅｓ」である。Since the wild card is included in the detection location data RD _t [0], the determination in step S2103 is “Yes”.

【０２３７】ステップＳ２１０４において、ワイルドカ
ードに対応するキーワード文字は、検出箇所データＲＤ
_t［０］のうち、ワイルドカードの位置を調べることに
より特定される。これにより、ワイルドカードに対応す
るキーワード文字は、Ｋ_w［２］と特定される。[0237] In step S2104, the keyword character corresponding to the wild card is the detection location data RD.
It is specified by checking the position of the wild card in _t [0]. As a result, the keyword character corresponding to the wild card is specified as K _w [2].

【０２３８】このように、文字形状検索手段４０２は、
ステップＳ２１０４において、キーワードに含まれる少
なくとも１つのキーワード文字（第１文字）のうち、文
字認識結果に一致しない第１文字を第１不一致文字（こ
の場合Ｋ_w［２］）として特定する第１不一致文字特定
手段として機能する。As described above, the character shape searching means 402 is
In step S2104, the first non-matching that identifies the first character that does not match the character recognition result among the at least one keyword character (first character) included in the keyword as the first non-matching character (K _w [2] in this case). Functions as a character identification means.

【０２３９】ステップＳ２１０４において、ワイルドカ
ードに対応する認識結果Ｄ_c中の文字を特定する処理
は、以下のように行われる。ワイルドカードに対応する
認識結果Ｄ_c中の文字は、認識結果Ｄ_c中の１または２以
上の連続した文字を結合したものであり、これを第２不
一致文字と呼ぶ。In step S2104, the process of specifying the character in the recognition result D _c corresponding to the wild card is performed as follows. Character recognition result in D _c for each wild card is obtained by combining the one or more consecutive characters in the recognition result D _c, it is referred to as a second mismatch characters.

【０２４０】検出箇所データＲＤ_t［０］において、ワ
イルドカードの左隣に文字Ｄ_c［１］が存在するので、
第２不一致文字は左端の文字として文字Ｄ_c［２］を含
むと考えられる。しかし、この時点では、第２不一致文
字の右端の文字がＤ_c［２］〜Ｄ_c［４］のうちのどれで
あるのか不明である。すなわち、第２不一致文字が文字
Ｄｃ［２］のみからなるのか、文字Ｄ_c［２］と文字Ｄ_c
［３］とからなるのか、文字Ｄ_c［２］と文字Ｄ_c［３］
と文字Ｄ_c［４］とからなるのか不明である。第２不一
致文字の右端の文字がＤ_c［５］である可能性は考慮す
る必要がない。文字Ｄ_c［５］は、すでに検出箇所デー
タＲＤ_t［０］の３番目の位置に登録されているからで
ある。In the detection location data RD _t [0], since the character D _c [1] exists to the left of the wild card,
The second non-matching character is considered to include the character D _c [2] as the leftmost character. However, at this point, it is unclear which of the rightmost characters of the second non-matching character is D _c [2] to D _c [4]. That is, whether the second mismatch character consists only of characters Dc [2], letter D _c [2] and letter D _c
Character D _c [2] and character D _c [3]
It is unknown whether or not it consists of the character D _c [4]. It is not necessary to consider the possibility that the rightmost character of the second non-matching character is D _c [5]. This is because the character D _c [5] has already been registered at the third position of the detection location data RD _t [0].

【０２４１】ステップＳ２１０４では、第１不一致文字
（この場合Ｋ_w［２］）の幅に最も近い幅を有するよう
な第２不一致文字が特定される。第２不一致文字の幅
は、第２不一致文字に含まれるそれぞれの文字に割り当
てられたする部分領域を包含する領域の領域幅として定
義される。このように、文字形状検索手段４０２はま
た、ステップＳ２１０４において、文字認識結果に含ま
れる少なくとも１つの第２文字のうち、第１不一致文字
の幅に最も近い幅を有する１または２以上の連続した第
２文字を第２不一致文字として特定する第２不一致文字
特定手段として機能する。In step S2104, the second non-matching character having the width closest to the width of the first non-matching character (K _w [2] in this case) is specified. The width of the second non-matching character is defined as the area width of the area including the partial area assigned to each character included in the second non-matching character. Thus, in step S2104, the character shape search unit 402 also continuously has at least one or more second characters included in the character recognition result and having one or more consecutive widths that are closest to the width of the first non-matching character. It functions as a second non-matching character specifying means for specifying the second character as the second non-matching character.

【０２４２】図２２は、結合される文字と領域幅との関
係を示す。FIG. 22 shows the relationship between the characters to be combined and the area width.

【０２４３】文字Ｄ_c［２］の領域幅は３５であり、文
字Ｄ_c［２］と文字Ｄ_c［３］とを結合した場合の領域幅
は８０であり、文字Ｄ_c［２］と文字Ｄ_c［３］と文字Ｄ
_c［４］とを結合した場合の領域幅は１２５である。こ
のような領域幅は、図２Ｂに示される文字座標から求め
られる。The area width of the character D _c [2] is 35, the area width when the character D _c [2] and the character D _c [3] are combined is 80, and the area width of the character D _c [2] is Character D _c [3] and character D
The area width when _{c and} [4] are combined is 125. Such a region width is obtained from the character coordinates shown in FIG. 2B.

【０２４４】一方、文字幅推定手段４０３によりキーワ
ード文字Ｋ_w［２］の幅Ｋ_ww［２］が求められる。Ｋ_ww
［２］＝１２５である場合には、第１不一致文字（この
場合Ｋ_w［２］）の幅に最も近い幅を有するような第２
不一致文字として、文字Ｄ_c［２］と文字Ｄ_c［３］と文
字Ｄ_c［４］とを結合したものが特定される。文字Ｄ
_c［２］と文字Ｄ_c［３］と文字Ｄ_c［４］とを結合した
ものを、リスト（Ｄ_c［２］，Ｄ_c［３］，Ｄ_c［４］）
と記載する。On the other hand, the character width estimating means 403 obtains the width K _ww [2] of the keyword character K _w [2]. K _ww
If [2] = 125, then the second with the width closest to the width of the first non-matching character (K _w [2] in this case).
As the non-matching character, a combination of the character D _c [2], the character D _c [3], and the character D _c [4] is specified. Letter D
_A list (D _c [2], D _c [3], D _c [4]) obtained by combining _c [2] and the character D _c [3] and the character D _c [4].
Enter.

【０２４５】ステップＳ２１０５（図２１）で、キーワ
ード文字Ｋ_w［２］と、文字（Ｄ_c［２］，Ｄ_c［３］，
Ｄ_c［４］）とが形状照合される。この形状照合は、文
字Ｄ_c［２］〜Ｄ_c［４］に割り当てられた部分領域を包
含する矩形で囲まれる文字画像（文書画像Ｄ_iから求め
られる）の特徴量と、キーワード文字Ｋ_w［２］の文字
画像（文字画像テーブル４０４から求められる）の特徴
量とを比較することによって行われる。それぞれの特徴
量はベクトル量であり、両者のユークリッド距離が所定
の閾値Ｔｈｄ₁より小さい場合には、リスト（Ｄ
_c［２］，Ｄ_c［３］，Ｄ_c［４］）を検出箇所データＲ
Ｄ_t［０］の２番目の位置に登録する。あるいは、図２
２に参照符号２２６２として示される文字を新たに生成
し、文字２２６２を検出箇所データＲＤｔ［０］の２番
目の位置に登録してもよい。新たに生成した文字２２６
２は、文字Ｄ_c［２］、Ｄ_c［３］、Ｄ_c［４］を結合し
た文字を示す。At step S2105 (FIG. 21), the keyword character K _w [2] and the characters (D _c [2], D _c [3],
D _c [4]) is shape-matched. In this shape matching, the feature amount of a character image (obtained from the document image D _i ) enclosed by a rectangle including the partial areas assigned to the characters D _c [2] to D _c [4] and the keyword character K _w. It is performed by comparing with the feature amount of the character image of [2] (obtained from the character image table 404). Each feature amount is a vector amount, and if the Euclidean distance between them is smaller than a predetermined threshold Thd ₁ , the list (D
_c [2], D _c [3], D _c [4]) is detected point data R
Register at the second position of D _t [0]. Alternatively, FIG.
The character indicated by reference numeral 2262 in 2 may be newly generated, and the character 2262 may be registered in the second position of the detection location data RDt [0]. Newly generated character 226
2 indicates a character in which the characters D _c [2], D _c [3], and D _c [4] are combined.

【０２４６】このように、文字形状検索手段４０２は、
ステップＳ２１０５において、キーワード文字Ｋ
_w［２］（第１不一致文字）の画像の特徴量と、第２不
一致文字に含まれる文字Ｄ_c［２］、Ｄ_c［３］、Ｄ
_c［４］（１または２以上の連続した第２文字）に割り
当てられた１または２以上の部分領域を含む領域の画像
の特徴量とを比較することにより、前記第１不一致文字
が前記第２不一致文字に一致するか否かを判定する第２
判定手段として機能する。In this way, the character shape retrieval means 402
In step S2105, the keyword character K
_w [2] (first non-matching character) image feature amount and characters D _c [2], D _c [3], D included in the second non-matching character
_c [4] (1 or 2 or more consecutive second characters) is compared with a feature amount of an image of an area including 1 or 2 or more partial areas assigned to the first non-matching character 2nd judgment of whether or not it matches 2 non-matching characters
It functions as a judgment means.

【０２４７】また、キーワードが「琵琶湖」である別の
例で、図１８に示される検出箇所データＲＤ_t［０］の
ようにワイルドカードが検出箇所データの端にある場合
でも、上述した処理手順と同様の処理手順により第２不
一致文字を特定することができる。Further, in another example in which the keyword is "Lake Biwa", even when the wild card is at the end of the detection point data as in the detection point data RD _t [0] shown in FIG. 18, the above-mentioned processing procedure is performed. The second non-matching character can be specified by the same processing procedure as.

【０２４８】検出箇所データＲＤ_t［０］（図１８）で
はワイルドカードの左隣のリスト要素に文字Ｄ_c［１］
が登録されているので、第２不一致文字の左端の文字は
Ｄ_c［２］であると考えられる。文字Ｄ_c［２］を基準と
して、結合する文字の数を可変とし、キーワード文字Ｋ
_w［２］の幅と領域幅（結合された少なくとも１つの文
字の幅、すなわち第２不一致文字の幅）との比較を繰り
返すことにより、第２不一致文字が特定される。この比
較の際に、キーワード文字Ｋ_w［２］の幅に応じて、第
２不一致文字の幅の許容値を算出し、第２不一致文字の
幅がこの許容幅よりも小さいという条件下で、第２不一
致文字を特定してもよい。例えば、許容幅をキーワード
文字Ｋ_w［２］の１．２倍とすると、結合する文字の数
が４である場合（Ｄ_c［２］＋Ｄ_c［３］＋Ｄ_c［４］＋
Ｄ_c［５］の場合）の領域幅は、図２２の参照符号２２
６１に示されるように、２５３となり、許容幅（１２５
×１．２＝１５０）を超える。従って、結合する文字の
数が４以上の場合は考慮する必要がなくなり、文字Ｄ_c
［２］、Ｄ_c［３］およびＤ_c［４］が、第２不一致文字
として特定される。In the detection location data RD _t [0] (FIG. 18), the character D _c [1] is added to the list element to the left of the wild card.
Is registered, it is considered that the leftmost character of the second mismatch character is D _c [2]. With the character D _c [2] as a reference, the number of characters to be combined is variable, and the keyword character K
The second non-matching character is specified by repeating the comparison between the width of _w [2] and the region width (the width of the combined at least one character, that is, the width of the second non-matching character). In this comparison, the allowable value of the width of the second non-matching character is calculated according to the width of the keyword character K _w [2], and under the condition that the width of the second non-matching character is smaller than this allowable width, The second non-matching character may be specified. For example, if the allowable width is 1.2 times the keyword character K _w [2], and the number of characters to be combined is 4, (D _c [2] + D _c [3] + D _c [4] +
The area width of D _c [5]) is 22 in FIG.
As shown in 61, it becomes 253, and the allowable width (125
X1.2 = 150) is exceeded. Therefore, when the number of characters to be combined is 4 or more, it is not necessary to consider, and the character D _c
[2], D _c [3], and D _c [4] are specified as the second mismatch character.

【０２４９】別の状況で、ワイルドカードが連続して存
在する場合、上述した処理手順により、１つずつ第２不
一致文字を特定することが可能である。In another situation, when the wildcards are consecutively present, it is possible to specify the second non-matching characters one by one by the processing procedure described above.

【０２５０】上述した例では、第２不一致文字を特定す
る際に、第２不一致文字の左端の文字（Ｄ_c［２］）を
固定し、その右側に結合する文字の数を可変としてい
た。これとは逆に、第２不一致文字を特定する際に、第
２不一致文字の右端の文字を固定し、その左側に結合す
る文字の数を可変とする処理も、上述した処理手順と同
様に行われ得る。In the above example, when the second non-matching character is specified, the leftmost character (D _c [2]) of the second non-matching character is fixed, and the number of characters to be joined to the right side of the second non-matching character is variable. On the contrary, when specifying the second non-matching character, the process of fixing the rightmost character of the second non-matching character and varying the number of characters to the left of the second non-matching character is similar to the above-described processing procedure. Can be done.

【０２５１】なお、ワイルドカードが連続している場合
や、検出箇所データの端にワイルドカード文字が存在す
る場合、第２不一致文字として特定される可能性のある
文字を可能な限り小さな要素に分割し、この要素に対し
てステップＳ２１０４およびステップＳ２１０５の処理
を行ってもよい。この分割は、ＯＣＲ装置２０２によっ
て隣接する文字が結合して認識された場合でも検索漏れ
を防ぐために行われる。文書が横書きである場合、この
分割は、例えば、垂直方向に黒画素の射影ヒストグラム
を求め、射影ヒストグラムが予め定められた閾値よりも
小さい部分で行われる。When wildcards are continuous or when wildcard characters are present at the end of the detected position data, the character that may be specified as the second non-matching character is divided into the smallest possible elements. However, the processing of steps S2104 and S2105 may be performed on this element. This division is performed to prevent omission of search even when adjacent characters are combined and recognized by the OCR device 202. When the document is written horizontally, for example, this division is performed in a portion where the projection histogram of black pixels in the vertical direction is obtained and the projection histogram is smaller than a predetermined threshold value.

【０２５２】なお、ワイルドカード検索手段４０１は、
キーワード中の文字が少なくとも１つ以上文字認識結果
に一致する検出箇所データを検索結果ＲＤ_t1として出力
したが、キーワード中の文字が予め指定した数以上文字
認識結果に一致する検出箇所データを検索結果ＲＤ_t1と
して出力してもよい。例えば、ワイルドカード検索手段
４０１は、キーワード中の文字のうち、半分以上が文字
認識結果に一致する検出箇所データを出力してもよい。The wildcard search means 401 is
The detection location data in which at least one character in the keyword matches the character recognition result is output as the search result RD _t1 , but the detection location data in which the number of characters in the keyword matches the character recognition result in a predetermined number or more It may be output as RD _t1 . For example, the wildcard search means 401 may output detection location data in which more than half of the characters in the keyword match the character recognition result.

【０２５３】また、ワイルドカード検索手段４０１にお
いて、キーワード文字の文字コードと文字認識結果Ｄ_c
中の文字の文字コードが一致するか否かを判定していた
が、キーワード文字の類似文字の文字コードと、文字認
識結果Ｄ_c中の文字の文字コードが一致するか否かを判
定してもよい。類似文字とは、例えば、（カタカナの
「タ」と漢字の「夕」）、（「犬」と「大」と「太」）
などのように、形状の類似した文字を意味する。Further, in the wild card search means 401, the character code of the keyword character and the character recognition result D _c
It was determined whether or not the character codes of the middle characters match, but it is determined whether or not the character codes of the similar characters of the keyword character and the character codes of the character recognition result D _c match. Good. The similar characters are, for example, ("kata" in katakana and "evening" in kanji), ("dog", "large" and "thick").
It means a character with a similar shape, such as.

【０２５４】以上のように、本発明の実施の形態２の文
書検索装置４５１によれば、文字コードの比較におい
て、キーワードに含まれるキーワード文字のうち１文字
でも文字認識結果Ｄ_c中の文字と一致すれば、文字認識
結果Ｄ_cのうち、その近傍を対象として画像の特徴量の
比較に基づく照合が行われる。文字コードの比較におい
て、キーワードの文字の全てが文字認識結果Ｄ_c中の文
字と一致することは必要ではない。従って、文字認識の
誤りに起因する検索漏れを減らすことができる。また、
画像の特徴量の比較に基づく検索の対象は、文字コード
の比較においてキーワードの文字のうち１文字でも文字
認識結果Ｄ_c中の文字と一致した検出箇所に限定される
ので、検索にかかるコスト（時間および計算量）は低く
て済む。As described above, according to the document retrieval apparatus 451 of the second embodiment of the present invention, even if one of the keyword characters included in the keyword is compared with the character in the character recognition result D _c in the character code comparison. If they match, the matching is performed based on the comparison of the image feature amounts in the vicinity of the character recognition result D _c . In comparing the character codes, it is not necessary that all the characters of the keyword match the characters in the character recognition result D _c . Therefore, it is possible to reduce the omission of search due to an error in character recognition. Also,
Since the target of the search based on the comparison of the image feature amounts is limited to the detection position where even one of the characters of the keyword in the character code matches the character in the character recognition result D _c , the cost of the search ( Time and calculation amount) are low.

【０２５５】（実施の形態３）図２３は本発明の実施の
形態３の文書検索システム１５６１の構成を示す。文書
検索システム１５６１は、実施の形態１および実施の形
態２で説明された文書検索装置の利用形態の一例であ
る。(Third Embodiment) FIG. 23 shows the configuration of a document search system 1561 according to a third embodiment of the present invention. The document search system 1561 is an example of a usage pattern of the document search device described in the first and second embodiments.

【０２５６】文書検索システム１５６１は、第１の通信
手段１５０１と、センター１５０２と、画像登録サーバ
１５０３と、画像検索サーバ１５０４と、画像データベ
ース１５０５と、第２の通信手段１５０６と、端末１５
０７とを備える。The document search system 1561 includes a first communication means 1501, a center 1502, an image registration server 1503, an image search server 1504, an image database 1505, a second communication means 1506, and a terminal 15.
07 and.

【０２５７】第１の通信手段１５０１と第２の通信手段
１５０６とは、通信回線１５０９を介して通信を行う。
通信回線１５０９は、例えば、電話回線（ＰＨＳ、携帯
電話を含む）やインターネット（無線または有線）であ
り得る。[0257] The first communication means 1501 and the second communication means 1506 communicate with each other via the communication line 1509.
The communication line 1509 can be, for example, a telephone line (including PHS and mobile phone) or the Internet (wireless or wired).

【０２５８】画像登録サーバ１５０３は、ＯＣＲによる
文字認識の機能を有する。The image registration server 1503 has a function of character recognition by OCR.

【０２５９】端末１５０７は、例えば、スキャナを備え
ており、オリジナルの文書から文書画像データを得るこ
とができる。あるいは、端末１５０７は、デジタルカメ
ラで撮影した文書画像データを取込むことができる。The terminal 1507 is equipped with a scanner, for example, and can obtain document image data from an original document. Alternatively, the terminal 1507 can capture document image data captured by a digital camera.

【０２６０】画像検索サーバ１５０４は、例えば、実施
の形態１および実施の形態２で説明された文書検索装置
を備える。The image search server 1504 includes, for example, the document search device described in the first and second embodiments.

【０２６１】ユーザは端末１５０７に、スキャナやデジ
タルカメラ等により得られた文書画像データを入力す
る。端末１５０７は、この文書画像データをセンター１
５０２に送信する。センター１５０２は文書画像データ
を受け取り、画像登録サーバ１５０３に送る。画像登録
サーバ１５０３は文書画像データに対してＯＣＲによる
文字認識を行い、文字認識結果と文書画像データとを画
像データベース１５０５に保存する。The user inputs the document image data obtained by a scanner, a digital camera or the like into the terminal 1507. The terminal 1507 sends this document image data to the center 1
Send to 502. The center 1502 receives the document image data and sends it to the image registration server 1503. The image registration server 1503 performs character recognition by OCR on the document image data, and stores the character recognition result and the document image data in the image database 1505.

【０２６２】ユーザはセンター１５０２と通信可能な任
意の端末から、画像データベース１５０５に保存された
文書を検索することができる。また、閲覧・印刷・回覧
等のサービスも利用することができる。画像データベー
ス１５０５に保存された文書の閲覧は、画像閲覧ソフト
を介して行われる。画像閲覧ソフトとしては、例えば、
ＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａ
ｎｇｕａｇｅ）形式の文書を閲覧するブラウザが使用さ
れ得る。The user can retrieve the document stored in the image database 1505 from any terminal capable of communicating with the center 1502. You can also use services such as browsing, printing, and circulation. The document stored in the image database 1505 is browsed through image browsing software. As image browsing software, for example,
HTML (Hyper Text Markup La
A browser that browses documents in the ngage) format may be used.

【０２６３】センター１５０２は、個人同定手段を有し
ており、画像テータベース１５０５をユーザごとに専用
化したり、サービスの利用に対する課金をユーザごとに
行うことが可能である。The center 1502 has a personal identification means, and it is possible to dedicate the image database 1505 to each user and to charge the use of the service for each user.

【０２６４】個人同定手段としては、公知の技術による
指紋照合システムやパスワードが使用され得る。As a personal identification means, a fingerprint collation system or password according to a known technique can be used.

【０２６５】このように、本発明の実施の形態３の文書
検索システム１５６１によれば、ユーザは、ユーザの保
有する文書を、いつでも、どこからでも閲覧・検索する
ことが可能になる。As described above, according to the document search system 1561 of the third embodiment of the present invention, the user can browse and search the document owned by the user anytime, anywhere.

【０２６６】上述した実施の形態１および２で説明した
文書検索処理は、プログラムの形式で記録媒体に記録さ
れ得る。記録媒体としては、フロッピー（登録商標）デ
ィスクやＣＤ−ＲＯＭなどのコンピュータによって読み
取り可能な任意のタイプの記録媒体を使用することがで
きる。記録媒体から読み出された文書検索処理プログラ
ムをコンピュータにインストールすることにより、その
コンピュータを文書検索装置として機能させることが可
能になる。The document search process described in the first and second embodiments can be recorded in the recording medium in the form of a program. As the recording medium, any type of computer-readable recording medium such as a floppy (registered trademark) disk or a CD-ROM can be used. By installing the document search processing program read from the recording medium into the computer, the computer can be made to function as the document search device.

【０２６７】なお、上述した実施の形態１および２で
は、日本語の文書を例にとり説明した。しかし、本発明
の適用は、日本語の文書に限定されない。他の任意の言
語の文書（例えば、中国語の文書、英語の文書、韓国語
の文書）に本発明を適用することも可能である。In the above-described first and second embodiments, a Japanese document has been described as an example. However, the application of the present invention is not limited to Japanese documents. It is also possible to apply the present invention to documents in any other language (for example, Chinese documents, English documents, Korean documents).

【０２６８】また、上述した実施の形態１および２で
は、形状照合手段において、キーワード文字の文字画像
ＫＣ_iの特徴量と文書画像データ中の文字画像Ｃ_iの特徴
量とを比較する際の閾値Ｔｈｄ₁は所定の値であるとし
た。閾値Ｔｈｄ₁は、キーワード文字の文字コードに応
じて変化させてもよい。例えば、予め文字画像テーブル
を用いて、使用されているフォントの文字画像の特徴量
と、文書画像データ中の文字画像の特徴量との距離の確
率分布を求め、任意の確率を設定することにより閾値Ｔ
ｈｄ₁を決めることができる。また、閾値Ｔｈｄ₁を制御
することにより、検索の精度を自由に制御することが可
能になる。Further, in the above-described first and second embodiments, the threshold value used when the shape matching means compares the feature amount of the character image KC _i of the keyword character with the feature amount of the character image C _i in the document image data. Thd ₁ is assumed to be a predetermined value. The threshold Thd ₁ may be changed according to the character code of the keyword character. For example, by using a character image table in advance, the probability distribution of the distance between the feature amount of the character image of the font used and the feature amount of the character image in the document image data is obtained, and an arbitrary probability is set. Threshold T
You can decide hd ₁ . Further, by controlling the threshold value Thd ₁ , it becomes possible to freely control the accuracy of the search.

【０２６９】[0269]

【発明の効果】本発明によれば、検索漏れを減らすこと
ができる文書検索装置および記録媒体を提供することが
できる。As described above, according to the present invention, it is possible to provide a document retrieval device and a recording medium capable of reducing omission of retrieval.

【０２７０】本発明によれば、文字認識結果からキーワ
ードを検索する際に、まず、文字コードの比較に基づく
検索がなされる。次に、文字コードの比較に基づく検索
によってキーワードと一致しなかった部分のうち、所定
の条件を満たす部分について画像の特徴量の比較に基づ
く検索がなされる。これによって、文字コードの比較に
基づく検索において発生し得る検索漏れは、画像の特徴
量の比較に基づく検索によってカバーされる。従って、
文字認識の誤りに起因する検索漏れを減らすことができ
る。また、画像の特徴量の比較に基づく検索の対象は、
所定の条件を満たす部分に限定されるので、検索にかか
るコスト（時間および計算量）は低くて済む。According to the present invention, when searching a keyword from a character recognition result, first, a search is performed based on comparison of character codes. Next, of the portions that do not match the keyword by the search based on the comparison of the character codes, the portion that satisfies the predetermined condition is searched based on the comparison of the image feature amounts. Thus, a search omission that may occur in the search based on the comparison of character codes is covered by the search based on the comparison of image feature amounts. Therefore,
It is possible to reduce search omissions due to character recognition errors. In addition, the target of the search based on the comparison of the image feature amount is
Since it is limited to the part that satisfies the predetermined condition, the cost (time and calculation amount) required for the search can be low.

【０２７１】本発明によれば、文字コードの比較におい
て、キーワードの文字のうち１文字でも文字認識結果中
の文字と一致すれば、その近傍を対象として画像の特徴
量の比較に基づく照合が行われる。文字コードの比較に
おいて、キーワードの文字の全てが文字認識結果中の文
字と一致することは必要ではない。従って、文字認識の
誤りに起因する検索漏れを減らすことができる。また、
画像の特徴量の比較に基づく検索の対象は、文字コード
の比較においてキーワードの文字のうち１文字でも文字
認識結果中の文字と一致した検出箇所に限定されるの
で、検索にかかるコスト（時間および計算量）は低くて
済む。According to the present invention, in the comparison of character codes, if any one of the characters of the keyword matches with the character in the character recognition result, the collation based on the comparison of the image feature amount is performed in the vicinity thereof. Be seen. In comparing character codes, it is not necessary that all the characters of the keyword match the characters in the character recognition result. Therefore, it is possible to reduce the omission of search due to an error in character recognition. Also,
The search target based on the comparison of the image feature amounts is limited to the detection position where even one of the characters of the keyword matches the character in the character recognition result in the comparison of the character codes. The amount of calculation) is low.

[Brief description of drawings]

【図１】文書ファイリングシステム２１０の構成を示す
図FIG. 1 is a diagram showing the configuration of a document filing system 210.

【図２Ａ】文書画像データＤ_iの例を示す図FIG. 2A is a diagram showing an example of document image data D _i .

【図２Ｂ】文書画像データＤ_iについて文字認識を実行
した結果である文字認識結果Ｄ_cのデータ構造を示す図FIG. 2B is a diagram showing a data structure of a character recognition result D _c which is a result of performing character recognition on document image data D _i .

【図２Ｃ】キーワードＫ_wのデータ構造を示す図FIG. 2C is a diagram showing a data structure of a keyword K _w .

【図２Ｄ】検索結果ＲＤ_tのデータ構造を示す図FIG. 2D is a diagram showing a data structure of a search result RD _t .

【図３】文字ブロックデータＤ_tの構造を示す図FIG. 3 is a diagram showing a structure of character block data D _t .

【図４】本発明の実施の形態１の文書検索装置１の構成
を示すブロック図FIG. 4 is a block diagram showing a configuration of a document search device 1 according to the first embodiment of the present invention.

【図５】文字幅推定手段１０４が、キーワードＫ_wに含
まれる各文字Ｋ_w［ｉ］の文字幅を推定する例を示す図FIG. 5 is a diagram showing an example in which a character width estimation unit 104 estimates a character width of each character K _w [i] included in a keyword K _w .

【図６】所定の条件を満たす候補部分ＳＤ_cを特定する
処理の例を示す図FIG. 6 is a diagram showing an example of processing for identifying a candidate portion SD _c that satisfies a predetermined condition.

【図７】文字形状検索手段１０３の詳細な構成を示すブ
ロック図FIG. 7 is a block diagram showing a detailed configuration of a character shape search unit 103.

【図８】文字画像抽出手段３０１においてキーワードＫ
_w中の特定の文字と対応する候補部分ＳＤ_c中の文字が特
定される例を示す図FIG. 8 shows a keyword K in the character image extracting means 301.
shows an example in which the character is identified in the corresponding candidate portions in SD _c with a particular character in _w

【図９】文字画像Ｃ_iから特徴量（ベクトル量）を求め
る方法の例を示す図FIG. 9 is a diagram showing an example of a method for obtaining a feature amount (vector amount) from a character image C _i .

【図１０】文字形状検索手段１０３のバリエーションと
しての文字形状検索手段１０３ａの構成を示すブロック
図FIG. 10 is a block diagram showing the configuration of a character shape search unit 103a as a variation of the character shape search unit 103.

【図１１】文書検索装置１のバリエーションとしての文
書検索装置７０１の構成を示すブロック図FIG. 11 is a block diagram showing the configuration of a document search device 701 as a variation of the document search device 1.

【図１２】文書検索装置７０１のバリエーションとして
の文書検索装置８０１の構成を示すブロック図FIG. 12 is a block diagram showing the configuration of a document search device 801 as a variation of the document search device 701.

【図１３】文書検索装置７０１のバリエーションとして
の文書検索装置９０１の構成を示すブロック図FIG. 13 is a block diagram showing the configuration of a document search device 901 as a variation of the document search device 701.

【図１４】文書検索装置１のバリエーションとしての文
書検索装置１１５１の構成を示すブロック図FIG. 14 is a block diagram showing the configuration of a document search device 1151 as a variation of the document search device 1.

【図１５】文書検索装置１１５１のバリエーションとし
ての文書検索装置１０５１の構成を示すブロック図FIG. 15 is a block diagram showing the configuration of a document search device 1051 as a variation of the document search device 1151.

【図１６】本発明の実施の形態２の文書検索装置４５１
の構成を示すブロック図FIG. 16 is a document search device 451 according to the second embodiment of the present invention.
Block diagram showing the configuration of

【図１７】ワイルドカード検索手段４０１における処理
手順を示すフローチャートFIG. 17 is a flowchart showing a processing procedure in the wildcard search means 401.

【図１８】図１７に示される処理手順によって検索され
た検索結果ＲＤ_t1を示す図FIG. 18 is a diagram showing a search result RD _t1 searched by the processing procedure shown in FIG. 17;

【図１９】図１７に示されるステップＳ１７０６の詳細
な処理手順を示すフローチャートFIG. 19 is a flowchart showing a detailed processing procedure of step S1706 shown in FIG.

【図２０Ａ】キーワードＫ_w「琵琶湖畔」のうち、「琵
琶湖」についての照合処理が完了した時点の検出箇所デ
ータＲＤ_t［０］の状態を示す図FIG. 20A is a diagram showing a state of detection point data RD _t [0] at the time when the matching process for “Lake Biwa” in the keyword K _w “Biwako shore” is completed.

【図２０Ｂ】文字Ｄ_c［５］を登録した時点における検
出箇所データＲＤ_t［０］の状態を示す図FIG. 20B is a diagram showing a state of detection point data RD _t [0] when the character D _c [5] is registered.

【図２１】文字形状検索手段４０２におけるワイルドカ
ードの照合の処理手順を示すフローチャートFIG. 21 is a flowchart showing a wild card matching process procedure in the character shape searching unit 402.

【図２２】結合される文字と領域幅の関係を示す図FIG. 22 is a diagram showing a relationship between a combined character and a region width.

【図２３】本発明の実施の形態３の文書検索システム１
５６１の構成を示す図FIG. 23 is a document search system 1 according to the third embodiment of the present invention.
Diagram showing the configuration of 561

【図２４Ａ】オリジナルの文書中に含まれる文字「本」
および「口」が、文字認識における誤りにより、それぞ
れ形状の類似した「木」および「区」という文字に対応
する文字コードに変換されている例を示す図FIG. 24A is a character “book” included in an original document.
And "mouth" are converted into character codes corresponding to the characters "tree" and "ku", which have similar shapes, due to an error in character recognition.

【図２４Ｂ】類似文字のリストの例を示す図FIG. 24B is a diagram showing an example of a list of similar characters.

[Explanation of symbols]

１、４５１、７０１、８０１、９０１、１０５１、１１
５１文書検索装置１０１テキスト検索手段１０２文字特定手段１０３、４０２文字形状検索手段１０４、４０３文字幅推定手段２０１画像入力装置２０２ＯＣＲ装置２０３文書データベース２０４文書検索装置２０５表示装置２１０文書ファイリングシステム３０１文字画像抽出手段３０２、４０４文字画像テーブル３０３形状照合手段３０４、６０６照合制御手段４０１ワイルドカード検索手段６０２類似文字照合手段７０５検索精度制御手段８０５品質情報抽出手段９０５検索精度指定手段１００２、１１０２類似文字列平均化手段１１０３文字列再検出手段1, 451, 701, 801, 901, 1051, 11
51 document search device 101 text search means 102 character specification means 103, 402 character shape search means 104, 403 character width estimation means 201 image input device 202 OCR device 203 document database 204 document search device 205 display device 210 document filing system 301 character image Extracting means 302, 404 Character image table 303 Shape matching means 304, 606 Matching control means 401 Wildcard searching means 602 Similar character matching means 705 Search accuracy control means 805 Quality information extracting means 905 Search accuracy specifying means 1002, 1102 Similar character string average Characterizing means 1103 character string re-detecting means

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩＧ０６Ｋ 9/00 Ｇ０６Ｋ 9/00 Ｓ 9/62 ６２０ 9/62 ６２０Ｄ (72)発明者目片強司大阪府門真市大字門真1006番地松下電器産業株式会社内 (56)参考文献特開2001−337993（ＪＰ，Ａ) 松川善彦、今川太郎、近藤堅司、目方強司，形状特徴検索併用による文書画像検索の性能向上，電子情報通信学会技術研究報告，日本，社団法人電子情報通信学会，1999年９月16日，第99巻、第 305号、ＰＲＭＵ99−74，第77−83頁 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/30 310 - 419 G06K 9/00 - 9/72 ─────────────────────────────────────────────────── ─── Continuation of the front page (51) Int.Cl. ⁷ Identification code FI G06K 9/00 G06K 9/00 S 9/62 620 9/62 620D (72) Inventor Koji Kaji, Osaka Prefecture Kadoma City Daimon Kadoma 1006 Address: Matsushita Electric Industrial Co., Ltd. (56) References JP 2001-337993 (JP, A) Matsukawa Yoshihiko, Imagawa Taro, Kondo Kenji, Meguro Koji, improved document image search performance using shape feature search, electronic information IEICE Technical Report, Japan, The Institute of Electronics, Information and Communication Engineers, September 16, 1999, Volume 99, No. 305, PRMU99-74, 77-83 (58) Fields investigated (Int.Cl. ⁷ , DB name) G06F 17/30 310-419 G06K 9/00-9/72

Claims

(57) [Claims]

1. A document search device for searching a keyword from a recognition result obtained by performing character recognition on an image of a document, wherein the keyword includes at least one first character, A character code and a character image are assigned to each one of the first characters, the recognition result includes at least one second character, and each of the at least one second character includes a character code and a character code. , A partial area of the image of the document is assigned, and the document search device determines whether at least one first matching portion matching the keyword is present in the recognition result based on the comparison of the character codes. It is determined whether or not there is, and if there is, a first matching part specifying means for specifying the at least one first matching part, and a predetermined first condition. It is determined whether or not at least one first portion to be added exists in a portion excluding the specified at least one first matching portion from the recognition result, and if there is, at least one first portion. a first part specifying means for specifying a part, the feature amount of the image of the said partial area assigned to the second character included in the first part, of the character image of the first character included in the keyword Based on the comparison with the feature amount, it is determined whether at least one second matching portion that matches the keyword is present in the specified at least one first portion. A second matching part specifying unit for specifying one second matching part, wherein the predetermined first condition is a width smaller than a predetermined width.
The document search device having a condition that the first portion is present in the vicinity of a specific second character having a .

2. A document search device for searching a keyword from a recognition result obtained by performing character recognition on an image of a document, wherein the keyword includes at least one first character, A character code and a character image are assigned to each one of the first characters, the recognition result includes at least one second character, and each of the at least one second character includes a character code and a character code. , The partial area of the image of the document and the character recognition
Trust that indicates the certainty of character recognition obtained when performing
And the document search device determines whether or not at least one first matching portion that matches the keyword exists in the recognition result based on the comparison of the character codes, and In the case of performing, at least one first matching part specifying means for specifying the at least one first matching part, and at least one first part satisfying a predetermined first condition are at least one first specified part from the recognition result. The first part specifying means for judging whether or not the first part is present in the part excluding the one matching part, and if the part is present, the second part included in the first part and the feature quantity of the image of the partial area allocated to the character, based on a comparison of the feature quantity of the character image of the first character included in the keyword, the less matches said keyword Also determines whether one second matching portion exists in the specified at least one first portion, and if so, specifies a second matching portion that specifies the at least one second matching portion. and means, the predetermined first condition, wherein the confidence assigned to the vicinity of the small specific second character than a predetermined threshold value first
A document retrieval device, provided that there is a part .

3. The document search device according to claim 2, further comprising: a unit that determines the image quality of the image of the document; and a unit that determines the predetermined threshold value based on the determined image quality of the image. .

4. The second matching portion identifying means is the first matching portion identifying means.
First determining means for determining whether or not the character code of the second character included in the portion matches the character code of the specific first character included in the keyword; and the first determining means included in the first portion. The two-letter character code is
When the character code of the specific first character included in the keyword does not match, the width that includes at least the second character included in the first portion and is closest to the width of the specific first character is set. A non-matching character specifying unit for specifying one or more continuous second characters as a non-matching character, a feature amount of the image of the specific first character, and one or more continuous two or more characters included in the non-matching character. The specific first character matches the unmatched character when the distance from the image feature amount of the area including one or more partial areas assigned to the second character is smaller than a predetermined value. The document search device according to claim 1, further comprising a second determination unit that determines to do so.

5. The document search device comprises the at least one
Calculating means for calculating a predetermined judgment reference value from one of the first matching portions, and the at least one second judgment portion based on the judgment reference value.
The document search device according to claim 1, further comprising: a detection unit that detects a second matching portion satisfying a predetermined second condition among the matching portions.

6. The calculating means determines the determination reference value based on a characteristic amount of an image of at least one partial region assigned to the at least one second character included in the at least one first matching portion. The second condition is calculated as a distance between a feature amount of an image of at least one partial region assigned to the at least one second character included in the at least one second matching portion and the determination reference value. The document search device according to claim 5, including a condition that is smaller than a predetermined value.