JPH11232296A

JPH11232296A - Document filing system and document filing method

Info

Publication number: JPH11232296A
Application number: JP10035545A
Authority: JP
Inventors: Taizou Kameshiro; 泰三亀代; Yasuhiro Okada; 康裕岡田
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1998-02-18
Filing date: 1998-02-18
Publication date: 1999-08-27
Anticipated expiration: 2018-02-18
Also published as: JP3589007B2

Abstract

PROBLEM TO BE SOLVED: To provide a document filing system and a document filing method to eliminate omission in retrieval, erroneous extraction and capable of retrieving the document with high accuracy in retrieval of a document using character recognition. SOLUTION: A character whose character code is uniquely decided and the one whose character code is not uniquely decided are sorted from results of the character recognition and a post-processing and are stored together with a flag capable of discriminating each character. In the case of the retrieval, collation with the character to be uniquely decided is regarded to be successfully performed when the character codes coincide and regarding the characters not to be uniquely decided, a matching degree between a key word and a text in the document is calculated by using similarity to be outputted by a similarity output part 7.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、例えば、文書や図
面等の画像を電子的にファイリングする文書ファイリン
グシステムおよび文書ファイリング方法に関し、特にフ
ァイルした文書をキーワードを用いて検索する文書ファ
イリングシステムおよび文書ファイリング方法に関する
ものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document filing system and a document filing method for electronically filing images such as documents and drawings, and more particularly to a document filing system and a document for retrieving filed documents using keywords. It concerns the filing method.

【０００２】[0002]

【従来の技術】従来より、文書画像を電子的に保存し、
それを検索、表示するため、文書画像に対して人手でキ
ーワード情報を付加して保存する、という方法が用いら
れている。また、人手によるキーワード入力の手間を省
くために、文字認識機能を有するシステムを使用し、そ
れによって文書画像中の文字を認識して、関連するキー
ワードあるいは全文を、文書画像とともに保存する方法
が用いられている。2. Description of the Related Art Conventionally, document images are stored electronically,
In order to retrieve and display the keyword, a method of manually adding keyword information to a document image and storing the keyword image is used. In addition, in order to save the trouble of manually inputting a keyword, a method is used in which a system having a character recognition function is used, whereby a character in a document image is recognized, and a related keyword or a whole sentence is stored together with the document image. Have been.

【０００３】後者の場合には、そのシステムの文字認識
性能が不完全であることに起因して誤認識が生じる。こ
のため、検索用に入力したキーワードに対して、そのキ
ーワードとは異なる文字列が検索結果として表示され
る、いわゆる「誤抽出」が発生する。また、文書画像中
の文字が、入力したキーワードと同一であるにもかかわ
らず、文字の誤認識があるために検索結果として表示さ
れない、「検索もれ」も発生する。In the latter case, erroneous recognition occurs due to imperfect character recognition performance of the system. For this reason, a so-called “erroneous extraction” occurs in which a character string different from the keyword input for the search is displayed as a search result. In addition, even though the characters in the document image are the same as the input keyword, "search omission" occurs in which the characters are erroneously recognized and are not displayed as a search result.

【０００４】そこで、検索精度を向上させるためには、
上記の誤抽出および検索もれを極力少なくする必要があ
る。この検索時の誤抽出、検索もれを減少させる方法と
して、（１）文字認識性能を向上させる（正解文字保有率を向
上させる）（２）入力キーワードと検索対象文字列との部分的な不
一致を許容し、文字認識性能の不完全性を補助するという方法がある。Therefore, in order to improve search accuracy,
It is necessary to minimize the above erroneous extraction and search omission. As a method of reducing erroneous extraction and search omission at the time of the search, (1) improving the character recognition performance (improving the correct character holding ratio) (2) partial mismatch between the input keyword and the search target character string And assist the imperfect character recognition performance.

【０００５】上記（１）の一例として、１つの文字画像
に対する文字認識結果を複数保持することで、正解文字
を保有する確率を高める方法がある。例えば、「文書認
識と全文検索の融合技術に関する実験的検討」（情報処
理学会研究会、情報学基礎３９−９，１９９５年９月、
丸川他）では、認識候補文字に対して、個々の認識の類
似度によって、保持する候補文字数を可変にし、それら
を複数保持することで、１文字づつの保持に比べて、よ
り高精度な検索を可能にしている。[0005] As an example of the above (1), there is a method of increasing the probability of retaining correct characters by retaining a plurality of character recognition results for one character image. For example, "Experimental Study on Fusion Technology of Document Recognition and Full Text Search" (Information Processing Society of Japan, Information Science Basics 39-9, September 1995,
Marukawa et al.) Makes the number of candidate characters to be held variable according to the degree of similarity of individual recognition with respect to the recognition candidate characters, and holds a plurality of them, whereby a more accurate search can be performed as compared to holding one character at a time. Is possible.

【０００６】また、例えば、特開平８−２７２８１３に
示すファイリング装置では、各文字画像に対する認識結
果を第４位候補まで固定して保持し、候補文字も含めた
文書コードから検索キーワードとのマッチングを行って
いる。[0006] For example, in a filing apparatus disclosed in Japanese Patent Application Laid-Open No. Hei 8-272713, the recognition result for each character image is fixed and held up to the fourth candidate, and matching with a search keyword is performed from a document code including candidate characters. Is going.

【０００７】上記（２）に係る方法の一例としては、上
記特開平８−２７２８１３に記載されているように、キ
ーワードと認識結果との一致度ｍを、ｍ＝（一致した文字数／キーワードの文字数）×１００（％） …（１）で算出し、文字認識をした結果としての候補文字中に全
ての検索文字が含まれていなくても、それを検索結果と
して出力するものがある。As an example of the method according to the above (2), as described in JP-A-8-272813, the degree of coincidence m between a keyword and a recognition result is calculated as follows: m = (number of matched characters / number of characters of keyword) ) × 100 (%) (1) Even if all the search characters are not included in the candidate characters as a result of character recognition, some of them are output as search results.

【０００８】以下、特開平８−２７２８１３に係る装置
の動作を簡単に説明する。＜データの格納方法の説明＞図１７は、特開平８−２７
２８１３に係るファイリング装置のブロック構成図であ
る。同図において、スキャナ１０１で読み取られた原稿
画像は、スキャナインタフェース（Ｉ／Ｆ）回路１０２
でディジタル信号に変換される。原稿が文字画像の場
合、Ｉ／Ｆ回路１０２より信号を受けたＣＰＵ１０５
は、文字認識処理を行い、文字画像の１文字に対し、４
文字までの認識結果としての候補文字を、文書保存手段
である外部記憶装置１１０に出力する。[0008] The operation of the apparatus according to Japanese Patent Application Laid-Open No. Hei 8-272713 will be briefly described below. <Description of Data Storage Method> FIG.
FIG. 28 is a block configuration diagram of a filing device according to No. 2813. In FIG. 1, a document image read by a scanner 101 is a scanner interface (I / F) circuit 102.
Is converted into a digital signal. If the original is a character image, the CPU 105 receives a signal from the I / F circuit 102.
Performs a character recognition process and outputs 4 characters to one character of a character image.
A candidate character as a recognition result up to the character is output to the external storage device 110 as a document storage unit.

【０００９】なお、ＲＡＭ１０７は、文字画像の展開や
文字認識処理のための作業領域である。また、上記の外
部記憶装置１１０は、例えば、バードディスク等、登録
されたデータを格納する装置であり、ここではデータの
蓄積のみならず、文字認識用の辞書も格納されている。The RAM 107 is a work area for developing character images and performing character recognition processing. The external storage device 110 is a device for storing registered data, such as a bird disk, and stores not only data but also a dictionary for character recognition.

【００１０】このファイリング装置では、入力文書画像
をファイリングするとき、文字認識処理で得られた文字
の第１候補の文字コードだけをテキストデータとして登
録するのではなく、第４候補までの文字コードを登録す
る。つまり、個々の文字画像に対して、文字画像と認識
結果の候補文字を、４文字保存する。In this filing apparatus, when filing an input document image, instead of registering only the character code of the first candidate of the character obtained in the character recognition processing as text data, the character code of up to the fourth candidate is registered. sign up. That is, for each character image, four character images and candidate characters of the recognition result are stored.

【００１１】＜検索方法の説明＞図１８は、特開平８−
２７２８１３に係る装置におけるテキストデータの候補
を示している。同図において、検索のキーワードを「内
部処理統合型」とした場合の文字認識結果とキーワード
の照合部分（一致部分）を矢印で示す。上述のＣＰＵ１
０５は、文書検索手段としてキーワードを、第４位まで
の候補文字全てと照合する。<Description of Search Method> FIG.
The text data candidates in the device according to 272813 are shown. In the figure, the arrow indicates the character recognition result and the matching part (matching part) of the keyword when the search keyword is “internal processing integrated type”. CPU 1 described above
In step 05, the keyword is matched with all the candidate characters up to the fourth place as a document search unit.

【００１２】上記の式（１）でｍの値が、ある閾値（例
えば６０（％））以上のとき、これを検索結果候補とす
ると、図１８では、７個のキーワード文字数に対して、
６文字の一致があるので、ｍ＝（６／７）×１００＝８５．７（％）となり、これらが検索結果候補となる。When the value of m in the above equation (1) is equal to or greater than a certain threshold (for example, 60 (%)), and this is considered as a search result candidate, FIG.
Since there is a match of six characters, m = (6/7) × 100 = 85.7 (%), and these are search result candidates.

【００１３】[0013]

【発明が解決しようとする課題】しかしながら、上記従
来の、１つの文字画像に対する文字認識結果を複数保持
する方法は、文字認識の精度に関して、保存する候補文
字数を少なくすると、候補文字中に正解文字を含む可能
性が低くなって、検索もれが起こりやすい。また、候補
文字数を多く保持すると、正解文字を含む可能性が高く
なるために検索もれは減少するが、正解文字以外の文字
をも多く保持するために、誤抽出が多く発生する、とい
う問題がある。また、候補文字を多く保持すると文書保
存のためのメモリ容量が増大するという問題もある。However, in the conventional method of holding a plurality of character recognition results for one character image, if the number of candidate characters to be stored is reduced with respect to the accuracy of character recognition, correct character characters are not included in the candidate characters. Is less likely to be included, and search leakage is likely to occur. In addition, if the number of candidate characters is large, the possibility of including correct characters is high, and search omissions are reduced. However, since many characters other than correct characters are also stored, erroneous extraction often occurs. There is. Another problem is that holding a large number of candidate characters increases the memory capacity for document storage.

【００１４】類似度によって保持する候補文字数を可変
にする方法では、例えば、文書画像の濃度が適正でな
く、つぶれていたり、掠れている場合には、標準パター
ンと文字画像とで、認識に用いる特徴量の差が大きくな
るために、候補文字中に正解文字が含まれず、その認識
率が低下して、このために検索もれが生じる、という問
題がある。In the method of changing the number of candidate characters to be held according to the similarity, for example, when the density of the document image is not appropriate and the document image is crushed or blurred, the standard pattern and the character image are used for recognition. Since the difference between the feature amounts is large, the correct character is not included in the candidate character, and the recognition rate is reduced, which causes a problem that a search miss occurs.

【００１５】また、この場合、認識結果の類似度が小さ
くなる（正解の可能性が低くなる）ため、一定の正解文
字含有率を満たすには、より多くの候補文字を保存する
必要が生じる。その結果、検索時に誤抽出が大きくなる
という問題がある。Further, in this case, since the similarity of the recognition result is reduced (the possibility of correct answer is reduced), it is necessary to store more candidate characters in order to satisfy a certain correct character content rate. As a result, there is a problem that erroneous extraction becomes large at the time of retrieval.

【００１６】一方、特開平８−２７２８１３に開示され
ているような、ある程度の不一致を許容する検索方法で
は、キーワードと照合する文字の不一致となる部分がど
のような文字であっても、一致する部分が共通である限
り、同一の一致度として計算される、という問題があ
る。On the other hand, in a search method that allows a certain degree of mismatch as disclosed in Japanese Patent Application Laid-Open No. Hei 8-272713, a character that does not match a character to be matched with a keyword is matched regardless of the character. As long as the portions are common, there is a problem that the same degree of coincidence is calculated.

【００１７】これにより、例えば、検索キーワードが
「日本人」の場合、文字列「日本入」「日本語」「日本
国」「日本の」「日本は」等に対して、いずれも、ｍ＝
（２／３）×１００＝６７％という同一の一致度とな
り、これを検索結果として出力、表示することになる。
ここでは、「日本入」の「入」が誤認識され、実際に検
索したい文字列は「日本人」であるのに、上記の「日本
語」「日本国」「日本の」「日本は」等と一致度が等し
いために、それを一致度の高い順に表示した場合、「日
本入」を、これらの中に埋もれて表示してしまうことに
なる。Thus, for example, when the search keyword is "Japanese", the character string "Japanese", "Japanese", "Japan", "Japan", "Japan is", etc.
The same degree of coincidence of (2/3) × 100 = 67% is obtained, which is output and displayed as a search result.
Here, "Japanese" is misrecognized as "Japanese", and the character string you actually want to search for is "Japanese", but the above "Japanese", "Japanese country", "Japanese", "Japan is" Since the degree of coincidence is the same as the degree of coincidence, if the items are displayed in descending order of the degree of coincidence, "Japan" will be buried in these and displayed.

【００１８】そこでユーザは、表示手段であるディスプ
レイ１０８に表示された、このような誤抽出の中から、
さらに希望する結果を探す必要がある。そして、この不
一致を許可する閾値が小さいほど、誤抽出も大量に出力
されるため、ユーザが真に検索したい文書が誤抽出に埋
もれ、結果として、その装置がユーザには使いづらいも
のとなる、という問題がある。また、閾値を大きくする
と検索もれが増える、といった問題もある。Then, the user selects one of the erroneous extractions displayed on the display 108 as the display means.
You also need to look for the desired result. Then, as the threshold value for permitting the mismatch is smaller, a large amount of erroneous extraction is also output, so that a document that the user really wants to search is buried in the erroneous extraction, and as a result, the device becomes difficult for the user to use. There is a problem. In addition, there is also a problem that search omission increases when the threshold value is increased.

【００１９】本発明は、上記の課題に鑑みてなされたも
ので、その目的とするところは、文書検索の検索もれが
起こりにくく、候補文字数を多くしても誤抽出が発生し
ない文書ファイリングシステムおよび文書ファイリング
方法を提供することである。また、本発明の他の目的
は、キーワードと一部不一致である文字が認識結果に存
在しても適正な一致度を算出でき、高精度な検索を実行
できる文書ファイリングシステムおよび文書ファイリン
グ方法を提供することである。SUMMARY OF THE INVENTION The present invention has been made in view of the above-mentioned problems, and has as its object to provide a document filing system in which search omission of a document search hardly occurs and erroneous extraction does not occur even if the number of candidate characters is increased. And a document filing method. Another object of the present invention is to provide a document filing system and a document filing method that can calculate a proper degree of matching even if a character partially mismatched with a keyword exists in a recognition result, and can execute a highly accurate search. It is to be.

【００２０】[0020]

【課題を解決するための手段】上記の目的を達成するた
め、第１の発明は、文書画像より、キーワードに従って
所定の文書を検索する文書ファイリングシステムにおい
て、上記文書画像中の文字を認識する手段と、上記キー
ワードと上記認識した文字とを照合する照合手段と、上
記照合の結果をもとに、上記文書画像に関する情報を文
書検索結果として出力する出力手段とを備え、上記照合
手段は、上記キーワード中の文字と上記認識した文字と
の類似度を文字毎に算出し、この類似度を用いて、これ
らキーワード中の文字と認識した文字との一致度を計算
した結果に基づいて上記照合を行い、また、上記出力手
段は、この一致度の示す値をもとに、上記キーワードに
対応する文書を上記文書検索結果として出力する文書フ
ァイリングシステムを提供する。According to a first aspect of the present invention, there is provided a document filing system for retrieving a predetermined document from a document image in accordance with a keyword by recognizing a character in the document image. A matching unit that matches the keyword with the recognized character; and an output unit that outputs information on the document image as a document search result based on the result of the matching. The similarity between the character in the keyword and the recognized character is calculated for each character, and the matching is performed based on the result of calculating the matching degree between the character in the keyword and the recognized character using the similarity. And outputting the document corresponding to the keyword as a document search result based on the value indicating the degree of coincidence. To provide.

【００２１】第２の発明は、第１の発明において、さら
に、あらかじめ格納した単語辞書を参照して、上記認識
した文字に所定の文法解析を施し、この文法解析の結果
をもとに、上記文字と単語辞書との一致状態を示す評価
値を各文字に付与する手段を備え、上記照合手段は、上
記類似度と上記評価値とから求めた各文字毎の総合評価
値に基づいて上記照合を行う文書ファイリングシステム
を提供する。According to a second aspect of the present invention, in the first aspect, a predetermined grammatical analysis is performed on the recognized character with reference to a word dictionary stored in advance, and based on a result of the grammatical analysis, Means for assigning an evaluation value indicating a matching state between the character and the word dictionary to each character, wherein the collation means performs the collation based on an overall evaluation value for each character obtained from the similarity and the evaluation value. Provide a document filing system that performs

【００２２】第３の発明は、第２の発明において、上記
照合手段が、上記総合評価値が一定値以上であるか否か
をもとに、上記認識した文字について、その文字コード
を一意に決定できる文字と一意に決定できない文字との
区別を行い、上記一意に決定できる文字を、その文字コ
ードと所定のフラグとを対応付けて保存する文書ファイ
リングシステムを提供する。In a third aspect based on the second aspect, the collating means uniquely assigns a character code to the recognized character based on whether or not the comprehensive evaluation value is equal to or greater than a predetermined value. Provided is a document filing system that distinguishes a character that can be determined from a character that cannot be uniquely determined and stores the character that can be uniquely determined by associating the character code with a predetermined flag.

【００２３】また、第４の発明は、第３の発明におい
て、上記照合手段が、上記文字コードが同一であるか否
かをもとに、上記一意に決定できる文字と上記キーワー
ド中の文字との類似度を算出するとともに、この類似度
を用いて上記一致度を計算し、また、上記一意に決定で
きない文字については、上記キーワード中の文字と上記
認識した文字との類似度を用いて上記一致度を計算する
文書ファイリングシステムを提供する。According to a fourth aspect of the present invention, in the third aspect of the invention, the collating means may determine whether the character which can be uniquely determined and the character in the keyword are based on whether or not the character codes are the same. And the similarity is calculated using the similarity. For the characters that cannot be uniquely determined, the similarity between the character in the keyword and the recognized character is used to calculate the similarity. Provide a document filing system that calculates the degree of coincidence.

【００２４】第５の発明は、第３の発明において、さら
に、上記キーワード中の文字と上記認識した文字それぞ
れの特徴量を抽出する手段を備え、上記照合手段は、上
記一意に決定できない文字について、上記特徴量の類似
度を算出し、この類似度を用いて上記一致度を計算する
文書ファイリングシステムを提供する。According to a fifth aspect of the present invention, in the third aspect, there is further provided a means for extracting a characteristic amount of each of the character in the keyword and the recognized character, and the collating means is provided for the character which cannot be uniquely determined. And a document filing system that calculates the similarity of the feature amount and calculates the coincidence using the similarity.

【００２５】そして、第６の発明は、第３の発明におい
て、さらに、上記一意に決定できない文字の文字数と上
記認識した文字の文字数の比率を算出する手段を備え、
上記出力手段は、上記比率が一定値以上の場合、上記所
定の文書を適正に検索できない旨の表示をする文書ファ
イリングシステムを提供する。According to a sixth aspect of the present invention, in the third aspect, there is further provided means for calculating a ratio between the number of characters of the characters that cannot be uniquely determined and the number of characters of the recognized characters.
The output means provides a document filing system for displaying that the predetermined document cannot be properly searched when the ratio is equal to or more than a certain value.

【００２６】第７の発明は、文書画像より、キーワード
に従って所定の文書を検索する文書ファイリング方法に
おいて、上記文書画像中の文字を認識する工程と、上記
キーワード中の文字と上記認識した文字との類似度を文
字毎に算出する工程と、上記類似度を用いて、上記キー
ワード中の文字と上記認識した文字との一致度を計算す
る工程と、上記一致度に基づいて、上記キーワード中の
文字と上記認識した文字とを照合する照合工程と、上記
照合の結果をもとに、上記キーワードに対応する文書を
文書検索結果として出力する出力工程とを備える文書フ
ァイリング方法を提供する。According to a seventh aspect, in a document filing method for retrieving a predetermined document from a document image according to a keyword, a step of recognizing a character in the document image, and a step of recognizing the character in the keyword and the recognized character are performed. Calculating the degree of similarity for each character; calculating the degree of coincidence between the character in the keyword and the recognized character using the degree of similarity; and calculating the degree of similarity in the keyword based on the degree of coincidence. A document filing method comprising: a collation step of collating a character with the recognized character; and an output step of outputting a document corresponding to the keyword as a document search result based on a result of the collation.

【００２７】第８の発明は、第７の発明において、さら
に、あらかじめ格納した単語辞書を参照して、上記認識
した文字に所定の文法解析を施し、この文法解析の結果
をもとに、上記文字と単語辞書との一致状態を示す評価
値を各文字に付与する工程を備え、上記照合工程は、上
記類似度と上記評価値とから求めた各文字毎の総合評価
値に基づいて上記照合を行う文書ファイリング方法を提
供する。According to an eighth aspect, in the seventh aspect, a predetermined grammatical analysis is performed on the recognized character with reference to a word dictionary stored in advance, and based on a result of the grammatical analysis, A step of assigning an evaluation value indicating a matching state between the character and the word dictionary to each character, wherein the matching step includes the step of performing the matching based on the overall evaluation value for each character obtained from the similarity and the evaluation value. Provide a document filing method that performs

【００２８】第９の発明は、第８の発明において、上記
照合工程が、上記総合評価値が一定値以上であるか否か
をもとに、上記認識した文字について、その文字コード
を一意に決定できる文字と一意に決定できない文字との
区別を行い、上記一意に決定できる文字を、その文字コ
ードと所定のフラグとを対応付けて保存する文書ファイ
リング方法を提供する。In a ninth aspect based on the eighth aspect, the collating step uniquely identifies a character code of the recognized character based on whether the comprehensive evaluation value is equal to or greater than a predetermined value. Provided is a document filing method for distinguishing between a character that can be determined and a character that cannot be uniquely determined, and storing the uniquely determined character in association with its character code and a predetermined flag.

【００２９】また、第１０の発明は、第９の発明におい
て、上記照合工程が、上記文字コードが同一であるか否
かをもとに、上記一意に決定できる文字と上記キーワー
ド中の文字との類似度を算出するとともに、この類似度
を用いて上記一致度を計算し、また、上記一意に決定で
きない文字については、上記キーワード中の文字と上記
認識した文字との類似度を用いて上記一致度を計算する
文書ファイリング方法を提供する。[0029] In a tenth aspect based on the ninth aspect, in the collation step, the uniquely determinable character and the character in the keyword are determined based on whether or not the character codes are the same. And the similarity is calculated using the similarity. For the characters that cannot be uniquely determined, the similarity between the character in the keyword and the recognized character is used to calculate the similarity. A document filing method for calculating the degree of coincidence is provided.

【００３０】また、第１１の発明は、第９の発明におい
て、さらに、上記キーワード中の文字と上記認識した文
字それぞれの特徴量を抽出する行う工程を備え、上記照
合工程は、上記一意に決定できない文字について、上記
特徴量の類似度を算出し、この類似度を用いて上記一致
度を計算する文書ファイリング方法を提供する。An eleventh invention according to the ninth invention further comprises a step of extracting a characteristic amount of each of the character in the keyword and the recognized character, and the collating step is a step of uniquely determining the character. A document filing method is provided for calculating the similarity of the feature amount for a character that cannot be obtained, and calculating the coincidence using the similarity.

【００３１】そして、第１２の発明は、第９の発明にお
いて、さらに、上記一意に決定できない文字の文字数と
上記認識した文字の文字数の比率を算出する工程を備
え、上記出力工程は、上記比率が一定値以上の場合、上
記所定の文書を適正に検索できない旨の表示をする文書
ファイリング方法を提供する。In a twelfth aspect based on the ninth aspect, the method further comprises the step of calculating a ratio of the number of characters that cannot be uniquely determined to the number of characters of the recognized character. A document filing method for displaying a message indicating that the predetermined document cannot be properly searched if the value is equal to or more than a predetermined value.

【００３２】[0032]

【実施の形態】以下、添付図面を参照して、本発明の実
施の形態を説明する。実施の形態１．図１は、本発明の実施の形態１に係るフ
ァイリングシステムの構成を示すブロック図である。同
図に示すシステムでは、入力部１より文書画像が入力さ
れ、入力した文書画像と、文字認識部３が認識した結果
を文書保存部２に保存する。この文字認識部３は、入力
した文書画像より文字領域を求め、その文字領域内の文
字を認識する。また、辞書８は、文字認識部３が文字を
認識するために保持する、各文字毎の標準パターンの特
徴を保持する。Embodiments of the present invention will be described below with reference to the accompanying drawings. Embodiment 1 FIG. FIG. 1 is a block diagram showing a configuration of a filing system according to Embodiment 1 of the present invention. In the system shown in FIG. 1, a document image is input from the input unit 1, and the input document image and the result recognized by the character recognition unit 3 are stored in the document storage unit 2. The character recognizing unit 3 obtains a character area from the input document image and recognizes characters in the character area. Further, the dictionary 8 holds the features of the standard pattern for each character which the character recognition unit 3 holds for character recognition.

【００３３】文書画像データベース４は、入力画像を電
子的に保持し、文字認識結果データベース５には、文字
認識部３が認識した結果の文字コード列が保持される。
文書検索部６は、文字認識結果データベース５から、ユ
ーザが入力したキーワード２０と一致する文字列を含む
文書を検索する。また、類似度出力部７は、２つの文字
コード間の類似度を、辞書８を参照して算出し、表示部
９は、文書検索部６による検索結果を表示する。The document image database 4 electronically stores the input image, and the character recognition result database 5 stores a character code string as a result of recognition by the character recognition unit 3.
The document search unit 6 searches the character recognition result database 5 for a document including a character string that matches the keyword 20 input by the user. The similarity output unit 7 calculates the similarity between the two character codes with reference to the dictionary 8, and the display unit 9 displays the search result by the document search unit 6.

【００３４】図２は、本実施の形態に係るシステムにお
ける文書保存処理手順を示すフローチャートである。最
初に、図２に示すフローチャートを用いて、本実施の形
態における文字認識結果の保存方法について説明する。
図２のステップＳ１００では、入力部１が画像を入力
し、それを文書保存部２へ転送する。なお、この入力部
１は、例えば、スキャナを用いて原稿画像を光電変換す
るものでもよいし、あらかじめ光電変換された画像をネ
ットワーク経由等で入力してもよい。FIG. 2 is a flowchart showing a document storage processing procedure in the system according to the present embodiment. First, a method for storing a character recognition result according to the present embodiment will be described with reference to the flowchart shown in FIG.
In step S100 of FIG. 2, the input unit 1 inputs an image and transfers it to the document storage unit 2. The input unit 1 may, for example, perform photoelectric conversion of a document image using a scanner, or may input a photoelectrically converted image via a network or the like.

【００３５】文書保存部２は、ステップＳ１１０で、入
力部１より入力した画像を文字認識部３へ渡し、この文
字認識部３から、画像内の文字認識結果である文字コー
ドを受け取る。なお、この文字認識部３は、公知の文字
列切り出し、文字切り出し、文字認識を実行し、それに
よって得られた文字認識結果を文書保存部２に返す。ま
た、文字列切り出しは、例えば、文書画像内の黒画素が
連続する部分を連結し、黒画素の連結成分の幅、高さか
ら、それが文字列であるかを決定する、という方法をと
る。In step S110, the document storage unit 2 transfers the image input from the input unit 1 to the character recognition unit 3, and receives a character code as a result of character recognition in the image from the character recognition unit 3. The character recognizing unit 3 performs known character string cutting, character cutting, and character recognition, and returns a character recognition result obtained thereby to the document storage unit 2. In addition, the character string cutout uses, for example, a method of connecting portions where black pixels are continuous in a document image, and determining whether or not it is a character string from the width and height of the connected components of the black pixels. .

【００３６】文字切り出し方法は、例えば、文字列切り
出しで決定した文字列画像を、縦方向と横方向から走査
し、黒画素数の周辺分布を求めて、黒画素数の少ない部
分を切り出し候補点として、１文字毎の画像に分割す
る。また、文字認識処理は、文字切り出しによって１文
字単位に分割した画像に対し、例えば、８×８次元の濃
度特徴を抽出し、標準パターンとの各次元毎の差分の和
を求めて、この差分の和の最も小さな標準パターンか
ら、数文字を認識結果として出力する。In the character extraction method, for example, a character string image determined by character string extraction is scanned in the vertical direction and the horizontal direction, a peripheral distribution of the number of black pixels is obtained, and a portion having a small number of black pixels is extracted. Is divided into images for each character. In the character recognition process, for example, an 8 × 8-dimensional density feature is extracted from an image divided into single characters by character segmentation, and a sum of a difference for each dimension from a standard pattern is obtained. A few characters are output as a recognition result from the standard pattern having the smallest sum of.

【００３７】次のステップＳ１２０では、文書保存部２
が、文字認識結果となる文字コードを文字認識結果デー
タベース５に保存するが、ここでは、各１文字画像に対
して１文字の認識結果のみを保存する。なお、図３に、
本実施の形態における文字認識結果の保存例を示す。ま
た、文書保存部２は、入力された２値文書画像を文字画
像データベース４に保存する。以上で文書保存処理を終
了する。In the next step S120, the document storage unit 2
Saves a character code as a character recognition result in the character recognition result database 5, but here, only one character recognition result is stored for each one character image. Note that FIG.
5 shows an example of storing a character recognition result in the present embodiment. The document storage unit 2 stores the input binary document image in the character image database 4. Thus, the document storage processing is completed.

【００３８】次に、本実施の形態におけるキーワード検
索方法について説明する。図４は、本実施の形態に係る
システムにおける検索処理手順を示すフローチャートで
ある。すなわち、図４は、キーワードと認識結果の文字
コードとの照合手順を示すフローチャートである。Next, a keyword search method according to the present embodiment will be described. FIG. 4 is a flowchart showing a search processing procedure in the system according to the present embodiment. That is, FIG. 4 is a flowchart showing a procedure for collating the keyword with the character code of the recognition result.

【００３９】図４のステップＳ２００では、ユーザが入
力したキーワード２０を文書検索部６へ入力する。続く
ステップＳ２１０で、認識結果を含む配列（以下、ｔｅ
ｘｔと記す）と、キーワードを含む配列（以下、ｋｅｙ
と記す）のポインタを初期化する。文書検索部６は、文
字認識結果データベース５より、１文書の認識結果を取
り出し、それをｔｅｘｔにセットする。In step S 200 of FIG. 4, the keyword 20 input by the user is input to the document search unit 6. In the following step S210, an array including the recognition result (hereinafter, te
xt) and an array including a keyword (hereinafter, key)
) Is initialized. The document search unit 6 extracts the recognition result of one document from the character recognition result database 5 and sets it in text.

【００４０】次のステップＳ２２０では、ｔｅｘｔとｋ
ｅｙ内の文字の照合を行う。この照合は、文字ｔｅｘｔ
［ｋ］と文字ｋｅｙ［ｊ］の文字コードを類似度出力部
７に入力し、類似度出力部７が辞書８を参照して、ｔｅ
ｘｔ［ｋ］とｋｅｙ［ｊ］の文字の類似度ｓを出力する
ことで行う。ここでは、によって求めたｓを類似度として返す。なお、上記の式
（２）において、Ｎｔｅｘｔ［ｋ］ｎは、辞書８内の文
字ｔｅｘｔ［ｋ］における第ｎ次元の濃度値であり、Ｎ
ｋｅｙ［ｋ］ｎは、辞書８内の文字ｋｅｙ［ｋ］におけ
る第ｎ次元の濃度値である。また、ｎは１〜６４の値を
とり、｜｜は、絶対値を表す。In the next step S220, text and k
Check the characters in ey. This collation is based on the character text
The character code of [k] and the character key [j] are input to the similarity output unit 7, and the similarity output unit 7 refers to the dictionary 8 and
This is performed by outputting the similarity s between the characters xt [k] and key [j]. here, Is returned as the similarity. In the above equation (2), Ntext [k] n is the n-dimensional density value of the character text [k] in the dictionary 8, and Ntext [k]
key [k] n is an n-dimensional density value of the character key [k] in the dictionary 8. Further, n takes a value of 1 to 64, and | | represents an absolute value.

【００４１】そして、文書検索部６は、類似度出力部７
が出力する類似度を用いて、一致度＝照合した文字のｓの和／照合した文字数 …（３）を計算する。Then, the document search unit 6 outputs the similarity output unit 7
Is calculated by using the similarity output by..., (3).

【００４２】ステップＳ２３０では、文書検索部６が、
上記の一致度が一定値以上であるか否かを判定する。そ
れが一定値以上である場合は、ステップＳ２４０に進ん
で、ポインタｋ，ｊそれぞれを１インクリメントし、続
くステップＳ２５０で、全てのキーワードとの照合が行
われたかどうかを判定する。In step S230, the document search unit 6
It is determined whether or not the degree of coincidence is equal to or greater than a certain value. If it is equal to or larger than the predetermined value, the process proceeds to step S240, where each of the pointers k and j is incremented by one, and in subsequent step S250, it is determined whether or not matching with all keywords has been performed.

【００４３】このステップＳ２５０で、全てのキーワー
ドとの照合が行われたと判定された場合は、ステップＳ
２７０で出力結果を保存し、本処理を終了する。なお、
この出力結果とは、一致した文書名、一致度、ページ番
号、一致した文字の画像上での座標等を意味しており、
本実施の形態では、それらを表示部９に出力し、表示す
る。If it is determined in step S250 that all keywords have been collated, the process proceeds to step S250.
At 270, the output result is saved, and the process ends. In addition,
The output result means a matched document name, a matching degree, a page number, coordinates of a matched character on an image, and the like.
In the present embodiment, they are output to the display unit 9 and displayed.

【００４４】一方、ステップＳ２３０で、一致度が一定
値以下と判定された場合は、処理をステップＳ２８０へ
進め、文書の終わりであるかどうかをチェックする。文
書の終わりである場合には、本処理を終了するが、文書
の終わりでない場合には、ステップＳ２９０へ進み、ｉ
を１インクリメントしてｋに代入し、キーワードのポイ
ンタｊを０にして、再びステップＳ２２０へ戻る。On the other hand, if it is determined in step S230 that the coincidence is equal to or smaller than the predetermined value, the process proceeds to step S280 to check whether the end of the document is reached. If it is the end of the document, this process ends. If it is not the end of the document, the process proceeds to step S290, where i
Is incremented by 1 and substituted into k, the pointer j of the keyword is set to 0, and the process returns to step S220.

【００４５】また、ステップＳ２５０で、全てのキーワ
ードとの照合が行われていないと判定された場合には、
続くステップＳ２６０で、文書の終わりであるかどうか
をチェックする。そして、文書の終わりである場合に
は、本処理を終了するが、文書の終わりでない場合に
は、再びステップＳ２２０へ戻る。以下同様にして、文
字認識結果データベース５内の全ての文書に対して、上
記の処理を繰り返す。If it is determined in step S250 that matching with all keywords has not been performed,
In a succeeding step S260, it is checked whether or not the end of the document is reached. If it is the end of the document, the process ends. If it is not the end of the document, the process returns to step S220. In the same manner, the above processing is repeated for all the documents in the character recognition result database 5.

【００４６】ここで、本実施の形態における、文字認識
結果とキーワード「内部処理統合型」との照合方法につ
いて、例示しながら説明する。図４のステップＳ２２０
まで検索処理が進み、ポインタｋ＝０，ｊ＝０のとき、
すなわち、文字認識結果とキーワードのそれぞれ最初の
文字を照合する。つまり、ｔｅｘｔ［０］＝「で」と、
ｋｅｙ［０］＝「内」の照合を行い、それらの類似度を
求める。なお、図５は、辞書８内の標準パターン「で」
の特徴を、図６は、同じく「内」の特徴を示しており、
これらのパターン内にある数値は、それぞれの特徴量を
示す。Here, a method of matching the character recognition result with the keyword “internal processing integrated type” in the present embodiment will be described by way of example. Step S220 in FIG.
The search process proceeds until pointers k = 0 and j = 0,
That is, the character recognition result is compared with the first character of each of the keywords. That is, text [0] = “in”,
Key [0] = “inside” is collated, and their similarity is determined. FIG. 5 shows the standard pattern “de” in the dictionary 8.
FIG. 6 also shows the feature of “inside”.
Numerical values in these patterns indicate respective feature amounts.

【００４７】類似度出力部７は、図５と図６のパターン
から、上記の式（２）に従って類似度ｓを計算し、ｓ＝
１−｛９１／（９４＋９５）｝＝０．５２を得て、それ
を文書検索部６へ出力する。そして、ステップＳ２３０
で文書検索部６は、この類似度ｓをもとに一致度を計算
し、一致度＝０．５２／１＝０．５２を得る。ここで、
一致度ｓの閾値を０．６とすると、ステップＳ２３０で
の判定は「Ｎｏ」となるので、処理はステップＳ２８０
へ進む。The similarity output unit 7 calculates the similarity s from the patterns of FIGS. 5 and 6 according to the above equation (2), and s = s
1− {91 / (94 + 95)} = 0.52, and outputs it to the document search unit 6. Then, step S230
Then, the document search unit 6 calculates the coincidence based on the similarity s, and obtains the coincidence = 0.52 / 1 = 0.52. here,
If the threshold value of the degree of coincidence s is 0.6, the determination in step S230 is “No”, so the processing is performed in step S280.
Proceed to.

【００４８】この段階では、文書の終わりではないの
で、ステップＳ２８０での判定は「Ｎｏ」となり、続く
ステップＳ２９０でｉのインクリメント、つまり、ｋを
１インクリメントして、ステップＳ２２０へ戻る。その
結果、ステップＳ２２０では、ｔｅｘｔ［１］＝「行」
とｋｅｙ［０］＝「内」の照合を行う。At this stage, since it is not the end of the document, the determination in step S280 is "No", and in step S290, i is incremented, that is, k is incremented by 1, and the process returns to step S220. As a result, in step S220, text [1] = “line”
And key [0] = “inside”.

【００４９】類似度出力部７が、上記と同様の計算を行
い、例えば、類似度０．４を文書検索部６へ出力する
と、この場合もステップＳ２３０での判定は「Ｎｏ」と
なるので、処理はステップＳ２８０へ進む。以下、同様
にこれらの処理を実行し、照合の結果、ｔｅｘｔ内の
「内都処理縦合型」とキーワードが一定値以上の一致度
を示した場合、これを出力候補とする。When the similarity output unit 7 performs the same calculation as above and outputs, for example, the similarity 0.4 to the document search unit 6, the determination in step S230 is also "No" in this case. The process proceeds to step S280. Hereinafter, similarly, these processes are executed, and when the result of the comparison indicates that the keyword and the “internal processing vertical type” in the text show a degree of coincidence equal to or more than a certain value, this is set as an output candidate.

【００５０】以上説明したように、本実施の形態によれ
ば、認識結果を各１文字画像に対して１文字しか保存し
ないので、文書保存のために必要となるメモリ容量が少
なくて済む。また、照合対象文字を認識結果候補である
数文字に限定せず、全ての文字を照合対象とし、かつ、
照合時に各文字毎の一致度を全て用いるため、検索もれ
が少なくなるとともに、キーワードと一部不一致である
文字が認識結果に存在しても、適正な一致度を算出する
ことが可能となるため、ユーザが検索結果を確認する手
間も省ける、という顕著な効果がある。As described above, according to the present embodiment, since only one character is stored for each character image in the recognition result, the memory capacity required for storing the document can be reduced. Also, the characters to be collated are not limited to several characters that are candidates for the recognition result, all characters are to be collated, and
Since all the degrees of matching for each character are used at the time of collation, search omission is reduced, and even if a character partially mismatching with the keyword exists in the recognition result, it is possible to calculate an appropriate degree of matching. Therefore, there is a remarkable effect that the user does not have to check the search result.

【００５１】なお、文字認識方法については、上記の方
法に限定されるものではなく、例えば、「パターン認
識」（舟久保登著、共立出版）に記述された構造解析的
手法を用いてもよい。また、上記の実施の形態では、類
似度出力部７は、文字認識部３が用いる辞書から計算し
ているが、これに限定されるものではなく、あらかじめ
各文字間の類似度の表を保持するようにしてもよい。図
７は、各文字間の類似度を表にした例である。さらに、
一致度の閾値および一致度の計算方法も、上記の数値お
よび式に限定されないことは言うまでもない。Note that the character recognition method is not limited to the above method, and for example, a structural analysis method described in "Pattern Recognition" (Noboru Funakubo, Kyoritsu Shuppan) may be used. Further, in the above embodiment, the similarity output unit 7 calculates from the dictionary used by the character recognizing unit 3, but the present invention is not limited to this, and a similarity table between each character is stored in advance. You may make it. FIG. 7 is an example in which the similarity between each character is tabulated. further,
It goes without saying that the threshold value of the degree of coincidence and the method of calculating the degree of coincidence are not limited to the above numerical values and expressions.

【００５２】実施の形態２．図８は、本発明の実施の形
態２に係るファイリングシステムの構成を示すブロック
図である。なお、同図において、図１に示す上記実施の
形態１に係るシステムと同一構成要素には同一符号を付
し、ここでは、それらの説明を省略する。図８の後処理
部１０は、文書保存部２に保存された、文字認識部３に
よる文字認識結果を文法的に検証し、その結果を出力す
る。単語辞書１１は、後処理部１０が用いる辞書であ
る。Embodiment 2 FIG. 8 is a block diagram showing a configuration of a filing system according to Embodiment 2 of the present invention. In the figure, the same components as those of the system according to the first embodiment shown in FIG. 1 are denoted by the same reference numerals, and the description thereof is omitted here. The post-processing unit 10 in FIG. 8 grammatically verifies the character recognition result of the character recognition unit 3 stored in the document storage unit 2 and outputs the result. The word dictionary 11 is a dictionary used by the post-processing unit 10.

【００５３】最初に、本実施の形態に係るシステムにお
ける文書の登録方法について述べる。図９は、本実施の
形態に係るファイリングシステムで実行される、文字認
識結果の保存処理を示すフローチャートである。同図の
ステップＳ３００で、文字認識部３が文字認識処理を行
う。ここでの文字認識の方法は、上記実施の形態１にお
ける方法と同じであるので、その説明は省略する。な
お、この認識結果は、候補文字を含め、１文字画像に対
し複数出力する。First, a method for registering a document in the system according to the present embodiment will be described. FIG. 9 is a flowchart showing a character recognition result storing process executed by the filing system according to the present embodiment. In step S300 in the figure, the character recognition unit 3 performs a character recognition process. Since the method of character recognition here is the same as the method in the first embodiment, the description is omitted. It should be noted that a plurality of recognition results, including candidate characters, are output for one character image.

【００５４】次にステップＳ３１０へ進み、後処理を行
う。ここでは、後処理部１０が、候補文字も含めた認識
結果に対して、それを文法的に解析し、文章として正し
いと思われる組み合わせを決定する。以下、公知の形態
素照合を用いた後処理方法について説明する。Next, the flow advances to step S310 to perform post-processing. Here, the post-processing unit 10 grammatically analyzes the recognition result including the candidate character and determines a combination considered to be correct as a sentence. Hereinafter, a post-processing method using a known morphological verification will be described.

【００５５】図１０は、本実施の形態に係る後処理の手
順を示すフローチャートであり、ここでは、図１８に示
す認識結果候補文字に対して、この後処理を行う。ま
た、図１１は、単語辞書１１内の単語と形態素番号の対
応を示す図である。なお、形態素番号とは、形態素（普
通名詞、固有名詞、サ変名詞等の単語）に対して、一意
に番号を割り当てたものである。そして、図１２は、こ
れら形態素番号の接続関係を記述したものを示してい
る。FIG. 10 is a flowchart showing the procedure of post-processing according to the present embodiment. Here, this post-processing is performed on the recognition result candidate characters shown in FIG. FIG. 11 is a diagram showing the correspondence between words in the word dictionary 11 and morpheme numbers. Note that the morpheme number is a number uniquely assigned to a morpheme (a word such as a common noun, proper noun, or savari noun). FIG. 12 illustrates the connection relationship between these morpheme numbers.

【００５６】図１０のステップＳ４１０で後処理部１０
は、文字の認識結果と単語辞書１１との照合を行う。こ
こで、図１１に示す単語と図１８に示す認識結果とを照
合すると、図１１内の「処理」「統合」「型」と認識結
果との照合に成功する。そして、ステップＳ４２０へ進
み、図１２に示す記述に従って、形態素接続検定を行
う。In step S410 of FIG.
Performs a comparison between the character recognition result and the word dictionary 11. Here, when the word shown in FIG. 11 is compared with the recognition result shown in FIG. 18, the “processing”, “integration”, and “type” in FIG. 11 are successfully matched with the recognition result. Then, the process proceeds to step S420, and a morphological connection test is performed according to the description shown in FIG.

【００５７】図１１に示すように、「処理」の形態素番
号は「５５」、「統合」の形態素番号は「５５」，「５
９」の２種類、そして、「型」については「４６」であ
る。そこで、図１２に示す記述内容を用いて形態素接続
検定をすると、５５と５５、５５と４６の接続がいずれ
も成立する（図中の○印）。次にステップＳ４３０へ進
み、一致した形態素文字数を評価値として、それらを各
候補文字に付与する。As shown in FIG. 11, the morpheme numbers of “processing” are “55”, and the morpheme numbers of “integration” are “55” and “5”.
"9" and "46" for "type". Therefore, when the morphological connection test is performed using the description contents shown in FIG. 12, the connections of 55 and 55 and the connections of 55 and 46 are all established (indicated by a circle in the figure). Next, the process proceeds to step S430, in which the number of matching morpheme characters is assigned to each candidate character as an evaluation value.

【００５８】図１３は、後処理部１０が出力する、照合
結果を示す。同図に示すように、「処」「理」「統」
「合」に評価値「２」、「型」に評価値「１」、それ以
外の文字には、それらが単語辞書と一致しないので、全
て「０」が付与されている。FIG. 13 shows a collation result output from the post-processing unit 10. As shown in FIG.
The evaluation value is “2” for “go”, the evaluation value is “1” for “type”, and all other characters are assigned “0” because they do not match the word dictionary.

【００５９】上記の後処理が終わると、図９のステップ
Ｓ３２０が実行される。すなわち、文書保存部２は、文
字認識結果と後処理結果から、各文字毎の総合評価値を
算出する。この総合評価値は、以下の式（４）により算
出される。総合評価値＝α×（文字認識の類似度）＋（１−α）×（後処理の評価値） …（４）なお、ここでαは、０≦α≦１を満たす値をとり、本実
施の形態では、α＝０．８とする。When the above post-processing is completed, step S320 in FIG. 9 is executed. That is, the document storage unit 2 calculates a comprehensive evaluation value for each character from the character recognition result and the post-processing result. This comprehensive evaluation value is calculated by the following equation (4). Total evaluation value = α × (similarity of character recognition) + (1−α) × (evaluation value of post-processing) (4) Here, α is a value satisfying 0 ≦ α ≦ 1, and In the embodiment, α = 0.8.

【００６０】今、図１８に示す候補文字の内、「縦」の
類似度を０．７、「統」の類似度を０．５とし、それら
と、図１３に示す照合結果とを、上記の式（４）に当て
はめると、「縦」についての総合評価値は、０．８×
０．７＋０．２×０＝０．５６、「統」についての総合
評価値は、０．８×０．５＋０．２×２＝０．８とな
り、「統」の方が「縦」よりも評価値が高い。Now, of the candidate characters shown in FIG. 18, the similarity of “vertical” is set to 0.7 and the similarity of “to” is set to 0.5, and the matching result shown in FIG. When applied to the equation (4), the overall evaluation value of “vertical” is 0.8 ×
0.7 + 0.2 × 0 = 0.56, the overall evaluation value of “To” is 0.8 × 0.5 + 0.2 × 2 = 0.8, and “To” is more than “Vertical”. Evaluation value is high.

【００６１】続くステップＳ３３０で、文書保存部２
は、一意に決定できる文字と一意に決定できない文字を
区別する。この区別の方法は、例えば、総合評価値が一
定値以上の場合、一意に決定可能とし、それ以下の場合
は、その決定が不可能とする。そして、ステップＳ３４
０では、文書保存部２が、一意に決定可能となった文字
については、総合評価値が最も高い文字コードのみを、
フラグ「０」とともに保存する。しかし、一意に決定で
きないと判断した文字は、総合評価値が最も高い文字コ
ードを、フラグ「１」とともに保存する。つまり、文字
認識および後処理を行って、評価値が最も大きいもの１
文字を、一意に決定できるか否かにかかわらず、認識結
果として保存する。In the following step S330, the document storage unit 2
Distinguishes between characters that can be uniquely determined and characters that cannot be uniquely determined. The method of this distinction is that, for example, when the total evaluation value is equal to or more than a certain value, it can be uniquely determined, and when it is less than that, it cannot be determined. Then, step S34
In the case of 0, the character that can be uniquely determined by the document storage unit 2 is only the character code having the highest overall evaluation value.
Save with flag "0". However, for a character determined not to be uniquely determined, the character code having the highest overall evaluation value is stored together with the flag “1”. In other words, character recognition and post-processing are performed, and
Characters are stored as recognition results regardless of whether they can be uniquely determined.

【００６２】図１４は、本実施の形態における文字認識
結果の保存例を示す。同図に示すように、文字認識結果
は、図３の場合に比べて、本実施の形態では「統」の文
字が正しく保存されている。FIG. 14 shows an example of storing a character recognition result in this embodiment. As shown in the figure, as for the character recognition result, in the present embodiment, the character "" is correctly stored as compared with the case of FIG.

【００６３】以下、本実施の形態に係るシステムにおけ
る検索方法について説明する。なお、ここでの検索方法
は、基本的には、図４に示す実施の形態１に係る検索方
法と同一であるが、ステップＳ２２０での処理が異な
る。つまり、上記実施の形態１では、全ての文字に対し
てキーワードとの一致度を、類似度出力部７からの類似
度を用いて計算しているが、本実施の形態２では、ｔｅ
ｘｔのフラグが１の文字については、類似度出力部７か
らの値を類似度として用いて、その一致度を計算する
が、フラグが０の文字については、ｔｅｘｔとｋｅｙの
文字コードが完全に一致した場合に照合が成功したとし
て、類似度１を与える。また、フラグが０の文字で、ｔ
ｅｘｔとｋｅｙの文字コードが一致しない場合は、類似
度０を返す。Hereinafter, a search method in the system according to the present embodiment will be described. Note that the search method here is basically the same as the search method according to Embodiment 1 shown in FIG. 4, but the processing in step S220 is different. That is, in the first embodiment, the degree of coincidence of all characters with the keyword is calculated using the similarity from the similarity output unit 7, but in the second embodiment, te
For a character whose xt flag is 1, the degree of coincidence is calculated using the value from the similarity output unit 7 as the similarity. For a character whose flag is 0, the character codes of text and key are completely changed. If they match, it is determined that the matching is successful, and the similarity 1 is given. In addition, the character whose flag is 0, t
If the character codes of ext and key do not match, the similarity 0 is returned.

【００６４】そこで、本実施の形態における検索方法
を、図１４に示すｔｅｘｔとキーワード「内部処理統合
型」とを照合する場合を例に、図４に示すフローチャー
トを参照して説明する。文書検索部６がキーワード「内
部処理統合型」を受け取り（図４のステップＳ２０
０）、ｔｅｘｔとｋｅｙのポインタを初期化して（ステ
ップＳ２１０）、続くステップＳ２２０で、ｔｅｘｔと
ｋｅｙの照合を行う。ここでも、上記実施の形態１と同
様、最初は、ｔｅｘｔ［０］＝「で」と、ｋｅｙ［０］
＝「内」との照合を行う。Therefore, the search method in the present embodiment will be described with reference to the flowchart shown in FIG. 4, taking as an example a case where the text shown in FIG. 14 is collated with the keyword “internal processing integrated type”. The document search unit 6 receives the keyword “internal processing integrated type” (step S20 in FIG. 4).
0), the text and key pointers are initialized (step S210), and in the following step S220, the text and the key are collated. Here, as in the first embodiment, initially, text [0] = “in” and key [0].
= Check with “inside”.

【００６５】図１４に示すように、ｔｅｘｔ［０］のフ
ラグは「０」なので、文書検索部６は、ｔｅｘｔ［０］
とｋｅｙ［０］が同一の文字コードであるかどうかの比
較をする。この場合、同一ではないので、照合に失敗し
たとして、文書検索部６は類似度０を返す。そして、ス
テップＳ２３０へ進み、一致度を計算する。ここでの一
致度は、０／１＝０であるから、処理をステップＳ２８
０以降へ進め、再びステップＳ２２０で、今度はｔｅｘ
ｔ［１］＝「行」とｋｅｙ［０］＝「内」との照合を行
う。As shown in FIG. 14, since the flag of text [0] is “0”, the document search unit 6 sets the text [0]
And whether or not key [0] has the same character code. In this case, since they are not the same, it is determined that the matching has failed, and the document search unit 6 returns the similarity 0. Then, the process proceeds to step S230, and the degree of coincidence is calculated. Since the degree of coincidence is 0/1 = 0, the process proceeds to step S28.
0, and again in step S220, this time with tex
Matching is performed between t [1] = “line” and key [0] = “inside”.

【００６６】以下、同様に処理を進め、ｔｅｘｔ［５］
とｋｅｙ［０］との照合を行うと、所定値以上の一致度
が得られるので、ステップＳ２３０での判定結果が「Ｙ
ｅｓ」となり、処理はステップＳ２４０へ進む。このス
テップＳ２４０では、ｔｅｘｔとｋｅｙのポインタを１
インクリメントし、その後、ステップＳ２５０，Ｓ２６
０を経て、処理はステップＳ２２０へ戻る。Hereinafter, the process proceeds similarly, and text [5]
Is compared with key [0], a matching degree equal to or greater than a predetermined value is obtained, and the determination result in step S230 is “Y
es ", and the process proceeds to step S240. In this step S240, the pointer of the text and the key is set to 1
Increment, and then steps S250 and S26
After 0, the process returns to step S220.

【００６７】ｔｅｘｔ［６］＝「都」とｋｅｙ［１］＝
「部」との照合の場合、ｔｅｘｔ［６］のフラグは
「１」なので、文書検索部６は、類似度出力部７の結果
を類似度とする。例えば、類似度出力部７が返す類似度
（都、部）を０．６とすると、ステップＳ２３０で求め
られる一致度は、（１＋０．６）／２＝０．８となる。
その結果、この一致度が一定値０．６を超えているた
め、処理はステップＳ２４０へ進む。以下、同様に処理
を行い、ｔｅｘｔである「内都処理縦合型」との照合に
成功した場合、それを検索結果として出力する。Text [6] = “city” and key [1] =
In the case of collation with “part”, since the flag of text [6] is “1”, the document search part 6 sets the result of the similarity output part 7 as the similarity. For example, assuming that the similarity (city, part) returned by the similarity output unit 7 is 0.6, the coincidence obtained in step S230 is (1 + 0.6) /2=0.8.
As a result, since the degree of coincidence exceeds the fixed value 0.6, the process proceeds to step S240. Hereinafter, the same processing is performed, and if the collation with the text “internal processing vertical type” is successful, it is output as a search result.

【００６８】以上説明したように、本実施の形態によれ
ば、一意に決定できる文字に対しては、文字コードが完
全に同一である場合のみ、照合成功とし、例えば、総合
評価値が２位以下の候補文字を除外すると、それらの照
合が不要となるので、誤抽出が検出される確率をさらに
小さくできる。As described above, according to the present embodiment, for a character that can be uniquely determined, only when the character code is completely the same, the collation succeeds. If the following candidate characters are excluded, their matching becomes unnecessary, and the probability of detecting erroneous extraction can be further reduced.

【００６９】また、常に一文字毎の類似度を用いてキー
ワードとの一致度を評価するので、不一致部分にどのよ
うな文字が来ても、それらが同一の一致度で表示される
ことがないため、得られた一致度に基づいて、誤抽出と
正しい検索結果を、より明確に分離することが可能とな
る。Also, since the degree of coincidence with a keyword is always evaluated using the degree of similarity for each character, no matter what character comes to the mismatched part, they are not displayed with the same degree of coincidence. Based on the obtained coincidence, it is possible to more clearly separate erroneous extraction from correct search results.

【００７０】なお、上記の実施の形態２では、その文字
コードを一意に決定できない文字に対しては、文字コー
ドとフラグ１を保存しているが、これに限定されず、文
字コードを一意に決定できないのは、例えば、入力画像
の濃度値が適切ではなく、それが掠れていたり、つぶれ
ている場合等が考えられる。この場合、一意に決定でき
ない文字に対しては、文字コードを保存するよりも、文
字認識に用いる特徴を保存した方が、より正確な照合が
可能となり、正確な検索が可能となる。In the second embodiment, the character code and the flag 1 are stored for characters whose character codes cannot be uniquely determined. However, the present invention is not limited to this. The reason why it cannot be determined is, for example, that the density value of the input image is not appropriate and the input image is blurred or crushed. In this case, for a character that cannot be uniquely determined, storing a feature used for character recognition enables more accurate collation and a more accurate search than storing a character code.

【００７１】図１５は、上記の特徴保存に係る辞書の例
を示す。同図に示すように、この辞書では、フラグが１
の文字に対して、その特徴値を保存しており、特徴値
（図中、２１で示す）は、例えば、図５，図６に示すよ
うな、８×８次元（６４次元）特徴値を一列に表示した
ものである。FIG. 15 shows an example of a dictionary relating to the above feature storage. As shown in FIG.
The characteristic values (characters indicated by 21 in the figures) are stored for the characters of 8 × 8-dimensional (64-dimensional) as shown in FIGS. 5 and 6, for example. It is displayed in a line.

【００７２】また、検索時の処理として、図４のステッ
プＳ２２０で類似度出力部７が、ｔｅｘｔ［ｋ］のフラ
グが０の場合、ｔｅｘｔ［ｋ］とｋｅｙ［ｊ］との照合
を行い、ｔｅｘｔ［ｋ］のフラグが１の場合には、ｔｅ
ｘｔ［ｋ］の文字の特徴として、図１５に示す特徴値を
使用し、ｋｅｙ［ｊ］の文字の特徴には、辞書８内の特
徴を用いて、上記の式（２）に従って類似度を算出する
ようにしてもよい。As a process at the time of retrieval, the similarity output unit 7 compares text [k] with key [j] when the flag of text [k] is 0 in step S220 in FIG. If the flag of text [k] is 1, te
The feature value shown in FIG. 15 is used as the feature of the character of xt [k], and the similarity is calculated according to the above expression (2) using the feature in the dictionary 8 as the feature of the character of key [j]. You may make it calculate.

【００７３】一方、一意に決定できない文字数が、文書
画像内の文字数に比べ大きくなると、その文書は、様々
なキーワードに対して許容範囲が広くなるため、誤抽出
として検索される確率が高くなる。そこで、このような
場合、ユーザに対して、例えば、「この文書は正しく検
索されない可能性があります」という旨を、入力部１よ
り入力した画像イメージとともに表示部９に表示するよ
うにしてもよい。On the other hand, if the number of characters that cannot be uniquely determined is larger than the number of characters in the document image, the document has a wider allowable range for various keywords, and the probability of being searched as erroneous extraction increases. Therefore, in such a case, for example, a message to the effect that “this document may not be correctly retrieved” may be displayed on the display unit 9 together with the image input from the input unit 1. .

【００７４】そこで、この動作を、図９，図１６を参照
して説明する。この場合、図９のステップＳ３４０で、
文書保存部２は、図１６に示すフローチャートに従っ
て、文書保存手順を実行する。すなわち、図１６のステ
ップＳ５１０で文書保存部２は、文字認識部３が出力す
る文字数Ａを計算する。そして、ステップＳ５２０で、
文字認識部３の結果と後処理部１０での結果から算出し
た、一意に決定できない文字数（フラグが１の文字数）
Ｆを計算する。Therefore, this operation will be described with reference to FIGS. In this case, in step S340 of FIG.
The document storage unit 2 executes a document storage procedure according to the flowchart shown in FIG. That is, in step S510 of FIG. 16, the document storage unit 2 calculates the number of characters A output by the character recognition unit 3. Then, in step S520,
The number of characters that cannot be uniquely determined (the number of characters whose flag is 1), calculated from the result of the character recognition unit 3 and the result of the post-processing unit 10
Calculate F.

【００７５】次に、ステップＳ５３０で、Ｆ／Ａが一定
値以上かどうかを判定する。それが一定値以上の場合、
ステップＳ５４０へ進み、表示部９に「この文書は正し
く検索されない可能性があります」という旨の警告文を
表示して、文書を保存せずに本処理を終了する。また、
Ｆ／Ａが一定値より小さければ、上記の警告文を表示せ
ず、文書を保存して本処理を終了する。Next, in step S530, it is determined whether F / A is equal to or more than a predetermined value. If it is above a certain value,
Proceeding to step S540, a warning message stating that "this document may not be correctly searched" is displayed on the display unit 9, and this processing ends without saving the document. Also,
If F / A is smaller than the predetermined value, the above-mentioned warning message is not displayed, the document is saved, and the process is terminated.

【００７６】このようにすることで、ユーザは、メッセ
ージと画像を見ることによって、文書が正しく検索され
ない原因を推測することが可能となるため、これが、正
しいデータを保存するために有効な手段となる。In this manner, the user can guess the cause of the incorrect retrieval of the document by looking at the message and the image. This is an effective means for storing correct data. Become.

【００７７】[0077]

【発明の効果】以上説明したように、第１の発明によれ
ば、文書画像より、キーワードに従って所定の文書を検
索する文書ファイリングシステムにおいて、上記文書画
像中の文字を認識する手段と、上記キーワードと上記認
識した文字とを照合する照合手段と、上記照合の結果を
もとに、上記文書画像に関する情報を文書検索結果とし
て出力する出力手段とを備え、上記照合手段は、上記キ
ーワード中の文字と上記認識した文字との類似度を文字
毎に算出し、この類似度を用いて、これらキーワード中
の文字と認識した文字との一致度を計算した結果に基づ
いて上記照合を行い、また、上記出力手段は、この一致
度の示す値をもとに、上記キーワードに対応する文書を
上記文書検索結果として出力することで、文書保存のた
めに必要となるメモリ容量が少なくて済み、検索もれが
少なくなるとともに、キーワードと一部不一致である文
字が認識結果に存在しても、適正な一致度を算出するこ
とが可能となり、結果的に、ユーザが検索結果を確認す
る手間も省ける。As described above, according to the first aspect, in a document filing system for retrieving a predetermined document from a document image in accordance with a keyword, means for recognizing characters in the document image, Matching means for comparing the character with the recognized character, and output means for outputting information on the document image as a document search result based on the result of the matching. And the similarity with the recognized character is calculated for each character, and using the similarity, the matching is performed based on the result of calculating the degree of coincidence between the character in these keywords and the recognized character, and The output unit outputs a document corresponding to the keyword as a result of the document search based on the value indicating the degree of matching, so that a document necessary for storing the document is output. It is possible to reduce the re-capacity, reduce search omissions, and even if a character that does not match the keyword partially exists in the recognition result, it is possible to calculate an appropriate matching degree. You don't have to check the search results.

【００７８】第２の発明によれば、さらに、あらかじめ
格納した単語辞書を参照して、上記認識した文字に所定
の文法解析を施し、この文法解析の結果をもとに、上記
文字と単語辞書との一致状態を示す評価値を各文字に付
与する手段を備え、また、上記照合手段は、上記類似度
と上記評価値とから求めた各文字毎の総合評価値に基づ
いて上記照合を行うことで、より適正な一致度を算出で
きる。According to the second invention, a predetermined grammatical analysis is performed on the recognized character with reference to a word dictionary stored in advance, and the character and the word dictionary are determined based on the result of the grammatical analysis. Means for assigning an evaluation value indicating a matching state to each character, and the collating means performs the collation based on an overall evaluation value for each character obtained from the similarity and the evaluation value. Thus, a more appropriate degree of coincidence can be calculated.

【００７９】第３の発明によれば、上記照合手段が、上
記総合評価値が一定値以上であるか否かをもとに、上記
認識した文字について、その文字コードを一意に決定で
きる文字と一意に決定できない文字との区別を行い、上
記一意に決定できる文字を、その文字コードと所定のフ
ラグとを対応付けて保存することで、誤抽出が検出され
る確率を小さくでき、また、不一致部分に対して同一の
一致度で表示されることがないため、得られた一致度に
基づいて、誤抽出と正しい検索結果を、より明確に分離
できる。According to the third aspect of the present invention, the collating means determines whether or not the character code of the recognized character can be uniquely determined based on whether or not the comprehensive evaluation value is equal to or greater than a predetermined value. Characters that cannot be uniquely determined are distinguished from each other, and the characters that can be uniquely determined are stored by associating the character codes with predetermined flags, so that the probability of detection of erroneous extraction can be reduced. Since parts are not displayed with the same degree of matching, erroneous extraction and correct search results can be more clearly separated based on the obtained degree of matching.

【００８０】第４の発明によれば、上記照合手段が、上
記文字コードが同一であるか否かをもとに、上記一意に
決定できる文字と上記キーワード中の文字との類似度を
算出するとともに、この類似度を用いて上記一致度を計
算し、また、上記一意に決定できない文字については、
上記キーワード中の文字と上記認識した文字との類似度
を用いて上記一致度を計算することで、文書検索のため
の適正な文字の一致度を算出することができる。According to the fourth aspect, the collating means calculates the similarity between the uniquely determinable character and the character in the keyword based on whether or not the character codes are the same. At the same time, the degree of matching is calculated using this degree of similarity, and for characters that cannot be uniquely determined,
By calculating the matching degree using the similarity between the character in the keyword and the recognized character, it is possible to calculate a proper character matching degree for document search.

【００８１】第５の発明によれば、さらに、上記キーワ
ード中の文字と上記認識した文字それぞれの特徴量を抽
出する手段を備え、上記照合手段は、上記一意に決定で
きない文字について、上記特徴量の類似度を算出し、こ
の類似度を用いて上記一致度を計算することで、文字コ
ードを保存する場合に比べて、正確な照合および検索が
可能となる。According to the fifth aspect, there is further provided a means for extracting a characteristic amount of each of the character in the keyword and the recognized character, and the collating means determines the characteristic amount of the character which cannot be uniquely determined. By calculating the degree of similarity and calculating the degree of coincidence using the degree of similarity, more accurate collation and search can be performed than in the case where the character code is stored.

【００８２】そして、第６の発明によれば、さらに、上
記一意に決定できない文字の文字数と上記認識した文字
の文字数の比率を算出する手段を備え、上記出力手段
が、上記比率が一定値以上の場合、上記所定の文書を適
正に検索できない旨の表示をすることで、その表示を見
て、文書が正しく検索されない原因を推測することが可
能となる。According to the sixth aspect of the present invention, there is further provided means for calculating a ratio of the number of characters of the characters which cannot be uniquely determined to the number of characters of the recognized characters. In the case of (1), by displaying an indication that the predetermined document cannot be searched properly, it is possible to guess the cause of the incorrect search of the document by looking at the display.

【００８３】第７の発明は、文書画像より、キーワード
に従って所定の文書を検索する文書ファイリング方法に
おいて、上記文書画像中の文字を認識する工程と、上記
キーワード中の文字と上記認識した文字との類似度を文
字毎に算出する工程と、上記類似度を用いて、上記キー
ワード中の文字と上記認識した文字との一致度を計算す
る工程と、上記一致度に基づいて、上記キーワード中の
文字と上記認識した文字とを照合する照合工程と、上記
照合の結果をもとに、上記キーワードに対応する文書を
文書検索結果として出力する出力工程とを備えること
で、文書保存のために必要となるメモリ容量が少なくて
済むとともに、検索もれが少なくなり、キーワードと一
部不一致である文字が認識結果に存在しても、適正な一
致度を算出することができ、ユーザには、検索結果を確
認する手間も省けることになる、という効果がある。According to a seventh aspect of the present invention, in a document filing method for retrieving a predetermined document from a document image according to a keyword, a step of recognizing a character in the document image, and a step of recognizing a character in the keyword and the recognized character Calculating the degree of similarity for each character; calculating the degree of coincidence between the character in the keyword and the recognized character using the degree of similarity; and calculating the degree of similarity in the keyword based on the degree of coincidence. And a collation step of collating the recognized character with the recognized character, and an output step of outputting a document corresponding to the keyword as a document search result based on the result of the collation. Calculate the correct matching degree even if the recognition result requires less memory capacity, less search omissions, and characters that do not match the keyword partially. Can, the user would Habukeru also troublesome to confirm the search result, there is an effect that.

【００８４】第８の発明によれば、さらに、あらかじめ
格納した単語辞書を参照して、上記認識した文字に所定
の文法解析を施し、この文法解析の結果をもとに、上記
文字と単語辞書との一致状態を示す評価値を各文字に付
与する工程を備え、上記照合工程は、上記類似度と上記
評価値とから求めた各文字毎の総合評価値に基づいて上
記照合を行うことで、より適正な一致度を算出できる。According to the eighth invention, a predetermined grammatical analysis is performed on the recognized character with reference to a word dictionary stored in advance, and the character and the word dictionary are analyzed based on the result of the grammatical analysis. And a step of assigning an evaluation value indicating a matching state to each character, wherein the collation step performs the collation based on an overall evaluation value for each character obtained from the similarity and the evaluation value. , A more appropriate degree of coincidence can be calculated.

【００８５】第９の発明によれば、上記照合工程が、上
記総合評価値が一定値以上であるか否かをもとに、上記
認識した文字について、その文字コードを一意に決定で
きる文字と一意に決定できない文字との区別を行い、上
記一意に決定できる文字を、その文字コードと所定のフ
ラグとを対応付けて保存することで、誤抽出が検出され
る確率を小さくでき、また、不一致部分に対して同一の
一致度で表示されることがないので、得られた一致度に
基づいて、誤抽出と正しい検索結果を明確に分離でき
る。[0085] According to the ninth aspect, the collating step determines whether or not the character code of the recognized character can be uniquely determined based on whether or not the comprehensive evaluation value is equal to or greater than a predetermined value. Characters that cannot be uniquely determined are distinguished from each other, and the characters that can be uniquely determined are stored by associating the character codes with predetermined flags, so that the probability of detection of erroneous extraction can be reduced. Since parts are not displayed with the same degree of matching, erroneous extraction and correct search results can be clearly separated based on the obtained degree of matching.

【００８６】第１０の発明によれば、上記照合工程が、
上記文字コードが同一であるか否かをもとに、上記一意
に決定できる文字と上記キーワード中の文字との類似度
を算出するとともに、この類似度を用いて上記一致度を
計算し、また、上記一意に決定できない文字について
は、上記キーワード中の文字と上記認識した文字との類
似度を用いて上記一致度を計算することで、文書検索の
ための適正な文字の一致度を算出することができる。According to the tenth aspect, the collating step comprises:
Based on whether or not the character codes are the same, calculate the similarity between the uniquely determinable character and the character in the keyword, calculate the degree of coincidence using this similarity, For the characters that cannot be uniquely determined, the degree of similarity is calculated using the similarity between the character in the keyword and the recognized character, thereby calculating an appropriate degree of character matching for document search. be able to.

【００８７】また、第１１の発明によれば、さらに、上
記キーワード中の文字と上記認識した文字それぞれの特
徴量を抽出する行う工程を備え、上記照合工程が、上記
一意に決定できない文字について、上記特徴量の類似度
を算出し、この類似度を用いて上記一致度を計算するこ
とで、文字コードを保存する場合に比べて、正確な照合
および検索が可能となる。Further, according to the eleventh aspect, there is further provided a step of extracting a characteristic amount of each of the character in the keyword and the recognized character. By calculating the degree of similarity of the feature amount and calculating the degree of coincidence using the degree of similarity, it is possible to perform more accurate collation and search as compared with the case where a character code is stored.

【００８８】そして、第１２の発明によれば、さらに、
上記一意に決定できない文字の文字数と上記認識した文
字の文字数の比率を算出する工程を備え、上記出力工程
が、上記比率が一定値以上の場合、上記所定の文書を適
正に検索できない旨の表示をすることで、ユーザがその
表示を見て、文書が正しく検索されない原因を推測する
ことが可能となる。According to the twelfth invention,
A step of calculating a ratio between the number of characters of the characters that cannot be uniquely determined and the number of characters of the recognized characters, wherein the output step indicates that the predetermined document cannot be properly searched if the ratio is equal to or more than a certain value. By doing so, it is possible for the user to look at the display and guess the reason why the document is not correctly searched.

[Brief description of the drawings]

【図１】本発明の実施の形態１に係るファイリングシ
ステムの構成を示すブロック図であるFIG. 1 is a block diagram showing a configuration of a filing system according to Embodiment 1 of the present invention.

【図２】本発明の実施の形態１に係るシステムにおけ
る文書保存処理手順を示すフローチャートである。FIG. 2 is a flowchart showing a document storage processing procedure in the system according to the first embodiment of the present invention.

【図３】本発明の実施の形態１における文字認識結果
の保存例を示す図である。FIG. 3 is a diagram illustrating an example of storing a character recognition result according to the first embodiment of the present invention.

【図４】本発明の実施の形態１，２に係るシステムに
おける検索処理手順を示すフローチャートであるFIG. 4 is a flowchart showing a search processing procedure in the systems according to the first and second embodiments of the present invention.

【図５】実施の形態１に係る辞書内の標準パターン
（文字「で」）の特徴を示す図である。FIG. 5 is a diagram showing characteristics of a standard pattern (character “de”) in the dictionary according to the first embodiment.

【図６】実施の形態１に係る辞書内の標準パターン
（文字「内」）の特徴を示す図である。FIG. 6 is a diagram showing characteristics of a standard pattern (character “in”) in the dictionary according to the first embodiment.

【図７】各文字間の類似度を表形式で表した図であ
る。FIG. 7 is a diagram showing a similarity between characters in a table format.

【図８】本発明の実施の形態２に係るファイリングシ
ステムの構成を示すブロック図である。FIG. 8 is a block diagram showing a configuration of a filing system according to Embodiment 2 of the present invention.

【図９】実施の形態２に係るファイリングシステムで
実行される、文字認識結果の保存処理を示すフローチャ
ートである。FIG. 9 is a flowchart showing a character recognition result saving process executed by the filing system according to the second embodiment.

【図１０】実施の形態２に係る後処理の手順を示すフ
ローチャートである。FIG. 10 is a flowchart illustrating a procedure of post-processing according to the second embodiment.

【図１１】実施の形態２に係る単語辞書内の単語と形
態素番号の対応を示す図である。FIG. 11 is a diagram showing correspondence between words in a word dictionary and morpheme numbers according to the second embodiment;

【図１２】実施の形態２に係る単語辞書内の形態素間
の接続ルールの記述を示す図である。FIG. 12 is a diagram showing a description of a connection rule between morphemes in the word dictionary according to the second embodiment.

【図１３】後処理部の出力する照合結果を示す図であ
る。FIG. 13 is a diagram illustrating a collation result output from a post-processing unit.

【図１４】実施の形態２における文字認識結果の保存
例を示す図である。FIG. 14 is a diagram showing an example of storing a character recognition result according to the second embodiment.

【図１５】文字認識に用いる特徴保存に係る辞書の例
を示す図である。FIG. 15 is a diagram illustrating an example of a dictionary related to feature storage used for character recognition.

【図１６】実施の形態２に係る、文書保存の他の手順
を示すフローチャートである。FIG. 16 is a flowchart showing another procedure for storing a document according to the second embodiment.

【図１７】従来のファイリング装置のブロック構成を
示す図である。FIG. 17 is a diagram showing a block configuration of a conventional filing apparatus.

【図１８】従来の装置におけるテキストデータの候補
を示す図である。FIG. 18 is a diagram showing text data candidates in a conventional device.

[Explanation of symbols]

１…入力部、２…文書保存部、３…文字認識部、４…文
書画像データベース、５…文字認識結果データベース、
６…文書検索部、７…類似度出力部、８…辞書、９…表
示部、１０…後処理部、１１…単語辞書、２０…キーワ
ードDESCRIPTION OF SYMBOLS 1 ... Input part, 2 ... Document storage part, 3 ... Character recognition part, 4 ... Document image database, 5 ... Character recognition result database,
6 Document search unit, 7 Similarity output unit, 8 Dictionary, 9 Display unit, 10 Post-processing unit, 11 Word dictionary, 20 Keywords

Claims

[Claims]

1. A document filing system for retrieving a predetermined document from a document image according to a keyword, means for recognizing a character in the document image, collation means for collating the keyword with the recognized character, Output means for outputting information on the document image as a document search result based on a result of the matching, wherein the matching means calculates a similarity between the character in the keyword and the recognized character for each character Then, using the similarity, the matching is performed based on the result of calculating the degree of coincidence between the character in the keyword and the recognized character, and the output unit determines the value based on the value indicated by the degree of coincidence. And outputting a document corresponding to the keyword as the document search result.

2. A predetermined grammatical analysis is performed on the recognized character with reference to a word dictionary stored in advance,
Based on the result of the grammatical analysis, a means for assigning an evaluation value indicating a matching state between the character and the word dictionary to each character, wherein the matching means determines each of the similarity and the evaluation value 2. The document filing system according to claim 1, wherein the collation is performed based on a comprehensive evaluation value for each character.

3. The method according to claim 1, wherein the collating unit determines, based on whether or not the comprehensive evaluation value is equal to or more than a predetermined value, a character whose character code can be uniquely determined and a character that cannot be uniquely determined for the recognized character. 3. The document filing system according to claim 2, wherein the character that can be uniquely determined is stored by associating the character code with a predetermined flag. 4.

4. The collation unit calculates a similarity between the uniquely determinable character and a character in the keyword based on whether or not the character codes are the same,
The similarity is calculated using the similarity, and for the character that cannot be uniquely determined, the coincidence is calculated using the similarity between the character in the keyword and the recognized character. The document filing system according to claim 3, wherein

5. The apparatus according to claim 1, further comprising a unit configured to extract a feature amount of each of the character in the keyword and the recognized character, wherein the matching unit calculates a similarity of the feature amount for the character that cannot be uniquely determined. 4. The document filing system according to claim 3, wherein the similarity is calculated using the similarity.

6. The apparatus according to claim 1, further comprising: means for calculating a ratio between the number of characters of the characters that cannot be uniquely determined and the number of characters of the recognized characters. If the ratio is equal to or more than a predetermined value, the output means outputs the predetermined document. 4. The document filing system according to claim 3, wherein an indication that the search cannot be properly performed is displayed.

7. A document filing method for retrieving a predetermined document from a document image according to a keyword, wherein a step of recognizing a character in the document image, and a step of recognizing a similarity between the character in the keyword and the recognized character by a character Calculating for each character; using the similarity to calculate the degree of coincidence between the character in the keyword and the recognized character; and, based on the degree of coincidence, the character in the keyword and the recognized character. A document filing method, comprising: a collation step of collating a character; and an output step of outputting a document corresponding to the keyword as a document search result based on the result of the collation.

8. A predetermined grammatical analysis is performed on the recognized character with reference to a word dictionary stored in advance,
Based on the result of the grammatical analysis, a step of assigning an evaluation value indicating a matching state between the character and the word dictionary to each character, wherein the matching step includes determining each of the similarities and the evaluation value. 8. The document filing method according to claim 7, wherein the collation is performed based on a comprehensive evaluation value for each character.

9. The collating step includes, for the recognized character, a character for which a character code can be uniquely determined and a character for which the character code cannot be uniquely determined based on whether or not the comprehensive evaluation value is equal to or more than a predetermined value. 9. The document filing method according to claim 8, wherein the distinction is made, and the character which can be uniquely determined is stored in association with the character code and a predetermined flag.

10. The collation step calculates a similarity between the uniquely determinable character and a character in the keyword based on whether or not the character codes are the same. And calculating the degree of coincidence using the similarity between the character in the keyword and the recognized character for the character that cannot be uniquely determined. 9. The document filing method according to item 9.

11. The method according to claim 11, further comprising the step of extracting a characteristic amount of each of the character in the keyword and the recognized character, wherein the collating step calculates a similarity of the characteristic amount for the character that cannot be uniquely determined. 10. The document filing method according to claim 9, wherein the similarity is calculated using the similarity.

12. The method according to claim 12, further comprising: calculating a ratio between the number of characters of the characters that cannot be uniquely determined and the number of characters of the recognized characters. 10. The document filing method according to claim 9, wherein an indication that the search cannot be properly performed is displayed.