JP4334068B2

JP4334068B2 - Keyword extraction method and apparatus for image document

Info

Publication number: JP4334068B2
Application number: JP19421199A
Authority: JP
Inventors: 淳之後藤
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1999-07-08
Filing date: 1999-07-08
Publication date: 2009-09-16
Anticipated expiration: 2019-07-08
Also published as: JP2001022773A

Description

【０００１】
【発明の属する技術分野】
本発明は、イメージ文書のキーワード抽出方法、より詳細には、ＯＣＲ文字認識により認識されたイメージ文書からキーワードを抽出するキーワード抽出方法に関する。
【０００２】
【従来の技術】
近年のＰＣの普及、インターネットなどを中心としたインフラの整備、また森林資源保護などの環境意識の高まりから、情報伝達、蓄積の方法が従来の紙を主体としたものからデジタル情報を主体としたものへ変化しつつある。ｅ−ｍａｉｌによる情報交換、ｗｅｂ上での情報の閲覧、及び出版などがその良い例である。しかし、従来の紙を主体とした膨大な情報は、個人にとっても企業にとっても知的資産であるのには変わりなく、切り捨てるわけにはいかない。そこで紙による情報を何とか利用できるようにデジタル情報に変換しなければならない。
【０００３】
上記のデジタル情報への変換は、一般的には次のようにする。まず紙文書（文書の情報媒体が紙であるという意味）をスキャナから読み込み、イメージ文書に変換する。単にデジタル情報にするだけならこのままで良いが、文書の内容を検索できるようにするためには、イメージ文書をＯＣＲ（Optical Character Recognition）を使用してコード情報に変換する必要がある（イメージ文書に文字情報がある場合）。管理する文書数が多く、また不特定多数の人間が文書を利用する場合は、要約文またはキーワードなどの情報により、文書の検索、及び閲覧の際に概要が把握し易くなる。
【０００４】
上述したように“スキャン＋ＯＣＲ＋キーワード・要約文抽出”により紙文書を有効なデジタル情報に変換することが可能である。しかし、漢字などの複雑な形状を文字セットに持つ日本語文はもとより、英文に対してもＯＣＲは１００％の認識率を保証できないので、ＯＣＲにより変換されたコード情報には誤りのあるコードが含まれるのが常である。紙文書の状態やスキャナのスキャン面の汚れなどが原因でスキャンされたイメージの品質が良くない場合は、ＯＣＲにより変換されたテキストには高い割合で誤りが含まれる。このような誤りを含むテキストからキーワード抽出を行うと、抽出されたキーワードに誤りのある文字が含まれる可能性は十分に高い。
【０００５】
自動ＯＣＲは人手を介さずＯＣＲを行うが、人手を介してＯＣＲの誤認識文字を逐次修正してキーワード抽出を行えば、キーワードに誤認識文字が含まれることはなくなる。しかしながら紙文書のスキャンからキーワード抽出までの時間を考えた場合、人手を介してのＯＣＲは処理時間と人手がかかり過ぎて実用的でない。自動ＯＣＲは、変換結果としてＯＣＲ結果ファイルを出力する。ＯＣＲ結果ファイルは、１文字ごとの確信度、文字位置情報を含むバイナリファイルである。
【０００６】
上述したごとくに、一般的にイメージ文書をＯＣＲにかけて生成されたテキストからキーワードを抽出すると、誤認識された文字で構成された文字列がキーワードリストに含まれる可能性がある。この問題は、ＯＣＲの認識結果に対して確信度が低い文字をテキストから排除するという誤認識対策を施せば解決するように思われるが、実際には次のような問題が発生する。
【０００７】
（１）誤認識文字を排除して何らかの加工をした文字列がキーワードになり、その文字列が原イメージ文書で確認できない文字列であったとしたらユーザにとっては、不具合になる。
（２）ユーザが確認できる文字列に加工するには、少なくともキーワード抽出ユニットに渡すテキストが単語分割されている必要がある。すなわち単語長を検査して、何文字まで誤認識文字が含まれていても問題がないか判断できる必要がある。そのためには自動ＯＣＲは、イメージをテキストに変換したあとに形態素解析をする必要があり、処理系が重くなる。
【０００８】
本発明は、上述のごとき実情に鑑みてなされたもので、ＯＣＲにより変換されたテキストに誤りがあっても、抽出するキーワードには許容範囲の誤りしかないことを保証する（たとえば、６文字で構成されるキーワードのうち１文字に誤りがあっても、容易にこのキーワードが何を意味するかは理解できる）ことを可能にしたイメージ文書のキーワード抽出方法及び装置を提供することを目的とするものである。
【０００９】
【課題を解決するための手段】
請求項１の発明は、プレーンテキスト・確信度ファイル生成手段により、イメージ文書のＯＣＲ文字認識により文字コードと該文字コードの確信度情報とを含む文字情報の候補が各文字毎に生成されたＯＣＲ結果ファイルを入力し、該文字情報の候補のなかから、第１候補の文字情報に含まれる文字コードと確信度情報とを各文字毎に抜き出し、該文字コードによるプレーンテキストと該確信度情報による確信度ファイルとを生成するステップと、キーワード抽出手段により、得られた前記プレーンテキストの形態素解析及びキーワード抽出を行ってキーワードリストを生成するステップと、キーワード検証手段により、生成された前記キーワードリストのキーワードに対し、予め設定されたしきい値に基づいて文字を誤認しているかどうかを一文字ずつ判断し、所定条件に基づき算出された数よりも多い誤認文字数を含むキーワードを前記キーワードリストから外す処理を行うステップと、を有することを特徴としたものである。
【００１２】
請求項２の発明は、請求項１の発明において、前記キーワードリストに含まれるキーワードに対応した前記確信度リストのエントリを参照し、参照したエントリから、イメージ文書における該キーワードの位置情報を特定し、特定した位置情報に基づいてイメージ文書における該キーワードの強調表示を行うステップを有することを特徴としたものである。
請求項３の発明は、イメージ文書のＯＣＲ文字認識により文字コードと該文字コードの確信度情報とを含む文字情報の候補が各文字毎に生成されたＯＣＲ結果ファイルを入力し、該文字情報の候補のなかから、第１候補の文字情報に含まれる文字コードと確信度情報とを各文字毎に抜き出し、該文字コードによるプレーンテキストと該確信度情報による確信度ファイルとを生成するプレーンテキスト・確信度ファイル生成手段と、得られた前記プレーンテキストの形態素解析及びキーワード抽出を行ってキーワードリストを生成するキーワード抽出手段と、該キーワード抽出手段により生成された前記キーワードリストのキーワードに対し、予め設定されたしきい値に基づいて文字を誤認しているかどうかを一文字ずつ判断し、所定条件に基づき算出された数よりも多い誤認文字数を含むキーワードを前記キーワードリストから外す処理を行うキーワード検証手段と、を有することを特徴としたものである。
請求項４の発明は、請求項３の発明において、前記キーワードリストに含まれるキーワードに対応した前記確信度リストのエントリを参照し、参照したエントリから、イメージ文書における該キーワードの位置情報を特定し、特定した位置情報に基づいてイメージ文書における該キーワードの強調表示を行う手段を有することを特徴としたものである。
【００１３】
【発明の実施の形態】
本発明の動作を説明する前にＯＣＲ結果ファイルとキーワードリストの構造の概要について説明する。図１は、ＯＣＲ結果ファイルのＴＡＧ構造の一例を示す図である。
【００１４】
代表的なＴＡＧとして以下の３つがある。
▲１▼ページ情報タグ
開始オフセットでポイントされる領域には、１ページについての画像情報（解像度，サイズ，領域数など）のような情報が格納される。
▲２▼領域情報タグ
開始オフセットでポイントされる領域には、領域の位置などのような情報が格納される。なお、領域はネストする可能性があるのでページ情報タグとは少し異なる構造を持つが、これについては本発明とは直接関係しないので説明を省略する。
▲３▼文字情報タグ
１つの領域内の文字情報として、認識結果である文字コード、認識結果の確信度、及び認識座標位置（文字を囲む矩形の位置：ピクセル値）などの情報が格納される。
【００１５】
領域内にｎ個の文字が存在したと仮定すると、文字情報タグと文字情報領域は、図２に示すごとくの構造となり、このときのｉ番目の文字の認識結果(Ｃｉ)は図３に示すようなデータ構造になる。文字情報に関しては、１つの文字に対して複数の認識結果が出るので、１つの文字に対して複数の候補がある。
【００１６】
（動作説明）
図４は、本発明が適用されるキーワード抽出処理の概略フローを説明するための図である。紙文書１に対しスキャナによるスキャン２を実行して、イメージ文書３に変換し、自動ＯＣＲ４により処理を行ってＯＣＲ結果ファイル５を出力する。得られたＯＣＲ結果ファイル５に対して、キーワード抽出部６による処理を行い、キーワードリスト７を出力する。図５は、図４において示された本発明のキーワード抽出を行うキーワード抽出部６をさらに詳しく示す図である。以下に、キーワード抽出部６における処理について詳しく説明する。
【００１７】
[１]前処理
前処理としてプレーンテキスト・確信度ファイル生成部６ａにおいては、自動ＯＣＲ４により得られたＯＣＲ結果ファイル５から第1候補の文字コードと確信度を抜き出し、プレーンテキスト６ｃと確信度ファイル６ｂを生成する。プレーンテキスト６ｃは、キーワード抽出ユニット６ｅに渡される。確信度ファイル６ｂは、図６に示すごとくの構造を有するもので、1文字につき第一候補だけの確信度を保持する。
【００１８】
入力イメージ文書が１枚の用紙であれば、直接ＯＣＲ結果ファイルを参照して、確信度を参照すればよいが、そうでない場合は、文書を構成する入力イメージのＯＣＲ結果ファイルをキーワード抽出が終わるまですべて保持する必要があるので、確信度ファイルを別に作成する。確信度ファイル６ｂは、ページ毎に作成され、プレーンテキスト６ｃは、文書１つにつき、１つだけ作成される。確信度ファイル６ｂの各エントリとプレーンテキスト６ｃ中の文字は完全に同期がとられる必要がある。
【００１９】
[２]キーワード抽出ユニット
キーワード抽出ユニット６ｅは、テキストの形態素解析及び構文解析を行い、名詞句を中心に出現頻度、類似度の検査、慣用句検査、１文における修飾関係、及び係り受けなどの情報からキーワードを抽出し、キーワードリストを生成する。ここでキーワード抽出ユニット６ｅは、プレーンなテキストだけを受け付け、確信度ファイルなどの非テキストファイルは受け付けない。生成されるキーワードリストの構造の概要を図７に示す。キーワード抽出ユニット６ｅから出力されるキーワードリストは、誤認を含むキーワードリスト６ｇである。
【００２０】
［３］後処理
後処理として、キーワード検証部６ｆでは、キーワードリストの上位のキーワードから、誤認文字があるかどうかを一文字づつ検査する。文字を誤認しているかどうかは、しきい値により判定する。しきい値は、イメージごとに動的に決定される。キーワード抽出ユニット６ｅは、キーワードの情報として、キーワード位置もアプリケーションに公開しているので、キーワード抽出ユニット６ｅの入力であるプレーンテキストにおけるキーワード先頭位置がわかる。
【００２１】
ＯＣＲ結果ファイル５とプレーンテキスト６ｃは、認識文字に対して同期がとれているので、ＯＣＲ結果ファイル５を参照すれば、該当文字の確信度がわかる。図８に示すように、確信度が低い文字を含むキーワードをキーワードリストからはずし、次点のキーワードの順位を上げる。この操作を希望するキーワード数に達するまで繰り返せば、誤認文字を含まないキーワードリスト７を獲得できる。
【００２２】
キーワードリストからはずか否かの判断は、次のような判定式を使用する。
キーワード中の誤認文字数＞キーワードを構成する文字数×（α／１００）
上記αはユーザが与える％値である。たとえば、１０％以上の誤認文字を含むキーワードを必要としない場合には、αに１０を指定する。現実的には、誤認文字を１文字程度含む４文字のキーワードならば、ユーザはその文字を判別可能なので、誤認文字を“？”などの記号に置換してユーザに提示する（２文字のうち、１文字が誤認されていた場合は、“？”に置換できない。常識的な範囲で処理する）。以上により、判読可能なキーワードをイメージ文書からテキストコードの形式で獲得できる。
【００２３】
［４］イメージ文書にキーワードを強調表示
確信度ファイルの文字の位置情報（イメージの位置情報）を利用すれば、イメージ文書上で抽出したキーワードを表示させることが可能になる。確信度ファイルの各エントリとプレーンテキストの各エントリは１対１に対応しているので、キーワードリストによりキーワードのプレーンテキストでの位置がわかれば、確信度ファイルでのキーワードの位置もわかる。確信度ファイル上での位置が分かれば、イメージ文書のページ内でのキーワードの先頭文字位置がわかる。同様にしてキーワードを構成する各文字の位置も分かる。各文字位置は矩形で表現されるので、キーワードを強調する場合は、文字を囲む矩形領域に対して適当な色を表すＲＧＢ値でＯＲマスクを施せばよい。ｂｉｔ操作により文字の色だけを変更することも可能である。
【００２４】
［５］イメージ文書のイメージ品質
イメージ文書のイメージ品質が悪い場合、ＯＣＲ結果ファイルに大量の誤認文字が入り込むことになる。このような状況で、上記［１］〜［３］の方法によりキーワードを抽出しても、それらはキーワードリストの下位に属するキーワード群に属するかあるいはキーワードリストにキーワードがなくなる可能性もあり、キーワード抽出処理が無意味になる。こうした場合は、確信度ファイルの各文字ごとの確信度の統計値（標準偏差など）６ｄを確信度ファイルごとに計算しておき、統計値と“しきい値”の大小関係を比較することによりキーワード抽出を行うか否かを判断する。
【００２５】
また、この統計値は、ＯＣＲ結果ファイル中の文字が誤認されているかどうか判定する場合に使用する“しきい値”も決定する。
文字誤認判定のしきい値＝Ｆｕｎｃ（確信度ファイルの統計値)
Ｆｕｎｃ(確信度ファイルの統計値)の具体的な定義式は省略する。この式は、確信度ファイルの統計値が文字誤認判定のしきい値に影響を与えることを示している。“しきい値”は、イメージ毎に動的に決定される。イメージ毎にイメージの状態が異なるのでＯＣＲの結果からイメージの品質を定量化するのは合理的な方法である。
【００２６】
【発明の効果】
本発明によれば、ＯＣＲにより変換されたテキストに誤りがあっても、抽出するキーワードには許容範囲の誤りしかないことを保証する（たとえば、６文字で構成されるキーワードのうち１文字に誤りがあっても、容易にこのキーワードが何を意味するかは理解できる）ことが可能となり、処理系を重くすることなくユーザの利便性を高めることができる。
【図面の簡単な説明】
【図１】ＯＣＲ結果ファイルのＴＡＧ構造の一例を示す図である。
【図２】ＯＣＲ結果ファイルにおける文字情報タグと文字情報領域の構造の一例を示す図である。
【図３】図２に示すｉ番目の文字のデータ構造の一例を示す図である。
【図４】本発明によるキーワード抽出方法の処理動作の概略を説明するためのフロー図である。
【図５】図４に示すキーワード抽出部をより詳しく説明するための図である。
【図６】確信度ファイルの構造の一例を示す図である。
【図７】キーワードリストの構造の一例を示す図である。
【図８】キーワードリストからキーワードを外す処理を説明するための図である。
【符号の説明】
１…紙文書、２…スキャン、３…イメージ文書、４…自動ＯＣＲ、５…ＯＣＲ結果ファイル、６…キーワード抽出部、６ａ…プレーンテキスト・確信度ファイル生成部（前処理）、６ｂ…確信度ファイル、６ｃ…プレーンテキスト、６ｄ…確信度の統計値、６ｅ…キーワード抽出ユニット、６ｆ…キーワード検証部（後処理）、６ｇ…誤認を含むキーワードリスト、７…キーワードリスト。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an image document keyword extraction method, and more particularly, to a keyword extraction method for extracting a keyword from an image document recognized by OCR character recognition.
[0002]
[Prior art]
Due to the recent spread of PCs, infrastructure development centered on the Internet, etc., and increased environmental awareness such as forest resource protection, information transmission and storage methods were mainly digital information instead of conventional paper. It is changing into things. Good examples are information exchange by e-mail, browsing of information on web, and publication. However, the vast amount of information based on conventional paper remains an intellectual asset for both individuals and businesses, and cannot be discarded. Therefore, information on paper must be converted into digital information so that it can be used somehow.
[0003]
The conversion into the digital information is generally performed as follows. First, a paper document (meaning that the document information medium is paper) is read from the scanner and converted into an image document. If it is just digital information, it can be left as it is. However, in order to be able to search the contents of the document, it is necessary to convert the image document into code information using OCR (Optical Character Recognition) (to the image document). If there is text information). When a large number of documents are managed and a large number of unspecified persons use the documents, it becomes easy to grasp the outline when searching and browsing the documents by using information such as summary sentences or keywords.
[0004]
As described above, it is possible to convert a paper document into effective digital information by “scan + OCR + keyword / summary sentence extraction”. However, OCR cannot guarantee a 100% recognition rate for English sentences as well as Japanese sentences with complex shapes such as kanji characters, so the code information converted by OCR includes erroneous codes. It is always done. If the quality of the scanned image is not good due to the state of the paper document or the scan surface of the scanner, the text converted by OCR contains a high percentage of errors. When keyword extraction is performed from text that includes such an error, there is a high possibility that the extracted keyword includes an erroneous character.
[0005]
Although automatic OCR performs OCR without human intervention, if a keyword is extracted by correcting OCR misrecognized characters sequentially through manual operation, the erroneous recognition characters are not included in the keyword. However, considering the time from scanning a paper document to keyword extraction, manual OCR is not practical because it takes too much processing time and manpower. Automatic OCR outputs an OCR result file as a conversion result. The OCR result file is a binary file including the certainty factor and character position information for each character.
[0006]
As described above, generally, when keywords are extracted from text generated by subjecting an image document to OCR, a character string composed of misrecognized characters may be included in the keyword list. This problem seems to be solved by taking a misrecognition countermeasure that excludes characters with low confidence in the recognition result of the OCR from the text. However, the following problem actually occurs.
[0007]
(1) If a character string that has been processed by eliminating some misrecognized characters becomes a keyword, and that character string is a character string that cannot be confirmed in the original image document, it is a problem for the user.
(2) To process the character string that can be confirmed by the user, at least the text to be passed to the keyword extraction unit needs to be divided into words. That is, it is necessary to check the word length and determine whether there is no problem even if up to how many characters are misrecognized. For that purpose, the automatic OCR needs to perform morphological analysis after converting the image into text, and the processing system becomes heavy.
[0008]
The present invention has been made in view of the above circumstances, and ensures that even if there is an error in the text converted by OCR, the extracted keyword has only an allowable error (for example, with 6 characters). An object of the present invention is to provide an image document keyword extraction method and apparatus which can easily understand what this keyword means even if there is an error in one character of the constructed keywords. Is.
[0009]
[Means for Solving the Problems]
According to the first aspect of the present invention, an OCR in which a candidate for character information including a character code and certainty information of the character code is generated for each character by OCR character recognition of the image document by the plain text and certainty factor file generating means. The result file is input, and the character code and the certainty information included in the first candidate character information are extracted for each character from the character information candidates, and the plain text based on the character code and the certainty information are used. A step of generating a certainty factor file, a step of generating a keyword list by performing morphological analysis and keyword extraction of the obtained plain text by the keyword extracting means, and a step of generating the keyword list by the keyword verifying means Whether the keyword is misidentified based on a preset threshold By character is determined, it is obtained by, comprising the steps of: a keyword performs processing to remove from the keyword list containing more false characters than the number calculated-out based on a predetermined condition.
[0012]
The invention of claim 2 refers to the entry of the certainty factor list corresponding to the keyword included in the keyword list in the invention of claim 1, and specifies position information of the keyword in the image document from the referenced entry. And a step of highlighting the keyword in the image document based on the specified position information.
The invention according to claim 3 inputs an OCR result file in which character information candidates including character codes and certainty information of the character codes are generated for each character by OCR character recognition of the image document. Plain text for extracting the character code and the certainty factor information included in the character information of the first candidate for each character from the candidates, and generating a plain text based on the character code and a certainty factor file based on the certainty factor information Certainty factor file generation means, keyword extraction means for generating a keyword list by performing morphological analysis and keyword extraction of the obtained plain text, and keywords set in the keyword list generated by the keyword extraction means are preset. Based on the specified threshold value, it is judged one by one whether the character is mistaken, and based on the predetermined condition. Is obtained by comprising: the keyword verification means for a keyword containing a calculated often misidentified characters than the number can perform a process of removing from the keyword list, the.
The invention of claim 4 refers to the entry of the certainty factor list corresponding to the keyword included in the keyword list in the invention of claim 3 , and specifies position information of the keyword in the image document from the referenced entry. And a means for highlighting the keyword in the image document based on the specified position information.
[0013]
DETAILED DESCRIPTION OF THE INVENTION
Before describing the operation of the present invention, an outline of the structure of the OCR result file and the keyword list will be described. FIG. 1 is a diagram illustrating an example of a TAG structure of an OCR result file.
[0014]
There are the following three typical TAGs.
(1) Information such as image information (resolution, size, number of areas, etc.) for one page is stored in the area pointed to by the page information tag start offset.
(2) Information such as the position of the area is stored in the area pointed to by the area information tag start offset. Since the area may be nested, it has a slightly different structure from that of the page information tag. However, this is not directly related to the present invention and will not be described.
(3) Character information tag Information such as a character code as a recognition result, a certainty of the recognition result, and a recognition coordinate position (position of a rectangle surrounding the character: pixel value) is stored as character information in one area. .
[0015]
Assuming that n characters exist in the area, the character information tag and the character information area are structured as shown in FIG. 2, and the recognition result (Ci) of the i-th character at this time is shown in FIG. The data structure is as follows. Regarding character information, since a plurality of recognition results are obtained for one character, there are a plurality of candidates for one character.
[0016]
(Description of operation)
FIG. 4 is a diagram for explaining a schematic flow of keyword extraction processing to which the present invention is applied. The paper document 1 is scanned 2 by the scanner, converted into an image document 3, processed by automatic OCR 4, and an OCR result file 5 is output. The obtained OCR result file 5 is processed by the keyword extraction unit 6 and a keyword list 7 is output. FIG. 5 is a diagram showing in more detail the keyword extraction unit 6 that performs the keyword extraction of the present invention shown in FIG. Below, the process in the keyword extraction part 6 is demonstrated in detail.
[0017]
[1] In the plain text / confidence file generation unit 6a as the pre-processing pre-processing, the character code and the certainty factor of the first candidate are extracted from the OCR result file 5 obtained by the automatic OCR 4, and the plain text 6c and the certainty factor file are obtained. 6b is generated. The plain text 6c is passed to the keyword extraction unit 6e. The certainty factor file 6b has a structure as shown in FIG. 6, and holds the certainty factor of only the first candidate for each character.
[0018]
If the input image document is one sheet, it is sufficient to refer to the OCR result file directly and refer to the certainty factor. Otherwise, the keyword extraction ends for the OCR result file of the input image constituting the document. Since it is necessary to keep everything until, create a confidence file separately. The certainty factor file 6b is created for each page, and only one plain text 6c is created for each document. Each entry in the certainty factor file 6b and the characters in the plain text 6c need to be completely synchronized.
[0019]
[2] Keyword extraction unit The keyword extraction unit 6e performs morphological analysis and syntactic analysis of texts, and mainly shows noun phrases, appearance frequency, similarity check, idiomatic phrase check, modification in sentences, dependency, etc. Keywords are extracted from the information, and a keyword list is generated. Here, the keyword extraction unit 6e accepts only plain text and does not accept non-text files such as a certainty factor file. An outline of the structure of the generated keyword list is shown in FIG. The keyword list output from the keyword extraction unit 6e is the keyword list 6g including misidentification.
[0020]
[3] As post-processing, the keyword verification unit 6f inspects one character at a time whether or not there is a misidentified character from the upper keywords in the keyword list. Whether a character is mistaken is determined by a threshold value. The threshold is determined dynamically for each image. Since the keyword extraction unit 6e also discloses the keyword position to the application as keyword information, the keyword start position in the plain text that is the input of the keyword extraction unit 6e can be known.
[0021]
Since the OCR result file 5 and the plain text 6c are synchronized with the recognized character, the certainty factor of the character can be understood by referring to the OCR result file 5. As shown in FIG. 8, keywords including characters with low certainty are removed from the keyword list, and the ranking of the next keyword is increased. If this operation is repeated until the desired number of keywords is reached, a keyword list 7 that does not include misidentified characters can be obtained.
[0022]
Judgment formulas such as the following are used to determine whether or not to leave the keyword list.
Number of misidentified characters in keyword> number of characters constituting keyword × (α / 100)
Α is a% value given by the user. For example, if a keyword including 10% or more of misidentified characters is not required, 10 is designated as α. In reality, if the keyword is a four-character keyword including about 1 misidentified character, the user can discriminate the character, so the misidentified character is replaced with a symbol such as “?” And presented to the user (of the 2 characters). If one character is mistaken, it cannot be replaced with “?”. As described above, readable keywords can be acquired from the image document in the form of text codes.
[0023]
[4] Highlighting keywords in an image document If the character position information (image position information) in a certainty factor file is used, it is possible to display the extracted keywords on the image document. Since each entry in the certainty factor file and each entry in the plain text have a one-to-one correspondence, if the keyword list indicates the position of the keyword in the plain text, the keyword position in the certainty factor file is also known. If the position on the certainty factor file is known, the position of the first character of the keyword in the page of the image document can be known. Similarly, the position of each character constituting the keyword is also known. Since each character position is expressed by a rectangle, when emphasizing a keyword, an OR mask may be applied with an RGB value representing an appropriate color for the rectangular region surrounding the character. It is also possible to change only the color of the character by a bit operation.
[0024]
[5] Image quality of an image document When the image quality of an image document is poor, a large number of misidentified characters are included in the OCR result file. In such a situation, even if keywords are extracted by the above methods [1] to [3], they may belong to a group of keywords belonging to the lower level of the keyword list or there may be no keywords in the keyword list. The extraction process becomes meaningless. In such a case, a statistical value (standard deviation or the like) 6d of the certainty factor for each character in the certainty factor file is calculated for each certainty factor file, and the magnitude relationship between the statistical value and the “threshold value” is compared. It is determined whether or not to perform keyword extraction.
[0025]
This statistic also determines a “threshold” that is used to determine whether a character in the OCR result file is misidentified.
Character misidentification threshold = Func (statistical value of confidence file)
A specific definition formula of Func (statistical value of confidence file) is omitted. This equation indicates that the statistical value of the certainty factor file affects the threshold for character misidentification. The “threshold value” is dynamically determined for each image. Since the state of the image is different for each image, it is a reasonable method to quantify the image quality from the OCR result.
[0026]
【The invention's effect】
According to the present invention, even if there is an error in the text converted by OCR, it is guaranteed that the extracted keyword has only an allowable error (for example, one of the six-character keywords is erroneous). Even if there is, it is possible to easily understand what this keyword means), and the convenience of the user can be enhanced without increasing the processing system.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating an example of a TAG structure of an OCR result file.
FIG. 2 is a diagram illustrating an example of a structure of a character information tag and a character information area in an OCR result file.
FIG. 3 is a diagram illustrating an example of a data structure of an i-th character illustrated in FIG.
FIG. 4 is a flowchart for explaining the outline of the processing operation of the keyword extracting method according to the present invention.
FIG. 5 is a diagram for explaining the keyword extraction unit shown in FIG. 4 in more detail.
FIG. 6 is a diagram illustrating an example of the structure of a certainty factor file.
FIG. 7 is a diagram illustrating an example of the structure of a keyword list.
FIG. 8 is a diagram for explaining processing for removing a keyword from a keyword list.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Paper document, 2 ... Scan, 3 ... Image document, 4 ... Automatic OCR, 5 ... OCR result file, 6 ... Keyword extraction part, 6a ... Plain text and certainty factor file generation part (preprocessing), 6b ... Certainty factor File, 6c ... Plain text, 6d ... Statistical value of certainty factor, 6e ... Keyword extraction unit, 6f ... Keyword verification unit (post-processing), 6g ... Keyword list including misidentification, 7 ... Keyword list.

Claims

A plain text / confidence file generating means inputs an OCR result file in which character information candidates including character codes and certainty information of the character codes are generated for each character by OCR character recognition of the image document, The character code and the certainty factor information included in the first candidate character information are extracted for each character from the character information candidates, and a plain text based on the character code and a certainty factor file based on the certainty factor information are generated. A step of generating a keyword list by performing morphological analysis and keyword extraction of the obtained plain text by a keyword extraction means ;
The keyword verification means, with respect to keywords generated the keyword list, it is determined whether the misidentified a character one by one on the basis of a preset threshold, than the number calculated-out based on a predetermined condition And a step of removing a keyword including a large number of misidentified characters from the keyword list.

The keyword extraction method for an image document according to claim 1, wherein an entry in the certainty factor list corresponding to a keyword included in the keyword list is referenced, and position information of the keyword in the image document is specified from the referenced entry. A method for extracting a keyword of an image document, comprising the step of highlighting the keyword in the image document based on the specified position information.

An OCR result file in which a character information candidate including a character code and certainty information of the character code is generated for each character by OCR character recognition of the image document is input. A plain text and certainty factor file generating means for extracting the character code and the certainty factor information included in the candidate character information for each character, and generating the plain text by the character code and the certainty factor file by the certainty factor information, Keyword extraction means for generating a keyword list by performing morphological analysis and keyword extraction of the obtained plain text,
With respect to the keyword list keyword generated by the keyword extraction unit, determines whether the misidentified a character one by one on the basis of a preset threshold, than the number calculated-out based on a predetermined condition And a keyword verification unit that performs processing for removing a keyword including a large number of misidentified characters from the keyword list.

4. The keyword extracting apparatus for an image document according to claim 3 , wherein an entry in the certainty factor list corresponding to a keyword included in the keyword list is referenced, and position information of the keyword in the image document is specified from the referenced entry. An image document keyword extraction apparatus comprising means for highlighting the keyword in the image document based on the specified position information.