JP5357711B2

JP5357711B2 - Document processing device

Info

Publication number: JP5357711B2
Application number: JP2009262114A
Authority: JP
Inventors: 光晴大峡
Original assignee: Hitachi Solutions Ltd
Current assignee: Hitachi Solutions Ltd
Priority date: 2009-11-17
Filing date: 2009-11-17
Publication date: 2013-12-04
Anticipated expiration: 2029-11-17
Also published as: JP2011107966A

Description

本発明は、業務文書処理装置に関し、例えば紙文書にＯＣＲを適用した際に発生する文字の誤認識パターンを自動収集するための技術に関する。 The present invention relates to a business document processing apparatus and, for example, relates to a technique for automatically collecting character recognition patterns generated when OCR is applied to a paper document.

近年、組織内に蓄積された膨大な紙の業務文書に対して、スキャン及びＯＣＲによる文字認識を行い、文書データを文書管理システムで管理することで、検索性向上、紙文書の安全な保管、知識の共有を図ろうとする動きがある。 In recent years, a large number of paper business documents accumulated in an organization are recognized by scanning and OCR, and the document data is managed by a document management system, thereby improving searchability and safe storage of paper documents. There is a movement to share knowledge.

ＯＣＲ（Optical Character Reader）の認識精度は、技術の向上に伴い高まっているが、それでも誤認識を完全に無くすことは不可能である。そのため、誤認識に対する様々な対応策が考案されている。その中の一つに、置換辞書を使用する方法がある。これは、誤認識が起きやすい文字列に対して、正解文字列と誤認識文字列のペアを辞書登録しておき、ＯＣＲの対象となる文書中に辞書登録された誤認識文字列が含まれていた場合に、正解文字列に置換するというものである。この方法は、ＯＣＲ対象文書の誤認識パターンが既知で、かつ網羅的に辞書登録されている場合には有効である。置換辞書を採用している製品には、例えば非特許文献１〜４の製品があり、置換辞書はＯＣＲによる文字認識の際に用いられる一般的な機能である。 Although the recognition accuracy of OCR (Optical Character Reader) has been increasing with the improvement of technology, it is still impossible to completely eliminate misrecognition. Therefore, various countermeasures against misrecognition have been devised. One of them is a method using a replacement dictionary. This is because the correct character string and the misrecognized character string pair are registered in the dictionary for the character string that is likely to be erroneously recognized, and the misrecognized character string registered in the dictionary is included in the OCR target document. If so, it is replaced with the correct character string. This method is effective when the erroneous recognition pattern of the OCR target document is known and the dictionary is comprehensively registered. For example, non-patent documents 1 to 4 are products that use the replacement dictionary, and the replacement dictionary is a general function used for character recognition by OCR.

SEIKO EPSON CORPORATION. EPSON SALES JAPAN CORPORATION 2007. “読んde!!ココ”、[online]、[平成21年6月30日検索]、インターネット＜URL：http://ai2you.com/OCR/＞SEIKO EPSON CORPORATION. EPSON SALES JAPAN CORPORATION 2007. “Read de !! Coco”, [online], [Search June 30, 2009], Internet <URL: http://ai2you.com/OCR/> Media Drive Corporation.“WinReaderPro”、[online]、[平成21年6月30日検索]、インターネット＜URL：http://mediadrive.jp/products/wrp/index.html＞Media Drive Corporation. “WinReaderPro”, [online], [Search June 30, 2009], Internet <URL: http://mediadrive.jp/products/wrp/index.html> Media Drive Corporation. “e.Typist”、[online]、[平成21年6月30日検索]、インターネット＜URL：http://mediadrive.jp/products/et/＞Media Drive Corporation. “E.Typist”, [online], [Search June 30, 2009], Internet <URL: http://mediadrive.jp/products/et/> Panasonic Solution Technologies Co., Ltd. 2009“読取革命” 、[online]、[平成21年6月30日検索]、インターネット＜URL：http://panasonic.co.jp/pss/pstc/products/yomikaku/＞Panasonic Solution Technologies Co., Ltd. 2009 “Reading Revolution”, [online], [Search June 30, 2009], Internet <URL: http://panasonic.co.jp/pss/pstc/products/yomikaku />

しかしながら、誤認識パターンを置換辞書へ登録する作業は、ユーザが経験に基づいて行うのが一般的である。そのため、誤認識パターンをユーザが逐次登録作業を行うことになり、作業量が膨大になる。また、ユーザの技量により登録される誤認識パターンの質にぶれが生じるため、発生頻度が高い誤認識パターンの登録漏れや、発生頻度が低い誤認識パターンの過学習が発生するという問題がある。誤認識パターンの登録漏れによりＯＣＲ処理後の文書中に誤認識文字列がそのまま残ることになり、過学習により文書中の正しい文字列まで置換を行ってしまうことになり、どちらの場合もＯＣＲの認識精度の低下を招く。 However, it is common for a user to register an erroneous recognition pattern in a replacement dictionary based on experience. For this reason, the user must sequentially register the erroneous recognition pattern, and the amount of work becomes enormous. Moreover, since the quality of the misrecognition pattern registered according to the skill of the user is fluctuated, there is a problem in that registration of a misrecognition pattern having a high occurrence frequency or over-learning of an erroneous recognition pattern having a low occurrence frequency occurs. Misrecognition pattern registration omissions cause misrecognized character strings to remain in the document after OCR processing, and over-learning results in replacement of the correct character strings in the document. The recognition accuracy is reduced.

本発明はこのような状況に鑑みてなされたものであり、ＯＣＲ誤認識を補正する際に使用する置換辞書に登録する誤認識パターンを自動的に収集し、さらに収集した誤認識パターンを選別することが可能となる技術を適用するものである。 The present invention has been made in view of such a situation, and automatically collects misrecognition patterns registered in a replacement dictionary used when correcting OCR misrecognition, and further selects the collected misrecognition patterns. Applying technology that makes it possible.

上記課題を解決するために、本発明による文書処理装置は、ＯＣＲ誤認識文字列を補正するための置換辞書を自動生成する文書処理装置であって、業務文書を画像化したサンプル電子文書データから切り分けた正解文字列と、前記サンプル電子文書データに対してＯＣＲを行った結果得られるＯＣＲ後サンプル文書データから切り分けたＯＣＲ後文字列と、を比較単位とし、誤認識の判定を行うマッチング処理部と、前記正解文字列を所定の単語単位に切り分け、該切り分けた単語のうち前記マッチング処理部で誤認識と判定された文字を含む単語を誤認識パターン候補として登録する解析処理部と、記憶装置に格納された日本語の単語が登録された日本語辞書データ及び業務で使用される単語が登録された業務単語辞書データに含まれる単語と部分一致または完全一致する単語を前記誤認識パターン候補から削除してフィルタリングし、該フィルタリング後の誤認識パターン候補を誤認識パターンとして前記記憶装置へ格納するフィルタリング処理部と、を備える。 In order to solve the above problems, a document processing apparatus according to the present invention is a document processing apparatus that automatically generates a replacement dictionary for correcting an OCR misrecognized character string, from sample electronic document data obtained by imaging a business document. A matching processing unit that performs misrecognition determination using a segmented correct character string and a post-OCR character string segmented from the post-OCR sample document data obtained as a result of performing OCR on the sample electronic document data. And an analysis processing unit that divides the correct character string into predetermined word units, registers a word including characters determined to be erroneously recognized by the matching processing unit among the divided words, and a storage device. Japanese words stored in the Japanese dictionary data registered with Japanese words and business word dictionary data registered with words used in business And delete the partial or exact matching words from the erroneous recognition pattern candidate filtering comprises a filtering section for storing into said storage device as a mis-recognition pattern misrecognition pattern candidates after the filtering, the.

さらなる本発明の特徴は、以下本発明を実施するための最良の形態および添付図面によって明らかになるものである。 Further features of the present invention will become apparent from the best mode for carrying out the present invention and the accompanying drawings.

本発明によれば、文書にＯＣＲを適用した際に発生する誤認識を補正するための置換辞書を作成する際に、必要な誤認識パターンを自動収集できる。これにより、ユーザの作業量を大幅に削減でき、誤認識パターンの置換辞書への登録漏れを防止できる。 According to the present invention, it is possible to automatically collect necessary misrecognition patterns when creating a replacement dictionary for correcting misrecognition that occurs when OCR is applied to a document. Thereby, a user's work amount can be reduced significantly and omission of registration to the replacement dictionary of a misrecognition pattern can be prevented.

また、収集した誤認識パターンをフィルタリングすることで、正しい文字列まで置換を行ってしまうという過学習を防止できる。 Moreover, it is possible to prevent overlearning that a correct character string is replaced by filtering the collected erroneous recognition patterns.

これらの結果として、ユーザの技量に依らず均質な置換辞書を作成することが可能となる。 As a result, it is possible to create a uniform replacement dictionary regardless of the skill of the user.

本発明の実施形態による業務文書処理装置の構成を概略的に示す機能ブロック図である。It is a functional block diagram which shows roughly the structure of the business document processing apparatus by embodiment of this invention. 図１に示す記憶装置内に記憶されているサンプル電子文書データ５１を、印刷、スキャンすることで得られる画像例を示す図である。It is a figure which shows the example of an image obtained by printing and scanning the sample electronic document data 51 memorize | stored in the memory | storage device shown in FIG. 図１に示す記憶装置内に記憶されているＯＣＲ後サンプル文書データの例を示す図である。It is a figure which shows the example of the sample document data after OCR memorize | stored in the memory | storage device shown in FIG. 図１に示す記憶装置内に記憶されている日本語辞書データの例を示す図である。It is a figure which shows the example of the Japanese dictionary data memorize | stored in the memory | storage device shown in FIG. 図１に示す記憶装置内に記憶されている業務単語辞書データの例を示す図である。It is a figure which shows the example of the business word dictionary data memorize | stored in the memory | storage device shown in FIG. ＯＣＲ誤認識補正プログラムにおけるマッチング処理部の説明をするためのフローチャートである。It is a flowchart for demonstrating the matching process part in an OCR misrecognition correction program. 図１に示すデータメモリ内に記憶されている文字列比較データの例を示す図である。It is a figure which shows the example of the character string comparison data memorize | stored in the data memory shown in FIG. ＯＣＲ誤認識補正プログラムにおける解析処理部の説明をするためのフローチャートである。It is a flowchart for demonstrating the analysis process part in an OCR misrecognition correction program. 図１に示すデータメモリ内に記憶されている誤認識パターン候補データの例を示す図である。It is a figure which shows the example of the misrecognition pattern candidate data memorize | stored in the data memory shown in FIG. ＯＣＲ誤認識補正プログラムにおけるフィルタリング部の説明をするためのフローチャートである。It is a flowchart for demonstrating the filtering part in an OCR misrecognition correction program. 誤認識補正パターンの出力結果を示す確認画面の例を示す図である。It is a figure which shows the example of the confirmation screen which shows the output result of a misrecognition correction pattern.

以下、添付図面を参照しながら、本発明の誤認識パターン収集装置を実施するための形態を詳細に説明する。図１〜図１１は、本発明の実施形態を例示する図である。これらの図において、同一の符号を付した部分は同一物を表し、基本的な構成及び動作は同様であるものとする。尚、本発明の実施形態において、使用される機器、手法等は一例であり、本発明はこれらに限定されるものではないことは勿論である。 Hereinafter, an embodiment for carrying out an erroneous recognition pattern collection apparatus of the present invention will be described in detail with reference to the accompanying drawings. 1 to 11 are diagrams illustrating an embodiment of the present invention. In these drawings, parts denoted by the same reference numerals represent the same items, and the basic configuration and operation are the same. In addition, in embodiment of this invention, the apparatus, method, etc. which are used are examples, and of course, this invention is not limited to these.

＜誤認識パターン収集装置の構成＞
図１は、本発明の実施形態による誤認識パターン収集装置の概略構成を示す機能ブロック図である。この誤認識パターン収集装置は、記憶装置５０と、データの入出力を行うための入出力装置３０と、必要な演算処理及び制御処理等を行う中央処理装置１０と、中央処理装置１０での処理に必要なプログラムを格納するプログラムメモリ４０と、中央処理装置１０での処理に必要なデータを格納するデータメモリ２０と、を備えている。 <Configuration of error recognition pattern collection device>
FIG. 1 is a functional block diagram showing a schematic configuration of a misrecognition pattern collection apparatus according to an embodiment of the present invention. This misrecognition pattern collection device includes a storage device 50, an input / output device 30 for inputting and outputting data, a central processing unit 10 for performing necessary arithmetic processing and control processing, and processing in the central processing unit And a data memory 20 for storing data necessary for processing in the central processing unit 10.

記憶装置５０は、ＯＣＲ対象とする業務文書と同様の構成を持つように作成したWindows（登録商標） wordファイルなどのサンプル電子文書データ５１と、サンプル電子文書データ５１に対して、印刷、スキャン、ＯＣＲを行った結果得られるＯＣＲ後サンプル文書データ５２と、本発明により最終的に出力される誤認識パターンデータ５３と、一般的な日本語の単語が多数登録された日本語辞書データ５４と、業務等で使用する文書で使用される単語が多数登録された業務単語辞書データ５５と、ＯＣＲ後サンプル文書データ５２に対して誤認識パターンデータ５３を用いて誤認識を訂正した結果得られる訂正後電子文書データ５６と、を記憶している。 The storage device 50 prints, scans, scans the sample electronic document data 51 such as a Windows (registered trademark) word file created so as to have the same configuration as the business document to be subjected to OCR, and the sample electronic document data 51. Sample document data 52 after OCR obtained as a result of OCR, erroneous recognition pattern data 53 finally output according to the present invention, Japanese dictionary data 54 in which a number of general Japanese words are registered, Business word dictionary data 55 in which a number of words used in documents used in business and the like are registered, and post-OCR sample document data 52 after correction using error recognition pattern data 53 and corrected results obtained Electronic document data 56 is stored.

入出力装置３０は、データを表示するための表示装置３２やプリンタ（図示せず）等で構成される出力部と、表示されたデータに対してメニューを選択するなどの操作を行うためのキーボード３１、マウスなどのポインティングデバイス３３や文書を取り込むためのスキャナ３４等で構成される入力部と、を有している。 The input / output device 30 includes a display device 32 for displaying data, an output unit including a printer (not shown), and a keyboard for performing operations such as selecting a menu for the displayed data. 31 and an input unit including a pointing device 33 such as a mouse, a scanner 34 for capturing a document, and the like.

プログラムメモリ４０は、サンプル電子文書データ５１上の文字とＯＣＲ後サンプル文書データ５２における対応する文字に対して比較処理を行うマッチング処理部４１と、マッチング処理部によって出力された結果を元に、誤認識パターン候補を出力する解析処理部４２と、解析処理部によって出力された誤認識パターン候補の中から、誤った補正の原因となる不要なパターンを削除するフィルタリング処理部４３と、を含んでいる。なお、各処理部は、プログラムコードとしてプログラムメモリ４０に格納されており、中央処理装置１０が各プログラムコードを実行することによって各処理部が実現される。 The program memory 40 includes a matching processing unit 41 that compares the characters on the sample electronic document data 51 and the corresponding characters in the post-OCR sample document data 52, and an error based on the result output by the matching processing unit. An analysis processing unit 42 that outputs a recognition pattern candidate, and a filtering processing unit 43 that deletes an unnecessary pattern that causes an erroneous correction from the erroneous recognition pattern candidates output by the analysis processing unit. . Each processing unit is stored in the program memory 40 as a program code, and each processing unit is realized by the central processing unit 10 executing each program code.

データメモリ２０は、サンプル電子文書データ５１とＯＣＲ後サンプル文書データ５２とから得られる文字列比較データ２１と、文字列比較データ２１から導出される誤認識パターン候補データ２２を含んでいる。 The data memory 20 includes character string comparison data 21 obtained from the sample electronic document data 51 and post-OCR sample document data 52, and erroneous recognition pattern candidate data 22 derived from the character string comparison data 21.

図２は、記憶装置５０に記憶されているサンプル電子文書データ５１を、印刷、スキャンすることで得られる画像例を示す図である。文書内に、一連の文字列が３つ含まれている。このサンプル電子文書データ５１をスキャンした画像（図２）に対してＯＣＲによる文字認識を行い、基のサンプル電子データ５１と比較することで、ＯＣＲ処理後に発生する誤認識文字を把握するのが目的である（図７）。 FIG. 2 is a diagram illustrating an example of an image obtained by printing and scanning the sample electronic document data 51 stored in the storage device 50. The document includes three series of character strings. The purpose is to recognize erroneously recognized characters generated after the OCR process by performing character recognition by OCR on the scanned image (FIG. 2) of the sample electronic document data 51 and comparing it with the original sample electronic data 51. (FIG. 7).

図３は、記憶装置５０に含まれるＯＣＲ後サンプル文書データ５２の例を示す図である。図２の画像に対してＯＣＲが適用された結果を示している。一部の文字には誤認識が発生している。 FIG. 3 is a diagram illustrating an example of the post-OCR sample document data 52 included in the storage device 50. 3 shows the result of applying OCR to the image of FIG. Some characters are misrecognized.

図４は、記憶装置５０に含まれる日本語辞書データ５４の例を示す図である。一般的に使用される日本語の単語が多数登録されている。品詞、自立語・付属語の種別、基本形、読み方、活用の種類等の情報が含まれる。これらの情報は、後述の解析処理で用いる。具体的には、解析時に、日本語辞書データ内の活用形も含めていずれかの単語と、OCR後文字列が、部分的に一致するか否かで、誤認識パターンに登録するか否かを決定するために用いる。活用の種類等の情報を登録することで、活用パターンのいずれかとOCR後文字列が一致した場合であっても、誤認識パターンに登録することを可能とするためである。 FIG. 4 is a diagram illustrating an example of Japanese dictionary data 54 included in the storage device 50. Many commonly used Japanese words are registered. It includes information such as part of speech, independent word / attachment type, basic form, reading, and type of utilization. These pieces of information are used in an analysis process described later. Specifically, at the time of analysis, whether or not to register in the misrecognition pattern depending on whether any word, including the usage form in the Japanese dictionary data, and the character string after OCR partially match. Used to determine This is because by registering information such as the type of utilization, it is possible to register an erroneous recognition pattern even if any of the utilization patterns matches the character string after OCR.

図５は、記憶装置５０に含まれる業務単語辞書データ５５の例を示す図である。業務等で使用される単語（名詞）が多数登録されている。ただし、登録する単語は名詞に限らず、また、品詞、自立語・付属語の種別、基本形、読み方、活用の種類等の情報を含ませても良い。 FIG. 5 is a diagram illustrating an example of the business word dictionary data 55 included in the storage device 50. Many words (nouns) used in business are registered. However, the word to be registered is not limited to a noun, and information such as part of speech, independent word / attached word type, basic form, reading, and utilization type may be included.

図７は、データメモリ２０に含まれる文字列比較データ２１の例を示す図であり、サンプル電子データ５１及びＯＣＲ後サンプル文書データ５２から切り分けた文字列を比較した結果を表している。文字列を構成する文字を比較した結果、同一文字であった場合には誤認識フラグが「０」に設定され、異なった文字であった場合には誤認識フラグが「１」に設定されている。 FIG. 7 is a diagram illustrating an example of the character string comparison data 21 included in the data memory 20, and shows a result of comparing character strings separated from the sample electronic data 51 and the post-OCR sample document data 52. As a result of comparing the characters constituting the character strings, the misrecognition flag is set to “0” if they are the same character, and the misrecognition flag is set to “1” if they are different characters. Yes.

図９は、データメモリ２０に含まれる誤認識パターン候補データ２２の例を示す図であり、図２及び図３のデータについての誤認識パターンの出力結果を表している。図中の正解文字列は、サンプル電子文書データ５１から切り分けた文字列である。また、誤認識文字列は、ＯＣＲ後サンプル文書データ５２から切り分けた、正解文字列に対応する文字列である。ＯＣＲ後サンプル文書データ内の誤認識パターンはまとまった意味を表す単位ごとに誤認識文字列として出力され、誤認識文字列に対応する文字列はサンプル電子文書データ５１から正解文字列として出力される。 FIG. 9 is a diagram illustrating an example of the erroneous recognition pattern candidate data 22 included in the data memory 20, and represents the output result of the erroneous recognition pattern for the data in FIGS. The correct character string in the figure is a character string cut from the sample electronic document data 51. The misrecognized character string is a character string corresponding to the correct character string, which is separated from the post-OCR sample document data 52. The misrecognition pattern in the sample document data after OCR is output as a misrecognized character string for each unit representing a collective meaning, and the character string corresponding to the misrecognized character string is output from the sample electronic document data 51 as a correct character string. .

＜誤認識パターン収集装置における処理＞
次に、上述の構成を有する誤認識パターン収集装置内の中央処理装置１０において行われる処理の概要について説明する。 <Processing in wrong recognition pattern collection device>
Next, an outline of processing performed in the central processing unit 10 in the erroneous recognition pattern collection device having the above-described configuration will be described.

まず、マッチング処理部４１は、サンプル電子文書データ５１とＯＣＲ後サンプル文書データ５２を読みこみ、それぞれのデータにおける対応する文字の認識と、当該文字の誤認識の判定を行う。その結果を文字列比較データ２１としてデータメモリ２０に格納する。 First, the matching processing unit 41 reads the sample electronic document data 51 and the post-OCR sample document data 52, and performs recognition of corresponding characters in each data and determination of erroneous recognition of the characters. The result is stored in the data memory 20 as character string comparison data 21.

次に解析処理部４２は、サンプル電子文書データ５１と単語辞書データ５４を読み込み、サンプル電子文書データ５１内に、単語辞書データ５４に登録されている単語が含まれていれば、当該単語の単位に文字列を区切る。次に、文字列比較データ２１を読みこみ、区切られた単語単位に誤認識パターンを算出し、誤認識パターン候補データ２２としてデータメモリ２０に格納する。 Next, the analysis processing unit 42 reads the sample electronic document data 51 and the word dictionary data 54, and if the word registered in the word dictionary data 54 is included in the sample electronic document data 51, the unit of the word Delimits the string. Next, the character string comparison data 21 is read, a misrecognition pattern is calculated for each divided word unit, and stored in the data memory 20 as misrecognition pattern candidate data 22.

次にフィルタリング処理部４３は、誤認識パターン候補データ２２を読み込み、不要な誤認識パターンを削除する。そして誤認識パターンデータ５３として記憶装置５０に格納する。
それぞれの処理について、以下詳細に説明する。 Next, the filtering processing unit 43 reads the erroneous recognition pattern candidate data 22 and deletes unnecessary erroneous recognition patterns. Then, it is stored in the storage device 50 as erroneous recognition pattern data 53.
Each process will be described in detail below.

＜マッチング処理＞
ここでは、サンプル電子文書データ５１とＯＣＲ後サンプル文書データ５２とから文字列を切り分け、切り分けたそれぞれの文字列同士を比較単位とし、文字列を構成する文字ごとに誤認識の判定を行う。 <Matching process>
Here, a character string is cut out from the sample electronic document data 51 and the post-OCR sample document data 52, and each cut character string is used as a comparison unit, and a recognition error is determined for each character constituting the character string.

図６は、マッチング処理の概要を示すフローチャートである。
まず、マッチング処理部４１は、サンプル電子文書データ５１と、ＯＣＲ後サンプル文書データ５２を読み込み、対応する文書のペアについて以下の処理を行う（ステップ６０１）。 FIG. 6 is a flowchart showing an outline of the matching process.
First, the matching processing unit 41 reads the sample electronic document data 51 and the post-OCR sample document data 52, and performs the following processing on the corresponding document pair (step 601).

選択したサンプル電子文書データ５１とＯＣＲ後サンプル文書データのペアについて、まずまとまった文字列単位に切り分ける（ステップ６０２）。図２及び図３の例では、「日△ソフトウェア株式会社殿」と「目△ソフトウア秩式会社殿」、「納品書」と「糸内晶書」、「平成21年6月25日」と「平成21年6月25日」がそれぞれまとまった文字列として対応している。文字列の対応付けは、例えば文字の座標情報を利用することで可能である。 The pair of the selected sample electronic document data 51 and post-OCR sample document data is first cut into a group of character strings (step 602). In the example of Fig. 2 and Fig. 3, "Japan △ Software Co., Ltd." and "Eye △ Software Chichishi Co., Ltd.", "Invoice" and "Itouchi Akisho", "June 25, 2009" “June 25, 2009” corresponds to each character string. The character strings can be associated by using, for example, character coordinate information.

次に、サンプル電子文書データ５１から切り分けた文字列を正解文字列として、すべての正解文字列に含まれる全文字に対して、以下の処理を行う（ステップ６０３、６０４）。 Next, the following processing is performed on all the characters included in all the correct character strings using the character strings cut out from the sample electronic document data 51 as correct character strings (steps 603 and 604).

正解文字列を、ＯＣＲ後サンプル文書データの対応する文字列（ＯＣＲ後文字列）と比較する（ステップ６０５）。比較は、正解文字列を構成する個々の文字ごとに、ＯＣＲ後文字列を構成する個々の文字が正しく対応しているか、整合性を確認する。尚、文字の対応付けはＤＰマッチング（Dynamic Programming Matching）等の一般的な文字列マッチング手法により可能である。図７は、比較結果を示している。 The correct character string is compared with the corresponding character string (character string after OCR) of the sample document data after OCR (step 605). In the comparison, for each individual character constituting the correct character string, consistency is confirmed as to whether the individual characters constituting the post-OCR character string correspond correctly. Characters can be associated by a general character string matching method such as DP matching (Dynamic Programming Matching). FIG. 7 shows the comparison results.

比較の結果、同一文字であった場合には誤認識フラグが「０」に設定され（ステップ６０７）、異なった文字であった場合には誤認識フラグが「１」に設定される（ステップ６０８）。なお、正解文字列に含まれる文字が、ＯＣＲ後文字列内に存在しない場合（例えば図７の「ェ」）は、当該正解文字の誤認識フラグを「１」に設定する。また、正解文字に対応するＯＣＲ後文字が複数であった場合（例えば図７の「ネ土」）は、当該正解文字一つにつき誤認識フラグを「１」に設定する。このような処理を、全文字に対して行う。 As a result of comparison, if the characters are the same, the misrecognition flag is set to “0” (step 607). If the characters are different, the misrecognition flag is set to “1” (step 608). ). If the character included in the correct character string does not exist in the character string after OCR (for example, “e” in FIG. 7), the erroneous recognition flag of the correct character is set to “1”. Further, when there are a plurality of post-OCR characters corresponding to the correct character (for example, “Ne” in FIG. 7), the erroneous recognition flag is set to “1” for each correct character. Such processing is performed for all characters.

最終的に得られたデータを、文字列比較データ２１としてデータメモリ２０に格納する（ステップ６０９）。 The finally obtained data is stored in the data memory 20 as the character string comparison data 21 (step 609).

＜解析処理＞
ここでは、正解文字列を所定の単語単位に切り分け、切り分けた単語のうち上述のマッチング処理で誤認識フラグが１と判定された文字を含む単語を誤認識パターン候補として登録する。 <Analysis processing>
Here, the correct character string is segmented into predetermined word units, and among the segmented words, words including characters whose erroneous recognition flag is determined to be 1 by the above-described matching processing are registered as erroneous recognition pattern candidates.

図８は、解析処理の概要を示すフローチャートである。 FIG. 8 is a flowchart showing an outline of the analysis process.

解析処理部４２は、マッチング処理部で得られた文字列比較データを読み込み、すべての文字列比較データについて以下の処理を行う（ステップ８０１）。 The analysis processing unit 42 reads the character string comparison data obtained by the matching processing unit, and performs the following processing for all the character string comparison data (step 801).

まず、正解文字列を所定の単語単位に切り分ける（ステップ８０２）。具体的には、正解文字列内に、日本語辞書データあるいは業務単語辞書データに登録された単語が含まれていれば、その単語単位に切り分ける。切り分けた単語に、さらに辞書登録単語が存在する場合は、全単語を別々に切り分ける。例えば、図２の正解文字列である「納品書」は、「納品書」にも切り分けられるし、さらに「納品」にも切り分けられる。 First, the correct character string is cut into predetermined word units (step 802). Specifically, if a word registered in the Japanese dictionary data or the business word dictionary data is included in the correct character string, the correct character string is divided into the word units. If there are more dictionary registered words in the cut words, all the words are cut separately. For example, the “delivery note” which is the correct character string in FIG. 2 is divided into “delivery note” and further “delivery”.

次に、切り分けた全単語に対して次の処理を行う（ステップ８０３）。文字列比較データを参照し、切り分けた単語中に誤認識フラグが「１」と判定された文字が含まれていれば、その単語を正解文字列の該当する単語と共に誤認識パターン候補としてデータメモリ２０に登録する。例えば、図７において「納品書」に含まれる単語「納品」と「納品書」は、いずれも誤認識フラグが「１」を含む。したがって、「納品」と「糸内品」、「納品書」と「糸内品書」をそれぞれ誤認識パターン候補として登録する。このようにして、全単語についてステップ８０４の処理を行う。図９は、図２及び図３のデータに対して、得られた誤認識パターン候補の例を示す図である。 Next, the following processing is performed on all the cut words (step 803). If a character whose error recognition flag is determined to be “1” is included in the segmented word with reference to the character string comparison data, the word is used as a misrecognition pattern candidate along with the corresponding word in the correct character string. 20 to register. For example, in FIG. 7, the words “delivery” and “delivery note” included in “delivery note” both include the erroneous recognition flag “1”. Accordingly, “delivery” and “in-yarn product”, “delivery note” and “in-yarn product” are registered as erroneous recognition pattern candidates. In this way, the process of step 804 is performed for all words. FIG. 9 is a diagram illustrating examples of erroneous recognition pattern candidates obtained for the data in FIGS. 2 and 3.

＜フィルタリング処理部＞
ここでは、日本語辞書データ及び業務辞書データに登録された別の単語と部分一致または完全一致する単語は、誤認識パターン候補から削除される。 <Filtering processing part>
Here, a word that partially or completely matches another word registered in the Japanese dictionary data and the business dictionary data is deleted from the erroneous recognition pattern candidate.

図１０は、フィルタリング処理部の概要を示すフローチャートである。フィルタリング処理部４３は、解析処理部で得られた誤認識パターン候補データを読み込み、すべての誤認識パターン候補について以下の処理を行う（ステップ１００１）。 FIG. 10 is a flowchart showing an outline of the filtering processing unit. The filtering processing unit 43 reads the erroneous recognition pattern candidate data obtained by the analysis processing unit, and performs the following processing for all the erroneous recognition pattern candidates (step 1001).

上述の解析処理により登録された誤認識パターン候補の誤認識文字列を、日本語辞書データ及び業務辞書データの各単語と比較し、部分一致または完全一致するか否かを判定する（ステップ１００２）。例えば、「目立」という文字列は、日本語の「目立つ」という単語と部分一致する。このような別の単語と部分一致または完全一致する単語は、誤認識パターンとして登録してはならない。なぜならば、このような単語を誤認識パターンとして登録してしまうと、誤認識された単語のみならず、正しく認識された別の単語までも誤認識と判断され、誤変換の原因となるからである。このような理由により、別の単語と部分一致または完全一致する単語は、誤認識パターン候補から削除する（ステップ１００３）。例えばある文書内において、「目立」を「日立」に一律変換してしまうと、「目立つ」のような文字列が「日立つ」という文字列に誤変換されてしまう。フィルタリング処理により、不要なパターンを削除することによりこのような誤変換を防ぐことができる。このようにして、登録されているすべての誤認識パターン候補についてフィルタリング処理を行う。図９の例では、「目立」が、日本語辞書データ及び業務辞書データに登録された「目立つ」と部分一致するため、「目立」と「日立」のペアは誤認識パターン候補から削除される。なお、「日△ソフトウア」は「日△ソフトウェア」と部分一致しない。ここで言う部分一致とは、日本語辞書データ及び業務辞書データ内の単語の中に、誤認パターン候補の文字列が完全に含まれることを意味する。「日△ソフトウェア」内に「日△ソフトウア」は完全には含まれていないので、「日△ソフトウア」は誤認パターン候補から削除されない。「糸内晶書」と「納品書」についても同様である。逆に、「目立つ」の中に「目立」は完全に含まれているので、「目立」は誤認パターン候補から削除する。 The misrecognized character string of the misrecognized pattern candidate registered by the above-described analysis processing is compared with each word of the Japanese dictionary data and the business dictionary data to determine whether or not a partial match or a complete match (step 1002). . For example, a character string “conspicuous” partially matches a Japanese word “conspicuous”. Such a word that partially or completely matches another word must not be registered as a misrecognition pattern. This is because if such a word is registered as a misrecognition pattern, not only a misrecognized word but also another correctly recognized word is judged to be misrecognized, which causes misconversion. is there. For this reason, a word that partially matches or completely matches another word is deleted from the erroneous recognition pattern candidate (step 1003). For example, in a document, if “conspicuous” is uniformly converted to “Hitachi”, a character string such as “conspicuous” is erroneously converted to a character string “conspicuous”. Such erroneous conversion can be prevented by deleting unnecessary patterns by filtering. In this way, the filtering process is performed for all registered erroneous recognition pattern candidates. In the example of FIG. 9, “conspicuous” partially matches “conspicuous” registered in the Japanese dictionary data and business dictionary data, so the pair of “conspicuous” and “Hitachi” is deleted from the erroneous recognition pattern candidates. Is done. “Date / Software” does not partially match “Date / Software”. The partial match mentioned here means that the character string of the misidentification pattern candidate is completely included in the words in the Japanese dictionary data and the business dictionary data. Since “date Δ software” is not completely included in “date Δ software”, “date Δ software” is not deleted from the misidentification pattern candidates. The same applies to “Intra-Inner Crystal” and “Invoice”. On the contrary, since “conspicuous” is completely included in “conspicuous”, “conspicuous” is deleted from the misidentified pattern candidates.

次に、フィルタリング後の誤認識パターン候補の確認画面が表示される（ステップ１００４）。図１１は、確認画面の例を示す図である。誤認識文字列と正解文字列のペア、識別番号、登録するか否かを指定するチェックボックスが、誤認識パターン候補毎に含まれる。ユーザは最終的に登録する誤認識パターンを、チェックボックスにチェックするか否かによって選択できる。登録すべき誤認識パターンをすべて選択後、「ＯＫ」ボタンを押下することで、該当する誤認識パターン候補データが、誤認識パターンデータ５３として記憶装置５０に格納される。ユーザの了承が得られなかった場合は、「キャンセル」を押下することで処理をキャンセルすることができる。 Next, a confirmation screen for erroneous recognition pattern candidates after filtering is displayed (step 1004). FIG. 11 is a diagram illustrating an example of a confirmation screen. A misrecognition character string / correct answer character string pair, an identification number, and a check box for specifying whether to register are included for each misrecognition pattern candidate. The user can select the erroneous recognition pattern to be finally registered depending on whether or not the check box is checked. After selecting all the erroneous recognition patterns to be registered, the corresponding OK recognition pattern candidate data is stored in the storage device 50 as the erroneous recognition pattern data 53 by pressing the “OK” button. If the user's consent is not obtained, the process can be canceled by pressing “Cancel”.

＜まとめ＞
本実施形態では、サンプル電子文書データ５１とＯＣＲ後サンプル文書データ５２に含まれる文字列を切り分け、それぞれから切り分けた正解文字列とＯＣＲ後文字列とを比較単位とし、誤認識の判定を行う。次に、正解文字列を所定の単語単位に切り分け、切り分けた単語のうち上述のマッチング処理で誤認識フラグが１と判定された文字を含む単語を誤認識パターン候補として登録する。最後に、日本語辞書データ及び業務辞書データに登録された別の単語と部分一致または完全一致する単語は、誤認識パターン候補から削除する。 <Summary>
In the present embodiment, character strings included in the sample electronic document data 51 and the post-OCR sample document data 52 are separated, and a recognition error is determined using the correct character string and the post-OCR character string separated from each other as a comparison unit. Next, the correct character string is segmented into predetermined word units, and among the segmented words, words including characters whose error recognition flag is determined to be 1 by the above-described matching processing are registered as error recognition pattern candidates. Finally, words that partially or completely match another word registered in the Japanese dictionary data and business dictionary data are deleted from the erroneous recognition pattern candidates.

このような処理を実行することにより、ＯＣＲ適用時の誤認識パターンを自動的に収集することができる。また、収集した誤認識パターンをフィルタリングすることで、正しい文字列まで置換を行ってしまうという過学習を防止できる。これらの結果として、ユーザの技量に依らず均質な置換辞書を作成することが可能となり、ユーザによる置換辞書作成コストを大幅に削減することができる。 By executing such processing, it is possible to automatically collect misrecognition patterns when OCR is applied. Moreover, it is possible to prevent overlearning that a correct character string is replaced by filtering the collected erroneous recognition patterns. As a result, a uniform replacement dictionary can be created regardless of the skill of the user, and the cost of creating a replacement dictionary by the user can be greatly reduced.

また、本実施形態では、マッチング処理部は、正解文字列を構成する文字ごとに、ＯＣＲ後文字列を対比させて誤認識フラグを設定することで、誤認識の判定を行う。 Moreover, in this embodiment, a matching process part performs misrecognition determination by setting the misrecognition flag for each character which comprises a correct character string by contrasting the character string after OCR.

さらに、マッチング処理部は、正解文字列を構成する文字がＯＣＲ後文字列内に存在しない場合は、正解文字列を構成する文字の誤認識フラグを「１」に設定し、正解文字に対応するＯＣＲ後文字が複数ある場合は、正解文字列を構成する文字一つの誤認識フラグを「１」に設定する。 Further, when the character constituting the correct character string does not exist in the character string after OCR, the matching processing unit sets the erroneous recognition flag of the character constituting the correct character string to “1” to correspond to the correct character. If there are a plurality of characters after OCR, the misrecognition flag of one character constituting the correct character string is set to “1”.

その後、解析処理部は、記憶装置に格納された日本語の単語が登録された日本語辞書データ及び業務で使用される単語が登録された業務単語辞書データに含まれる単語単位に、正解文字列を切り分け、誤認識パターン候補から削除する。 Thereafter, the analysis processing unit corrects the correct character string in units of words included in the Japanese dictionary data in which the Japanese words stored in the storage device are registered and the business word dictionary data in which the words used in the business are registered. Are removed from the erroneous recognition pattern candidates.

このような処理を実行することにより、単語単位で置換辞書へ登録ができ、しかも過学習の起きる可能性のある単語が置換辞書へ登録されることを防止できる。文字単位で置換辞書へ登録すると過学習が頻発するため、一部の文字を除き、原則として文字単位では置換辞書へ登録することができない。たとえば、「日」が「目」として誤認識されるからといって置換辞書へ登録すれば、「目」を含む単語全てが「日」に置換されてしまうからである。 By executing such processing, it is possible to register in the replacement dictionary in units of words and to prevent registration of words that may cause overlearning in the replacement dictionary. Since overlearning frequently occurs when registering in the replacement dictionary in units of characters, in principle, it is impossible to register in the replacement dictionary in units of characters except for some characters. For example, if “day” is misrecognized as “eyes” and it is registered in the replacement dictionary, all the words including “eyes” are replaced with “days”.

本実施形態では、誤認識された文字を含む単語単位で誤認識パターンを把握し、該誤認識パターンを置換辞書へ登録するため、過学習を抑制することができる。しかも、把握された誤認識パターンの中でも、別の単語への過学習が起きる可能性のある誤認識パターンは削除され、置換辞書への登録が防止されるため、より過学習を抑制することができる。 In this embodiment, since the misrecognition pattern is grasped in units of words including the misrecognized character and the misrecognition pattern is registered in the replacement dictionary, overlearning can be suppressed. Moreover, among the recognized misrecognition patterns, misrecognition patterns that may cause overlearning to other words are deleted, and registration to the replacement dictionary is prevented, so that overlearning can be further suppressed. it can.

なお、本発明は、実施形態の機能を実現するソフトウェアのプログラムコードによっても実現できる。この場合、プログラムコードを記録した記憶媒体をシステム或は装置に提供し、そのシステム或は装置のコンピュータ（又はＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読み出す。この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコード自体、及びそれを記憶した記憶媒体は本発明を構成することになる。このようなプログラムコードを供給するための記憶媒体としては、例えば、フレキシブルディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−Ｒ、磁気テープ、不揮発性のメモリカード、ＲＯＭなどが用いられる。 The present invention can also be realized by a program code of software that realizes the functions of the embodiments. In this case, a storage medium in which the program code is recorded is provided to the system or apparatus, and the computer (or CPU or MPU) of the system or apparatus reads the program code stored in the storage medium. In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the program code itself and the storage medium storing the program code constitute the present invention. As a storage medium for supplying such program code, for example, a flexible disk, CD-ROM, DVD-ROM, hard disk, optical disk, magneto-optical disk, CD-R, magnetic tape, nonvolatile memory card, ROM Etc. are used.

また、プログラムコードの指示に基づき、コンピュータ上で稼動しているＯＳ（オペレーティングシステム）などが実際の処理の一部又は全部を行い、その処理によって前述した実施の形態の機能が実現されるようにしてもよい。さらに、記憶媒体から読み出されたプログラムコードが、コンピュータ上のメモリに書きこまれた後、そのプログラムコードの指示に基づき、コンピュータのＣＰＵなどが実際の処理の一部又は全部を行い、その処理によって前述した実施の形態の機能が実現されるようにしてもよい。 Also, based on the instruction of the program code, an OS (operating system) running on the computer performs part or all of the actual processing, and the functions of the above-described embodiments are realized by the processing. May be. Further, after the program code read from the storage medium is written in the memory on the computer, the computer CPU or the like performs part or all of the actual processing based on the instruction of the program code. Thus, the functions of the above-described embodiments may be realized.

また、実施の形態の機能を実現するソフトウェアのプログラムコードを、ネットワークを介して配信することにより、それをシステム又は装置のハードディスクやメモリ等の記憶手段又はＣＤ-ＲＷ、ＣＤ-Ｒ等の記憶媒体に格納し、使用時にそのシステム又は装置のコンピュータ(又はＣＰＵやＭＰＵ)が当該記憶手段や当該記憶媒体に格納されたプログラムコードを読み出して実行するようにしても良い。 Also, by distributing the program code of the software that realizes the functions of the embodiment via a network, the program code is stored in a storage means such as a hard disk or memory of a system or apparatus, or a storage medium such as a CD-RW or CD-R And the computer of the system or apparatus (or CPU or MPU) may read and execute the program code stored in the storage means or the storage medium when used.

１０・・・中央処理装置
２０・・・データメモリ
２１・・・文字列比較データ
２２・・・誤認識パターン候補データ
３０・・・入出力装置
３１・・・キーボード
３２・・・表示装置
３３・・・ポインティングデバイス
３４・・・スキャナ
４０・・・誤認識パターン収集プログラム
４１・・・マッチング処理部
４２・・・解析処理部
５０・・・記憶装置
５１・・・サンプル電子文書データ
５２・・・ＯＣＲ後サンプル文書データ
５３・・・誤認識パターンデータ
５４・・・日本語辞書データ
５５・・・業務単語辞書データ
５６・・・訂正後電子文書データ DESCRIPTION OF SYMBOLS 10 ... Central processing unit 20 ... Data memory 21 ... Character string comparison data 22 ... False recognition pattern candidate data 30 ... Input / output device 31 ... Keyboard 32 ... Display device 33- ..Pointing device 34 ... Scanner 40 ... Error recognition pattern collection program 41 ... Matching processing unit 42 ... Analysis processing unit 50 ... Storage device 51 ... Sample electronic document data 52 ... Sample document data 53 after OCR ... Misrecognition pattern data 54 ... Japanese dictionary data 55 ... Business word dictionary data 56 ... Corrected electronic document data

Claims

A document processing apparatus that automatically generates a replacement dictionary for correcting an OCR misrecognized character string,
A correct character string cut out from sample electronic document data obtained by imaging a business document and a post-OCR character string cut out from post-OCR sample document data obtained as a result of performing OCR on the sample electronic document data And a matching processing unit for determining misrecognition,
An analysis processing unit that divides the correct character string into predetermined word units and registers a word including a character that is determined to be erroneously recognized by the matching processing unit among the divided words as an erroneous recognition pattern candidate;
A word that partially or completely matches a word included in Japanese dictionary data in which a Japanese word stored in a storage device is registered and a business word dictionary data in which a word used in a business is registered Filtering by deleting from the candidate, filtering, and storing the erroneous recognition pattern candidate after the filtering in the storage device as an erroneous recognition pattern;
A document processing apparatus comprising:

The said matching process part determines the said misrecognition by setting the misrecognition flag for each character which comprises the said correct character string by contrasting the said character string after OCR. Document processing apparatus described in 1.

When the character constituting the correct character string does not exist in the post-OCR character string, the matching processing unit sets a false recognition flag of the character constituting the correct character string to “1”, and the correct character 3. The document processing apparatus according to claim 2, wherein when there are a plurality of post-OCR characters corresponding to the character string, an erroneous recognition flag of one character constituting the correct character string is set to “1”.

The analysis processing unit includes the correct character in units of words included in Japanese dictionary data in which Japanese words stored in the storage device are registered and business word dictionary data in which words used in business are registered. The document processing apparatus according to claim 1, wherein the column is divided.