JPH05298495A

JPH05298495A - Character recognizing device, erroneous recognition character correcting method and occidental document processor

Info

Publication number: JPH05298495A
Application number: JP4106766A
Authority: JP
Inventors: Shigeru Owada; 茂大和田
Original assignee: Hitachi Engineering Co Ltd
Current assignee: Hitachi Engineering Co Ltd
Priority date: 1992-04-24
Filing date: 1992-04-24
Publication date: 1993-11-12

Abstract

PURPOSE:To improve a recognition ratio and shorten correction time by automat ically correcting an erroneously recognized word to be the word which is most apparently the same by means of a spelling checking means. CONSTITUTION:A character candidate inside a character recognition dictionary means 23, which is judged to be similar to picture data of a segmented character, is stored in a recognition result storage means 24. Then, a word selecting means 25 selects all the words which are separated by a null character, a special character, etc., within the recognition result of the recognition result storage means 24 so as to be generated. Then, the spelling checking means 26 compares the word selected by the word selecting means 25 with a word dictionary means 27 and executs a processing for judging whether the word is correct or not to all the selected words. The word which is judged to be erroneous by the spelling checking means 26 is collated with an erroneous recognition character data base 29 in a succeeding new word generating means 28, the respective characters in the word are replaced and the new word is generated.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明はＯＣＲ等で読み取った欧
文文書の文字認識装置に係り、特に、誤認識文字の修正
を容易にするのに好適な文字認識装置と誤認識文字修正
方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character recognition device for a European document read by OCR or the like, and more particularly to a character recognition device and a method for correcting a misrecognized character suitable for facilitating correction of a misrecognized character.

【０００２】[0002]

【従来の技術】文字認識装置で認識された文字をオペレ
ータが修正する場合、従来は、各文字ごとに認識候補を
持ち、それを表示装置等に表示し、オペレータが正解文
字を指定するようになっている。2. Description of the Related Art Conventionally, when an operator corrects a character recognized by a character recognition device, each character has a recognition candidate, which is displayed on a display device or the like so that the operator can specify the correct character. Is becoming

【０００３】図７（ａ）〜（ｃ）は、従来の認識結果修
正方法の表示例を示した図である。図７には、文字列
“ｔｈｅｎ”を認識させた場合に、認識結果が“ｔｈｅ
ｒ”となった例を示している。オペレータは、表示され
ている“ｔｈｅｒ”を構成する文字“ｒ”が誤認識であ
ることを見つけたとき、誤認文字“ｒ”を指定する。文
字認識装置側は、その指定された文字“ｒ”に対する他
の文字候補、この場合には、“ｔ”，“ｎ”，“ｊ”，
“ｃ”の４つの文字を表示する。その中に正解の文字が
ある場合には、オペレータはその文字例えば“ｎ”を指
定する。これにより、文字認識装置は該当文字の修正を
行い、“ｔｈｅｎ”と表示する。FIGS. 7A to 7C are diagrams showing display examples of the conventional recognition result correction method. In FIG. 7, when the character string "then" is recognized, the recognition result is "the".
When the operator finds that the character "r" forming the displayed "ther" is an erroneous recognition, the operator designates the erroneously recognized character "r". On the device side, another character candidate for the designated character “r”, in this case, “t”, “n”, “j”,
Display the four letters "c". If there is a correct character in it, the operator specifies the character, for example "n". As a result, the character recognition device corrects the corresponding character and displays "then".

【０００４】この第１の従来技術を改良した第２の従来
技術として、特開昭６４−７３４８３号公報記載のもの
がある。この第２従来技術では、スペルチェックを利用
し、文字認識候補の修正を行っている。つまり、認識結
果の単語がスペルチェックにより誤りであると判定され
た場合に、その単語の誤認され易い文字を誤認マトリッ
クス（間違い易い文字相互の関連付けを予めしてあるマ
トリックス）を利用して他の文字に置換し、それを再度
スペルチェックし、正しいと判定された単語が唯一１つ
存在する場合に、認識候補をその単語に合わせて変更す
るというものである。As a second conventional technique which is an improvement of the first conventional technique, there is one disclosed in Japanese Patent Laid-Open No. 64-73483. In the second conventional technique, the spell check is used to correct the character recognition candidate. In other words, when the word of the recognition result is determined to be erroneous by spell check, the character that is likely to be misidentified in that word is replaced with another by using the misidentification matrix (matrix in which the mutual easily erroneous characters are associated with each other in advance). It replaces a character, spell-checks it again, and if there is only one word that is determined to be correct, the recognition candidate is changed according to that word.

【０００５】[0005]

【発明が解決しようとする課題】上述した第１の従来技
術では、オペレータは、１つの単語中に複数個誤認識文
字が存在している場合でも、誤認識文字をマウス等のポ
インティングデバイス装置で一つ一つ指定し、候補を表
示して修正しなければならず非常に面倒である。また、
スペルチェック機能を利用した第２の従来技術によれ
ば、文字置換後の単語の内、スペルチェックが複数個の
単語を正解と判定した場合には、認識候補の変更を行な
わないため、結局第１の従来技術の修正方法を用いて１
文字１文字候補文字を変更するという作業が発生する。
更に、正解と判定された単語が複数個存在する可能性が
あるため、なんらかの方法で唯一１つの単語を選出し候
補変更を行う様に改良する必要があるが、それが真の正
解であるとは言い切れず、それが間違っていた場合に
は、前記と同様に第１の従来技術により文字候補の表示
を行って修正することになってしまう。つまり、従来技
術は、誤認識文字を修正する場合に正解に達するまでの
操作数が多くなりオペレータの負担が大きいという問題
がある。In the first prior art described above, even if a plurality of erroneously recognized characters exist in one word, the operator uses the pointing device device such as a mouse to recognize the erroneously recognized characters. It is very troublesome to specify each one, display the candidates and make corrections. Also,
According to the second conventional technique using the spell check function, the recognition candidates are not changed when the spell check determines that a plurality of words are correct among the words after the character replacement. 1 using one of the prior art correction methods
The work of changing one character candidate character occurs.
Furthermore, there is a possibility that there are multiple words that have been determined to be correct, so it is necessary to improve the method so that only one word is selected and candidate changes are made, but that is the true correct answer. However, if it is wrong, the character candidates are displayed and corrected by the first conventional technique as described above. In other words, the conventional technique has a problem in that when correcting an erroneously recognized character, the number of operations required to reach the correct answer is large and the operator's burden is heavy.

【０００６】本発明の目的は、少ない操作数で誤認識文
字の修正を行うことができる文字認識装置とその認識方
法を提供することにある。An object of the present invention is to provide a character recognition device and a recognition method thereof that can correct an erroneously recognized character with a small number of operations.

【０００７】[0007]

【課題を解決するための手段】上記目的は、光学文字読
取装置等で入力された外国文文書の画像データから文字
を切り出し、認識する文字認識装置において、（ａ）認
識結果から空白文字や特殊文字等により区切られて作成
される単語を選出する単語選出手段と、（ｂ）選出され
た単語をあらかじめ用意した単語辞書と比較することに
より、選出された単語に誤りがあるかどうかを判定する
スペルチェック手段と、（ｃ）予め求めておいた誤認識
しやすい文字のデータと、その出現頻度に添って設定し
た重み係数を記憶した誤認文字データベースと、（ｄ）
誤認文字データベースを参照して、スペルチェック手段
により誤りと判定された単語の誤認識しやすい文字を置
換し、新たな単語を生成する新単語生成手段と、（ｅ）
新単語生成手段により生成された単語全てを再度スペル
チェック手段を用いて誤りの判定を行い、正しいと判定
された単語全てを単語候補として記憶する単語候補記憶
手段と、（ｆ）認識結果をＣＲＴ等の表示装置に表示
し、誤認識している単語をオペレータが指定した場合
に、前記単語候補記憶手段に記憶されている他の単語候
補を表示し、表示候補中の単語候補が指定されたとき、
認識結果中の単語を指定された単語候補に変更する単語
候補変更手段とを設けることで、達成される。In the character recognition device for recognizing a character cut out from image data of a foreign text document input by an optical character reading device or the like, (a) a blank character or a special character is recognized from the recognition result. It is determined whether or not there is an error in the selected word by comparing the selected word with a word selection means for selecting a word created by being separated by characters or the like, and (b) the prepared word dictionary. Spell checking means, (c) misidentified character database that stores data of previously obtained characters that are likely to be misrecognized, and a weighting coefficient set according to the appearance frequency, (d)
(E) a new word generation unit that refers to the misidentified character database and replaces a character that is erroneously recognized in the word determined to be erroneous by the spell check unit to generate a new word.
All the words generated by the new word generation means are again judged as errors using the spell check means, and all the words judged to be correct are stored as word candidates, and (f) the recognition result is CRT. When the operator designates a word that has been erroneously recognized, the other word candidates stored in the word candidate storage means are displayed, and the word candidate among the display candidates is designated. When
This is achieved by providing word candidate changing means for changing a word in the recognition result into a designated word candidate.

【０００８】[0008]

【作用】上記構成によれば、単語選出手段は、入力され
た文書画像を切り出し、認識した結果より、空白文字や
特殊文字等にて区切られて作成される単語を選出し、そ
の単語をスペルチェック手段へ出力するよう動作する。With the above arrangement, the word selecting means cuts out the input document image, selects a word created by being separated by a blank character or a special character from the recognition result, and spells the word. Operates to output to the checking means.

【０００９】スペルチェック手段は、単語を受け取り、
それがあらかじめ用意した単語辞書中に存在するかどう
かを判定し、その結果を新単語生成手段へ出力するよう
動作する。また、使用する単語辞書は、その外国文とし
て存在可能な単語が全て登録されているものである。The spell checking means receives the word,
It is operated to determine whether or not it exists in the word dictionary prepared in advance, and output the result to the new word generating means. Further, the word dictionary used is one in which all the words that can exist as the foreign sentence are registered.

【００１０】誤認文字データベースは、ある文字を認識
した場合に、その正解文字とそれを認識ミスして出力し
た認識結果の文字の２つのデータ、及び、その２つを対
として集計した出現回数に添った重み係数の合計３つの
データを格納したデータベースのことである。また、本
データベースのデータには、文字切り出しは正しいが認
識が誤りである場合の誤認データ、及び、文字切り出し
（例えば、文字間が接触しているために融合して切り出
した場合や、かすれにより分離して切り出した場合等）
が失敗した場合の誤認データの２つが含まれており、正
解文字と認識結果文字は、複数個の文字となる場合も考
慮して格納している。When a certain character is recognized, the misidentified character database stores two data of the correct character and the character of the recognition result output by recognizing the incorrect character, and the number of appearances of the two as a pair. It is a database that stores a total of three data of the associated weighting factors. In addition, the data of this database includes misidentification data when the character cutout is correct but the recognition is incorrect, and the character cutout (for example, when the characters are cut out by fusion due to contact between characters, (When separated and cut out, etc.)
It includes two pieces of erroneous data in the case of failure, and the correct character and the recognition result character are stored in consideration of the case where there are a plurality of characters.

【００１１】新単語生成手段は、前記スペルチェック手
段が誤りであると判定した単語の各文字より、誤認文字
データベース中を検索し、その文字に認識ミスしやすい
正解文字を取り出して文字を置換し、新たな単語を生成
し、その単語を単語候補記憶手段へ出力する様動作す
る。ここで、文字の置換を行う際には、その置換に使用
した誤認文字データベース中のデータの重み係数を用い
て、生成した単語の重み係数を算出しておくものとす
る。The new word generation means searches the misidentified character database for each character of the word determined by the spell check means to be incorrect, retrieves the correct character that is apt to be misrecognized, and replaces the character. , Creates a new word and outputs the word to the word candidate storage means. Here, when replacing a character, the weighting coefficient of the generated word is calculated using the weighting coefficient of the data in the misidentified character database used for the replacement.

【００１２】単語候補記憶手段は、前記新単語生成手段
より出力された単語を再度前記スペルチェック手段で誤
りかどうかを判定し、正解であると判定された場合にそ
の単語を記憶する様動作する。ここで、単語を記憶する
順序は、前記新単語生成手段より出力された重み係数に
より決定され、その値により一番尤もらしいと思われる
単語は、現状の認識結果と変更する様動作する。The word candidate storage means operates so that the word output from the new word generation means is again judged by the spell check means to be erroneous, and if it is judged to be correct, the word is stored. .. Here, the order in which the words are stored is determined by the weighting coefficient output from the new word generating means, and the word that seems to be most likely to operate according to the value is changed to the current recognition result.

【００１３】単語候補変更手段は、認識結果を表示装置
に表示し、オペレータが誤認識文字を指定した場合に、
その文字が含まれる単語を算出し、その単語の候補を単
語候補記憶手段より取り出して表示装置に表示し、その
単語候補を指定した場合に、認識結果が指定した単語候
補に変更される様動作する。The word candidate changing means displays the recognition result on the display device, and when the operator designates an erroneously recognized character,
An operation in which a word including the character is calculated, the word candidate is retrieved from the word candidate storage means and displayed on the display device, and when the word candidate is designated, the recognition result is changed to the designated word candidate. To do.

【００１４】[0014]

【実施例】以下、本発明の一実施例を図面を参照して説
明する。図１は、本発明の一実施例に係る文字認識装置
の全体構成図であり、図２は文字認識処理手順を示すフ
ローチャートである。画像入力装置１より入力された文
書画像データは、電子計算機２の文書画像データ記憶手
段２１内に記憶される。文字認識手段２２は、文書画像
データ記憶手段２１内の画像データより文字を切り出
し、正規化，塵処理等の前処理を行った後、文字認識辞
書手段２３とのパターンマッチング等の認識手法により
文字認識処理を行い、切り出した文字の画像データに似
ていると判断した文字認識辞書手段２３内の文字候補
を、認識結果記憶手段２４に格納（ステップ１０１）す
る。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is an overall configuration diagram of a character recognition device according to an embodiment of the present invention, and FIG. 2 is a flowchart showing a character recognition processing procedure. The document image data input from the image input device 1 is stored in the document image data storage means 21 of the electronic computer 2. The character recognition unit 22 cuts out characters from the image data in the document image data storage unit 21, performs preprocessing such as normalization and dust processing, and then performs character recognition by a recognition method such as pattern matching with the character recognition dictionary unit 23. The recognition process is performed, and the character candidates in the character recognition dictionary means 23, which are determined to be similar to the image data of the cut out characters, are stored in the recognition result storage means 24 (step 101).

【００１５】単語選出手段２５は、認識結果記憶手段２
４内の認識結果中の空白文字や特殊文字等により分離さ
れて作成される単語を全て選出（ステップ１０２）す
る。スペルチェック手段２６は、単語選出手段２５によ
り選出された単語を単語辞書手段２７との比較を行っ
て、その単語が正しいかどうかを判定する処理を、選出
した単語全て（ステップ１０３，１０４）に対して行
う。The word selection means 25 is a recognition result storage means 2
All the words created by being separated by blank characters, special characters, etc. in the recognition result in 4 are selected (step 102). The spell check means 26 compares the word selected by the word selection means 25 with the word dictionary means 27 to determine whether or not the word is correct for all the selected words (steps 103 and 104). To do.

【００１６】スペルチェック手段２６により誤りである
と判定された単語（ステップ１０５）は、次の新単語生
成手段２８において、誤認文字データベース２９との照
合により、単語の各文字の置換が行われ、新たな単語が
生成（ステップ１０６）される。With respect to the word (step 105) judged to be erroneous by the spell check means 26, the next new word generation means 28 compares each character of the word by collation with the misidentified character database 29, A new word is generated (step 106).

【００１７】図３は、誤認文字データベース２９の構成
図である。誤認文字データベース２９は、文字認識手段
２２が誤認識したデータを集計して作成したものであ
り、正解文字，認識結果文字，重み係数の３つのデータ
より構成される。図３の例は、ある外国文文書を用いて
認識テストを行った場合の誤認識の集計結果であり、１
行目のデータは、文字“ｔ”を、文字“ｌ”に認識ミス
した回数が３０回あったということを表している。ま
た、４行目のデータは、接触した文字列“ｒｔ”を切り
出しミスし、文字“ｎ”に認識した回数が１０回あった
ということを表している。このように、誤認識しやすい
文字の組を集計し、その出現回数に添った重み係数を持
たせたのが誤認文字データベース２９である。本実施例
では、出現回数をそのまま重み係数としているが、出現
回数の最大値で各出現回数を割った値というような、０
〜１の間の実数値等に変換して用いても構わない。FIG. 3 is a configuration diagram of the misidentified character database 29. The misidentified character database 29 is created by aggregating the data erroneously recognized by the character recognizing means 22, and is composed of three data, that is, the correct answer character, the recognition result character, and the weight coefficient. The example in FIG. 3 is the result of tabulation of misrecognition when a recognition test is performed using a foreign text document.
The data in the line indicates that the character "t" was mistakenly recognized as the character "l" 30 times. The data on the fourth line indicates that the touched character string "rt" was missed and the character "n" was recognized 10 times. In this way, the misidentified character database 29 is a collection of character sets that are likely to be erroneously recognized and that has a weighting coefficient according to the number of appearances. In the present embodiment, the number of appearances is used as a weighting coefficient as it is, but the value obtained by dividing each number of appearances by the maximum value of the number of appearances is 0.
It may be used by converting it into a real value or the like between 1 and 1.

【００１８】図４は誤認文字データベース２９と照合し
て新単語を生成する処理手順を示すフローチャートであ
り、図５は、例えば文字列“ｔｈｅｎ”を文字認識した
場合に認識結果が“ｔｈｅｒ”となったときの新単語生
成処理の例を示す図である。まず、認識結果の単語“ｔ
ｈｅｒ”の文字数を算出（ステップ２０１）し、その各
文字“ｔ”，“ｈ”，“ｅ”，“ｒ”が誤認文字データ
ベース２９の認識結果文字に登録されている文字か否か
を判定する（ステップ２０２、２０３）。その文字が誤
認文字データベース２９の認識結果文字に登録されてい
る場合には、それに対応する正解文字と重み係数を変更
候補として記憶（ステップ２０４，２０５）する。つま
り、誤認文字データベース２９が図５（ｂ）に示す様な
構成であった場合は、図５（ａ）変更候補に示す様に、
“ｔ”の変更候補として重み係数「３」の正解文字
“ｆ”が記憶される。また、“ｒ”の変更候補として重
み係数「２０」の正解文字“ｎ”と、重み係数「４」の
正解文字“ｔ”の２つが記憶される。FIG. 4 is a flowchart showing a processing procedure for generating a new word by collating with the misidentified character database 29, and FIG. 5 shows that the recognition result is "ther" when the character string "then" is recognized. It is a figure which shows the example of a new word generation process when it becomes. First, the recognition result word "t"
The number of characters of "her" is calculated (step 201), and it is determined whether or not each of the characters "t", "h", "e", "r" is a character registered in the recognition result character of the misidentified character database 29. If the character is registered as a recognition result character in the misidentified character database 29, the correct answer character and the weighting coefficient corresponding to the character are stored as change candidates (steps 204 and 205). If the misidentified character database 29 has the structure shown in FIG. 5B, as shown in FIG.
The correct character “f” of the weighting factor “3” is stored as the change candidate of “t”. Further, as a change candidate of “r”, two correct characters “n” having a weight coefficient “20” and a correct character “t” having a weight coefficient “4” are stored.

【００１９】各変更候補の取り出しが完了したら、各変
更候補との全組合せを発生（ステップ２０６）させる様
に、各変更候補と認識結果の文字を置換（ステップ２０
７）して新単語生成及び重み係数算出を行い、それを記
憶（ステップ２０８）する。本処理方法によれば、図５
（ａ）の例では、生成単語として重み係数「２０」の
“ｔｈｅｎ”や、重み係数「１４」の“ｔｈａｎ”等の
８通りの新単語が生成される。When the extraction of each change candidate is completed, the characters of the recognition result are replaced with each change candidate so that all combinations with each change candidate are generated (step 206) (step 20).
7) Then, a new word is generated and a weighting factor is calculated and stored (step 208). According to this processing method, FIG.
In the example of (a), eight new words such as “then” with the weighting factor “20” and “than” with the weighting factor “14” are generated as the generated words.

【００２０】ここで、本実施例における生成単語の重み
係数の計算方法について説明する。本実施例の場合の誤
認文字データベース２９中の重み係数は、出現回数の値
をそのまま使用しているので、値の大きい方が出現の可
能性が高いという意味を持っている。従って単語を生成
した場合でも、同様に重み係数の大きい方が出現の可能
性が高いという意味を持たせる必要があるので、各変更
した文字の重み係数の総和を変更した文字数で割るとい
う計算により算出している。つまり、生成単語“ｔｈａ
ｎ”は、“ｅ”を重み係数「８」の正解文字“ａ”と変
更し、“ｒ”を重み係数「２０」の正解文字“ｎ”と変
更して生成されているので、変更した文字の重み係数の
総和は「２８」であり、変更した文字数は“２”とな
り、２８／２により生成単語“ｔｈａｎ”の重み係数は
「１４」となっている。Here, a method of calculating the weighting coefficient of the generated word in this embodiment will be described. Since the value of the number of appearances is used as it is as the weighting coefficient in the misidentified character database 29 in the case of the present embodiment, the larger the value, the higher the possibility of appearance. Therefore, even when a word is generated, it is necessary to have the meaning that the larger the weighting coefficient is, the higher the possibility of appearance is. Similarly, the calculation of dividing the sum of the weighting coefficients of each changed character by the number of changed characters is performed. It is calculated. That is, the generated word "tha
"n" is generated by changing "e" to the correct character "a" with weighting factor "8" and changing "r" to the correct character "n" with weighting factor "20". The sum of the weighting factors of characters is "28", the number of changed characters is "2", and the weighting factor of the generated word "than" is 28 due to 28/2.

【００２１】新単語生成完了後は、その各新単語のスペ
ルチェックを行い（ステップ１０８）、正しいと判定さ
れた単語（ステップ１０９）のみを単語候補記憶手段３
０で記憶（ステップ１１０）する。ここで、新単語を記
憶する順序は、各新単語の重み係数により決定され、一
番出現する可能性の高い新単語は、現在の認識結果記憶
手段２４に記憶されている認識結果と変更されて記憶さ
れる。つまり、図５の例を用いれば、生成された新単語
８つの中でスペルチェックが正しいと判定する単語は、
重み係数「２０」の“ｔｈｅｎ”と、重み係数「１４」
の“ｔｈａｎ”と、重み係数「６」の“ｔｈａｔ”の３
つであり、この内、一番出現する可能性の高いのは、最
も重み係数の大きい“ｔｈｅｎ”であるので、認識結果
記憶手段２４の中の認識結果 “ｔｈｅｒ”は、“ｔｈ
ｅｎ” に変更され、それ以外の“ｔｈａｎ”と“ｔｈ
ａｔ”は、単語候補として単語候補記憶手段３０に記憶
される。After the completion of the generation of the new word, the spelling of each new word is checked (step 108), and only the word judged to be correct (step 109) is stored in the word candidate storage means 3.
It is stored as 0 (step 110). Here, the order of storing the new words is determined by the weighting factor of each new word, and the new word most likely to appear is changed to the recognition result currently stored in the recognition result storage means 24. Will be remembered. That is, using the example of FIG. 5, the word determined to be correct in spell check among the eight new words generated is
Weighting factor “20” “then” and weighting factor “14”
Of “than” and “that” of weighting factor “6” is 3
Of these, the one most likely to appear is “then”, which has the largest weighting coefficient. Therefore, the recognition result “ther” in the recognition result storage means 24 is “th”.
changed to "en" and other "than" and "th"
“At” is stored in the word candidate storage unit 30 as a word candidate.

【００２２】次に、認識結果修正方法について説明す
る。各誤認識した単語の単語候補の生成及び記憶が完了
したら、認識結果記憶手段２４中の認識結果を認識結果
表示手段３１によりＣＲＴ等の表示装置３へ表示（ステ
ップ１１１）する。単語候補変更手段３２は、オペレー
タが表示された認識結果を検証し、認識結果中に誤認識
があった場合にマウス等のポインティングデバイス装置
４により、その認識結果を指定（ステップ１１３）させ
るものとする。マウス等で指定された場合、指定された
単語の認識結果の属する単語を算出し、その単語の単語
候補が単語候補記憶手段３０に記憶されている場合（ス
テップ１１４）は、その単語候補を表示装置３へ表示
（ステップ１１５）する。表示した単語候補をオペレー
タが指定した場合（ステップ１１７）に、その指定され
た単語候補に認識結果記憶手段２４中の認識結果を変更
（ステップ１１８）する。もし、指定した認識結果の単
語に単語候補が記憶されていない場合（ステップ１１
６）は、従来の候補表示と同様に、その認識結果“文
字”の文字候補を表示装置３へ表示し、認識結果の変更
を行う。Next, the recognition result correction method will be described. When the generation and storage of the word candidates for each erroneously recognized word are completed, the recognition result in the recognition result storage means 24 is displayed on the display device 3 such as a CRT by the recognition result display means 31 (step 111). The word candidate changing means 32 verifies the recognition result displayed by the operator, and if there is an erroneous recognition in the recognition result, designates the recognition result by the pointing device device 4 such as a mouse (step 113). To do. When designated with a mouse or the like, the word to which the recognition result of the designated word belongs is calculated, and when the word candidate of the word is stored in the word candidate storage means 30 (step 114), the word candidate is displayed. It is displayed on the device 3 (step 115). When the displayed word candidate is designated by the operator (step 117), the recognition result in the recognition result storage means 24 is changed to the designated word candidate (step 118). If no word candidate is stored in the specified recognition result word (step 11)
In 6), similarly to the conventional candidate display, the character candidates of the recognition result “character” are displayed on the display device 3 and the recognition result is changed.

【００２３】図６に、認識結果の表示画面の例を示す。
図６（ａ）は、図５に示した正解単語“ｔｈｅｎ”を認
識させた場合の認識結果の表示例である。文字認識手段
２２により認識した結果は、“ｔｈｅｒ”であるが、前
記処理により“ｔｈｅｒ”は、“ｔｈｅｎ”に変更され
ているので、結果としては、“ｔｈｅｎ”が表示されて
いる。図６（ｂ）は、その表示した単語“ｔｈｅｎ”を
指定した場合の単語候補の表示例である。本単語の候補
としては、前記処理により“ｔｈａｎ”と“ｔｈａｔ”
の２つが記憶されているので、その２つの単語が表示さ
れている。図６（ｃ）は、単語候補を指定して認識結果
を修正する場合の例である。本例では、正解単語は“ｔ
ｈｅｎ”であるので修正する必要は無いが、あえてこれ
を単語候補“ｔｈａｔ”に変更する場合には、表示され
ている単語候補“ｔｈａｔ”をマウス等で指定すること
により、認識結果記憶手段２４内の認識結果“ｔｈｅ
ｎ”は、“ｔｈａｔ”に変更され、認識結果の表示も
“ｔｈａｔ”に変更される。FIG. 6 shows an example of the recognition result display screen.
FIG. 6A is a display example of the recognition result when the correct word “then” shown in FIG. 5 is recognized. The result recognized by the character recognizing means 22 is "ther", but "ther" is changed to "then" by the above process, so "then" is displayed as the result. FIG. 6B is a display example of word candidates when the displayed word “then” is designated. As candidates for this word, "than" and "that" are obtained by the above processing.
2 are stored, the two words are displayed. FIG. 6C shows an example in which a word candidate is designated and the recognition result is corrected. In this example, the correct word is "t
Since it is "hen", there is no need to correct it. However, in the case of changing this to the word candidate "that", the displayed word candidate "that" is designated by the mouse or the like, and the recognition result storage means 24 Recognition result in "the
“N” is changed to “that”, and the display of the recognition result is also changed to “that”.

【００２４】以上の処理により、認識結果記憶手段２４
内の認識結果を全て正しく修正（ステップ１１２）した
ら、認識結果出力手段３３により、認識結果記憶手段２
４内の認識結果を電子計算機用の文字コードとして出力
（ステップ１１９）する。By the above processing, the recognition result storage means 24
When all the recognition results in the table have been corrected correctly (step 112), the recognition result storage means 2 is operated by the recognition result output means 33.
The recognition result in 4 is output as a character code for a computer (step 119).

【００２５】上述した実施例は、イメージスキャナ等で
イメージデータとして読み取った欧文文字列を認識して
文字コードに変換する文字認識装置について説明した
が、本発明は、キーボードからオペレータが打ち込んだ
欧文文字列のスペルチェック及び修正にも適用できるも
のである。従来のスペルチェックは、入力された欧文文
字列が辞書にあるか否かのみを調べ、無い場合にそれを
表示するものにすぎない。しかし、誤認文字データベー
スの作り方によって、スペルチェックのみならず、正解
の文字列を重み係数順に表示してオペレータの文書入力
を補助することが可能となる。この場合には、“ｔ”と
“ｆ”の画像が似通っているからその文字の誤認の重み
をつけるのではなく、キーボードの打ち間違いによる重
み係数を付けることになる。例えば、右手の中指で押す
“ｉ”と左手の中指で押す“ｅ”の打ち間違いや、右手
の人指し指で押す“ｔ”と“ｒ”の打ち間違い等でデー
タベースを作成することになる。キーボードの打ち間違
いは、オペレータの癖に依存する率が高いので、オペレ
ータの癖を学習しながら自動的にデータベースを作成す
る機能を付加するのが好ましい。In the above-described embodiment, the character recognition device for recognizing a European character string read as image data by an image scanner or the like and converting the character string into a character code has been described. However, the present invention is a European character typed by an operator from a keyboard. It can also be applied to column spell checking and correction. The conventional spell check only checks whether or not the inputted Roman character string exists in the dictionary, and displays it if there is no. However, depending on how to create a database of misidentified characters, not only spell checking but also correct character strings can be displayed in order of weighting factor to assist the operator in document input. In this case, since the images of "t" and "f" are similar to each other, the weight for misrecognition of the character is not attached, but the weighting factor for the keyboard mistyping is attached. For example, a database is created by mistakenly hitting "i" pushed by the middle finger of the right hand and "e" pushed by the middle finger of the left hand, or by mistakenly hitting "t" and "r" pushed by the index finger of the right hand. Since the mistake of typing the keyboard is highly dependent on the habit of the operator, it is preferable to add a function of automatically creating a database while learning the habit of the operator.

【００２６】[0026]

【発明の効果】本発明によれば、誤認識した単語はスペ
ルチェック手段により一番尤もらしいと思われる単語に
自動的に修正されるので、認識率の向上及び、修正時間
の短縮が可能となる。According to the present invention, a misrecognized word is automatically corrected by the spell checking means to a word that seems to be most likely, so that the recognition rate can be improved and the correction time can be shortened. Become.

【００２７】また、同時に、誤認識した単語には、すで
にスペルチェック済みの正しい単語のみが単語候補とし
て出力されるので、誤認識の修正を単語単位に行うこと
が可能となり、従来の１文字ごとに修正を行うよりも、
修正時間の短縮が可能となる。At the same time, since only the correct words that have been spell checked have been output as word candidates to the erroneously recognized words, the erroneous recognition can be corrected word by word. Than making a fix to
The correction time can be shortened.

[Brief description of drawings]

【図１】本発明の一実施例に係る文字認識装置の全体構
成図である。FIG. 1 is an overall configuration diagram of a character recognition device according to an embodiment of the present invention.

【図２】本発明の全体の処理手順を示すフローチャート
である。FIG. 2 is a flowchart showing the overall processing procedure of the present invention.

【図３】誤認文字データベースの構成図である。FIG. 3 is a configuration diagram of a misidentified character database.

【図４】新単語生成の処理手順を示すフローチャートで
ある。FIG. 4 is a flowchart showing a processing procedure for generating a new word.

【図５】新単語生成処理手順を示すフローチャートであ
る。FIG. 5 is a flowchart showing a new word generation processing procedure.

【図６】認識結果の表示例及び単語候補変更の表示例を
示す図である。FIG. 6 is a diagram showing a display example of a recognition result and a display example of a word candidate change.

【図７】従来の認識結果の表示例及び文字候補変更の表
示例を示す図である。FIG. 7 is a diagram showing a display example of a conventional recognition result and a display example of changing a character candidate.

[Explanation of symbols]

１…画像入力装置、２…電子計算機、２１…文書画像デ
ータ記憶手段、２２…文字認識手段、２３…文字認識辞
書手段、２４…認識結果記憶手段、２５…単語選出手
段、２６…スペルチェック手段、２７…単語辞書手段、
２８…新単語生成手段、２９…誤認文字データベース、
３０…単語候補記憶手段、３１…認識結果表示手段、３
２…単語候補変更手段、３３…認識結果出力手段、３…
表示装置、４…ポインティングデバイス装置。DESCRIPTION OF SYMBOLS 1 ... Image input device, 2 ... Electronic computer, 21 ... Document image data storage means, 22 ... Character recognition means, 23 ... Character recognition dictionary means, 24 ... Recognition result storage means, 25 ... Word selection means, 26 ... Spell check means , 27 ... Word dictionary means,
28 ... New word generation means, 29 ... Misidentified character database,
30 ... Word candidate storage means, 31 ... Recognition result display means, 3
2 ... Word candidate changing means, 33 ... Recognition result output means, 3 ...
Display device, 4 ... Pointing device device.

Claims

[Claims]

1. A character recognition device for cutting out and recognizing characters from image data of a foreign text document input by an optical character reading device or the like, and (a) is created by separating from the recognition result by a blank character or a special character. A word selection means for selecting a word to be selected, and (b) a spell check means for determining whether or not there is an error in the selected word by comparing the selected word with a prepared word dictionary.
(C) By referring to the character recognition data database that stores the character data that is erroneously recognizable in advance and the weighting coefficient set according to the frequency of appearance, and (d) the character recognition error database, the spell check means A new word generation unit that replaces a character that is erroneously recognized in a word determined to be erroneous and generates a new word, and (e) all words generated by the new word generation unit are erroneous using the spell check unit again. The word candidate storage means for storing all the words determined to be correct as word candidates, and (f) the recognition result is displayed on a display device such as a CRT, and the operator has designated the erroneously recognized word. In this case, another word candidate stored in the word candidate storage means is displayed, and when the word candidate among the display candidates is designated, the word in the recognition result is changed to the designated word candidate. Character recognition device, characterized in that it comprises a word candidate change means that.

2. The word selecting means according to claim 1, wherein a word candidate is not generated only when the word selected by the word selecting means is judged to be erroneous by the spell check, and the spell check is not judged. A character recognition device comprising means for generating word candidates for all the words selected in.

3. A character recognition device for recognizing a character by cutting out image data of a Roman character string read by an image reading device, selecting a word created by being separated by a blank character or a special character from the recognition result, The selected word is compared with a prepared word dictionary to determine whether or not the selected word has an error by the spell check function. By referring to the misidentified character database that stores the set weighting factor, a character that is easily misrecognized in the word that is determined to be erroneous by the spell check function is replaced with a character in the database to generate a new word. All the words are judged again by the spell check function, and all the words judged to be correct are regarded as word candidates, and the recognition result is displayed on the display device, and misrecognized. A method for correcting an erroneously recognized character in a character recognition device, characterized in that, when an operator designates a given word, another word candidate is selected from the word candidates and displayed.

4. A character string and constituent characters are cut out from the image data of a European character string input from an image reading device and recognized, and the recognized character string is judged to be correct by the spell check function and other recognized character strings that are not in the word dictionary are detected. In the character recognition device that distinguishes and displays, a character recognition device comprising means for displaying correct word candidates of a character string determined to be erroneous on a screen and allowing an operator to select a correct word.

5. A character string and constituent characters are cut out and recognized from image data of a European character string input from an image reading device, and the recognized character string is judged to be correct by a spell check function and other recognized character strings that are not in the word dictionary are detected. In a character recognition device that distinguishes and displays, a word determined to be a character string that is not in the word dictionary replaces a constituent character that is easily misrecognized among the constituent characters of the character string with another character that is easily misrecognized. A character recognition device comprising means for replacing a character string generated by the above-mentioned method with a character string determined to be correct by the spell check function and displaying the character string on the screen as a recognition result.

6. The recognition result according to claim 5, wherein when there are a plurality of character strings determined to be correct by the spell check function again, the recognition results are arranged in order of appearance frequency obtained from the easiness of erroneous recognition of the characters that are likely to be erroneously recognized. A character recognition device comprising means for displaying.

7. A character string and constituent characters are cut out and recognized from image data of a European character string input from an image reading device, and the recognized character string is judged to be correct by a spell check function, and a recognized character string that is not in the word dictionary is excluded. In a character recognition device which distinguishes and displays, a correct word candidate of a character string determined to be incorrect is displayed on the screen and an operator is allowed to select a correct word.

8. A character string and constituent characters are cut out and recognized from image data of a European character string input from an image reading device, and the recognized character string is judged to be correct by a spell check function and other recognized character strings that are not in the word dictionary are detected. In a character recognition device that distinguishes and displays, a word determined to be a character string that is not in the word dictionary replaces a constituent character that is easily misrecognized among the constituent characters of the character string with another character that is easily misrecognized. Character recognition method characterized by replacing with a character string that is generated and generated again by the spell check function and determined to be the correct answer, and displaying this on the screen as the recognition result

9. The recognition result according to claim 8, wherein when there are a plurality of character strings which are determined to be correct again by the spell check function, the recognition results are arranged in order of appearance frequency obtained from the easiness of erroneous recognition of the characters that are likely to be erroneously recognized. A character recognition method characterized by displaying.

10. A character string and constituent characters are cut out and recognized from image data of a European character string input from an image reading device, and the recognized character string is judged as correct by a spell check function and other recognized character strings are not included in the word dictionary. In the character recognition device that distinguishes and displays the recognition result, the recognition result is held in the memory as a word (character string) in the word dictionary instead of the individual constituent characters, and when correcting the recognition result, word candidates are word-by-word from the memory. A method for correcting erroneously recognized characters in a character recognition device, characterized by reading out, correcting, and changing.

11. A European document processing apparatus for judging whether a European character string input from a keyboard is correct or incorrect by a spell check function and distinguishing a character string that is not in a word dictionary from others and displaying the character string that is not in a word dictionary. The determined word is generated by replacing a constituent character that is easily erroneously input among the constituent characters of the character string with another character that is easily erroneously input, and is replaced with a character string that is determined to be correct by the spell check function again. A European document processing apparatus, characterized in that it comprises means for outputting.

12. The display according to claim 11, wherein when there are a plurality of character strings that are determined to be correct by the spell check function, the characters that are likely to be erroneously input are displayed on the screen in the order of appearance frequency determined from the easiness of erroneous input. A document processing apparatus for European languages characterized by comprising means for selecting a correct answer character string.