JPH06266906A

JPH06266906A - Character recognition system

Info

Publication number: JPH06266906A
Application number: JP5051918A
Authority: JP
Inventors: Toshio Niwa; 寿男丹羽; Kazuhiro Kayashima; 一弘萱嶋; 泰治〆木; Taiji Shimeki; Hidetsugu Maekawa; 英嗣前川; Satoru Ito; 哲伊藤; Yoshihiro Kojima; 良宏小島; Koji Yamamoto; 浩司山本
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1993-03-12
Filing date: 1993-03-12
Publication date: 1994-09-22
Anticipated expiration: 2017-11-25
Also published as: JP3350127B2

Abstract

PURPOSE:To decrease recognition errors and to improve character recognition rate by detecting the recognition errors by correcting characters while using knowledge processing, and reconstructing the recognition dictionary of a character recognition part. CONSTITUTION:A character recognition part 1 recognizes a document image 10 and outputs pieces of candidate characters per character. A candidate clause is calculated from a candidate character set 11 by using a word dictionary 6 and a grammar dictionary 7. A clause evaluated value arithmetic part 4 calculates the vocabular and grammatical rightness of the clause, the clause is selected by a clause selection part 5 with the evaluated value of the clause as a reference, and a corrected character string 14 is outputted. The candidate character set 11 is compared with the corrected character string 14, and an additional learning character 15 is decided by a candidate character comparison part 9. Based on the additional learning character 15, a recognition dictionary 16 of the character recognition part 1 is reconstructed.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、文字を読みとるための
文字認識装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character recognition device for reading characters.

【０００２】[0002]

【従来の技術】近年、データベースの発展に伴い、高速
で認識率の高い文字認識装置に対する要求が高まってい
る。2. Description of the Related Art In recent years, with the development of databases, there has been an increasing demand for a character recognition device having a high speed and a high recognition rate.

【０００３】従来の文字認識装置としては、例えば、特
開平2-214990号広報に示されているような、図９に示し
た文字認識装置が提案されている。文字訂正部８は、文
字認識部１から１文字に付きＮ個の候補文字を入力とし
て受け取る。自動訂正部６１は、候補文字と訂正規則テ
ーブル６３を比較し訂正規則により文字を訂正する。自
動訂正部６１の訂正出力結果は操作者に表示され、操作
者は誤って認識された文字を訂正する。この訂正の操作
の情報をもとに、手動訂正制御部６２で訂正規則を作
り、この規則を訂正規則テーブル６３に登録し、以降の
認識結果に訂正規則を適用して認識誤りを自動訂正す
る。これにより、操作者が行った訂正をもとに、文書の
フォントに合わせた文字認識を行うことができる。As a conventional character recognition device, for example, the character recognition device shown in FIG. 9 as disclosed in Japanese Patent Laid-Open No. 2-214990 has been proposed. The character correction unit 8 receives N candidate characters per character from the character recognition unit 1 as input. The automatic correction unit 61 compares the candidate character with the correction rule table 63 and corrects the character according to the correction rule. The correction output result of the automatic correction unit 61 is displayed to the operator, and the operator corrects the character that is erroneously recognized. Based on the information of this correction operation, the manual correction control unit 62 creates a correction rule, registers this rule in the correction rule table 63, and applies the correction rule to subsequent recognition results to automatically correct the recognition error. . As a result, based on the correction made by the operator, it is possible to perform character recognition that matches the font of the document.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上記の
文字認識装置では、操作者が行った訂正をもとに訂正規
則テーブルを作成するために、人手をかけずに自動的に
訂正規則を作成することができない。However, in the above-mentioned character recognition device, since the correction rule table is created based on the correction made by the operator, the correction rule is automatically created without human intervention. I can't.

【０００５】本発明は、このような従来の課題を解決す
るもので、知識処理を用いて修正された文字列をもと
に、文字認識部の認識辞書を自動的に再構成し、これに
より自動的に文書のフォントに合った認識を行い、文字
認識率を高くすることを目的としている。The present invention solves such a conventional problem, and automatically reconstructs the recognition dictionary of the character recognizing unit based on the character string corrected by the knowledge processing. The purpose is to automatically recognize the font of the document and increase the character recognition rate.

【０００６】[0006]

【課題を解決するための手段】本発明は上記目的を達成
するために、文字修正部において訂正された文字を、文
字認識部における候補文字と比較することにより、文字
認識部の誤り易い文字を抽出する。抽出された文字は文
字認識部に送られ、この文字をもとに文字認識部におけ
る認識辞書を再構成することにより、文字認識部の処理
を認識対象の文書の文字に適応させ、認識誤りをなくす
る。In order to achieve the above object, the present invention compares a character corrected in a character correction unit with a candidate character in a character recognition unit to detect an error-prone character in the character recognition unit. Extract. The extracted characters are sent to the character recognition unit, and by reconstructing the recognition dictionary in the character recognition unit based on these characters, the processing of the character recognition unit is adapted to the characters of the document to be recognized, and the recognition error is eliminated. To lose.

【０００７】[0007]

【作用】本発明は上記した構成により、文字修正部が文
字を訂正した情報から、文字認識部の誤り易い文字を抽
出し、この文字をもとに文字認識部の認識辞書を再構成
する。これにより、文字認識部における認識誤りが減
り、文字認識率が向上する。With the above-described structure, the present invention extracts a character that is likely to be erroneous in the character recognizing unit from the information in which the character correcting unit has corrected the character, and reconstructs the recognition dictionary of the character recognizing unit based on this character. This reduces recognition errors in the character recognition unit and improves the character recognition rate.

【０００８】[0008]

【実施例】以下、本発明の第１の発明の実施例について
説明する。図１にこの実施例の文字認識装置の全体の構
成を示す。文字認識部１は、認識辞書１６を用いて文書
画像１０より文字認識を行い、１文字につき第１候補文
字から第ｎ候補文字までのｎ個の候補文字を持つ候補文
字集合を出力する。EXAMPLES Examples of the first invention of the present invention will be described below. FIG. 1 shows the overall configuration of the character recognition device of this embodiment. The character recognition unit 1 performs character recognition from the document image 10 using the recognition dictionary 16 and outputs a candidate character set having n candidate characters from the first candidate character to the nth candidate character for each character.

【０００９】単語検索部２は、単語辞書６を検索するこ
とにより候補文字集合１１の組み合せの中から、単語辞
書６に存在する単語と一致する候補文字の組み合せであ
る候補単語集合１２を選び出す。文節検索部３は、文法
辞書７を参照して候補単語集合１２から文節となりえる
単語の組み合せ候補文節集合１３を選び出す。文節評価
値演算部４は、文節検索部３で検索された文節の語彙的
および文法的な正しさを文節中の単語の長さや頻度を基
準として評価値を計算する。文節選択部５は、文節の候
補の中で評価値の最も大きい文節を選択し、修正文字列
１４を出力する。The word search unit 2 searches the word dictionary 6 to select a candidate word set 12 which is a combination of candidate characters matching a word existing in the word dictionary 6 from the combinations of the candidate character set 11. The phrase searching unit 3 refers to the grammar dictionary 7 and selects a candidate phrase set 13 that is a combination of words that can be a phrase from the candidate word set 12. The bunsetsu evaluation value calculation unit 4 calculates an evaluation value based on the lexical and grammatical correctness of the bunsetsu searched by the bunsetsu searching unit 3 based on the length and frequency of the words in the bunsetsu. The phrase selecting unit 5 selects the phrase having the largest evaluation value among the phrase candidates and outputs the corrected character string 14.

【００１０】候補文字比較部９は、修正文字列１４と候
補文字集合１１を比較し、修正文字列と候補文字集合の
第１候補文字とが異なる文字を抽出し、追加学習文字１
５として文字認識部１に送る。The candidate character comparison unit 9 compares the modified character string 14 with the candidate character set 11 and extracts a character whose modified character string is different from the first candidate character of the candidate character set.
5 is sent to the character recognition unit 1.

【００１１】上記の構成の文字認識装置において次のよ
うにして文字認識を行う。まず、認識対象の文書画像１
０を文字認識部１で認識辞書１６を用いて文字認識し
て、１文字につき第１候補文字から第ｎ候補文字までの
ｎ個の候補文字を持つ候補文字集合を出力する。Character recognition is performed as follows in the character recognition device having the above configuration. First, the document image 1 to be recognized
The character recognition unit 1 character-recognizes 0 using the recognition dictionary 16 and outputs a candidate character set having n candidate characters from the first candidate character to the n-th candidate character per character.

【００１２】さらに、単語検索部２で、単語辞書６を検
索することにより候補文字集合１１の組み合せの中か
ら、単語辞書６に存在する単語と一致する候補文字の組
み合せである候補単語集合１２を選び出す。さらに、文
節検索部３で、文法辞書７を参照して候補単語集合１２
から文節となりえる単語の組み合せである候補文節集合
１３を選び出す。文節検索部３で検索された文節の語彙
的および文法的な正しさを文節中の単語の長さや頻度な
どを基準として文節評価値を計算する。文節評価値を求
めた候補文節に対して文節評価値を基準にして、文節選
択部５で正しい文節の組み合せを選択し修正文字列１４
を出力する。Further, the word search unit 2 searches the word dictionary 6 to select a candidate word set 12 which is a combination of candidate characters matching a word existing in the word dictionary 6 from the combinations of the candidate character sets 11. Pick out. Further, the phrase searching unit 3 refers to the grammar dictionary 7 and sets the candidate word set 12
A candidate phrase set 13, which is a combination of words that can be a phrase, is selected. The lexical evaluation value is calculated based on the lexical and grammatical correctness of the bunsetsu searched by the bunsetsu searching unit 3 based on the length and frequency of the words in the bunsetsu. Based on the phrase evaluation value for the candidate phrase for which the phrase evaluation value is obtained, the phrase selection unit 5 selects the correct combination of phrases and the modified character string 14 is selected.
Is output.

【００１３】候補文字比較部９は、修正文字列１４と候
補文字集合１１の比較を行う。同じ文字位置の修正文字
列の文字と候補文字集合の第１候補文字を比較し、これ
らの二つの文字が異なれば、追加学習文字１５として出
力する。The candidate character comparison unit 9 compares the corrected character string 14 with the candidate character set 11. The characters of the corrected character string at the same character position are compared with the first candidate character of the candidate character set, and if these two characters are different, the additional learning character 15 is output.

【００１４】文字認識部１は、追加学習文字１５を受け
取り、追加学習文字の文字画像と修正文字列の文字か
ら、追加学習文字が認識できるように認識辞書１６に追
加学習文字１５の辞書を追加する。The character recognition unit 1 receives the additional learning character 15 and adds a dictionary of the additional learning character 15 to the recognition dictionary 16 so that the additional learning character can be recognized from the character image of the additional learning character and the character of the corrected character string. To do.

【００１５】これにより、文字認識部１における初めの
文字認識で認識できなかった文字も認識辞書に追加文字
の辞書が追加されたことにより認識可能になる。As a result, the character that cannot be recognized by the first character recognition in the character recognition unit 1 can be recognized by adding the dictionary of additional characters to the recognition dictionary.

【００１６】なお、文字認識部１における認識辞書１６
への追加学習文字１５の辞書への追加は、文字認識部１
をニューラルネットワークで構成してネットワークの重
みを追加学習によって変化させて処理を行っても良い。The recognition dictionary 16 in the character recognition unit 1
The learning character 15 is added to the dictionary by the character recognition unit 1.
May be configured by a neural network and the weight of the network may be changed by additional learning to perform the processing.

【００１７】また、本実施例では候補文字比較部９で、
修正文字列と第１候補文字との比較を行ったが、修正文
字と第ｍ候補文字（１≦ｍ≦ｉ＜ｎ）とを比較し、修正
文字がｉ個の候補文字の中に含まれていなかったら、修
正文字を追加学習文字１５として出力するようにしても
良い。In the present embodiment, the candidate character comparison unit 9
The corrected character string was compared with the first candidate character, but the corrected character was compared with the mth candidate character (1 ≦ m ≦ i <n), and the corrected character was included in the i candidate characters. If not, the corrected character may be output as the additional learning character 15.

【００１８】次に、本発明の第２の発明の実施例につい
て説明する。図２にこの実施例の文字認識装置の全体の
構成を示す。Next, a second embodiment of the present invention will be described. FIG. 2 shows the overall configuration of the character recognition device of this embodiment.

【００１９】文字認識部１、単語検索部２、文節検索部
３、文節評価演算部４、文節選択部５、候補文字比較部
９は、第１の発明の実施例と同じである。The character recognition unit 1, the word search unit 2, the phrase search unit 3, the phrase evaluation calculation unit 4, the phrase selection unit 5, and the candidate character comparison unit 9 are the same as those in the first embodiment of the invention.

【００２０】同文字抽出部２１は、候補文字比較部から
出力された文字に対して、同じ文字を抽出し、さらにそ
の文字が異なる単語に含まれている場合に、その文字を
追加学習文字１５として出力する。The same character extraction unit 21 extracts the same character from the characters output from the candidate character comparison unit, and when the character is included in different words, the character is additionally learned character 15 Output as.

【００２１】上記の構成の文字認識装置において次のよ
うにして文字認識を行う。まず、認識対象の文書画像１
０を文字認識部１で認識辞書１６を用いて文字認識し
て、１文字につき第１候補文字から第ｎ候補文字までの
ｎ個の候補文字を持つ候補文字集合を出力する。Character recognition is performed as follows in the character recognition device having the above configuration. First, the document image 1 to be recognized
The character recognition unit 1 character-recognizes 0 using the recognition dictionary 16 and outputs a candidate character set having n candidate characters from the first candidate character to the n-th candidate character per character.

【００２２】さらに、単語検索部２で、単語辞書６を検
索することにより候補文字集合１１の組み合せの中か
ら、単語辞書６に存在する単語と一致する候補文字の組
み合せである候補単語集合１２を選び出す。さらに、文
節検索部３で、文法辞書７を参照して候補単語集合１２
から文節となりえる単語の組み合せである候補文節集合
１３を選び出す。文節検索部３で検索された文節の語彙
的および文法的な正しさを文節中の単語の長さや頻度な
どを基準として文節評価値を計算する。文節評価値を求
めた候補文節に対して文節評価値を基準にして、文節選
択部５で正しい文節の組み合せを選択し修正文字列１４
を出力する。Further, the word search unit 2 searches the word dictionary 6 to find a candidate word set 12 which is a combination of candidate characters matching a word existing in the word dictionary 6 from among the combinations of the candidate character sets 11. Pick out. Further, the phrase searching unit 3 refers to the grammar dictionary 7 and sets the candidate word set 12
A candidate phrase set 13, which is a combination of words that can be a phrase, is selected. The lexical evaluation value is calculated based on the lexical and grammatical correctness of the bunsetsu searched by the bunsetsu searching unit 3 based on the length and frequency of the words in the bunsetsu. Based on the phrase evaluation value for the candidate phrase for which the phrase evaluation value is obtained, the phrase selection unit 5 selects the correct combination of phrases and the modified character string 14 is selected.
Is output.

【００２３】候補文字比較部９は、修正文字列１４と候
補文字集合１１の比較を行う。同じ文字位置の修正文字
列の文字と候補文字集合の第１候補文字を比較し、これ
らが異なる文字を出力する。The candidate character comparison unit 9 compares the corrected character string 14 with the candidate character set 11. The characters of the modified character string at the same character position are compared with the first candidate character of the candidate character set, and the characters differing from these are output.

【００２４】同文字抽出部２１は、候補文字比較部９が
出力した文字に対して、同じ文字を抽出し、さらにその
文字が異なる単語に含まれている場合に、その文字を追
加学習文字１５として出力する。例えば、文字修正部８
によって訂正文字列が図３に示すように出力されたと
き、候補文字比較部９は、文章の『文』、認識の
『認』、文法の『文』、訂正の『正』を出力する。これ
らの文字に対して、同文字抽出部は、文章の『文』と文
法の『文』が同じ文字であり、かつ異なる単語に含まれ
ているので、『文』を追加学習文字１５として出力す
る。The same character extraction unit 21 extracts the same character from the characters output by the candidate character comparison unit 9, and when the character is included in a different word, the character is additionally learned character 15 Output as. For example, the character correction unit 8
When the corrected character string is output as shown in FIG. 3, the candidate character comparison unit 9 outputs the sentence “sentence”, the recognition “recognition”, the grammar “sentence”, and the correction “correct”. With respect to these characters, the same character extraction unit outputs the “sentence” as the additional learning character 15 because the “sentence” of the sentence and the “sentence” of the grammar are the same character and are included in different words. To do.

【００２５】文字認識部１は、追加学習文字１５を受け
取り、追加学習文字の文字画像と修正文字列の文字か
ら、追加学習文字が認識できるように認識辞書１６に追
加学習文字１５の辞書を追加する。The character recognition unit 1 receives the additional learning character 15 and adds a dictionary of the additional learning character 15 to the recognition dictionary 16 so that the additional learning character can be recognized from the character image of the additional learning character and the character of the corrected character string. To do.

【００２６】これにより、文字認識部１における初めの
文字認識で認識できなかった文字も認識辞書に追加文字
の辞書が追加されたことにより認識可能になる。As a result, the character which cannot be recognized by the first character recognition in the character recognition unit 1 can be recognized by adding the dictionary of additional characters to the recognition dictionary.

【００２７】なお、文字認識部１における認識辞書１６
への追加学習文字１５の辞書への追加は、文字認識部１
をニューラルネットワークで構成してネットワークの重
みを追加学習によって変化させて処理を行っても良い。The recognition dictionary 16 in the character recognition unit 1
The learning character 15 is added to the dictionary by the character recognition unit 1.
May be configured by a neural network and the weight of the network may be changed by additional learning to perform the processing.

【００２８】また、本実施例では候補文字比較部９で、
修正文字列と第１候補文字との比較を行ったが、修正文
字と第ｍ候補文字（１≦ｍ≦ｉ＜ｎ）とを比較し、修正
文字がｉ個の候補文字の中に含まれていなかったら、修
正文字を追加学習文字１５として出力するようにしても
良い。Further, in this embodiment, the candidate character comparison unit 9
The corrected character string was compared with the first candidate character, but the corrected character was compared with the mth candidate character (1 ≦ m ≦ i <n), and the corrected character was included in the i candidate characters. If not, the corrected character may be output as the additional learning character 15.

【００２９】次に、本発明の第３の発明の実施例につい
て説明する。図４にこの実施例の文字認識装置の全体の
構成を示す。Next, a third embodiment of the present invention will be described. FIG. 4 shows the overall configuration of the character recognition device of this embodiment.

【００３０】図５に図４の文字認識部１の構成を示す。
文字認識部１はニューラルネットワークにより、候補文
字を認識する。類似度計算部３６は、文字画像と重み係
数３７とから各文字との類似度を計算し、候補文字を出
力する。重み係数更新部３８は、候補文字と追加学習文
字の誤差とをもとにして重み係数を更新することもでき
る。FIG. 5 shows the configuration of the character recognition unit 1 shown in FIG.
The character recognition unit 1 recognizes a candidate character using a neural network. The similarity calculation unit 36 calculates the similarity between each character from the character image and the weighting coefficient 37 and outputs a candidate character. The weighting factor updating unit 38 can also update the weighting factor based on the error between the candidate character and the additional learning character.

【００３１】なお、単語検索部２、文節検索部３、文節
評価値演算部４、文節選択部５は、第１の発明の実施例
と同じである。The word search unit 2, the phrase search unit 3, the phrase evaluation value calculation unit 4, and the phrase selection unit 5 are the same as those in the first embodiment of the present invention.

【００３２】図４のキーワード抽出部３１は、文節選択
部５の出力の修正文字列１４から認識対象の文書のキー
ワードを抽出し、キーワード集合３５を作成する。キー
ワードの抽出は、例えば文書中の単語の頻度と一般の文
書における単語の頻度との差から求める。キーワード部
分一致検索部３２は、得られたキーワード集合３５と候
補文字集合１１との部分一致検索を行う。例えば、キー
ワードとして、「認識」が抽出されていれば、修正文字
列１４にある「認＊」や「＊識」が部分一致文字として
抽出される。候補単語付加部３３は、部分一致したキー
ワードを候補単語集合に付加する。前述の例では、部分
一致した「認＊」や「＊識」が「認識」として候補単語
集合１２に付加される。これによって、文字認識部１か
ら出力されなかった文字を文字訂正に用いることができ
る。The keyword extracting unit 31 in FIG. 4 extracts the keywords of the document to be recognized from the corrected character string 14 output from the phrase selecting unit 5, and creates the keyword set 35. The keyword is extracted, for example, from the difference between the frequency of words in a document and the frequency of words in a general document. The keyword partial match search unit 32 performs a partial match search between the obtained keyword set 35 and the candidate character set 11. For example, if “recognition” is extracted as the keyword, “recognition *” or “* knowledge” in the correction character string 14 is extracted as a partially matching character. The candidate word addition unit 33 adds the partially matched keywords to the candidate word set. In the above-described example, the partially matching “recognition *” and “* knowledge” are added to the candidate word set 12 as “recognition”. As a result, the character not output from the character recognition unit 1 can be used for character correction.

【００３３】候補外文字検出部３４は、文節選択部５が
出力した修正文字列１４の中で、候補単語付加部３３に
よって付加された候補外文字を検出し、その文字を追加
学習文字１５として出力する。The non-candidate character detection unit 34 detects the non-candidate character added by the candidate word addition unit 33 in the corrected character string 14 output by the phrase selection unit 5, and sets that character as the additional learning character 15. Output.

【００３４】上記の構成の文字認識装置において次のよ
うにして文字認識を行う。まず、認識対象の文書画像１
０を文字認識部１で図５に示した重み係数３７を参照し
て類似度計算部３６で、１文字につき第１候補文字から
第ｎ候補文字までのｎ個の候補文字を持つ候補文字集合
を出力する。Character recognition is performed as follows in the character recognition device having the above configuration. First, the document image 1 to be recognized
In the character recognition unit 1, 0 is referred to the weighting factor 37 shown in FIG. 5, and in the similarity calculation unit 36, a candidate character set having n candidate characters from the first candidate character to the nth candidate character per character. Is output.

【００３５】さらに、単語検索部２で、単語辞書６を検
索することにより候補文字集合１１の組み合せの中か
ら、単語辞書６に存在する単語と一致する候補文字の組
み合せである候補単語集合１２を選び出す。さらに、文
節検索部３で、文法辞書７を参照して候補単語集合１２
から文節となりえる単語の組み合せである候補文節集合
１３を選び出す。文節検索部３で検索された文節の語彙
的および文法的な正しさを文節中の単語の長さや頻度な
どを基準として文節評価値を計算する。文節評価値を求
めた候補文節に対して文節評価値を基準にして、文節選
択部５で正しい文節の組み合せを選択し修正文字列１４
を出力する。Further, the word search unit 2 searches the word dictionary 6 to find a candidate word set 12 which is a combination of candidate characters matching a word existing in the word dictionary 6 from the combinations of the candidate character sets 11. Pick out. Further, the phrase searching unit 3 refers to the grammar dictionary 7 and sets the candidate word set 12
A candidate phrase set 13, which is a combination of words that can be a phrase, is selected. The lexical evaluation value is calculated based on the lexical and grammatical correctness of the bunsetsu searched by the bunsetsu searching unit 3 based on the length and frequency of the words in the bunsetsu. Based on the phrase evaluation value for the candidate phrase for which the phrase evaluation value is obtained, the phrase selection unit 5 selects the correct combination of phrases and the modified character string 14 is selected.
Is output.

【００３６】次いで、キーワード抽出部３１で、修正文
字列１４からキーワード集合３５を抽出する。Next, the keyword extracting unit 31 extracts the keyword set 35 from the corrected character string 14.

【００３７】キーワード部分一致検索部３２で、キーワ
ード集合３５と候補文字集合１２との部分一致検索を行
う。次に候補単語付加部３３で、キーワード部分一致検
索部３２で出力された単語を候補単語集合１３に付加す
る。The keyword partial match search section 32 performs a partial match search between the keyword set 35 and the candidate character set 12. Next, the candidate word addition unit 33 adds the words output by the keyword partial match search unit 32 to the candidate word set 13.

【００３８】再び、文節検索部３と文節評価値演算部４
で、付加された候補単語から候補文節を検索し、文節評
価値を求める。Again, the phrase search unit 3 and the phrase evaluation value calculation unit 4
Then, the candidate phrase is searched from the added candidate word to obtain the phrase evaluation value.

【００３９】さらに、文節選択部５で、文節の候補の中
で評価値の大きい文節を選択し、修正文字列１４を出力
する。Further, the phrase selecting section 5 selects a phrase having a large evaluation value among the phrase candidates and outputs the corrected character string 14.

【００４０】候補外文字検出部３４で、修正文字列１４
の中で候補単語付加部３３によって付加された文字を検
出し、その文字を追加学習文字１５として出力する。In the non-candidate character detection unit 34, the corrected character string 14
A character added by the candidate word addition unit 33 is detected in the table, and the character is output as the additional learning character 15.

【００４１】重み係数更新部３８は、追加学習文字と候
補文字との誤差をもとに重み係数３７を更新し、追加学
習を行う。The weighting factor updating unit 38 updates the weighting factor 37 based on the error between the additional learning character and the candidate character to perform additional learning.

【００４２】これにより、文字認識部１における初めの
文字認識で認識できなかった文字も認識辞書に追加文字
の辞書が追加されたことにより認識可能になる。As a result, a character that cannot be recognized by the first character recognition in the character recognition unit 1 can be recognized by adding a dictionary of additional characters to the recognition dictionary.

【００４３】なお、文字認識部１は、ニューラルネット
ワークを用いない文字認識方式を用いてもよい。この方
式として例えば、各文字の平均値ベクトルを認識辞書と
して有し、それと画像との比較により文字を認識しても
よい。認識辞書を用いた場合は、追加された文字をもと
に認識辞書を再構築して、追加学習を行う。The character recognition unit 1 may use a character recognition method that does not use a neural network. As this method, for example, an average value vector of each character may be provided as a recognition dictionary, and the character may be recognized by comparing it with an image. When the recognition dictionary is used, the recognition dictionary is reconstructed based on the added characters to perform additional learning.

【００４４】次に、本発明の第４の発明の実施例につい
て説明する。図６にこの実施例の文字認識装置の全体の
構成を示す。Next, a fourth embodiment of the present invention will be described. FIG. 6 shows the overall configuration of the character recognition device of this embodiment.

【００４５】なお、文字認識部１、単語検索部２、文節
検索部３、文節評価値演算部４、文節選択部５は、第１
の発明の実施例と同じである。The character recognition unit 1, the word search unit 2, the phrase search unit 3, the phrase evaluation value calculation unit 4, and the phrase selection unit 5 are the first
This is the same as the embodiment of the invention.

【００４６】また、キーワード抽出部３１、キーワード
部分一致検索部３２、候補外文字検出部３４は、第４の
発明の実施例と同じである。The keyword extracting unit 31, the keyword partial match searching unit 32, and the non-candidate character detecting unit 34 are the same as those in the fourth embodiment of the present invention.

【００４７】単語誤訂正度演算部４１は、修正文字列１
４から訂正された単語が、誤訂正である確からしさ、す
なわち単語誤訂正度を計算する。リジェクト文字決定部
４２は、単語誤訂正度演算部４１が出力した単語誤訂正
度にもとづきリジェクト文字を決定する。候補単語付加
部４３は、キーワード部分一致検索部３２で検索された
キーワードの中で、リジェクト文字となっている文字を
候補単語集合に付加する。The word erroneous correction degree calculation unit 41 uses the corrected character string 1
The probability that the word corrected from 4 is an erroneous correction, that is, the word erroneous correction degree is calculated. The reject character determination unit 42 determines the reject character based on the word error correction degree output by the word error correction degree operation unit 41. The candidate word addition unit 43 adds a character that is a reject character among the keywords searched by the keyword partial match search unit 32 to the candidate word set.

【００４８】上記の構成の文字認識装置において次のよ
うにして文字認識を行う。まず、認識対象の文書画像１
０を文字認識部１で認識辞書１６を用いて文字認識し
て、１文字につき第１候補文字から第ｎ候補文字までの
ｎ個の候補文字を持つ候補文字集合を出力する。Character recognition is performed as follows in the character recognition device having the above configuration. First, the document image 1 to be recognized
The character recognition unit 1 character-recognizes 0 using the recognition dictionary 16 and outputs a candidate character set having n candidate characters from the first candidate character to the n-th candidate character per character.

【００４９】さらに、単語検索部２で、単語辞書６を検
索することにより候補文字集合１１の組み合せの中か
ら、単語辞書６に存在する単語と一致する候補文字の組
み合せである候補単語集合１２を選び出す。さらに、文
節検索部３で、文法辞書７を参照して候補単語集合１２
から文節となりえる単語の組み合せである候補文節集合
１３を選び出す。文節検索部３で検索された文節の語彙
的および文法的な正しさを文節中の単語の長さや頻度な
どを基準として文節評価値を計算する。文節評価値を求
めた候補文節に対して文節評価値を基準にして、文節選
択部５で正しい文節の組み合せを選択し修正文字列１４
を出力する。Further, the word search unit 2 searches the word dictionary 6 to select a candidate word set 12 which is a combination of candidate characters that match a word existing in the word dictionary 6 from the combinations of the candidate character sets 11. Pick out. Further, the phrase searching unit 3 refers to the grammar dictionary 7 and sets the candidate word set 12
A candidate phrase set 13, which is a combination of words that can be a phrase, is selected. The lexical evaluation value is calculated based on the lexical and grammatical correctness of the bunsetsu searched by the bunsetsu searching unit 3 based on the length and frequency of the words in the bunsetsu. Based on the phrase evaluation value for the candidate phrase for which the phrase evaluation value is obtained, the phrase selection unit 5 selects the correct combination of phrases and the modified character string 14 is selected.
Is output.

【００５０】次いで、キーワード抽出部３１で、修正文
字列１４からキーワード集合３５を抽出する。Next, the keyword extracting unit 31 extracts the keyword set 35 from the corrected character string 14.

【００５１】キーワード部分一致検索部３２で、キーワ
ード集合３５と候補文字集合１２との部分一致検索を行
う。The keyword partial match search unit 32 performs a partial match search between the keyword set 35 and the candidate character set 12.

【００５２】次に、単語誤訂正度演算部４１で、訂正単
語の長さ、単語中に含まれる文字の文字認識部１での評
価値、訂正文字と第１候補文字の文字認識部１での評価
値の差、単語を構成する文字の種類、訂正単語が正解で
ある統計確率などから単語誤訂正度を計算する。リジェ
クト文字決定部４２で、訂正単語とその前後の単語の単
語誤訂正度などからリジェクト文字を決定する。Next, in the word error correction degree calculation unit 41, the length of the corrected word, the evaluation value of the character included in the word in the character recognition unit 1, the corrected character and the character recognition unit 1 of the first candidate character are detected. The word erroneous correction degree is calculated from the difference in the evaluation value of, the type of characters forming the word, and the statistical probability that the corrected word is correct. The reject character determination unit 42 determines the reject character from the corrected word and the word error correction degree of the words before and after the corrected word.

【００５３】次に候補単語付加部４３で、キーワード部
分一致検索部３２で出力された単語とリジェクト文字決
定部４２で出力された文字とを比較し、両者が一致して
いる単語を候補単語集合１３に付加する。Next, the candidate word addition unit 43 compares the words output by the keyword partial match search unit 32 with the characters output by the reject character determination unit 42, and the words that match both are selected as a candidate word set. Add to 13.

【００５４】再び、文節検索部３と文節評価値演算部４
で、付加された候補単語から候補文節を検索し、文節評
価値を求める。Again, the phrase retrieval unit 3 and the phrase evaluation value calculation unit 4
Then, the candidate phrase is searched from the added candidate word to obtain the phrase evaluation value.

【００５５】さらに、文節選択部５で、文節の候補の中
で評価値の大きい文節を選択し、修正文字列１４を出力
する。Further, the phrase selecting section 5 selects a phrase having a large evaluation value among the phrase candidates and outputs the corrected character string 14.

【００５６】候補外文字検出部３４で、修正文字列１４
の中で候補単語付加部４３によって付加された文字を検
出し、その文字を追加学習文字１５として出力する。In the non-candidate character detection unit 34, the corrected character string 14
A character added by the candidate word addition unit 43 is detected in the table, and the character is output as the additional learning character 15.

【００５７】文字認識部１は、追加学習文字１５を受け
取り、追加学習文字の文字画像と修正文字列の文字か
ら、追加学習文字が認識できるように認識辞書１６に追
加学習文字１５の辞書を追加する。The character recognition unit 1 receives the additional learning character 15 and adds a dictionary of the additional learning character 15 to the recognition dictionary 16 so that the additional learning character can be recognized from the character image of the additional learning character and the character of the corrected character string. To do.

【００５８】これにより、文字認識部１における初めの
文字認識で認識できなかった文字も認識辞書に追加文字
の辞書が追加されたことにより認識可能になる。As a result, the character which cannot be recognized by the first character recognition in the character recognition unit 1 can be recognized by adding the dictionary of additional characters to the recognition dictionary.

【００５９】なお、文字認識部１における認識辞書１６
への追加学習文字１５の辞書への追加は、文字認識部１
をニューラルネットワークで構成してネットワークの重
みを追加学習によって変化させて処理を行っても良い。The recognition dictionary 16 in the character recognition unit 1
The learning character 15 is added to the dictionary by the character recognition unit 1.
May be configured by a neural network and the weight of the network may be changed by additional learning to perform the processing.

【００６０】次に、本発明の第５の発明の実施例につい
て説明する。図７にこの実施例の文字認識装置の全体の
構成を示す。Next, a fifth embodiment of the present invention will be described. FIG. 7 shows the overall configuration of the character recognition device of this embodiment.

【００６１】なお、文字認識部１、単語検索部２、文節
検索部３、文節評価値演算部４、文節選択部５は、第１
の発明の実施例と同じである。The character recognition unit 1, the word search unit 2, the phrase search unit 3, the phrase evaluation value calculation unit 4, and the phrase selection unit 5 are the first
This is the same as the embodiment of the invention.

【００６２】文字種決定部５１は、修正文字列１４から
文字種辞書５２を参照して各文字の文字種を決定し出力
する。The character type determining section 51 refers to the character type dictionary 52 from the corrected character string 14 to determine the character type of each character and outputs it.

【００６３】上記の構成の文字認識装置において次のよ
うにして文字認識を行う。まず、認識対象の文書画像１
０を文字認識部１で認識辞書１６を用いて文字認識し
て、１文字につき第１候補文字から第ｎ候補文字までの
ｎ個の候補文字を持つ候補文字集合を出力する。Character recognition is performed in the character recognition device having the above-described configuration as follows. First, the document image 1 to be recognized
The character recognition unit 1 character-recognizes 0 using the recognition dictionary 16 and outputs a candidate character set having n candidate characters from the first candidate character to the n-th candidate character per character.

【００６４】さらに、単語検索部２で、単語辞書６を検
索することにより候補文字集合１１の組み合せの中か
ら、単語辞書６に存在する単語と一致する候補文字の組
み合せである候補単語集合１２を選び出す。さらに、文
節検索部３で、文法辞書７を参照して候補単語集合１２
から文節となりえる単語の組み合せである候補文節集合
１３を選び出す。文節検索部３で検索された文節の語彙
的および文法的な正しさを文節中の単語の長さや頻度な
どを基準として文節評価値を計算する。文節評価値を求
めた候補文節に対して文節評価値を基準にして、文節選
択部５で正しい文節の組み合せを選択し修正文字列１４
を出力する。Further, the word search unit 2 searches the word dictionary 6 to select a candidate word set 12 which is a combination of candidate characters matching a word existing in the word dictionary 6 from among the combinations of the candidate character sets 11. Pick out. Further, the phrase searching unit 3 refers to the grammar dictionary 7 and sets the candidate word set 12
A candidate phrase set 13, which is a combination of words that can be a phrase, is selected. The lexical evaluation value is calculated based on the lexical and grammatical correctness of the bunsetsu searched by the bunsetsu searching unit 3 based on the length and frequency of the words in the bunsetsu. Based on the phrase evaluation value for the candidate phrase for which the phrase evaluation value is obtained, the phrase selection unit 5 selects the correct combination of phrases and the modified character string 14 is selected.
Is output.

【００６５】文字種決定部５１で、修正文字列１４から
文字種辞書５２に従って文字種を決定する。文字種辞書
５２には、例えば、数詞前接頭語（第、約など）と数助
詞（本、個など）の間に挟まれる文字は数字である、片
仮名と片仮名に挟まれている文字は片仮名である確率が
高い、英字と英字に挟まれている文字は英字である確率
が高いなどというルールが登録されており、このルール
に従って文字種を決定する。図８は、修正文字列から文
字種を決定した例である。図８で修正文字列の中の「ニ
ューラねネットワーク」の文字列を片仮名、「９Ｇ．
５」の文字列を数字であると決定した。それ以外の部分
は、文字種の決定はされなかった。文字種が決定した文
字は文字認識部１に送られ、文字認識部１では、文字種
決定部の出力に従って文字種を限定して文字認識を再度
行う。The character type determining unit 51 determines the character type from the corrected character string 14 according to the character type dictionary 52. In the character type dictionary 52, for example, the characters sandwiched between the pre-numerical prefix (e.g., about, etc.) and the number particle (book, individual, etc.) are numbers, and the characters sandwiched between katakana and katakana are katakana. There is a registered rule that there is a high probability that there is a high probability that the letters sandwiched between letters and letters are letters, and the character type is determined according to this rule. FIG. 8 is an example in which the character type is determined from the corrected character string. In FIG. 8, the character string “Nura-ne Network” in the modified character string is a katakana “9G.
The character string "5" was determined to be a number. Other than that, the character type was not determined. The character whose character type has been determined is sent to the character recognition unit 1, and the character recognition unit 1 performs character recognition again by limiting the character type according to the output of the character type determination unit.

【００６６】単語検索部２では、文字認識部１によって
出力された文字を候補単語集合１２に付加し、再度候補
単語を検索する。さらに、文節検索部３と文節評価値演
算部４で、付加された候補単語から候補文節を検索し、
文節評価値を求める。The word search unit 2 adds the characters output by the character recognition unit 1 to the candidate word set 12 and searches again for candidate words. Furthermore, the phrase search unit 3 and the phrase evaluation value calculation unit 4 search for candidate phrases from the added candidate words,
Find the phrase evaluation value.

【００６７】さらに、文節選択部５で、文節の候補の中
で評価値の大きい文節を選択し、修正文字列１４を出力
する。Further, the phrase selecting section 5 selects a phrase having a large evaluation value among the phrase candidates and outputs the corrected character string 14.

【００６８】このように、文字種を限定すると、文字認
識部１は文字種が限定されたことにより認識率が向上す
る。その結果文字修正を行った結果も認識率が向上す
る。As described above, when the character types are limited, the character recognition unit 1 improves the recognition rate because the character types are limited. As a result, the recognition rate also improves as a result of character correction.

【００６９】なお、本実施例では、文字種決定部５１に
おいて文字種を決定し文字種の限定を行ったが、文字種
決定部５１である一定の数の文字に限定することを行っ
ても良い。例えば、修正文字列１４から文字列が都市名
を表していることが決定できたら、都市名に使われてい
る文字だけに限定を行えば良い。In this embodiment, the character type determining unit 51 determines the character type and limits the character type. However, the character type determining unit 51 may limit the number of characters to a certain number. For example, if it is determined from the corrected character string 14 that the character string represents a city name, it is sufficient to limit the characters used in the city name.

【００７０】[0070]

【発明の効果】以上の実施例から明らかなように、本発
明の構成の文字認識装置を使用することにより、文字修
正部の訂正結果の情報を用いて追加学習文字を決定し、
認識対象の文書のフォントに合った認識辞書を自動的に
構成することができる。このため、認識対象の文書に合
った文字認識を行うために認識率が向上し、その実用的
効果は大きい。また、文字修正部の訂正結果の情報から
文字認識部における認識文字の文字種を限定することに
より、文字認識部の認識率を向上させることができる。As is apparent from the above embodiments, by using the character recognition device having the configuration of the present invention, the additional learning character is determined by using the information of the correction result of the character correction unit,
It is possible to automatically configure a recognition dictionary that matches the font of the document to be recognized. For this reason, the recognition rate is improved in order to perform character recognition suitable for the document to be recognized, and its practical effect is great. Further, the recognition rate of the character recognition unit can be improved by limiting the character type of the recognized character in the character recognition unit from the information of the correction result of the character correction unit.

[Brief description of drawings]

【図１】本発明の文字認識装置の一実施例の構成図FIG. 1 is a configuration diagram of an embodiment of a character recognition device of the present invention.

【図２】本発明の文字認識装置の他の実施例の構成図FIG. 2 is a configuration diagram of another embodiment of the character recognition device of the present invention.

【図３】本発明の同文字抽出部の出力図FIG. 3 is an output diagram of the same character extraction unit of the present invention.

【図４】本発明の文字認識装置の別の実施例の構成図FIG. 4 is a configuration diagram of another embodiment of the character recognition device of the present invention.

【図５】本発明の文字認識部の構成図FIG. 5 is a configuration diagram of a character recognition unit of the present invention.

【図６】本発明の文字認識装置の他の実施例の構成図FIG. 6 is a configuration diagram of another embodiment of the character recognition device of the present invention.

【図７】本発明の文字認識装置の別の実施例の構成図FIG. 7 is a configuration diagram of another embodiment of the character recognition device of the present invention.

【図８】本発明の文字種決定部の出力図FIG. 8 is an output diagram of the character type determination unit of the present invention.

【図９】従来の文字認識装置の構成図FIG. 9 is a configuration diagram of a conventional character recognition device.

[Explanation of symbols]

１文字認識部２単語検索部３文節検索部４文節評価値演算部５文節選択部６単語辞書７文法辞書８文字修正部９候補文字比較部１０文字画像１１候補文字集合１２候補単語集合１３候補文節集合１４修正文字列１５追加学習文字１６認識辞書２１同文字抽出部３１キーワード抽出部３２キーワード部分一致検索部３３候補単語付加部３４候補外文字検出部３５キーワード集合３６類似度計算部３７重み係数３８重み係数更新部４１単語誤訂正度演算部４２リジェクト文字決定部４３候補単語付加部５１文字種決定部５２文字種辞書６１自動訂正部６２手動訂正制御部６３訂正規則テーブル 1 character recognition unit 2 word search unit 3 phrase search unit 4 phrase evaluation value calculation unit 5 phrase selection unit 6 word dictionary 7 grammar dictionary 8 character correction unit 9 candidate character comparison unit 10 character image 11 candidate character set 12 candidate word set 13 candidate Phrase set 14 Modified character string 15 Additional learning character 16 Recognition dictionary 21 Same character extraction unit 31 Keyword extraction unit 32 Keyword partial match search unit 33 Candidate word addition unit 34 Candidate non-character detection unit 35 Keyword set 36 Similarity calculation unit 37 Weight coefficient 38 Weighting coefficient update unit 41 Word error correction degree calculation unit 42 Rejected character determination unit 43 Candidate word addition unit 51 Character type determination unit 52 Character type dictionary 61 Automatic correction unit 62 Manual correction control unit 63 Correction rule table

フロントページの続き (72)発明者前川英嗣大阪府門真市大字門真1006番地松下電器産業株式会社内 (72)発明者伊藤哲大阪府門真市大字門真1006番地松下電器産業株式会社内 (72)発明者小島良宏大阪府門真市大字門真1006番地松下電器産業株式会社内 (72)発明者山本浩司大阪府門真市大字門真1006番地松下電器産業株式会社内Front page continued (72) Inventor Hidetsugu Maekawa 1006 Kadoma, Kadoma City, Osaka Prefecture Matsushita Electric Industrial Co., Ltd. (72) Inventor, Satoshi Ito 1006 Kadoma, Kadoma City Osaka Prefecture (72) Invention Person Yoshihiro Kojima, 1006, Kadoma, Kadoma, Osaka Prefecture, Matsushita Electric Industrial Co., Ltd. (72) Inventor, Koji Yamamoto, 1006, Kadoma, Kadoma, Osaka Prefecture Matsushita Electric Industrial Co., Ltd.

Claims

[Claims]

1. A character recognition unit that recognizes a document image and outputs N candidate characters per character, a word search unit that obtains a candidate word set from a candidate character set using a word dictionary, and a candidate word set. A phrase search unit that obtains candidate phrases from a grammar dictionary, a phrase evaluation value calculation unit that calculates the lexical and grammatical correctness of a phrase, and a phrase that selects and corrects a phrase based on the phrase evaluation value. A character recognition device comprising: a phrase selection unit that outputs a candidate character set and a candidate character comparison unit that compares a candidate character set with a modified character string and outputs an additional learned character.

2. A character recognition unit that recognizes a document image and outputs N candidate characters per character, a word search unit that obtains a candidate word set from a candidate character set using a word dictionary, and a candidate word set. A phrase search unit that obtains candidate phrases from a grammar dictionary, a phrase evaluation value calculation unit that calculates the lexical and grammatical correctness of a phrase, and a phrase that selects and corrects a phrase based on the phrase evaluation value. And a candidate character comparison unit that compares the candidate character set and the modified character string, and a same character extraction unit that extracts the same character in a different word and outputs an additional learning character. Character recognizer.

3. A character recognition unit that recognizes a document image and outputs N candidate characters per character, a word search unit that obtains a candidate word set from a candidate character set using a word dictionary, and a candidate word set. A phrase search unit that obtains candidate phrases from a grammar dictionary, a phrase evaluation value calculation unit that calculates the lexical and grammatical correctness of a phrase, and a phrase that selects and corrects a phrase based on the phrase evaluation value. A phrase selection unit that outputs, a keyword extraction unit that extracts a keyword of a document to be recognized from a corrected character string, a keyword partial match search unit that performs a partial match search with the keyword in the candidate character string set, A candidate word addition unit that adds a partially matched keyword to the candidate word set, and a non-candidate character detection unit that detects the characters added by the candidate word addition unit from the corrected character string and outputs additional learning characters. Character recognition apparatus characterized by comprising.

4. A character recognition unit that recognizes a document image and outputs N candidate characters per character, a word search unit that obtains a candidate word set from a candidate character set using a word dictionary, and a candidate word set. A phrase search unit that obtains candidate phrases from a grammar dictionary, a phrase evaluation value calculation unit that calculates the lexical and grammatical correctness of a phrase, and a phrase that selects and corrects a phrase based on the phrase evaluation value. A phrase selection unit that outputs, a keyword extraction unit that extracts a keyword of a document to be recognized from a corrected character string, a keyword partial match search unit that performs a partial match search with the keyword in the candidate character string set, The word error correction degree calculation unit that determines the error correction degree of the words in the corrected character string, the reject character determination unit that determines the reject character from the word error correction degree, and the keyword that partially matches the reject character are compared. And a candidate word addition unit that adds matching characters to the candidate word set, and a non-candidate character detection unit that detects the characters added by the candidate word addition unit from the corrected character string and outputs additional learned characters. Character recognition device.

5. A character recognition unit updates a weighting factor from a similarity calculation unit that calculates a character similarity from a character image and a weighting factor and outputs a candidate character, and an error between an additional learning character and a candidate character. The character recognition device according to claim 4, further comprising:

6. A character recognition unit that recognizes a document image and outputs N candidate characters per character, a word search unit that obtains a candidate word set from a candidate character set using a word dictionary, and a candidate word set. A phrase search unit that obtains candidate phrases from a grammar dictionary, a phrase evaluation value calculation unit that calculates the lexical and grammatical correctness of a phrase, and a phrase that selects and corrects a phrase based on the phrase evaluation value. A character recognition device characterized by comprising a phrase selection unit for outputting a character type and a character type determination unit for determining a character type from a corrected character string using a character type dictionary.