JP2004139428A

JP2004139428A - Character recognition device

Info

Publication number: JP2004139428A
Application number: JP2002304616A
Authority: JP
Inventors: Hideo Horiuchi; 堀内　秀雄; Naoki Natori; 名取　直毅; Akihiko Nakao; 中尾　昭彦; Takuma Akagi; 赤木　琢磨; Yasuhiro Aoki; 青木　泰浩; Tomoyuki Hamamura; 浜村　倫行
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2002-10-18
Filing date: 2002-10-18
Publication date: 2004-05-13

Abstract

<P>PROBLEM TO BE SOLVED: To reduce the frequency of misrecognition in the case of recognizing characters with accent marks in language systems such as German and French. <P>SOLUTION: Characters are recognized after previously removing accent marks which may cause misrecognition and words in the character string whose characters are recognized are collated by using a database from which accent marks are previously removed. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
この発明は、ドイツ語やフランス語等の言語体系に見られるアクセント記号の付いた文字の認識を行う文字認識装置に関する。
【０００２】
【従来の技術】
ドイツ語やフランス語等の言語体系においては、英語におけるアルファベット文字の上下にアクセント記号（ウムラウト等）が付いた文字が存在する。これらのアクセント記号は文字認識をする上で重要な情報であり、従来はアクセント記号を付けたまま文字認識を行ってきた。
【０００３】
しかし、解像度の問題や手書き文字の場合には、アクセント記号がつぶれてしまったり、アルファベット文字本体に接触してしまい、誤読を引き起こす原因となっていた。
【０００４】
【発明が解決しようとする課題】
この発明は、ドイツ語やフランス語等の言語体系に見られるアクセント記号の付いた文字の認識を行うものにおいて、アクセント記号が存在する文字列であっても、誤読なく文字認識を行うことができる文字認識装置を提供することを目的としている。
【０００５】
【課題を解決するための手段】
この発明の文字認識装置は、ドイツ語やフランス語等の言語体系に見られるアクセント記号の付いた文字の認識において、アクセント記号付き文字よりアクセント記号をあらかじめ除去して文字認識を行い、あらかじめアクセント記号付き文字よりアクセント記号を除去した文字を用いて作成された単語列からなるデータベースと照合して単語照合を行うものである。
【０００６】
この発明の文字認識装置は、ドイツ語等の言語体系に見られるアクセント記号の付いた文字を含む単語により作成されている文書、あるいはアクセント記号の付いた文字を代替文字で表記した文字を含む単語により作成されている文書のいずれかから文字を認識するものにおいて、アクセント記号の付いた文字を含む単語を含む単語列からなる第１の単語辞書と、アクセント記号の付いた文字を代替文字で表記した文字を含む単語列からなる第２の単語辞書と、上記文書のアクセント記号の有無を判断する判断手段と、上記文書の文字を認識する認識手段と、この認識手段により認識した文字から単語列を抽出する抽出手段と、上記判断手段により上記文書のアクセント記号の有りを判断した際、上記抽出手段により抽出された単語列と上記第１の単語辞書との比較により一致する文字列を単語として出力し、上記判断手段により上記文書のアクセント記号が無いと判断された際、上記抽出手段により抽出された単語列と上記第２の単語辞書との比較により一致する文字列を単語として出力する出力手段とを有する。
【０００７】
【発明の実施の形態】
以下、図面を参照してこの発明の実施形態の文字認識装置を説明する。
この文字認識装置１０は、図１に示すように、画像入力部１、前処理部２、文字行切り出し部３、文字切り出し部４、文字認識部５、後処理部６、答え出力部７、データベース８により構成されている。
【０００８】
まず、画像入力部１において、必要情報の書き込まれている対象物より、必要情報と思われる部分の画像を入力する。画像入力部１としては、ＣＣＤカメラやスキャナ等を用いる。
次に、前処理部２において、入力された画像を二値画像（画素単位のマトリクス状のビットマップデータ）に変換し、黒画素の連結成分を１つの塊とするようなラベリングを行う。
【０００９】
各ラベル画像ごとの連結成分（領域）の最大のＸ座標、Ｙ座標、最小のＸ座標、Ｙ座標（外接矩形領域の座標）が計測され、外接矩形領域テーブル２ａに登録される。
また、各ラベル画像ごとの連結成分に対するラスタ方向の走査順に求めた全てのランの始点座標、終点座標が、ランの始点、終点座標テーブル２ｂに登録される。
【００１０】
得られた黒画素の塊の外接矩形の両辺の長さが、両辺ともあるしきい値よりも小さなものは「ノイズ」として除去される。この際、アクセント記号が除去される場合もある。また、この際に除去された黒画素の塊の形や大きさや二値画像全体における位置情報をそのまま残したノイズ画像データが、ノイズ画像データテーブル２ｃに登録される。
【００１１】
例えば図２に示すようなドイツ語文書の場合、図３に示すような、点の部分に対するデータが登録される。
【００１２】
次に、文字行切り出し部３において、文字行を抽出する。上記二値画像において、横方向（第１の方向）と縦方向（第２の方向）それぞれに黒画素の射影をとり、より規則的なピークのある方向を文字行の方向と判断する。射影値のピークとピークの間の値の小さな場所を行の境目と判断して画像を分離する。ここで分離されたものが文字行となる。
【００１３】
上記文字行切り出し部３により切出された行画像から、さらに文字切り出し部４において、黒画素の塊ごとに行画像を分離する。前後する黒画素の塊と幅（行方向における黒画素の外接矩形の長さ）を比較し、前後する黒画素よりも幅があきらかに大きいものについては、さらに行方向と垂直な方向に画素を分離する。分離する方法としては、行画像について、行方向と垂直な方向に黒画素の射影をとり、射影値の小さい部分を文字の境目と判断して分離する。もしくは、行画像において、上下両端より行方向と垂直な方向に画像を走査して黒画素と初めてぶつかった位置より文字幅を求め、文字幅の小さい部分を文字の境目と判断して分離する。
【００１４】
上記文字切り出し部４より得られた文字を、文字認識部５によって認識する。認識方式は既存の方式（複合類似度方式等）でかまわない。
その文字認識結果をそのまま用いても良いが、本発明利用者によっては、特定出来る情報（例えば地名）を利用することも出来るため、必要に応じて後処理部６においてデータベース８を利用して単語照合を行う。
最後に、文字認識部５および後処理部６より出力された文字認識結果を、必要に応じて、答え出力部７において出力する。
【００１５】
［第１の実施形態］
この第１の実施形態は、アクセント記号付き文字（アルファベット文字）よりアクセント記号をあらかじめ除去して文字認識を行い、あらかじめアクセント記号付き文字よりアクセント記号を除去した文字を用いて作成された単語辞書としてのデータベース８と照合して単語照合を行うものである。
【００１６】
すなわち、図１の前処理部２において、図２に示すような、入力された画像を、二値画像（ビットマップデータ）に変換し、この二値画像の黒画素の連結成分を１つの塊とするようなラベリングを行う。得られた黒画素の塊の外接矩形の両辺の長さが、あるしきい値よりも小さなものは「ノイズ」として除去される。この際、アクセント記号が除去される。
【００１７】
例えば、図４のような、ウムラウトが記載されている場合、黒画素の塊による外接矩形が、図５のようになり、ウムラウトによる２つの外接矩形と「ａ」による外接矩形が抽出される。この際、ウムラウトによる２つの外接矩形の両辺の長さが、あるしきい値よりも小さいため、上記二値画像からウムラウトによる黒画素がノイズとして除去される。
【００１８】
また、データベース８は、ドイツ語やフランス語等の言語体系に見られるアクセント記号の付いた文字からなる単語列（図６の（ａ））からアクセント記号を除いた文字からなる単語列（図６の（ｂ））を記憶するものである（単語辞書）。
【００１９】
このような構成において、認識動作を図７に示すフローチャートを参照しつつ説明する。
すなわち、ドイツ語やフランス語等の言語体系に見られるアクセント記号の付いた文字を用いて記載されている書類等を、画像入力部１において、読取走査し、この読取画像を入力画像（図２参照）として前処理部２に出力する（ＳＴ１）。
【００２０】
これにより、前処理部２は、入力された画像を二値画像に変換し、黒画素の連結成分を１つの塊とするようなラベリングを行う（ＳＴ２）。この際、前処理部２は、上記したように、得られた黒画素の塊の外接矩形の両辺の長さが、あるしきい値よりも小さなものを「ノイズ」として除去し、アクセント記号も除去する。
前処理部２は、ラベリングの結果を文字行切り出し部３に出力する。
【００２１】
次に、文字行切り出し部３において、文字行を切り出し（ＳＴ３）、この切り出された行画像から、さらに文字切り出し部４において、文字を切り出し（ＳＴ４）、この切出した文字を文字認識部５によって認識する（ＳＴ５）。
これにより、後処理部６はこの文字認識結果としての複数文字列の認識結果（図６の（ｃ））と、上記データベース８のアクセント記号を除いた文字からなる単語列（図６の（ｂ））との照合を行う（ＳＴ６）。
この照合により一致する照合結果を答え出力部７において出力する（ＳＴ７）。
【００２２】
［第２の実施形態］
この第２の実施形態は、第１の実施形態のように、上記前処理部２においてノイズとともにアクセント記号を除去する際に、解像度が低い画像の場合等では、アクセント記号がアルファベット文字に接触して残ってしまう場合が有り、この対策を行うものである。ノイズとして除去できなかった場合の対策である。
【００２３】
この場合には、文字切り出し部４により切り出された各文字ごとの外接矩形を調べ、前後の文字に対して高さ（行方向と垂直方向の辺）があきらかに大きいもの（所定画素以上高いもの）をアクセント付き文字の可能性があるとする（結果として、大文字も含まれることになる）。この場合には、図８（ａ）（ｂ）（ｃ）に示すように、一文字として分離された文字の外接矩形の上側もしくは下側を１画素行ずつ（１画素分の幅に相当する幅で形成される１行ずつ）削りつつ文字認識を繰り返し、辞書５ａに含まれる基準文字とのマッチング度（類似度等）が高い時に、正解とする。
【００２４】
上記第１の実施形態のフローチャートのステップ５において、切出した文字を文字認識部５によって認識する際に、下記処理が実行される。
すなわち、図９において、文字切り出し部４により切り出された各文字ごとの外接矩形を調べ（ＳＴ１１）、前後の文字に対して高さがあきらかに大きいか否かにより、アクセント付き文字か否かを判断する（ＳＴ１２）。
【００２５】
この判断結果が、アクセント付き文字の場合、文字認識部５は一文字として分離された文字の外接矩形の上側（最上行）から下側の所定の画素行までの１画素行ずつ除去した各読取りパターンを正規化したパターンと辞書５ａに含まれる各基準文字とのマッチング度（類似度等）を演算し（ＳＴ１３）、一番高いマッチング度の文字を認識結果として出力する（ＳＴ１４）。
【００２６】
上記ステップ１２によりアクセント付き文字ではない通常の文字の場合、文字認識部５は一文字として分離された文字の読取りパターンを正規化したパターンと辞書５ａに含まれる各基準文字とのマッチング度（類似度等）を演算し（ＳＴ１５）、一番高いマッチング度の文字を認識結果として出力する（ＳＴ１６）。
【００２７】
［第３の実施形態］
この第３の実施形態は、第１の実施形態のように、上記前処理部２においてノイズとともにアクセント記号を除去できなかった場合の対策である。
この場合には、文字切り出し部４により切り出された各文字ごとの外接矩形を調べ、前後の文字に対して高さ（行方向と垂直方向の辺）があきらかに大きいもの（所定画素以上高いもの）をアクセント付き文字の可能性があるとする（結果として、大文字も含まれることになる）。この場合には、図１０に示すように、一文字として分離された文字の複数箇所の文字幅を推定し、この推定した文字幅の一番小さな部分で文字を切断して、図１１に示すように、上下にそれぞれ切断された文字に対する外接矩形を生成し、大きい方の外接矩形に対する文字パターンを切出した文字として文字認識部５によって認識する。
【００２８】
上記文字幅の推定方法としては、上記アクセント記号付き文字の可能性ありとして抽出された文字の外接矩形の左右両端よりそれぞれ逆側に向かって画素（１画素単位、あるいは複数画素ごと）を走査し（図１０参照）、黒画素とぶつかった位置を求めて文字幅を推定するものである。
【００２９】
図１０の場合、アクセント記号の下側と英字「ａ」の上側とに隙間が開いているため、この間の文字幅が一番小さくなり、この位置で、図１１に示すように、外接矩形を上下２つに分離し、大きい方の外接矩形を判断し、この判断した外接矩形内の文字パターンにより文字認識を行う。
【００３０】
上記第１の実施形態のフローチャートのステップ５において、切出した文字を文字認識部５によって認識する際に、下記処理が実行される。
すなわち、図１２において、文字切り出し部４により切り出された各文字ごとの外接矩形を調べ（ＳＴ２１）、前後の文字に対して高さがあきらかに大きいか否かにより、アクセント付き文字か否かを判断する（ＳＴ２２）。
【００３１】
この判断結果が、アクセント付き文字の場合、文字認識部５は一文字として分離された文字の複数箇所の文字幅を推定し（ＳＴ２３）、この推定した文字幅の一番小さな部分で文字を切断して、図１１に示すように、上下にそれぞれ切断された文字に対する外接矩形を生成し（ＳＴ２４）、大きい方の外接矩形に対する文字パターンを正規化したパターンと辞書５ａに含まれる各基準文字とのマッチング度（類似度等）を演算し（ＳＴ２５）、一番高いマッチング度の文字を認識結果として出力する（ＳＴ２６）。
【００３２】
上記ステップ２２によりアクセント付き文字ではない通常の文字の場合、文字認識部５は一文字として分離された文字の読取りパターンを正規化したパターンと辞書５ａに含まれる各基準文字とのマッチング度（類似度等）を演算し（ＳＴ２７）、一番高いマッチング度の文字を認識結果として出力する（ＳＴ２８）。
【００３３】
［第４の実施形態］
第４の実施形態は、言語によって使用するアクセント記号が違うことを利用して、アクセント記号の識別に基づいて言語を判断し、この判断した言語専用の辞書に基づいて、認識処理を行うものである。
たとえば、ヨーロッパ大陸の国では近接している国が多いため、その文書は場合によっては、複数言語によって記載されている場合が多い。
【００３４】
この場合、例えば図１３に示すドイツ語文書と、図１４に示すフランス語文書を見ると、それぞれの言語に固有のアクセント記号がついていることがわかる。
すなわち、ドイツ語文書の場合には、図１５に示すように、アクセント記号としてウムラウトが記載されている。また、フランス語文書の場合には、図１６に示すように、アクセント記号としてアクサン−テギュ、アクサン−グラーブ、アクサン−シルコンフレクス、トレマ、セディーユが記載されている。
【００３５】
これにより、アクセント記号としてウムラウトだけが記載されている場合には、ドイツ語文書と判断し、アクサン−テギュ、アクサン−グラーブ、アクサン−シルコンフレクス、トレマ、セディーユ等が記載されている場合には、フランス語文書と判断する。
このように、アクセント記号を識別することにより、何の言語で書かれてあるかが推定できる。
【００３６】
また、上記ノイズ画像データや、上記第３の実施形態において切断された文字の小さい方の外接矩形よりアクセント記号を抽出し、この抽出したアクセント記号の識別を行い、この識別結果に基づいて、言語が推定された後、その言語の辞書を選択して文字認識および単語照合を行う。
【００３７】
アクセント記号を識別する方法としては、アクセント記号の画像を抽出し、画像サイズを正規化して、あらかじめアクセント記号用の辞書として用意されているアクセント記号画像とのマッチングをとる方法がある。
この第４の実施形態で用いる文字認識装置１０は、図１７に示すように、画像入力部１、前処理部２、文字行切り出し部３、文字切り出し部４、文字認識部５、後処理部６、答え出力部７、データベース８、アクセント記号識別部１１、言語判別部１２により構成されている。
【００３８】
ただし、文字認識部５には、フランス語用の辞書２１と、ドイツ語用の辞書２２とが設けられている。データベース８には、フランス語用の単語列の記憶部２３と、ドイツ語用の単語列の記憶部２４とが設けられている。
アクセント記号識別部１１には、識別するアクセント記号ごとの基準画像が記憶されているアクセント記号用の辞書２５が設けられている。
【００３９】
アクセント記号識別部１１は、前処理部２のノイズ画像データテーブル２ｃに登録されているノイズ画像データとしての除去された黒画素の塊の形や大きさや二値画像全体における位置情報に基づいて、アクセント記号の画像を抽出し、画像サイズを正規化して、あらかじめアクセント記号用の辞書２５に用意されているアクセント記号画像とのマッチングをとることにより、アクセント記号の識別を行うものである。このアクセント記号識別部１１によるアクセント記号の識別結果は、言語判別部１２に出力される。
【００４０】
言語判別部１２は、入力画像の全体に対する、アクセント記号識別部１１からのアクセント記号の識別結果に基づいて、言語を判別するものである。この言語判別部１２による言語の判別結果は、文字認識部５、後処理部６に出力される。
【００４１】
言語判別部１２は、例えば、アクセント記号としてウムラウトだけが記載されている場合には、ドイツ語文書と判断し、アクサン−テギュ、アクサン−グラーブ、アクサン−シルコンフレクス、セディーユ等の少なくとも１つが記載されている場合には、フランス語文書と判断する。
なお、フランス語文書に、フランス語のアクセント記号としてウムラウトと表記が同じトレマが混在していても良い。
【００４２】
このような構成において、認識動作を図１８、図１９に示すフローチャートを参照しつつ説明する。
すなわち、ドイツ語やフランス語等の言語体系に見られるアクセント記号の付いた文字を用いて記載されている書類等を、画像入力部１において、読取走査し、この読取画像を入力画像（図２参照）として前処理部２に出力する（ＳＴ３１）。
【００４３】
これにより、前処理部２は、入力された画像を二値画像に変換し、黒画素の連結成分を１つの塊とするようなラベリングを行う（ＳＴ３２）。この際、前処理部２は、上記したように、得られた黒画素の塊の外接矩形の両辺の長さが、あるしきい値よりも小さなものを「ノイズ」として除去する。
この際に除去された黒画素の塊の形や大きさや二値画像全体における位置情報をそのまま残したノイズ画像データが、ノイズ画像データテーブル２ｃに登録される（ＳＴ３３）。
【００４４】
これにより、アクセント記号識別部１１は、前処理部２のノイズ画像データテーブル２ｃに登録されているノイズ画像データとしての除去された黒画素の塊の形や大きさや二値画像全体における位置情報に基づいて、アクセント記号の画像を抽出し、画像サイズを正規化して、あらかじめアクセント記号用の辞書２５に用意されているアクセント記号画像とのマッチングをとることにより、アクセント記号の識別を行う（ＳＴ３４）。
このアクセント記号識別部１１によるアクセント記号の識別結果は、言語判別部１２に出力される。
【００４５】
これにより、言語判別部１２は、入力画像の全体に対する、アクセント記号識別部１１からのアクセント記号の識別結果としてウムラウトだけが記載されている場合、ドイツ語文書と判断し、アクサン−テギュ、アクサン−グラーブ、アクサン−シルコンフレクス、セディーユ等の少なくとも１つが記載されている場合には、フランス語文書と判断する（ＳＴ３５）。
この判断結果は文字認識部５と後処理部６に出力される。
【００４６】
文字認識部５は、言語判別部１２からの判断結果がドイツ語文書の場合には、ドイツ語用の辞書２２を選択し、フランス語文書の場合には、フランス語用の辞書２１を選択する（ＳＴ３６）。
後処理部６は、言語判別部１２からの判断結果がドイツ語文書の場合には、データベース８のドイツ語用の単語列の記憶部２４を選択し、フランス語文書の場合には、データベース８のフランス語用の単語列の記憶部２３を選択する（ＳＴ３７）。
また、前処理部２は、ラベリングの結果を文字行切り出し部３に出力する。
【００４７】
次に、文字行切り出し部３において、文字行を切り出し（ＳＴ３８）、この切り出された行画像から、さらに文字切り出し部４において、文字を切り出し（ＳＴ３９）、この切出した文字を文字認識部５によって上記選択されているフランス語用の辞書２１あるいはドイツ語用の辞書２２を用いて認識する（ＳＴ４０）。
【００４８】
これにより、後処理部６はこの文字認識結果としての複数文字列の認識結果と、上記選択されているデータベース８のフランス語用の単語辞書２３あるいはドイツ語用の単語辞書２４との照合を行う（ＳＴ４１）。
この照合により一致する照合結果を答え出力部７において出力する（ＳＴ４２）。
【００４９】
［第５の実施形態］
第５の実施形態は、手書き文字に多い、文字が連結されたカーシブ文字列において、アクセント記号が付く場合には、その位置から連結されている文字の位置を推定することが出来る。
上記第１の実施形態の文字切り出し部４において、図２０に示すように、アクセント記号が抽出された場合、そのアクセント記号位置Ｓを基準にして、そこから左右両方向に向けて文字幅を調べ、文字幅が急激に変動している部分を文字と判断して切り出す。
【００５０】
上記文字幅の推定方法としては、上記文字が連結されたカーシブ文字列の可能性ありとして抽出された文字列の外接矩形の上下両端よりそれぞれ逆側に向かって画素（１画素単位、あるいは複数画素ごと）を走査し（図２０参照）、黒画素とぶつかった位置を求めて文字幅を推定するものである。
図２０の場合、アクセント記号の位置を基準として点線部分が文字幅が変動している部分であり、この点線部分で区切ることにより、「ｎ」「ｉ」「ｃ」「ｅ」の手書き文字の文字ずつが検出切り出しされる。
【００５１】
［第６の実施形態］
第６の実施形態は、手書きによるアクセント記号画像を登録し、次回以降のアクセント記号の判定に利用するようにしたものである。これにより、同一人物が記載した手書きによるアクセント記号の認識率を向上させることができる。特に、本装置を利用する人が少ない場合に有効である。これは、特定の人の種々の書き癖を登録可能となっているためである。
【００５２】
すなわち、手書き文字を同一人物が記載した場合には、同一筆記具でもあるために、そのアクセント記号の書き方に、その人や筆記具の癖が現われる。図２１、図２２に示すように、アクセント記号としてのウムラウト、あるいはアクサン−テギュを抽出出来た場合には、そのアクセント記号画像を記憶装置５０に記憶しておき、次回以降アクセント記号が抽出された場合に、そのアクセント記号画像とマッチングをとって同一性を調べることで、正解かどうかの確度が高くなる。
【００５３】
マッチングの方法については、画像サイズを正規化して、パターンマッチングを行う方法や、アクセント記号に直線部が多いものであれば、その輪郭線の長さや輪郭線の角度、他に芯線の長さや角度の値を記憶しておいて、比較する方法もある。
たとえば、アクサン−シルコンフレクスの場合は、図２３に示すように、輪郭線の長さＬと輪郭線の傾きθなどのサイズと形状が記憶される。
【００５４】
［第７の実施形態］
第７の実施形態は、ドイツ語においてはアクセント記号としてウムラウトがあり、このウムラウトの代りに代用規則による表記ができるようになっているのを考慮して、ウムラウトを用いて作成されている文章の認識を行う場合と、ウムラウトの代りに代用規則による表記（代用文字）で作成されている文章の認識を行う場合とで、用いる単語辞書を変更するようにしたものである。
【００５５】
すなわち、ウムラウトを用いて作成されている文章の認識を行う場合には、ウムラウト付き文字が含まれる単語が登録されている単語辞書により単語照合を行い、ウムラウトの代りに代用規則による表記で作成されている文章の認識を行う場合には、ウムラウト付き文字が含まれる単語をウムラウトなしの代用文字が含まれる単語に置き換えて登録されている単語辞書により単語照合を行うものである。
【００５６】
上記２種類の辞書のいずれを用いるかは、上記ノイズ画像データによるアクセント記号の有無の判断結果が利用できる。（第４の実施形態参照）
これにより、単語照合のマッチング精度を高めることができる。
【００５７】
ドイツ語によるウムラウトがない場合の代用規則は、
【数１】

となっており、
ドイツ語によるウムラウト付き文字が含まれる単語は、
【数２】

となっており、
ドイツ語によるウムラウトなしの代用文字が含まれる単語は、
【数３】

となっている。
【００５８】
この第７の実施形態で用いる文字認識装置１０は、図２４に示すように、画像入力部１、前処理部２、文字行切り出し部３、文字切り出し部４、文字認識部５、後処理部６、答え出力部７、データベース８、アクセント記号識別部１１、判別部１３により構成されている。
【００５９】
ただし、文字認識部５には、ウムラウト付きの文字が含まれるドイツ語用の辞書３１、ウムラウト付きの文字を含まないドイツ語用の辞書３２が設けられている。データベース８には、ウムラウト付き文字が含まれる単語列の記憶部（単語辞書）３３と、ウムラウト付きの文字に代る代用文字が含まれる単語列の記憶部（単語辞書）３４とが設けられている。
アクセント記号識別部１１には、識別するアクセント記号ごとの基準画像が記憶されているアクセント記号（ウムラウト）用の辞書２５が設けられている。
【００６０】
アクセント記号識別部１１は、前処理部２のノイズ画像データテーブル２ｃに登録されているノイズ画像データとしての除去された黒画素の塊の形や大きさや二値画像全体における位置情報に基づいて、アクセント記号の画像を抽出し、画像サイズを正規化して、あらかじめアクセント記号用の辞書２５に用意されているアクセント記号画像とのマッチングをとることにより、アクセント記号の識別を行うものである。このアクセント記号識別部１１によるアクセント記号の識別結果は、後処理部６に出力される。
【００６１】
後処理部６は、アクセント記号識別部１１からウムラウトのアクセント記号の有りにより、データベース８のウムラウト付き文字が含まれる単語辞書３２により単語照合を行い、アクセント記号識別部１１からウムラウトのアクセント記号の無しにより、データベース８のウムラウト付き文字が含まれる単語をウムラウトつきの文字の代りに代用文字が含まれる単語に置き換えて登録されている単語辞書３３により単語照合を行うものである。
【００６２】
このような構成において、認識動作を図２５、図２６に示すフローチャートを参照しつつ説明する。
すなわち、ドイツの言語体系に見られるアクセント記号（ウムラウト）の付いた文字を用いて記載されている書類、あるいはアクセント記号の付いた文字の代りに代用文字を用いて記載されている書類等を、画像入力部１において、読取走査し、この読取画像を入力画像（図２参照）として前処理部２に出力する（ＳＴ５１）。
【００６３】
これにより、前処理部２は、入力された画像を二値画像に変換し、黒画素の連結成分を１つの塊とするようなラベリングを行う（ＳＴ５２）。この際、前処理部２は、上記したように、得られた黒画素の塊の外接矩形の両辺の長さが、あるしきい値よりも小さなものを「ノイズ」として除去する。
この際に除去された黒画素の塊の形や大きさや二値画像全体における位置情報をそのまま残したノイズ画像データが、ノイズ画像データテーブル２ｃに登録される（ＳＴ５３）。
【００６４】
これにより、アクセント記号識別部１１は、前処理部２のノイズ画像データテーブル２ｃに登録されているノイズ画像データとしての除去された黒画素の塊の形や大きさや二値画像全体における位置情報に基づいて、アクセント記号の画像を抽出し、画像サイズを正規化して、あらかじめアクセント記号用の辞書２５に用意されているアクセント記号（ウムラウト）画像とのマッチングをとることにより、アクセント記号の識別を行う（ＳＴ５４）。
【００６５】
このアクセント記号識別部１１によるアクセント記号の識別結果は、判別部１３に出力される。
これにより、判別部１３は、入力画像の全体に対する、アクセント記号識別部１１からのアクセント記号の識別結果としてウムラウトが記載されている場合、アクセント記号（ウムラウト）の付いた文字を用いて記載されているドイツ語文書と判断し、ウムラウトが記載されていない場合、アクセント記号（ウムラウト）の付いた文字の代りに代用文字を用いて記載されているドイツ語文書と判断する（ＳＴ５５）。
【００６６】
この判断結果は文字認識部５と後処理部６に出力される。
文字認識部５は、判別部１３からの判断結果がアクセント記号（ウムラウト）の付いた文字を用いて記載されているドイツ語文書の場合には、ウムラウト付きの文字が含まれる辞書３１を選択し、判別部１３からの判断結果がアクセント記号（ウムラウト）の付いた文字の代りに代用文字を用いて記載されているドイツ語文書の場合には、ウムラウト付きの文字を含まない辞書３２を選択する（ＳＴ５６）。
【００６７】
後処理部６は、判別部１３からの判断結果がアクセント記号（ウムラウト）の付いた文字を用いて記載されているドイツ語文書の場合には、ウムラウト付き文字が含まれる単語列の単語辞書３３を選択し、判別部１３からの判断結果がアクセント記号（ウムラウト）の付いた文字の代りに代用文字を用いて記載されているドイツ語文書の場合には、ウムラウト付きの文字の代りの代用文字が含まれる単語列の単語辞書３４を選択する（ＳＴ５７）。
また、前処理部２は、ラベリングの結果を文字行切り出し部３に出力する。
【００６８】
次に、文字行切り出し部３において、文字行を切り出し（ＳＴ５８）、この切り出された行画像から、さらに文字切り出し部４において、文字を切り出し（ＳＴ５９）、この切出した文字を文字認識部５によって上記選択されているドイツ語用の辞書３１あるいはドイツ語用の辞書３２を用いて認識する（ＳＴ６０）。
これにより、後処理部６はこの文字認識結果としての複数文字列の認識結果と、上記選択されているデータベース８のウムラウト付き文字が含まれる単語列の単語辞書３３あるいはウムラウト付き文字の代りの代用文字が含まれる単語列の単語辞書３４との照合を行う（ＳＴ６１）。
【００６９】
この照合により一致する照合結果を答え出力部７において出力する（ＳＴ６２）。
したがって、第１の実施形態では、誤読の原因となっているアクセント記号をあらかじめ除去してから認識し、あらかじめアクセント記号を除去したデータベースを用いて単語照合を行うことで誤読を削減することができる。
【００７０】
また、第４の実施形態では、削除したアクセント記号の種類により言語の判別を行って、用いる辞書を選択することができる。
また、第５の実施形態では、アクセント記号の位置から文字を切り出す場所を特定することができる。
また、第６の実施形態では、アクセント記号の形状を記憶しておいて、以降のアクセント記号識別に利用する。
上記したように、アクセント記号を積極的に利用することで認識率を向上できる。
【００７１】
【発明の効果】
以上詳述したように、この発明によれば、ドイツ語やフランス語等の言語体系に見られるアクセント記号の付いた文字の認識を行うものにおいて、アクセント記号が存在する文字列であっても、誤読なく文字認識を行うことができる文字認識装置を提供できる。
【図面の簡単な説明】
【図１】この発明の実施形態を説明するための文字認識装置の概略構成を示すブロック図。
【図２】入力された画像を説明するための図。
【図３】ノイズ画像データを説明するための図。
【図４】ウムラウトが記載されている文字を説明するための図。
【図５】抽出される外接矩形を説明するための図。
【図６】アクセント記号の付いた文字を含む単語列と、アクセント記号を除いた文字からなる単語列と、比較する単語列とを説明するための図。
【図７】認識動作を説明するためのフローチャート。
【図８】一文字として分離された文字の外接矩形の上側を１画素行ずつ削った際の文字パターンを説明するための図。
【図９】一文字ずつの文字認識処理を説明するためのフローチャート。
【図１０】一文字ずつの文字幅を推定する際の処理を説明するための図。
【図１１】一文字ずつの文字幅が一番小さくなる位置で、外接矩形を上下２つに分離した状態を説明するための図。
【図１２】一文字ずつの文字認識処理を説明するためのフローチャート。
【図１３】ドイツ語文書の一例を示す図。
【図１４】フランス語文書の一例を示す図。
【図１５】ドイツ語文書のアクセント記号を示す図。
【図１６】フランス語文書のアクセント記号を示す図。
【図１７】文字認識装置の概略構成を示すブロック図。
【図１８】認識動作を説明するためのフローチャート。
【図１９】認識動作を説明するためのフローチャート。
【図２０】文字が連結されたカーシブ文字列を示す図。
【図２１】アクセント記号の記憶例を示す図。
【図２２】アクセント記号の記憶例を示す図。
【図２３】アクセント記号の記憶例を示す図。
【図２４】文字認識装置の概略構成を示すブロック図。
【図２５】認識動作を説明するためのフローチャート。
【図２６】認識動作を説明するためのフローチャート。
【符号の説明】
１…画像入力部、２…前処理部、３…文字行切り出し部、４…文字切り出し部、５…文字認識部、５ａ…辞書、６…後処理部、７…答え出力部、８…データベース、１０…文字認識装置、１１…アクセント記号識別部。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a character recognition device that recognizes accented characters found in language systems such as German and French.
[0002]
[Prior art]
In linguistic systems such as German and French, there are letters with accent marks (such as umlauts) above and below alphabetic letters in English. These accent marks are important information for character recognition. Conventionally, character recognition has been performed with accent marks attached.
[0003]
However, in the case of resolution problems or handwritten characters, accent marks are crushed or come into contact with the main body of the alphabet character, causing misreading.
[0004]
[Problems to be solved by the invention]
The present invention recognizes characters with accent marks found in language systems such as German and French, and is capable of performing character recognition without misreading even a character string having accent marks. It is intended to provide a recognition device.
[0005]
[Means for Solving the Problems]
The character recognition device of the present invention performs character recognition by removing accent marks from accented characters in advance and recognizing characters with accent marks in a language system such as German or French. The word collation is performed by collating with a database composed of word strings created using characters in which accent characters have been removed from the characters.
[0006]
A character recognition device according to the present invention is a document created by a word including a character with an accent mark found in a language system such as German, or a word including a character in which a character with an accent mark is represented by an alternative character. A first word dictionary consisting of a word string containing a word containing an accented character, and an accented character represented by an alternative character in a document that recognizes characters from any of the documents created by A second word dictionary consisting of a word string containing the recognized characters, a judging means for judging the presence or absence of accent marks in the document, a recognizing means for recognizing the characters of the document, and a word string from the characters recognized by the recognizing means. Extraction means for extracting the word string extracted by the extraction means when the presence of accent marks in the document is determined by the determination means. The word string extracted by the extracting means and the second word are output when the determining means determines that there is no accent mark in the document by outputting the matching character string as a word by comparison with the first word dictionary. Output means for outputting a character string that matches by comparison with the dictionary as a word.
[0007]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, a character recognition device according to an embodiment of the present invention will be described with reference to the drawings.
As shown in FIG. 1, the character recognition device 10 includes an image input unit 1, a preprocessing unit 2, a character line cutout unit 3, a character cutout unit 4, a character recognition unit 5, a post-processing unit 6, an answer output unit 7, It is composed of a database 8.
[0008]
First, the image input unit 1 inputs an image of a portion considered to be necessary information from an object in which necessary information is written. As the image input unit 1, a CCD camera, a scanner, or the like is used.
Next, in the pre-processing unit 2, the input image is converted into a binary image (bitmap data in the form of a matrix in pixel units), and labeling is performed such that the connected components of black pixels are formed into one block.
[0009]
The maximum X coordinate, Y coordinate, minimum X coordinate, and Y coordinate (coordinates of the circumscribed rectangular area) of the connected component (area) for each label image are measured and registered in the circumscribed rectangular area table 2a.
Also, the start point coordinates and end point coordinates of all the runs obtained in the raster direction scanning order for the connected components of each label image are registered in the run start point and end point coordinate table 2b.
[0010]
If both sides of the circumscribed rectangle of the obtained black pixel block are smaller than a certain threshold on both sides, they are removed as "noise". At this time, the accent mark may be removed. In addition, noise image data that retains the shape and size of the lump of black pixels removed at this time and position information in the entire binary image as they are is registered in the noise image data table 2c.
[0011]
For example, in the case of a German document as shown in FIG. 2, data for a point portion as shown in FIG. 3 is registered.
[0012]
Next, the character line cutout unit 3 extracts a character line. In the binary image, black pixels are projected in each of the horizontal direction (first direction) and the vertical direction (second direction), and a direction having a more regular peak is determined as a character line direction. An image is separated by judging a position where the value between the peaks of the projection values is small as a line boundary. What is separated here is a character line.
[0013]
From the line image cut out by the character line cutout unit 3, the line cutout unit 4 further separates the line image for each black pixel block. Compare the width of the preceding and succeeding black pixels with the width (the length of the circumscribed rectangle of the black pixels in the row direction), and if the width is clearly larger than the preceding and following black pixels, further increase the pixels in the direction perpendicular to the row direction. To separate. As a method of separation, a row image is projected with black pixels in a direction perpendicular to the row direction, and a portion having a small projection value is determined as a boundary of a character and separated. Alternatively, in the line image, the image is scanned from the upper and lower ends in the direction perpendicular to the line direction, the character width is obtained from the position where the image first meets the black pixel, and the portion having the small character width is determined and separated as the boundary of the character.
[0014]
The character obtained from the character cutout unit 4 is recognized by the character recognition unit 5. The recognition method may be an existing method (such as a composite similarity method).
Although the character recognition result may be used as it is, depending on the user of the present invention, identifiable information (for example, a place name) can be used. Perform collation.
Finally, the result of character recognition output from the character recognition unit 5 and the post-processing unit 6 is output by the answer output unit 7 as necessary.
[0015]
[First Embodiment]
The first embodiment performs character recognition by removing accent marks from accented characters (alphabet characters) in advance, and creates a word dictionary created using characters in which accent marks have been removed from accented characters in advance. Is performed by collating with the database 8 of the above.
[0016]
That is, in the pre-processing unit 2 of FIG. 1, the input image as shown in FIG. 2 is converted into a binary image (bitmap data), and the connected components of the black pixels of the binary image are converted into one block. Is performed. If the length of both sides of the circumscribed rectangle of the obtained black pixel block is smaller than a certain threshold value, it is removed as "noise". At this time, accent marks are removed.
[0017]
For example, when an umlaut is described as shown in FIG. 4, a circumscribed rectangle formed by a cluster of black pixels is as shown in FIG. 5, and two circumscribed rectangles formed by an umlaut and a circumscribed rectangle formed by “a” are extracted. At this time, since the lengths of both sides of the two circumscribed rectangles by the umlaut are smaller than a certain threshold value, black pixels by the umlaut are removed as noise from the binary image.
[0018]
In addition, the database 8 stores a word string (FIG. 6A) composed of characters obtained by removing accent marks from a word string (FIG. 6A) composed of accented characters found in a language system such as German or French. (B)) (word dictionary).
[0019]
In such a configuration, the recognition operation will be described with reference to the flowchart shown in FIG.
That is, a document or the like described using characters with accent marks found in a language system such as German or French is read and scanned by the image input unit 1, and the read image is input to the input image (see FIG. 2). ) Is output to the preprocessing unit 2 (ST1).
[0020]
As a result, the preprocessing unit 2 converts the input image into a binary image, and performs labeling such that the connected components of the black pixels become one block (ST2). At this time, as described above, the preprocessing unit 2 removes, as described above, those in which the lengths of both sides of the circumscribed rectangle of the obtained black pixel block are smaller than a certain threshold value as “noise”, and also removes accent marks. Remove.
The preprocessing unit 2 outputs the result of the labeling to the character line cutout unit 3.
[0021]
Next, the character line cutout unit 3 cuts out a character line (ST3), and further cuts out characters from the cutout line image in a character cutout unit 4 (ST4). Recognize (ST5).
As a result, the post-processing unit 6 recognizes a plurality of character strings as the character recognition result ((c) in FIG. 6) and a word string ((b) in FIG. )) (ST6).
A matching result that is matched by this matching is output from the answer output unit 7 (ST7).
[0022]
[Second embodiment]
In the second embodiment, as in the first embodiment, when the pre-processing unit 2 removes accent marks together with noise, in the case of an image having a low resolution, for example, the accent marks touch the alphabetic characters in the case of a low-resolution image. In some cases, such measures may be left, and this measure is taken. This is a countermeasure when noise cannot be removed.
[0023]
In this case, the circumscribed rectangle of each character cut out by the character cutout unit 4 is checked, and the height (side in the line direction and the vertical direction) of the preceding and succeeding characters is clearly larger (one higher than a predetermined pixel). ) Could be an accented character (which would include uppercase letters). In this case, as shown in FIGS. 8A, 8B, and 8C, the upper side or the lower side of the circumscribed rectangle of a character separated as one character is set one pixel row at a time (a width corresponding to the width of one pixel). The character recognition is repeated while shaving (one line at a time), and when the matching degree (similarity or the like) with the reference character included in the dictionary 5a is high, the answer is correct.
[0024]
In step 5 of the flowchart of the first embodiment, when the extracted character is recognized by the character recognition unit 5, the following processing is executed.
That is, in FIG. 9, a circumscribed rectangle for each character cut out by the character cutout unit 4 is checked (ST11), and whether or not the character is an accented character is determined based on whether or not the height of the preceding and following characters is clearly large. A decision is made (ST12).
[0025]
If the result of this determination is an accented character, the character recognizing unit 5 removes each reading pattern from the upper (uppermost line) to the lower predetermined pixel line of the circumscribed rectangle of the character separated as one character. Is calculated (ST13), and a character having the highest matching degree is output as a recognition result (ST14).
[0026]
In the case of a normal character that is not an accented character in step 12, the character recognizing unit 5 determines a matching degree (similarity degree) between a pattern obtained by normalizing a read pattern of a character separated as one character and each reference character included in the dictionary 5a. , Etc.) (ST15), and outputs the character with the highest matching degree as a recognition result (ST16).
[0027]
[Third Embodiment]
The third embodiment is a countermeasure in a case where the pre-processing unit 2 cannot remove the accent marks together with the noise as in the first embodiment.
In this case, the circumscribed rectangle of each character cut out by the character cutout unit 4 is checked, and the height (side in the line direction and the vertical direction) of the preceding and succeeding characters is clearly larger (one higher than a predetermined pixel). ) Could be an accented character (which would include uppercase letters). In this case, as shown in FIG. 10, the character widths of a plurality of positions of the character separated as one character are estimated, and the character is cut at the smallest portion of the estimated character width, as shown in FIG. Next, a circumscribed rectangle is generated for each of the upper and lower cut characters, and the character recognition unit 5 recognizes a character pattern corresponding to the larger circumscribed rectangle as a cut character.
[0028]
As a method of estimating the character width, pixels (one pixel unit or a plurality of pixels) are scanned from opposite left and right ends of the character circumscribed rectangle of the character extracted as a possibility of the character with the accent mark toward the opposite side. (See FIG. 10), the character width is estimated by finding the position where the black pixel is hit.
[0029]
In the case of FIG. 10, a gap is opened between the lower side of the accent mark and the upper side of the alphabet “a”, so that the character width therebetween becomes the smallest. At this position, as shown in FIG. It is separated into upper and lower parts, the larger circumscribed rectangle is determined, and character recognition is performed based on the character pattern in the determined circumscribed rectangle.
[0030]
In step 5 of the flowchart of the first embodiment, when the extracted character is recognized by the character recognition unit 5, the following processing is executed.
That is, in FIG. 12, a circumscribed rectangle for each character cut out by the character cutout unit 4 is checked (ST21), and whether or not the character is an accented character is determined based on whether or not the height of the preceding and following characters is clearly large. A decision is made (ST22).
[0031]
If the result of this determination is a character with an accent, the character recognition unit 5 estimates the character widths of a plurality of characters separated as one character (ST23), and cuts the character at the smallest portion of the estimated character width. As shown in FIG. 11, a circumscribed rectangle is generated for each of the characters cut vertically and vertically (ST24), and a pattern obtained by normalizing the character pattern for the larger circumscribed rectangle and each reference character included in the dictionary 5a is generated. The degree of matching (similarity etc.) is calculated (ST25), and the character with the highest degree of matching is output as a recognition result (ST26).
[0032]
In the case of a normal character that is not an accented character in step 22, the character recognition unit 5 matches a pattern (normalized read pattern of a character separated as one character) with each reference character included in the dictionary 5 a (similarity degree). , Etc.) (ST27), and outputs the character with the highest matching degree as a recognition result (ST28).
[0033]
[Fourth embodiment]
In the fourth embodiment, a language is determined based on accent symbol discrimination, utilizing the fact that accent marks used differ depending on the language, and recognition processing is performed based on the determined language-specific dictionary. is there.
For example, many countries in continental Europe are close together, so their documents are often written in multiple languages.
[0034]
In this case, for example, looking at the German document shown in FIG. 13 and the French document shown in FIG. 14, it can be seen that each language has a unique accent mark.
That is, in the case of a German document, umlauts are described as accent marks as shown in FIG. Further, in the case of a French document, as shown in FIG. 16, Axant-Tegu, Axant-Grave, Axant-Sylconflex, Trema and Cedilla are described as accent marks.
[0035]
Thus, if only umlaut is described as an accent mark, it is determined to be a German document, and if Axant-Tegu, Axant-Grave, Axant-Silconflex, Trema, Cedille, etc. are listed, French is used. Judge as a document.
In this way, by identifying the accent mark, it is possible to estimate in what language it is written.
[0036]
Also, accent symbols are extracted from the noise image data and the smaller circumscribed rectangle of the character cut in the third embodiment, and the extracted accent symbols are identified. Is estimated, a dictionary of the language is selected, and character recognition and word matching are performed.
[0037]
As a method of identifying accent symbols, there is a method of extracting an accent symbol image, normalizing the image size, and matching with an accent symbol image prepared in advance as a dictionary for accent symbols.
As shown in FIG. 17, a character recognition device 10 used in the fourth embodiment includes an image input unit 1, a preprocessing unit 2, a character line cutout unit 3, a character cutout unit 4, a character recognition unit 5, a post-processing unit. 6, an answer output unit 7, a database 8, an accent symbol identification unit 11, and a language determination unit 12.
[0038]
However, the character recognition unit 5 is provided with a dictionary 21 for French and a dictionary 22 for German. The database 8 is provided with a storage unit 23 for word strings for French and a storage unit 24 for word strings for German.
The accent symbol identification section 11 is provided with an accent symbol dictionary 25 in which a reference image for each accent symbol to be identified is stored.
[0039]
The accent symbol identification unit 11 determines the shape and size of a block of black pixels removed as noise image data registered in the noise image data table 2c of the preprocessing unit 2 and position information in the entire binary image. Accent symbol identification is performed by extracting an accent symbol image, normalizing the image size, and matching the accent symbol image prepared in the accent symbol dictionary 25 in advance. The accent symbol identification result by the accent symbol identification unit 11 is output to the language identification unit 12.
[0040]
The language discriminating unit 12 discriminates the language based on the result of accent symbol identification from the accent symbol identifying unit 11 for the entire input image. The result of language determination by the language determination unit 12 is output to the character recognition unit 5 and the post-processing unit 6.
[0041]
For example, when only umlaut is described as an accent mark, the language determination unit 12 determines that the document is a German document, and describes at least one of Axan-Tegu, Axan-Grave, Axan-Sylconflex, Cedilla, and the like. If so, it is determined to be a French document.
It should be noted that a French document may include a mixture of trema with the same notation as umlauts as French accent marks.
[0042]
In such a configuration, the recognition operation will be described with reference to the flowcharts shown in FIGS.
That is, a document or the like described using characters with accent marks found in a language system such as German or French is read and scanned by the image input unit 1, and the read image is input to the input image (see FIG. 2). ) To the preprocessing unit 2 (ST31).
[0043]
Thus, the preprocessing unit 2 converts the input image into a binary image, and performs labeling such that the connected components of the black pixels are formed into one block (ST32). At this time, as described above, the pre-processing unit 2 removes, as the “noise”, those in which the lengths of both sides of the circumscribed rectangle of the obtained lump of black pixels are smaller than a certain threshold.
At this time, noise image data that retains the shape and size of the lump of black pixels removed and the position information in the entire binary image is registered in the noise image data table 2c (ST33).
[0044]
With this, the accent symbol identification unit 11 stores the shape and size of the removed black pixels as the noise image data registered in the noise image data table 2c of the preprocessing unit 2 and the position information in the entire binary image. Accent symbol images are extracted, the image size is normalized, and the accent symbols are identified by matching them with the accent symbol images prepared in advance in the accent symbol dictionary 25 (ST34). .
The accent symbol identification result by the accent symbol identification unit 11 is output to the language identification unit 12.
[0045]
Thereby, when only the umlaut is described as the accent symbol identification result from the accent symbol identifying unit 11 with respect to the entire input image, the language discriminating unit 12 determines that the document is a German document. If at least one of Grave, Aksan-Silconflex, Cedilla, etc. is described, it is determined that the document is a French document (ST35).
This determination result is output to the character recognition unit 5 and the post-processing unit 6.
[0046]
Character recognition unit 5 selects dictionary 22 for German when the result of determination from language discrimination unit 12 is a German document, and selects dictionary 21 for French if it is a French document (ST36). ).
The post-processing unit 6 selects the German word string storage unit 24 of the database 8 when the determination result from the language determination unit 12 is a German document. The storage unit 23 of the word string for French is selected (ST37).
Further, the preprocessing unit 2 outputs the result of the labeling to the character line cutout unit 3.
[0047]
Next, the character line cutout unit 3 cuts out a character line (ST38), and further cuts out characters from the cutout line image in the character cutout unit 4 (ST39). Recognition is performed using the selected dictionary 21 for French or the dictionary 22 for German (ST40).
[0048]
Thereby, the post-processing unit 6 checks the recognition result of the plurality of character strings as the character recognition result against the selected word dictionary 23 for French or the word dictionary 24 for German in the selected database 8 ( ST41).
The matching result obtained by this matching is output from the answer output unit 7 (ST42).
[0049]
[Fifth Embodiment]
In the fifth embodiment, when an accent mark is added to a cursive character string in which characters are connected, which is common in handwritten characters, the position of the connected character can be estimated from that position.
In the character cutout unit 4 of the first embodiment, as shown in FIG. 20, when an accent mark is extracted, the character width is checked in both left and right directions from the accent mark position S based on the accent mark position S. A portion where the character width fluctuates rapidly is determined as a character and cut out.
[0050]
As a method for estimating the character width, a pixel (one pixel unit or a plurality of pixels) may be used from the upper and lower ends of the circumscribed rectangle of the character string extracted as the possibility that the character string is concatenated. ) Is scanned (see FIG. 20), and the position where the pixel collides with the black pixel is obtained to estimate the character width.
In the case of FIG. 20, the dotted line portion is a portion in which the character width fluctuates based on the position of the accent mark, and by dividing the dotted line portion, the handwritten characters “n”, “i”, “c”, and “e” are Each character is detected and cut out.
[0051]
[Sixth Embodiment]
In the sixth embodiment, a handwritten accent symbol image is registered and used for determining accent symbols in the next and subsequent times. Thereby, the recognition rate of the handwritten accent symbol written by the same person can be improved. This is particularly effective when there are few people using the present apparatus. This is because various writing habits of a specific person can be registered.
[0052]
That is, when the same person writes the handwritten characters, since the writing tool is also the same writing tool, the habit of the person and the writing tool appears in the way of writing the accent mark. As shown in FIG. 21 and FIG. 22, when umlaut or axan-tegu as an accent symbol could be extracted, the accent symbol image was stored in the storage device 50, and the accent symbol was extracted from the next time. In this case, the accuracy of the correct answer is increased by checking the identity by matching the accent symbol image.
[0053]
As for the matching method, the method of pattern matching by normalizing the image size, and the method of accent pattern with many straight lines, the length and angle of the contour, and the length and angle of the core line There is also a method of storing the value of and comparing the values.
For example, in the case of Axan-Silcon Flex, as shown in FIG. 23, the size and shape such as the length L of the contour and the inclination θ of the contour are stored.
[0054]
[Seventh embodiment]
In the seventh embodiment, in consideration of the fact that umlauts are used as accent marks in German, and that the umlauts can be expressed by substitution rules instead of the umlauts, a sentence created using umlauts is considered. The word dictionary to be used is changed between the case of performing recognition and the case of performing recognition of a sentence that is formed by a description (substitute character) according to a substitute rule instead of an umlaut.
[0055]
In other words, when recognizing a sentence created using umlauts, word matching is performed using a word dictionary in which words containing characters with umlauts are registered, and words are created using notation based on substitution rules instead of umlauts. When recognizing a sentence, a word including characters with an umlaut is replaced with a word including a substitute character without an umlaut, and word matching is performed using a registered word dictionary.
[0056]
Which of the two types of dictionaries to use can be determined based on the result of determining whether or not there is an accent symbol based on the noise image data. (Refer to the fourth embodiment)
As a result, the matching accuracy of word matching can be improved.
[0057]
Substitution rules without German umlauts are:
(Equation 1)

It is,
Words containing umlauts in German are:
(Equation 2)

It is,
Words that contain German umlauts without umlauts
[Equation 3]

It has become.
[0058]
As shown in FIG. 24, a character recognition device 10 used in the seventh embodiment includes an image input unit 1, a preprocessing unit 2, a character line cutout unit 3, a character cutout unit 4, a character recognition unit 5, a post-processing unit. 6, an answer output unit 7, a database 8, an accent symbol identification unit 11, and a determination unit 13.
[0059]
However, the character recognition unit 5 is provided with a dictionary 31 for German containing characters with umlaut and a dictionary 32 for German not containing characters with umlaut. The database 8 is provided with a storage unit (word dictionary) 33 for a word string including characters with umlaut and a storage unit (word dictionary) 34 for a word string including substitute characters for characters with umlaut. I have.
The accent symbol identification unit 11 is provided with an accent symbol (umlaut) dictionary 25 in which a reference image for each accent symbol to be identified is stored.
[0060]
The accent symbol identification unit 11 determines the shape and size of a block of black pixels removed as noise image data registered in the noise image data table 2c of the preprocessing unit 2 and position information in the entire binary image. Accent symbol identification is performed by extracting an accent symbol image, normalizing the image size, and matching the accent symbol image prepared in the accent symbol dictionary 25 in advance. The result of accent symbol identification by the accent symbol identification unit 11 is output to the post-processing unit 6.
[0061]
The post-processing unit 6 performs word matching with the word dictionary 32 including characters with umlaut in the database 8 when there is an umlaut accent symbol from the accent symbol identification unit 11, and there is no umlaut accent symbol from the accent symbol identification unit 11. Thus, the words containing the umlauted characters in the database 8 are replaced with the words containing the substitute characters instead of the umlauted characters, and word matching is performed by the registered word dictionary 33.
[0062]
In such a configuration, the recognition operation will be described with reference to the flowcharts shown in FIGS.
That is, documents written using accented characters (umlaut) found in the German language system, or documents written using substitute characters instead of accented characters, etc. The image input unit 1 performs read scanning and outputs the read image to the preprocessing unit 2 as an input image (see FIG. 2) (ST51).
[0063]
As a result, the preprocessing unit 2 converts the input image into a binary image, and performs labeling such that the connected components of black pixels are formed into one block (ST52). At this time, as described above, the pre-processing unit 2 removes, as the “noise”, those in which the lengths of both sides of the circumscribed rectangle of the obtained lump of black pixels are smaller than a certain threshold.
At this time, noise image data that retains the shape and size of the lump of black pixels removed and the position information in the entire binary image as they are is registered in the noise image data table 2c (ST53).
[0064]
With this, the accent symbol identification unit 11 stores the shape and size of the removed black pixels as the noise image data registered in the noise image data table 2c of the preprocessing unit 2 and the position information in the entire binary image. Then, the accent symbol image is extracted, the image size is normalized, and the accent symbol is identified by matching with an accent symbol (umlaut) image prepared in advance in the accent symbol dictionary 25. (ST54).
[0065]
The result of the accent symbol identification by the accent symbol identification unit 11 is output to the determination unit 13.
Accordingly, when umlaut is described as the accent symbol identification result from the accent symbol identifying unit 11 with respect to the entire input image, the discriminating unit 13 is described using a character with an accent symbol (umlaut). If the document is determined to be a German document and no umlaut is described, it is determined that the document is described using a substitute character instead of a character with an accent mark (umlaut) (ST55).
[0066]
This determination result is output to the character recognition unit 5 and the post-processing unit 6.
In the case of a German document in which the judgment result from the discriminating unit 13 is described using a character with an accent mark (umlaut), the character recognition unit 5 selects the dictionary 31 including the character with the umlaut. In the case of a German document in which the determination result from the determination unit 13 is described using a substitute character instead of a character with an accent mark (umlaut), the dictionary 32 that does not include a character with an umlaut is selected. (ST56).
[0067]
In the case of a German document in which the determination result from the determination unit 13 is described using a character with an accent mark (umlaut), the post-processing unit 6 includes a word dictionary 33 of a word string including a character with an umlaut. Is selected, and in the case of a German document in which the determination result from the determination unit 13 is described using a substitute character instead of a character with an accent mark (umlaut), a substitute character instead of a character with an umlaut Is selected (ST57).
Further, the preprocessing unit 2 outputs the result of the labeling to the character line cutout unit 3.
[0068]
Next, the character line cutout unit 3 cuts out a character line (ST58), and further cuts out characters from the cutout line image in the character cutout unit 4 (ST59). Recognition is performed using the selected German dictionary 31 or German dictionary 32 (ST60).
As a result, the post-processing unit 6 substitutes the recognition result of the plurality of character strings as the character recognition result with the word dictionary 33 of the word string including the umlauted character of the selected database 8 or the substitute for the umlauted character. The word string including the characters is collated with the word dictionary 34 (ST61).
[0069]
The matching result obtained by this matching is output from the answer output unit 7 (ST62).
Therefore, in the first embodiment, it is possible to reduce misreading by performing recognition by removing accent marks that cause misreading in advance and then performing word matching using a database from which accent marks have been removed. .
[0070]
Further, in the fourth embodiment, it is possible to determine the language based on the type of the deleted accent mark and select a dictionary to be used.
Further, in the fifth embodiment, it is possible to specify a place where a character is cut out from the position of an accent mark.
In the sixth embodiment, the shape of the accent symbol is stored and used for subsequent accent symbol identification.
As described above, the recognition rate can be improved by actively using accent marks.
[0071]
【The invention's effect】
As described in detail above, according to the present invention, in recognition of accented characters found in a language system such as German or French, even if the character string contains an accent mark, it may be misread. And a character recognition device capable of performing character recognition without the need.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a schematic configuration of a character recognition device for describing an embodiment of the present invention.
FIG. 2 is a diagram illustrating an input image.
FIG. 3 is a diagram for explaining noise image data.
FIG. 4 is a diagram for explaining characters in which umlauts are described.
FIG. 5 is a diagram for explaining a circumscribed rectangle to be extracted;
FIG. 6 is a diagram for explaining a word string including a character with an accent mark, a word string including characters without an accent mark, and a word string to be compared.
FIG. 7 is a flowchart illustrating a recognition operation.
FIG. 8 is a view for explaining a character pattern when the upper side of a circumscribed rectangle of a character separated as one character is cut by one pixel row.
FIG. 9 is a flowchart for explaining character recognition processing for each character.
FIG. 10 is a view for explaining processing when estimating the character width of each character.
FIG. 11 is a view for explaining a state where a circumscribed rectangle is divided into upper and lower parts at a position where the character width of each character is the smallest.
FIG. 12 is a flowchart for explaining a character recognition process for each character.
FIG. 13 is a diagram showing an example of a German document.
FIG. 14 is a diagram showing an example of a French document.
FIG. 15 is a diagram showing accent marks in a German document.
FIG. 16 is a diagram showing accent marks in a French document.
FIG. 17 is a block diagram illustrating a schematic configuration of a character recognition device.
FIG. 18 is a flowchart illustrating a recognition operation.
FIG. 19 is a flowchart illustrating a recognition operation.
FIG. 20 is a diagram showing a cursive character string in which characters are connected.
FIG. 21 is a diagram showing a storage example of accent marks.
FIG. 22 is a diagram showing a storage example of accent symbols.
FIG. 23 is a diagram showing a storage example of accent symbols.
FIG. 24 is a block diagram illustrating a schematic configuration of a character recognition device.
FIG. 25 is a flowchart illustrating a recognition operation.
FIG. 26 is a flowchart illustrating a recognition operation.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Image input part, 2 ... Preprocessing part, 3 ... Character line cutout part, 4 ... Character cutout part, 5 ... Character recognition part, 5a ... Dictionary, 6 ... Post-processing part, 7 ... Answer output part, 8 ... Database Reference numeral 10: character recognition device; 11: accent symbol identification unit.

Claims

In recognizing accented characters found in language systems such as German and French,
Accent marks are removed from accented characters in advance and character recognition is performed.
A character recognition device, which performs word matching by collating with a database consisting of a word string created using characters in which accent symbols have been removed from accented characters in advance.

In recognizing accented characters found in language systems such as German and French,
When character recognition is performed in a language system having characters with accent marks, character recognition is performed by cutting lines formed with a width corresponding to one pixel above or below a circumscribed rectangle of a character separated as one character one by one. A character recognition device characterized by repeating.

When performing character recognition in a language system having accented characters, the pixel position that is considered to be a character part is obtained by scanning pixels in a direction away from the left and right ends of the circumscribed rectangle of the character separated as one character A character recognition device that estimates a character width, cuts a character at a portion having a small width in the obtained character width, and performs character recognition on a larger cut character element.

By extracting accents from accented characters and identifying accents, the language of the sentence in which the accented characters are written is estimated, and a character dictionary of the estimated language is selected to recognize characters. A character recognition device characterized by performing the following.

A character recognition apparatus characterized in that accent characters are extracted from a cursive character string to which characters used in handwritten characters are connected, and characters are cut out from the cursive character string based on the positions of the accent marks to perform character recognition.

In the handwritten character recognition, a character recognition device characterized by storing the shape of an accent mark of a previously detected accented character and using the shape of the accent mark detected thereafter.

A character recognition apparatus characterized by using a word dictionary converted into two vowels corresponding to a vowel with an umlaut when it is determined in the German-speaking character recognition that there is no umlaut in the document.

In a character recognition device for recognizing a word including a character to which an accent mark is added,
Conversion means for converting the supplied image into a binary image;
A character line cutout unit that cuts out a character line by projecting black pixels in the first and second directions orthogonal to each other from the binary image converted by the conversion unit;
A black pixel is projected in a second direction perpendicular to the first direction from the image in the first direction cut out by the character line cutout means, and a small portion of the obtained projection value is defined as a boundary between characters. Character separating means for determining and separating a line image for each black pixel block,
Recognizing means for recognizing characters separated by the separating means;
Extracting means for extracting a word string from the character recognized by the recognition means;
Output means for outputting a character string that matches as a word by comparing the word string extracted by the extraction means with a database of character strings from which accent marks have been removed,
A character recognition device comprising:

In recognizing accented characters found in language systems such as German and French,
By extracting accent symbols from the accented characters, identifying the extracted accent symbols, estimating the language of the sentence in which the accented characters are written, and selecting a character dictionary of the estimated language A character recognition device that performs character recognition.

In recognition of accented characters,
Extracting means for extracting accent marks from accented characters;
Identification means for identifying accent marks extracted by the extraction means;
Determining means for determining the language from the accent marks identified by the identifying means;
A character recognition device characterized by selecting a character dictionary of a language determined by the determination means and performing character recognition.

Either a document composed of words containing accented characters found in a language system such as German, or a document composed of words containing characters that represent accented characters as alternative characters In a character recognition device that recognizes a character from
A first word dictionary comprising a word string containing words containing accented characters;
A second word dictionary including a word string including characters in which accented characters are represented by alternative characters;
Determining means for determining the presence or absence of accent marks in the document;
Recognizing means for recognizing characters of the document;
Extracting means for extracting a word string from the character recognized by the recognition means;
When the presence of accent marks in the document is determined by the determination means, a character string that matches by comparison between the word string extracted by the extraction means and the first word dictionary is output as a word. Output means for, when it is determined that there is no accent mark in the document, outputting a character string that matches as a word by comparing the word string extracted by the extraction means with the second word dictionary;
A character recognition device comprising:

The apparatus of claim 11, wherein the accent mark is an umlaut.