JP4304920B2

JP4304920B2 - Character string recognition device and its program

Info

Publication number: JP4304920B2
Application number: JP2002166462A
Authority: JP
Inventors: 健永崎; 勝美丸川; 広新庄
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2002-06-07
Filing date: 2002-06-07
Publication date: 2009-07-29
Anticipated expiration: 2022-06-07
Also published as: JP2004013548A

Description

【０００１】
【発明の属する技術分野】
本発明は、帳票に記入された文字列を認識する文字列認識方法、装置および文字列認識処理プログラムに関する。
【０００２】
【従来の技術】
一般に、定型文書である帳票に記入された文字を認識するには、読取り対象となる枠の罫線情報を予め辞書に登録し、入力画像中から抽出した罫線と辞書の登録情報を付き合わせて、読取り対象枠内の文字を切り出して文字を認識する方法が採用されている（特開平10-49602等を参照）。一般に枠内には、帳票に予め印刷された文字（以後、プレ印刷文字）と、プリンタ等で後から印字された文字（以後、印字文字）が存在する。この場合、読取り対象となるのは印字文字である。プレ印刷文字と印字文字が共存する環境下で文字列を読取る方法としては、文字列表記知識を用いてプレ印刷文字と印字文字を共に読取る手法（特開2001-126010）、事前に登録しておいたプレ印刷文字の枠内配置情報を用いてプレ印刷文字を取り除き、印字文字のみを読取る手法（特開2000-207488）などがある。
しかし、プレ印刷文字と印字文字が同じ枠内に共存する場合、２つの文字が重なって１つの文字パタンとなる問題（重なり文字の問題）が生じるが、上述した手法ではこの問題に対応できない。
この重なり文字の問題に対しては、文字切出処理（特開平08-243506等に記載）を適用してプレ印刷文字と印字文字とを予め切断し、文字列表記知識を用いて文字列を読取ることで対応が可能である。文字切出処理の各種パラメータを調整することで、重なり文字をより細かなパタンへと分割することが可能である。
【０００３】
【発明が解決しようとする課題】
しかし、上述の文字切出処理は、切断対象であるパタンのサイズや画素射影量等の局所的な情報を利用して行うため、重なり文字用に調整したパラメータを使った場合、正しい文字パタンもまた過剰に切断してしまうオーバーセグメンテーションの問題が生じる。そのため、文字切出パラメータのみで重なり文字に対応すると、全体的な文字列読取りの精度が低下するという現象が生じる。
本発明の目的は、プレ印刷文字と印字文字とが共存する文字列を読取る文字列認識において、プレ印刷文字（帳票にあらかじめ印刷された文字）と印字文字（プリンタ等で後から印字又は加筆された文字）とが重なって１つのパタンとなったケースに対して、プレ印刷文字と印字文字とを区別した文字列表記知識に基づき、周囲のプレ印刷文字の情報を利用して重なり文字からプレ印刷文字を除去することにより印字文字を読取る文字列認識手法、その装置及び文字列認識処理プログラムを提供することにある。
【０００４】
【課題を解決するための手段】
上記課題解決のため、本出願の開示する代表的な発明を概説すれば帳票画像を読み込む手段と、上記帳票画像に予め印刷されたプレ印刷文字と、後から加筆された印字文字とを区別する情報を有する文字列表記知識を記憶する記憶手段と、上記文字列表記知識を用いて帳票画像から上記プレ印刷文字と上記印字文字が重なり部分が存在するか判定する手段と、上記重なり部分について重なり文字の認識処理を行う手段とを有することを特徴とする文字列認識装置である。さらに、重なり文字の認識処理を行う手段は、上記判定手段により重なり部分と判定された各々の部分に対して複数種類の重なり文字の認識処理を行い、えられた複数の認識結果を比較して選択した結果を出力する。
【０００５】
【発明の実施の形態】
本発明の一実施例について図面を用いて説明する。第１図は、読取り対象の帳票を入力して、重なり文字処理部が起動されるまでの処理フローを示したものである。第２図は、読取り対象文字行の例と、それに対応する文字列表記知識の概念図である。第３図は、読取り対象文字行から候補文字ネットワークを作り、文字列表記知識を用いて文字列パスを確定するまでの処理の概念図である。第４図は、文字列パスの中から重なり文字を検出し、処理を行い、新しい文字列パスを確定するまでの処理の流れである。第５図は、重なり文字を線の太さと重なり具合によって分類した表である。第６図は、モルフォロジーを用いた太線パタン抽出処理の概念図である。第７図は、矩形分解を用いたプレ印刷文字除去処理の概念図である。第８図は、パタン照合を用いたプレ印刷文字除去処理の概念図である。第９図は、本発明による文字列認識手法を組み込んだ帳票認識装置の一構成例である。次に、これらの図を用いて発明の実施の形態を詳述する。
第1図について詳述する。本発明の実施例である帳票認識装置では、初めに読取り対象の帳票を撮像し、これを電子データに変換する（１０１）。本処理は、元々の帳票が電子データである場合は省略可能である。次に、帳票画像の電子データを元にして、罫線抽出、枠構造解析、およびその結果を用いた読取り対象枠の位置推定等の帳票解析を行う（１０２）。これについては特開平06-052156、特開平09-305701、特開2000−251012等に記載された技術を用いることができる。次に、帳票解析の結果を受けて、読取り対象である文字行画像を抽出する（１０３）。次に、文字行画像から文字パタン候補の切出し及び各パタンの識別を行う（１０４）。切出しパタン及び識別結果を候補文字ネットワークと称する。次に、候補文字ネットワークから文字列表記知識を使って文字列パス（文字コード列と文字パタン列の組）の候補を出力する（１０５）。文字列表記知識の例を図2-(c)(d)に掲げる。文字列表記知識とは、対象とする文章がどのように書かれうるかを文脈自由文法（あるいは正規文法）によって記述したものである。例えば図2-(d)の表記知識は、始めに「＊」が書かれ、次に「血液化学検査」または「処方箋料」などの単語が続き、さらにその次に「１」「１２」などの数字が続き、最後に「項目」や「点」などの単語が書かれるような文章を表している。次に、得られた文字列パス群に対して、文字識別尤度、文字パタンの配置、適用した表記知識の尤度等によって文字列パスの尤度を評価して、順位を付ける（１０６）。このとき、文字列パス中に重なり文字が存在すると判定された場合は、重なり文字処理部が呼ばれる。このように本願発明では、プレ印刷文字と印字文字が重なった場合でも文字切出仮説を複数作ることでプレ印刷文字と印字文字を分離し、文字列表記知識を用いて文字列を読取り、文字切出で分離できなかった重なり文字を検知し、重なり文字に対して再文字切出・再文字識別・再知識処理を行う。重なり文字の問題は、あらかじめ印刷された定型帳票に文字を印字して文書を作成する場合においてしばしば発生する。これまでに提案された手法の多くは、これを文字切出しの問題と捉えて、文字切出し処理の改良により対応している。本願発明は、文字列表記知識を利用して重なり文字を検知し、後処理として重なり文字に対応するため、既存の文字列読取り処理にオーバーセグメンテーション等の悪影響を与えることなく、重なり文字の読取りが可能となる。
重なり文字処理部（１０７）は、文字列パス中にある重なり文字について、プレ印刷文字部分を消去して印字文字を識別し、知識処理を再適用することで文字列パスを再計算する処理である。
第２図について詳述する。第２図は、読取り対象文字行の例（ａ、ｂ）と、それぞれに対応する文字列表記知識（ｃ、ｄ）を示した概念図である。文字列表記知識では読取り対象である印字文字部分（２０１）と、プレ印刷文字部分（２０２）とが区別して記述される。ここで、２０１の「Ｎ」は数字列を意味するものとする。また、２０２の中括弧（“［］”）は、この部分が「コード」または「医コ」または「医コ：」と表記され得ることを意味するものとする。
第３図について詳述する。第３図は、知識処理部（１０５）の処理過程を概念図で示したものである。初めに、読取り対象文字行（ａ）から、文字パタンと思われる部分をさまざまに切り出し、各々の文字パタンの候補を文字識別して、候補文字ネットワークを作る（ｂ）。候補文字ネットワークは、文字パタン、文字識別の結果得られた順位付けされた文字コード群、各文字コードに対応した尤度、候補文字ネットワーク中での文字パタン間の接続関係の情報、を最低限持つものとする。次に文字列表記知識（ｃ）を使って、候補文字ネットワークから文字列パス（ｄ）を計算する。ここで言う文字コードとは、「あ」という文字概念に対応するコード（例えばＪＩＳ＿Ｘ＿０２０８規格であれば２４２２_１６というコードがそれにあたる）を意味する。本稿では文字コードの列を文字列と称し、印字または手書きで記された文字パタンの列を特に指す場合は文字パタン列と称することとする。
文字列パスは、文字列と、文字列中の各文字コードに対応する文字パタンと、候補文字ネットワーク中での文字パタンの接続関係の情報、を持つ。文字列表記知識は文字列認識プログラムとは独立したデータベースであり、外部記憶装置等に記録される、文字列表記知識の記述手法としてはトライ、文脈自由文法などがある（特開2001-014311等に記載）。本特許では、プレ印刷文字及び印字文字を併せて読むことを特徴とし、かつプレ印刷文字と印字文字は表記知識上で区別して記述される。
通常の知識処理では１０５の処理により文字列パスが確定するが、１０６の文字列検定処理によって重なり文字が存在すると判定された場合は、この文字列パスを１０７の重なり文字処理に渡して再処理を行う。１０７の処理に関しては第４図で詳しく述べる。１０６の重なり文字の判定は、１０５で候補として挙げられた各文字列パスに対して行う。文字列パス中に重なり文字が有るか無いかの判定は、文字列パス中の全ての文字パタンを走査して、文字パタンの外接矩形のサイズと識別尤度を、あらかじめ定めてある閾値と較べることで行う。例えば、文字列パスを文字パタン列｛Ｐ_１、Ｐ_２、・・・、Ｐ_ｎ｝とし、ある文字パタンＰ_ｋに対応する文字コードをＣ_ｋとし、その識別尤度をＬ_ｋとし、文字パタンＰ_ｋの外接矩形の高さ及び幅をＨ_ｋとＷ_ｋとし、あらかじめ定めてある尤度閾値及び高さ閾値及び幅閾値をＴＬ_ｋ、ＴＨ_ｋ、ＴＷ_ｋで表すならば、重なり文字有無の判定を（Ｌ_ｋ＜ＴＬ_ｋかつ（Ｈ_ｋ＞ＴＨ_ｋまたはＷ_ｋ＞ＴＷ_ｋ））という論理でできる。判定に用いる各種閾値の価は、プレ印刷文字の平均高さ、平均幅、及び平均識別尤度などから実験的に求められる。このように処理１０６は、重なり文字有無を単純な論理で判定することで、重なり文字処理部（１０７）に渡すべき文字列パスを少ない計算量で絞り込むことができる。重なり文字処理部では、絞り込んだ文字列パス群に対して、重なり文字の詳細判定及び位置特定を改めて行う（処理４０１、後述）。
第４図について詳述する。この処理では、上位モジュールより重なり文字を含むと思われる文字列パスを与えられ、これに対して重なり文字の部分を特定してプレ印刷文字の除去を行い、残った印字文字を再文字識別して、再知識照合によってより正確な文字列パスを計算する処理である。まず初めに、文字列パス中からプレ印刷文字と印字文字とが重なっていると思われるパタンを推定し、重なり文字の候補を列挙する（４０１）。次に、重なり文字部分以外で普通に読めているプレ印刷文字から、重なり文字の部分にあるプレ印刷文字の位置やサイズ等の情報を推定する（４０２）。次に、重なり文字の候補に対して画像処理と文字識別処理が行われる（４０３）。これによりプレ印刷文字が除去され、読取り対象である印字文字が抽出され、これに対して再度文字識別が行われる。次に、前記処理を受けて再知識照合処理が行われる（４０４）。これにより重なり文字から抽出された印字文字を読取り、文字列パスを計算する。最後に、読取った文字列パスについて、文字識別の尤度、文字パタンの配置情報、利用した表記知識の尤度等によって検定を行い、文字列パスの尤度を評価し、優先順位付け及び棄却するか否かの決定を行う（４０５）。
第４図４０３の処理を更に詳述する。重なり文字と推定されたパタンに対しては、３つの画像処理を並列に適用して印字文字を抽出し、各々の抽出結果に対して文字識別を行う（４０６、４０７、４０８）。最後に、上記処理（４０６〜４０８）により得られた３つの印字文字の識別結果を順位付けして、重なり文字の文字識別結果として決定する（４０９）。各文字識別について尤度を比較して、識別文字コードの順位を決定する（４０９）。４０６は太線仮説処理と称する。これは、重なり文字に対して太線パタンの抽出を行い、印字文字のみを抜き出す処理である。４０７は矩形仮説処理と称する。これは、プレ印刷文字と印字文字の重なり具合を矩形情報から判別し、プレ印刷文字と思われる矩形を除く処理である。４０８はパタン照合処理と称する。これは、文字パタン照合を用いてプレ印刷文字位置、印字文字の位置を特定し、プレ印刷文字を除去する処理である。更に細かな処理である４１０〜４１８については、以下の記述で述べる。
４０１の重なり文字候補の推定方法について述べる。文字列パスは、読取り対象文字行から作られる候補文字ネットワークと、文字列表記知識を用いることで計算される。具体的なアルゴリズムとしては、表記知識を用いた動的計画法、あるいは文脈自由文法解析（特開2001-014311等に記載）がある。更に、プレ印刷文字と印字文字とが知識上で区分されているため、文字列パス中の各パタンが両者のどちらかであるかを判別できる。
例えば第２図（ａ）に掲げた文字列を例とすれば、この文字列は知識処理によって、
（１）「平成１１年５月分県番１？コ１２、３３３４、５」
（２）「平成１１年５月分県番１３コ１２、３３３４、５」
（３）「平成１１年５月分県番１医コ１２、３３３４、５」
などと読まれる可能性がある。但し、「？」は文字識別で読めなかった文字パタン（＝不読文字）を意味するものとする。
例えば（１）のケースでは、文字列中の「？」に相当する文字パタンが不読であるとわかる。文字パタンの不読原因としては、元々の文字パタンが擦れにより識別できない場合、重なり文字パタンであるため識別できない場合、ノイズが混入しているため識別できない場合、等の原因が考えられる。これらの内、重なり文字パタンが原因で不読「？」となっているケースについては、他のプレ印刷文字との位置・サイズ比較、あるいは他の印字文字との位置・サイズ比較（同種同士の比較、例えば数字同士が望ましい）、文字パタン中の連結成分数等の情報を使って、重なり文字パタンが原因であろうということが推定できる。なぜならば、擦れが原因であれば文字パタンのサイズには大きな変動が無く、重なり文字のように横幅に広がる（横書きの場合）ことはない。また連結成分数を調べれば擦れが原因かどうかの計算も、ある程度の信頼度を持ってできるからである。ノイズが混入したケースでも同様である。
また（２）のケースでは、重なり文字パタンが「３」と読まれている。この場合も、文字列中の印字文字同士（この場合は、数字同士）の横幅、高さ、位置、文字識別の尤度を比較することにより、「３」が重なり文字パタンを無理に読んでいるとの判定が可能である。同様に（３）のケースでも、重なり文字パタンを「医」と読んでいるが、これも文字列中のプレ印刷文字同士の横幅、高さ、位置、文字識別の尤度を比較することにより、「医」が重なり文字を無理に読んでしまった結果であると判定することが可能である。
重なり文字の具体的な判定手法としては、プレ印刷文字または印字文字同士の横幅比、高さ比、文字識別の尤度などを入力とする識別器をニューラルネットで構築する、あるいは線形識別関数・２次識別関数を構築する、またはロジックベースの判定器を構築する等の手がある。ニューラルネット、線形識別関数、２次識別関数を用いた重なり文字パタンの判定器は、多量のデータによる学習が必要となるが精度の高い判定が可能である。またロジックベースによる判定器は、ヒューリスティックを盛り込むことにより比較的少量のデータでも精度の良い判定が可能になるというメリットがある。
尚、４０１の処理は、１０６の文字列検定部で、文字列パス中に重なり文字があると判定する場合にも用いられる。１０６によって文字列中に重なり文字があると判定された場合は、重なり文字の処理（第１図１０７、及び第４図）に移行する。
４０２のプレ印刷情報推定について述べる。これは重なり文字の近傍のプレ印刷文字から、重なり文字中のプレ印刷文字の位置・サイズ等を推定する処理である。例えば第２図（ａ）に掲げた文字列を例とすれば、この場合は「医」が重なり文字である。このとき、まず初めに、「医」と同じグループに属する「コ」の位置・サイズを参照する。「コ」が不読であるばあいは、他の文字で最も近い文字、例えば「県」「番」などからプレ印刷文字の位置・サイズを推定する。例えば、プレ印刷文字の高さ・文字行中の上下位置がほぼ一定であることを利用すれば高さ・上下配置の推定が可能であり、また重なり文字に含まれるプレ印刷文字のコードが分かれば、それが重なり文字中の左右どちらかに存在するかが推定できる。ここで求めた重なり文字パタン中のプレ印刷文字の推定位置・推定サイズは、後の処理で用いられる。このように本願発明では、従来処理では文字切出しが難しかったパタンに対しても、プレ印刷文字と印字文字を区別した文字列表記知識を使って一段目の文字列読取りを行うため、正常な文字パタンのサイズや位置、本来書かれるべきプレ印刷の位置等の情報が容易に計算され、これらの情報を利用することで重なり文字の判別精度を向上することができる。
４０３で行われる３通りの文字抽出及び識別処理（４０６〜４０８）について述べる。重なり文字は第５図に示すように、線の太さと接触の度合いによって分類される。５０１は印字文字の線の太さがプレ印刷文字よりも十分太いケースを、５０２は印字文字の線の太さがプレ印刷文字よりもやや太いケースを、５０３は印字文字の線の太さがプレ印刷文字と等しいか、または細いケースを示す。５０４は印字文字とプレ印刷文字が文字の左右端で接触しているケース、５０５はどちらかのパタンがやや含まれる形で接触しているケース、５０６は両者が完全に重なっているケースを示す。これらのケースに対して３通りの文字抽出を行うことで、５０７に示すマスク部分のケースに対して対応を行う。尚、５０８のケースに対しては、識別結果の尤度等により文字識別を棄却することを方針とする。第５図に示した通り、重なり文字のケースはその形態により幾つかに分類でき、各形態に有効な文字抽出処理が存在する。その中の一つである太線抽出処理は、プレ印刷文字と印字文字とで線の太さが異なるケースにおいて有効に機能する。また、外接矩形分解処理は、プレ印刷と印字文字とが異なる高さに印字されたケースにおいて有効に機能する。さらに文字パタン照合処理は、上記処理の適用が困難な場合において補完的に機能する。本願発明では、重なり文字と推定されたパタンに対して、これら複数の文字抽出処理を並行に適用することで、様々な重なり文字のケースに対して対応可能な処理を実現する。
重なり文字処理の第１の処理は、太線パタンの抽出及び識別を使ったものである（４０６）。これは第５図のプレ印刷文字に比べて印字文字が十分に太く印刷されたケースにおいて有効である。４０６の処理は３つの過程から構成され、まず太線パタンの抽出（４１０）を行い、次に抽出したパタンのサイズ等の検定を行い（４１１）、最後に文字識別を行う（４１２）。
４１０の処理にはモルフォロジー（「モルフォロジー」コロナ社、ISBN4-339-00664-5等に記載）を用いる。モルフォロジーとは、画像の太め・細め・平滑化・細線化等の処理を、処理対象画像Ａと構造要素Ｂと呼ばれる２つの画像間に対してミンコフスキー和・ミンコフスキー差等の演算を適用することによって実現する、画像処理の演算体系である。
２値画像に対するモルフォロジー演算の模式図を第６図に示す。この例ではモルフォロジーのopening演算を使い、構造要素に３×３ドットの画像を用いて（ｂ）、対象画像Ａ（ａ）から最低３ドット以上の太さを持つパタンを抽出している（ｃ）。モルフォロジーの構造要素は、これ以外にも様々に定義ができるので、より太いパタンあるいは細いパタンでも選択的に抽出できる。この図のように太線抽出には一般にopening演算が使われる。opening演算は次のように定義される。
opening演算：（Ａ−Ｂ^ｓ）＋Ｂ
但し、「＋」は画像間のミンコフスキー和、「−」は画像間のミンコフスキー差、Ｂ^ｓは画像Ｂの原点に関する対象図形を表すものとする。
【０００６】
太線抽出のみでは、本来抽出したい太線数字文字以外の所にも雑音領域が抽出される恐れがあるので、抽出後に、他の印字文字のサイズに比べて閾値α倍以上に小さい孤立領域を除去する等の、パタンの検定作業を行う（４１１）。この処理の後に文字識別を行う（４１２）。
重なり文字処理の第２の処理は、文字パタンの矩形分解及び識別を使ったものである（４０７）。これは第５図に示すとおり、プレ印刷文字に比べて印字文字（この場合は数字）の線の太さは同様であるが、文字が完全に重ならないケースや、両者の文字の大きさが異なる、または位置がずれて印刷されるケースを対象とする。４０７の処理は３つの過程から構成され、まず矩形分解を行い印字文字のパタンを抽出し（４１３）、次に抽出したパタンのサイズ等の検定を行い（４１４）、最後に文字識別を行う（４１５）。
４１３の矩形分解処理は、大きさの異なる文字パタン、または位置がずれた文字パタンの重なりについては、矩形の重なりとして表現できることを利用する。重なり文字が２つの矩形重なりとして表現できた場合、４０２で行ったプレ印刷文字の位置・サイズ推定の結果から、どちらの矩形がプレ印刷文字に相当するかが分かるので、プレ印刷文字に相当する矩形部分内の画素を消去し、目的である印字文字のパタンを抽出することが可能となる。この処理の概念図を第７図に掲げる。第７図では、数字「４」とプレ印刷文字「コ」が重なった様子を示している。矩形分解により文字パタンの重なりが２つの矩形の重なりとして表現できる。数字「４」に相当する矩形が７０１、プレ印刷文字「コ」に相当する矩形が７０２である。事前に行われているプレ印刷文字の位置推定によって、２つの矩形のどちらがプレ印刷文字に相当するかが分かるので、該当するプレ印刷文字の矩形部分の画像を消去することで、読取り対象である数字パタンを抽出することができる。
重なり文字の矩形分解は次のように行う。初めに重なり文字の外接矩形を求める。次に外接矩形の４隅を中心として白領域の矩形を広げる。その際、孤立黒領域が悪影響を及ぼさないように、あるサイズ以下の連結黒領域はあらかじめ削除する。求まった左右上下４隅の白領域に対して、次の条件で正規化を行う。例えば右側の上下白領域のサイズは次のように正規化する。初めに、右上の白領域のサイズを（ｓｘ１、ｓｙ１）、右下の白領域のサイズを（ｓｘ２、ｓｙ２）とする。このとき、それぞれの値を次の処理により書き換える。

同様の処理を左端の白領域サイズについても行う。この処理は、基本的に右端（または左端）の上下の白領域の横幅（ｓｘ１、ｓｘ２）を両者の最小値に等しくする正規化処理である。但し、ノイズの影響によって誤った白領域が求まることを配慮して、白領域の高さが閾値βよりも小さいものついてはこれを無視する。また、ここで述べた方法以外にも、モルフォロジー演算を適用した矩形分解が可能である。これはあらかじめ推定したプレ印刷文字パタンの大きさを用いて、横１ドット、縦にプレ印刷文字の高さサイズの構造要素を用意し、これを用いてclosing演算を行うことで実現される。これにより文字内部の微小な黒画素が塗りつぶされ、外輪郭を矩形として近似されるからである。closing演算は次に定義される。
closing演算：（Ａ＋Ｂ^Ｓ）―Ｂ
但し、「＋」は画像間のミンコフスキー和、「−」は画像間のミンコフスキー差、Ｂ^Ｓは画像Ｂの原点に関する対象図形を表すものとする。
【０００７】
これらの処理により印字文字が抽出される。次に抽出した印字文字についてパタン検定（４１４）と文字識別（４１５）を行う。この処理については４１１、４１２と同じである。
重なり文字処理の第３の処理は文字パタンの照合及び識別を使ったものである（４０８）。文字パタンの形状照合の概念図を第８図に示す。第８図では、対象図形（ａ）に対して、あらかじめ用意していたプレ印刷文字のパタン（ｂ）を使ってパタン照合を行い、重なった部分をプレ印刷文字部分として消去する（ｃ）。図（ｃ）ではパタン照合の結果、印字文字部分が８０１に、プレ印刷文字のパタンが８０２に示されている。照合アルゴリズムとしては、動的計画法を用いたＸＹ軸独立整合法や屈曲ワープサーチ法（”Recognition of Handwritten Digits Using Template and Model Matching,” Pattern Recognition, vol.24, no.5, pp.421-431等に記載）を用いる。４０８の処理は３つの過程から構成され、まずパタン照合によりプレ印刷文字の位置特定と除去を行い（４１６）、次に抽出したパタンのサイズ等の検定を行い（４１７）、最後に文字識別を行う（４１８）。４１７及び４１８の処理については、それぞれ４１１及び４１２と同様である。
以上述べた３つの処理により、プレ印刷文字が除去され、読取り対象である印字文字（この場合は数字）が残り、それぞれの処理後のパタンに対して識別が行われる。次にこれらの識別結果の順位を決定する（４０９）。判定は基本的に文字識別の尤度の降順（高い順）に選ばれる。但し特定のパタン（数字の「１」など）については、プレ印刷文字除去の結果、たまたま残ったノイズを読んでしまうなどの間違いが多く見られるので、これら特定パタンに対しては文字識別の尤度を下げる処理を行う。最後に重なり文字パタンを再認識した結果に基づいて知識処理を適用し、文字列パスを計算し（４０４）、検定を行って文字列パスを決定する（４０５）。このように本願発明では、文字列の表記知識を用いて重なり文字の推定を行い、重なり文字と推定されたパタンに対しては、複数の文字抽出処理及び文字識別を並列に行い、それら複数の識別結果の中から最適なものを選択することにより、切出し誤りやノイズ等に対してロバストな読取り処理を実現する。
最後に第９図を詳述する。これは本発明による文字列認識装置を組み込んだ帳票読取装置の一構成例である。この装置は、画像入力装置（９０１）により帳票を電子データに変換し、それを外部記憶装置（９０４）及びメモリ（９０５）に蓄えて、中央演算装置（９０６）が記憶媒体等に記録された本願発明実行のためのプログラムを読み出すことにより本願発明の文字列認識処理を行う。必要となる文字列表記知識、帳票読取り定義位置などは外部記憶装置（９０４）に蓄えられる。これらの処理は操作端末装置（９０２）を通して人間が操作可能であり、処理の結果等は表示端末装置（９０３）を通して表示され、また通信装置（９０７）を通して外部装置とのデータ連携が可能である。
以上の様に、本願においては、予め印刷された文字を含む帳票に、記入された文字列を認識する文字列認識手法であり、予め印刷された文字（プレ印刷文字）とプリンタ等を用いて後から帳票に印字された文字（印字文字）とが混在する環境下において、文字切出多重仮説と文字列表記知識を用いて共に読取ることを基本処理とし、プレ印刷文字と印字文字とが重なった場合に、重なったパタン（重なり文字）を判定する機能を有し、重なり文字に対して、太線抽出、矩形分解、パタン照合等の技術を用いてプレ印刷文字除去及び印字文字抽出を行い、文字の再識別及び再知識照合処理を行うことで、印字文字を正しく読取ることを特徴とする文字列認識手法を開示する。
【０００８】
【発明の効果】
以上説明したように本発明による文字列認識装置は、プレ印刷文字と印字文字とが重なった場合においても、プレ印刷文字と印字文字とを区別した文字列表記知識を用いて重なり文字を判定し、重なり文字に対しては複数の処理を使って印字文字を抽出し、再文字識別及び再知識照合を行うことで、文字列認識を行うことが可能となる。
【図面の簡単な説明】
【図１】重なり文字処理を含む文字列認識のフロー図。
【図２】読取り対象文字行及び文字列表記知識の概要図。
【図３】文字列表記知識を用いた文字列読取りの概念図。
【図４】重なり文字処理のフロー図。
【図５】重なり文字パタンの分類図。
【図６】モルフォロジーによる太線抽出処理の概念図。
【図７】矩形分解によるプレ印刷文字除去処理の概念図。
【図８】パタン照合によるプレ印刷文字除去処理の概念図。
【図９】重なり文字列読取機能を組み込んだ帳票認識装置の一構成例。
【符号の説明】
１０１…画像入力部
１０２…帳票解析部
１０３…読取り対象行の抽出部
１０４…文字パタン候補の生成部
１０５…知識処理部
１０６…文字列検定部
１０７…重なり文字処理部
２０１…文字列表記知識中の印字文字部分
２０２…文字列表記知識中のプレ印刷文字部分
４０１…重なり文字の候補を列挙する処理
４０２…重なり文字中のプレ印刷文字の情報を推定する処理
４０３…重なり文字候補の全てに関するループ処理
４０４…重なり文字処理後の知識処理
４０５…文字列検定処理
４０６…太線仮説処理
４０７…矩形仮説処理
４０８…パタン照合処理
４０９…識別結果選択処理
４１０…太線パタン抽出による印字文字抽出処理
４１１…パタンの検定処理
４１２…文字識別処理
４１３…矩形分解による印字文字抽出処理
４１４…パタンの検定処理
４１５…文字識別処理
４１６…パタン照合による印字文字抽出処理
４１７…パタンの検定処理
４１８…文字識別処理
５０１…印字文字の線の太さがプレ印刷文字より十分太いケース
５０２…印字文字の線の太さがプレ印刷文字よりやや太いケース
５０３…印字文字の線の太さがプレ印刷文字と同じか細いケース
５０４…印字文字とプレ印刷文字が弱く接触するケース
５０５…印字文字とプレ印刷文字が強く接触するケース
５０６…印字文字とプレ印刷文字が完全に重なるケース
５０７…本発明で読取り対象とするケース
５０８…本発明で読取り棄却とするケース
７０１…矩形分解において印字文字矩形と判定された部分
７０２…矩形分解においてプレ印刷文字矩形と判定された部分
８０１…文字パタン照合において印字文字と判定された部分
８０２…文字パタン照合においてプレ印刷文字と判定された部分
９０１…画像入力装置
９０２…操作端末装置
９０３…表示端末装置
９０４…外部記憶装置
９０５…メモリ
９０６…中央演算装置
９０７…通信装置。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a character string recognition method, a device, and a character string recognition processing program for recognizing a character string entered in a form.
[0002]
[Prior art]
In general, in order to recognize characters entered in a form that is a standard document, the ruled line information of the frame to be read is registered in the dictionary in advance, the ruled line extracted from the input image and the registration information of the dictionary are added together, A method of cutting out characters in a frame to be read and recognizing the characters is employed (see JP 10-49602 A). In general, there are characters preprinted on a form (hereinafter referred to as preprinted characters) and characters (hereinafter referred to as print characters) printed later by a printer or the like in the frame. In this case, it is a print character to be read. As a method of reading a character string in an environment where pre-printed characters and printed characters coexist, a method of reading both pre-printed characters and printed characters using character string notation knowledge (JP 2001-126010), registered in advance There is a method (Japanese Patent Laid-Open No. 2000-207488) that removes pre-printed characters using the pre-printed character in-frame arrangement information and reads only the printed characters.
However, when pre-printed characters and printed characters coexist in the same frame, there is a problem that two characters overlap to form one character pattern (overlapping character problem), but the above-described method cannot cope with this problem.
For this overlapping character problem, character cutting processing (described in Japanese Patent Application Laid-Open No. 08-243506, etc.) is applied to cut the pre-printed character and the printed character in advance, and the character string notation knowledge is used to cut the character string. It can be handled by reading. By adjusting various parameters of the character cut-out process, it is possible to divide the overlapping characters into finer patterns.
[0003]
[Problems to be solved by the invention]
However, since the character extraction process described above is performed using local information such as the size of the pattern to be cut and the pixel projection amount, the correct character pattern is also obtained when parameters adjusted for overlapping characters are used. Moreover, the problem of the over segmentation which cut | disconnects excessively arises. For this reason, when overlapping characters are handled only by the character extraction parameter, a phenomenon occurs in which the accuracy of reading the entire character string is lowered.
An object of the present invention is to recognize pre-printed characters (characters pre-printed on a form) and print characters (printed or added later by a printer or the like) in character string recognition for reading a character string in which pre-printed characters and print characters coexist. In the case where a single character pattern is overlapped, the preprinted character and the printed character are distinguished based on the knowledge of the character string notation, and the preprinted character information is used to create a pre-printed character. It is an object of the present invention to provide a character string recognition method for reading a print character by removing the print character, an apparatus therefor, and a character string recognition processing program.
[0004]
[Means for Solving the Problems]
In order to solve the above problems, the outline of the representative invention disclosed in the present application is to distinguish a means for reading a form image from a pre-printed character printed in advance on the form image and a printed character added later. Storage means for storing knowledge of character string notation having information, means for determining whether the pre-printed character and the print character are overlapped from a form image using the character string notation knowledge, and overlapping the overlapping part A character string recognition apparatus comprising: means for performing character recognition processing. Furthermore, the means for performing recognition processing of overlapping characters performs recognition processing of a plurality of types of overlapping characters on each portion determined as an overlapping portion by the determination means, and compares the plurality of recognition results obtained. Output the selected result.
[0005]
DETAILED DESCRIPTION OF THE INVENTION
An embodiment of the present invention will be described with reference to the drawings. FIG. 1 shows a processing flow from the input of a form to be read until the overlapping character processing unit is activated. FIG. 2 is a conceptual diagram of an example of a character line to be read and knowledge of character string notation corresponding thereto. FIG. 3 is a conceptual diagram of a process from creating a candidate character network from a character line to be read to determining a character string path using character string notation knowledge. FIG. 4 is a flow of processing from detection of overlapping characters in a character string path to processing and determination of a new character string path. FIG. 5 is a table in which overlapping characters are classified according to line thickness and overlapping state. FIG. 6 is a conceptual diagram of thick line pattern extraction processing using morphology. FIG. 7 is a conceptual diagram of preprinted character removal processing using rectangular decomposition. FIG. 8 is a conceptual diagram of preprinted character removal processing using pattern matching. FIG. 9 is a structural example of a form recognition apparatus incorporating the character string recognition method according to the present invention. Next, an embodiment of the invention will be described in detail with reference to these drawings.
FIG. 1 will be described in detail. The form recognition apparatus according to the embodiment of the present invention first captures a form to be read and converts it into electronic data (101). This process can be omitted when the original form is electronic data. Next, based on the electronic data of the form image, form analysis such as ruled line extraction, frame structure analysis, and position estimation of the reading target frame using the result is performed (102). For this, the techniques described in JP-A-06-052156, JP-A-09-305701, JP-A-2000-251012 and the like can be used. Next, in response to the form analysis result, a character line image to be read is extracted (103). Next, a character pattern candidate is extracted from the character line image and each pattern is identified (104). The cut pattern and the identification result are referred to as a candidate character network. Next, a candidate for a character string path (a combination of a character code string and a character pattern string) is output from the candidate character network using character string notation knowledge (105). Examples of character string notation knowledge are shown in Fig. 2- (c) (d). Character string notation knowledge describes how a target sentence can be written using context-free grammar (or regular grammar). For example, in the notation knowledge of FIG. 2- (d), “*” is written first, followed by words such as “blood chemistry test” or “prescription fee”, and then “1”, “12”, etc. This indicates a sentence in which words such as “item” and “dot” are written at the end. Next, the likelihood of the character string path is evaluated for the obtained character string path group based on the character identification likelihood, the arrangement of the character pattern, the likelihood of the applied notation knowledge, and the like (106). . At this time, if it is determined that there are overlapping characters in the character string path, the overlapping character processing unit is called. As described above, in the present invention, even when the pre-printed character and the print character overlap, the pre-printed character and the print character are separated by creating a plurality of character extraction hypotheses, and the character string notation knowledge is used to read the character string. Overlapping characters that cannot be separated by cutting are detected, and re-cutting, re-character identification, and re-knowledge processing are performed on the overlapping characters. The problem of overlapping characters often occurs when a document is created by printing characters on a pre-printed fixed form. Many of the methods proposed so far regard this as a problem of character extraction, and cope with it by improving the character extraction processing. Since the present invention detects overlapping characters using knowledge of character string notation and handles overlapping characters as post-processing, it can read overlapping characters without adversely affecting the existing character string reading process such as oversegmentation. It becomes possible.
The overlapping character processing unit (107) is a process of recalculating the character string path by erasing the pre-printed character portion to identify the printed character and reapplying the knowledge processing for the overlapping character in the character string path. is there.
FIG. 2 will be described in detail. FIG. 2 is a conceptual diagram showing examples of character lines to be read (a, b) and corresponding character string notation knowledge (c, d). In the character string notation knowledge, the print character part (201) to be read and the pre-print character part (202) are described separately. Here, “N” in 201 means a numeric string. 202 braces (“[]”) mean that this part can be expressed as “code” or “medicine” or “medicine:”.
FIG. 3 will be described in detail. FIG. 3 is a conceptual diagram showing the process of the knowledge processing unit (105). First, a portion that seems to be a character pattern is cut out from the character line to be read (a) in various ways, each character pattern candidate is identified, and a candidate character network is created (b). The candidate character network includes a character pattern, a group of character codes ranked as a result of character identification, a likelihood corresponding to each character code, and information on connection relations between character patterns in the candidate character network. Shall have. Next, the character string path knowledge (c) is used to calculate the character string path (d) from the candidate character network. The character code referred to here is a code corresponding to the character concept “A” (for example, 2422 in the case of the JIS_X — 0208 standard). ₁₆ Means the code). In this paper, character code strings are referred to as character strings, and character patterns that are printed or handwritten are referred to as character pattern strings.
The character string path includes a character string, a character pattern corresponding to each character code in the character string, and information on connection relations of character patterns in the candidate character network. Character string notation knowledge is a database independent of a character string recognition program and is recorded in an external storage device or the like. Examples of character string notation knowledge description methods include trie and context-free grammar (JP 2001-014311 etc.) Described in). In this patent, the pre-printed characters and the print characters are read together, and the pre-print characters and the print characters are described separately on the notation knowledge.
In the normal knowledge process, the character string path is determined by the process 105, but if it is determined by the character string verification process 106 that an overlapping character exists, the character string path is transferred to the overlap character process 107 and reprocessed. I do. The process 107 is described in detail in FIG. The overlapping character 106 is determined for each character string path that is listed as a candidate in 105. To determine whether or not there are overlapping characters in the character string path, all the character patterns in the character string path are scanned, and the size of the circumscribed rectangle of the character pattern and the identification likelihood are compared with a predetermined threshold value. Do that. For example, a character string path is changed to a character pattern string {P ₁ , P ₂ ... P _n } And a certain character pattern P _k The character code corresponding to _k And its identification likelihood is L _k And character pattern P _k The height and width of the circumscribed rectangle of H _k And W _k And a predetermined likelihood threshold, height threshold, and width threshold are set to TL _k , TH _k TW _k , It is determined whether there is an overlapping character (L _k <TL _k And (H _k > TH _k Or W _k > TW _k )). The values of various threshold values used for the determination are experimentally obtained from the average height, average width, average identification likelihood, and the like of preprinted characters. In this way, the processing 106 can narrow down the character string path to be passed to the overlapping character processing unit (107) with a small amount of calculation by determining the presence or absence of overlapping characters with simple logic. The overlapping character processing unit performs detailed determination and position specification of the overlapping characters on the narrowed character string path group (processing 401, which will be described later).
FIG. 4 will be described in detail. In this process, a character string path that seems to contain overlapping characters is given by the upper module, and the overlapping characters are identified and preprinted characters are removed, and the remaining printed characters are re-characterized. Thus, a more accurate character string path is calculated by re-knowledge collation. First, a pattern that preprinted characters and printed characters are considered to overlap is estimated from the character string path, and candidates for overlapping characters are listed (401). Next, information such as the position and size of the pre-printed character in the overlapping character portion is estimated from the pre-printed character that is normally read other than the overlapping character portion (402). Next, image processing and character identification processing are performed on the overlapping character candidates (403). As a result, the pre-printed characters are removed, the print characters to be read are extracted, and character identification is performed again. Next, a re-knowledge collation process is performed in response to the process (404). As a result, the print character extracted from the overlap character is read, and the character string path is calculated. Finally, the read character string path is tested based on the likelihood of character identification, the character pattern layout information, the likelihood of notation knowledge used, etc., and the likelihood of the character string path is evaluated, prioritized and rejected. It is determined whether or not to perform (405).
The process of FIG. 4 403 will be further described in detail. For the pattern estimated as an overlapped character, three image processes are applied in parallel to extract the printed character, and character identification is performed for each extraction result (406, 407, 408). Finally, the identification results of the three printed characters obtained by the above processing (406 to 408) are ranked and determined as the character identification results of the overlapping characters (409). The likelihood of each character identification is compared to determine the rank of the identification character code (409). Reference numeral 406 denotes a thick line hypothesis process. This is a process for extracting a printed character by extracting a thick line pattern from an overlapped character. Reference numeral 407 denotes a rectangular hypothesis process. This is a process of discriminating the degree of overlap between the pre-printed characters and the print characters from the rectangle information and removing the rectangle that seems to be the pre-printed characters. Reference numeral 408 denotes a pattern matching process. This is processing for specifying a preprint character position and a print character position using character pattern matching and removing the preprint character. Further detailed processing 410 to 418 will be described in the following description.
A method of estimating 401 overlapping character candidates will be described. The character string path is calculated by using a candidate character network created from the character line to be read and knowledge of character string notation. Specific algorithms include dynamic programming using notation knowledge or context-free grammar analysis (described in Japanese Patent Laid-Open No. 2001-014311). Furthermore, since the pre-printed character and the printed character are classified based on knowledge, it can be determined whether each pattern in the character string path is either of them.
For example, taking the character string shown in FIG. 2 (a) as an example, this character string is obtained by knowledge processing.
(1) "Prefectural number 1 for May 1999? Co 12, 3334, 5"
(2) “May 1999 prefecture number 13 ko 12, 3334, 5”
(3) "May 1999 prefecture number 1 medical doctor 12, 3334, 5"
It may be read as such. However, “?” Means a character pattern (= unread character) that cannot be read by character identification.
For example, in the case of (1), it can be seen that the character pattern corresponding to “?” In the character string is unread. Possible causes of non-reading of the character pattern include the case where the original character pattern cannot be identified by rubbing, the case where it cannot be identified because it is an overlapping character pattern, or the case where it cannot be identified because of noise. Of these cases, in case of unread "?" Due to overlapping character pattern, position / size comparison with other pre-printed characters or position / size comparison with other print characters (same type By using information such as comparison, for example, numbers are preferable, and the number of connected components in the character pattern, it can be estimated that the overlapping character pattern may be the cause. This is because there is no large variation in the size of the character pattern if it is caused by rubbing, and it does not spread horizontally (in the case of horizontal writing) like overlapping characters. Also, if the number of connected components is examined, it can be calculated with a certain degree of reliability whether or not the friction is the cause. The same applies to cases where noise is mixed.
In the case (2), the overlapping character pattern is read as “3”. Also in this case, by comparing the width, height, position, and likelihood of character identification between the printed characters (in this case, numbers) in the character string, "3" overlaps and the character pattern is forcibly read. It can be determined that Similarly, in the case of (3), the overlapping character pattern is read as “medicine”, but this is also obtained by comparing the width, height, position and likelihood of character identification between pre-printed characters in the character string. It is possible to determine that the result is that “medicine” has forcibly read the overlapping characters.
As a specific method for determining overlapping characters, a neural network is used to construct a classifier that receives pre-printed characters or the width ratio, height ratio, likelihood of character recognition, etc. For example, a secondary discriminant function is constructed or a logic-based discriminator is constructed. The overlapping character pattern determiner using a neural network, a linear discriminant function, and a quadratic discriminant function requires high-accuracy determination, although learning with a large amount of data is required. In addition, a logic-based determination device has an advantage that accurate determination can be performed even with a relatively small amount of data by incorporating a heuristic.
The process 401 is also used when the character string verification unit 106 determines that there are overlapping characters in the character string path. If it is determined by 106 that there are overlapping characters in the character string, the process proceeds to overlapping character processing (FIG. 1 and FIG. 4).
The preprint information estimation 402 is described. This is a process for estimating the position, size, etc. of the preprinted character in the overlapped character from the preprinted character near the overlapped character. For example, taking the character string shown in FIG. 2 (a) as an example, in this case, “medicine” is an overlapping character. At this time, first, the position / size of “ko” belonging to the same group as “medicine” is referred to. When “ko” is unread, the position / size of the pre-printed character is estimated from the closest character among other characters, for example, “prefecture” and “number”. For example, by using the fact that the height of the pre-printed characters and the vertical position in the character line are almost constant, the height and vertical arrangement can be estimated, and the codes of the pre-printed characters included in the overlapping characters are separated. For example, it can be estimated whether it exists on the left or right side of the overlapped character. The estimated position and estimated size of the pre-printed character in the overlapping character pattern obtained here are used in later processing. As described above, in the present invention, even for a pattern that has been difficult to cut out by the conventional processing, since the first-stage character string reading is performed using the character string notation knowledge that distinguishes the pre-printed character and the printed character, normal characters are read. Information such as the size and position of the pattern and the pre-print position to be originally written is easily calculated, and by using these pieces of information, it is possible to improve the accuracy of distinguishing overlapping characters.
The three types of character extraction and identification processing (406 to 408) performed in 403 will be described. As shown in FIG. 5, the overlapping characters are classified according to the thickness of the line and the degree of contact. 501 is a case where the line thickness of the printed character is sufficiently thicker than the pre-printed character, 502 is a case where the line thickness of the printed character is slightly thicker than the pre-printed character, and 503 is the line thickness of the printed character. Indicates a case that is equal to or narrower than pre-printed characters. Reference numeral 504 denotes a case where the printed character and the pre-printed character are in contact at the left and right ends of the character. Reference numeral 505 denotes a case where one of the patterns is slightly included. Reference numeral 506 denotes a case where the two are completely overlapped. . By performing three kinds of character extraction for these cases, the case of the mask portion shown in 507 is dealt with. For the case of 508, the policy is to reject the character identification based on the likelihood of the identification result. As shown in FIG. 5, overlapping character cases can be classified into several types according to their forms, and there is an effective character extraction process for each form. One of them, the thick line extraction process, functions effectively in the case where the preprinted character and the printed character have different line thicknesses. The circumscribed rectangle decomposition process functions effectively in the case where pre-printing and printing characters are printed at different heights. Furthermore, the character pattern matching process complementarily functions when it is difficult to apply the above process. In the present invention, a plurality of character extraction processes are applied in parallel to a pattern estimated to be an overlapping character, thereby realizing processing that can handle various overlapping character cases.
The first process of overlapping character processing uses the extraction and identification of thick line patterns (406). This is effective in the case where the printed characters are printed sufficiently thicker than the preprinted characters in FIG. The process of 406 is composed of three processes. First, a thick line pattern is extracted (410), then the size of the extracted pattern is tested (411), and finally character identification is performed (412).
For the treatment of 410, morphology (described in “Morphology” Corona, ISBN4-339-00664-5, etc.) is used. Morphology is the application of operations such as Minkowski sum and Minkowski difference between two images called processing object image A and structural element B for processing such as image thickening, thinning, smoothing, and thinning. This is an image processing arithmetic system to be realized.
FIG. 6 shows a schematic diagram of a morphological operation for a binary image. In this example, a morphological opening operation is used to extract a pattern having a thickness of at least 3 dots from the target image A (a) using a 3 × 3 dot image as a structural element (c) (c). ). Morphological structural elements can be defined in various ways other than this, so that even thicker patterns or thinner patterns can be selectively extracted. As shown in this figure, an opening operation is generally used to extract a thick line. The opening operation is defined as follows:
opening operation: (AB ^s ) + B
Where “+” is the Minkowski sum between images, “−” is the Minkowski difference between images, and B ^s Represents an object graphic relating to the origin of the image B.
[0006]
If only thick line extraction is used, noise areas may be extracted in places other than the thick line numeric characters that should be extracted. After extraction, isolated areas that are smaller than the threshold α times the size of other print characters are removed. The pattern is verified (411). After this processing, character identification is performed (412).
The second process of overlapping character processing uses rectangular decomposition and identification of character patterns (407). As shown in Fig. 5, the line thickness of the printed characters (numbers in this case) is the same as that of the pre-printed characters, but the case where the characters do not overlap completely or the size of both characters is Covers cases that are printed differently or out of position. The processing of 407 is composed of three processes. First, rectangular decomposition is performed to extract the pattern of the printed character (413), then the size of the extracted pattern is tested (414), and finally the character is identified ( 415).
The rectangular decomposition process 413 uses the fact that character patterns having different sizes or overlapping character patterns whose positions are shifted can be expressed as rectangular overlaps. If the overlap character can be expressed as two rectangle overlaps, it can be understood from the result of the pre-print character position / size estimation performed in 402 which rectangle corresponds to the pre-print character. It is possible to erase the pixels in the rectangular portion and extract the target print character pattern. A conceptual diagram of this process is shown in FIG. FIG. 7 shows a state where the number “4” and the pre-printed character “ko” overlap. Character pattern overlap can be expressed as overlap of two rectangles by rectangular decomposition. A rectangle corresponding to the number “4” is 701, and a rectangle corresponding to the pre-printed character “K” is 702. Since the pre-printed character position estimation performed in advance knows which of the two rectangles corresponds to the pre-printed character, the image of the rectangular part of the corresponding pre-printed character is erased and is read. Numeric patterns can be extracted.
The rectangular decomposition of overlapping characters is performed as follows. First, the circumscribed rectangle of the overlapping character is obtained. Next, the rectangle of the white region is widened around the four corners of the circumscribed rectangle. At that time, the connected black areas of a certain size or less are deleted in advance so that the isolated black areas do not have an adverse effect. Normalization is performed under the following conditions for the obtained white regions at the four corners on the left, right, top, and bottom. For example, the size of the upper and lower white areas on the right side is normalized as follows. First, the size of the upper right white region is (sx1, sy1), and the size of the lower right white region is (sx2, sy2). At this time, each value is rewritten by the following processing.

The same process is performed for the white area size at the left end. This process is basically a normalization process in which the horizontal widths (sx1, sx2) of the upper and lower white regions at the right end (or left end) are made equal to the minimum value of both. However, in consideration of the fact that an erroneous white area is obtained due to the influence of noise, if the height of the white area is smaller than the threshold value β, this is ignored. In addition to the method described here, rectangular decomposition using morphological operations is possible. This is realized by preparing a structural element having a horizontal size of one dot and a vertical size of the pre-printed character by using the pre-estimated size of the pre-printed character pattern, and performing a closing operation using this. This is because minute black pixels inside the character are filled and the outer contour is approximated as a rectangle. The closing operation is defined next.
Closing operation: (A + B ^S -B
Where “+” is the Minkowski sum between images, “−” is the Minkowski difference between images, and B ^S Represents an object graphic relating to the origin of the image B.
[0007]
Print characters are extracted by these processes. Next, a pattern test (414) and character identification (415) are performed on the extracted print characters. This process is the same as 411 and 412.
The third process of overlapping character processing uses character pattern matching and identification (408). FIG. 8 shows a conceptual diagram of character pattern shape matching. In FIG. 8, pattern verification is performed on the target graphic (a) using a preprinted character pattern (b) prepared in advance, and the overlapped portion is erased as a preprinted character portion (c). In FIG. 7C, as a result of the pattern collation, the print character portion is indicated by 801 and the pre-print character pattern is indicated by 802. As matching algorithms, XY axis independent matching method using dynamic programming and bending warp search method ("Recognition of Handwritten Digits Using Template and Model Matching," Pattern Recognition, vol.24, no.5, pp.421- 431). The process of 408 is composed of three processes. First, the position of the pre-printed character is identified and removed by pattern matching (416), then the size of the extracted pattern is verified (417), and finally the character identification is performed. Perform (418). The processing of 417 and 418 is the same as that of 411 and 412, respectively.
Through the three processes described above, the pre-printed characters are removed, the printed characters to be read (numbers in this case) remain, and the patterns after each processing are identified. Next, the rank of these identification results is determined (409). The determination is basically selected in descending order (highest order) of the likelihood of character identification. However, for certain patterns (such as the number “1”), there are many mistakes such as reading out the remaining noise as a result of pre-printed character removal. Process to lower the degree. Finally, knowledge processing is applied based on the result of re-recognizing the overlapping character pattern, a character string path is calculated (404), and a character string path is determined by performing a test (405). As described above, in the present invention, overlapped characters are estimated using the knowledge of notation of character strings, and a plurality of character extraction processes and character identification are performed in parallel for the patterns estimated as overlapped characters. By selecting an optimum one from among the identification results, it is possible to realize a reading process that is robust against cutting errors and noise.
Finally, FIG. 9 will be described in detail. This is an example of the configuration of a form reading apparatus incorporating a character string recognition apparatus according to the present invention. This device converts a form into electronic data by an image input device (901), stores it in an external storage device (904) and a memory (905), and the central processing unit (906) is recorded on a storage medium or the like. The character string recognition process of the present invention is performed by reading a program for executing the present invention. Necessary character string notation knowledge, form reading definition position, and the like are stored in the external storage device (904). These processes can be operated by a human through the operation terminal device (902), the processing results and the like are displayed through the display terminal device (903), and data communication with an external device is possible through the communication device (907). .
As described above, in the present application, there is a character string recognition method for recognizing a character string entered in a form including characters printed in advance, using a character printed in advance (a pre-printed character) and a printer or the like. In an environment where characters (printing characters) printed on the form are mixed later, the basic processing is to read together using the character extraction multiple hypothesis and character string notation knowledge, and the pre-printed characters and the printed characters overlap. In this case, it has a function of determining overlapping patterns (overlapping characters), and performs pre-printing character removal and printing character extraction for overlapping characters using techniques such as thick line extraction, rectangular decomposition, and pattern matching. Disclosed is a character string recognition method that reads a printed character correctly by performing character re-identification and re-knowledge collation processing.
[0008]
【The invention's effect】
As described above, the character string recognizing apparatus according to the present invention determines overlapping characters using knowledge of character string notation that distinguishes preprinted characters from printed characters even when the preprinted characters and printed characters overlap. For overlapped characters, character strings can be recognized by extracting print characters using a plurality of processes and performing re-character identification and re-knowledge collation.
[Brief description of the drawings]
FIG. 1 is a flowchart of character string recognition including overlapping character processing.
FIG. 2 is a schematic diagram of reading target character lines and character string notation knowledge.
FIG. 3 is a conceptual diagram of character string reading using knowledge of character string notation.
FIG. 4 is a flowchart of overlapping character processing.
FIG. 5 is a classification diagram of overlapping character patterns.
FIG. 6 is a conceptual diagram of a thick line extraction process based on morphology.
FIG. 7 is a conceptual diagram of preprinted character removal processing by rectangular decomposition.
FIG. 8 is a conceptual diagram of preprinted character removal processing by pattern matching.
FIG. 9 shows a configuration example of a form recognition apparatus incorporating an overlapping character string reading function.
[Explanation of symbols]
101: Image input unit
102 ... Form analysis unit
103 ... Extraction unit of the reading target line
104: Character pattern candidate generation unit
105 ... Knowledge processing unit
106: Character string verification section
107: Overlapping character processing unit
201: Printed character part in knowledge of character string notation
202 ... Preprinted character part in character string knowledge
401 ... Processing for enumerating overlapping character candidates
402: Processing for estimating preprinted character information in overlapping characters
403 ... Loop processing for all overlapping character candidates
404 ... Knowledge processing after overlapping character processing
405 ... Character string verification process
406 ... thick line hypothesis processing
407 ... rectangle hypothesis processing
408 ... Pattern verification processing
409 ... Identification result selection process
410 ... Printed character extraction processing by thick line pattern extraction
411 ... Pattern verification process
412 ... Character identification processing
413 ... Print character extraction processing by rectangular decomposition
414 ... Pattern verification process
415 ... Character identification processing
416 ... Print character extraction processing by pattern verification
417 ... Pattern verification process
418 ... Character identification processing
501: Case where the line thickness of the printed character is sufficiently thicker than the pre-printed character
502: Case where the thickness of the printed character line is slightly thicker than the pre-printed character
503: Case where the thickness of the printed character line is the same as or thinner than the pre-printed character
504: Case where printed characters and pre-printed characters are in weak contact
505: Case where printed characters and pre-printed characters make strong contact
506: The case where the print character and the pre-print character completely overlap
507: Case to be read by the present invention
508: Case where reading is rejected in the present invention
701: A portion determined to be a print character rectangle in the rectangle decomposition
702... Part determined to be a preprint character rectangle in the rectangle decomposition
801: A portion determined to be a print character in character pattern verification
802... Part determined as pre-printed character in character pattern matching
901 ... Image input device
902 ... Operation terminal device
903 ... Display terminal device
904 ... External storage device
905 ... Memory
906: Central processing unit
907: Communication device.

Claims

Means for reading the form image;
Storage means for storing character string notation knowledge having information for distinguishing between pre-printed characters printed in advance on the form image and printed characters added later;
As a result of reading a character string using the character string notation knowledge, means for determining a portion that has become unread as a portion where the pre-printed character and the printed character overlap;
Means for separating the pre-printed characters and the printed characters using a plurality of character-cutting hypotheses created for the character string, and performing overlapping character recognition processing using the character string notation knowledge. Character string recognition device characterized by this.

The means for recognizing overlapping characters performs a recognition process for a plurality of types of overlapping characters for each portion determined as an overlapping portion by the determining means,
2. The character string recognition apparatus according to claim 1, further comprising means for comparing the plurality of recognition results and outputting a selected result.

The overlapping portion determination means is:
Separating the pre-printed characters and the printed characters by creating multiple character-cutting hypotheses for the character string of the form image,
Read the string using the string notation knowledge,
A character string is read using the character string notation knowledge for a portion that cannot be separated by the character extraction, and then determined using pattern shape information and identification information corresponding to the read character string, The character string recognition apparatus according to claim 1 or 2.

The above multiple types of recognition processes
A thick line extraction process that uses the difference in line thickness between the pre-printed character and the printed character,
Rectangle decomposition that uses the difference in size of the outer rectangle of each character pattern of the pre-printed character and the printed character,
The character string recognition apparatus according to claim 2, comprising at least any two of pattern collation using preprinted character information held in the storage unit.

Acquiring a form image; and
Reading from the storage means character string notation knowledge having information for distinguishing between pre-printed characters preprinted on the form image and printed characters added later;
As a result of reading a character string from the form image using the character string notation knowledge, determining a portion that has become unread as a portion where the pre-printed character and the printed character overlap;
Separating the pre-printed characters and the printed characters using a plurality of character-cutting hypotheses created for the character string, and performing overlapping character recognition processing using the character string notation knowledge;
A program for causing a computer to execute a character string recognition method, comprising the step of comparing the results of the recognition processing and outputting a selection result.