JP3689485B2

JP3689485B2 - Form recognition method

Info

Publication number: JP3689485B2
Application number: JP11457396A
Authority: JP
Inventors: 広新庄; 好博嶋; 勝美丸川; 和樹中島
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1996-05-09
Filing date: 1996-05-09
Publication date: 2005-08-31
Anticipated expiration: 2016-05-09
Also published as: JPH09305701A

Description

【０００１】
【発明の属する技術分野】
本発明は帳票、特に、不動産に関する登記情報が記載された多様な帳票に関し，特に，登記済通知書から文字データを読み取り，自動的に入力する帳票認識方法に関する。
【０００２】
【従来の技術】
帳票の種類の識別に関する従来技術の例としては，以下のものが挙げられる。第１は，全ての種類の帳票に対して同じ位置に記載された帳票の種類を表すＩＤ番号を読み取ることにより，帳票の種類を識別する方式である。第２は，帳票の種類ごとに枠の構造が異なる場合に，枠の構造を識別することにより帳票の種類を識別する方式である。この例は，特開平７―１４１４６２号公報に記載されている。
【０００３】
【発明が解決しようとする課題】
不動産に関する登記済通知書は現在７種類ある。これらの帳票は不動産に関する課税のためのデ−タ入力に用いられるものであるが、この通知書には，帳票の種類を特定するＩＤ番号の記載がないため，ＩＤ番号読み取りにより帳票を識別する従来手法を用いることはできない。さらに，これらの帳票は，同じ種類であっても枠の形状が異なる非定型帳票であるため，枠の構造から帳票を識別する従来手法を用いることはできない。また，表題部の文字を読み取ることにより帳票を識別する従来手法を用いる場合には，帳票の識別精度は文字認識の精度に大きく依存するという問題がある。登記済通知書の帳票名は，「権利に関する土地登記済通知書」，「権利に関する建物登記済通知書（一般）」，「権利に関する建物登記済通知書（専有）」，「表示に関する土地登記済通知書」，「表示に関する建物登記済通知書（一般）」，「表示に関する建物登記済通知書（一棟）」，「表示に関する建物登記済通知書（専有）」の７種類である。このうち，「表示に関する建物登記済通知書（一般）」と「表示に関する建物登記済通知書（一棟）」は，一字しか違わないため，この二種類に対する識別精度が低くなる可能性がある。
【０００４】
そこで，本発明の第１の目的は，帳票の種類が多様な読み取り対象に対して，高精度な帳票識別手段を有する帳票認識手段を提案することである。
【０００５】
従来の下線検出方法では，枠線以外の罫線を下線としていたため，文字の横方向のストローク等のノイズ成分を下線として誤抽出する可能性があった。そこで，本発明の第２の目的は，高精度な下線検出手段を有する帳票認識手段を提案することである。
【０００６】
【課題を解決するための手段】
第１の観点では、この発明は、登記済通知書の表面画像を入力し文字を読み取る登記情報の認識方法であって，登記済通知書の画像から文字行を抽出する文字行抽出手段と，抽出した複数の文字行と枠との位置関係から帳票名の文字行を選択し文字行選択手段と，帳票名の文字行を読み取る文字識別手段から，登記済通知書の種類を識別する第１の帳票識別手段と，登記済通知書の画像から罫線を抽出する罫線抽出手段と，抽出した罫線から表の特徴を抽出する表特徴抽出手段と，表の特徴から登記済通知書の種類を識別する第２の帳票識別手段と，登記済通知書の画像から文字行を抽出する文字行抽出手段と，抽出した文字行を読み取る文字識別手段と，読み取り結果の中から帳票の項目名を選択する項目名選択手段と，項目名の組み合わせから登記済通知書の種類を識別する第３の帳票識別手段とを具備し，当該３つの手段の結果を組み合わせることにより，登記済通知書の種類を識別する帳票認識方法を提供する。
【０００７】
第２の観点では、この発明は、登記済通知書の表面画像を入力し文字を読み取る登記情報の認識方法であって，登記済通知書の画像から文字行と罫線を抽出する文字行抽出手段と，抽出した罫線から枠罫線と枠罫線でない罫線を区別する罫線種判定手段と，枠罫線でない罫線が含まれる枠内の文字行を検出する手段と，当該枠内の文字行と当該枠内の枠罫線でない罫線との位置関係から，当該枠内の枠罫線でない罫線が下線か否かを判定する下線検出手段を具備する帳票認識方法を提供する。
【０００８】
【発明の実施の形態】
以下、本発明の一実施例を詳細に説明する。なお、これにより本発明が限定されるものではない。
【０００９】
図１は、本発明の一実施例である登記情報システムの構成図である。登記情報の認識を行う認識部１０１と認識結果の修正を行う修正部１０５がネットワーク１０４により接続されており，入力センタ１１１において認識と修正を並行して行うことができる。処理の過程は，まずスキャナ１０２により登記済通知書１００の画像を入力する。次に，認識用計算機１０３では，文字および罫線の認識を行い，修正用計算機１０６において認識結果の修正確認を行う。また，辞書やコード表と照合チェックし，コードデータを出力する。認識結果は，通信制御用計算機１０７を介して，遠隔地にある計算センタ１１０にあるホスト計算機１０８に接続された登記情報データベース１０９に格納される。修正部１０５では，認識結果の一部を利用し，登記情報データベース１０９をアクセスし，登録済の登記情報を読み出す。当該読み出した登録情報と認識結果の一部を照合し，矛盾がないかどうかの検定を行う。
【００１０】
図２は，登記情報認識の処理過程を示すブロック図である。認識部１０１では，帳票画像を読み取り，修正部１０５に縮小画像２４８，枠座標２５０，下線座標２５２，文字行座標２５４，帳票種類２５６，認識結果ラティス２５８，文字座標２６０を出力する。修正部１０５では，これらの入力データをもとに，操作者が認識結果を修正する。画像入力部２００では，帳票表面の画像を白黒２値化して入力する。
【００１１】
入力した画像は，画像縮小部２０２と文字行画像抽出部２１８に出力される。画像縮小部２０２では，後続の処理の高速化のため帳票画像を縮小し，縮小画像２４８を出力する。縮小処理は，細い罫線が縮小後かすれないよう，画素ごとのＯＲ処理を行う。縮小した画像に対し，罫線抽出部２０４において実線と点線の罫線を抽出する。実線は，黒画素の連続するつながりをもとに抽出される。点線は，黒画素の連結成分の外接矩形の配置，サイズの拘束条件をもとに抽出される。枠抽出部２０６では，２０４で抽出した罫線から罫線が四方を取り囲む枠を求め，枠の頂点座標２５０を出力する。表特徴抽出部２０８では，２０６で抽出された枠の情報から，枠の集まりである表の特徴量を抽出する。この特徴量とは，縦横の罫線の本数や，罫線同士の接続関係，枠の位置関係等である。
【００１２】
一方，文字行抽出部２１２では，２０２から出力された縮小画像から，文字の集合である文字行を抽出する。ここでは，黒画素の連結成分うち，文字と推定される大きさの連結成分の外接矩形の頂点座標をもとに，文字の並びと推定される外接矩形を融合することにより，文字行を生成する。行―枠対応部２１４では，２１２で抽出した文字行の頂点座標と２０６で抽出した枠の頂点座標を比較することにより，各文字行がどの枠内に存在するか，もしくは枠外にあるかを判定し，枠ごとに含まれる文字行の頂点座標と枠外の文字行の頂点座標２５４を出力する。また，下線抽出部２１６では，２０４で抽出した罫線座標と，２０６で抽出した枠の頂点座標と，２１４で抽出した枠内の文字行座標とをもとに，下線を抽出して，下線の座標２５２を出力する。さらに，文字行画像抽出部２１８では，２１４で抽出された文字行座標をもとに，２００で入力された画像から文字行部分の画像を切り出す。文字切り出し・文字識別部２２０では，文字切り出し部２２２と文字識別部２２４が協調して，文字を１文字ずつ切り出し，その文字座標２６０を出力する。さらに，文字識別部２２４では，切り出した１文字分の画像パターンに対して，識別辞書２２６を用いて文字を識別する。帳票名照合部２２８では，文字識別部２２４の出力である文字識別結果を入力し，単語照合部２３０により帳票名辞書２３２に格納された帳票名単語と照合することにより帳票名についての認識結果の誤りを修正して帳票名を求める。
【００１３】
帳票名辞書２３２に格納された単語は，認識対象の帳票名である。認識対象の帳票名はあらかじめわかっており，帳票名は帳票の種類に１対１に対応する。さらに，項目照合部２３４では，２２８で照合されなかった文字認識結果を入力し，単語照合部２３６により項目辞書２３８に格納された項目名単語と照合することにより項目名についての認識結果の誤りを修正して項目名を求める。項目辞書２３８にされた単語は，認識対象の帳票内に記載された項目である。内容照合部２４０では，２３４で照合されなかった文字認識結果を入力し，単語照合部２４２により内容辞書２４４に格納された内容単語と照合することにより内容についての認識結果の誤りを修正する。ここで，「内容」とは帳票において，項目名に対して記載されている内容をさす。例えば，「地目」という項目に対する内容には「居宅」や「公園」などがある。内容辞書２４４に格納された単語は，認識対象の帳票内に記載された内容を記載する単語のうち，あらかじめ使用が決められている単語である。２４０の処理の結果出力される認識結果ラティス２５８は，１文字ごとに文字識別処理の結果である候補文字を類似度が高い順に並べたものである。この文字識別結果は，帳票名照合，項目照合，内容照合により誤りを修正してある。
【００１４】
一方，帳票識別部２４６では，表特徴抽出部２０８と帳票名照合部２２８と項目名照合部２３４の出力結果を入力し，表特徴と帳票名，項目名から帳票の種類を識別し，帳票種類２５６を出力する。
【００１５】
図３は、図２で示した登記情報認識の処理フローを示す図である。ステップ３００で画像を入力し，ステップ３０２で当該画像を縮小する。次いで，ステップ３０４で画像から罫線を抽出し，ステップ３０６で罫線から枠を抽出する。さらに，ステップ３０８で表の特徴を抽出する。また，ステップ３１０で当該縮小画像から文字行を抽出し，ステップ３１２で，抽出した行と枠とを対応付ける。また，ステップ３１４で，罫線と枠と文字行の座標から下線を抽出する。さらに，ステップ３１６で，文字行の座標値に基づいて帳票画像から文字行部分の画像のみを抽出する。ステップ３１８で，当該文字行画像を１文字ずつの画像に分割し，ステップ３２０で切り出された画像パターンに対して文字識別を実行する。ステップ３２２では，文字識別結果を帳票名の単語と照合して帳票名を識別する。ステップ３２４では，文字識別結果を項目名の単語と照合して項目名を識別する。ステップ３２６では，文字識別結果を内容の単語と照合して内容を識別する。ステップ３２８では，ステップ３０８の処理結果である表の特徴とステップ３２２の処理結果である帳票名とステップ３２４の処理結果である項目名から帳票の種類を識別する。ステップ３３０では，３００から３２８の処理で得た結果を出力する。
【００１６】
図４は，認識対象である登記済通知書の画像を，説明のために簡略的に示した図である。帳票画像４００の例では，帳票名「権利に関する建物登記済通知書（専有）」４０１が記載されており。横罫線４０２，４０４，４０６，４０８と縦罫線４１０，４１２，４１４，４１６が印刷されている。また，項目として「符号」４１８と「所在」４２０，「地目」４２２がある。「符号」の内容としては「１」（４２４）と「２」（４２６），「所在」の内容としては４２８と４３０に「国分寺市東恋ヶ窪１丁目２８０番地」が記載されている。「地目」の内容としては，「宅地」（４３２）と「公園」（４３４）が記載されている。さらに，内容４２４「１」，４２８「国分寺市東恋ヶ窪１丁目２８０番地」，４３２「宅地」には，それぞれ下線４３６，４３８，４４０が印刷されている。
【００１７】
図５は，図４の帳票画像に対する，図３のステップ３０４の罫線抽出処理結果を示すものである。（ａ）の５００は横罫線の抽出結果であり，（ｂ）の５２０は縦罫線の抽出結果である。（ａ）では，図４の横罫線４０２から４０８に相当する罫線として，それぞれ，５０２から５０８が抽出されている。下線４３６，４３８，４４０に相当する下線として，それぞれ，５１０，５１４，５１６が抽出されている。５１２と５１８は，「市東恋」の横ストロークをつなげることによって，罫線として抽出したものである。この離れた横ストロークが接続される現象は，横罫線を抽出する際に黒画素を横方向に収縮・膨張処理することにより，接近した黒画素が接続されることに起因する。また，（ｂ）では，図４の縦罫線４１０から４１６に相当する罫線として，それぞれ，５２２から５２８が抽出されている。
【００１８】
図６は，図４の帳票画像に対する，図３のステップ３０６の枠抽出処理結果を示すものである。６００は枠抽出結果である。６０２から６１８の９個の枠が抽出されている。
【００１９】
図７は，図４の帳票画像に対する，図３のステップ３１０の文字行抽出処理結果を示すものである。７００は文字行抽出結果である。図４の文字行４０１，４１８，４２０，４２２，４２４から４３４の文字行に対して，それぞれ７０２から７２０の文字行の外接矩形が抽出されている。
【００２０】
図８は，図３のステップ３１４の下線抽出処理に関する処理フローである。罫線抽出処理３０４，枠抽出処理３０６，文字行抽出処理３１０の結果を用いて
，ステップ８００では，枠を構成しない罫線を抽出する。ステップ８０２では，ステップ８００で抽出した罫線の本数分だけ，以下の処理を繰り返す。ステップ８０４では，文字行の座標と罫線の座標を比較する。比較の方法については図９と図１０を用いて説明する。ステップ８０６では，比較した値が基準を満たすか否かを判定する。基準値を満たす場合，ステップ８０８で，比較対象の罫線を下線とする。なお，上記ステップ８０８において抽出された２本の下線について，端点同士がが微小な間隔で離れており，延長線上に存在する場合には，１本の下線であるとすることもできる。また，上記ステップ８０８において抽出した下線の長さが基準値以下であれば，下線とみなさないとすることもできる。
【００２１】
図９は，図８の処理フローを説明するための帳票の枠の例である。横罫線９００と９０２，縦罫線９０４と９０６，文字行９０８，下線９１０が印刷されている。
【００２２】
図１０は，図９の例から罫線と文字行を抽出した結果である。この図を用いて下線の判定を説明する。下線判定処理は，文字行と同一枠内にある罫線の中で，文字行の下に位置し，文字行とほぼ同じ長さの罫線を下線と判定する。図１０において，１００７は文字が印刷されていた領域であり，１００８は１００７の外接矩形である。図９の９００から９１０の罫線は，それぞれ１０００から１０１０として抽出されている。さらに，１０１２は文字の横ストロークを罫線として抽出したものである。抽出された罫線の中から，枠を構成していない罫線として，１０１０と１０１２が抽出される。以下，１０１０を例として下線と判定される場合について説明し，１０１２を例として下線と判定されない場合を説明する。
【００２３】
図１０の１０１０について判定する。まず，罫線の下端のｙ座標と文字行の下端のｙ座標との差ｄ１１（１０１４）を求める。次に，罫線の上端のｙ座標と文字行の上端のｙ座標との差ｄ１２（１０１６）を求める。さらに，罫線のｘ方向の長さＬ１（１０１８）と文字行のｘ方向の長さＬｃ（１０２０）との差を求める。この値を基準値，α，β，γ１，γ２と比較する。ｄ１１が文字行より下でα未満であり，ｄ１２がβ以上であり，Ｌ１―Ｌｃがγ１以上γ２以下であれば，この罫線を下線とする。上記の処理の判定基準であるα，β，γ１，γ２の値は経験的に求めることができる。
【００２４】
例えば，αは，文字行と下線との間隔が一定であればその値を用いることができる。一定でなければ，枠の高さと文字の高さの差の１／２を用いることができる。βは，文字行の下端と下線との間隔と，文字の高さとが一定であれば，この２つの値の和を用いることができる。γ１とγ２の値は，一文字程度のマージンを見込んで，γ１は文字幅に（−１）をかけた値，γ２は文字幅等を用いることができる。上記のα，β，γ１，γ２の値の設定にあたっては，帳票の傾きや，線のかすれやつぶれ等に対して頑健性をもたせるため，マージンをもたせて値を設定することができる。また，ｄ１１の値の許容値について，負の値を許容すれば，下線が文字と重なる場合にも対応できる。
【００２５】
次に，図１０の１０１２について判定する。まず，罫線の下端のｙ座標と文字行の下端のｙ座標との差ｄ２１（１０２２）を求める。次に，罫線の上端のｙ座標と文字行の上端のｙ座標との差ｄ２２（１０２４）を求める。さらに，罫線のｘ方向の長さＬ２（１０２６）と文字行のｘ方向の長さＬｃ（１０２０）との差を求める。これらの値を上記α，β，γ１，γ２と比較した場合，ｄ２１は負の大きな値となり，ｄ２２はβより小さな値になるため，下線ではないと判定される。
【００２６】
なお，ここで用いたｄ１１，ｄ１２は文字の高さや枠の高さ等で正規化してもよい。また，Ｌ１とＬｃの差の代わりに比を比較してもよい。α，β，γ１，γ２の値は，比較対象の定義に合わせて設定する。
【００２７】
また，ここでは，罫線の下端のｙ座標と文字行の下端のｙ座標との差１０１４と，罫線の上端のｙ座標と文字行の上端のｙ座標との差１０１６，罫線のｘ方向の長さ（１０１８）と文字行のｘ方向の長さ（１０２０）との差の３つの評価値を用いたが，必要に応じてこの中の１つもしくは２つのみを用いていもよい。
【００２８】
図１１は，図３のステップ３１４下線抽出処理において，文字行の座標の代わりに文字の座標を用いた例である。図１０で説明した判定基準を用いて，枠を構成しない罫線１１０８と文字の外接矩形１１１２を比較することにより，１１０８は下線であると判定できる。また，枠を構成しない罫線１１１０と文字の外接矩形１１１４を比較することにより，１１１０は下線でないと判定できる。
【００２９】
図１２は，文字行内の一部の文字に対してのみ下線が印刷されている例である。枠１２００内に，文字行１２０２と下線１２０４が記載されている。図１１の方法を用いれば，文字行中の「１丁目２８０番」の文字のみに下線が印刷されていることを判定できる。
【００３０】
図１３は，図３のステップ３１４の下線抽出処理に関する別の処理フローである。登記済通知書では，図４の４３６，４３８，４４０のように同一線上に複数の下線が存在することが多い。一方，下線４３６は短いので，文字内の横方向のストロークと長さが変わらないため，罫線抽出の際に抽出もれする可能性がある。この処理では，罫線抽出の際に抽出もれする可能性のある短い下線を正しく抽出することを目的とする。このため，まず長い下線を抽出し，この下線の延長上にある罫線を下線と判定する。
【００３１】
以下，図１３の各ステップについて説明する。ステップ１３００では，長い下線のみを抽出する。この処理は，図８で示した処理等を用いて実現できる。ステップ１３０２では，横方向のランレングスデータのうち枠線を構成しないランレングスデータを抽出する。ステップ１３０４では，抽出したランレングスデータの個数分についてステップ１３０６と１３０８の処理を繰り返す。ステップ１３０６では，対象とするランレングスデータが下線の延長線上にあるか否かを判定する。延長線上にあれば，ステップ１３０８で下線を構成するランレングスデータであるとして抽出する。ステップ１３１０では，ステップ１３０８で下線を構成すると判定されたランレングスデータから構成される罫線を下線として抽出する。なお，上記ステップ１３１０において抽出された２本の下線について，端点同士がが微小な間隔で離れており，延長線上に存在する場合には，１本の下線であるとすることもできる。また，上記ステップ１３１０において抽出した下線の長さが基準値以下であれば，下線とみなさないとすることもできる。
【００３２】
図１４は，図１３の処理フローを説明するための帳票の枠の例である。横罫線１４００と１４０２，縦罫線１４０４から１４１０，下線１４１２から１４１６，文字行１４１８から１４２２が印刷されている。
【００３３】
図１５は，図１４の画像から枠を構成しない横方向のランレングスデータと長い下線とを抽出した結果である。１５００は図１３のステップ１３００で抽出された長い下線である。横方向のランレングスデータの連結成分のうち，１５０２と１５０４は１５００の延長線上１５０８から許容範囲ｗ（１５１０）以内にあるので，下線であると判定する。１５０６はｗよりも外にあるので，下線はないと判定する。
【００３４】
図１６は，図３のステップ３１４の文字行抽出処理に関する別の処理フローである。この処理では，枠を構成しない横方向のランレングスデータの長さの値をランの中点から傾き方向に投影して作成したヒストグラムを用いて下線を抽出する。以下，図１６の各ステップにてついて説明する。ステップ１６００では，横方向のランレングスデータのうち枠線を構成しないランレングスデータを抽出する。ステップ１６０２では，抽出したランレングスデータの長さの値を，ランの中点から傾き方向に投影してヒストグラムを作成する。ステップ１６０４では，ヒストグラムの山の数だけステップ１６０６とステップ１６０８の処理を繰り返す。ステップ１６０６では，投影値が基準値以上であるか否かを判定する。基準値以上であれば，ステップ１６０８で投影されたランレングスデータは下線を構成すると判定する。ステップ１６１０では，ステップ１６０８で下線を構成すると判定されたランレングスデータから下線を抽出する。なお，上記ステップ１６１０において抽出された２本の下線について，端点同士がが微小な間隔で離れており，延長線上に存在する場合には，１本の下線であるとすることもできる。また，上記ステップ１６１０において抽出した下線の長さが基準値以下であれば，下線とみなさないとすることもできる。
【００３５】
図１７は，図１４の画像から枠を構成しない横方向のランレングスデータを抽出し，ヒストグラムを作成した結果である。１７００から１７０６は図１６のステップ１６００で抽出された横方向のランレングスデータの連結成分である。ヒストグラム１７０８と１７１０は，ステップ１６０２で投影された結果である。ステップ１６０６において，１７０８と１７１０について，許容範囲ｗ（１７１２）の範囲内の面積を基準値と比較する。この場合，１７０８は基準値以上，１７１０は基準値未満であるとすると，１７００，１７０２，１７０４は下線であり，１７０６は下線ではないと判定できる。
【００３６】
図１８は，図３のステップ３２８の帳票識別処理に関する処理フローである。ステップ３０８では表の特徴量を抽出する。ステップ３２２では帳票名の単語照合結果を求める。ステップ３２４では項目名の単語照合結果を求める。ステップ１８００では，３０８，３２２，３２４の結果からそれぞれ導出される帳票の種類を用いて，多数決により帳票種類を識別する。
【００３７】
ステップ３０８で抽出する表の特徴としては，罫線の接続関係，枠の個数，枠の配置関係，縦罫線の本数，横罫線の本数等がある。罫線の接続関係が帳票の種類ごとに異なる場合には，特開平７―１４１４６２号公報に記載されている技術を用いて帳票の種類を特定できる。
【００３８】
【表１】

【００３９】
表１では，ステップ３０８で抽出する表の特徴の例として，認識対象である登記済通知書の縦の実線罫線の本数を示している。これにより，縦の実線罫線は７，８，１０，１１，１２，１６本のうちのいずれかでることがわかる。このうち，８本と１０本の場合を除けば，帳票の種類が一意に決定する。８本と１０本の場合も帳票種類の候補を挙げることができる。
【００４０】
また，ステップ３２２で照合する帳票名の単語は，帳票名全てを一つの単語として登録してもよく，「権利」「表示」，「建物」「土地」，「一般」，「専有」，「一棟」など特徴的な単語のみを登録してもよい。
【００４１】
【表２】

【００４２】
表２は，ステップ３０８で照合する項目名の中から一部を抜粋して示したものである。表２より，「所在」や「所」のように複数の帳票に共通する項目名や，「地積」や「一棟の建物番号」，「棟」，「表」のように帳票固有の項目名などがある。帳票固有の項目名をもたない種類の帳票でも，複数の項目を組み合わせて存在を判定することにより，「表示に関する建物登記済通知書（一般）」と「表示に関する建物登記済通知書（専有）」を除く５種類の帳票の種類を識別することができる。例えば，「床面積」の項目が存在し，「一棟の建物番号」の項目が存在しなければ「権利に関する建物登記済通知書（一般）」と識別することができる。
【００４３】
ステップ１８００では，ステップ３０８，３２２，３２４の結果を統合して帳票の種類を識別する。統合の手段としては，上記３つの結果の多数決を用いることができる。
【００４４】
ステップ１８００において，３０８，３２２，３２４の各ステップで，一意に帳票の種類を識別できない場合でも，各ステップの処理結果を相互に補完することによって，帳票の種類を識別することもできる。例えば，ステップ３０８において，縦の実線罫線の本数が８本抽出された場合，表１より帳票の種類は「表示に関する土地登記済通知書」，「表示に関する建物登記済通知書（一般）」，「表示に関する建物登記済通知書（専有）」の３種類が考えられる。しかし，ステップ３２４において，項目名「表」が抽出されれば，「表示に関する土地登記済通知書」であると一意に決定できる。
【００４５】
なお，ステップ１８００において，３０８，３２２，３２４の３つのステップの結果を用いるのではなく，２つのみを用いることもできる。
【００４６】
なお，ステップ１８００において，３０８，３２２，３２４の各ステップの結果を同等に扱うのではなく，一つのステップで得た結果から帳票を識別し，他のステップで得た結果は，帳票識別の結果を検証するために用いることもできる。
【００４７】
図１９は，本発明の一実施例である登記情報システムの構成図である。１０１から１０９の構成は図１に同じである。ソータ１９００は，認識部１０１で認識し，修正部１０５で修正した結果に基づき，登記済通知書を記載内容の優先度順に帳票１００をソートする。以下にソートの例を２つ挙げる。第一は，所在と地番に該当する文字から，町ごとに丁目，番地，号の順にソートする。第二は，作成日，番号の順にソートする。また，ソートする対象は，登記済通知書の帳票でも，認識結果のデータでもよい。
【００４８】
【発明の効果】
本発明の帳票認識方法によれば，登記済通知書のような非定型帳票に対しても高精度に帳票の種類を識別することができる。
【００４９】
また，本発明の帳票認識方法によれば，下線を文字中のストロークなどと間違うことなく，高精度に抽出することができる。
【００５０】
また，本発明の帳票認識方法によれば，帳票の認識結果に基づいて，帳票をソートすることができる。
【図面の簡単な説明】
【図１】本発明の一実施例である登記情報認識システムの構成図である。
【図２】登記情報認識の処理過程を示すブロック図である。
【図３】図２で示した登記情報認識のＰＡＤ図である。
【図４】認識対象である登記済通知書の画像の説明図である。
【図５】図４の画像に対して図３のステップ３０４の罫線抽出処理をした結果を示す図である。
【図６】図４の画像に対して図３のステップ３０６の枠抽出処理をした結果を示す図である。
【図７】図４の画像に対して図３のステップ３１０の文字行抽出処理をした結果を示す図である。
【図８】図３のステップ３１４の下線抽出処理に関するＰＡＤ図である。
【図９】下線抽出対象画像の説明図である。
【図１０】図９の画像に対して罫線と文字行を抽出した結果を示す図である。
【図１１】図９の画像に対して罫線を抽出し，文字を切り出した結果を示す図である。
【図１２】文字行の一部の文字に対してのみ下線が印刷されている画像の説明図である。
【図１３】図３のステップ３１４の下線抽出処理に関するＰＡＤ図である。
【図１４】下線抽出対象画像の説明図である。
【図１５】図１４の画像に対して枠線を構成しないランレングスデータと長い下線線を抽出した結果を示す図である。
【図１６】図３のステップ３１４の下線抽出処理に関するＰＡＤ図である。
【図１７】図１４の画像に対して枠線を構成しないランレングスデータを抽出し，ランレングスデータの長さを傾き方向に投影した結果を示す図である。
【図１８】図３のステップ３２８の帳票識別処理に関するＰＡＤ図である。
【図１９】本発明の一実施例である，ソート機能をもつ登記情報認識システムの構成図である。
【符号の説明】
２００…画像入力、２０４…罫線抽出、２０６…枠抽出、２０８…表特徴抽出、２４６…帳票識別、２２２…文字切り出し、２２４…文字識別、２３６…単語照合、２４０…内容照合[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a form, in particular, to various forms in which registration information relating to real estate is described, and more particularly to a form recognition method for reading character data from a registered notice and automatically inputting it.
[0002]
[Prior art]
Examples of the prior art relating to identification of the form type include the following. The first is a method of identifying the form type by reading the ID number indicating the form type described at the same position for all types of forms. The second is a method of identifying the form type by identifying the frame structure when the frame structure is different for each form type. This example is described in JP-A-7-141462.
[0003]
[Problems to be solved by the invention]
There are currently seven types of registered notifications regarding real estate. These forms are used for data input for taxation related to real estate. However, since this notification form does not include an ID number for specifying the form type, the form is identified by reading the ID number. Conventional methods cannot be used. Furthermore, since these forms are atypical forms having the same type but different frame shapes, it is not possible to use a conventional method for identifying the form from the frame structure. In addition, when a conventional method for identifying a form by reading characters in the title part is used, there is a problem that the identification accuracy of the form greatly depends on the accuracy of character recognition. The form name of the registered notice is “Land registered notice on rights”, “Building registered notice on rights (general)”, “Building registered notice on rights (exclusive)”, “Land registration on display” There are seven types: “Notification of completed building”, “Notification of registered building concerning display (general)”, “Notification of registered building regarding display (one building)”, and “Notification of registered building regarding display (proprietary)”. Of these, the “Building registered notice on display (general)” and the “Building registered notice on display (one building)” differ only in one character, so the identification accuracy for these two types may be low. is there.
[0004]
Accordingly, a first object of the present invention is to propose a form recognition unit having a high-precision form identification unit for reading objects having various types of forms.
[0005]
In the conventional underline detection method, since ruled lines other than the frame line are underlined, there is a possibility that noise components such as horizontal strokes of characters are erroneously extracted as underlines. Accordingly, a second object of the present invention is to propose a form recognition unit having a highly accurate underline detection unit.
[0006]
[Means for Solving the Problems]
In a first aspect, the present invention is a recognition information recognition method for inputting a surface image of a registered notice and reading characters, and a character line extracting means for extracting a character line from an image of a registered notice; A first character for identifying the type of registered notice is selected from the character line selecting means and the character identifying means for reading the character line of the form name by selecting the character line of the form name from the positional relationship between the extracted character lines and the frame. Form identification means, ruled line extraction means for extracting ruled lines from registered notice images, table feature extraction means for extracting table features from the extracted ruled lines, and types of registered notices are identified from table features The second form identifying means, the character line extracting means for extracting the character line from the image of the registered notification, the character identifying means for reading the extracted character line, and the item name of the form are selected from the read results. Combination of item name selection means and item name ; And a third form identification means for identifying the type of registration completion notice from, by combining the results of the three means, provides a form recognition method for identifying the type of registration completion notice.
[0007]
In a second aspect, the present invention is a recognition information recognition method for inputting a surface image of a registered notice and reading characters, and for extracting a character line and a ruled line from the image of the registered notice. A ruled line type determining unit that distinguishes a frame ruled line and a ruled line that is not a frame ruled line from the extracted ruled line, a means for detecting a character line in a frame that includes a ruled line that is not a frame ruled line, a character line in the frame, There is provided a form recognition method comprising an underline detection means for determining whether or not a ruled line that is not a frame ruled line in the frame is an underline based on a positional relationship with a ruled line that is not a frame ruled line.
[0008]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment of the present invention will be described in detail. Note that the present invention is not limited thereby.
[0009]
FIG. 1 is a configuration diagram of a registration information system according to an embodiment of the present invention. A recognition unit 101 that recognizes registration information and a correction unit 105 that corrects a recognition result are connected by a network 104, and recognition and correction can be performed in parallel in the input center 111. In the process, first, an image of the registered notice 100 is input by the scanner 102. Next, the recognition computer 103 recognizes characters and ruled lines, and the correction computer 106 confirms the correction of the recognition result. It also checks against a dictionary or code table and outputs code data. The recognition result is stored in the registration information database 109 connected to the host computer 108 in the remote calculation center 110 via the communication control computer 107. The correction unit 105 uses a part of the recognition result, accesses the registration information database 109, and reads the registered registration information. The read registration information and a part of the recognition result are collated to test whether there is any contradiction.
[0010]
FIG. 2 is a block diagram showing a registration information recognition process. The recognition unit 101 reads the form image and outputs the reduced image 248, frame coordinates 250, underline coordinates 252, character line coordinates 254, form type 256, recognition result lattice 258, and character coordinates 260 to the correction unit 105. In the correction unit 105, the operator corrects the recognition result based on these input data. In the image input unit 200, the image on the form surface is converted into black and white and input.
[0011]
The input image is output to the image reduction unit 202 and the character line image extraction unit 218. The image reduction unit 202 reduces the form image and outputs a reduced image 248 for speeding up subsequent processing. In the reduction process, an OR process is performed for each pixel so that a thin ruled line is not faded after reduction. The ruled line extraction unit 204 extracts a solid line and a dotted ruled line for the reduced image. The solid line is extracted based on the continuous connection of black pixels. The dotted lines are extracted based on the arrangement and size constraints of the circumscribed rectangle of the connected components of black pixels. The frame extraction unit 206 obtains a frame in which the ruled line surrounds the four sides from the ruled line extracted in 204 and outputs the vertex coordinates 250 of the frame. The table feature extraction unit 208 extracts a table feature amount, which is a collection of frames, from the frame information extracted in 206. This feature amount includes the number of vertical and horizontal ruled lines, the connection relationship between ruled lines, the positional relationship of frames, and the like.
[0012]
On the other hand, character line extractor 212 Then, a character line that is a set of characters is extracted from the reduced image output from 202. Here, based on the vertex coordinates of the circumscribed rectangle of the connected component of black pixels among the connected components of black pixels, the character line is generated by fusing the estimated circumscribed rectangle with the arrangement of characters. To do. The line-frame correspondence unit 214 compares the vertex coordinates of the character line extracted in 212 with the vertex coordinates of the frame extracted in 206 to determine in which frame each character line exists or is outside the frame. The determination is made, and the vertex coordinates of the character line included in each frame and the vertex coordinates 254 of the character line outside the frame are output. The underline extraction unit 216 extracts an underline based on the ruled line coordinates extracted in 204, the vertex coordinates of the frame extracted in 206, and the character line coordinates in the frame extracted in 214, and The coordinates 252 are output. Further, the character line image extraction unit 218 cuts out an image of the character line portion from the image input at 200 based on the character line coordinates extracted at 214. In the character cutout / character identification unit 220, the character cutout unit 222 and the character identification unit 224 cooperate to cut out characters one by one and output the character coordinates 260. Further, the character identification unit 224 identifies characters using the identification dictionary 226 with respect to the extracted image pattern for one character. In the form name collation unit 228, the character identification result output from the character identification unit 224 is input, and the word collation unit 230 collates with the form name word stored in the form name dictionary 232, thereby recognizing the recognition result for the form name. Find the form name by correcting the mistake.
[0013]
A word stored in the form name dictionary 232 is a form name to be recognized. The form name to be recognized is known in advance, and the form name has a one-to-one correspondence with the form type. Further, the item collating unit 234 inputs the character recognition result that has not been collated in 228, and the word collating unit 236 collates with the item name word stored in the item dictionary 238 to thereby correct the recognition result for the item name. Modify to find the item name. The words in the item dictionary 238 are items described in the form to be recognized. In the content collation unit 240, the character recognition result that has not been collated in 234 is input, and the word collation unit 242 collates with the content word stored in the content dictionary 244, thereby correcting an error in the content recognition result. Here, “content” refers to the contents described for the item name in the form. For example, there are “home” and “park” as the contents for the item “ground”. The words stored in the content dictionary 244 are words that are determined to be used in advance among the words that describe the content described in the form to be recognized. The recognition result lattice 258 output as a result of the processing of 240 is obtained by arranging the candidate characters that are the result of the character identification processing for each character in descending order of similarity. This character identification result is corrected for errors by form name verification, item verification, and content verification.
[0014]
On the other hand, the form identification unit 246 inputs the output results of the table feature extraction unit 208, the form name collation unit 228, and the item name collation unit 234, identifies the type of the form from the table feature, the form name, and the item name. 256 is output.
[0015]
FIG. 3 is a diagram showing a processing flow of registration information recognition shown in FIG. In step 300, an image is input, and in step 302, the image is reduced. Next, a ruled line is extracted from the image in step 304, and a frame is extracted from the ruled line in step 306. In step 308, table features are extracted. In step 310, a character line is extracted from the reduced image, and in step 312, the extracted line is associated with a frame. In step 314, an underline is extracted from the coordinates of the ruled line, frame, and character line. In step 316, only the image of the character line portion is extracted from the form image based on the coordinate value of the character line. In step 318, the character line image is divided into images for each character, and character identification is performed on the image pattern cut out in step 320. In step 322, the form name is identified by comparing the character identification result with the word of the form name. In step 324, the item name is identified by comparing the character identification result with the word of the item name. In step 326, the contents are identified by comparing the character identification result with the words of the contents. In step 328, the type of the form is identified from the characteristics of the table that is the processing result of step 308, the form name that is the processing result of step 322, and the item name that is the processing result of step 324. In step 330, the result obtained by the processing from 300 to 328 is output.
[0016]
FIG. 4 is a diagram simply showing an image of a registered notification that is a recognition target for the sake of explanation. In the example of the form image 400, a form name “Notification of building registration regarding rights (exclusive)” 401 is described. Horizontal ruled

lines

402, 404, 406, and 408 and vertical ruled

lines

410, 412, 414, and 416 are printed. In addition, there are “code” 418, “location” 420, and “ground” 422 as items. “1” (424) and “2” (426) are described as the contents of “code”, and “280 280, Higashi Koigakubo 1-chome, Kokubunji city” are described as 428 and 430 as the contents of “location”. As the contents of “ground”, “residential land” (432) and “park” (434) are described. Further, underlines 436, 438, and 440 are printed on the contents 424 “1”, 428 “1-280 Higashi Koigakubo 1-chome, Kokubunji City”, and 432 “House Land”, respectively.
[0017]
FIG. 5 shows the ruled line extraction processing result of step 304 in FIG. 3 for the form image in FIG. (A) 500 is the result of extracting horizontal ruled lines, and (b) 520 is the result of extracting vertical ruled lines. In (a), 502 to 508 are extracted as ruled lines corresponding to the horizontal ruled lines 402 to 408 in FIG. 4, respectively. 510, 514, and 516 are extracted as underlines corresponding to the underlines 436, 438, and 440, respectively. 512 and 518 are extracted as ruled lines by connecting the horizontal strokes of “Shito Koi”. The phenomenon in which the separated horizontal strokes are connected is due to the fact that black pixels that are close to each other are connected by contracting and expanding black pixels in the horizontal direction when extracting horizontal ruled lines. Further, in FIG. 4B, 522 to 528 are extracted as ruled lines corresponding to the vertical ruled lines 410 to 416 in FIG.
[0018]
FIG. 6 shows the result of frame extraction processing in step 306 in FIG. 3 for the form image in FIG. Reference numeral 600 denotes a frame extraction result. Nine frames 602 to 618 are extracted.
[0019]
FIG. 7 shows the result of the character line extraction process in step 310 of FIG. 3 for the form image of FIG. Reference numeral 700 denotes a character line extraction result. For the character lines 401, 418, 420, 422, 424 to 434 in FIG. 4, circumscribed rectangles of the character lines 702 to 720 are extracted.
[0020]
FIG. 8 shows the step 314 in FIG. Underline It is a processing flow regarding extraction processing. Using the results of ruled line extraction processing 304, frame extraction processing 306, and character line extraction processing 310
In step 800, ruled lines that do not constitute a frame are extracted. In step 802, the following processing is repeated for the number of ruled lines extracted in step 800. In step 804, the coordinates of the character line and the coordinates of the ruled line are compared. The comparison method will be described with reference to FIGS. In step 806, it is determined whether or not the compared value satisfies the criterion. If the reference value is satisfied, in step 808, the ruled line to be compared is set as an underline. Note that, regarding the two underlines extracted in step 808, if the end points are separated from each other by a minute interval and exist on the extension line, it can be assumed that it is one underline. Further, if the length of the underline extracted in step 808 is equal to or less than the reference value, it can be regarded as not being underlined.
[0021]
FIG. 9 is an example of a form frame for explaining the processing flow of FIG. Horizontal ruled

lines

900 and 902, vertical ruled

lines

904 and 906, character lines 908, and underline 910 are printed.
[0022]
FIG. 10 shows the result of extracting ruled lines and character lines from the example of FIG. Underline determination will be described with reference to FIG. In the underline determination process, a ruled line that is positioned below the character line and has the same length as the character line is determined to be an underline among the ruled lines within the same frame as the character line. In FIG. 10, reference numeral 1007 denotes an area where characters are printed, and reference numeral 1008 denotes a circumscribed rectangle of 1007. The ruled lines 900 to 910 in FIG. 9 are extracted as 1000 to 1010, respectively. Further, reference numeral 1012 indicates a character horizontal stroke extracted as a ruled line. 1010 and 1012 are extracted from the extracted ruled lines as ruled lines that do not constitute a frame. Hereinafter, a case where 1010 is determined as an underline will be described as an example, and a case where 1012 is determined as an underline will be described as an example.
[0023]
The determination is made for 1010 in FIG. First, a difference d11 (1014) between the y coordinate of the lower end of the ruled line and the y coordinate of the lower end of the character line is obtained. Next, a difference d12 (1016) between the y coordinate of the upper end of the ruled line and the y coordinate of the upper end of the character line is obtained. Further, the difference between the length L1 (1018) in the x direction of the ruled line and the length Lc (1020) in the x direction of the character line is obtained. This value is compared with the reference values α, β, γ1, and γ2. If d11 is less than α below the character line, d12 is equal to or greater than β, and L1-Lc is equal to or greater than γ1 and equal to or less than γ2, the ruled line is set as an underline. The values of α, β, γ1, and γ2, which are the determination criteria for the above processing, can be obtained empirically.
[0024]
For example, the value of α can be used if the distance between the character line and the underline is constant. If it is not constant, 1/2 of the difference between the frame height and the character height can be used. For β, the sum of these two values can be used if the distance between the lower end and the underline of the character line and the height of the character are constant. The values of γ1 and γ2 allow for a margin of about one character, γ1 can be a value obtained by multiplying the character width by (−1), γ2 can be a character width or the like. In setting the values of α, β, γ1, and γ2, the values can be set with a margin in order to provide robustness against the inclination of the form, blurring of the lines, and collapse of the lines. In addition, if a negative value is allowed for the allowable value of d11, it is possible to cope with the case where the underline overlaps with the character.
[0025]
Next, 1012 in FIG. 10 is determined. First, the difference d21 (1022) between the y coordinate of the lower end of the ruled line and the y coordinate of the lower end of the character line is obtained. Next, a difference d22 (1024) between the y coordinate of the upper end of the ruled line and the y coordinate of the upper end of the character line is obtained. Further, the difference between the length L2 (1026) in the x direction of the ruled line and the length Lc (1020) in the x direction of the character line is obtained. When these values are compared with the above α, β, γ1, and γ2, d21 is a large negative value and d22 is a smaller value than β, so it is determined that it is not underlined.
[0026]
Note that d11 and d12 used here may be normalized by the height of the character, the height of the frame, or the like. Further, the ratio may be compared instead of the difference between L1 and Lc. The values of α, β, γ1, and γ2 are set according to the definition of the comparison target.
[0027]
Also, here, the difference 1014 between the y-coordinate at the bottom of the ruled line and the y-coordinate at the bottom of the character line, and the difference 1016 between the y-coordinate at the top of the ruled line and the y-coordinate at the top of the character line, the length of the ruled line in the x direction. Three evaluation values of the difference between the length (1018) and the length of the character line in the x direction (1020) are used, but only one or two of them may be used as necessary.
[0028]
FIG. 11 is an example in which character coordinates are used instead of character line coordinates in the underline extraction process in step 314 of FIG. By comparing the ruled line 1108 that does not constitute a frame with the circumscribed rectangle 1112 of the character using the determination criterion described with reference to FIG. 10, it can be determined that 1108 is an underline. Further, by comparing the ruled line 1110 that does not constitute a frame with the circumscribed rectangle 1114 of the character, it can be determined that 1110 is not an underline.
[0029]
FIG. 12 is an example in which underlines are printed only for some characters in the character line. In a frame 1200, a character line 1202 and an underline 1204 are described. If the method of FIG. 11 is used, it can be determined that the underline is printed only on the characters “1 chome number 280” in the character line.
[0030]
FIG. 13 shows the step 314 in FIG. Underline It is another processing flow regarding an extraction process. In the registered notification, a plurality of underlines are often present on the same line as 436, 438, and 440 in FIG. On the other hand, since the underline 436 is short, the length does not change with the horizontal stroke in the character, and therefore, there is a possibility that the underline 436 may be extracted during ruled line extraction. The purpose of this processing is to correctly extract short underlines that may be extracted during ruled line extraction. Therefore, first, a long underline is extracted, and a ruled line on the extension of this underline is determined as an underline.
[0031]
Hereinafter, each step of FIG. 13 will be described. In step 1300, only the long underline is extracted. This processing can be realized using the processing shown in FIG. In step 1302, run length data that does not constitute a frame line is extracted from the run length data in the horizontal direction. In step 1304, the processing in

steps

1306 and 1308 is repeated for the number of extracted run length data. In step 1306, it is determined whether or not the target run length data is on the extension of the underline. If it is on the extension line, it is extracted in step 1308 as being run length data constituting the underline. In step 1310, a ruled line composed of run-length data determined to constitute an underline in step 1308 is extracted as an underline. In addition, regarding the two underlines extracted in the above step 1310, if the end points are separated from each other by a minute interval and exist on the extension line, it can be regarded as one underline. Further, if the length of the underline extracted in step 1310 is equal to or less than the reference value, it can be regarded as not being underlined.
[0032]
FIG. 14 is an example of a form frame for explaining the processing flow of FIG. Horizontal ruled lines 1400 and 1402, vertical ruled lines 1404 to 1410, underlines 1412 to 1416, and character lines 1418 to 1422 are printed.
[0033]
FIG. 15 shows a result of extracting lateral run-length data and a long underline that do not constitute a frame from the image of FIG. 1500 is the long underline extracted in step 1300 of FIG. Of the connected components of the run length data in the horizontal direction, 1502 and 1504 are within the allowable range w (1510) from the extended line 1508 of 1500, and therefore are determined to be underlined. Since 1506 is outside w, it is determined that there is no underline.
[0034]
FIG. 16 is another processing flow related to the character line extraction processing in step 314 of FIG. In this process, the underline is extracted using a histogram created by projecting the length value of the run length data in the horizontal direction that does not constitute the frame from the midpoint of the run in the inclination direction. Hereinafter, each step in FIG. 16 will be described. In step 1600, run length data that does not constitute a frame line is extracted from the run length data in the horizontal direction. In step 1602, the length value of the extracted run length data is projected in the direction of inclination from the midpoint of the run to create a histogram. In step 1604, the processing in

steps

1606 and 1608 is repeated for the number of peaks in the histogram. In step 1606, it is determined whether or not the projection value is greater than or equal to a reference value. If it is equal to or greater than the reference value, it is determined that the run-length data projected in step 1608 constitutes an underline. In step 1610, the underline is extracted from the run-length data determined to constitute the underline in step 1608. Note that the two underlines extracted in the above step 1610 can be regarded as one underline when the end points are separated from each other by a minute interval and exist on the extension line. Further, if the length of the underline extracted in step 1610 is equal to or less than the reference value, it can be regarded as not being underlined.
[0035]
FIG. 17 shows the result of extracting the run-length data in the horizontal direction that does not constitute a frame from the image of FIG. 14 and creating a histogram. Reference numerals 1700 to 1706 are connected components of the lateral run-length data extracted in step 1600 of FIG. Histograms 1708 and 1710 are the results of projection in step 1602. In step 1606, for 1708 and 1710, the area within the allowable range w (1712) is compared with the reference value. In this case, if 1708 is greater than or equal to the reference value and 1710 is less than the reference value, it can be determined that 1700, 1702, and 1704 are underlined, and 1706 is not underlined.
[0036]
FIG. 18 is a process flow relating to the form identification process in step 328 of FIG. In step 308, the feature amount of the table is extracted. In step 322, a word matching result of the form name is obtained. In step 324, the word matching result of the item name is obtained. In step 1800, the form type is identified by majority vote using the form type derived from the results of 308, 322, and 324, respectively.
[0037]
Features of the table extracted in step 308 include ruled line connection relationships, the number of frames, the frame arrangement relationship, the number of vertical ruled lines, the number of horizontal ruled lines, and the like. When the ruled line connection relationship differs for each form type, the form type can be specified using the technique described in Japanese Patent Laid-Open No. 7-141462.
[0038]
[Table 1]

[0039]
Table 1 shows the number of vertical solid ruled lines of the registered notification that is the recognition target as an example of the characteristics of the table extracted in step 308. As a result, it can be seen that the vertical solid ruled line is one of 7, 8, 10, 11, 12, and 16. Of these, except for the case of 8 and 10, the type of form is uniquely determined. In the case of 8 and 10, the form type candidates can be listed.
[0040]
The form name words to be collated in step 322 may be registered as a single word for all form names, such as “right”, “display”, “building”, “land”, “general”, “exclusive”, “ Only characteristic words such as “one building” may be registered.
[0041]
[Table 2]

[0042]
Table 2 shows a part of the item names to be collated in step 308. From Table 2, item names common to multiple forms such as “location” and “location”, and items specific to forms such as “land area”, “building number of one building”, “building”, “table” There are names. Even if the form does not have a form-specific item name, the existence of the form is determined by combining multiple items, so that “building registered notice on display (general)” and “building registered notice on display (proprietary) 5) types of forms other than “)” can be identified. For example, if the item “floor area” exists and the item “building number of one building” does not exist, it can be identified as “building registration notice (right) regarding rights”.
[0043]
In step 1800, the results of

steps

308, 322, and 324 are integrated to identify the form type. As a means of integration, the majority of the above three results can be used.
[0044]
In step 1800, even if the

steps

308, 322, and 324 cannot uniquely identify the form type, the form type can be identified by mutually complementing the processing results of each step. For example, when the number of vertical solid ruled lines is extracted in step 308, the types of forms are as follows from Table 1, “Land registered notice regarding display”, “Build registered notice (general) regarding display”, Three types of “notification of building registration regarding display (proprietary)” can be considered. However, if the item name “table” is extracted in step 324, it can be uniquely determined to be “land registration completed notification regarding display”.
[0045]
In step 1800, instead of using the results of the three

steps

308, 322, and 324, only two steps can be used.
[0046]
In step 1800, the results of

steps

308, 322, and 324 are not handled equally, but a form is identified from the result obtained in one step, and the result obtained in the other step is the result of form identification. It can also be used to verify
[0047]
FIG. 19 is a configuration diagram of a registration information system according to an embodiment of the present invention. The configurations 101 to 109 are the same as those in FIG. The sorter 1900 sorts the forms 100 in the order of priority of the contents of the registered notices based on the results recognized by the recognition unit 101 and corrected by the correction unit 105. Here are two examples of sorting: The first sort is in order of chome, house number and number for each town, starting with the letters corresponding to the location and house number. Second, sort by creation date and number. Further, the object to be sorted may be a registered notice form or recognition result data.
[0048]
【The invention's effect】
According to the form recognition method of the present invention, the type of form can be identified with high accuracy even for an atypical form such as a registered notice.
[0049]
Further, according to the form recognition method of the present invention, underline can be extracted with high accuracy without being mistaken for strokes in characters.
[0050]
Further, according to the form recognition method of the present invention, the forms can be sorted based on the form recognition result.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of a registration information recognition system according to an embodiment of the present invention.
FIG. 2 is a block diagram illustrating a registration information recognition process.
FIG. 3 is a PAD diagram of registration information recognition shown in FIG. 2;
FIG. 4 is an explanatory diagram of an image of a registered notification that is a recognition target.
5 is a diagram showing a result of ruled line extraction processing in step 304 in FIG. 3 performed on the image in FIG. 4;
6 is a diagram illustrating a result of frame extraction processing in step 306 in FIG. 3 performed on the image in FIG. 4;
7 is a diagram illustrating a result of character line extraction processing in step 310 of FIG. 3 performed on the image of FIG. 4;
FIG. 8 is a PAD related to the underline extraction process in step 314 of FIG. 3;
FIG. 9 is an explanatory diagram of an underline extraction target image.
10 is a diagram illustrating a result of extracting ruled lines and character lines from the image of FIG. 9;
11 is a diagram illustrating a result of extracting a ruled line and cutting out characters from the image of FIG. 9;
FIG. 12 is an explanatory diagram of an image in which an underline is printed only for some characters in a character line.
FIG. 13 is a PAD related to the underline extraction process in step 314 of FIG. 3;
FIG. 14 is an explanatory diagram of an underline extraction target image.
15 is a diagram illustrating a result of extracting run-length data that does not constitute a frame line and a long underline from the image of FIG. 14;
16 is a PAD related to the underline extraction process in step 314 of FIG. 3. FIG.
FIG. 17 is a diagram illustrating a result of extracting run-length data that does not constitute a frame line from the image of FIG. 14 and projecting the length of the run-length data in the tilt direction.
18 is a PAD related to the form identification process in step 328 of FIG. 3;
FIG. 19 is a configuration diagram of a registration information recognition system having a sort function according to an embodiment of the present invention.
[Explanation of symbols]
200: Image input, 204: Ruled line extraction, 206: Frame extraction, 208 ... Table feature extraction, 246 ... Form identification, 222 ... Character extraction, 224 ... Character identification, 236 ... Word verification, 240 ... Content verification

Claims

In the form recognition method that reads the characters by inputting the surface image of the form,
In the image input part, input the image of the form as binarized,
In the frame extraction unit, the ruled lines that are detected based on the continuous state of black pixels are extracted from the image of the form, and the ruled lines that do not constitute the frame ruled lines are identified ,
An underline extraction unit extracts an underline from a ruled line that does not constitute the frame ruled line, detects a black pixel existing on an extension line of the underline, and determines that the detected arrangement of black pixels is an underline Form recognition method.