JP4616522B2

JP4616522B2 - Document recognition apparatus, document image region identification method, program, and storage medium

Info

Publication number: JP4616522B2
Application number: JP2001211476A
Authority: JP
Inventors: 利夫宮澤
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2001-07-12
Filing date: 2001-07-12
Publication date: 2011-01-19
Anticipated expiration: 2021-07-12
Also published as: JP2003030584A

Description

【０００１】
【発明の属する技術分野】
本発明は、文書画像データ中に混在する文字領域と文字以外の領域とを識別分類する文書認識装置、文書画像の領域識別方法、プログラム及び記憶媒体に関する。
【０００２】
【従来の技術】
従来、文書画像中の文字列や文字領域（コラム）の識別方法としては、各種の方式が知られている。
【０００３】
例えば、特開平06-020092号公報には、文書画像中から空白部を抽出し、この空白部の繋がりからなる空白セパレータを領域分割線として扱って領域を分割することにより、文書画像中の文字列や文字領域（コラム）を抽出する方法が提案されている。
【０００４】
また、黒画素の射影ヒストグラムを利用し、黒画素の分布の高い部分を文字列の範囲とする方法も知られている（秋山、増田「周辺分布、線密度、外接矩形特徴を併用した文書画像の領域識別」電子通信学会論文誌８６／８ＶｏｌＪ６９−Ｄ））。
【０００５】
【発明が解決しようとする課題】
ところで、従来の文書画像中の文字列や文字領域（コラム）を抽出する方法によれば、領域識別処理の後に行われる文字認識のための行切り出し処理において、文字領域には印鑑や図領域などは混在していないものとして処理を行っている。
【０００６】
しかしながら、現実には、印鑑や図など文字以外の領域が、領域分割の結果として文字と判定された領域に入り込むことがある。このような場合には、従来の方法では、文書画像中の文字列や文字領域（コラム）から文字行を切り出す行切りだし処理を行うことができず、文字抽出精度が低下するという問題があった。
【０００７】
本発明の目的は、文字抽出精度を向上させることである。
【０００８】
【課題を解決するための手段】
本発明は、文書画像データ中に混在する文字領域と文字以外の領域とを識別分類する文書認識装置において、文字領域属性と識別された領域内に含まれる黒画素の連結成分に外接する矩形を前記文書画像データより抽出する外接矩形抽出手段と、前記外接矩形抽出手段により抽出された前記矩形の座標の最小値を１画素ライン毎に求め、求めた前記矩形の座標の最小値が、前記文字領域属性と識別された領域の始点に連続して寄っているところを文字行として判定し、前記文字領域属性と識別された領域の終点に連続して寄っているところを行間として判定し、「行−行間−行」の組み合わせが検出された場合に、前記文字領域属性と識別された領域内に、前記文字以外の領域が存在すると判定する非文字領域判定手段とを備える。
また、本発明は、文書画像データ中に混在する文字領域と文字以外の領域とを識別分類する文書認識装置において、文字領域属性と識別された領域内に含まれる黒画素の連結成分に外接する矩形を前記文書画像データより抽出する外接矩形抽出手段と、前記外接矩形抽出手段により抽出された前記矩形の座標の最大値を１画素ライン毎に求め、求めた前記矩形の座標の最大値が、前記文字領域属性と識別された領域の終点に連続して寄っているところを文字行として判定し、前記文字領域属性と識別された領域の始点に連続して寄っているところを行間として判定し、「行−行間−行」の組み合わせが検出された場合に、前記文字領域属性と識別された領域内に、前記文字以外の領域が存在すると判定する非文字領域判定手段とを備える。
また、本発明において、前記外接矩形抽出手段は、入力画像のオリジナル画像から、前記矩形の抽出を行う。
また、本発明において、前記文字領域属性と識別された領域を、前記行間部分で再分割する再分割手段を備える。
【０００９】
したがって、文字領域属性と識別された領域内に文字以外の領域が存在するか否かが判定され、文字領域属性と識別された領域内に文字以外の領域が存在すると判定された場合、当該文字領域属性と識別された領域が再分割される。これにより、印鑑や図など文字以外の領域が領域分割の結果として文字と判定された領域に入り込んだ場合であっても、当該文字領域属性と識別された領域を再度分割することで文字以外の領域を排除することが可能になるので、文字抽出精度を向上させることが可能になる。
【００１１】
また、文字領域属性と識別された領域内に文字以外の領域が存在するか否かの判定が容易になる。
【００１３】
また、再分割が容易になる。
【００１４】
また、本発明は、前記再分割手段により再分割された前記文字領域属性と識別された領域において、前記行間と判定された領域に存在する前記矩形に実線が存在するか否かにより、前記文字以外の領域が存在するか否かを判定する。
【００１５】
したがって、行間と判定された領域には文字は存在しないことから、文字以外の領域が存在するか否かの判定が容易になる。
【００１８】
また、本発明は、文書画像データ中に混在する文字領域と文字以外の領域とを識別分類する文書画像の領域識別方法であって、文字領域属性と識別された領域内に含まれる黒画素の連結成分に外接する矩形を前記文書画像データより抽出する外接矩形抽出工程と、前記外接矩形抽出工程により抽出された前記矩形の座標の最小値を１画素ライン毎に求め、求めた前記矩形の座標の最小値が、前記文字領域属性と識別された領域の始点に連続して寄っているところを文字行として判定し、前記文字領域属性と識別された領域の終点に連続して寄っているところを行間として判定し、「行−行間−行」の組み合わせが検出された場合に、前記文字領域属性と識別された領域内に、前記文字以外の領域が存在すると判定する非文字領域判定工程とを含む。
また、本発明は、文書画像データ中に混在する文字領域と文字以外の領域とを識別分類する文書画像の領域識別方法であって、文字領域属性と識別された領域内に含まれる黒画素の連結成分に外接する矩形を前記文書画像データより抽出する外接矩形抽出工程と、前記外接矩形抽出工程により抽出された前記矩形の座標の最大値を１画素ライン毎に求め、求めた前記矩形の座標の最大値が、前記文字領域属性と識別された領域の終点に連続して寄っているところを文字行として判定し、前記文字領域属性と識別された領域の始点に連続して寄っているところを行間として判定し、「行−行間−行」の組み合わせが検出された場合に、前記文字領域属性と識別された領域内に、前記文字以外の領域が存在すると判定する非文字領域判定工程とを含む。
また、本発明において、前記外接矩形抽出工程は、入力画像のオリジナル画像から、前記矩形の抽出を行う。
また、本発明において、前記文字領域属性と識別された領域を、前記行間部分で再分割する再分割工程を備える。
【００１９】
したがって、文字領域属性と識別された領域内に文字以外の領域が存在するか否かが判定され、文字領域属性と識別された領域内に文字以外の領域が存在すると判定された場合、当該文字領域属性と識別された領域が再分割される。これにより、印鑑や図など文字以外の領域が領域分割の結果として文字と判定された領域に入り込んだ場合であっても、当該文字領域属性と識別された領域を再度分割することで文字以外の領域を排除することが可能になるので、文字抽出精度を向上させることが可能になる。
【００２１】
また、文字領域属性と識別された領域内に文字以外の領域が存在するか否かの判定が容易になる。
【００２３】
また、再分割が容易になる。
【００２４】
また、本発明は、前記再分割工程により再分割された前記文字領域属性と識別された領域において、前記行間と判定された領域に存在する前記矩形に実線が存在するか否かにより、前記文字以外の領域が存在するか否かを判定する。
【００２５】
したがって、行間と判定された領域には文字は存在しないことから、文字以外の領域が存在するか否かの判定が容易になる。
【００２８】
また、本発明は、文書画像データ中に混在する文字領域と文字以外の領域との識別分類をコンピュータに実行させるためのプログラムであって、前記コンピュータに、文字領域属性と識別された領域内に含まれる黒画素の連結成分に外接する矩形を前記文書画像データより抽出する外接矩形抽出機能と、前記外接矩形抽出機能により抽出された前記矩形の座標の最小値を１画素ライン毎に求め、求めた前記矩形の座標の最小値が、前記文字領域属性と識別された領域の始点に連続して寄っているところを文字行として判定し、前記文字領域属性と識別された領域の終点に連続して寄っているところを行間として判定し、「行−行間−行」の組み合わせが検出された場合に、前記文字領域属性と識別された領域内に、前記文字以外の領域が存在すると判定する非文字領域判定機能とを実行させる。
また、本発明は、文書画像データ中に混在する文字領域と文字以外の領域との識別分類をコンピュータに実行させるためのプログラムであって、前記コンピュータに、文字領域属性と識別された領域内に含まれる黒画素の連結成分に外接する矩形を前記文書画像データより抽出する外接矩形抽出機能と、前記外接矩形抽出機能により抽出された前記矩形の座標の最大値を１画素ライン毎に求め、求めた前記矩形の座標の最大値が、前記文字領域属性と識別された領域の終点に連続して寄っているところを文字行として判定し、前記文字領域属性と識別された領域の始点に連続して寄っているところを行間として判定し、「行−行間−行」の組み合わせが検出された場合に、前記文字領域属性と識別された領域内に、前記文字以外の領域が存在すると判定する非文字領域判定機能とを実行させる。
また、本発明において、前記外接矩形抽出機能は、入力画像のオリジナル画像から、前記矩形の抽出を行う。
また、本発明において、前記文字領域属性と識別された領域を、前記行間部分で再分割する再分割機能を備える。
【００２９】
したがって、文字領域属性と識別された領域内に文字以外の領域が存在するか否かが判定され、文字領域属性と識別された領域内に文字以外の領域が存在すると判定された場合、当該文字領域属性と識別された領域が再分割される。これにより、印鑑や図など文字以外の領域が領域分割の結果として文字と判定された領域に入り込んだ場合であっても、当該文字領域属性と識別された領域を再度分割することで文字以外の領域を排除することが可能になるので、文字抽出精度を向上させることが可能になる。
【００３１】
また、文字領域属性と識別された領域内に文字以外の領域が存在するか否かの判定が容易になる。
【００３３】
また、再分割が容易になる。
【００３４】
また、本発明において、前記再分割機能により再分割された前記文字領域属性と識別された領域において、前記行間と判定された領域に存在する前記矩形に実線が存在するか否かにより、前記文字以外の領域が存在するか否かを判定する。
【００３５】
したがって、行間と判定された領域には文字は存在しないことから、文字以外の領域が存在するか否かの判定が容易になる。
【００３８】
また、本発明のコンピュータに読み取り可能な記憶媒体は、請求項１１ないし１５のいずれか一記載のプログラムを記憶している。
【００３９】
したがって、この記憶媒体をコンピュータにインストールすることにより、請求項１１ないし１５のいずれか一記載のプログラムと同様の作用を得ることが可能になる。
【００４０】
【発明の実施の形態】
本発明の実施の一形態を図１ないし図６に基づいて説明する。
【００４１】
図１は、文書認識装置１のハードウェア構成を概略的に示すブロック図である。図１に示すように、文書認識装置１は、この文書認識装置１の各部を集中的に制御するＣＰＵ（Central Processing Unit）２を備えており、このＣＰＵ２には、ＢＩＯＳなどを記憶した読出し専用メモリであるＲＯＭ（Read Only Memory）３と、各種データを書換え可能に記憶するＲＡＭ（Random Access Memory）４とがバス５で接続されている。さらにバス５には、外部記憶となるＨＤＤ（Hard Disk Drive）６と、ＣＤ（Compact Disc）−ＲＯＭ７を読み取るＣＤ−ＲＯＭドライブ８と、文書認識装置１とネットワーク９との通信を司る通信制御装置１０と、入力部として機能するキーボードやマウスなどの入力装置１１と、ＣＲＴ（Cathode Ray Tube）、ＬＣＤ（Liquid Crystal Display）などの出力装置１２と、画像入力部として機能するスキャナなどの画像入力装置１３とが、図示しないＩ／Ｏを介して接続されている。
【００４２】
ＲＡＭ４は、各種データを書換え可能に記憶する性質を有していることから、ＣＰＵ２の作業エリアとして機能する。
【００４３】
また、ＨＤＤ６には、各種のプログラムを格納するプログラムファイルが格納されている。
【００４４】
図１に示すＣＤ−ＲＯＭ７は、この発明の記憶媒体を実施するものであり、所定のプログラムが記憶されている。ＣＰＵ２は、ＣＤ−ＲＯＭ７に記憶されているプログラムをＣＤ−ＲＯＭドライブ８で読み取り、ＨＤＤ６にインストールする。これにより、文書認識装置１は、後述するような各種の処理を行なうことが可能な状態となる。
【００４５】
なお、記憶媒体としては、ＣＤ−ＲＯＭ７のみならず、ＤＶＤなどの各種の光ディスク、各種光磁気ディスク、フロッピーディスクなどの各種磁気ディスク等、半導体メモリ等の各種方式のメディアを用いることができる。また、通信制御装置１０を介してインターネットなどのネットワーク９からプログラムをダウンロードし、ＨＤＤ６にインストールするようにしてもよい。この場合に、送信側のサーバでプログラムを記憶している記憶装置も、この発明の記憶媒体である。なお、プログラムは、所定のＯＳ（Operating System）上で動作するものであってもよいし、その場合に後述の各種処理の一部の実行をＯＳに肩代わりさせるものであってもよいし、ワープロソフトなど所定のアプリケーションソフトやＯＳなどを構成する一群のプログラムファイルの一部として含まれているものであってもよい。
【００４６】
次に、文書認識装置１のＣＰＵ２がプログラムに基づいて制御されることにより実現される各種機能について説明する。図２は、文書認識装置１の機能ブロック図である。
【００４７】
領域識別部１４は、例えば画像入力装置１３から入力されてメモリ（ＲＡＭ４等）に記憶された文書画像を領域識別し、文字領域、表領域、図領域、写真領域などに分類する。なお、文書の領域属性は、黒ランの密度を用いて判断する等の手法により求めることが可能であるが、この手法は従来より公知であるため、その説明は省略する。
【００４８】
図領域抽出部１５は、領域識別部１４において文字領域として分類された領域内に、実線（印鑑や図等）が混入しているか否かを判定する。実線（印鑑や図等）が混入しているか否かの判定手法は従来より公知であるため、その説明は省略する。
【００４９】
領域分割部１６は、図領域抽出部１５において文字領域として分類された領域内に実線（印鑑や図等）が混入していると判断された場合、対象文字領域を再分割し、文字認識部１７に渡す。
【００５０】
文字認識部１７は、行切り出し処理及び文字切り出し処理によって１文字の文字を切り出すとともに、切り出した文字に対する文字認識処理のマッチング処理により、文字候補を選択する。
【００５１】
なお、図領域抽出部１５において文字領域として分類された領域内に実線（印鑑や図等）が混入していないと判断された場合は、図領域抽出部１５において文字領域として分類された領域はそのまま文字認識部１７に渡される。
【００５２】
ここで、本実施の形態の特長的な機能を発揮する図領域抽出部１５及び領域分割部１６における処理の流れについて図３を参照しつつ詳細に説明する。まず、ステップＳ１においては、領域識別部１４において文字領域として分類された領域について、領域座標データ（入力画像を１／４に圧縮した１／４圧縮画像で抽出された始点、終点のＸ，Ｙ座標）を用いて該当領域が縦長領域であるか否かを判断し、該当領域が縦長領域である場合には、該当領域を排除する（以降の処理を行わない）。
【００５３】
加えて、ステップＳ２においては、該当領域の行方向が「縦」であるか否かを判断し、該当領域の行方向が「縦」である場合には、該当領域を排除する（以降の処理を行わない）。
【００５４】
次いで、ステップＳ３において、候補領域の検出を行う。より詳細には、まず、上記の処理で検出された文字領域のオリジナル画像に対して矩形抽出処理を行い、矩形座標データを得る。ここに、外接矩形抽出手段の機能が実行される。ここで、１／４圧縮画像を用いないのは、圧縮画像を用いると矩形同士が接触して大きな矩形となってしまうからである。この後の処理で矩形座標情報から強制分割位置を推定するため、矩形同士が接触して大きな矩形となってしまった場合には、推定精度があがらないという問題が発生するためである。そして、このような矩形抽出処理の結果求まった文字領域内の矩形がすべて黒画素であると仮定し、各ラインごと（Ｙ座標ごと）に文字領域内で最も小さいＸ座標（図４に示す太実線：minＸs(y)）と、最も大きいＸ座標（図４に示す太破線：maxＸe(y)）とを求める。
【００５５】
minＸs(y)とmaxＸe(y)とのｙの値は、該当領域座標の始点(area.Ｙs)から終点(area.Ｙe)の値を取るが、ここで上記の範囲を０〜９９の１００個のデータに正規化する。
minＸs(y)，maxＸs(y) →（正規化）→ minＸs(Ｙ)，maxＸe(Ｙ)
但しＹ＝(ｙ−area.Ｙs)／(area.Ｙe−area.Ｙs)×１００
以上により、領域内矩形のＸ座標の最大値、最小値が各画素行ごとに求められる。
【００５６】
次いで、この領域内矩形のＸ座標の最大値、最小値の値から、複数行が接触しているか否かを判断する。複数行の左側が接触している例（図４（ａ）参照）では、Ｘ座標の最小値に注目し、最小値が領域の始点Ｘsに連続して寄っているところを行とする。また、行間は、Ｘ座標の最小値が領域の終点Ｘeに近くなることから、連続してＸeに寄っているところを行間とする。そして、「行−行間−行」の組み合わせが検出された領域を複数行が接触している（つまり、印鑑や図等が混入している）と判定し、図５に示すように、行間の中心で領域を強制分割する。なお、複数行の右側が接触している例（図４（ｂ）参照）では、Ｘ座標の最大値に注目し同様の処理を行うことになる。これにより、文字領域が再分割され、候補領域の検出処理（ステップＳ３）が終了する。ここに、非文字領域判定手段の機能及び再分割手段の機能が実行される。
【００５７】
最後に、ステップＳ４に進み、最終判定処理を実行する。最終判定は、再分割された文字領域内に実線（印鑑や図等）が混入しているか否かを判定するものであって、行間と判定された領域に存在する矩形に実線（印鑑や図等）が存在するか否かを判定し、矩形に実線（印鑑や図等）が存在する場合にはその矩形を図領域とするものである。
【００５８】
なお、上記では座標の凹凸情報から行を横に分割する例を説明したが、凹凸情報を用いて、図６に示すように凸部で図や写真と思われる図領域部分を縦方向に分割するようにしても良い。
【００５９】
また、これらの分割の後、文字認識を行った結果の確からしさを示す指標（確信度）を算出し、確信度が低い（確からしさが低い）部分は、図領域とすることで、より分割精度を向上させることも可能である。
【００６０】
なお、本実施の形態においては、行方向横向きである横書き文書に関して説明をしたが、これに限るものではなく、行方向縦向きである縦書き文書に適用することも可能である。
【００６１】
ここに、文字領域属性と識別された領域内に文字以外の領域が存在するか否かが判定され、文字領域属性と識別された領域内に文字以外の領域が存在すると判定された場合、当該文字領域属性と識別された領域が再分割される。これにより、印鑑や図など文字以外の領域が領域分割の結果として文字と判定された領域に入り込んだ場合であっても、当該文字領域属性と識別された領域を再度分割することで文字以外の領域を排除することが可能になるので、文字抽出精度を向上させることが可能になる。
【００６２】
【発明の効果】
本発明によれば、文書画像データ中に混在する文字領域と文字以外の領域とを識別分類する文書認識装置において、文字領域属性と識別された領域内に含まれる黒画素の連結成分に外接する矩形を前記文書画像データより抽出する外接矩形抽出手段と、前記外接矩形抽出手段により抽出された前記矩形の座標の最小値を１画素ライン毎に求め、求めた前記矩形の座標の最小値が、前記文字領域属性と識別された領域の始点に連続して寄っているところを文字行として判定し、前記文字領域属性と識別された領域の終点に連続して寄っているところを行間として判定し、「行−行間−行」の組み合わせが検出された場合に、前記文字領域属性と識別された領域内に、前記文字以外の領域が存在すると判定する非文字領域判定手段とを備える。また、本発明によれば、文書画像データ中に混在する文字領域と文字以外の領域とを識別分類する文書認識装置において、文字領域属性と識別された領域内に含まれる黒画素の連結成分に外接する矩形を前記文書画像データより抽出する外接矩形抽出手段と、前記外接矩形抽出手段により抽出された前記矩形の座標の最大値を１画素ライン毎に求め、求めた前記矩形の座標の最大値が、前記文字領域属性と識別された領域の終点に連続して寄っているところを文字行として判定し、前記文字領域属性と識別された領域の始点に連続して寄っているところを行間として判定し、「行−行間−行」の組み合わせが検出された場合に、前記文字領域属性と識別された領域内に、前記文字以外の領域が存在すると判定する非文字領域判定手段とを備える。また、前記外接矩形抽出手段は、入力画像のオリジナル画像から、前記矩形の抽出を行う。また、前記文字領域属性と識別された領域を、前記行間部分で再分割する再分割手段を備える。これにより、文字領域属性と識別された領域内に文字以外の領域が存在するか否かを判定し、文字領域属性と識別された領域内に文字以外の領域が存在すると判定した場合、当該文字領域属性と識別された領域を再分割することにより、印鑑や図など文字以外の領域が領域分割の結果として文字と判定された領域に入り込んだ場合であっても、当該文字領域属性と識別された領域を再度分割することで文字以外の領域を排除することができるので、文字抽出精度を向上させることができる。
【００６３】
また、本発明によれば、文字領域属性と識別された領域内に文字以外の領域が存在するか否かの判定を容易に行うことができる。
【００６４】
また、本発明によれば、再分割を容易に行うことができる。
【００６５】
また、本発明によれば、前記再分割手段により再分割された前記文字領域属性と識別された領域において、前記行間と判定された領域に存在する前記矩形に実線が存在するか否かにより、前記文字以外の領域が存在するか否かを判定することにより、行間と判定された領域には文字は存在しないことから、文字以外の領域が存在するか否かの判定を容易に行うことができる。
【００６７】
また、本発明によれば、文書画像データ中に混在する文字領域と文字以外の領域とを識別分類する文書画像の領域識別方法であって、文字領域属性と識別された領域内に含まれる黒画素の連結成分に外接する矩形を前記文書画像データより抽出する外接矩形抽出工程と、前記外接矩形抽出工程により抽出された前記矩形の座標の最小値を１画素ライン毎に求め、求めた前記矩形の座標の最小値が、前記文字領域属性と識別された領域の始点に連続して寄っているところを文字行として判定し、前記文字領域属性と識別された領域の終点に連続して寄っているところを行間として判定し、「行−行間−行」の組み合わせが検出された場合に、前記文字領域属性と識別された領域内に、前記文字以外の領域が存在すると判定する非文字領域判定工程とを含む。また、本発明によれば、文書画像データ中に混在する文字領域と文字以外の領域とを識別分類する文書画像の領域識別方法であって、文字領域属性と識別された領域内に含まれる黒画素の連結成分に外接する矩形を前記文書画像データより抽出する外接矩形抽出工程と、前記外接矩形抽出工程により抽出された前記矩形の座標の最大値を１画素ライン毎に求め、求めた前記矩形の座標の最大値が、前記文字領域属性と識別された領域の終点に連続して寄っているところを文字行として判定し、前記文字領域属性と識別された領域の始点に連続して寄っているところを行間として判定し、「行−行間−行」の組み合わせが検出された場合に、前記文字領域属性と識別された領域内に、前記文字以外の領域が存在すると判定する非文字領域判定工程とを含む。また、前記外接矩形抽出工程は、入力画像のオリジナル画像から、前記矩形の抽出を行う。また、前記文字領域属性と識別された領域を、前記行間部分で再分割する再分割工程を備える。これにより、文字領域属性と識別された領域内に文字以外の領域が存在するか否かを判定し、文字領域属性と識別された領域内に文字以外の領域が存在すると判定した場合、当該文字領域属性と識別された領域を再分割することにより、印鑑や図など文字以外の領域が領域分割の結果として文字と判定された領域に入り込んだ場合であっても、当該文字領域属性と識別された領域を再度分割することで文字以外の領域を排除することができるので、文字抽出精度を向上させることができる。
【００６８】
また、本発明によれば、文字領域属性と識別された領域内に文字以外の領域が存在するか否かの判定を容易に行うことができる。
【００６９】
また、本発明によれば、再分割を容易に行うことができる。
【００７０】
また、本発明によれば、前記再分割工程により再分割された前記文字領域属性と識別された領域において、前記行間と判定された領域に存在する前記矩形に実線が存在するか否かにより、前記文字以外の領域が存在するか否かを判定することにより、行間と判定された領域には文字は存在しないことから、文字以外の領域が存在するか否かの判定を容易に行うことができる。
【００７２】
また、本発明によれば、文書画像データ中に混在する文字領域と文字以外の領域との識別分類をコンピュータに実行させるためのプログラムであって、前記コンピュータに、文字領域属性と識別された領域内に含まれる黒画素の連結成分に外接する矩形を前記文書画像データより抽出する外接矩形抽出機能と、前記外接矩形抽出機能により抽出された前記矩形の座標の最小値を１画素ライン毎に求め、求めた前記矩形の座標の最小値が、前記文字領域属性と識別された領域の始点に連続して寄っているところを文字行として判定し、前記文字領域属性と識別された領域の終点に連続して寄っているところを行間として判定し、「行−行間−行」の組み合わせが検出された場合に、前記文字領域属性と識別された領域内に、前記文字以外の領域が存在すると判定する非文字領域判定機能とを実行させる。また、本発明によれば、文書画像データ中に混在する文字領域と文字以外の領域との識別分類をコンピュータに実行させるためのプログラムであって、前記コンピュータに、文字領域属性と識別された領域内に含まれる黒画素の連結成分に外接する矩形を前記文書画像データより抽出する外接矩形抽出機能と、前記外接矩形抽出機能により抽出された前記矩形の座標の最大値を１画素ライン毎に求め、求めた前記矩形の座標の最大値が、前記文字領域属性と識別された領域の終点に連続して寄っているところを文字行として判定し、前記文字領域属性と識別された領域の始点に連続して寄っているところを行間として判定し、「行−行間−行」の組み合わせが検出された場合に、前記文字領域属性と識別された領域内に、前記文字以外の領域が存在すると判定する非文字領域判定機能とを実行させる。また、前記外接矩形抽出機能は、入力画像のオリジナル画像から、前記矩形の抽出を行う。また、前記文字領域属性と識別された領域を、前記行間部分で再分割する再分割機能を備える。これにより、文字領域属性と識別された領域内に文字以外の領域が存在するか否かを判定し、文字領域属性と識別された領域内に文字以外の領域が存在すると判定した場合、当該文字領域属性と識別された領域を再分割することにより、印鑑や図など文字以外の領域が領域分割の結果として文字と判定された領域に入り込んだ場合であっても、当該文字領域属性と識別された領域を再度分割することで文字以外の領域を排除することができるので、文字抽出精度を向上させることができる。
【００７３】
また、本発明によれば、文字領域属性と識別された領域内に文字以外の領域が存在するか否かの判定を容易に行うことができる。
【００７４】
また、本発明によれば、再分割を容易に行うことができる。
【００７５】
また、本発明によれば、前記再分割機能により再分割された前記文字領域属性と識別された領域において、前記行間と判定された領域に存在する前記矩形に実線が存在するか否かにより、前記文字以外の領域が存在するか否かを判定することにより、行間と判定された領域には文字は存在しないことから、文字以外の領域が存在するか否かの判定を容易に行うことができる。
【００７７】
また、本発明のコンピュータに読み取り可能な記憶媒体によれば、上述したプログラムを記憶したことにより、この記憶媒体をコンピュータにインストールすることで、上述したプログラムと同様の作用・効果を得ることができる。
【図面の簡単な説明】
【図１】本発明の実施の一形態の文書認識装置のハードウェア構成を概略的に示すブロック図である。
【図２】文書認識装置の機能ブロック図である。
【図３】図領域抽出部及び領域分割部における処理の流れを示すフローチャートである。
【図４】領域内矩形抽出結果に基づいて領域内矩形のＸ座標の最大値、最小値を求めた例を示す説明図である。
【図５】強制分割位置の一例を示す説明図である。
【図６】強制分割位置の他の一例を示す説明図である。
【符号の説明】
１文書認識装置
７記憶媒体[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document recognition apparatus, a document image area identification method, a program, and a storage medium that identify and classify character areas and non-character areas mixed in document image data.
[0002]
[Prior art]
Conventionally, various methods are known as a method of identifying a character string or a character region (column) in a document image.
[0003]
For example, in Japanese Patent Laid-Open No. 06-020092, a blank part is extracted from a document image, and a character is included in the document image by dividing a region by treating a blank separator formed by the connection of the blank part as a region dividing line. A method of extracting a row or a character area (column) has been proposed.
[0004]
Also known is a method that uses the projection histogram of black pixels to make the portion of the black pixel distribution high as the range of the character string (Akiyama, Masuda “Document image using both peripheral distribution, line density, and circumscribed rectangle characteristics) ”Area identification” of IEICE Transactions 86/8 Vol J69-D)).
[0005]
[Problems to be solved by the invention]
By the way, according to the conventional method for extracting a character string or a character area (column) in a document image, in a line segmentation process for character recognition performed after the area identification process, the character area includes a seal, a figure area, and the like. Are treated as not being mixed.
[0006]
However, in reality, areas other than characters, such as seals and drawings, may enter areas determined to be characters as a result of area division. In such a case, the conventional method has a problem that line extraction processing for cutting out a character line from a character string or a character region (column) in a document image cannot be performed, and character extraction accuracy decreases. It was.
[0007]
An object of the present invention is to improve character extraction accuracy.
[0008]
[Means for Solving the Problems]
  The present inventionIn the document recognition apparatus for discriminating and classifying character regions and non-character regions mixed in document image data, a rectangle circumscribing a connected component of black pixels included in the region identified as the character region attribute is defined as the document. Circumscribed rectangle extracting means for extracting from the image data, and the rectangle extracted by the circumscribed rectangle extracting meansDetermining the minimum value of the coordinates for each pixel line, determining that the determined minimum coordinate value of the rectangle is continuously approaching the start point of the area identified as the character area attribute as a character line, An area identified as the character area attribute is determined when a position that is continuously approaching an end point of the area identified as the character area attribute is determined as a line space and a combination of “line-line space-line” is detected. If there is a region other than the character inNon-character area determination means for determiningAndPrepare.
  The present invention also provides:In a document recognition apparatus for discriminating and classifying character areas and non-character areas mixed in document image data, a rectangle circumscribing a connected component of black pixels included in the area identified as the character area attribute is defined as the document image data. The circumscribed rectangle extracting means for extracting the rectangle and the rectangle extracted by the circumscribed rectangle extracting meansA maximum value of the coordinates of each pixel line is determined, and the determined maximum value of the coordinates of the rectangle is determined as a character line where it is continuously approaching the end point of the area identified as the character area attribute, An area identified as the character area attribute is determined when a line-to-line determination is made as a line space where the character string attribute is continuously approaching the start point of the area identified as the character area attribute. If there is a region other than the character inNon-character area determination means for determiningAndPrepare.
In the present invention, the circumscribed rectangle extracting means extracts the rectangle from the original image of the input image.
In the present invention, the image processing apparatus further comprises subdivision means for subdividing an area identified as the character area attribute at the line spacing portion.
[0009]
Therefore, if it is determined whether or not there is a non-character area in the area identified as the character area attribute, and if it is determined that there is a non-character area in the area identified as the character area attribute, Regions identified as region attributes are subdivided. As a result, even if an area other than a character such as a seal or a figure enters an area determined to be a character as a result of the area division, the area identified as the character area attribute can be divided again to obtain a non-character area. Since the area can be excluded, the character extraction accuracy can be improved.
[0011]
  AlsoThis makes it easy to determine whether or not there is a region other than a character in the region identified as the character region attribute.
[0013]
  AlsoSubdivision becomes easier.
[0014]
  In addition, the present inventionIsIn the area identified as the character area attribute subdivided by the subdivision means,Area determined to be the line spacingDepending on whether there is a solid line in the rectangle present in,AboveIt is determined whether an area other than characters exists.
[0015]
Therefore, since there is no character in the area determined as the line spacing, it is easy to determine whether there is an area other than the character.
[0018]
  The present invention also provides:A document image region identification method for identifying and classifying character regions and non-character regions mixed in document image data, wherein the rectangle circumscribes a connected component of black pixels included in the region identified as the character region attribute A circumscribed rectangle extracting step for extracting the rectangle from the document image data, and the rectangle extracted by the circumscribed rectangle extracting stepDetermining the minimum value of the coordinates for each pixel line, determining that the determined minimum coordinate value of the rectangle is continuously approaching the start point of the area identified as the character area attribute as a character line, An area identified as the character area attribute is determined when a position that is continuously approaching an end point of the area identified as the character area attribute is determined as a line space and a combination of “line-line space-line” is detected. If there is a region other than the character inNon-character area determination process for determiningAndIncluding.
  The present invention also provides:A document image region identification method for identifying and classifying character regions and non-character regions mixed in document image data, wherein the rectangle circumscribes a connected component of black pixels included in the region identified as the character region attribute A circumscribed rectangle extracting step for extracting the rectangle from the document image data, and the rectangle extracted by the circumscribed rectangle extracting stepA maximum value of the coordinates of each pixel line is determined, and the determined maximum value of the coordinates of the rectangle is determined as a character line where it is continuously approaching the end point of the area identified as the character area attribute, An area identified as the character area attribute is determined when a line-to-line determination is made as a line space where the character string attribute is continuously approaching the start point of the area identified as the character area attribute. If there is a region other than the character inNon-character area determination process for determiningAndInclude.
  In the present invention, the circumscribed rectangle extracting step extracts the rectangle from the original image of the input image.
The present invention further includes a subdivision step of subdividing the area identified as the character area attribute at the line spacing portion.
[0019]
Therefore, if it is determined whether or not there is a non-character area in the area identified as the character area attribute, and if it is determined that there is a non-character area in the area identified as the character area attribute, Regions identified as region attributes are subdivided. As a result, even if an area other than a character such as a seal or a figure enters an area determined to be a character as a result of the area division, the area identified as the character area attribute can be divided again to obtain a non-character area. Since the area can be excluded, the character extraction accuracy can be improved.
[0021]
  AlsoThis makes it easy to determine whether or not there is a region other than a character in the region identified as the character region attribute.
[0023]
  AlsoSubdivision becomes easier.
[0024]
  In the area identified as the character area attribute subdivided by the subdivision step,Area determined to be the line spacingDepending on whether there is a solid line in the rectangle present in,AboveIt is determined whether an area other than characters exists.
[0025]
Therefore, since there is no character in the area determined as the line spacing, it is easy to determine whether there is an area other than the character.
[0028]
  The present invention also provides:A program for causing a computer to identify and classify a character area and a non-character area mixed in document image data, the computer linking black pixels included in an area identified as a character area attribute A circumscribed rectangle extracting function for extracting a rectangle circumscribing a component from the document image data, and the rectangle extracted by the circumscribed rectangle extracting functionDetermining the minimum value of the coordinates for each pixel line, determining that the determined minimum coordinate value of the rectangle is continuously approaching the start point of the area identified as the character area attribute as a character line, An area identified as the character area attribute is determined when a position that is continuously approaching an end point of the area identified as the character area attribute is determined as a line space and a combination of “line-line space-line” is detected. If there is a region other than the character inNon-character area judgment function to judgeAndLet it run.
  The present invention also provides:A program for causing a computer to identify and classify a character area and a non-character area mixed in document image data, the computer linking black pixels included in an area identified as a character area attribute A circumscribed rectangle extracting function for extracting a rectangle circumscribing a component from the document image data, and the rectangle extracted by the circumscribed rectangle extracting functionA maximum value of the coordinates of each pixel line is determined, and the determined maximum value of the coordinates of the rectangle is determined as a character line where it is continuously approaching the end point of the area identified as the character area attribute, An area identified as the character area attribute is determined when a line-to-line determination is made as a line space where the character string attribute is continuously approaching the start point of the area identified as the character area attribute. If there is a region other than the character inNon-character area judgment function to judgeAndExecute.
  In the present invention, the circumscribed rectangle extraction function extracts the rectangle from the original image of the input image.
  In the present invention, a re-division function is provided for re-division of the area identified as the character area attribute at the portion between lines.
[0029]
Therefore, if it is determined whether or not there is a non-character area in the area identified as the character area attribute, and if it is determined that there is a non-character area in the area identified as the character area attribute, Regions identified as region attributes are subdivided. As a result, even if an area other than a character such as a seal or a figure enters an area determined to be a character as a result of the area division, the area identified as the character area attribute can be divided again to obtain a non-character area. Since the area can be excluded, the character extraction accuracy can be improved.
[0031]
  AlsoThis makes it easy to determine whether or not there is a region other than a character in the region identified as the character region attribute.
[0033]
  AlsoSubdivision becomes easier.
[0034]
  In the present invention, in the area identified as the character area attribute subdivided by the subdivision function,Area determined to be the line spacingDepending on whether there is a solid line in the rectangle present in,AboveIt is determined whether an area other than characters exists.
[0035]
Therefore, since there is no character in the area determined as the line spacing, it is easy to determine whether there is an area other than the character.
[0038]
  In addition, the present inventionA computer-readable storage medium stores the program according to claim 11.is doing.
[0039]
Therefore, by installing this storage medium in a computer, it is possible to obtain the same operation as the program according to any one of claims 11 to 15.
[0040]
DETAILED DESCRIPTION OF THE INVENTION
An embodiment of the present invention will be described with reference to FIGS.
[0041]
FIG. 1 is a block diagram schematically showing a hardware configuration of the document recognition apparatus 1. As shown in FIG. 1, the document recognition apparatus 1 includes a CPU (Central Processing Unit) 2 that centrally controls each part of the document recognition apparatus 1, and the CPU 2 stores a read-only memory that stores a BIOS and the like. A ROM (Read Only Memory) 3 that is a memory and a RAM (Random Access Memory) 4 that stores various data in a rewritable manner are connected by a bus 5. Further, the bus 5 includes an HDD (Hard Disk Drive) 6 serving as an external storage, a CD-ROM drive 8 that reads a CD (Compact Disc) -ROM 7, and a communication control device that controls communication between the document recognition device 1 and the network 9. 10, an input device 11 such as a keyboard or a mouse that functions as an input unit, an output device 12 such as a CRT (Cathode Ray Tube) or an LCD (Liquid Crystal Display), and an image input device such as a scanner that functions as an image input unit 13 is connected via an I / O (not shown).
[0042]
The RAM 4 functions as a work area for the CPU 2 because it has the property of storing various data in a rewritable manner.
[0043]
The HDD 6 stores program files for storing various programs.
[0044]
A CD-ROM 7 shown in FIG. 1 implements the storage medium of the present invention, and stores a predetermined program. The CPU 2 reads the program stored in the CD-ROM 7 with the CD-ROM drive 8 and installs it in the HDD 6. As a result, the document recognition apparatus 1 is in a state in which various processes as described later can be performed.
[0045]
As the storage medium, not only the CD-ROM 7 but also various types of media such as semiconductor memory such as various optical disks such as DVD, various magnetic disks such as various magneto-optical disks and floppy disks, and the like can be used. Alternatively, the program may be downloaded from the network 9 such as the Internet via the communication control device 10 and installed in the HDD 6. In this case, the storage device storing the program in the server on the transmission side is also a storage medium of the present invention. Note that the program may operate on a predetermined OS (Operating System), in which case the OS may execute a part of various processes described later, or a word processor. It may be included as part of a group of program files that constitute predetermined application software such as software or an OS.
[0046]
Next, various functions realized when the CPU 2 of the document recognition apparatus 1 is controlled based on a program will be described. FIG. 2 is a functional block diagram of the document recognition apparatus 1.
[0047]
The area identifying unit 14 identifies areas of document images input from the image input device 13 and stored in a memory (RAM 4 or the like), for example, and classifies them into character areas, table areas, figure areas, photo areas, and the like. Note that the region attribute of the document can be obtained by a method such as judging using the density of black runs, but since this method is conventionally known, the description thereof is omitted.
[0048]
The figure area extraction unit 15 determines whether or not a solid line (such as a seal or a drawing) is mixed in the area classified as the character area by the area identification unit 14. Since a method for determining whether or not a solid line (a seal, a drawing, etc.) is mixed is conventionally known, the description thereof is omitted.
[0049]
When it is determined that a solid line (such as a seal or a drawing) is mixed in the area classified as the character area in the figure area extraction unit 15, the area dividing unit 16 subdivides the target character area, and the character recognition unit Pass to 17.
[0050]
The character recognition unit 17 cuts out one character by line cut-out processing and character cut-out processing, and selects a character candidate by matching processing of character recognition processing for the cut-out character.
[0051]
If it is determined that a solid line (such as a seal or a drawing) is not mixed in the area classified as the character area in the figure area extraction unit 15, the area classified as the character area in the figure area extraction unit 15 is It is passed to the character recognition unit 17 as it is.
[0052]
Here, the flow of processing in the figure region extraction unit 15 and the region division unit 16 that exhibit the characteristic functions of the present embodiment will be described in detail with reference to FIG. First, in step S1, region coordinate data (start point and end point X, Y extracted from a 1/4 compressed image obtained by compressing the input image to 1/4) is obtained for the region classified as a character region by the region identifying unit 14. It is determined whether or not the corresponding area is a vertically long area using coordinates), and if the corresponding area is a vertically long area, the corresponding area is excluded (the subsequent processing is not performed).
[0053]
In addition, in step S2, it is determined whether or not the row direction of the corresponding area is “vertical”, and if the row direction of the corresponding area is “vertical”, the corresponding area is excluded (following processing). Do not do).
[0054]
Next, in step S3, candidate areas are detected. More specifically, first, rectangle extraction processing is performed on the original image of the character area detected by the above processing to obtain rectangular coordinate data. Here, the function of the circumscribed rectangle extracting means is executed. Here, the reason why the 1/4 compressed image is not used is that when the compressed image is used, the rectangles come into contact with each other to form a large rectangle. This is because the forced division position is estimated from the rectangular coordinate information in the subsequent processing, and therefore, when the rectangles come into contact with each other to form a large rectangle, there is a problem that the estimation accuracy is not improved. Then, assuming that all the rectangles in the character area obtained as a result of such rectangle extraction processing are black pixels, the smallest X coordinate (the thicker shown in FIG. 4) in the character area for each line (for each Y coordinate). The solid line: minXs (y)) and the largest X coordinate (thick broken line: maxXe (y) shown in FIG. 4) are obtained.
[0055]
The value y of minXs (y) and maxXe (y) takes the value from the start point (area.Ys) to the end point (area.Ye) of the corresponding area coordinates. Normalize to data.
minXs (y), maxXs (y) → (normalization) → minXs (Y), maxXe (Y)
However, Y = (y−area.Ys) / (area.Ye−area.Ys) × 100
As described above, the maximum value and the minimum value of the X coordinate of the in-region rectangle are obtained for each pixel row.
[0056]
Next, it is determined whether or not a plurality of lines are in contact with each other based on the maximum and minimum values of the X coordinate of the rectangle in the region. In an example in which the left sides of a plurality of lines are in contact (see FIG. 4A), attention is paid to the minimum value of the X coordinate, and the line where the minimum value is continuously approaching the start point Xs of the region is defined. In addition, since the minimum value of the X coordinate is close to the end point Xe of the region, the space between the rows is defined as the space between the rows. Then, it is determined that a plurality of lines are in contact with each other in an area where the combination of “row-line-line-row” is detected (that is, seals, drawings, etc. are mixed), and as shown in FIG. Force split the region at the center. In the example in which the right sides of a plurality of rows are in contact (see FIG. 4B), the same processing is performed by paying attention to the maximum value of the X coordinate. As a result, the character area is subdivided, and the candidate area detection process (step S3) ends. Here, the function of the non-character area determination means and the function of the re-division means are executed.
[0057]
Finally, the process proceeds to step S4, and a final determination process is executed. The final determination is to determine whether a solid line (such as a seal or a figure) is mixed in the subdivided character area, and a solid line (a seal or a figure in a rectangle existing in the area determined to be between lines). Or the like), and when a solid line (such as a seal or a figure) exists in the rectangle, the rectangle is used as the figure region.
[0058]
In addition, although the example which divides | segments a line horizontally from the uneven | corrugated information of a coordinate was demonstrated above, using the uneven | corrugated information, as shown in FIG. 6, the figure area part considered to be a figure and a photograph is vertically divided | segmented by a convex part. You may make it do.
[0059]
In addition, after these divisions, an index (confidence level) indicating the certainty of the result of character recognition is calculated, and parts with low confidence levels (low certainty levels) are made into figure regions so that they are further divided. It is also possible to improve accuracy.
[0060]
In the present embodiment, a horizontal document that is horizontally oriented in the row direction has been described. However, the present invention is not limited to this, and it can also be applied to a vertically written document that is vertically oriented in the row direction.
[0061]
Here, it is determined whether or not there is a non-character area in the area identified as the character area attribute, and if it is determined that there is a non-character area in the area identified as the character area attribute, The area identified as the character area attribute is subdivided. As a result, even if an area other than a character such as a seal or a figure enters an area determined to be a character as a result of the area division, the area identified as the character area attribute can be divided again to obtain a non-character area. Since the area can be excluded, the character extraction accuracy can be improved.
[0062]
【The invention's effect】
  According to the present inventionIn a document recognition apparatus for identifying and classifying character areas and non-character areas mixed in document image data, a rectangle circumscribing a connected component of black pixels included in the area identified as the character area attribute is defined as the document image. Circumscribed rectangle extracting means for extracting from the data, and the rectangle extracted by the circumscribed rectangle extracting meansDetermining the minimum value of the coordinates for each pixel line, determining that the determined minimum coordinate value of the rectangle is continuously approaching the start point of the area identified as the character area attribute as a character line, An area identified as the character area attribute is determined when a position that is continuously approaching an end point of the area identified as the character area attribute is determined as a line space and a combination of “line-line space-line” is detected. If there is a region other than the character inNon-character area determination means for determiningAndPrepare.Moreover, according to the present invention,In a document recognition apparatus for discriminating and classifying character areas and non-character areas mixed in document image data, a rectangle circumscribing a connected component of black pixels included in the area identified as the character area attribute is defined as the document image data. The circumscribed rectangle extracting means for extracting the rectangle and the rectangle extracted by the circumscribed rectangle extracting meansA maximum value of the coordinates of each pixel line is determined, and the determined maximum value of the coordinates of the rectangle is determined as a character line where it is continuously approaching the end point of the area identified as the character area attribute, An area identified as the character area attribute is determined when a line-to-line determination is made as a line space where the character string attribute is continuously approaching the start point of the area identified as the character area attribute. If there is a region other than the character inNon-character area determination means for determiningAndPrepare. The circumscribed rectangle extracting means extracts the rectangle from the original image of the input image. Further, the image processing apparatus further comprises subdivision means for subdividing the area identified as the character area attribute at the line spacing portion. ThisIf it is determined whether there is a non-character area in the area identified as the character area attribute, and if it is determined that there is a non-character area in the area identified as the character area attribute, the character area attribute Even if a region other than characters such as a seal or a figure enters a region determined to be a character as a result of region division by subdividing the region identified as, the region identified as the character region attribute Since the region other than the characters can be excluded by dividing the character string again, the character extraction accuracy can be improved.
[0063]
  Also bookAccording to the invention,SentenceIt is possible to easily determine whether or not there is a region other than a character in the region identified as the character region attribute.
[0064]
  Also bookAccording to the inventionReDivision can be easily performed.
[0065]
  Also bookAccording to the invention, In the area identified as the character area attribute subdivided by the subdivision means,Area determined to be the line spacingDepending on whether there is a solid line in the rectangle present in,AboveBy determining whether or not there is an area other than the character, there is no character in the area determined to be between lines, so it is possible to easily determine whether or not there is an area other than the character. .
[0067]
  Also bookAccording to the invention,A document image region identification method for identifying and classifying character regions and non-character regions mixed in document image data, wherein the rectangle circumscribes a connected component of black pixels included in the region identified as the character region attribute A circumscribed rectangle extracting step for extracting the rectangle from the document image data, and the rectangle extracted by the circumscribed rectangle extracting stepDetermining the minimum value of the coordinates for each pixel line, determining that the determined minimum coordinate value of the rectangle is continuously approaching the start point of the area identified as the character area attribute as a character line, An area identified as the character area attribute is determined when a position that is continuously approaching an end point of the area identified as the character area attribute is determined as a line space and a combination of “line-line space-line” is detected. If there is a region other than the character inNon-character area determination process for determiningAndIncluding.Moreover, according to the present invention,A document image region identification method for identifying and classifying character regions and non-character regions mixed in document image data, wherein the rectangle circumscribes a connected component of black pixels included in the region identified as the character region attribute A circumscribed rectangle extracting step for extracting the rectangle from the document image data, and the rectangle extracted by the circumscribed rectangle extracting stepA maximum value of the coordinates of each pixel line is determined, and the determined maximum value of the coordinates of the rectangle is determined as a character line where it is continuously approaching the end point of the area identified as the character area attribute, An area identified as the character area attribute is determined when a line-to-line determination is made as a line space where the character string attribute is continuously approaching the start point of the area identified as the character area attribute. If there is a region other than the character inNon-character area determination process for determiningAndInclude. In the circumscribed rectangle extracting step, the rectangle is extracted from the original image of the input image. Further, a subdivision step is provided for subdividing the area identified as the character area attribute at the line spacing portion. ThisIf it is determined whether there is a non-character area in the area identified as the character area attribute, and if it is determined that there is a non-character area in the area identified as the character area attribute, By re-dividing the identified area, even if a non-character area such as a seal or drawing enters the area determined to be a character as a result of the area division, the area identified as the character area attribute is Since the area other than the character can be excluded by dividing again, the character extraction accuracy can be improved.
[0068]
  Also bookAccording to the invention,It is possible to easily determine whether or not there is a region other than a character in the region identified as the character region attribute.
[0069]
  Also bookAccording to the invention,Subdivision can be easily performed.
[0070]
  Also bookAccording to the invention, In the area identified as the character area attribute subdivided by the subdivision step,Area determined to be the line spacingDepending on whether there is a solid line in the rectangle present in,AboveBy determining whether or not there is an area other than the character, there is no character in the area determined to be between lines, so it is possible to easily determine whether or not there is an area other than the character. .
[0072]
  Also bookAccording to the invention,A program for causing a computer to identify and classify a character area and a non-character area mixed in document image data, the computer linking black pixels included in an area identified as a character area attribute A circumscribed rectangle extracting function for extracting a rectangle circumscribing a component from the document image data, and the rectangle extracted by the circumscribed rectangle extracting functionDetermining the minimum value of the coordinates for each pixel line, determining that the determined minimum coordinate value of the rectangle is continuously approaching the start point of the area identified as the character area attribute as a character line, An area identified as the character area attribute is determined when a position that is continuously approaching an end point of the area identified as the character area attribute is determined as a line space and a combination of “line-line space-line” is detected. If there is a region other than the character inNon-character area judgment function to judgeAndExecute. Moreover, according to the present invention,A program for causing a computer to identify and classify a character area and a non-character area mixed in document image data, the computer linking black pixels included in an area identified as a character area attribute A circumscribed rectangle extracting function for extracting a rectangle circumscribing a component from the document image data, and the rectangle extracted by the circumscribed rectangle extracting functionA maximum value of the coordinates of each pixel line is determined, and the determined maximum value of the coordinates of the rectangle is determined as a character line where it is continuously approaching the end point of the area identified as the character area attribute, An area identified as the character area attribute is determined when a line-to-line determination is made as a line space where the character string attribute is continuously approaching the start point of the area identified as the character area attribute. If there is a region other than the character inNon-character area judgment function to judgeAndExecute. The circumscribed rectangle extraction function extracts the rectangle from the original image of the input image. Further, a re-division function is provided for re-division of the area identified as the character area attribute at the line spacing portion. ThisIf it is determined whether there is a non-character area in the area identified as the character area attribute, and if it is determined that there is a non-character area in the area identified as the character area attribute, the character area attribute Even if a region other than characters such as a seal or a figure enters a region determined to be a character as a result of region division by subdividing the region identified as, the region identified as the character region attribute Since the region other than the characters can be excluded by dividing the character string again, the character extraction accuracy can be improved.
[0073]
  Also bookAccording to the invention,It is possible to easily determine whether or not there is a region other than a character in the region identified as the character region attribute.
[0074]
  Also bookAccording to the invention,Subdivision can be easily performed.
[0075]
  Also bookAccording to the invention, In the area identified as the character area attribute subdivided by the subdivision function,Area determined to be the line spacingDepending on whether there is a solid line in the rectangle present in,AboveBy determining whether or not there is an area other than the character, there is no character in the area determined to be between lines, so it is possible to easily determine whether or not there is an area other than the character. .
[0077]
  Also bookAccording to the computer-readable storage medium of the invention,Mentioned aboveBy installing this storage medium in the computer by storing the program,Mentioned aboveYou can get the same actions and effects as the program.
[Brief description of the drawings]
FIG. 1 is a block diagram schematically showing a hardware configuration of a document recognition apparatus according to an embodiment of the present invention.
FIG. 2 is a functional block diagram of a document recognition apparatus.
FIG. 3 is a flowchart showing a flow of processing in a diagram area extracting unit and a region dividing unit;
FIG. 4 is an explanatory diagram showing an example in which the maximum value and the minimum value of the X coordinate of the rectangle within the area are obtained based on the result of extracting the rectangle within the area.
FIG. 5 is an explanatory diagram showing an example of a forced division position.
FIG. 6 is an explanatory diagram showing another example of forced division positions.
[Explanation of symbols]
1 Document recognition device
7 Storage media

Claims

In a document recognition apparatus that identifies and classifies character areas and non-character areas mixed in document image data,
Circumscribing rectangle extracting means for extracting a rectangle circumscribing a connected component of black pixels included in the region identified as the character region attribute from the document image data;
The minimum value of the coordinates of the rectangle extracted by the circumscribed rectangle extracting means is obtained for each pixel line, and the determined minimum value of the coordinates of the rectangle is continuously from the start point of the area identified as the character area attribute. When it is determined that the line is close as a character line, the line that is continuously close to the end point of the area identified as the character area attribute is determined as a line space, and a combination of “line-line-line-line” is detected And a non-character area determining unit that determines that an area other than the character exists in the area identified as the character area attribute .

In a document recognition apparatus that identifies and classifies character areas and non-character areas mixed in document image data,
Circumscribing rectangle extracting means for extracting a rectangle circumscribing a connected component of black pixels included in the region identified as the character region attribute from the document image data;
The maximum value of the coordinates of the rectangle extracted by the circumscribed rectangle extracting means is obtained for each pixel line, and the determined maximum value of the coordinates of the rectangle is continuously from the end point of the area identified as the character area attribute. When it is determined that the line is close as a character line, the line that is continuously close to the start point of the area identified as the character area attribute is determined as a line space, and a combination of “line-line-line-line” is detected And a non-character area determining unit that determines that an area other than the character exists in the area identified as the character area attribute .

3. The document recognition apparatus according to claim 1, wherein the circumscribed rectangle extracting unit extracts the rectangle from an original image of the input image.

4. The document recognition apparatus according to claim 1, further comprising subdivision means for subdividing an area identified as the character area attribute at the line spacing portion.

In the identified with the character region attribute subdivided by subdivision means regions, depending on whether a solid line exists in the rectangular present in the rows and a region determined as the region other than the character is present 5. The document recognition apparatus according to claim 4, wherein it is determined whether or not.

A document image region identification method for identifying and classifying character regions and non-character regions mixed in document image data,
A circumscribed rectangle extracting step of extracting a rectangle circumscribing a connected component of black pixels included in the region identified as the character region attribute from the document image data;
The minimum value of the coordinates of the rectangle extracted by the circumscribed rectangle extraction step is obtained for each pixel line, and the determined minimum value of the coordinates of the rectangle is continuously from the start point of the area identified as the character area attribute. When it is determined that the line is close as a character line, the line that is continuously close to the end point of the area identified as the character area attribute is determined as a line space, and a combination of “line-line-line-line” is detected to, in the character region attribute with the identified area, area identification method of a document image, which comprises a non-character area determining step of determining a region other than the character is present.

A document image region identification method for identifying and classifying character regions and non-character regions mixed in document image data,
A circumscribed rectangle extracting step of extracting a rectangle circumscribing a connected component of black pixels included in the region identified as the character region attribute from the document image data;
The maximum value of the coordinates of the rectangle extracted by the circumscribed rectangle extraction step is obtained for each pixel line, and the determined maximum value of the coordinates of the rectangle is continuously from the end point of the area identified as the character area attribute. When it is determined that the line is close as a character line, the line that is continuously close to the start point of the area identified as the character area attribute is determined as a line space, and a combination of “line-line-line-line” is detected to, in the character region attribute with the identified area, area identification method of a document image, which comprises a non-character area determining step of determining a region other than the character is present.

8. The document image region identification method according to claim 6, wherein the circumscribed rectangle extracting step extracts the rectangle from an original image of the input image.

9. The document image region identification method according to claim 6, further comprising a subdivision step of subdividing the region identified as the character region attribute at the portion between lines.

In subdivided the character region attributes identified regions by the subdivision step, whether solid is present in the rectangular present in the rows and a region determined as the region other than the character is present 10. The document image region identification method according to claim 9, further comprising :

A program for causing a computer to identify and classify character areas and non-character areas mixed in document image data,
In the computer,
A circumscribed rectangle extracting function for extracting a rectangle circumscribing a connected component of black pixels included in the region identified as the character region attribute from the document image data;
The minimum value of the coordinates of the rectangle extracted by the circumscribed rectangle extraction function is obtained for each pixel line, and the obtained minimum value of the coordinates of the rectangle is continuously from the start point of the area identified as the character area attribute. When it is determined that the line is close as a character line, the line that is continuously close to the end point of the area identified as the character area attribute is determined as a line space, and a combination of “line-line-line-line” is detected in the character region attribute with the identified area, program characterized by executing a non-character area determining function for determining a region other than the character is present.

A program for causing a computer to identify and classify character areas and non-character areas mixed in document image data,
In the computer,
A circumscribed rectangle extracting function for extracting a rectangle circumscribing a connected component of black pixels included in the region identified as the character region attribute from the document image data;
The maximum value of the coordinates of the rectangle extracted by the circumscribed rectangle extraction function is obtained for each pixel line, and the maximum value of the determined coordinates of the rectangle is continuously from the end point of the area identified as the character area attribute. When it is determined that the line is close as a character line, the line that is continuously close to the start point of the area identified as the character area attribute is determined as a line space, and a combination of “line-line-line-line” is detected in the character region attribute with the identified area, program characterized by executing a non-character area determining function for determining a region other than the character is present.

The program according to claim 11 or 12, wherein the circumscribed rectangle extraction function extracts the rectangle from an original image of an input image.

The program according to any one of claims 11 to 13, further comprising a subdivision function for subdividing an area identified as the character area attribute at the portion between lines.

In subdivided the character region attributes identified regions by the subdivision feature, depending on whether a solid line exists in the rectangular present in the rows and a region determined as the region other than the character is present 15. The program according to claim 14, wherein it is determined whether or not.

A computer-readable storage medium storing the program according to any one of claims 11 to 15 .