JP3668026B2

JP3668026B2 - Publication electronic processing equipment

Info

Publication number: JP3668026B2
Application number: JP36009698A
Authority: JP
Inventors: 晴信森
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1998-12-18
Filing date: 1998-12-18
Publication date: 2005-07-06
Anticipated expiration: 2018-12-18
Also published as: JP2000182055A

Description

【０００１】
【発明の属する技術分野】
本発明は、複数枚の画像中より画像一枚一枚のレイアウト情報を自動的に抽出する装置に関するもので、例えば出版物の電子化処理を行うために利用される。
【０００２】
【従来の技術】
文書を画像として読み取り、自動的にレイアウト構造の抽出を行なう際に、文書の種類によってレイアウト構造が大きく異なることもあるため、抽出アルゴリズムを種類に応じて変更することで、抽出結果を向上させることができる。例えば、特許番号第２７４８９７４号公報のように定型書式のモデルを複数用意することで、モデルとの照合結果によって処理の制御を切り替えることが可能になる。
【０００３】
また、特開平８−１２９５５０号公報に示されるように、文書を画像として読み込み、入力された文書データを解析して文書データと画像データ領域に分離し、分離された文書データ領域から文字画像を切り出して文字認識する技術は公知である。
【０００４】
さらに、例えば、橋本新一郎編著「文字認識概論」Ｐ５７〜Ｐ９５、オーム社、１９８２に示されるように文字認識に係わる技術が公知である。
【０００５】
【発明が解決しようとする課題】
例えば書籍の場合、連続するページでは、同じレイアウト構造を持つページが連続している場合が多い。従来技術においては、ページ一枚ごとにレイアウト情報抽出アルゴリズムを選択して切り替えることを想定しているため、定形書式モデルを蓄えておくメモリが増大するという問題が生じる。また、モデルとの照合を各ページで行うため処理時間が長くなるという問題がある。また、モデルとの照合において、モデルの各項目の空間情報と、入力文書のイメージから切り出した文字列ブロックの空間情報との照合を取るため、モデル照合に間違いを発生しやすいという問題がある。
【０００６】
本発明は、レイアウト構造の変化を検出した時点で抽出アルゴリズムを切り替え、レイアウト構造に対する抽出精度の向上と処理速度の向上を図ることにより上記課題を解決することを目的とする。
【０００７】
【課題を解決するための手段】
本発明は、出版物の頁画像のレイアウト情報を認識する出版物電子化処理装置であって、複数の頁画像を連続して入力することが可能な画像入力手段と、画像入力手段により入力された頁画像を、白領域と黒領域の領域を含む頁画像として格納する入力画像記憶手段と、入力画像記憶手段に格納されている頁画像の黒画素に対して、黒画像に外接する矩形を求め、同じ高さの黒画素の矩形が横に並んでいる場合、また同じ幅の黒画素の矩形が縦に並んでいる場合に、矩形間の距離にかかわらず近接する二つの外接矩形を一つの外接矩形に統合することを繰り返すことにより、複数の矩形領域を抽出する第１レイアウト情報抽出手段と、入力画像記憶手段に格納されている画像の黒画素に対して、黒画像に外接する矩形を求め、同じ高さの黒画素の矩形が横に並んでいる場合、また同じ幅の黒画素の矩形が縦に並んでいる場合に、矩形間の距離が所定範囲内に近接する二つの外接矩形を一つの外接矩形に統合することを繰り返すことにより、複数の矩形領域を抽出する第２レイアウト情報抽出手段と、第１のレイアウト情報抽出手段により抽出された矩形領域に対して、文字行を抽出し、文字行の最後が数字かを調査する行末数字調査手段と、行末数字調査手段による調査の結果、頁内の全文字行の数に対する文字行の最後が数字である文字行の数の割合が所定値以上かを判定する行末数字多判定手段と、を有し、行末数字多判定手段により、頁内の全文字行の数に対する文字行の最後が数字である文字行の数の割合が所定値以上であると判定された場合には、該頁について第２レイアウト情報抽出手段を使用せず、該判定された頁の次の頁において、第１レイアウト情報抽出手段を使用し、その後に、頁内の全文字行の数に対する文字行の最後が数字である文字行の数の割合が所定値未満であると判定された場合には、該頁について第２レイアウト情報抽出手段を使用し、該判定された頁以降の頁において、第１レイアウト情報抽出手段、行末数字調査手段および行末数字多判定手段を使用せず、第２レイアウト情報抽出手段を使用するものである。
【０００８】
また、本発明は、出版物の頁画像のレイアウト情報を認識する出版物電子化処理装置であって、複数の頁画像を連続して入力することが可能な画像入力手段と、画像入力手段により入力された頁画像を、白領域と黒領域の領域を含む頁画像として格納する入力画像記憶手段と、入力画像記憶手段に格納されている頁画像の黒画素に対して、黒画像に外接する矩形を求め、同じ高さの黒画素の矩形が横に並んでいる場合、また同じ幅の黒画素の矩形が縦に並んでいる場合に、矩形間の距離にかかわらず近接する二つの外接矩形を一つの外接矩形に統合することを繰り返すことにより、複数の矩形領域を抽出する第１レイアウト情報抽出手段と、入力画像記憶手段に格納されている画像の黒画素に対して、黒画像に外接する矩形を求め、同じ高さの黒画素の矩形が横に並んでいる場合、また同じ幅の黒画素の矩形が縦に並んでいる場合に、矩形間の距離が所定範囲内に近接する二つの外接矩形を一つの外接矩形に統合することを繰り返すことにより、複数の矩形領域を抽出する第２レイアウト情報抽出手段と、第１のレイアウト情報抽出手段により抽出された矩形領域に対して、文字行を抽出し、文字行の最後が数字である場合に、該文字行の行頭が連続した数字で、かつ該連続した数字の前に特定のデリミッタがあるかを調査する行末数字調査手段と、行末数字調査手段による調査の結果、頁内の全文字行の数に対する、文字行の最後が数字であり、文字行の行頭が連続した数字で、かつ連続した数字の前に特定のデリミッタがある文字行の数の割合が所定値以上かを判定する行末数字多判定手段と、を有し、行末数字多判定手段により、頁内の全文字行の数に対する、文字行の最後が数字であり、文字行の行頭が連続した数字で、かつ連続した数字の前に特定のデリミッタがある文字行の数の割合が所定値以上と判定された場合には、該頁について第２レイアウト情報抽出手段を使用せず、該判定された頁の次の頁において、第１レイアウト情報抽出手段を使用し、その後に、頁内の全文字行の数に対して、文字行の最後が数字であり、該文字行の行頭が連続した数字で、かつ連続した数字の前に特定のデリミッタがある文字行の数の割合が所定値未満と判定された場合には、該頁について第２レイアウト情報抽出手段を使用し、該判定された頁以降の頁において、第１レイアウト情報抽出手段、行末数字調査手段および行末数字多判定手段を使用せず、第２レイアウト情報抽出手段を使用するものである。
【０００９】
さらに、本発明は、画像入力手段により入力された複数の頁画像について、最終頁からあらかじめ定められた頁以内の頁であるかを判定する最終所定頁判定手段を、さらに備え、最終所定頁判定手段により、最終頁からあらかじめ定められた頁以内の頁であると判定された場合は、その判定された頁以降の頁について、第１レイアウト情報抽出手段を使用するものである。
【００１２】
【発明の実施の形態】
以下に、本発明による画像処理装置の一実施例について、図１〜図５に基づき説明する。
【００１３】
図１は、本発明を適用した一実施例のシステムの構成を示す機能ブロック図である。１は画像を取り込む画像入力手段であり、スキャナ、ビデオカメラ、他のＰＣや画像メディアからの画像入力などが含まれる。本発明では説明を簡単にするため以降入力手段をスキャナーとして説明を加える。２は入力画像用メモリであり、ハードディスク、ＭＤ、ＭＯなどの磁気メモリ、ＲＯＭ、ＲＡＭ、スマートカードなどの半導体メモリで構成される。３はレイアウト情報格納用メモリであり、２と同様に磁気メモリ、半導体メモリによって構成される。４は文字認識用辞書であり、磁気メモリあるいは半導体メモリに格納されている。５は本発明の画像処理を実行するプログラムの格納されたプログラム用ＲＯＭであり、６はプログラム用ＲＯＭ５内のプログラムに従って処理の流れを制御するＣＰＵあるいは専用ハードなどにより構成される制御手段である。７はレイアウト情報を抽出するためのレイアウト情報抽出手段であり、レイアウト情報抽出手段７には少なくとも第１のレイアウト情報抽出手段と第２のレイアウト情報抽出手段が存在する。８はレイアウト情報抽出手段で抽出された情報を基にどのレイアウト情報を参照ページ以降に適用するかを判断する判定処理手段である。レイアウト情報抽出手段、判定手段については後で詳しく述べる。９はデータバスであり、本発明はバスの幅、速度等に関してなんらの制限を加えるものではない。
【００１４】
図２〜図４は、実施例の処理の流れを示すフローチャートである。初めに、図２のフローチャートの部分から説明を加える。
【００１５】
まず、画像処理装置の制御手段の初期化を行う。初期化にはｆｌａｇ、ｆｌａｇ＿ｂ、ｆｌａｇ＿ｃ、３種類の変数に初期データ０を書き込む処理が含まれる（Ｓ１）。また、ｐはページを管理する変数であり、初期値は１が設定される。ｆｌａｇは第１レイアウト情報抽出手段によりレイアウト情報の抽出を行うか第２レイアウト情報抽出手段によりレイアウト情報抽出を行うか判定するためのフラグであり、その値が０であれば第１レイアウト情報抽出手段で抽出を行い、１であれば第２レイアウト情報抽出手段でレイアウト情報抽出を行う。ｆｌａｇ＿ｂは処理当該ページの前ページがどのレイアウト情報抽出手段を用いたかを蓄えておくフラグであり、第１レイアウト情報抽出手段を用いる場合０、第２レイアウト情報抽出手段を用いる場合は１が格納されている。ｆｌａｇ＿ｃは数字行末が多くないページに対しカウントアップされるカウンターである。
【００１６】
次に画像入力手段１で読み取った画像を、入力画像用メモリ２に転送する（Ｓ２）。この時、入力画像は白領域を０、黒領域を１として格納する（０と１の役割は逆でもかまわない）。
【００１７】
続いて、制御手段６は、ｆｌａｇの値を参照する。ｆｌａｇの値が０であればＳ４へ移り、ｆｌａｇの値が１であれば１２へ移る（Ｓ３）。今の場合、初期化によりｆｌａｇは０に設定されているので、Ｓ４へ移ることになる。
【００１８】
次に、制御手段６は、入力画像用メモリ２に格納された画像に対し、プログラム用ＲＯＭ５の内容を元に第１レイアウト情報抽出手段を用いて、レイアウト情報を抽出する（Ｓ４）。ここで、レイアウト情報の抽出アルゴリズムは従来より様々な方法が提案されており（特開平８−１２９５５０号公報など）、限定する必要はないが、ここでは一例として、画像内の黒画素に対して黒画素連結領域に外接する矩形を求め（図５（ａ））、近接する二つの外接矩形を一つの外接矩形に統合していくことを繰り返して、複数の矩形領域を求める（図５（ｂ））方法を用いることにする。また、外接矩形を統合する際には、対象が横書き文書の場合は、同じ高さの矩形が横に並んでいるならば横方向の矩形間距離が大きくても統合し、対象が縦書き文書の場合は、同じ幅の矩形が縦に並んでいるならば縦方向の矩形間距離が大きくても統合する。なお、外接矩形の統合に際して別の条件を追加することも可能である。
【００１９】
また、制御手段６は、例えば特開平８−１２９５５０号公報に開示される方法で、入力画像用メモリ２に格納された画像のうちＳ４の処理で得られた矩形領域で囲まれた部分について、文字行を抽出し、文字認識用辞書４を使用して文字行ごとに文字認識処理（例えば、橋本新一郎編著「文字認識概論」Ｐ５７〜Ｐ９５、オーム社、１９８２に記載されている方法を用いる）を行ない、文字認識結果の中で文字行の最後にあるのが数字かどうかを調べる（Ｓ５）。
【００２０】
Ｓ５で調査した結果、文字行の行末が数字である行が多いか否かを調べ、多い場合は１３へ、多くない場合は１４へ処理を分岐する（Ｓ６）。ここで、行末が数字である文字行が多いか否かは、当該ページの文字行の数をｎ、その内行末が数字である行をｍとするとき、ｍ／ｎが一定値以上、例えば、０．５以上であれば多いとし、ｍ／ｎが０．５以下の場合でも、行末が数字である行が一定行以上、例えば、２行以上連続していれば多いと判断するようにする。
【００２１】
また、上記Ｓ６では文字行の行末が数字である文字行が多いか否かで判断を分岐するようにしたが、例えば、行末が数字であれば行頭の方へ文字認識を進め、数字が１個以上連続しており、かつ、数字の前に特定のデリミッタがある場合を判定基準とすることもできる。ここで、デリミッタには例えばスペース、…などが含まれる。
【００２２】
図３は、Ｓ６を受けた処理から始まる。文字行の行末が数字である文字行が多い場合は、第２レイアウト情報抽出手段を使用しないと判定して、Ｓ４の処理で得られた矩形領域の座標データをレイアウト情報格納用メモリ３に格納する（Ｓ７）。続いてｆｌａｇ＿ｂに１を設定し、ｆｌａｇ＿ｃに０を設定する（Ｓ８）。これらの設定により、当該ページの次のページは第２レイアウト情報抽出手段を用いず、第１レイアウト抽出手段を用いることになる。Ｓ８の後はＳ１５へ移る。
【００２３】
なお、ステップＳ６で行末が数字である行が多くない場合は、Ｓ９へ移る。Ｓ９では、まず、ｆｌａｇ＿ｂを参照し、その値が０か１かで処理を分岐する。ｆｌａｇ＿ｂが０の場合は、第２レイアウト情報抽出手段を用いるルーチンを通るので、ｆｌａｇ＿ｃの値を１つカウントアップする（Ｓ１０）。
【００２４】
Ｓ９でｆｌａｇ＿ｂが１である場合は、ｆｌａｇに１を設定し、当該ページの次のページ以降については第１レイアウト情報抽出手段を用いず、第２レイアウト情報抽出手段を用いてレイアウト情報を抽出するようにフラグを設定しておく（Ｓ１１）。Ｓ１０、Ｓ１１に続いてｆｌａｇ＿ｂに値０を設定し、当該ページは第２レイアウト情報抽出手段を用いることを記録しておく。
【００２５】
Ｓ１２に続いて、制御手段６は入力画像用メモリ２に格納された画像に対し、プログラム用ＲＯＭ５の内容を基に第２レイアウト情報抽出手段を用いてレイアウト情報を抽出（Ｓ１３）する。抽出アルゴリズムは限定する必要はないが、例えば、外接矩形を統合する際に矩形間距離が大きい（例えば画像内の黒画素連結領域外接矩形サイズの平均値の定数倍を越える）場合は統合しない点を除いて、Ｓ４と同様の処理を行う。その結果得られた矩形領域データをレイアウト情報格納用メモリに格納する（Ｓ１４）。
【００２６】
Ｓ８またはＳ１４に続いてＳ１５へ移る。Ｓ１５では、カウンタｆｌａｇ＿ｃを参照し、予め設定した値ｋ以上であるか否かにより処理を分岐する。ここでｋは、ユーザーが指定した値や、全ページ数を定数で割った値などを割り当てる。
【００２７】
Ｓ１５において、ｆｌａｇ＿ｃがｋ以上であれば、ｆｌａｇを１に設定し（Ｓ１６）、次のページからは第２レイアウト情報抽出手段を用いてレイアウト情報を抽出する準備を行い、Ｓ１７の処理へ移る。また、Ｓ１５でｆｌａｇ＿ｃがｋ未満である場合は何もせずにＳ１７の処理へ移る。
【００２８】
Ｓ１７は現在処理しているページｐを参照し、入力画像最終ページｐ＿ｌａｓｔから一定数のページｊ以内であるか否かを照合する処理ルーチンである。ｊは例えば、ドキュメントの索引ページ相当のものを予め設定しておく。今、ｐ＿ｌａｓｔ−ｐ＝ｊであれば、ｆｌａｇに０を設定し、ｐ以降のページに対し、第１レイアウト情報抽出手段を用いるように準備する（Ｓ１８）。なお、本実施例では、ｐ＿ｌａｓｔ−ｐ＝ｊを満足する場合、以降のページで第１レイアウト情報抽出手段を用いるようにしているが、第１レイアウト情報抽出手段の代わりに、別のレイアウト情報抽出手段、例えば第３レイアウト情報抽出手段を用意しておいて使用してもよい。
【００２９】
Ｓ１９は次ページが存在するか否かを照合する処理モジュールであり、次ページが存在する場合はｐをインクリメントしてからＳ２に戻り、最終ページになるまでこの処理を繰り返す。
【００３０】
以上により、簡便で精度のよいレイアウト抽出が可能になる。なお、上記実施例ではレイアウト情報抽出手段として第１、第２を有するものを基本として述べているが、これらの手段が２個以上になっても本発明の動作に支障はない。また、上記ｋ、ｊなどの変数を複数用意することにより複雑多種なレイアウトを有する文書にも本発明を適用することが可能である。
【００３１】
【発明の効果】
以上のように本発明の請求項１および請求項２に係る出版物電子化処理装置は、最後が数字となる文字行が多いページを目次ページや索引ページとみなして、目次や索引用のレイアウト情報抽出アルゴリズムと、本文用のレイアウト情報抽出アルゴリズムを切換えることができ、結果として簡便で精度の高いレイアウト抽出が可能になるという効果がある。
【００３２】
また、本発明の請求項３に係る出版物電子化処理装置は、例えば索引ページ用のレイアウト情報抽出アルゴリズム（第１レイアウト情報抽出手段）とそれ以外のページ用のレイアウト情報抽出アルゴリズム（第２レイアウト情報抽出手段）を用意しておき、それまで、それ以外のページ用のレイアウト情報抽出アルゴリズムを使用していても、データの最後近くで索引ページ用のレイアウト情報抽出アルゴリズムに切換えることができ、処理時間の短縮を図ることができるという効果がある。
【図面の簡単な説明】
【図１】本発明を適用した一実施例システムの構成を示す機能ブロック図である。
【図２】本発明の一実施例の動作を示すフローチャートである。
【図３】本発明の一実施例の動作を示すフローチャートである。
【図４】本発明の一実施例の動作を示すフローチャートである。
【図５】同図（ａ）、（ｂ）は文書画像内の黒画素に対して黒画素連結領域の外接矩形を求めた一例を表す図である。
【符号の説明】
１画像入力手段
２入力画像用メモリ
３レイアウト情報格納用メモリ
４文字認識用辞書
５プログラム用ＲＯＭ
６制御手段
７レイアウト情報抽出手段
８判定処理手段
Ｓ１初期化処理モジュール
Ｓ２入力手段で読み取った画像データを入力画像用メモリに転送する処理モジュール
Ｓ３第１あるいは第２のレイアウト情報抽出手段の選択を判定する処理モジュール
Ｓ４第１レイアウト情報抽出手段を実行する処理モジュール
Ｓ５領域内の文字認識処理を実行する処理モジュール
Ｓ６数字行末の多さを判定する処理モジュール[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an apparatus for automatically extracting layout information for each image from a plurality of images, and is used, for example, for digitizing a publication.
[0002]
[Prior art]
When a document is read as an image and the layout structure is automatically extracted, the layout structure may differ greatly depending on the type of document. Therefore, the extraction result can be improved by changing the extraction algorithm according to the type. Can do. For example, by preparing a plurality of fixed format models as in Japanese Patent No. 2748974, it is possible to switch the processing control depending on the matching result with the model.
[0003]
Also, as disclosed in JP-A-8-129550, a document is read as an image, the input document data is analyzed and separated into a document data and an image data area, and a character image is extracted from the separated document data area. The technique for cutting out and recognizing characters is well known.
[0004]
Further, for example, as shown in “Character Recognition Overview” edited by Shinichiro Hashimoto P57 to P95, Ohmsha, 1982, a technique related to character recognition is known.
[0005]
[Problems to be solved by the invention]
For example, in the case of books, there are many cases where pages having the same layout structure are continuous among consecutive pages. In the prior art, since it is assumed that the layout information extraction algorithm is selected and switched for each page, there arises a problem that the memory for storing the standard format model increases. In addition, there is a problem that the processing time becomes long because matching with the model is performed on each page. Further, in collation with the model, there is a problem that an error is likely to occur in the model collation because the spatial information of each item of the model is collated with the spatial information of the character string block cut out from the image of the input document.
[0006]
An object of the present invention is to solve the above problems by switching the extraction algorithm when a change in layout structure is detected, and improving the extraction accuracy and processing speed for the layout structure.
[0007]
[Means for Solving the Problems]
The present invention relates to a publication electronic processing apparatus for recognizing layout information of a page image of a publication, and an image input unit capable of continuously inputting a plurality of page images and an image input unit. Input image storage means for storing the page image as a page image including a white area and a black area, and a rectangle circumscribing the black image with respect to black pixels of the page image stored in the input image storage means When black rectangles with the same height are lined up horizontally or when black pixel rectangles with the same width are lined up vertically, two circumscribed rectangles that are adjacent to each other are equalized regardless of the distance between the rectangles. A rectangle that circumscribes the black image with respect to the black pixels of the image stored in the input image storage unit and the first layout information extracting unit that extracts a plurality of rectangular regions by repeating the integration into two circumscribed rectangles Ask for the same height black painting When two rectangles are arranged side by side or when black pixel rectangles of the same width are arranged vertically, two circumscribed rectangles whose distances between rectangles are close to each other within a predetermined range are integrated into one circumscribed rectangle. By repeating this, a character line is extracted from the second layout information extracting means for extracting a plurality of rectangular areas and the rectangular area extracted by the first layout information extracting means, and the last character line is a number. As a result of the survey by the end-of-line number surveying means and the end-of-line number surveying means, it is determined whether the ratio of the number of character lines whose number at the end of the character line to the number of all the character lines on the page is equal to or greater than a predetermined value The end-of-line number multi-determination means, and the end-of-line number multi-determination means determines that the ratio of the number of character lines in which the last character line is a number to the number of all character lines in the page is equal to or greater than a predetermined value. If a second label is The first layout information extraction means is used on the next page of the determined page without using the out information extraction means, and then the end of the character line with respect to the number of all character lines in the page is a number. If it is determined that the ratio of the number of character lines is less than a predetermined value, the second layout information extracting unit is used for the page, and the first layout information extracting unit is used for the pages subsequent to the determined page. The second layout information extraction means is used without using the line end number search means and the line end number multiple determination means.
[0008]
The present invention is also a publication electronic processing apparatus for recognizing layout information of page images of a publication, and includes an image input unit capable of continuously inputting a plurality of page images, and an image input unit. An input image storage unit that stores the input page image as a page image including a white region and a black region, and circumscribes the black image with respect to black pixels of the page image stored in the input image storage unit When a rectangle is obtained and black pixel rectangles of the same height are lined up horizontally, or when black pixel rectangles of the same width are lined up vertically, two circumscribed rectangles that are close to each other regardless of the distance between the rectangles By repeating the integration into a single circumscribed rectangle, the first layout information extracting means for extracting a plurality of rectangular areas and the black pixels of the image stored in the input image storage means are circumscribed on the black image. Find the rectangle to be the same height When black pixel rectangles are arranged horizontally, or when black pixel rectangles of the same width are arranged vertically, two circumscribed rectangles whose distances between rectangles are close to each other within a predetermined range are combined into one circumscribed rectangle. By repeating the integration, a character line is extracted from the second layout information extracting unit that extracts a plurality of rectangular regions and the rectangular region extracted by the first layout information extracting unit, and the end of the character line is extracted. Is a number, the end of the character line is a continuous number, and the end-of-line number search means for investigating whether there is a specific delimiter before the continuous number, and the result of the search by the end-of-line number search means, The ratio of the number of character lines to the number of all character lines on the page is a predetermined value, with the number at the end of the character line being a number, the beginning of the character line being a continuous number, and a specific delimiter preceding the consecutive number Number of line endings to determine whether or not Multi-determination means, and by the end-of-line number multi-determination means, the end of the character line is a number and the beginning of the character line is a continuous number and When it is determined that the ratio of the number of character lines having a specific delimiter before is not less than a predetermined value, the second layout information extraction unit is not used for the page, and the page next to the determined page is Using the first layout information extraction means, after that, with respect to the number of all character lines in the page, the last character line is a number, the beginning of the character line is a continuous number, and When it is determined that the ratio of the number of character lines having a specific delimiter before is less than a predetermined value, the second layout information extraction unit is used for the page, and the first page is determined for the page after the determined page. Layout information extraction means, end-of-line number survey means, and line The second layout information extraction unit is used without using the terminal digit multi-determination unit.
[0009]
The present invention further includes final predetermined page determination means for determining whether a plurality of page images input by the image input means are pages within a predetermined page from the final page, and the final predetermined page determination If it is determined by the means that the page is within the predetermined page from the last page, the first layout information extracting means is used for the pages after the determined page.
[0012]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment of an image processing apparatus according to the present invention will be described with reference to FIGS.
[0013]
FIG. 1 is a functional block diagram showing a system configuration of an embodiment to which the present invention is applied. Reference numeral 1 denotes image input means for capturing an image, and includes image input from a scanner, a video camera, another PC, or an image medium. In the present invention, in order to simplify the description, the input means is hereinafter described as a scanner. Reference numeral 2 denotes an input image memory, which includes a magnetic memory such as a hard disk, MD, and MO, and a semiconductor memory such as a ROM, RAM, and a smart card. Reference numeral 3 denotes a layout information storage memory, which is composed of a magnetic memory and a semiconductor memory as in the case of 2. A character recognition dictionary 4 is stored in a magnetic memory or a semiconductor memory. Reference numeral 5 denotes a program ROM in which a program for executing image processing according to the present invention is stored. Reference numeral 6 denotes a control means configured by a CPU or dedicated hardware for controlling the flow of processing in accordance with the program in the program ROM 5. Reference numeral 7 denotes layout information extraction means for extracting layout information. The layout information extraction means 7 includes at least a first layout information extraction means and a second layout information extraction means. Reference numeral 8 denotes determination processing means for determining which layout information is applied to the reference page and thereafter based on the information extracted by the layout information extracting means. The layout information extraction means and determination means will be described in detail later. Reference numeral 9 denotes a data bus, and the present invention does not impose any restrictions on the width, speed, etc. of the bus.
[0014]
2 to 4 are flowcharts showing the flow of processing of the embodiment. First, a description will be added from the part of the flowchart of FIG.
[0015]
First, the control means of the image processing apparatus is initialized. The initialization includes a process of writing initial data 0 to three types of variables, flag, flag_b, and flag_c (S1). Further, p is a variable for managing the page, and 1 is set as the initial value. The flag is a flag for determining whether the layout information is extracted by the first layout information extracting unit or the layout information is extracted by the second layout information extracting unit. If the value is 0, the first layout information extracting unit is used. If the number is 1, the layout information is extracted by the second layout information extraction means. flag_b is a flag for storing which layout information extracting means is used for the previous page of the processing page, and 0 is stored when the first layout information extracting means is used, and 1 is stored when the second layout information extracting means is used. ing. flag_c is a counter that is counted up for pages that do not have many numbers at the end of the line.
[0016]
Next, the image read by the image input means 1 is transferred to the input image memory 2 (S2). At this time, the input image stores the white area as 0 and the black area as 1 (the roles of 0 and 1 may be reversed).
[0017]
Subsequently, the control means 6 refers to the value of flag. If the value of the flag is 0, the process proceeds to S4, and if the value of the flag is 1, the process proceeds to 12 (S3). In this case, since flag is set to 0 by initialization, the process proceeds to S4.
[0018]
Next, the control means 6 extracts layout information from the image stored in the input image memory 2 using the first layout information extraction means based on the contents of the program ROM 5 (S4). Here, various methods for extracting layout information have been proposed in the past (Japanese Patent Laid-Open No. 8-129550, etc.), and need not be limited, but here as an example, for black pixels in an image A rectangle circumscribing the black pixel connection area is obtained (FIG. 5A), and two adjacent circumscribed rectangles are integrated into one circumscribed rectangle to obtain a plurality of rectangular areas (FIG. 5B). )) Method. Also, when integrating circumscribed rectangles, if the target is a horizontally written document, if the rectangles of the same height are lined up side by side, they will be integrated even if the distance between the rectangles in the horizontal direction is large. In this case, if rectangles with the same width are arranged vertically, they are integrated even if the distance between the rectangles in the vertical direction is large. It is also possible to add another condition when integrating circumscribed rectangles.
[0019]
Further, the control means 6 is a method disclosed in, for example, Japanese Patent Application Laid-Open No. 8-129550, with respect to a portion surrounded by the rectangular area obtained by the process of S4 in the image stored in the input image memory 2. Character lines are extracted and character recognition processing is performed for each character line using the character recognition dictionary 4 (for example, the method described in Shinichiro Hashimoto “Character Recognition Overview” P57 to P95, Ohmsha, 1982). In step S5, the character recognition result is checked to see if it is a numeral at the end of the character line.
[0020]
As a result of the investigation in S5, it is checked whether or not there are many lines whose number ends at the end of the character line. If there are many, the process branches to 13, and if not, the process branches to 14 (S6). Here, whether or not there are many character lines with numbers at the end of the line is determined by assuming that n is the number of character lines on the page and m is the number of character lines at the end of the page. If it is 0.5 or more, it is assumed that there are many, and even when m / n is 0.5 or less, it is determined that the number of lines having a number at the end is more than a certain number of lines, for example, two or more lines. To do.
[0021]
In S6, the determination branches depending on whether there are many character lines whose numbers end at the end of the character line. For example, if the end of the line is a number, character recognition proceeds toward the beginning of the line, and the number is 1 A case where the number is continuous and there is a specific delimiter before the number can be used as a criterion. Here, the delimiter includes, for example, a space,.
[0022]
FIG. 3 begins with the process that receives S6. When there are many character lines whose number ends at the end of the character line, it is determined that the second layout information extracting unit is not used, and the coordinate data of the rectangular area obtained by the process of S4 is stored in the layout information storage memory 3. (S7). Subsequently, 1 is set in flag_b and 0 is set in flag_c (S8). With these settings, the first layout extraction unit is used for the next page of the page without using the second layout information extraction unit. After S8, the process proceeds to S15.
[0023]
If there are not many lines with numbers at the end in step S6, the process proceeds to S9. In S9, first, flag_b is referred to, and the process branches depending on whether the value is 0 or 1. If flag_b is 0, the routine using the second layout information extracting means is passed, so the value of flag_c is incremented by one (S10).
[0024]
If flag_b is 1 in S9, 1 is set in the flag, and the layout information is extracted by using the second layout information extracting unit without using the first layout information extracting unit for the pages following the page. A flag is set as described above (S11). Subsequent to S10 and S11, a value 0 is set in flag_b, and it is recorded that the page uses the second layout information extracting means.
[0025]
Subsequent to S12, the control means 6 extracts layout information from the image stored in the input image memory 2 using the second layout information extraction means based on the contents of the program ROM 5 (S13). The extraction algorithm does not need to be limited, but for example, when the circumscribed rectangles are integrated, if the distance between the rectangles is large (for example, it exceeds a constant multiple of the average value of the black pixel connected region circumscribed rectangle size in the image), it is not integrated except for, performs the same process as S 4. The rectangular area data obtained as a result is stored in the layout information storage memory (S14).
[0026]
After S8 or S14, the process proceeds to S15. In S15, the counter flag_c is referred to, and the process branches depending on whether or not the value is equal to or greater than a preset value k. Here, k is assigned a value designated by the user or a value obtained by dividing the total number of pages by a constant.
[0027]
If flag_c is greater than or equal to k in S15, the flag is set to 1 (S16), preparation is made for extracting layout information from the next page using the second layout information extracting means, and the process proceeds to S17. If flag_c is less than k in S15, the process proceeds to S17 without doing anything.
[0028]
S17 is a processing routine that refers to the page p currently being processed and checks whether the input image last page p_last is within a certain number of pages j. For example, j corresponds to an index page of a document. If p_last-p = j, the flag is set to 0, and the first layout information extracting unit is prepared for the pages after p (S18). In this embodiment, when p_last-p = j is satisfied, the first layout information extraction unit is used in the subsequent pages. However, instead of the first layout information extraction unit, another layout information extraction is performed. Means such as third layout information extraction means may be prepared and used.
[0029]
S19 is a processing module for checking whether or not the next page exists. If the next page exists, p is incremented and then the process returns to S2 and this process is repeated until the last page is reached.
[0030]
As described above, simple and accurate layout extraction becomes possible. In the above-described embodiment, the layout information extraction means having the first and second layout information is basically described. However, even if the number of these means is two or more, the operation of the present invention is not hindered. Further, by preparing a plurality of variables such as k and j, the present invention can be applied to a document having various complicated layouts.
[0031]
【The invention's effect】
As described above, the publication electronic processing apparatus according to claims 1 and 2 of the present invention regards a page with many character lines that end with numerals as a table of contents page or an index page, and a layout for the table of contents or index. It is possible to switch between the information extraction algorithm and the layout information extraction algorithm for the text, and as a result, there is an effect that a simple and highly accurate layout extraction becomes possible.
[0032]
Further, the publication electronic processing apparatus according to claim 3 of the present invention includes, for example, an index page layout information extraction algorithm (first layout information extraction means) and other page layout information extraction algorithms (second layout). Information extraction means) is prepared, and even if layout information extraction algorithms for other pages are used, it can be switched to the layout information extraction algorithm for index pages near the end of the data. There is an effect that the time can be shortened.
[Brief description of the drawings]
FIG. 1 is a functional block diagram showing a configuration of a system according to an embodiment to which the present invention is applied.
FIG. 2 is a flowchart showing the operation of an embodiment of the present invention.
FIG. 3 is a flowchart showing the operation of an embodiment of the present invention.
FIG. 4 is a flowchart showing the operation of one embodiment of the present invention.
FIGS. 5A and 5B are diagrams illustrating an example in which a circumscribed rectangle of a black pixel connection region is obtained for black pixels in a document image.
[Explanation of symbols]
1 Image Input Means 2 Input Image Memory 3 Layout Information Storage Memory 4 Character Recognition Dictionary 5 Program ROM
6 Control means 7 Layout information extraction means 8 Determination processing means S1 Initialization processing module S2 Processing module S3 for transferring the image data read by the input means to the input image memory. Selection of first or second layout information extraction means is determined. Processing module S4 Processing module S5 for executing the first layout information extracting means Processing module S6 for executing character recognition processing in the area Processing module for determining the number of end numbers

Claims

A publication electronic processing apparatus for recognizing layout information of a page image of a publication ,
Image input means capable of continuously inputting a plurality of page images;
Input image storage means for storing the page image input by the image input means as a page image including a white area and a black area;
For a black pixel of a page image stored in the input image storage means, a rectangle circumscribing the black image is obtained, and when black pixels of the same height are arranged side by side, black pixels of the same width are also obtained. First layout information extraction for extracting a plurality of rectangular areas by repeating the integration of two circumscribed rectangles into one circumscribed rectangle regardless of the distance between the rectangles when the rectangles are arranged vertically Means,
For the black pixels of the image stored in the input image storage means, a rectangle circumscribing the black image is obtained, and when black pixels of the same height are arranged side by side, Second layout information for extracting a plurality of rectangular areas by repeating the integration of two circumscribed rectangles whose distances between rectangles are close within a predetermined range into a single circumscribed rectangle when the rectangles are arranged vertically Extraction means;
A line number search means for extracting a character line from the rectangular area extracted by the first layout information extraction means and checking whether the end of the character line is a number;
As a result of the survey by the end-of-line number survey means, there is provided an end-of-line number multi-determination means for determining whether the ratio of the number of character lines whose number is the last character line to the number of all character lines in the page is equal to or greater than a predetermined value And
If the ratio of the number of character lines in which the last character line is a number to the number of all character lines in the page is determined to be greater than or equal to a predetermined value by the end-of-line number multiple determination means, 2 The layout information extraction means is not used, but the first layout information extraction means is used on the next page of the determined page, and then the end of the character line with respect to the total number of character lines in the page is a number. When it is determined that the ratio of the number of character lines is less than a predetermined value, the second layout information extraction unit is used for the page, and the first layout is determined for the pages after the determined page. A publication electronic processing apparatus characterized by using the second layout information extracting means without using the information extracting means, the end-of-line number examining means, and the end-of-line number multiple judging means .

A publication electronic processing apparatus for recognizing layout information of a page image of a publication,
Image input means capable of continuously inputting a plurality of page images;
Input image storage means for storing the page image input by the image input means as a page image including a white area and a black area;
For a black pixel of a page image stored in the input image storage means, a rectangle circumscribing the black image is obtained, and when black pixels of the same height are arranged side by side, black pixels of the same width are also obtained. First layout information extraction for extracting a plurality of rectangular areas by repeating the integration of two circumscribed rectangles into one circumscribed rectangle regardless of the distance between the rectangles when the rectangles are arranged vertically Means,
For the black pixels of the image stored in the input image storage means, a rectangle circumscribing the black image is obtained, and when black pixels of the same height are arranged side by side, Second layout information for extracting a plurality of rectangular areas by repeating the integration of two circumscribed rectangles whose distances between rectangles are close within a predetermined range into a single circumscribed rectangle when the rectangles are arranged vertically Extraction means;
When a character line is extracted from the rectangular area extracted by the first layout information extracting means, and the last character line is a number, the character line has a continuous number and the continuous line End-of-line number survey means to investigate whether there is a specific delimiter before the number,
As a result of the survey by the end-of-line number survey means, the end of the character line is a number with respect to the number of all the character lines in the page, the beginning of the character line is a continuous number, and a specific number preceding the consecutive number An end-of-line number multiple determination means for determining whether the ratio of the number of character lines having a delimiter is a predetermined value or more,
By the row end digit multi determining means, to the number of all the character lines in the page, Ri last digit der character line, certain prior numbers beginning of the character row in consecutive digits, and was the continuous delimiters If it is determined that the ratio of the number of character lines is equal to or greater than a predetermined value, the second layout information extracting unit is not used for the page, and the first layout is displayed on the next page of the determined page. Using information extraction means, then, for the number of all character lines in the page, the end of the character line is a number, the beginning of the character line is a consecutive number, and before the consecutive number When it is determined that the ratio of the number of character lines having a specific delimiter is less than a predetermined value, the second layout information extraction unit is used for the page, and the first and subsequent pages are determined for the page after the determined page. Layout information extraction means, line end number search means Without using the finely the end of the line numbers multi determining means, publications electronic processing device characterized by using said second layout information extraction means.

A plurality of page images input by the image input means, further comprising a final predetermined page determination means for determining whether the page is within a predetermined page from the final page,
When the final predetermined page determination means determines that the predetermined page is within a predetermined page from the final page, the first layout information extraction means is used for pages after the determined page. The publication electronic processing apparatus according to claim 1 or 2 .